Secondary analysis of existing large-scale data sets is a viable option for both faculty and student research. Students may use existing data for dissertations, doctoral projects, theses, and/or independent research projects. They provide a cost-effective and valuable way to contribute to the current knowledge base, and allow you to answer a number of important research questions that may not have been possible otherwise.

There are thousands of public domain data sets involving a number of populations. Policies/steps regarding the release of those data vary by the specific data set. Some data sets require a faculty supervisor prior to release of the data, others require a proposal (e.g., research aims, planned analyses), and some can be downloaded directly from the hosting organization in SPSS format. Numerous agencies/organizations maintain various existing data sets.

For example, ICPSR, an international consortium of more than 750 academic institutions and research organizations, manages a large number of data sets.  Based at the University of Michigan, ICPSR hosts data in a repository with powerful search capabilities. Indexed by all the major search engines, ICPSR data are easily discoverable and widely accessible to the public. However, ICPSR is not the only source, so a basic internet search should also yield several leads.

If you are interested in using a particular data set, we recommend that you first review the study’s “codebook” to see what data were collected and what variables may be of interest to you for a research project.

Please see below for a listing of selected existing data sets, organized by category, compiled in December 2016 by Dr. Steven Proctor.

 

Juvenile Justice

Survey of Youth in Residential Placement (SYRP) 2003 [United States]

The Survey of Youth in Residential Placement (SYRP) is the only national survey that gathers data directly from youth in the juvenile justice system. The Office of Juvenile Justice and Delinquency Prevention (OJJDP) designed the survey in 2000 and 2001 to survey offender youth between the ages of 10 and 20. SYRP asks the youth about their backgrounds; offense histories and problems; the facility environment; experiences in the facility; experiences with alcohol and drugs; experiences of victimization in placement; medical needs and services received; and their expectations for the future. As a result, SYRP research provides answers to a number of questions about the characteristics and experiences of youth in custody.

 

SYRP’s findings are based on anonymous interviews with a nationally representative sample of youth in custody during the spring of 2003 using audio computer-assisted self-interview (ACASI) technology. SYRP is the latest addition to two ongoing data collections that OJJDP designed and implemented in the 1990s. It joins the Census of Juveniles in Residential Placement and the Juvenile Residential Facility Census to provide updated statistics on youth in custody in the juvenile justice system.

The codebook for the project, which can be downloaded from the main page of the project website, provides the following access information: “The data are restricted from general dissemination. Users interested in obtaining these data must complete a Restricted Data Use Agreement Restrictions form and specify the reasons for the request. A copy of the Restricted Data Use Agreement form can be requested by calling 800-999-0960. Researchers can also download this form as a Portable Document Format (PDF) file from the download page associated with this dataset. Completed forms should be returned to: Director, National Archive of Criminal Justice Data, Inter-university Consortium for Political and Social Research, Institute for Social Research, P.O. Box 1248, University of Michigan, Ann Arbor, MI 48106-1248, or by fax: 734-647-8200.”

SYRP bulletins, reports, and a simplified online analysis tool are available from the SYRP website. Click the tabs across the top to link to specific information.

The Survey of Youth in Residential Placement (SYRP) is a unique addition to the Office of Juvenile Justice and Delinquency Prevention’s (OJJDP’s) constellation of surveys on youth in custody in the juvenile justice system. In contrast to OJJDP’s Census of Juveniles in Residential Placement and Juvenile Residential Facility Census, which are mail surveys of residential facility administrators, SYRP gathers information directly from youth through anonymous interviews.

Adolescent Health

The National Longitudinal Study of Adolescent to Adult Health (Add Health)

Add Health is a longitudinal study of a nationally representative sample of adolescents in grades 7-12 in the United States during the 1994-95 school year. The Add Health cohort has been followed into young adulthood with four in-home interviews, the most recent in 2008, when the sample was aged 24-32. Add Health is re-interviewing cohort members in a Wave V follow-up from 2016-2018 to collect social, environmental, behavioral, and biological data with which to track the emergence of chronic disease as the cohort moves through their fourth decade of life.

Add Health combines longitudinal survey data on respondents’ social, economic, psychological, and physical well-being with contextual data on the family, neighborhood, community, school, friendships, peer groups, and romantic relationships, providing unique opportunities to study how social environments and behaviors in adolescence are linked to health and achievement outcomes in young adulthood. The fourth wave of interviews expanded the collection of biological data in Add Health to understand the social, behavioral, and biological linkages in health trajectories as the Add Health cohort ages through adulthood, and the fifth wave of data collection continues this biological data expansion.

Additional information about the project is available on the Add Health website.

If Add Health data are used, you must include the following acknowledgement in any books, articles, conference papers, theses, dissertations, reports, or other publications that employ the data:

This research uses data from Add Health, a program project designed by J. Richard Udry, Peter S. Bearman, and Kathleen Mullan Harris, and funded by a grant P01-HD31921 from the Eunice Kennedy Shriver National Institute of Child Health and Human Development, with cooperative funding from 17 other agencies. Special acknowledgment is due Ronald R. Rindfuss and Barbara Entwisle for assistance in the original design. Persons interested in obtaining data files from Add Health should contact Add Health, Carolina Population Center, 123 W. Franklin Street, Chapel Hill, NC 27516-2524 (addhealth@unc.edu). No direct support was received from grant P01-HD31921 for this analysis.

Youth Risk Behavior Surveillance System (YRBSS)

The Youth Risk Behavior Surveillance System (YRBSS) monitors six types of health-risk behaviors that contribute to the leading causes of death and disability among youth and adults: behaviors that contribute to unintentional injuries and violence; sexual behaviors related to unintended pregnancy and sexually transmitted diseases, including HIV infection; alcohol and other drug use; tobacco use; unhealthy dietary behaviors; and inadequate physical activity. YRBSS also measures the prevalence of obesity and asthma and other priority health-related behaviors, plus sexual identity and the gender of sexual contacts.

YRBSS includes a national school-based survey conducted by CDC and state, territorial, tribal, and local surveys conducted by state, territorial, and local education and health agencies and tribal governments.

National YRBS data sets and documentation are available for download at YRBSS Data & Documentation. There is no charge for the data nor is permission needed to download or use the data.

Criminal Justice

U.S. Department of Justice

The U.S. Department of Justice believes that publishing high-value data sets that increase transparency and accountability can improve public knowledge of the Department of Justice and its operations. The Department hopes that in implementing the Open Government Plan, it will not only respond to the needs and demands of the public but also create economic opportunity. To date, the Department has registered dozens of data sets to Data.gov, a clearinghouse for data from the Executive Branch of the Federal Government. New data sets will continue to be registered as they become available for publication. Every bureau, office, and division at the Department of Justice has been asked to identify and inventory potential data sets for release to the public. This includes new data as well as data that may be in existence, but unpublished.

The National Archive of Criminal Justice Data (NACJD)

The mission of the National Archive of Criminal Justice Data (NACJD) is to facilitate research in criminal justice and criminology through the preservation, enhancement, and sharing of computerized data resources; through the production of original research based on archived data; and through specialized training workshops in quantitative analysis of crime and justice data. NACJD provides the following services to assist those using its data collections:

 

  • The identification of appropriate criminological and criminal justice data collections on specific topics
  • Custom subsetting of selected data files through an online Survey Documentation and Analysisprogram
  • Assistance with the retrieval and use of files obtained from the archive

Early Care & Education

National Research Center on Hispanic Children and Families Data Sets

The National Research Center on Hispanic Children and Families recently released a series of data briefs and interactive tools to facilitate the access and use of existing, large-scale data sets to examine policy-relevant questions about early care and education use among low-income Hispanic families.

The Families’ Utilization of Early Care and Education data tool allows users to unpack the early care and education preference utilization among Hispanic families, separately for each data set. It gives you the capability to dig deeper into national surveys and see which include questions about number of arrangements, provider type, financial assistance, and more.

The Early Care and Education Search and Decision Making data tool allows users to unpack the early care and education preferences, priorities, and search and decision-making among Hispanic families, separately for each data set. It gives you the capability to quickly and efficiently dig deeper into different national surveys and see which ones include questions about the variables of interest, such as satisfaction with options, access barriers, difficulty of ECE search, and more.

The Unpacking Hispanic Diversity data tool allows users to unpack the diversity of Hispanic populations, by data set, by giving you the capability to dig deeper into national surveys and see which include questions about citizenship, literacy, heritage, and more.

The Early Childhood Longitudinal Study (ECLS)

The Early Childhood Longitudinal Study (ECLS) program includes three longitudinal studies that examine child development, school readiness, and early school experiences, as described below. The ECLS program provides national data on children’s status at birth and at various points thereafter; children’s transitions to non-parental care, early education programs, and school; and children’s experiences and growth through the eighth grade. The program also provides data to analyze the relationships among a wide range of family, school, community, and individual variables with children’s development, early learning, and performance in school.

The Early Childhood Longitudinal Study, Birth Cohort (ECLS-B) was designed to provide policy makers, researchers, child care providers, teachers, and parents with detailed information about children’s early life experiences. Data collected for the ECLS-B focus on children’s health, development, care, and education during the formative years from birth through kindergarten entry.

A subset of variables from the ECLS-B 9-month data collection is available to the general public in the Data Analysis System (DAS). DAS is an on-line tool that allows users to build tables of weighted estimates, calculate t-tests, and produce correlation matrices for use in linear regression analyses. The DAS webpage has links to an on-line tutorial and a guide to help users navigate through the application and produce desired tables.

The Early Childhood Longitudinal Study, Kindergarten Class of 1998-99 (ECLS-K) focuses on early school experiences, beginning with kindergarten and following children through middle school. The ECLS-K data provide descriptive information on children’s status at entry to school, their transition into school, and their progression through 8th grade. The longitudinal nature of the ECLS-K data enables researchers to study how a wide range of family, school, community, and individual factors are associated with school performance.

The Longitudinal Kindergarten Through Eighth Grade Full Sample Public-Use Data and Documentation is now available. This longitudinal K-8 data file includes all released data for all cases that ever participated in the study, including those that became non-respondents at some point after kindergarten. This is the only file that is needed for analysis of publicly available data for any round of ECLS-K data collection.

The Early Childhood Longitudinal Study, Kindergarten Class of 2010-11 (ECLS-K:2011) is sponsored by the National Center for Education Statistics (NCES) within the Institute of Education Sciences (IES) of the U.S. Department of Education. In addition, the study benefits from its partnership with and sponsorship by several additional federal agencies. The study is also endorsed by many professional organizations in the area of education.

Broad in its scope and coverage of child development, early learning, and school progress, the ECLS-K:2011 draws together information from multiple sources to provide rich data on children’s early school experiences, beginning with kindergarten and following children through fifth grade. The ECLS-K:2011 provides descriptive information on children’s status at entry to school, their transition into school, and their progression through the elementary grades. The longitudinal nature of the ECLS-K:2011 data enables researchers to study how a wide range of family, school, community, and individual factors are associated with school performance over time.

Round of data collection Dates of data collection Date of release or tentative date for release
Fall Kindergarten August 2010 to December 2010 July 2013 (restricted data)
April 2015 (public data)
Spring Kindergarten March 2011 to June 2011 July 2013 (restricted data)
April 2015 (public data)
Fall First Grade August 2011 to December 2011 November 2014 (restricted data)
April 2015 (public data)
Spring First Grade March 2012 to June 2012 November 2014 (restricted data)
April 2015 (public data)
Fall Second Grade August 2012 to December 2012 July 2015 (restricted data)
Fall 2016 (public data)
Spring Second Grade March 2013 to June 2013 July 2015 (restricted data)
Fall 2016 (public data)
Spring Third Grade March 2014 to June 2014 Fall 2016 (restricted data)
Fall 2017 (public data; released with
fourth-grade data)
Spring Fourth Grade March 2015 to June 2015 Summer 2017 (restricted data)
Fall 2017 (public data)
Spring Fifth Grade March 2016 to June 2016 Summer 2018 (restricted data)
Fall 2018 (public data)

 

Addiction

PhenX

The PhenX (consensus measures for Phenotypes and eXposures) Toolkit is a catalog of recommended, standard measures of phenotypes and environmental exposures for use in biomedical research. PhenX measures can be used to expand a study design beyond the primary research focus. The PhenX Toolkit is a web-based resource and is available for use at no cost.

The PhenX Toolkit offers well-established, broadly validated measures of phenotypes and exposures relevant to investigators in human genomics, epidemiology, and biomedical research. The measures in the Toolkit are selected by working groups of domain experts using a consensus process. The Toolkit provides detailed protocols, information about the measures, and tools to help investigators incorporate PhenX measures into their studies. Inclusion of PhenX measures facilitates cross-study analysis downstream, thus increasing the scientific impact of each individual study.

The National Addiction & HIV Data Archive Program (NAHDAP)

The National Addiction & HIV Data Archive Program (NAHDAP) acquires, preserves, and disseminates data relevant to drug addiction and HIV research. The scope of the data housed at NAHDAP covers a wide range of legal and illicit drugs (alcohol, tobacco, marijuana, cocaine, synthetic drugs, and others) and the trajectories, patterns, and consequences of drug use as well as related predictors and outcomes. By preserving and making available an easily accessible library of electronic data on drug addiction and HIV infection in the United States, NAHDAP offers scholars the opportunity to conduct secondary analysis on major issues of social and behavioral sciences and public policy.

The research community benefits when researchers can use data from original research projects to test conclusions – verifying, refining, or refuting published findings. Sharing data through NAHDAP fosters the development and testing of new conclusions, as data collected for one purpose can be used to pursue inquiries not addressed by the original investigators.

National Institute on Drug Abuse (NIDA) Data Share for Clinical Trials

The NIDA Data Share website is an electronic environment that allows data from completed clinical trials to be distributed to investigators and the public in order to promote new research, encourage further analyses, and disseminate information to the community. Secondary analyses produced from data sharing multiply the scientific contribution of the original research. NIH expects and supports the timely release and sharing of final research data from NIH-supported studies for use by other researchers to expedite the translation of research results into knowledge, products and procedures to improve human health. The website was created in order to make the NIDA Clinical Trial data available to as wide an audience as possible. As studies are completed and their data become available, the website will be linked to those data.

National Epidemiologic Survey on Alcohol and Related Conditions-III (NESARC-III)

The National Epidemiologic Survey on Alcohol and Related Conditions-III (NESARC-III) was sponsored, designed, and directed by the National Institute on Alcohol Abuse and Alcoholism (NIAAA). NESARC-III is cross-sectional, based on a nationally representative sample of the civilian noninstitutionalized population of the United States aged 18 years and older. Fieldwork was conducted by Westat through a contract under the data collection authorization of Title 42 USC 285n. Utilizing the NIAAA Alcohol Use Disorder and Associated Disabilities Interview Schedule (AUDADIS-5), NESARC-III collected information on DSM-5 alcohol and drug use and disorders, related risk factors, and associated physical and mental disabilities. In addition, DNA was obtained through saliva samples. The final sample size was 36,309 and included persons living in households and select noninstitutional group quarters.

Prescription Drug Abuse Policy System (PDAPS)

The Prescription Drug Abuse Policy System (PDAPS) is funded by the National Institute on Drug Abuse to track key state laws related to prescription drug abuse.  Click on any topic area to reach an interactive page where you can investigate the history and features of the law or download data and other documentation for research.

National Institute on Alcohol Abuse and Alcoholism (NIAAA) Alcohol Policy Information System (APIS)

The Alcohol Policy Information System (APIS) provides detailed information on a wide variety of alcohol-related policies in the United States at both state and federal levels. Detailed state-by-state information is available for the 35 alcohol-related policies listed on the website. APIS also provides a variety of informational resources of interest to alcohol policy researchers and others involved with alcohol policy issues.

Other

LawAtlas Policy Surveillance Program

The goal of the Policy Surveillance Program is to increase the use of policy surveillance and legal mapping as tools for improving the nation’s health. Researchers, policymakers, public health practitioners, and the media are recognizing the need for access to reliable information about laws and policies that influence the public’s health.

Legal mapping can help policymakers, advocates, and researchers understand what the laws are on a given topic, know how the laws differ over time and across jurisdictions, and obtain data to evaluate their impact. On the Policy Surveillance Program website, you can access maps, tables, data, and reports describing both the current state of health laws and how they have changed over time. Health is impacted by a wide-ranging array of laws and policies. Search by public health topic or alphabetically to find legal maps and begin exploring the law.

You can search laws in the following categories (which include subcategories): alcohol, tobacco, and other drugs; chronic disease; environmental health; food safety; health communication and information technology; health services; infectious disease prevention and control; injury and violence prevention; maternal, infant, and child health; mental health and mental disorders; occupational safety and health; oral health; public health infrastructure; and social determinants of health and health disparities.

 

National Institutes of Health (NIH) Data Sharing Repositories

This table lists NIH-supported data repositories (including some of those listed elsewhere on this page) that make data accessible for reuse. Most accept submissions of appropriate data from NIH-funded investigators (and others), but some restrict data submission to only those researchers involved in a specific research network. Also included are resources that aggregate information about biomedical data and information sharing systems. The table can be sorted by name and by NIH Institute or Center and may be searched using keywords so that you can find repositories more relevant to your data. Links are provided to information about submitting data to and accessing data from the listed repositories. Additional information about the repositories and points-of-contact for further information or inquiries can be found on the websites of the individual repositories.

United States Census Bureau

The mission of the United States Census Bureau is to serve as the leading source of quality data about the nation’s people and economy. The Bureau’s Data Tools and Apps page provides information on using interactive applications to get statistics from multiple surveys. The survey topics include, among many others, housing; economic and social factors; county business patterns; occupation groupings by race, ethnicity, and sex; migration flows; and small area health insurance, income, and poverty estimates.

Child Trends

Child Trends is the nation’s leading nonprofit research organization focused exclusively on improving the lives and prospects of children, youth, and their families. It accomplishes this goal by conducting high-quality research and sharing the resulting knowledge with practitioners and policymakers. The organization’s online databank enables you to search by a wide variety of indicators.

Pew Research Center

The Pew Research Center includes a robust group of data sets on the following topics:

 

Click on a link from the list above to access the topic of your interest, then on “Datasets” or “Data” in the menu at the top of the page. You will see a list of the data sets available for downloading, along with links to the reports already released from that data. Pew Research Center staff is available to answer questions and to provide limited assistance in importing and analyzing the data.