49
A Crash Course in Secondary Data Sources for Berkeley Researchers Jon Stiles D-Lab

A Crash Course in Secondary Data Sources for Berkeley Researchers Jon Stiles D-Lab

Embed Size (px)

Citation preview

A Crash Course in Secondary Data

Sourcesfor Berkeley Researchers

Jon Stiles D-Lab

Lesson 1 & 2: Plan Ahead!

Take a little time to check out the

landscape and see

what you might

want to look for.

Lesson 1 & 2: Look where you’re going!

…and

Don’t go so fast

that you lose

control of what you’re doing!

Secondary data: what is it and where does it come from? Secondary data: what is it and where does it come from? Why and how would you want to use it?Why and how would you want to use it?

Secondary data: where can you find it?Secondary data: where can you find it? Sites (archives, research organizations, government agencies)Sites (archives, research organizations, government agencies) Strategies (keyword, literature, snowball)Strategies (keyword, literature, snowball)

Tools to help you extract and use secondary dataTools to help you extract and use secondary data

Local resources to help youLocal resources to help you

Secondary Data Resources

Road Map for Today

Secondary data: what is it and where does it come from?

Secondary Secondary datadata: what is : what is it?it?

Data: plural of "datum" ….. from the Latin "something given."

Plural : Right on!

Something Given: Not so much….

SecondarySecondary datadata: what is : what is it?it?

Primary Primary data data ““New” dataNew” data Typically collected to answer specific questions or serve Typically collected to answer specific questions or serve

specific needsspecific needs Known universe/sample, intentional designKnown universe/sample, intentional design Tailored data itemsTailored data items

Secondary Secondary data data ““Recycled” dataRecycled” data Collected by others and re-used Collected by others and re-used Often (but not always) collected for a different useOften (but not always) collected for a different use Value reliant on meta-data (information about the data)Value reliant on meta-data (information about the data)

Secondary data: basic Secondary data: basic characteristicscharacteristics

Secondary data tend to emerge from three kinds of collection processes: Secondary data tend to emerge from three kinds of collection processes: Survey data: collection for research purposes, coherent research design, Survey data: collection for research purposes, coherent research design,

well-defined sampling process, intent to generalizewell-defined sampling process, intent to generalize Administrative data: collection for program administration or routine Administrative data: collection for program administration or routine

record-keepingrecord-keeping Digital exhaust: an electronic byproduct or residue of activitiesDigital exhaust: an electronic byproduct or residue of activities

Secondary data may be available either as: Microdata: individual level records for a unit of analysis Aggregate data: summary counts or statistics across multiple units

Secondary data may be available either as: Cross-sectional: data collected at a single point in time Longitudinal data: data collected for the same unit of observation at

multiple points in time

Data CharacteristicsData CharacteristicsSurvey Data CharacteristicsSurvey Data Characteristics

Well defined sampling processWell defined sampling processUsually fewer observationsUsually fewer observations

American community survey (~200K/mon)American community survey (~200K/mon)GSS (~1500-6000) –GSS (~1500-6000) –Public Opinion (~1200)Public Opinion (~1200)

Individual opinions and characteristics often gatheredIndividual opinions and characteristics often gathered

Administrative data characteristicsAdministrative data characteristicsRestricted universe, but can have large amounts of data (millions of observations)Restricted universe, but can have large amounts of data (millions of observations)Data collected only for program administrationData collected only for program administrationOther data spotty, even if described in programOther data spotty, even if described in programOften linkable to other data Often linkable to other data Rarely includes participant opinionRarely includes participant opinion

“ “Data Exhaust” CharacteristicsData Exhaust” CharacteristicsOften very large Often very large Skewed populations – unclear sampling frameSkewed populations – unclear sampling frameUncertain but developing capacity to linkUncertain but developing capacity to link

Secondary data: originsSecondary data: origins Secondary data emerge from several kinds of collection processes: Secondary data emerge from several kinds of collection processes:

Survey dataSurvey data: collection for research purposes, coherent research design, well-: collection for research purposes, coherent research design, well-defined sampling process, intent to generalizedefined sampling process, intent to generalize

Examples: Examples: General Social Survey (GSS)General Social Survey (GSS)National Health Interview Surveys (NHIS)National Health Interview Surveys (NHIS)Current Population Survey (CPS)Current Population Survey (CPS)

Administrative dataAdministrative data: collection for program administration or routine record-: collection for program administration or routine record-keeping keeping

Examples:Examples: Marriage RecordsMarriage RecordsProperty SalesProperty Sales

Hospital Discharge RecordsHospital Discharge RecordsCourt RecordsCourt Records

Data exhaustData exhaust: byproduct or residue of activities: byproduct or residue of activities

Examples: Examples: Twitter collectionsTwitter collectionsCell phone location dataCell phone location dataNewspaper articlesNewspaper articles

Advantages of Secondary Advantages of Secondary DataData

Cost: Cost: original data collector bear burdenoriginal data collector bear burden

Comparability: Comparability: results may be contrasted with results may be contrasted with others using same/similar sources others using same/similar sources

Chronology: Chronology: research process can be shortened research process can be shortened dramaticallydramatically

Coverage: Coverage: data may address points in time or data may address points in time or geographies not directly available to researchergeographies not directly available to researcher

Credibility: Credibility: data collection may use specially data collection may use specially trained/knowledgeable stafftrained/knowledgeable staff

Disadvantages/ Concerns Disadvantages/ Concerns about Secondary Dataabout Secondary Data

Sample design may be unknown/ undocumentedSample design may be unknown/ undocumented

Quality of data elements may vary dramatically Quality of data elements may vary dramatically

Data collection challenges may be difficult to Data collection challenges may be difficult to ascertainascertain

Data may be gathered for different purposes/ coded Data may be gathered for different purposes/ coded in inappropriate waysin inappropriate ways

Data may be outdatedData may be outdated

Cost/ Availability: proprietary or confidential dataCost/ Availability: proprietary or confidential data

Break & IntroductionBreak & Introduction Next we are going to talk about Next we are going to talk about

places which serve as places which serve as repositories for data, and how repositories for data, and how to locate data….to locate data….

But before we do that, let’s take a But before we do that, let’s take a break and talk about your interests break and talk about your interests and needs.and needs.

Secondary data: where can you find it?

ICPSR (Inter-University Consortium for Political and Social Research) is a membership-based organization which collects data from individual researchers, polling agencies, and governmental and international agencies. Data set cover areas such as political attitudes and behavior patterns, crime and criminal justice, state and national voting records, election studies, census enumerations, economic behavior, family studies, and social atttitudes. Holdings at ICPSR are available to UCB subject toIP verification. (www.icpsr.umich.edu)

Archives: Academic

Roper Center: The Roper Center archives data from thousands of surveys with national adult, state, foreign, and special subpopulation samples conducted by Gallup, NORC, CBS, ABC, Harris, the LA Times, the NY Times, and many other polling organizations. Polls are available from as far back as the mid-1930’s. Holdings at the Roper Center are, effective as of this month, also available via IP screening. (www.ropercenter.uconn.edu )

Archives: Polling Data

Government: NCESGovernment: NCES

http://nces.ed.gov/

NCES: Data AccessNCES: Data Access

http://nces.ed.gov/edat/

http://www.cdc.gov/nchs/surveys.htm

Government: NCHSGovernment: NCHS

Government: NSF Government: NSF - College, Doctoral, Post-- College, Doctoral, Post-

DoctoralDoctoral

http://www.nsf.gov/statistics/data-tools.cfm#micro-data

Government: BEAGovernment: BEA

Government: BLSGovernment: BLS

http://www.bls.gov/data/

Government: USDAGovernment: USDA

UKDA: General Purpose UKDA: General Purpose ArchiveArchive

http://discover.ukdataservice.ac.uk/

IEA: TIMSSIEA: TIMSS

OECD: PISAOECD: PISA

http://www.oecd.org/pisa/http://www.asdfree.com/2013/12/analyze-program-for-international.html

Other Archives/Data Resources on the net

Office of Population Research at Princetonhttp://opr.princeton.edu/archive/

This archive focuses on data of interest to demographers: data about fertility, mortality, and migration.

The Mexican Migration Project (MMP), an ongoing multidisciplinary study of migration from Mexico to the United States, has released data for 93 communities in 17 States in Mexico. The Latin American Migration Project (LAMP), which extends the MMP design to a study of migration flows originating in other Latin American countries, has now released data for Dominican Republic, Nicaragua, Costa Rica, Haiti, Peru and Paraguay.

Demographic and Health Surveys http://www.measuredhs.com/ Surveys from Central and South American, Africa, and Asia dealing with health, family planning, education, andhousehold characteristics. Free, but registration required.

Archives: Distributed

http://thedata.harvard.edu/dvn/http://thedata.harvard.edu/dvn/faces/site/BrowseDataversesPage.xhtml?initialSort=Released

Add HealthAdd Health

http://www.cpc.unc.edu/projects/addhealth/data

Other Archives/Data Resources on the net

Integrated Public Use Microdata Series (IPUMS)http://www.ipums.umn.edu/

This is THE starting place if you have any interest in using microdata from the decennial censuses of the US. The documentation provides wording/context, extractions are straightforward, multiple statistical packages are supported.

General Social Surveyhttp://sda.berkeley.edu/cgi-bin/hsda?harcsda+gss10

The GSS (General Social Survey) is an almost annual "omnibus," personal interview survey of U.S. households conducted by the National Opinion Research Center (NORC) since 1972. It covers a broad range of topics, with a strong core of replicated items each year, and modules which are concurrently fielded in many other countries since the mid-1980’s.

Other Archives/Data Resources on the net

Panel Study of Income Dynamics (PSID) http://psidonline.isr.umich.edu/

The PSID is a longitudinal survey of a representative sample of US individuals and their families, ongoing since 1968. The data were collected each year through 1997, and every other year starting in 1999. Topics include income and wealth, expenses, education, and health care.

The National Survey of Families and Households (NSFH) http://www.ssc.wisc.edu/nsfh/home.htm The NSFH has fielded three waves of interviews between 1987 and 2002 which cover family structure, household division of labor,employment, cohabitation, parenting, health and well-being, etc..

Other Archives/Data Resources on the net

National Historic Geographic Information Systemhttp://www.nhgis.org/

Provides, free of charge, aggregate census data and GIS compatible boundary files for the United States between 1790 and 2000.

International Social Science Programme (ISSP)http://www.gesis.org/en/data_service/issp/data/list_quest_pdf.htm

The ISSP topical modules have focused on the Role of Government (1985, 1990, 1996), Family (1988, 1994, 2002), Social Inequality (1987, 1992, 1999), Social Networks (1986, 2001), Religion (1991, 1998), National Identity (1995, 2003), the Environment (1993, 2000) and Work Orientations (1989, 1997). The most recent modules are fielded in almost 40 countries.

Other Archives/Data Resources on the net

National Bureau of Economic Research

http://nber.org/data/ Downloadable macro and microdata. Includes ConsumerExpenditures data, Survey of Program Dynamics (SPD), Survey of Income and Program Participation (SIPP), natality and mortality files from NCHS, a long time series on segregation, and more.

http://nber.org/data/cps.html A very nice description and organization of the topical supplements to the CPS, as well as the data, documentation, and (in many cases) SAS, SPSS, and stata syntax to read in the data.

Other Archives/Data Resources on the net

American Religion Data Archive http://www.arda.tm/

Consortium for Earth Science Information Network (CIESIN) http://sedac.ciesin.org/data.html 1980/1990/2000 Census summary files in easily usable formatboundary files in popular GIS formats

University of Wisconsin-Madison Center for Demography and Ecology ftp sitehttp://www.ssc.wisc.edu/cde/library/cdeftp.htm

University of Virginia Libraryhttp://fisher.lib.virginia.edu/

Other Data Resources/tools on the net

The Dataferrett

http://dataferrett.census.gov/TheDataWeb/index.htmlA collaboration between the CDC and Census Bureau which allows youto extract and download data from:

American Community Survey (ACS)American Housing Survey (AHS)Behavioral Risk Factor Surveillance System (BRFSS)Consumer Expenditure Survey (CES)Current Population Survey (CPS)Decennial Census of Population and Housing (Census2000)National Ambulatory Medical Care Survey (NAMCS)National Center for Health Statistics Mortality-Underlying Cause-of-Death (MORT)National Health and Nutrition Examination Survey (HANES)National Health Interview Survey (NHIS)*National Hospital Ambulatory Medical Care Survey (NHAMCS)National Survey of Fishing, Hunting, and Wildlife-Assocated Recreation (FHWAR)Survey of Income and Program Participation (SIPP)Survey of Program Dynamics (SPD)

Tools to help you extract and use secondary data

www.socialexplorer.com

Local resources to help you

Selected Data Resources at Berkeley

D-Lab http://dlab.berkeley.edu/

UC DATAhttp://ucdata.berkeley.edu/

California Census Research Data Centerhttp://www.census.gov/ces/

Library Data Labhttp://www.lib.berkeley.edu/wikis/datalab/

SDA (Survey Documentation & Analysis)http://sda.berkeley.edu/

Geospatial Innovation Facilityhttp://gif.berkeley.edu/

Thank you. (Slides will be posted.)

Road Map(I)

Research Design & ImplementationData CollectionData Entry

Primary or Secondary – or both?

(& Documentation)(& Documentation)(& Documentation)

Road Map(II)

Data CleaningReading Data InLabellingEdit Checks (More Cleaning)Weighting

(& Documentation)(& Documentation)(& Documentation)(& Documentation)(& Documentation)

Road Map(III)

Descriptive StatisticsData TransformationRecord MatchingAggregation/Collapsing

(& Documentation)(& Documentation)(& Documentation)(& Documentation)

First StopsData Cleaning

Skip PatternsMissing DataRange Checks

Reading Data InFixed format / Delimited / HierarchicalVariable Typing (String/ Numeric)

LabellingVariables & Values

Edit Checks (More Cleaning)Consistency / Imputation

WeightingSampling ProbabilityNon-ResponsePopulation

Along the way

Descriptive StatisticsMin/Mean/Ptiles/Max/Valid N

Data TransformationRecodingComplexScales & Indices

Record MatchingLinking (1-1) / (1-Many)

Aggregation/CollapsingSummary Statistics

Planning Your Trip and Getting on the Road

Research Design & ImplementationWhat do you want to be able to say at the

end?Who/what are your units of analysis?What is the universe of the units you want to

talk about?How are the units you observe selected from

the universe?What is/are the instruments used to collect

data?Data Collection

How was the sampling strategy implemented?

Non-response – unit-level, item-level – and followupData Entry

Coding, Collapsing, Open-endedValidation