Lars Lyberg Stockholm University Frimis, November 11, 2015 What’s Going on in Survey Research?

Preview:

Citation preview

Lars Lyberg

Stockholm University

Frimis, November 11, 2015

What’s Going on in Survey Research?

2

A Changing Survey Landscape

• Probability and nonprobability sampling• Total survey error• New technology• Big data• International surveys• Hard-to-survey populations

3

Probability Sample

Every object in the target population has a known non-zero probability of being selected

• Very few samples in market, opinion and social research live up to this definition

• Reasons include nonresponse, frame problems, and special research goals

The Origins of Probability Sampling

• Introduced in 1934• Basically a financial breakthrough• Data collection was expensive• To be able to say something about a population

based on a relatively small sample and a margin of error to go with that was almost like magic

4

Problems

• It took a while for probability sampling to be accepted

• The sampling theory did not handle other error sources very well

• Basically the only “allowed” error source is sampling

6

Issues Associated with Sampling

• Ridiculous response rates• Increased demands for timely data• Access to large volumes of (inexpensive) data • Margins of error are understated• Discussions about nonprobability sampling• New less expensive ways of collecting data• The advent of opt-in panels• Proper inference not always possible

7

Examples of Statements

• Probability sampling is the only reasonable way to achieve representativity

• Probability samples are not representative due to nonresponse

• There is no theoretical foundation for opt-in panels

• There are theories and methods based on modeling and weighting

8

More Statements

• Studies show that probability sampling is more accurate that nonprobability sampling

• Some of these comparisons are flawed since weighting of the nonprobability samples has not been sufficiently ambitious

• Even though results from opt-in panels might be biased to some extent they come at a fraction of the costs for a probability sample and much quicker

9

The Current Situation

• Both probability and nonprobability sampling have problems

• Bayesian inference gaining ground• Lots of experimentation needed• Quality criteria need to be defined

10

The Recent British Election

• Whilst the Conservatives won convincingly, 18% of the campaign polls had suggested a dead heat and a further 46% had suggested Labour leads.

• Of the 36% of polls that registered Conservative leads, three out of four showed leads that were less than half the actual outcome.

• Both probability sampling and panels failed.• The British Polling Council has initiated an

investigation on why things went wrong.

11

Due to selecting Errors due toa sample instead of mistakes or systemthe entire pop’n deficiencies

Total Survey Error

SamplingError

NonsamplingError

12

Risk of Bias and Variance by Error Source

MSE Component Var Bias

Sampling error High Low

Specification error Low High

Nonresponse error Low High

Frame error Low High

Measurement error High High

Data Processing error High High

13

14

What to do about Total Survey Error

• Minimize variances and biases through QA, QC, QM, and best practices

• Estimate the size of the total error

• Apply risk management

15

New Technology

• Smartphones as a data collection mode

• Social media as an information source

• GPS

Big data is a term that describes data sets so large and complex that they cannot be processed and analyzed with conventional software systems.

Sources:• Transaction databases• Social media• The Internet of Things

16

17

A Black Swan

A black swan is an undirected and unpredicted event.It is rare, has an extreme impact but in retrospect we saw it coming

• Internet - yes

• 9/11 - yes

• The Lehman Brothers crash - yes

• The advent of Big Data - ?

18

The Three V’s

• Volume• Tera- to Peta- to Exabytes of data, stored and

processed

• Variability• Structured, unstructured, text, images, maps,

multimedia• Varying sources

• Velocity• Streaming data, from seconds to milliseconds

• Veracity• Can we trust Big Data? Can we use it? Proxies,

indicators

19

20

Big Data

Examples of Big Data with use or potential use in statistics production

• Google searches (flu trends)

• Traffic camera data

• Retail scanner data

• Credit card and transaction data

• GPS data

21

Hype of Big Data

Gartner’s hype curve

Source: Wikipedia

22

Happiness and Well-being

The common survey question: How satisfied areyou with your life?

BD alternative• 10 million tweets that are coded for happiness

(rainbow, love, beauty, hope, wonderful, wine…) and non-happiness (damn, boo, ugly, smoke, hate, lied,…)

• Happiest states: Hawaii, Utah, Idaho, Maine, Washington

• Saddest states: Louisiana, Mississippi, Maryland, Michigan, Delaware

23

Big Data Challenges

• Data quality

• Data analytics

• Confidentiality concerns

Mono Surveys vs 3MC Surveys• 3MC=multinational, multregional and multicultural contexts• One population vs more than one population• In 3MC TSE or MSE as planning criteria must be complemented by equivalence or comparability• 3MC surveys need to be designed with a mixture of standardization and flexibility to achieve operational equivalence• Implementation and control much more demanding in 3MC surveys

24

Examples of 3MC Surveys

• Adult literacy (IALS)• Adult skills (PIAAC)• Student assessment

(PISA)• European Social

Survey (ESS)• World values (WVS)• Health, ageing and

retirement (SHARE)• Electoral systems

(CSES)

• Gallup World Poll (GWP)

• European Statistical System

• Marketing surveys on customer satisfaction, brand names, attitudes, finances etc

• Pure entertainment surveys

25

Some Special Features in a 3MC Survey Setting

• Comparability is the main goal• Concepts must have a uniform meaning• Risk management differs• Financial and methodological resources

differ (3MC’s are expensive)• National and international interests are

in conflict• Scientific challenge• Administrative challenge• National pride is at stake

26

Response Rates in PIAAC, Cycle I (%)

• Australia 71

• Austria 53

• Belgium 62

• Canada 58

• Cyprus 73

• Czech Republic 66

• Denmark 50

• Estonia 63

• Finland 66

• Germany 55

• Ireland 72

• Italy 56

• Japan 50

• Korea 75

• Netherlands 51

• Norway 62

• Poland 54

• Slovak Republic 66

• Spain 48

• Sweden 45

• UK-England 59

• UK-Northern Ireland 65

• USA 70

27

28

Challenges in 3MC Surveys

• Design (what can vary, what is rigid)• Translation• Adaptation• Culturally different error structures• Data fabrication• Quality control• Often too many countries involved

29

Hard-to-survey Populations (H2S)

• Homeless• Prostitutes• Refugees• Victims• Persons with disabilities• Minorities• Illegal aliens• Rare (fans, musicians, language groups,

extremists)• Mobile populations (nomads, migrants,

students)

30

Methodological Approaches to H2S

• Innovative sampling methods• Venue-based (red light districts, voting

facilities)• Indirect sampling• Snowball and respondent driven

• Qualitative studies (anthropology etc)

• Formative research

The End of Theory

Faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. Petabytes allow us to say: ‘Correlation is enough.’ We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

Chris Anderson 2008 31

32

The Future of Surveys is Uncertain

• Too many surveys, too much off-the-shelf tools• Active participation going down• Passive participation going up• Many problems are global• Decision makers need data fast and at low cost• The design-based approach needs refreshment• Decision makers need data from different

sources• The big survey institutes are worried

33

Endnote • Our industry needs innovations and less fighting• We need to merge with other research cultures• We need to know more about combining data

sources• We need to account for all major sources of

uncertainty that is associated with data collection and analysis of data

• We need to develop new theories for handling error structures, combining data sources, and reaching equivalence

34

Over and Out

Recommended