86

WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location
Page 2: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

WELCOME

Brian Harris-KojetinCommittee on National Statistics, Director

Brian MoyerBureau of Economic Analysis, Director

Big Data Day Executive Sponsor

Page 3: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

CNSTAT PanelBob Groves, Georgetown University, CNSTAT Chair

Sallie Keller, Virginia Tech University

Jerry Reiter, Duke University

Page 4: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location
Page 5: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Scanner Data in the Economic Research Service Consumer Food Data Program

Megan SweitzerUSDA Economic Research Service

Big Data DayMay 11, 2018

Page 6: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Proprietary Scanner Data

• Consumer purchase transactions

• Used for marketing research

• Household purchase survey

• Retail point-of-sale (POS) data

– Purchase transaction records collected

from store POS systems of 60,000 stores

– Each year of data contains 6.5 billions records

• Used for research projects, program evaluations, regulatory

impact analyses, data products

Page 7: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Challenges

Accommodating size of data• About 5 TB of data over 9 years

Extensive effort to clean, organize, and document • Checking completeness and accuracy• Data format and organization• Understanding components and variables

Using data appropriately• Designing suitable studies given properties of data• Interpreting results in appropriate context

Page 8: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Understanding and Evaluating Data Quality

Documentation

Statistical properties

Coverage

Representativeness

Page 9: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Filling Gaps in the Data

Developing weights for the retail stores

Imputing missing prices in household data

Linking stores in the household and retail data

Acquiring new data from the vendor

• SNAP and WIC variables

• Less-restricted data

Page 10: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Creating Linkages with Other Datasets

USDA Nutrient and Food Composition Databases

• Linking foods as purchased to foods as consumed

• Linking food prices and dietary recall data USDA’s National Household

Food Acquisition and Purchase Survey (FoodAPS)

• Product identification

• Food environment

Page 11: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Thank you!

Megan SweitzerERS Food Economics Division

[email protected]://www.ers.usda.gov/

Page 12: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location
Page 13: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

“The Art of the Possible”Census Enterprise Data Lake Proof-of-Concept

Big Data Day Presentation

Nitin Naik & Adley Kloth

Chief Technology Office

IT Directorate

May 11, 2018

Page 14: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Topics

Census Survey Operation As-Is Today

Proposed Solution

Proof-of-Concept

Census Survey Operation Future Target State

Page 15: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Census As-Is today Decentralized Data Management

• Multiple copies/instances

• Decentralized data stewardship with no Master Metadata Model at Directorate or

Enterprise level

• Difficult to share or link the data or even metadata between project or research teams

or Directorates.

Processing and Analytic Logic Constraints

• Processing data using coding intensive methods in SAS or Oracle, and in

numerous systems with questionable documentation

• Proper curation and storage of the data processing code is limited

• Reproducibility of results is very limited

Decentralized Data Storage

• Data not stored centrally for access.

• Most datasets are stored as SAS files or Oracle DBs

• Most data have file level access control

• Current data handling process non-scalable for administrative and 3rd party datasets

required to improve accuracy and quality of data products.

Decentralized security and privacy implementation

• Creating severe system inefficiencies.

• Cumbersome governance and security measures make tracking disclosure of Title 13

and Title 26 data difficult

• Limited Auditing and Usage monitoring

Technology Constraints

• Multiple copies stored on different servers/systems due to silo-ed technology

deployments based project funding

• Inability to handle large datasets with complex calculations

• Lack of access to software and tools for deeper data analysis

DEMOECON

S1 S2 S3 S4…

.M1

M2

M3….

Y1

Y2

Yn

.

….

Survey Portfolio

Tim

e P

eri

od

Census Data Limitation

Se

cu

rity

Go

ve

rna

nce

Infrastructure

Management

Data

Management

Analytics

Se

cu

rity

Go

ve

rna

nce

Infrastructure

Management

Data

Management

Analytics

Se

cu

rity

Go

ve

rna

nce

Infrastructure

Management

Data

Management

Analytics

Survey N

DEMOGRAPHICS

DECENNIAL

OTHER

PROGRAMS

Survey N + 1 Survey N + 1…

Economics

Page 16: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Enterprise Data Lake

Data

Se

rvic

es Machine Learning

Algorithms

Hyper Search Cognitive Learning API’s

Meta Tagging Micro Services

Mass Parallel Processing Text Analytics

In-Memory Analytics

Legacy

Data Application

Data

Streaming

DataRDBMS

Data

Unstructured

Data

(images, videos,

BLOB’s) Big Data

Stores

Flat Files

(csv, .txt)Social Media

External

Data

(Address, Geo,

Weather) SAS

Datasets

artificial

intelligencecognitive

computing

Data Linkage

Life

cycle Response

Collection

Imputation &

Estimation

Disclosure

AvoidanceTabulation Publication

Production

Application

s

Analytics &

Reporting

Internal

User Portal

Mobile

Apps

Census

WebsiteDashboards

Artificial

Intelligence

Systems

Cognitive

Computing

Systems

16

Proposed Solution: Enterprise Data Lake in the Cloud

Security & Privacy

Page 17: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Enterprise Data Lake Proof-of-Concept (POC) Strategy

Team

Problem

How can we support migration of legacy datasets, data treatment

and analytics SAS code in the Cloud and Big Data platforms

Operation Mode8 Week Effort with Weekly Sprints

Econ

Demo

CTO

Page 18: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Enterprise Data Lake POC ScopeWhat Does “Success” Look Like?

• Data Lake Web User Interface

• Data access controls

• Pluggable Analytic pipelines (EMR, HortonWorks, etc)

• “App store”

• 2012 Commodity Flow Survey

• Survey of Income and Program Participation

• Legacy SAS Code

• Accuracy and Performance

• Code Handoff

• Training and Documentation

• Roadmap

Build Run Transfer

Page 19: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

User initiates

Analysis Routine

based on selected

data

1

Enterprise Data Lake POC Execution StepsHow did we do It?

Custom Java routine

creates Accumulo

rights and data tables

and loads the data

4

Hive tables are

created based on

data visible to user in

Accumulo

5

A SAS AMI is

launched with Hive

connection details

6

A SAS Program is run

and results are stored in

S3. The AWS instances

and services are

terminated

7

1) Location of Results

2) Location of Logs

A NodeJS Lambda

function launches

EMR/HDX via

SDK/API

2 1) Analysis Routine

2) Data File

3) AD Group

A Hadoop cluster is

launched and

bootstrapped to install

Accumulo and Hive

3

Automation using same standard images

for OS, HDP, Accumulo, Hive, SAS and R

were used for CSF and SIPP

User Views Results

A

A

A

A

A

A

A

8

Page 20: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Enterprise Data Lake POC Outcomes

• Detail research and recommendations for Hortonworks™ running in AWS

• Detail research and recommendations for SAS™ running in AWS

• Demonstrate a web-accessible data lake that provides:User authentication and authorization

Control of access roles and rights on a survey

In-context launch of SAS and SQL that supports:Column level access control based on LDAP Group

One routine executed against one or more data files

• Replicate output of existing DEMOGRAPHIC and ECONOMIC data and procedures that:Matches results of existing analysis routines

Completes faster than current capabilities

Accomplishments

Page 21: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Census Future Target State Centralized Data Management

• Only one instance of data

• Data lineage available

• Can share or link the data and/or metadata between project or research

teams and Directorates

Processing and Analytic Logic Constraints

• Processing data using coding intensive methods in SAS or Oracle in one

environment

• Proper curation and storage of the data processing code is fully visible

• Reproducibility of results is fully capable

Centralized Data Storage

• Data stored centrally for access with access control at file and

row/column/cell level

• Data handling scalable for administrative and 3rd party datasets.

Centralized security and privacy implementation

• Easier governance and security measures

• Easier Tracking disclosure of Title 13 and Title 26 data

• Expanded Auditing and Usage monitoring

Technology Capabilities

• Automated Provisioning of Security Certified Standardized PaaS

• Tight Integration of Storage and Compute for Faster Analysis

• Ability to handle large datasets with complex calculations

• Use of new tools (R, Python, Graph, Spark, Solr, mahout, Palo)

• Cost Chargeback based on utilization

Enterprise

Directorate

Analytics

Directorate

Analytics

Directorate

Analytics

EDL Standard Services

Standardized Cloud Services

Standardized Census Data Services

Go

vern

ance

Secu

rity

Flat

Files

Survey

dataAdRecsSAS

Dataset

Enterprise Data Lake

Analytical

extracts

RDBMS Datasets

Infrastructure as a Service (IaaS)

Platform as a Service (PaaS)

Software as a Service (SaaS)

Data as a Service (DaaS)

Page 22: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Census Future State: To-Be Vision

Theme: Any Data

Many Tools

Easier Operation

Technologies:

=Better Analytics

Quality Products

Faster Time-to-Publish

Page 23: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Feedback and Q&A

Page 24: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Thank You!

Page 25: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location
Page 26: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

RAAS SENSITIVE BUT UNCLASSIFIED

WORKING DRAFT – for research purposes only

Big Data Day –Recommendation Systems at the IRS5/11/2018

Page 27: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

WORKING DRAFT – for research purposes only

| RESEARCH, APPLIED ANALYTICS AND STATISTICS (RAAS)

SENSITIVE BUT UNCLASSIFIED

May 2018

Challenges in form value verification

Time-consuming and

costly

• > 300 employee-

weeks away from

regular duties

• Employees who

perform this task are

typically very

experienced

Rushed

• High volume of forms

requires reviewers to

minimize time spent on

each form

IRS processes millions of forms that require an assessment of accuracy of form values.

In an early stage of the process, experienced employees manually evaluate forms to flag potentially inaccurate values.

Inconsistent

• While standard

operating procedures

exist, adherence varies

widely

• Furthermore, the

manual process is

complicated and

unwieldy

Challenges with Current Process

Page 28: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

WORKING DRAFT – for research purposes only

| RESEARCH, APPLIED ANALYTICS AND STATISTICS (RAAS)

SENSITIVE BUT UNCLASSIFIED

May 2018

Avatar Citizen Kane Mean GirlsJurassic

Park

Stephanie 2 5

Andy 3 1

Ryan 5 4

Caitlin 4

Ashley 2 3

Known

• Netflix initially had limited selection of popular movies and wanted to promote lesser-known movies

• Sponsored a $1M competition to estimate customer movie ratings• Dataset involved >17,000 movies for almost 500,000 users• Required estimation in a sparse high-dimensional space

• Recommendation problem can be posed as a problem in sparse matrix estimation

• Competition won by a team employing a variety of algorithms combined in an ensemble approach

3.1 4.3

3.5 2.8

4.8 4.1

2.1 3.6 3.2

4.2 2.6

Estimated

Netflix Movie Rating Challenge

Recommendation algorithm research was spurred by the Netflix movie

recommendation challenge

Page 29: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

WORKING DRAFT – for research purposes only

| RESEARCH, APPLIED ANALYTICS AND STATISTICS (RAAS)

SENSITIVE BUT UNCLASSIFIED

May 2018

Recommendation system at the IRS

Form line item Form 1 Form 2 Form 3

Line item 1 125,000 976,789

Line item 2 -98,761

Line item 3 10,000 200,000

Line item 4 95,657

Line item 5 67,932

Line item 6 9,657 3,400

Line item 7 45,000

Line item 8 23,000

Line item 9 89,000

Line item 10 34,000 25,341

Line item 11 9,521

Line item 12 34,567

Form-level score

(120,000) (890,341)

(-94,081)

(9,000) (189,591)

(10,000)

(65,329)

(8,492) (4,021)

(4,254)

(24,912)

(29,301)

(75,025) (27,124)

(9,964)

(39,231)

Identifying non-compliant returns and their component Issues

Sparse data algorithms forecast all line items on forma

Flexible approach allows for scoring attachments –even rarely used forms or line items

The differences between line item estimates and actuals can be aggregated across a single form to derive a form-level score

Actual value (forecast value) Possible inaccurate

value

0.76 0.39 0.44

Line-item forecasts are obtained without using any training data

Page 30: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

WORKING DRAFT – for research purposes only

| RESEARCH, APPLIED ANALYTICS AND STATISTICS (RAAS)

SENSITIVE BUT UNCLASSIFIED

May 2018

An individual return’s line items are estimated using a baseline model trained on

millions of forms

Baseline

model

μ, Σ

Baseline

model

μ, Σ

Form line item Actuals ($)

Line item 1 125,000

Line item 2

Line item 3 95,657

Line item 4 10,000

Line item 5

Line item 6 45,000

Line item 7

Line item 8

Line item 9 34,676

Form line

item

Forecast

($)

Line item 1 324,720

Line item 2

Line item 3 94,154

Line item 4 10,175

Line item 5

Line item 6 42,500

Line item 7

Line item 8

Line item 9 39,231

Testing: Given an individual form, use model to forecast that form’s line items, and compare

Attachment 2

Main Form

Attachment 1

4.2MM forms

Training: Estimate baseline model using millions of concatenated forms

Attachment 3

Corrections

313,378 Correction 1

93,234

10,456

42,100

38,123

> 5

00

varia

ble

s

Performance is

measured by

comparison to

actual

corrections

Page 31: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

WORKING DRAFT – for research purposes only

| RESEARCH, APPLIED ANALYTICS AND STATISTICS (RAAS)

SENSITIVE BUT UNCLASSIFIED

May 2018

On classifier-selected returns, the recommendation system outperforms classifiers in

issue identification

Source: XRDB2, 1120S MeF 1M Randomly Sampled Returns from TY13-15 subsetted to returns in CDE

63%

21%

-43%-50%

-40%

-30%

-20%

-10%

0%

10%

20%

30%

40%

50%

60%

70%

On-Target Capture

Perc

en

tag

e c

han

ge

The top model improves upon classifier performance on several

fronts

Misdirection

63% increase in on-target

predictions – the portion of

anomalies flagged that resulted in a

correction

21% increase in capture rate – the

portion of all corrections made that

were also identified in classification

43% decrease in misdirection – the

number of flagged anomalies that did

not result in a correction for every

flagged anomaly that did result in a

correction

Metrics

Page 32: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location
Page 33: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

— U.S. BUREAU OF LABOR STATISTICS • bls.gov

A Big Data Concept by the Bureau of Labor Statistics

MEASURING FOREIGN DIRECT INVESTMENT’S

IMPACT ON THE LABOR MARKET

Erik FriesenhahnEconomist, Bureau of Labor Statistics

May 11, 2018

Page 34: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

— U.S. BUREAU OF LABOR STATISTICS • bls.gov

Overview

Collaborative effort between the Bureau of Economic

Analysis (BEA) and the Bureau of Labor Statistics (BLS)

Add value to existing data

• BEA employment data

– national and state

– some industry detail

• BLS can publish:

– employment and wage data with greater industry and geographic detail

– occupational data

BLS last published limited FDI data in 1994

Page 35: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

— U.S. BUREAU OF LABOR STATISTICS • bls.gov

What data sources did we use?

BEA Foreign Direct Investment in the United

States

BLS Quarterly Census of Employment and

Wages (QCEW)

BLS Occupational Employment Statistics (OES)

Page 36: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

— U.S. BUREAU OF LABOR STATISTICS • bls.gov

Bureau of Economic Analysis (BEA)

2012 benchmark Survey of Foreign Direct Investment in the United States• 10% or greater foreign ownership

• data collected based upon a company’s fiscal year

Affiliate level data• often composed of many establishments

Publish data on:• balance sheets, plant and equipment, income, value added, goods

and services provided, employment

• no wage data

• no occupational data

Page 37: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

— U.S. BUREAU OF LABOR STATISTICS • bls.gov

Quarterly Census of Employment and

Wages (QCEW)

Nearly complete quarterly census of businesses

• 98% of all nonfarm employment

• both private and public

• unique in frequency and timeliness

Establishment level data

Over 200 variables:

• monthly employment

• quarterly wages

• industry classification

• company name

• address and phone number

Page 38: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

— U.S. BUREAU OF LABOR STATISTICS • bls.gov

Occupational Employment Statistics (OES)

Sample units drawn from the QCEW

• 1.2 million establishments

Classification system

• 23 major occupational groups

• over 800 detailed occupations

Publish occupational employment and wage data by:

• industry classification

• geographic detail

Page 39: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

— U.S. BUREAU OF LABOR STATISTICS • bls.gov

How are we creating our “big data” product?

1. Auto-match BEA data to the QCEW• matching variable: Employer Identification Number (EIN)

• initial match done with computer algorithm

2. Analyst review of matches• internet search

– verify auto-matched information

– locate additional subsidiaries

• longitudinal database– company name

– address

– telephone number

Page 40: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

— U.S. BUREAU OF LABOR STATISTICS • bls.gov

A closer look at EINs

Firms can have more than one EIN

Firms may use different EINs for different

purposes

Neither BEA nor BLS collects full list of EINs

Incomplete matching of EINs will lead to

employment differences

Page 41: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

— U.S. BUREAU OF LABOR STATISTICS • bls.gov

What are some other factors that will cause differences between BEA and BLS?

Different people filling out the forms

Timing issues

• reference period

• seasonal fluctuations

• rapidly growing/contracting companies

All in/all out

• foreign ownership may be for only a specific establishment

Geographical scope

Page 42: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

— U.S. BUREAU OF LABOR STATISTICS • bls.gov

What data will BLS publish?

QCEW

• industry detail

– total private down to NAICS 4-digit

• geographic detail

– national, state, MSA and county

• country of ownership

OES

• major occupation groups (national and state)

• detailed occupations (national)

• ownership by world region

Page 43: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Contact Information

— U.S. BUREAU OF LABOR STATISTICS • bls.gov

Erik FriesenhahnEconomist

Business Employment Dynamicswww.bls.gov/bdm

[email protected]

Page 44: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location
Page 45: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

www.eia.govU.S. Energy Information Administration

Independent Statistics & Analysis

Innovative Uses of Administrative and Third Party Data

For

ICSP Big Data Day

May 11, 2018 | Washington, D.C.

By

Nanda Srinivasan and Kevin Cooksey (Bureau of Labor Statistics)

Page 46: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Overview• Description of EIA’s motor gasoline survey and petroleum marketing frame

and framing the context

• Description of BLS QCEW – framing the context, use of the data – 1.5

minute

• Challenges faced in drafting the MOU – description of safeguards on both

ends, including CIPSEA

• Alignment of management and drafting of MOU – Actions taken at both ends

• Results for EIA and BLS

• Closing thoughts

Nanda Srinivasan, Reston, VA

May 17, 2018

Nanda Srinivasan

May 11, 2018

Page 47: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Motivation

• Responsibility of federal statistical agencies to investigate alternative

sources of data to:

– Increase time and cost efficiencies

– Reduce respondent burden

• Larger context of research priorities for federal statistical agencies

– CNSTAT reports on Multiple Data Sources

– Commission on Evidence Based Policy Making

• EIA – Internal Statistical Methods Improvement Plan

• BLS – Provides good statistical use case for QCEW

Nanda Srinivasan, Reston, VA

May 17, 2018

4

7

Nanda Srinivasan

May 11, 2018

Page 48: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Data Source: Motor Gasoline Price Survey (EIA-878)

• Weekly mandatory CIPSEA survey of approximately 800

retail gasoline stations across the country.

• “Gasoline price” definition: Cash price per gallon

(including taxes) as of 8:00 a.m. local time each Monday

– Regular, midgrade, and premium gasoline.

• Mode: Mostly CATI; however, other modes also available

• Same day data collection, processing, and dissemination

• Estimates are produced for 276 publication cells

– Nation, regions, 10 cities, 9 states

– Regular, midgrade, and premium

– Conventional and reformulated gasoline

Nanda Srinivasan, Reston, VA

May 17, 2018

Nanda Srinivasan

May 11, 2018

Page 49: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Nanda Srinivasan

May 11, 2018

Page 50: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

EIA-863 – Petroleum Product Sales Identification Survey

• A triennial census of petroleum product volumes sold annually within the 50

states and DC.

– Products: No. 2 Distillate Fuel Oil/Diesel; No. 5 & 6 Residual Fuel; Motor Gasoline and

Gasohol; Propane.

– Used to benchmark and weight other EIA surveys.

– Has not been conducted in a number of years.

• Challenges

– Need to identify all unique business in a diversified product market. No longer able to use

SIC codes to identify firms by product sold.

– Need to identify the geography of a firm’s market area. Who’s selling what across what state

lines?

Nanda Srinivasan

May 11, 2018

Page 51: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

EIA-863 Multiple Data Source Solution

• U.S. Bureau of Labor Statistics Quarterly Census of Earnings and

Wages(QCEW)

– Frame Covers 40 states with firms by NAICS.

– Most comprehensive source but has holes.

• State Energy Office Lists

– Available for a few states. Limited geographic coverage.

• Web Crawling/Scrapping Data Grab

– EIA has engaged Idaho National Lab’s Big Data Team to develop a web crawling and

scrapping process for the 10 states not in QCEW.

– Process uses a trained classifier with set key words to identify websites that are then

scrapped for firm names and contact information.

Nanda Srinivasan

May 11, 2018

Page 52: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Data Sharing Takes Patience and Time; Results are “Big”

• Cost savings

• Institutional issues

– State buy in for QCEW

– CIPSEA requirements

– Getting data to flow across two agencies

• Support from management

• BLS staff on Employment and Unemployment Statistics

– Dave Talon and Kevin Cooksey

• EIA staff

– Jeramiah Yeksavich and Maura Bardos

Nanda Srinivasan

May 11, 2018

Page 53: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location
Page 54: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

“Big Data” for Health Care Through Use of Electronic Health Records

(EHRs)

Carol DeFrances, Ph.D.

Chief, Ambulatory and Hospital Care Statistics Branch

Division of Health Care Statistics

National Center for Health Statistics

Big Data Day

May 11, 2018

Division of Health Care Statistics

National Center for Health Statistics

Page 55: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Why Move the National Health Care Surveys to EHR Data?

• Less burden on the provider--no need for on-site medical record abstraction.

• More clinical detail and depth--all diagnoses, medications, and lab results are collected.

• Greater volume of data--all visits are included.

• Richer data available—allergies to medications, problem lists, family history, social history, and use of alcohol, tobacco, and substances.

• Better security--direct transmission of data with no need for laptops.

Page 56: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

What Steps Were Taken to Move to EHR Data Collection?

Research

• Conducted several pilot studies sponsored by the Assistant Secretary for Planning and Evaluation, DHHS.

Data Standards

• Developed HL7 CDA Implementation Guide (IG) for the National Health Care Surveys, which provides a standardized format for data submission.

Survey Incentives

• Participation fulfills requirements of Medicare and Medicaid EHR Incentive Programs, a.k.a. Meaningful Use (MU).

• IG named in 2015 edition of Health IT Certification Criteria.

Page 57: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

What are the Opportunities with EHR Data?

Greater Clinical Depth and Richness• Collect clinical information objectively without need for medical

record abstraction.• Include all diagnoses, procedures, active problems, medications,

laboratory tests, imaging, and results.

More Volume• Include all inpatient and ambulatory visits, not just a sample.• Collect rare conditions and experimental procedures.

Ability to Link Across Hospital Settings and to Other Data Sources

• Follow patients as they receive care throughout the sampled hospital—in the ED, as an inpatient, including ICU care, as well as any follow-up care received in outpatient clinics at the hospital.

• Link to the National Death Index (30-, 60-, and 90-day mortality).• Link to Medicare and Medicaid Claims.

Page 58: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

What are the Challenges with EHR Data?

Technological• How and where to store the large volume of data collected.

• Dealing with interoperability issues.Assessing Quality

• Need to conduct comparability studies.Analytical• How to integrate and harmonize: EHR data and abstracted data, and EHR data and administrative claims data.

• Should EHR data be edited?Disclosure Concerns

• Are public use files still possible?

Page 59: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

What is the Impact of EHR Data for the National Health Care Surveys?

“Big data” for health care

Innovative Research to Improve the Nation’s Health Care

Page 60: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location
Page 61: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

A Dive into U.S. Expenditures on

Treatment by Disease, 2000-2014

Office of the Chief Economist Health Team, May 2018

Page 62: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Trend in Health and Non-health Personal Consumption Expenditures 1970-2016

8/17/2018

62

Health: 21% of Consumption in

2016

0

2000

4000

6000

8000

10000

12000

Bill

ion

s $

Year

Page 63: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Contribution of our work

Redefines output for the health sector to be the treatment of a condition

For example

– Output = number of patients treated for heart attacks

– Expenditures = spending on the treatment of heart attacks

– Price = average spending per treated patient for attacks

8/17/2018

63

Page 64: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Health Care Satellite Account

8/17/2018

64

18 Aggregate Chapters

Page 65: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Health Care Satellite Account - Limitation

• Broad disease chapters limits applicability of the account

• “Nervous system” chapter includes:

– Migraines

– Multiple Sclerosis

8/17/2018

65

18 Aggregate Chapters

Page 66: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Health Care Satellite Account - Next release planned for this summer includes additional detail

8/17/2018

66

63 ComponentCategories18 Aggregate

Chapters

261 Detail Conditions

Page 67: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Health Care Satellite Account - Next release planned for this summer includes additional detail

8/17/2018

67

63 ComponentCategories18 Aggregate

Chapters

261 Detail Conditions

Page 68: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Additional detail at two levels of aggregation

8/17/2018

68

Aggregate Chapter

Component Categories

Detail Conditions

Page 69: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Construction of Blended Account

•Use population weights from Medical Expenditure Panel Survey to fold in data from different sources

69

Millions of Privately Insured MarketScan®

Millions of Medicare

Enrollees Medicare FFS 5%

Sample

MEPS Other (e.g. Uninsured, Medicaid)

Page 70: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Data comparison of spending per capita by condition

70

Page 71: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Growth Rates for Top-30 Conditions VS. All Others

Page 72: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Expenditures driven by technologies

8/17/2018

72

Sovaldi(Sofosbuvir,)

2013

Expensive biologics

Page 73: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Conclusion

•Big data used to produce detailed condition-level estimates

•Our analysis of this data shows innovative treatments driving expenditures higher

•We hope the availability of more detailed condition-level estimates will lead to other new insights, questions, and future improvements in BEA estimates

8/17/2018

73

Page 74: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Health Care Satellite Account - Next release planned for this summer includes additional detail

8/17/2018

74

63 ComponentCategories18 Aggregate

Chapters

261 Detail Conditions

Page 75: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location
Page 76: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Carol A. Robbins

ICSP Big Data Day

May 11, 2018

National Science Foundation

National Center for Science and Engineering Statistics

www.nsf.gov/statistics/

New Opportunities to Observe and Measure

Innovation, Modeling, Infrastructure, and Standards:What Can Big Data Tell us About Open Source Software?

Page 77: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

• NCSES

• Surveys on R&D, Innovation, S&E education and S&E workforce

• Congressionally Mandated Reports and other statistical products

• VT Social and Decision Analytics Lab: Data Science

• Collaboration with:

• Gizem Korkmaz, Stephanie Shipp, and Sallie Keller of VT SDAL

• Claire Kelling, Penn State University

•Improve Indicators of Research Outputs and Innovation Activities

• Open Source Software

• Public investment, limited output measures, potentially large

impact

• Intangible investment and an innovation activity

NCSES Collaboration with Virginia Tech

Page 78: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

• Server Software

•Operating Systems

•Statistical Software

Software is Everywhere:

Some of it is both Free and Customizable

ProprietaryOpen Source

Opportunity

to Harvest Data

Page 79: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Harvesting Open Source Software Data

Page 80: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Open Source Software Data InventoryVariables Source

Forge

GitHub depsy Open

Hub

Potential Uses

Downloads X X Measure potential impact

Ratings X Measure impact and sentiment

Release and

update date

X X Identify completed projects

and current activity

Citations X Measure impact

Reuse and

dependencies

X Measure impact across

projects

Type of software X Identify product

Lines of code X X Estimate effort

Contributor

characteristics

X X X X Contributor network, team,

experience, and sectors

Contribution level X X X Estimate effort

Page 81: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Open Source Software: Measuring Value

and Impact with R Statistical analysis

packages (green)

Data

wrangling,

exploration,

and

visualization

(pink)

Web-base data/API processing

(turquoise)

Packages for matrix

operations (blue)

Packages

Match depsy to

CRAN R list

Average

For R

ggplot2

n = 9,801

Downloads per

package

58,000 6,255,500

citations per

package

6.83 1,307

Downloads and Citations

Spatial

Analysis

(orange)

Page 82: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Next Step: Estimate Cost for R Packages

• Cost Components for packages

Effort in person months

Contributor experience and contribution level

Wage equivalent

computer programmers, software developers

Occupation Employment Survey, Bureau of Labor Statistics

Package cost = sum(total_person_year * wage_year)

• Industry Model: Constructive Cost Estimation

Effort is a function of complexity and lines of code

KLOC = kilo (thousands) of lines of code

• Compare with aggregate software investment measures

• Extend to other packages

Page 83: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Questions?

[email protected]

Visit us at the poster session

Thank You!

Page 84: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location
Page 85: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location

Share your feedback!https://www.surveymonkey.com/r/ICSPBDD

Page 86: WELCOME [sites.nationalacademies.org]sites.nationalacademies.org/cs/groups/dbassesite/... · S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location