ESSevent:# Big#Data#in#Oﬃcial#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

1

ESS event: Big Data in Official Statistics

v v erbi is

Antonino Virgillito, Istat

2

About me

•  Head of Unit “Web and BI Technologies”, IT Directorate of Istat

•  Project manager and technical coordinator of – Web technologies, mobile applications, BI & ETL – Highlights: online Census questionnaires, Consumer Price Survey on-‐field data collection architecture

•  PhD in Computer Engineering –  Field of research: large-‐scale distributed systems

•  Trainer on Hadoop-‐MapReduce

3

Abstract

•  The hype surrounding Big Data technologies hides the complexity of their adoption in NSIs –  Strong IT know-‐how required for configuration and management

–  Should be accessed through common statistical software

•  Reasoned overview of the most popular Big Data technologies, with focus on their usage in NSIs

4

Outline

•  Motivations for Big Data technologies •  Overview of Big Data tools •  Adopting Big Data technologies in NSIs

5

MOTIVATIONS FOR BIG DATA TECHNOLOGIES

6

What does Big mean?

Background

7

Big Size

Tera-‐, Peta-‐ … and growing

Background

Processing a complex statistical method can become untreatable even with data sets of “reasonable” size

Quality Big Data is often loosely

structured and highly noisy

8

Big Size

•  Real Big Data begins where your usual tools fail… Distributed file systems •  Clusters of commodity hardware that can scale to indefinite

size simply by adding new nodes at runtime •  Overcome physical limitations •  Should be managed by a middleware platform (Hadoop HDFS)

Tools and Techniques

9

Big Processing


MapReduce •  Programming paradigm that enables

programs to be executed in parallel on a cluster

•  Not tied to a programming language, interfaces exist for all common languages and tools

10

Big Quality

•  Pre-‐processing for cleaning and organizing data

•  Big Data are often unstructured but the viceversa is not true


11

Technical Challenges

Handling Big Data necessarily requires relying on complex distributed technologies

If you want to get something from real big data you have to deal with this complexity

12

Perspectives

Colleen, the Statistical Analyst Moss, the IT Guy

Ok, but I want to use my tools and methods. I don’t want to touch this distributed stuff

I can setup the infrastructure and the data and help you with the tools

Deal

I don’t want to write programs for every analysis she

makes

13

BIG DATA TOOLS OVERVIEW

14

Big Data IT Tools Proliferation

15

Our focus: IT Tools for Statistical Analysis of Big Data

•  What are the basic tools? •  What is the best tool for the job? •  How these tools integrate with common elements in an IT architecture?

16

Big Data IT Tools: the Common Denominator

17

Distributed Storage and Processing: Hadoop

Distributed storage platform De-‐facto standard for Big Data processing Open source project supported and/or adopted by most major vendors Virtually unlimited scalability

–  storage, memory, processing power

18

Hadoop

Hadoop Principle I’m one big

data set Hadoop is basically a middleware platforms that manages a cluster of machines The core components is a distributed file system (HDFS)

HDFS

Files in HDFS are split into blocks that are scattered over the cluster

The cluster can grow indefinitely simply by adding new nodes

19

The MapReduce Paradigm

Parallel processing paradigm Programmer is unaware of parallelism Programs are structured into a two-‐phase execution

Map Data elements are classified into categories

Reduce An algorithm is applied to all the elements of the same category

x 4 x 5 x 3

20

MapReduce and Hadoop

Hadoop

HDFS

MapReduce MapReduce is logically placed on top of HDFS

21

MapReduce and Hadoop

Hadoop

HDFS

MR

HDFS

MR

HDFS

MR

HDFS

MR

MR works on (big) files loaded on HDFS

Each node in the cluster executes the MR program in parallel, applying map and reduces phases on the blocks it stores

Output is written on HDFS

Scalability principle: Perform the computation were the data is

22

MapReduce Applications •  Naturally targeted at counts and aggregations

–  1-‐line aggregation algorithm •  Collecting & combining

–  It all began there…inverted index computation in Google •  Machine learning, cross-‐correlation •  Graph analysis

–  “People you may know” –  Geographical data: in Google Maps, finding nearest feature to

a given address or location •  Pre-‐processing of unstructured data •  Can also handle binary files

–  NYT converted 4TBs of scanned articles into 1.5TB of PDFs

23

Data Analysis with Hadoop

Colleen, the Statistical Analyst Moss, the IT Guy

Cool! Now how can I analyze them?

I finally loaded those elephant-‐size data sets into Hadoop!

It’s simple! Write a MapReduce program in Java!

No

Ok, I’ll do that for you

No MapReduce programs can be written in various programming languages Several tools are also available that translate high-‐level analysis languages into MapReduce programs

24

High-‐level languages for data manipulation

Tools for Data Analysis with Hadoop

Hadoop

HDFS

MapReduce

Pig

Statistical Software

Hive

25

Using Hadoop from Statistical Software

•  R –  packages rhdfs, rmr –  Issue HDFS commands and write MapReduce jobs

•  SAS –  SAS In-‐Memory Statistics –  SAS/ACCESS

•  Makes data stored in Hadoop appear as native SAS datasets

•  Uses Hive interface •  SPSS

–  Transparent integration with Hadoop data

26

Apache Pig

•  Tool for querying data on Hadoop clusters •  Widely used in the Hadoop world

–  Yahoo! estimates that 50% of their Hadoop workload on their 100,000 CPUs clusters is genarated by Pig scripts

•  Allows to write data manipulation scripts written in a high-‐level language called Pig Latin –  Interpreted language: scripts are translated into MapReduce jobs

•  Mainly targeted at joins and aggregations

27

Pig Example Real example of a Pig script used at Twitter

The Java equivalent…

28

Hadoop-‐MapReduce Limitations

•  Not usable in transactional applications •  Not suited to Real-‐time analysis

– MapReduce jobs run in batch mode. •  HDFS is an append-‐only file system

–  Can insert and delete, but cannot update •  MapReduce jobs run in batch mode

–  You cannot expect low response latency – Not suited for interactive, real-‐time operations and/or random-‐access read/writes

29

NoSQL databases

•  Distributed storage platforms that allows for lower latency processing

•  NoSQL: Not Only SQL •  Non-‐relational data models that trade transactional consistency for query efficiency and support semi-‐structured data – No joins, no transactions, no indexes

30

NoSQL Databases Popular choices: Hbase and Cassandra Use a column-‐oriented model

–  Data organized in families of key: value pairs

–  variable schema where each row is slightly different

–  optimized for sparse data

Can be accessed from R

Hadoop

HDFS

MapReduce

HBase

R

Cassandra

Fully distributed platform Not based on Hadoop

Based on Hadoop

31

Big Data Tools in the IT Architecture

•  Hadoop is not a DB/DW replacement but it sits besides traditional data technologies in a modern IT architecture

•  The outcome of Big Data processing can be stored in a traditional DB-‐DW

•  Modern (visual) analytics tools can integrate both kinds of data sources

32

Analysis tools

Augmented IT Architecture

Hadoop

Statistical software Visual Analytics

NoSQL DB DW

multi-‐structured

big data

BI

Initial processing and cleaning Keeps multi-‐structured historical data online and accessible

Analysis results

33

ADOPTING BIG DATA TECHNOLOGIES

34

Maximum control of configuration and costs High complexity

Pay-‐per-‐use billing model Cuts hardware and software costs and eliminates management burden Privacy issues!

Easy Costly

Hadoop Deployment Options

In-‐house Cloud Appliance

35

IT Skills for Big Data Tools

System manager Data Integrator Data Engineer Data scientist

Designs the IT architecture for collecting and processing Designs and develop writes MR jobs or PIG scripts

Sets up and manages the physical infrastructure

Develops ETL procedures to move data to/from HDFS and NoSQL DBs

Derives new insights by applying statistical analysis methods on different, heterogeneous, possibly big, data sources Has strong IT foundations and can develop her algorithms using both statistical tools and Hadoop

Uses statistical tools and VA

Data analyst

SQL

R -‐ SAS -‐ SPSS BI and Visual Analytics

Excel

Linux Map Reduce ETL

Pig Java

36

Suggestions for ESS

Training on data science for statisticians and Big data engineering for IT staff

Implementation of standard methods and tools in a Hadoop-‐compliant version

Eurostat establishing repositories of Big Data and allowing NSIs to access them

Set up of a “statistical cloud”, a Hadoop cluster shared by NSIs

Possible agreements with providers of IT solutions (Google, etc.)

37

Conclusions

•  Big Data tools makes sense when you really have serious size issues to deal with – Not much use for a 2-‐node Hadoop cluster

•  No value in jumping on the Big Data bandwagon for its own sake – High costs –  You can still be a “data scientist”...

•  However, Big Data engineering provides new opportunities –  Collect more data –  “Ask bigger questions”

38

Questions

Documents

ESSevent:# Big#Data#in#Oﬃcial#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#