Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
1
ESS event: Big Data in Official Statistics
v v erbi is
Antonino Virgillito, Istat
2
About me
• Head of Unit “Web and BI Technologies”, IT Directorate of Istat
• Project manager and technical coordinator of – Web technologies, mobile applications, BI & ETL – Highlights: online Census questionnaires, Consumer Price Survey on-‐field data collection architecture
• PhD in Computer Engineering – Field of research: large-‐scale distributed systems
• Trainer on Hadoop-‐MapReduce
3
Abstract
• The hype surrounding Big Data technologies hides the complexity of their adoption in NSIs – Strong IT know-‐how required for configuration and management
– Should be accessed through common statistical software
• Reasoned overview of the most popular Big Data technologies, with focus on their usage in NSIs
4
Outline
• Motivations for Big Data technologies • Overview of Big Data tools • Adopting Big Data technologies in NSIs
5
MOTIVATIONS FOR BIG DATA TECHNOLOGIES
6
What does Big mean?
Background
7
Big Size
Tera-‐, Peta-‐ … and growing
Background
Processing a complex statistical method can become untreatable even with data sets of “reasonable” size
Quality Big Data is often loosely
structured and highly noisy
8
Big Size
• Real Big Data begins where your usual tools fail… Distributed file systems • Clusters of commodity hardware that can scale to indefinite
size simply by adding new nodes at runtime • Overcome physical limitations • Should be managed by a middleware platform (Hadoop HDFS)
Tools and Techniques
9
Big Processing
Tools and Techniques
MapReduce • Programming paradigm that enables
programs to be executed in parallel on a cluster
• Not tied to a programming language, interfaces exist for all common languages and tools
10
Big Quality
• Pre-‐processing for cleaning and organizing data
• Big Data are often unstructured but the viceversa is not true
Tools and Techniques
11
Technical Challenges
Handling Big Data necessarily requires relying on complex distributed technologies
If you want to get something from real big data you have to deal with this complexity
12
Perspectives
Colleen, the Statistical Analyst Moss, the IT Guy
Ok, but I want to use my tools and methods. I don’t want to touch this distributed stuff
I can setup the infrastructure and the data and help you with the tools
Deal
I don’t want to write programs for every analysis she
makes
13
BIG DATA TOOLS OVERVIEW
14
Big Data IT Tools Proliferation
15
Our focus: IT Tools for Statistical Analysis of Big Data
• What are the basic tools? • What is the best tool for the job? • How these tools integrate with common elements in an IT architecture?
16
Big Data IT Tools: the Common Denominator
17
Distributed Storage and Processing: Hadoop
Distributed storage platform De-‐facto standard for Big Data processing Open source project supported and/or adopted by most major vendors Virtually unlimited scalability
– storage, memory, processing power
18
Hadoop
Hadoop Principle I’m one big
data set Hadoop is basically a middleware platforms that manages a cluster of machines The core components is a distributed file system (HDFS)
HDFS
Files in HDFS are split into blocks that are scattered over the cluster
The cluster can grow indefinitely simply by adding new nodes
19
The MapReduce Paradigm
Parallel processing paradigm Programmer is unaware of parallelism Programs are structured into a two-‐phase execution
Map Data elements are classified into categories
Reduce An algorithm is applied to all the elements of the same category
x 4 x 5 x 3
20
MapReduce and Hadoop
Hadoop
HDFS
MapReduce MapReduce is logically placed on top of HDFS
21
MapReduce and Hadoop
Hadoop
HDFS
MR
HDFS
MR
HDFS
MR
HDFS
MR
MR works on (big) files loaded on HDFS
Each node in the cluster executes the MR program in parallel, applying map and reduces phases on the blocks it stores
Output is written on HDFS
Scalability principle: Perform the computation were the data is
22
MapReduce Applications • Naturally targeted at counts and aggregations
– 1-‐line aggregation algorithm • Collecting & combining
– It all began there…inverted index computation in Google • Machine learning, cross-‐correlation • Graph analysis
– “People you may know” – Geographical data: in Google Maps, finding nearest feature to
a given address or location • Pre-‐processing of unstructured data • Can also handle binary files
– NYT converted 4TBs of scanned articles into 1.5TB of PDFs
23
Data Analysis with Hadoop
Colleen, the Statistical Analyst Moss, the IT Guy
Cool! Now how can I analyze them?
I finally loaded those elephant-‐size data sets into Hadoop!
It’s simple! Write a MapReduce program in Java!
No
Ok, I’ll do that for you
No MapReduce programs can be written in various programming languages Several tools are also available that translate high-‐level analysis languages into MapReduce programs
24
High-‐level languages for data manipulation
Tools for Data Analysis with Hadoop
Hadoop
HDFS
MapReduce
Pig
Statistical Software
Hive
25
Using Hadoop from Statistical Software
• R – packages rhdfs, rmr – Issue HDFS commands and write MapReduce jobs
• SAS – SAS In-‐Memory Statistics – SAS/ACCESS
• Makes data stored in Hadoop appear as native SAS datasets
• Uses Hive interface • SPSS
– Transparent integration with Hadoop data
26
Apache Pig
• Tool for querying data on Hadoop clusters • Widely used in the Hadoop world
– Yahoo! estimates that 50% of their Hadoop workload on their 100,000 CPUs clusters is genarated by Pig scripts
• Allows to write data manipulation scripts written in a high-‐level language called Pig Latin – Interpreted language: scripts are translated into MapReduce jobs
• Mainly targeted at joins and aggregations
27
Pig Example Real example of a Pig script used at Twitter
The Java equivalent…
28
Hadoop-‐MapReduce Limitations
• Not usable in transactional applications • Not suited to Real-‐time analysis
– MapReduce jobs run in batch mode. • HDFS is an append-‐only file system
– Can insert and delete, but cannot update • MapReduce jobs run in batch mode
– You cannot expect low response latency – Not suited for interactive, real-‐time operations and/or random-‐access read/writes
29
NoSQL databases
• Distributed storage platforms that allows for lower latency processing
• NoSQL: Not Only SQL • Non-‐relational data models that trade transactional consistency for query efficiency and support semi-‐structured data – No joins, no transactions, no indexes
30
NoSQL Databases Popular choices: Hbase and Cassandra Use a column-‐oriented model
– Data organized in families of key: value pairs
– variable schema where each row is slightly different
– optimized for sparse data
Can be accessed from R
Hadoop
HDFS
MapReduce
HBase
R
Cassandra
Fully distributed platform Not based on Hadoop
Based on Hadoop
31
Big Data Tools in the IT Architecture
• Hadoop is not a DB/DW replacement but it sits besides traditional data technologies in a modern IT architecture
• The outcome of Big Data processing can be stored in a traditional DB-‐DW
• Modern (visual) analytics tools can integrate both kinds of data sources
32
Analysis tools
Augmented IT Architecture
Hadoop
Statistical software Visual Analytics
NoSQL DB DW
multi-‐structured
big data
BI
Initial processing and cleaning Keeps multi-‐structured historical data online and accessible
Analysis results
33
ADOPTING BIG DATA TECHNOLOGIES
34
Maximum control of configuration and costs High complexity
Pay-‐per-‐use billing model Cuts hardware and software costs and eliminates management burden Privacy issues!
Easy Costly
Hadoop Deployment Options
In-‐house Cloud Appliance
35
IT Skills for Big Data Tools
System manager Data Integrator Data Engineer Data scientist
Designs the IT architecture for collecting and processing Designs and develop writes MR jobs or PIG scripts
Sets up and manages the physical infrastructure
Develops ETL procedures to move data to/from HDFS and NoSQL DBs
Derives new insights by applying statistical analysis methods on different, heterogeneous, possibly big, data sources Has strong IT foundations and can develop her algorithms using both statistical tools and Hadoop
Uses statistical tools and VA
Data analyst
SQL
R -‐ SAS -‐ SPSS BI and Visual Analytics
Excel
Linux Map Reduce ETL
Pig Java
36
Suggestions for ESS
Training on data science for statisticians and Big data engineering for IT staff
Implementation of standard methods and tools in a Hadoop-‐compliant version
Eurostat establishing repositories of Big Data and allowing NSIs to access them
Set up of a “statistical cloud”, a Hadoop cluster shared by NSIs
Possible agreements with providers of IT solutions (Google, etc.)
37
Conclusions
• Big Data tools makes sense when you really have serious size issues to deal with – Not much use for a 2-‐node Hadoop cluster
• No value in jumping on the Big Data bandwagon for its own sake – High costs – You can still be a “data scientist”...
• However, Big Data engineering provides new opportunities – Collect more data – “Ask bigger questions”
38
Questions