Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
1© 2014 The MathWorks, Inc.
Data Analytics with MATLAB
Tackling the Challenges of Big Data
Francesca Perino
Application Engineering Team
How big is big? What characterises “big” data?
“Any collection of data sets so large and complex that it becomes difficult to
process using … traditional data processing applications.”
Wikipedia
2
MATLAB Application Development Landscape
Prototyping Programming Deployment
3
MATLAB Application Development Landscape
Prototyping Programming Deployment
4© 2014 The MathWorks, Inc.
Data Analytics with MATLAB
Tackling the Challenges of Big Data
Francesca Perino
Application Engineering Team
How big is big? What characterises “big” data?
“Any collection of data sets so large and complex that it becomes difficult to
process using … traditional data processing applications.”
Wikipedia
5
Data-Driven Decisions and Data-Driven Design
Measurement Devices Big Data
Compute Power
6
Need is Across Many Application Areas
Signal
Processing
Image
Processing
Model-Based
Design
System
Design
Advanced driver
assistance
system
Hybrid and
electric
vehicles
Sound quality
analysis
Engine
Calibration
Data
Analysis
Portfolio risk
optimization
7
Source: Information Warfare
Edward Waltz – 1998
Physical
Sensors
Data
Information
Knowledge
Action
Data Analytics in MATLABMoving up the Information Hierarchy
8
Data Analytics in MATLABMoving up the Information Hierarchy
Data acquisition
Instruments
Imaging devices
Flat files, Excel, Web
Databases
Data warehouses
HDFS (Hadoop)
Physical
Sensors
Data
Information
Knowledge
Action
• Sensing
• Collecting
• Measurement
• Data Acquisition
OBSERVATION
9
Data Analytics in MATLABMoving up the Information Hierarchy
Data Processing
Exploratory Analysis
Filtering
Physical
Sensors
Data
Information
Knowledge
Action
• Sensing
• Collecting
• Measurement
• Data Acquisition
• Preprocessing
• Calibration
• Filtering
• Data Reduction
ORGANIZATION
10
Data Analytics in MATLABMoving up the Information Hierarchy
Machine Learning Decision Tree
Ensemble
Method
Neural Network
Support Vector
Machine
Classification
Linear
Non-linear
Non-parametric
Regression
Prediction0 20 40 60 80 100 120 140 160 180 200
0.5
0.6
0.7
0.8
0.9
1
time secs
active p
ow
er
per-
unit
NN
measured
0 5 10 15 20 25 30 35 40 450
0.2
0.4
0.6
0.8
1
1.2x 10
-4
turbine number
MS
E
Physical
Sensors
Data
Information
Knowledge
Action
• Sensing
• Collecting
• Measurement
• Data Acquisition
• Preprocessing
• Calibration
• Filtering
• Data Reduction
• Analysis
• Visualization
• Modeling
• Prediction
UNDERSTANDING
Visualization
11
Physical
Sensors
Data
Information
Knowledge
Action
Data Analytics in MATLABMoving up the Information Hierarchy
• Sensing
• Collecting
• Measurement
• Data Acquisition
• Preprocessing
• Calibration
• Filtering
• Data Reduction
• Analysis
• Visualization
• Modeling
• Prediction
• Reporting
• Apps
• Scalable Deployment
• Integration
Reports
MATLAB
Applications
Integration into
Existing Systems
Excel
Feedback for Design
and Operations
APPLICATION
12
Large Data Analytics
Explore
Prototype
Data Share/Deploy
Work on the desktop
Scale capacity as needed
Scale
13
Large Data Analytics on the Desktop
Explore
Prototype
Access Share/Deploy
Collections of Text Files datastore
Databases Database Toolbox
Binary Files memmapfileAccess big data from
your desktop
Scale
14
Example: Airline Flight Distance
Data
– BTS/RITA Airline On-Time Statistics
– 123.5M records, 29 fields
Task
– Find the maximum distance travelled by
commercial airlines based upon flight
operations performance data
CSV Data
22 files
12GB
15
Standard Workflow (up to R2014a)
files = {'1987.csv', '1988.csv', '1989.csv', '1990.csv', ...
'1991.csv', '1992.csv', '1993.csv', '1994.csv', '1995.csv', ...
'1996.csv', '1997.csv', '1998.csv', '1999.csv', '2000.csv', ...
'2001.csv', '2002.csv', '2003.csv', '2004.csv', '2005.csv', ...
'2006.csv', '2007.csv', '2008.csv'};
fmtspec = ['%*q%*q%*q%*q%*q%*q%*q%*q%*q%*q' ...
'%*q%*q%*q%*q%*q%*q%*q%*q%f%*q' ...
'%*q%*q %*q%*q%*q%*q%*q%*q%*q'];
maxdist = -Inf;
for i = 1 : numfiles
filei = fopen(files{i});
data = textscan(filei, fmtspec, ...);
fclose(filei);
maxi = max(data{:});
maxdist = max(maxdist,maxi);
end
Location
Compute
Combine
Format
Read data
16
New Workflow with datastore (in R2014b)
airdata = datastore('*.csv');
airdata.SelectedVariableNames = {'Distance'};
airdata.SelectedFormats = {'%f'};
maxdist = -Inf;
while hasdata(airdata)
data = read(airdata);
maxi = max(data.Distance);
maxdist = max(maxdist, maxi);
end
Compute
Combine
Format
Location
Read data
17
datastore
Easily specify data set
– Single text file (or collection of text files)
– Database (using Database Toolbox)
Preview data structure and format
Select data to import
using column names
Incrementally read
subsets of the data
airdata = datastore('*.csv');
airdata.SelectedVariables = {'Distance', 'ArrDelay‘};
data = read(airdata);
18
Large Data Analytics on the Desktop
Expand workspace 64 bit processor support – increased in-memory data set handling
Access portions of data too big to fit into memory Memory mapped variables – huge binary file
Datastore – huge text file or collections of text files
Database – query portion of a big database table
Variety of programming constructs System Objects – analyze streaming data
MapReduce – process text files that won’t fit into memory
Increase analysis speed Parallel for loops – use with multicore/multi-process machines
GPU Arrays
19
Scaled Large Data Analytics
Explore
Prototype
Access Share/Deploy
ComplexityEmbarrassingly
Parallel
Non-
Partitionable
Distributed Memory
SPMD
Load,
Analyze,
Discard
datastore, parfor
Scale
out-of-memory
in-memory
20
Example: Airline Delay Analysis
Data
– BTS/RITA Airline On-Time Statistics
– 123.5M records, 29 fields
Tasks
– Calculate delay patterns
– Visualize summaries
– Estimate & evaluate predictive models
Resources
– Amazon S3 data store
– Amazon EC2 cluster
21
Airline Delay Analysis: Framework
Cluster/Grid/Cloud environment
1987 1988 1989 1990 1991 1992
Instr
uctio
ns
Reduced D
ata
Client
22
Scaling Big Data Capacity
MATLAB supports a number of programming
constructs for use with clusters
General compute clusters Parallel for loops – embarrassingly parallel algorithms
SPMD – distributed processing
Hadoop clusters MapReduce – analyze data stored in the Hadoop Distributed File System
23
Scaled Large Data Analytics
Explore
Prototype
Access Share/Deploy
ComplexityEmbarrassingly
Parallel
Non-
Partitionable
MapReduce
Distributed Memory
SPMD
Load,
Analyze,
Discard
datastore, parfor
Scale
out-of-memory
in-memory
24
1503 UA LAX -5 -10 2356
540 PS BUR 13 5 186
1920 DL BOS 10 32 1876
1840 DL SFO 0 13 568
272 US BWI 4 -2 359
784 PS SEA 7 3 176
796 PS LAX -2 2 237
1525 UA SFO 3 -5 1867
632 PS SJC 2 -4 245
1610 UA MIA 60 34 1365
2032 DL EWR 10 16 789
2134 DL DFW -2 6 914
1503 UA LAX -5 -10 2356
540 PS BUR 13 5 186
1920 DL BOS 10 32 1876
1840 DL SFO 0 13 568
272 US BWI 4 -2 359
784 PS SEA 7 3 176
796 PS LAX -2 2 237
1525 UA SFO 3 -5 1867
632 US SJC 2 -4 245
1610 UA MIA 60 34 1365
2032 DL EWR 10 16 789
2134 DL DFW -2 6 914
UA
PS
DL
DL
2356
186
1876
568
US
PS
PS
UA
US
UA
DL
DL
245
1365
789
914
359
176
237
1867
UA 2356
PS 186
PS 237
UA 1867
UA 1365
DL 1876
DL 914
US 359
US 245
Data Store Map Reduce
mapreduce (in R2014b)
25
Workflow with mapreduce% Specify and format the data
indata = datastore('*.csv');
indata.SelectedVariables = 'Distance';
indata.SelectedFormats = '%f';
function mapfun(data,~,intermed)
% Compute and save intermediate result
maxi = max(data.Distance);
add(intermed,'maxi',maxi);
function reducefun(~,intermed,output)
maxdist = -Inf;
while hasnext(intermed)
maxi = getnext(intermed);
% Combine intermediate results
maxdist = max(maxdist,maxi);
end
add(output,'maxdist',maxdist);
outdata = mapreduce(indata,@mapfun,@reducefun)
Data Store
Map
Reduce
26
mapreduce
Use the powerful MapReduce programming
technique to analyze big data
– Multiple items (keys) to organize and process
– Intermediate results do not fit in memory
On the desktop
– Analyze big database tables (Database Toolbox)
– Increase compute capacity (Parallel Computing Toolbox)
– Access data on HDFS to develop algorithms for use on Hadoop
With Hadoop
– Run on Hadoop using MATLAB Distributed Computing Server
– Deploy applications and libraries for Hadoop using MATLAB Compiler
********************************
* MAPREDUCE PROGRESS *
********************************
Map 0% Reduce 0%
Map 20% Reduce 0%
Map 40% Reduce 0%
Map 60% Reduce 0%
Map 80% Reduce 0%
Map 100% Reduce 25%
Map 100% Reduce 50%
Map 100% Reduce 75%
Map 100% Reduce 100%
27
Data Analytics Landscape
easily
partitioned;
independent
tasks
iterative
all data needed in
memory at once
SMALL Increasing Data Size
SIMPLE
COMPLEX
Algorithm
complexity
More programming
effort required
Built-in
numerical & statistical
algorithms
spmddistributed
arrays
gpuarray
parfor
vectorisationmapreduce
28
Strengths of MATLAB for Large Data Analytics
Challenge MATLAB Solution
Getting started Easy access to data from your desktopTools for accessing typical big data sets consisting of text or binary
files, contained in database tables or stored on Hadoop
Rapid data exploration All the tools to explore and visualize data• Easy to try different methods
• Ideal environment for developing your own methods
Development of scalable
algorithms
Work on the desktop and scale to clustersTools for use in analyzing big data on your desktop, which scale for use
on clusters, including Hadoop, if needed
Use within business
systems
Ease of deployment and leveraging enterprisePush-button deployment into production including support for Hadoop
29
Strengths of MATLAB for Large Data Analytics
Challenge MATLAB Solution
Getting started Easy access to data from your desktop• Tools for accessing typical data sets consisting of text or binary
files, Excel files, contained in database tables.
• Data import from instruments
Rapid data exploration All the tools to explore and visualize data• Easy to try different methods
• Ideal environment for developing your own methods
Development of scalable
algorithms
Work on the desktop and scale to clustersTools for use in analyzing big data on your desktop, which scale for use
on clusters, including cloud, if needed
Use within business
systems
Ease of deployment and leveraging enterprisePush-button deployment into production framework
30
MATLAB Application Development Landscape
Prototyping Programming Deployment
31
MATLAB and Simulink are registered trademarks of The MathWorks, Inc. See
www.mathworks.com/trademarks for a list of additional trademarks. Other
product or brand names may be trademarks or registered trademarks of their
respective holders. © 2014 The MathWorks, Inc.