Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Copyright © 2015 by TeraSoft, Inc.© 2011 The MathWorks, Inc.
MATLAB Big Data Course Day 2:
MATLAB integrated with Hadoop
Judy YangTeraSoft, Inc.
Copyright © 2015 by TeraSoft, Inc.
Big Data Capabilities in MATLAB
2
Memory and Data Access
64-bit processors
Memory Mapped Variables
Disk Variables
Databases
Datastores
Platforms
Desktop (Multicore, GPU)
Clusters
Cloud Computing (MDCS on EC2)
Hadoop
Programming Constructs
Streaming
Block Processing
Parallel-for loops
GPU Arrays
SPMD and Distributed Arrays
MapReduce
Datastores
MapReduce
Hadoop
Copyright © 2015 by TeraSoft, Inc.
Access Big Datadatastore
▪ Easily specify data set
• Single text file (or collection of text files)
• Database (using Database Toolbox)
▪ Preview data structure and format
▪ Select data to import
using column names
▪ Incrementally read
subsets of the data
airdata = datastore('*.csv');
airdata.SelectedVariables = {'Distance', 'ArrDelay‘};
data = read(airdata);
8
Copyright © 2015 by TeraSoft, Inc.
Analyze Big Datamapreduce
▪ Use the powerful MapReduce programming
technique to analyze big data
• Multiple items (keys) to organize and process
• Intermediate results do not fit in memory
▪ On the desktop
• Analyze big database tables (Database Toolbox)
• Increase compute capacity (Parallel Computing Toolbox)
• Access data on HDFS to develop algorithms for use on Hadoop
▪ With Hadoop
• Run on Hadoop using MATLAB Distributed Computing Server
• Deploy applications and libraries for Hadoop using MATLAB Compiler
********************************
* MAPREDUCE PROGRESS *
********************************
Map 0% Reduce 0%
Map 20% Reduce 0%
Map 40% Reduce 0%
Map 60% Reduce 0%
Map 80% Reduce 0%
Map 100% Reduce 25%
Map 100% Reduce 50%
Map 100% Reduce 75%
Map 100% Reduce 100%
9
Copyright © 2015 by TeraSoft, Inc.
datastore
Copyright © 2015 by TeraSoft, Inc.
Outline
▪ What is a datastore?
▪ Types of data that can be handles using a datastore
▪ Examples illustrating how to create a datastore
• Example # 1: Airline Data
• Example # 2: NCDC Weather Data
Copyright © 2015 by TeraSoft, Inc.
datastore
A datastore is an object useful for reading collections of
data that are too large to fit in memory.
Files: 2,545
File Size: 100+ KB to 2+ GB
Copyright © 2015 by TeraSoft, Inc.
Types of Supported Data
Type of Files Datastore
Collections of text files TabularTextDatastore
Files containing key-value
pair data as part of
mapreduce
KeyValueDatastore
Data in a relational database DatabaseDatastore
Copyright © 2015 by TeraSoft, Inc.
Syntax
>> ds = datastore(location)
>> ds = datastore(location, Name, Value)
Properties
Methods
Copyright © 2015 by TeraSoft, Inc.
Example # 1: Airline Data
http://stat-computing.org/dataexpo/2009/the-data.html
Copyright © 2015 by TeraSoft, Inc.
Example # 2: NCDC Weather Data
Copyright © 2015 by TeraSoft, Inc.
Format of NCDC Weather Data
Copyright © 2015 by TeraSoft, Inc.
Format
Y -> 4 characters T -> 5 characters
Copyright © 2015 by TeraSoft, Inc.
Learning Outcomes
▪ Describe what a datastore object is and decide when to
use it
▪ Create a datastore object by reading in data from a
tabular text file
▪ Manipulate the datastore object using various methods
and properties
▪ Perform calculations using the datastore object
▪ Determine the limitations of the datastore object
Copyright © 2015 by TeraSoft, Inc.
MATLAB MapReduce on the Desktop
Copyright © 2015 by TeraSoft, Inc.
Outline
▪ MapReduce programming model
▪ Steps involved in implementing MapReduce
▪ Concept of keys and values
▪ Executing MapReduce in a “serial” MATLAB
environment
• Example # 1: Airline Data
• Example # 2: NCDC Weather Data
▪ Executing MapReduce in a “parallel” environment using
the Parallel Computing Toolbox
Copyright © 2015 by TeraSoft, Inc.
Find the Longest Flight Distance for Each Airline
Find longest flight distance for each
commercial airline in the U.S.Airline Data
Copyright © 2015 by TeraSoft, Inc.
mapreduce
A programmatic framework for analyzing data sets that do not fit into
memory.
MapReduce is a good fit for problems that need to analyze the whole
dataset in a batch fashion.
MapReduce programs are inherently parallel.
MapReduce addresses the scalability problem when working with large
datasets and not necessarily the performance problem.
Copyright © 2015 by TeraSoft, Inc.
mapreduce
A programmatic framework for analyzing data sets that do not fit into
memory.
Copyright © 2015 by TeraSoft, Inc.
Map Phase in MapReduce
Copyright © 2015 by TeraSoft, Inc.
Reduce Phase in MapReduce From Map Phase
Copyright © 2015 by TeraSoft, Inc.
Code Files Needed to Get Started
▪ main script
▪ mapper function
▪ reducer function
Copyright © 2015 by TeraSoft, Inc.
Example # 1: Airline Data
Copyright © 2015 by TeraSoft, Inc.
Example # 2: NCDC Weather Data
i. Find the maximum temperature for a particular year using a collection of
weather station files (Simple).
ii. Create a single file containing all the weather data for a particular year
from multiple weather station file (Advanced).
iii. Find the maximum temperature for particular year using a single weather
data file (Simple).
Copyright © 2015 by TeraSoft, Inc.
Mapper Function Signature
function fooMapper(data, info, intermKVStore)
Inputs Description
data & info Result from an automatic call to the read function by mapreduce.
intermKVStore Name of intermediate KeyValueStore object to which
mapper adds key-value pairs using the add or addmulti functions.
Requirements for Key-Value Pairs*:
1. Keys must be numeric scalars or strings.
2. All keys added by the mapper function must have the same class.
3. Values can be any valid MATLAB object or data type.
* Some requirements may differ when using other products with mapreduce
Copyright © 2015 by TeraSoft, Inc.
Reducer Function Signature
function fooReducer(intermKey, intermValIter, outKVStore)
Inputs Description
intermKey It is one of they unique keys added by the mapper function. Each call to the reducer by mapreduce specifies a new key from the keys in the
intermediate KeyValueStore object.
intermValIter It is the ValueIterator object associated with the active key, IntermKey.
The ValueIterator object contains all of the values associated with the
active key. Scroll using hasnext and getnext functions.
outKVStore It is the final Key ValuesStore object to which the reducer adds key-value
pairs using the add or addmulti functions.
Requirements for Key-Value Pairs*:
1. Keys must be numeric scalars or strings.
2. All keys added by the mapper function must have the same class.3. If OutputType argument of mapreduce is ‘Binary’, values added by the reducer can be any valid
MATLAB object or data type.4. If OutputType argument of mapreduce is ‘TabularText’, values added by the reducer can be
numeric scalar or string.* Some requirements may differ when using other products with mapreduce
Copyright © 2015 by TeraSoft, Inc.
mapreduce Execution Syntax
outds = mapreduce(ds, @fooMapper, @fooReducer)
Copyright © 2015 by TeraSoft, Inc.
MapReduce using Parallel Computing Toolbox
1. Open a parallel pool
>> p = parpool('local', 4);
2. Define execution environment for MapReduce
>> mr = mapreducer(p)
>> outds =
Mapreduce(ds, @fooMapper, @fooReducer, mr)
How to run mapreduce always in serial MATLAB?
>> mapreducer(0)
Copyright © 2015 by TeraSoft, Inc.
Learning Outcomes
After watching this module you should now be able to:
▪ Diagram the MapReduce programming model using an example
and list the key steps involved
▪ Choose appropriate key/value pairs for computation
▪ Prepare data for MapReduce implementation using the datastore
object
▪ Write a mapper and reduce functions as part of the MapReduce
model
▪ Link the Mapper and Reducer functions using the mapreduce API
to perform computations
▪ Execute MapReduce in a “serial” and “parallel” environment
Copyright © 2015 by TeraSoft, Inc.
Appendix
Copyright © 2015 by TeraSoft, Inc.
datastore
Copyright © 2015 by TeraSoft, Inc.
Syntax
>> ds = datastore(location)
>> ds = datastore(location, Name, Value)
>> ds = datastore('hdfs://esa-cluster-
ns/user/matlab/xxx.csv');
>> data = read(ds);
Note: esa-cluster-ns為Hadoop Master的FQDN
Copyright © 2015 by TeraSoft, Inc.
Deploy MATLAB Code Against Hadoop
Copyright © 2015 by TeraSoft, Inc.
Outline
▪ Supported Platform for Deployment
▪ Options to deploy MATLAB code against Hadoop
• Deployable Archive
• Standalone Application
▪ Examples
Copyright © 2015 by TeraSoft, Inc.
Supported Platform for Deployment against
Hadoop
▪ Linux Only
Copyright © 2015 by TeraSoft, Inc.
Options to Deploy MATLAB code Against Hadoop
Deployment
Option # 1
Deployable Archive• Use Hadoop Scheduling
framework
• Integrate MATLAB code
Option # 2
Standalone Application• Create a standalone application
that runs against Hadoop
Copyright © 2015 by TeraSoft, Inc.
Option 1: Deployable Archive
A deployable archive includes a datastore, a map function, and a
reduce function.
▪ Write Mapper and Reducer functions in MATLAB
• Mapper function
• Reducer function
• MAT file containing a datastore
▪ Setup Hadoop environment
▪ Install MATLAB runtime
▪ Copy input data to HDFS
▪ Use Hadoop Compiler app to create Deployable ArchiveA Hadoop specific script is also created along with the deployable archive (CTF) that provides all the components
needed to directly run your archive on Hadoop.
Copyright © 2015 by TeraSoft, Inc.
Hadoop Compiler app
Copyright © 2015 by TeraSoft, Inc.
Copying Files from Local Folder to Hadoop File
System (HDFS)
scp /home/matlab/work/for_redistributing/*
matlab@master2:/home/matlab/work/projectname/
Reference:
http://linux.vbird.org/linux_server/0310telnetssh.php#scp
$scp <local file location> [username@]
remotehost:/some/remote/directory
Copyright © 2015 by TeraSoft, Inc.
Executing a Deployable Archive
<mcr_directory> directory where MATLAB runtime is installed or
the directory where MATLAB is installed on the machine.
[hadoop properties] are the optional properties represented as –
D proterty_name = value
<input_files> all the data files that need to be processed by the
application
<output_folder> is the location for Hadoop to write the output
result in a new output folder. The application will fail if the output
directory already exists.
$./run_deployArchive.sh <mcr_directory>
[hadoop properties]
<input_files>
<output_folder>
Copyright © 2015 by TeraSoft, Inc.
Retrieving Results
/part-r-00000
/part-r-00001
/..
>> ds = datastore('hdfs://esa-cluster-
ns/user/matlab/output/deployableArchiveProject-
output/part*','DatastoreType','keyvalue');
>> result = readall(ds);
$hadoop fs –cat hdfs://esa-cluster-
ns/user/matlab/output/deployableArchiveProje
ct-output/part*
Copyright © 2015 by TeraSoft, Inc.
Q&A