42
Copyright © 2015 by TeraSoft, Inc. © 2011 The MathWorks, Inc. MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop Judy Yang TeraSoft, Inc.

MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.© 2011 The MathWorks, Inc.

MATLAB Big Data Course Day 2:

MATLAB integrated with Hadoop

Judy YangTeraSoft, Inc.

Page 2: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Big Data Capabilities in MATLAB

2

Memory and Data Access

64-bit processors

Memory Mapped Variables

Disk Variables

Databases

Datastores

Platforms

Desktop (Multicore, GPU)

Clusters

Cloud Computing (MDCS on EC2)

Hadoop

Programming Constructs

Streaming

Block Processing

Parallel-for loops

GPU Arrays

SPMD and Distributed Arrays

MapReduce

Datastores

MapReduce

Hadoop

Page 3: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Access Big Datadatastore

▪ Easily specify data set

• Single text file (or collection of text files)

• Database (using Database Toolbox)

▪ Preview data structure and format

▪ Select data to import

using column names

▪ Incrementally read

subsets of the data

airdata = datastore('*.csv');

airdata.SelectedVariables = {'Distance', 'ArrDelay‘};

data = read(airdata);

8

Page 4: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Analyze Big Datamapreduce

▪ Use the powerful MapReduce programming

technique to analyze big data

• Multiple items (keys) to organize and process

• Intermediate results do not fit in memory

▪ On the desktop

• Analyze big database tables (Database Toolbox)

• Increase compute capacity (Parallel Computing Toolbox)

• Access data on HDFS to develop algorithms for use on Hadoop

▪ With Hadoop

• Run on Hadoop using MATLAB Distributed Computing Server

• Deploy applications and libraries for Hadoop using MATLAB Compiler

********************************

* MAPREDUCE PROGRESS *

********************************

Map 0% Reduce 0%

Map 20% Reduce 0%

Map 40% Reduce 0%

Map 60% Reduce 0%

Map 80% Reduce 0%

Map 100% Reduce 25%

Map 100% Reduce 50%

Map 100% Reduce 75%

Map 100% Reduce 100%

9

Page 5: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

datastore

Page 6: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Outline

▪ What is a datastore?

▪ Types of data that can be handles using a datastore

▪ Examples illustrating how to create a datastore

• Example # 1: Airline Data

• Example # 2: NCDC Weather Data

Page 7: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

datastore

A datastore is an object useful for reading collections of

data that are too large to fit in memory.

Files: 2,545

File Size: 100+ KB to 2+ GB

Page 8: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Types of Supported Data

Type of Files Datastore

Collections of text files TabularTextDatastore

Files containing key-value

pair data as part of

mapreduce

KeyValueDatastore

Data in a relational database DatabaseDatastore

Page 9: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Syntax

>> ds = datastore(location)

>> ds = datastore(location, Name, Value)

Properties

Methods

Page 10: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Example # 1: Airline Data

http://stat-computing.org/dataexpo/2009/the-data.html

Page 11: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Example # 2: NCDC Weather Data

Page 12: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Format of NCDC Weather Data

Page 13: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Format

Y -> 4 characters T -> 5 characters

Page 14: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Learning Outcomes

▪ Describe what a datastore object is and decide when to

use it

▪ Create a datastore object by reading in data from a

tabular text file

▪ Manipulate the datastore object using various methods

and properties

▪ Perform calculations using the datastore object

▪ Determine the limitations of the datastore object

Page 15: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

MATLAB MapReduce on the Desktop

Page 16: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Outline

▪ MapReduce programming model

▪ Steps involved in implementing MapReduce

▪ Concept of keys and values

▪ Executing MapReduce in a “serial” MATLAB

environment

• Example # 1: Airline Data

• Example # 2: NCDC Weather Data

▪ Executing MapReduce in a “parallel” environment using

the Parallel Computing Toolbox

Page 17: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Find the Longest Flight Distance for Each Airline

Find longest flight distance for each

commercial airline in the U.S.Airline Data

Page 18: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

mapreduce

A programmatic framework for analyzing data sets that do not fit into

memory.

MapReduce is a good fit for problems that need to analyze the whole

dataset in a batch fashion.

MapReduce programs are inherently parallel.

MapReduce addresses the scalability problem when working with large

datasets and not necessarily the performance problem.

Page 19: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

mapreduce

A programmatic framework for analyzing data sets that do not fit into

memory.

Page 20: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Map Phase in MapReduce

Page 21: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Reduce Phase in MapReduce From Map Phase

Page 22: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Code Files Needed to Get Started

▪ main script

▪ mapper function

▪ reducer function

Page 23: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Example # 1: Airline Data

Page 24: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Example # 2: NCDC Weather Data

i. Find the maximum temperature for a particular year using a collection of

weather station files (Simple).

ii. Create a single file containing all the weather data for a particular year

from multiple weather station file (Advanced).

iii. Find the maximum temperature for particular year using a single weather

data file (Simple).

Page 25: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Mapper Function Signature

function fooMapper(data, info, intermKVStore)

Inputs Description

data & info Result from an automatic call to the read function by mapreduce.

intermKVStore Name of intermediate KeyValueStore object to which

mapper adds key-value pairs using the add or addmulti functions.

Requirements for Key-Value Pairs*:

1. Keys must be numeric scalars or strings.

2. All keys added by the mapper function must have the same class.

3. Values can be any valid MATLAB object or data type.

* Some requirements may differ when using other products with mapreduce

Page 26: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Reducer Function Signature

function fooReducer(intermKey, intermValIter, outKVStore)

Inputs Description

intermKey It is one of they unique keys added by the mapper function. Each call to the reducer by mapreduce specifies a new key from the keys in the

intermediate KeyValueStore object.

intermValIter It is the ValueIterator object associated with the active key, IntermKey.

The ValueIterator object contains all of the values associated with the

active key. Scroll using hasnext and getnext functions.

outKVStore It is the final Key ValuesStore object to which the reducer adds key-value

pairs using the add or addmulti functions.

Requirements for Key-Value Pairs*:

1. Keys must be numeric scalars or strings.

2. All keys added by the mapper function must have the same class.3. If OutputType argument of mapreduce is ‘Binary’, values added by the reducer can be any valid

MATLAB object or data type.4. If OutputType argument of mapreduce is ‘TabularText’, values added by the reducer can be

numeric scalar or string.* Some requirements may differ when using other products with mapreduce

Page 27: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

mapreduce Execution Syntax

outds = mapreduce(ds, @fooMapper, @fooReducer)

Page 28: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

MapReduce using Parallel Computing Toolbox

1. Open a parallel pool

>> p = parpool('local', 4);

2. Define execution environment for MapReduce

>> mr = mapreducer(p)

>> outds =

Mapreduce(ds, @fooMapper, @fooReducer, mr)

How to run mapreduce always in serial MATLAB?

>> mapreducer(0)

Page 29: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Learning Outcomes

After watching this module you should now be able to:

▪ Diagram the MapReduce programming model using an example

and list the key steps involved

▪ Choose appropriate key/value pairs for computation

▪ Prepare data for MapReduce implementation using the datastore

object

▪ Write a mapper and reduce functions as part of the MapReduce

model

▪ Link the Mapper and Reducer functions using the mapreduce API

to perform computations

▪ Execute MapReduce in a “serial” and “parallel” environment

Page 30: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Appendix

Page 31: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

datastore

Page 32: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Syntax

>> ds = datastore(location)

>> ds = datastore(location, Name, Value)

>> ds = datastore('hdfs://esa-cluster-

ns/user/matlab/xxx.csv');

>> data = read(ds);

Note: esa-cluster-ns為Hadoop Master的FQDN

Page 33: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Deploy MATLAB Code Against Hadoop

Page 34: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Outline

▪ Supported Platform for Deployment

▪ Options to deploy MATLAB code against Hadoop

• Deployable Archive

• Standalone Application

▪ Examples

Page 35: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Supported Platform for Deployment against

Hadoop

▪ Linux Only

Page 36: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Options to Deploy MATLAB code Against Hadoop

Deployment

Option # 1

Deployable Archive• Use Hadoop Scheduling

framework

• Integrate MATLAB code

Option # 2

Standalone Application• Create a standalone application

that runs against Hadoop

Page 37: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Option 1: Deployable Archive

A deployable archive includes a datastore, a map function, and a

reduce function.

▪ Write Mapper and Reducer functions in MATLAB

• Mapper function

• Reducer function

• MAT file containing a datastore

▪ Setup Hadoop environment

▪ Install MATLAB runtime

▪ Copy input data to HDFS

▪ Use Hadoop Compiler app to create Deployable ArchiveA Hadoop specific script is also created along with the deployable archive (CTF) that provides all the components

needed to directly run your archive on Hadoop.

Page 38: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Hadoop Compiler app

Page 39: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Copying Files from Local Folder to Hadoop File

System (HDFS)

scp /home/matlab/work/for_redistributing/*

matlab@master2:/home/matlab/work/projectname/

Reference:

http://linux.vbird.org/linux_server/0310telnetssh.php#scp

$scp <local file location> [username@]

remotehost:/some/remote/directory

Page 40: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Executing a Deployable Archive

<mcr_directory> directory where MATLAB runtime is installed or

the directory where MATLAB is installed on the machine.

[hadoop properties] are the optional properties represented as –

D proterty_name = value

<input_files> all the data files that need to be processed by the

application

<output_folder> is the location for Hadoop to write the output

result in a new output folder. The application will fail if the output

directory already exists.

$./run_deployArchive.sh <mcr_directory>

[hadoop properties]

<input_files>

<output_folder>

Page 41: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Retrieving Results

/part-r-00000

/part-r-00001

/..

>> ds = datastore('hdfs://esa-cluster-

ns/user/matlab/output/deployableArchiveProject-

output/part*','DatastoreType','keyvalue');

>> result = readall(ds);

$hadoop fs –cat hdfs://esa-cluster-

ns/user/matlab/output/deployableArchiveProje

ct-output/part*

Page 42: MATLAB Big Data Course Day 2: MATLAB integrated with Hadoop · Prepare data for MapReduce implementation using the datastore object Write a mapper and reduce functions as part of

Copyright © 2015 by TeraSoft, Inc.

Q&A