26
1 © 2014 The MathWorks, Inc. 2^48 - keine Angst vor großen Datensätzen in MATLAB 9. July 2014 Rainer Mümmler Application Engineering Group

2^48 - keine Angst vor großen Datensätzen in MATLAB · Parallel Computing Toolbox ... MATLAB documentation User’s Guide – Programming Fundamentals Software Development Memory

  • Upload
    hacong

  • View
    226

  • Download
    2

Embed Size (px)

Citation preview

Page 1: 2^48 - keine Angst vor großen Datensätzen in MATLAB · Parallel Computing Toolbox ... MATLAB documentation User’s Guide – Programming Fundamentals Software Development Memory

1 © 2014 The MathWorks, Inc.

2^48 - keine Angst vor großen Datensätzen in

MATLAB

9. July 2014

Rainer Mümmler

Application Engineering Group

Page 2: 2^48 - keine Angst vor großen Datensätzen in MATLAB · Parallel Computing Toolbox ... MATLAB documentation User’s Guide – Programming Fundamentals Software Development Memory

3

Challenges with Large Data Sets

“Out of memory”

– Running out of address space

Slow processing

– Data too large to be efficiently

managed between RAM and

virtual memory

– Lots of data to process

Gaining insight

– Large data visualization

– Modeling with no equation and lots of predictors

Page 3: 2^48 - keine Angst vor großen Datensätzen in MATLAB · Parallel Computing Toolbox ... MATLAB documentation User’s Guide – Programming Fundamentals Software Development Memory

4

Available system memory

Memory usage in MATLAB

Techniques for processing large data sets

Agenda

Page 4: 2^48 - keine Angst vor großen Datensätzen in MATLAB · Parallel Computing Toolbox ... MATLAB documentation User’s Guide – Programming Fundamentals Software Development Memory

5

System Memory

System Memory = RAM + Swap/Page on Disk

Virtual Memory

– Process sees contiguous block of memory

– Memory actually divided between RAM and disk

(swap/page file)

– OS maps virtual address to physical address

General guidelines:

– Add RAM, possibly swap space

– If thrashing, consider alternative approaches

Virtual Memory

(per process) Disk

RAM

Page 5: 2^48 - keine Angst vor großen Datensätzen in MATLAB · Parallel Computing Toolbox ... MATLAB documentation User’s Guide – Programming Fundamentals Software Development Memory

6

Memory and Your Operating System

32-bit operating systems

– 4GB of addressable memory per process

– Part of it is reserved by the OS,

leaving the application < 4GB

64-bit operating systems

– In theory, can address 18 Exabytes of memory

– Determined by OS and processor

– Essentially limited by the amount of RAM and

disk available on the computer

Use 64-bit OS, if possible

Memory per Process

Available for Process

Reserved by

Operating System

Page 6: 2^48 - keine Angst vor großen Datensätzen in MATLAB · Parallel Computing Toolbox ... MATLAB documentation User’s Guide – Programming Fundamentals Software Development Memory

7

Available system memory

Memory usage in MATLAB

Techniques for processing large data sets

Agenda

Page 7: 2^48 - keine Angst vor großen Datensätzen in MATLAB · Parallel Computing Toolbox ... MATLAB documentation User’s Guide – Programming Fundamentals Software Development Memory

8

Memory Management in MATLAB

Preallocate arrays

– Large matrices first

Clear variables when no longer needed

Check memory available (Windows only)

>> memory

Control contiguous memory

with startup switch (Windows only)

C:\matlab –shield medium

Allocated

x2 = zeros(50,1)

x1 = zeros(25,1)

Allocated

x3 = zeros(25,1)

x4 = zeros(100,1)

MATLAB Process

Address Space

Allocated

x2 = zeros(50,1)

x1 = zeros(25,1)

Allocated

x3 = zeros(25,1)

x4 = zeros(100,1)

Page 8: 2^48 - keine Angst vor großen Datensätzen in MATLAB · Parallel Computing Toolbox ... MATLAB documentation User’s Guide – Programming Fundamentals Software Development Memory

9

Data Copies Function calls

Data is “copy-on-write” (lazy-copy)

Passed by reference into the function

function y = foo(x,a,b) y = a * x + b; end

function y = foo(x,a,b) a(1) = a(1) + 12; y = a * x + b; end

a not copied a is copied

If not modified, no copy is made If modified, a temporary copy is made

Page 9: 2^48 - keine Angst vor großen Datensätzen in MATLAB · Parallel Computing Toolbox ... MATLAB documentation User’s Guide – Programming Fundamentals Software Development Memory

10

Data Copies In-Place Optimizations

MATLAB performs calculations “in-place” when:

– Output variable name is the same as input variable name

– Performing element-wise computation

not in-place

y = 2*x + 3;

x = 2*x + 3;

in-place

Page 10: 2^48 - keine Angst vor großen Datensätzen in MATLAB · Parallel Computing Toolbox ... MATLAB documentation User’s Guide – Programming Fundamentals Software Development Memory

11

Techniques for Minimizing Data Copies

In-place operations, if possible

Nested functions

– Share the workspace of all outer functions

– Avoids making temporary copies

of input arguments

For objects, consider handle classes

– Copy of a handle object refers to the

same object as the original handle

Page 11: 2^48 - keine Angst vor großen Datensätzen in MATLAB · Parallel Computing Toolbox ... MATLAB documentation User’s Guide – Programming Fundamentals Software Development Memory

12

Using Appropriate Data Storage

Numerical data types

– Floating point for math (e.g. linear algebra)

– Integers where appropriate (e.g. images)

Cells and structures

Sparse arrays

Categorical arrays

Page 12: 2^48 - keine Angst vor großen Datensätzen in MATLAB · Parallel Computing Toolbox ... MATLAB documentation User’s Guide – Programming Fundamentals Software Development Memory

13

How does MATLAB store data? Container overhead*

d Header (112)

Data

d = [1 2] dcell ={[1 2]}

dcell Header (112)

Data

Cell Header (112)

dstruct.d = [1 2]

dstruct Header (112)

Data

Element Header (112)

Fieldname (64)

* Using values for 64-bit MATLAB

Page 13: 2^48 - keine Angst vor großen Datensätzen in MATLAB · Parallel Computing Toolbox ... MATLAB documentation User’s Guide – Programming Fundamentals Software Development Memory

14

Sparse Matrices

Require less memory and are faster

When to use sparse?

– < 1/2 dense on 64-bit (double precision)

– < 2/3 dense on 32-bit (double precision)

Functions that support sparse matrices

>> help sparfun

Blog Post: Creating Sparse Finite Element Matrices http://blogs.mathworks.com/loren/2007/03/01/creating-sparse-finite-element-matrices-in-matlab/

Page 14: 2^48 - keine Angst vor großen Datensätzen in MATLAB · Parallel Computing Toolbox ... MATLAB documentation User’s Guide – Programming Fundamentals Software Development Memory

17

Available system memory

Memory usage in MATLAB

Techniques for processing large data sets

Agenda

Page 15: 2^48 - keine Angst vor großen Datensätzen in MATLAB · Parallel Computing Toolbox ... MATLAB documentation User’s Guide – Programming Fundamentals Software Development Memory

18

Processing Large Data Sets

Break your large data into separate pieces

and process independently

– Partial reading and writing of files

– Built-in functionality for block-processing

– System Objects for stream processing (signals, videos)

Use the whole dataset at once

– Single array across memory of multiple machines

Page 16: 2^48 - keine Angst vor großen Datensätzen in MATLAB · Parallel Computing Toolbox ... MATLAB documentation User’s Guide – Programming Fundamentals Software Development Memory

19

Reading in Part of a Dataset from Files

ASCII file

– Import Tool, textscan

MAT file

– Load and save part of a variable using the matfile

Binary file

– Read and write directly to/from file using memmapfile

– Maps address space to file

Databases (with Database Toolbox)

– ODBC and JDBC-compliant (e.g. Oracle, MySQL, Microsoft, SQL Server)

– Database Explorer App

Page 17: 2^48 - keine Angst vor großen Datensätzen in MATLAB · Parallel Computing Toolbox ... MATLAB documentation User’s Guide – Programming Fundamentals Software Development Memory

20

Summary Examples: Reading in Part of a Dataset from Files

ASCII file

– Import Tool, textscan

MAT file

– Load and save part of a variable using the matfile

Binary file

– Read and write directly to/from file using memmapfile

– Maps address space to file

Only read/write parts of datasets, and not the whole file

Page 18: 2^48 - keine Angst vor großen Datensätzen in MATLAB · Parallel Computing Toolbox ... MATLAB documentation User’s Guide – Programming Fundamentals Software Development Memory

21

Block Processing Images

blockproc automatically divides an

image into blocks for processing

Reduces memory usage

– Read and write block directly from image file

Processes arbitrarily large images

Available from Image Processing Toolbox

Page 19: 2^48 - keine Angst vor großen Datensätzen in MATLAB · Parallel Computing Toolbox ... MATLAB documentation User’s Guide – Programming Fundamentals Software Development Memory

22

Batch processing…

Load the entire file and process it all at once

Stream processing

Load a frame and process it before moving on to the next frame

Source

Batch

Processing

Algorithm

Memory

MATLAB Memory

Stream

Source

Stream

Processing

Page 20: 2^48 - keine Angst vor großen Datensätzen in MATLAB · Parallel Computing Toolbox ... MATLAB documentation User’s Guide – Programming Fundamentals Software Development Memory

23

System Objects

A class of MATLAB objects that support streaming workflows

Simplifies data access for streaming applications

– Manages flow of data from files or network

– Handles data indexing and buffering

Contain algorithms to work with streaming data

– Manages algorithm state

– Available for Signal Processing, Communications, Video Processing,

and Phased Array Applications

Available from DSP System Toolbox

Communications System Toolbox

Computer Vision System Toolbox

Phased Array System Toolbox

Page 21: 2^48 - keine Angst vor großen Datensätzen in MATLAB · Parallel Computing Toolbox ... MATLAB documentation User’s Guide – Programming Fundamentals Software Development Memory

24

Processing Large Data Sets

Break your large data into separate pieces

and process independently

– Partial reading and writing of files

– Built-in functionality for block-processing

– System Objects for stream processing

Use the whole dataset at once

– Single array across memory of multiple machines

Page 22: 2^48 - keine Angst vor großen Datensätzen in MATLAB · Parallel Computing Toolbox ... MATLAB documentation User’s Guide – Programming Fundamentals Software Development Memory

25

Distributed Array

Lives on the Workers

Remotely Manipulate Array

from Client

11 26 41

12 27 42

13 28 43

14 29 44

15 30 45

16 31 46

17 32 47

17 33 48

19 34 49

20 35 50

21 36 51

22 37 52

Distributing Large Data

Worker

Worker

Worker

Worker

MATLAB

Desktop (Client)

Available from Parallel Computing Toolbox

MATLAB Distributed Computing Server

Page 23: 2^48 - keine Angst vor großen Datensätzen in MATLAB · Parallel Computing Toolbox ... MATLAB documentation User’s Guide – Programming Fundamentals Software Development Memory

26

Using Distributed Arrays Regular MATLAB code

Page 24: 2^48 - keine Angst vor großen Datensätzen in MATLAB · Parallel Computing Toolbox ... MATLAB documentation User’s Guide – Programming Fundamentals Software Development Memory

27

Investigation: Distributed Calculations

Effect of number of computers on execution time

Processor: Intel Xeon E5-2670

16 cores, 60 GB RAM per compute node

10 Gigabit Ethernet

N

Time (s)

1 node,

multi-

threaded

Distributed

2 nodes,

32W

4 nodes,

64W

4000 2 3 3

8000 16 14 12

16000 126 102 67

20000 244 187 118

32000 - 664 394

40000 - - 710

Page 25: 2^48 - keine Angst vor großen Datensätzen in MATLAB · Parallel Computing Toolbox ... MATLAB documentation User’s Guide – Programming Fundamentals Software Development Memory

30

Sample of Other Technical Resources

MATLAB documentation User’s Guide

– Programming Fundamentals Software Development Memory Usage

The Art of MATLAB, Loren Shure’s blog

– blogs.mathworks.com/loren/

Memory Management Guides

– www.mathworks.com/support/tech-notes/1100/1106.html

– www.mathworks.com/support/tech-notes/1100/1107.html

MATLAB Answers

– http://www.mathworks.com/matlabcentral/answers/

Page 26: 2^48 - keine Angst vor großen Datensätzen in MATLAB · Parallel Computing Toolbox ... MATLAB documentation User’s Guide – Programming Fundamentals Software Development Memory

31