41
Large Data Analyze with PyTables

Py tables

Embed Size (px)

Citation preview

Page 1: Py tables

Large Data Analyzewith PyTables

Page 2: Py tables

Personal Profile:

● Ali Hallaji

● Parallel Processing and Large Data Analyze

● Senior Python Developer at innfinision Cloud Solutions

[email protected]

● Innfinision.net

Page 3: Py tables

innfinision Cloud Solutions:

● Providing Cloud, Virtualization and Data Center Solutions

● Developing Software for Cloud Environments

● Providing Services to Telecom, Education, Broadcasting & Health Fields

● Supporting OpenStack Foundation as the First Iranian Company

● First Supporter of IRAN OpenStack Community

Page 4: Py tables

Large Data Analyze with PyTables innfinision.net

● Outline

● What is PyTables ?

● Numexpr & Numpy

● Compressing Data

● What is HDF5?

● Querying your data in many different ways, fast

● Design goals

Agenda:

Page 5: Py tables

Outline

innfinision.netLarge Data Analyze with PyTables

Page 6: Py tables

Outline

innfinision.netLarge Data Analyze with PyTables

The Starving CPU Problem● Getting the Most Out of Computers

● Caches and Data Locality

● Techniques For Fighting Data

Starvation

High Performance Libraries● Why Should You Use Them?

● In-Core High Performance

● Out-of-Core High Performance

Libraries

Page 7: Py tables

Getting the Most Out of Computers

innfinision.netLarge Data Analyze with PyTables

Page 8: Py tables

Getting the Most Out of Computers

innfinision.netLarge Data Analyze with PyTables

Computers nowadays are very powerful:● Extremely fast CPU’s (multicores)

● Large amounts of RAM

● Huge disk capacities

But they are facing a pervasive problem:An ever-increasing mismatch between CPU, memory and disk speeds (the

so-called “Starving CPU problem”)

This introduces tremendous difficulties in getting the most out of

computers.

Page 9: Py tables

CPU vs Memory cycle Trend

innfinision.netLarge Data Analyze with PyTables

Cycle time is the time, usually measured in nanosecond s, between the start of

one random access memory ( RAM ) access to the time when the next access can

be started

History

● In the 1970s and 1980s the memory subsystem was able to

deliver all the data that processors required in time.

● In the good old days, the processor was the key bottleneck.

● But in the 1990s things started to change...

Page 10: Py tables

CPU vs Memory cycle Trend

innfinision.netLarge Data Analyze with PyTables

dd

Page 11: Py tables

The CPU Starvation Problem

innfinision.netLarge Data Analyze with PyTables

Known facts (in 2010):

● Memory latency is much higher (around 250x) than processors and it has been

an essential bottleneck for the past twenty years.

● Memory throughput is improving at a better rate than memory latency, but it

is also much slower than processors (about 25x).

The result is that CPUs in our current computers are suffering from

a serious data starvation problem: they could consume (much!)

more data than the system can possibly deliver.

Page 12: Py tables

What Is the Industry Doing to Alleviate CPU Starvation?

innfinision.netLarge Data Analyze with PyTables

● They are improving memory throughput: cheap to implement

(more data is transmitted on each clock cycle).

● They are adding big caches in the CPU dies.

Page 13: Py tables

Why Is a Cache Useful?

innfinision.netLarge Data Analyze with PyTables

● Caches are closer to the processor (normally in the same die),

so both the latency and throughput are improved.

● However: the faster they run the smaller they must be.

● They are effective mainly in a couple of scenarios:

● Time locality: when the dataset is reused.

● Spatial locality: when the dataset is accessed sequentially.

Page 14: Py tables

Time Locality

innfinision.netLarge Data Analyze with PyTables

Parts of the dataset are reused

Page 15: Py tables

Spatial Locality

innfinision.netLarge Data Analyze with PyTables

Dataset is accessed sequentially

Page 16: Py tables

Why High Performance Libraries?

innfinision.netLarge Data Analyze with PyTables

● High performance libraries are made by people that knows very

well the different optimization techniques.

● You may be tempted to create original algorithms that can be

faster than these, but in general, it is very difficult to beat

them.

● In some cases, it may take some time to get used to them, but

the effort pays off in the long run.

Page 17: Py tables

Some In-Core High Performance Libraries

innfinision.netLarge Data Analyze with PyTables

● ATLAS/MKL (Intel’s Math Kernel Library): Uses memory efficient algorithms as well

as SIMD and multi-core algorithms linear algebra operations.→

● VML (Intel’s Vector Math Library): Uses SIMD and multi-core to compute basic math

functions (sin, cos, exp, log...) in vectors.

● Numexpr: Performs potentially complex operations with NumPy arrays without the

overhead of temporaries. Can make use of multi-cores.

● Blosc: A multi-threaded compressor that can transmit data from caches to memory,

and back, at speeds that can be larger than a OS memcpy().

Page 18: Py tables

What is PyTables ?

innfinision.netLarge Data Analyze with PyTables

Page 19: Py tables

PyTables

innfinision.netLarge Data Analyze with PyTables

PyTables is a package for managing hierarchical datasets and designed to efficiently

and easily cope with extremely large amounts of data. You can download PyTables

and use it for free. You can access documentation, some examples of use and

presentations in the HowToUse section.

PyTables is built on top of the HDF5 library, using the Python language and the

NumPy package. It features an object-oriented interface that, combined with C

extensions for the performance-critical parts of the code (generated using Cython),

makes it a fast, yet extremely easy to use tool for interactively browse, process and

search very large amounts of data. One important feature of PyTables is that it

optimizes memory and disk resources so that data takes much less space (specially if

on-flight compression is used) than other solutions such as relational or object

oriented databases.

Page 20: Py tables

Numexpr & Numpy

innfinision.netLarge Data Analyze with PyTables

Page 21: Py tables

Numexpr: Dealing with Complex Expressions

innfinision.netLarge Data Analyze with PyTables

● Wears a specialized virtual machine for evaluating expressions.

● It accelerates computations by using blocking and by avoiding

temporaries.

● Multi-threaded: can use several cores automatically.

● It has support for Intel’s VML (Vector Math Library), so you

can accelerate the evaluation of transcendental (sin, cos,

atanh, sqrt. . . ) functions too.

Page 22: Py tables

NumPy: A Powerful Data Container for Python

innfinision.netLarge Data Analyze with PyTables

NumPy provides a very powerful, object oriented, multi dimensional data container:

● array[index]: retrieves a portion of a data container

● (array1**3 / array2) - sin(array3): evaluates potentially complex

expressions

● numpy.dot(array1, array2): access to optimized BLAS (*GEMM) functions

Page 23: Py tables

NumPy And Temporaries

innfinision.netLarge Data Analyze with PyTables

Computing “a*b+c” with numpy. Temporaries goes to memory

Page 24: Py tables

Numexpr Avoids (Big) Temporaries

innfinision.netLarge Data Analyze with PyTables

Computing “a*b+c” with numexpr. Temporaries in memory are avoided.

Page 25: Py tables

Numexpr Performance (Using Multiple Threads)

innfinision.netLarge Data Analyze with PyTables

Time to evaluate polynomial : ((.25*x + .75)*x – 1.5)*x -2

Page 26: Py tables

Compression

innfinision.netLarge Data Analyze with PyTables

Page 27: Py tables

Why Compression

innfinision.netLarge Data Analyze with PyTables

● Lets you store more data using the same space

● Uses more CPU, but CPU time is cheap compared with disk access

● Different compressors for different uses:

Bzip2, zlib, lzo, Blosc

Page 28: Py tables

Why Compression

innfinision.netLarge Data Analyze with PyTables

3X more data

Page 29: Py tables

Why Compression

innfinision.netLarge Data Analyze with PyTables

Less data needs to be transmitted to the CPU

Transmission + decompression faster than direct transfer?

Page 30: Py tables

Blosc: Compressing faster than Memory Speed

innfinision.netLarge Data Analyze with PyTables

Page 31: Py tables

What is HDF5 ?

innfinision.netLarge Data Analyze with PyTables

Page 32: Py tables

HDF5

innfinision.netLarge Data Analyze with PyTables

HDF5 is a data model, library, and file format for storing and managing data. It

supports an unlimited variety of datatypes, and is designed for flexible and

efficient I/O and for high volume and complex data. HDF5 is portable and is

extensible, allowing applications to evolve in their use of HDF5. The HDF5

Technology suite includes tools and applications for managing, manipulating,

viewing, and analyzing data in the HDF5 format.

Page 33: Py tables

The HDF5 technology suite includes:

innfinision.netLarge Data Analyze with PyTables

● A versatile data model that can represent very complex data objects and a wide

variety of metadata.

● A completely portable file format with no limit on the number or size of data

objects in the collection.

● A software library that runs on a range of computational platforms, from laptops to

massively parallel systems, and implements a high-level API with C, C++,

Fortran 90, and Java interfaces.

● A rich set of integrated performance features that allow for access time and

storage space optimizations.

● Tools and applications for managing, manipulating, viewing, and analyzing the data

in the collection.

Page 34: Py tables

Data structures

innfinision.netLarge Data Analyze with PyTables

High level of flexibility for structuring your data:

● Datatypes: scalars (numerical & strings), records, enumerated, time...

● Table support multidimensional cells and nested records

● Multidimensional arrays

● Variable lengths arrays

Page 35: Py tables

Attributes:

innfinision.netLarge Data Analyze with PyTables

Metadata about data

Page 36: Py tables

● Dataset hierarchy

innfinision.netLarge Data Analyze with PyTables

Page 37: Py tables

Querying your data in many different

ways, fast

innfinision.netLarge Data Analyze with PyTables

Page 38: Py tables

PyTables Query

innfinision.netLarge Data Analyze with PyTables

One characteristic that sets PyTables apart from similar tools is its capability to

perform extremely fast queries on your tables in order to facilitate as much as

possible your main goal: get important information *out* of your datasets.

PyTables achieves so via a very flexible and efficient query iterator, named

Table.where(). This, in combination with OPSI, the powerful indexing engine that

comes with PyTables, and the efficiency of underlying tools like NumPy, HDF5,

Numexpr and Blosc, makes of PyTables one of the fastest and more powerful query

engines available.

Page 39: Py tables

Different query modes

innfinision.netLarge Data Analyze with PyTables

Regular query:

● [ r[‘c1’] for r in table

if r[‘c2’] > 2.1 and r[‘c3’] == True)) ]

In- kernel query:

● [ r[‘c1’] for r in table.where(‘(c2>2.1)&(c3==True)’) ]

Indexed query:

● table.cols.c2.createIndex()

● table.cols.c3.createIndex()

● [ r[‘c1’] for r in table.where(‘(c2>2.1)&(c3==True)’) ]

Page 40: Py tables

innfinision.netLarge Data Analyze with PyTables

This presentation has been collected from several other presentations(PyTables presentation). For more presentation refer to this link (http://pytables.org/moin/HowToUse#Presentations).

Page 41: Py tables

Ali Hallaji

[email protected]

innfinision.net

Thank you