Information Retrieval, Applied Statistics and Mathematics onBigData - German Society of Physicists Conference 13

Embed Size (px)

Citation preview

IBM Data Warehousing Portfolio Positioning

Information Retrieval, Applied Statistics and Mathematicson BigData

Romeo KienzlerData Scientist and ArchitectIBM Innovation Center Zurich

Fault Tolerance / Commodity Hardware

AMD Turion II Neo N40L (2x 1,5GHz / 2MB / 15W), 8 GB RAM,3TB SEAGATE Barracuda 7200.14< 500 EURO

100 K => 200 X (2, 4, 3) => 400 Cores, 1,6 TB RAM, 200 TB HD

MTBF ~ 365 d > 1,5 d

Supercomputer in a Rack

Supercomputer before

Weather

Atom Bombs

Science

Crash Tests

Supercomputer in a Rack

18 TB Main Memory, 1008 CPU Cores, 113 TFLOPS (1st TOP500 2013: 17590 TFLOPS 2004: 71 TFLOPS)

Hadoop / BigInsights

Hadoop Distributed File System

Hadoop Job Scheduling

Aggregated Bandwith between CPU, Main Memory and Hard Drive

1 TB (at 10 GByte/s)- 1 Node - 100 sec- 10 Nodes - 10 sec- 100 Nodes - 1 sec- 1000 Nodes - 100 msec

Watson

1 TB (at 45.5 GByte/s)- 1 Core - 22 sec- 10 Core - 2.2 sec- 100 Core - 220 msec- 1000 Core - 22 msec- 10000 Core - 2.2 msec

Data Streaming

X86 Box

X86 Blade

CellBlade

X86 Blade

FPGABlade

X86 Blade

X86 Blade

X86Blade

X86 Blade

X86Blade

Operating System

Transport

System S Data Fabric

Processing Element Container

Processing Element Container

Processing Element Container

Processing Element Container

Processing Element Container

Massive Parallel DataWarehousing

Why do we need to process so much data?

Data Growth

Data AVAILABLE to an organization

data an organization can PROCESS

Missed opportunity

100 Million Tweets are posted every day, 35 hours of video are being uploaded every minute,6.1 x 10^12 text messages have been sent in 2011 and 247 x 10^9 E-Mails passed through the net.80 % spam and viruses. => Filtering is more and more important.

Up to 2003 the same amount of data has been produced as between 2003 and now

Separate the Signal From the Noise

http://www.ibmsystemsmag.com/power/businessstrategy/BI-and-Analytics/signal_noise/

The Unreasonable Effectiveness of Data

"sometimes it's not who has the best algorithm that wins; it's who has the most data."

(C) Google Inc.

http://www.csee.wvu.edu/~gidoretto/courses/2011-fall-cp/reading/TheUnreasonable%20EffectivenessofData_IEEE_IS2009.pdf

Statistical Modeling of Physical Systems

From Unstructured Data to Structured Data - Feature Extraction

Feature extraction involves simplifying the amount of resources required to describe a large set of data accurately

: Wikipedia

Dimension Reduction

Principal Component Analysis / Singular Value DecompositionLinear Discriminant Analysis

Source: coursera.org

Data Parallelism

Data Parallelism

Calculate the empirical mean along each dimension m = 1, ..,M (step in Principal Component Analysis)

N-gram Models (NLP)

Ordinary Least-Square Parameter Estimator for Linear Regression

BUT: Do I want to care about algorithm parallelization?

High-Level Languages

Source: Hadoopsphere.comSource: Hadoopsphere.com

High-Level Languages (IBM SystemML)

Extensible LibraryLinear SVMs,Logistic RegK-meansClassificationLinear RegressionRegressionSGD solver,NMFMatrix FactorizationsClusteringPageRank,HITSRanking

Parser

High-Level Ops

Low-Level Ops

Runtime Ops

Optimizations

Hadoop

DML Scripts

Open Source Variant: Apache Mahout- less algorithms- no optimizer

High-Level Languages (RHadoop)

Source: http://www.revolutionanalytics.com

High-Level Languages (R on IBM PureData)

Source: http://www.revolutionanalytics.com

Push Back

Application

Algorithm

Compile

Engine Execution Language

Engine

Push Back

Push Back

Source: coursera.org

Linear Discriminant Analysis

Outlook

Theory: With BigData the machines are thinking for us

Reality: Existing algorithms are now beginning to be applied on a large scale basis

Presence: Every company thinks they have to urgently participate in BigData, but don't know how

Future: Every company will have access to BigData technologies and will use them

Hype: The whole world is doing BigData

Vision: BigData Analytics is usable for everybody at their fingertips

Questions?

Links

www.ibm.com/developerworks

www.ibm.com/ibm/university/academic

[email protected]

[email protected]

U6K8qm_HFas

Jqq66INlQ0U

2012 IBM Corporation

2012 IBM Corporation

August 28, 2012

2012 IBM Corporation