41
Data in Action Natalino Busa

Data in Action

Embed Size (px)

Citation preview

Page 1: Data in Action

Data in ActionNatalino Busa

Page 2: Data in Action

Pleased to meet you!

Italian by birth (43)In the Netherlands since 1998

Married, 3 kids (Ryan, Nemo, Maya)

Likes: Movies, Music, Photography, Design, Fashion, Windsurfing, Architecture, Traveling, Salsa, Sailing.

On the web: https://www.google.nl/search?q=natalino+busa

Page 3: Data in Action

1/3 slide history of me:

Researcher:Compilers, VLIW processors, MultimediaMPEG4, Assisted driving, Audio compression

Page 4: Data in Action

2/3 slide history of me:Product Development:

VLSI layout optimization,DRC rule checking & correction

Lots of math:Simplex_algorithmComputational_geometry

Lots of traveling:Japan, South Korea, US, Germany

Page 5: Data in Action

3/3 slide history of me:Big Data and Data Science:

Two verticals:

- Music, Video industry (broadcasters)- Banking and Finance

Distributed computingMicroservices

Anomaly detection, Time serie predictions

Page 6: Data in Action

about:how to grok data with machines

and generate business value

Page 7: Data in Action

data is the fabric of our lives

all our relationships are based on data

Page 8: Data in Action

Data is the new ...

Page 9: Data in Action

Data as a Relationship

Page 10: Data in Action

Data as a RelationshipTrust

Transparency of Use

Customer First

Regulations and Laws

Respect and Protect

Providing a Service

Page 11: Data in Action

Help the customerPropose, Advise, Select, Filter, Connect, Simplify

Protect the customerDetect, Prevent, Alert, Block, Defend, Identify, Authorize

Actionable Data, Ethical approach

Page 12: Data in Action

Techs

Page 13: Data in Action

● Massive multi-rack systems● 100’s of Computing Cores● 100’s Terabytes of Storage

● Distributed computing● Share nothing Architecture

● Advanced Query Plans● Columnar Data Models

● Re-programmable hardware

The Advent of MPP ALAPs (Early 00s)

Page 14: Data in Action

● Simpler programming paradigm● Distributed, Replicated File System

Map-Reduce and Hadoop (Early 00s)

Page 15: Data in Action

Streaming and Real-Time Analytics

Page 16: Data in Action

16

Real Time APIsStreaming Data

Data Sources,Files, DB extractsBatched Data

Training, Scoring and Exposing models

Page 17: Data in Action

in-memory computingis winning!

Spark is emerging as an improved, faster, better, “new” hadoop.

https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html

The RAM is the new Disk

Page 18: Data in Action

Unified Distributed Computing paradigm:

SQL, StatisticsMachine Learning Graph Analytics

Polyglot Programming:

RPythonScalaJava

Integrated Data Science

Page 19: Data in Action

https://spark.apache.org/

Spark

Streaming SQL MLlib Graphx

Analytics, Statistics, Data Science, Model Training

HDFS NoSQL SQL

Data Sources

Map-Reduce

HDFS

KAFKA

Spark: Hadoop evolved

Page 20: Data in Action

Kafka + Spark + Cassandra + Akka (noSQL stack, Fast Data)

MPP + HDFS + Spark (“new” Hadoop / Data Lake)

Popular Operational Analytics Stacks (10s)

Page 21: Data in Action

Science

Page 22: Data in Action

Exploratory Data Analysis

In 1977, Tukey published Exploratory Data Analysis, arguing that more emphasis needed to be placed on using data to suggest hypotheses to test and that Exploratory Data Analysis and Confirmatory Data Analysis “can—and should—proceed side by side.”

Analytics goes mainstream (70s, 80s)

Page 23: Data in Action

Knowledge Data in Databases (1996)

Page 24: Data in Action

http://flowingdata.com/2009/06/04/rise-of-the-data-scientist/http://medriscoll.com/post/4740157098/the-three-sexy-skills-of-data-geeks

The Rise of the Data Scientist (00s)

Page 26: Data in Action

Applications: Kickstarter

you’re not selling analytics, you’re selling biz development !!!

Advices on how to setup a better kickstarter pageProjections for investors and B-series angels

Page 27: Data in Action

Prediction based on Big Data

PredPol: Los Angeles police department and University of California

13 Million recorded crimes

Prediction based on historical data, geo location and recency

Page 28: Data in Action

Prediction based on geolocated events

Clustering geolocated data using Spark and DBSCANHow to group users’ events using machine learning and distributed computing

By Natalino BusaJanuary 28, 2016

Page 29: Data in Action

Data Science: Interpretability vs Accuracy

Traditional

RegressionARIMAANOVANaive BayesDecision TreesSplines

ModernInterpretability prevails

Random ForestsCons: feature engineering

ModernAccuracy prevails

Neural NetworksCons: hard to explain why

Modern

ClusteringSVMsGaussian ProcessesBaggingBoosting

Page 30: Data in Action

- Deep LearningConvNets and the rebirth of neural networks

Page 31: Data in Action

Deep Learning to assist doctors treating and classifying cancerhttp://www.enlitic.com/

Cancer Classification and Treating

Page 32: Data in Action

DL4Jhttp://deeplearning4j.org/

Theanohttp://deeplearning.net/software/theano/

TensorFlowhttp://tensorflow.org/

Data Science: Deep Learning

Page 33: Data in Action

T-SNE and dimensionality reduction

https://www.oreilly.com/learning/an-illustrated-introduction-to-the-t-sne-algorithm

Page 34: Data in Action

- Topological Data Analysis

Analyze high-dimensional data, visuallyhttp://datarefiner.com/

Analysis of NetFlix Prize Dataset. Data sets statistics:

● 100,480,507 ratings

● 480,189 users

● 17,770 movies

● 2.8 GB CSV file size

Topological Data Analysis

Page 35: Data in Action

Fraud Detection and Graph Analysis

Page 36: Data in Action

Fraud Detection and Graph Analysis

“ Find those cases where the doctor or the examiner is also a participant in another case, and phone numbers have been reused ”

Sub Graph isomorphismProblem

Page 37: Data in Action

TakeawaysThink Business and ROI

Page 38: Data in Action

TakeawaysThink Business and ROI

Better InteractionStreaming Computing: Data in Action!

Page 39: Data in Action

TakeawaysThink Business and ROI

Better InteractionStreaming Computing: Data in Action!

Better predictions and models:Machine Learning + SQL

Page 40: Data in Action

TakeawaysThink Business and ROI

Better InteractionStreaming Computing: Data in Action!

Better predictions and models:Machine Learning + SQL

Better summarization and Feature ExtractionDeep Learning

Page 41: Data in Action

TakeawaysThink Business and ROI

Better InteractionStreaming Computing: Data in Action!

Better predictions and models:Machine Learning + SQL

Better summarization and Feature ExtractionDeep Learning

Better Analysis of RelationsGraph Databases and Algorithms