Upload
natalino-busa
View
222
Download
0
Embed Size (px)
Citation preview
Data in ActionNatalino Busa
Pleased to meet you!
Italian by birth (43)In the Netherlands since 1998
Married, 3 kids (Ryan, Nemo, Maya)
Likes: Movies, Music, Photography, Design, Fashion, Windsurfing, Architecture, Traveling, Salsa, Sailing.
On the web: https://www.google.nl/search?q=natalino+busa
1/3 slide history of me:
Researcher:Compilers, VLIW processors, MultimediaMPEG4, Assisted driving, Audio compression
2/3 slide history of me:Product Development:
VLSI layout optimization,DRC rule checking & correction
Lots of math:Simplex_algorithmComputational_geometry
Lots of traveling:Japan, South Korea, US, Germany
3/3 slide history of me:Big Data and Data Science:
Two verticals:
- Music, Video industry (broadcasters)- Banking and Finance
Distributed computingMicroservices
Anomaly detection, Time serie predictions
about:how to grok data with machines
and generate business value
data is the fabric of our lives
all our relationships are based on data
Data is the new ...
Data as a Relationship
Data as a RelationshipTrust
Transparency of Use
Customer First
Regulations and Laws
Respect and Protect
Providing a Service
Help the customerPropose, Advise, Select, Filter, Connect, Simplify
Protect the customerDetect, Prevent, Alert, Block, Defend, Identify, Authorize
Actionable Data, Ethical approach
Techs
● Massive multi-rack systems● 100’s of Computing Cores● 100’s Terabytes of Storage
● Distributed computing● Share nothing Architecture
● Advanced Query Plans● Columnar Data Models
● Re-programmable hardware
The Advent of MPP ALAPs (Early 00s)
● Simpler programming paradigm● Distributed, Replicated File System
Map-Reduce and Hadoop (Early 00s)
Streaming and Real-Time Analytics
16
Real Time APIsStreaming Data
Data Sources,Files, DB extractsBatched Data
Training, Scoring and Exposing models
in-memory computingis winning!
Spark is emerging as an improved, faster, better, “new” hadoop.
https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
The RAM is the new Disk
Unified Distributed Computing paradigm:
SQL, StatisticsMachine Learning Graph Analytics
Polyglot Programming:
RPythonScalaJava
Integrated Data Science
https://spark.apache.org/
Spark
Streaming SQL MLlib Graphx
Analytics, Statistics, Data Science, Model Training
HDFS NoSQL SQL
Data Sources
Map-Reduce
HDFS
KAFKA
Spark: Hadoop evolved
Kafka + Spark + Cassandra + Akka (noSQL stack, Fast Data)
MPP + HDFS + Spark (“new” Hadoop / Data Lake)
Popular Operational Analytics Stacks (10s)
Science
Exploratory Data Analysis
In 1977, Tukey published Exploratory Data Analysis, arguing that more emphasis needed to be placed on using data to suggest hypotheses to test and that Exploratory Data Analysis and Confirmatory Data Analysis “can—and should—proceed side by side.”
Analytics goes mainstream (70s, 80s)
Knowledge Data in Databases (1996)
http://flowingdata.com/2009/06/04/rise-of-the-data-scientist/http://medriscoll.com/post/4740157098/the-three-sexy-skills-of-data-geeks
The Rise of the Data Scientist (00s)
Applications: Kickstarter
Little demo.
Applications: Kickstarter
you’re not selling analytics, you’re selling biz development !!!
Advices on how to setup a better kickstarter pageProjections for investors and B-series angels
Prediction based on Big Data
PredPol: Los Angeles police department and University of California
13 Million recorded crimes
Prediction based on historical data, geo location and recency
Prediction based on geolocated events
Clustering geolocated data using Spark and DBSCANHow to group users’ events using machine learning and distributed computing
By Natalino BusaJanuary 28, 2016
Data Science: Interpretability vs Accuracy
Traditional
RegressionARIMAANOVANaive BayesDecision TreesSplines
ModernInterpretability prevails
Random ForestsCons: feature engineering
ModernAccuracy prevails
Neural NetworksCons: hard to explain why
Modern
ClusteringSVMsGaussian ProcessesBaggingBoosting
- Deep LearningConvNets and the rebirth of neural networks
Deep Learning to assist doctors treating and classifying cancerhttp://www.enlitic.com/
Cancer Classification and Treating
DL4Jhttp://deeplearning4j.org/
Theanohttp://deeplearning.net/software/theano/
TensorFlowhttp://tensorflow.org/
Data Science: Deep Learning
T-SNE and dimensionality reduction
https://www.oreilly.com/learning/an-illustrated-introduction-to-the-t-sne-algorithm
- Topological Data Analysis
Analyze high-dimensional data, visuallyhttp://datarefiner.com/
Analysis of NetFlix Prize Dataset. Data sets statistics:
● 100,480,507 ratings
● 480,189 users
● 17,770 movies
● 2.8 GB CSV file size
Topological Data Analysis
Fraud Detection and Graph Analysis
Fraud Detection and Graph Analysis
“ Find those cases where the doctor or the examiner is also a participant in another case, and phone numbers have been reused ”
Sub Graph isomorphismProblem
TakeawaysThink Business and ROI
TakeawaysThink Business and ROI
Better InteractionStreaming Computing: Data in Action!
TakeawaysThink Business and ROI
Better InteractionStreaming Computing: Data in Action!
Better predictions and models:Machine Learning + SQL
TakeawaysThink Business and ROI
Better InteractionStreaming Computing: Data in Action!
Better predictions and models:Machine Learning + SQL
Better summarization and Feature ExtractionDeep Learning
TakeawaysThink Business and ROI
Better InteractionStreaming Computing: Data in Action!
Better predictions and models:Machine Learning + SQL
Better summarization and Feature ExtractionDeep Learning
Better Analysis of RelationsGraph Databases and Algorithms