Upload
others
View
26
Download
2
Embed Size (px)
Citation preview
WHAT IS DATA
SCIENCE?
Timo Aho // Data Scientist, PhD // [email protected] // Twitter: @ahotimom
Data Science Tampere Meetup 29.9.2015
Turnover 2014
38,6Million euros
Over
340professionals
THIS IS SOLITA
Over
18years
Working in
3offices
Over
1000projects
Over
97 %customer
satisfaction
Ranking
6.in Great Place to Work
in Finland
Ranking
43.in European Best
Workplaces
Strategic planning
Pre-studies
Road maps
Service concepts
Service design
Visual design
User experience design
Usability design
Architecture design
Solution implementation
Continuous services
Hosting services
Understand & concept Pilot & implement Maintain & develop
ONLINE AND
ECOMMERCE
INFORMATION
MANAGEMENT AND
BIG DATA
UTILIZATION AND
VISUALIZATION OF
INFORMATION
SOFTWARE
DEVELOPMENT
PREDICTIVE
ANALYTICS
DIGITAL STRATEGY AND
TRANSFORMATION
PUBLIC SECTOR ONLINE
SERVICES
BUSINESS PLANNING
AND MANAGEMENT
INTEGRATION
SERVICES
DIGITAL BUSINESS SOLUTIONS
OUR CUSTOMERS
RETAIL SERVICES PUBLIC
INSURANCE MEDIA & TELECOM MANUFACTURING
OPEN FINLAND CHALLENGE
› An open data contest where you can win prices!
› Solita is offering a challenge on
predictive traffic analytics
› See more: http://openfinlandchallenge.fi/
› The site unfortunately mostly in Finnish
AGENDA
1. Data science vs Big data
2. Use case examples
3. Data science process
4. Data science methods
AGENDA
1. Data science vs Big data
2. Use case examples
3. Data science process
4. Data science methods
BIG DATA CONCEPTS
Big data
Data analysis
Data scienceKnowledge discovery in databases (KDD)
Data mining
Machine learning High VolumePredictive analytics
High Velocity
High Variety
Cloud computation
NoSQL
Cloud storageHadoop MapReduce
Batch vs. Real time
Structured
Unstructured
Semi-structured
Spark
Internet of thingsSensory data
Business analytics
Business intelligence
DEFINITION FOR BIG DATA?
› Narrow:
• Infrastructure for processing exceptionally large or rapidly produced
data
› Broad:
• All data storing, processing and analyzing
• (Does not necessarily fit into computer memory)
AGENDA
1. Data science vs Big data
2. Use case examples
3. Data science process
4. Data science methods
CASE
SANOMA OYJ
A personalized user experience on the most popular web services by
analyzing 200 millions new events daily
CASE: DIGITAL SERVICE PROVIDER
› Predicting:
• Customer churn
• Cross-selling
› The information available in all customer contacts
• When the customer contacts support
• When marketing contacts customer
• When meeting in shops, in phone, in web
CASE: RETAIL / SERVICE PROVIDER
› Customers act in waves, for a couple of weeks high service demand
› Analysis
• Segment customers according to behavior
• Predict customer action timing and high demand times
• Affect the customers to make demand level steadier. No peaks.
CASE
SANOMA OYJ
A personalized user experience on the most popular web services by
analyzing 200 millions new events daily
AGENDA
1. Data science vs Big data
2. Use case examples
3. Data science process
4. Data science methods
DATA ANALYSIS PROCESS
Source: CRISP-DM, Image: Wikipedia
50-70%
10-20%20-30%
10-20%
10-20%
5-10%
Servicelayer
Informationexploitation
Analyticsresult
Analyticsmodeling
Discoveringavailable dataBusiness goals
Reducechurn
WHAT A DATA SCIENTIST DOES?
Datapreprocessing
Ex. 1
Ex. 2Increase
manufacturing quality
Leaving customersBilling
ContractsContacts
Service qualityDemography
Failures
Raw materialsMachine
parametersManufacturing
sensorsEnd-product
quality measurements
Failures
Database connections
Abnormal data forms
Bringing to matrix form
Cleaning or highlighting outliers and exceptions
Handling missing information
Training:80 variables per
leaving customer, three times more
current customers
Training:tens of starting,
intermediate and ending variables
Churn prediction for each customer
Prediction for the optimal parameter values for quality
Getting the predictions to data
bases in source systems
Hint for good parameter values,
indication if suboptimal ones
selected
Optimizing communication to
prevent churn. Customer service
sees the churn prediction for
current customer.
Process controller either uses the recommended
parameter values or tunes them.
Also creates real-time reports on process quality.
AGENDA
1. Data science vs Big data
2. Use case examples
3. Data science process
4. Data science methods
NATURE OF THE DATA
› Structured
› Semi-structured
› Unstructured
{"cod":"200","message":0.0032,"city": {
"id":1851632,"name":"Shuzenji","coord":{"lon":138.933334,"lat":34.966671},"country":"JP"
},"cnt":10,"list": [{
"dt":1406080800,"temp": {
"day":297.77,"min":293.52,"max":297.77,"night":293.52,"eve":297.77,"morn":297.77
},"pressure":925.04,"humidity":76
}]}
HISTORY OR FUTURE?
Descriptive
• What happened?• What is happening?
Predictive
• What will probably happen?
Prescriptive
• What should be done for optimal outcome?
• Reporting• Data warehouses• Master data
• Statistical modeling• Machine learning
• Optimizing• Machine learning• Simulation• Real-time analytics
Most organizations are here
Feature 1 Feature 2 Feature 3 Feature 4
Data point 1
Data point 2
Data point 3
Data point 4
Data point 5
WHAT DO ALGORITHMS EAT?
› Visualizations
• High dimension?
› Statistical values, dependencies
› Clustering
DESCRIPTIVE MODELING METHODS
Source: Wikipedia
Feature 1 Feature 2 Feature 3 Target feature
Data point 1
Data point 2
Data point 3
Data point 4
Data point 5
WHAT DO ALGORITHMS EAT?
› Regression
› Classification
PREDICTIVE MODELING METHODS
1 €2 €4,5 €1,5 € 1,3 €2 €
AAAA OP
WHY IS DATA SCIENCE RELEVANT?
› More data available
› A lot of software tools available
• R, Python, Weka, Rapidminer, Tableau, SPSS, SAS
• Hadoop, Spark, NoSQL databases
• Cloud tools
› Business understanding on how to apply?
Twitter @SolitaOy
www.solita.fi
THANK YOU!
TIMO AHO
Data Scientist, PhD