Upload
anton-konushin
View
149
Download
2
Embed Size (px)
Citation preview
Big Data & Data Science:A Practical View
Mona Soliman HabibPrincipal Data ScientistMicrosoft Azure Machine Learning
1854 London data map
Data gives you the WHAT, so people can uncover the WHY
4
• Data is everywhere
• Small data, medium data, large data, huge data
• Data of all shapes and forms
• Internal, external, public, crowdsourced
• Numeric, free text, spatial, temporal, audio/video, ...
• Structured, semi-structured, unstructured
• Encrypted and decrypted
• Private, personal, sensitive, etc.
The Data Deluge
• Large, complex, challenging, difficult to process, …
• Big Data is not limited to size
• Five main concerns: The 5 V’s • Volume: amount of data is growing from terabytes to petabytes and more
• Velocity: data is being collected at a very fast pace
• Variety: data can be of any type, regardless of structure nature
• Veracity: lack of trust in information extracted from Big Data
• Value: is the data collection and curation worth it?
• Like it or not, data will continue to grow in all aspects.
• Data Insight Prediction Impactful Action
What is Big Data?
Anyone can benefit from data
Transformational trends
cloud computing
2011 2016 5x increase
emerging data science talent
Universities filling 300,000 US talent gap
90% of the data in the world today has been created in the last two years alone
data explosion
connected customers
1B+200M10.4M 160M
Cultural, technological, and scholarly phenomenon
Assumptions, biases, uncertainties
Is Big Data a part of mythology?? "large data sets offer a higher form of intelligence and knowledge [...], with the aura of truth, objectivity, and accuracy"1.
Food for serious thoughts1: Big Data changes the definition of knowledge
Claims to objectivity and accuracy are misleading
Bigger data are not always better data
Taken out of context, Big Data loses its meaning
Just because it’s accessible doesn’t make it ethical
Limited access to Big Data creates new digital divides
Critical questions for Big Data
Boyd, D.; Crawford, K. (2012). "Critical Questions for Big Data". Information, Communication & Society 15 (5): 662.
Data Scientist: The Sexiest Job of the 21st CenturyHBR, October 2012
Data science
• is the study of the generalizable extraction of knowledge
from data (Wikipedia)
• is getting predictive and/or actionable insight from data
(Neil Raden)
• involves extracting, creating, and processing data to turn
it into business value. – Vincent Granville (Developing
Analytic Talent: Becoming a Data Scientist )
What is Data Science?
Real
World
Machine
Learning
Data Science
Data Science is the practice of derivinginformation and insight from real-worlddata to create business value.
Data Science: Practical Definition
Problem Requirements
Available data• Related to the decision
• Historical
• Outcomes
Valuable business problem involving decision
• Existing process
• Metrics
The Data Science Process
Where AA sits:Transform & Analyze
Internal &
external
DashboardsReports Ask Mobile
Information
managementOrchestration
Extract, transform,
load Prediction
Relational Non-relational Analytical
Apps
Streaming
Data
Source: http://www.edureka.in/blog/core-data-scientist-skills/
Data Scientist: Essential Skills
• 80% of the work is janitorial• Data movement, consolidation, curation, wrangling, etc.
• Working with large data• How much data is really need for modeling?
• Data analysis, visualization, exploration
• Smart, representative sampling vs. large scale learning
• Customers don’t trust black-box models• How to interpret and react to model predictions?
• Selecting and understanding appropriate metrics
• Proper model integration in business workflows
• Post-deployment monitoring and updates• Monitoring online performance, detecting drifts, model updates, A/B testing, etc.
• Managing data science projects• How to manage projects involving data, machine learning, software apps, services, etc.?
Big Data & Data Science Challenges
Data Science Research
Source: DSRC http://www.slideshare.net/dsrc/data-science-research-center-overview-and-mission
Data Science Research
• Exploration and visualization• Auto-summarization
• Visualizing big data
• Exploring unstructured data
• Auto-processing of data• Data quality assessment
• Data curation
• Smart data consolidation
• Auto-modeling• Smart model selection
• Metrics selection
• Model transformation
Many open areas for research• Model interpretability• Feature importance
• Per instance interpretation
• What-if analysis
• Auto-maintenance• Active learning / Machine teaching
• Smart monitoring and testing
• Data/Software Engineering• Data model/schema management
• Data version control
• Model version control
• ML project management
• Security and privacy
•Understand the decision process
•Establish performance metrics early
•Keep the human in the loop
•Consider availability (and timing) of data
•Bad data happens
•User Interface is important
•Adopt a software version control system
•Implementing solutions take longer
•On-going support is not negligible
•Devil is in the details
10 practical lessons learned