Upload
amuletc
View
813
Download
3
Tags:
Embed Size (px)
DESCRIPTION
What is Data Science talk for General Assembly by Daniel D Gutierrez - February 5, 2014
Citation preview
What is Data Science?
Daniel D. Gutierrez, Data Scientist
AMULET Analytics
February 2014
/ page 2
/ page 3
A Life in Data ScienceAMULET Analytics
My personal consultancy doing work in data science – computational marketing
Doing data analysis, machine learning and visualization for enterprises
Wide breadth of industries: startups, manufacturing, non-profit, fashion, e-
commerce, market research, etc.
Big Data Journalist
Managing Editor – insideBIGDATA.com
Blogger – Big Data Republic (bigdatarepublic.com)
Blogger – All Analytics (allanalytics.com)
Teaching
Community TA – Coursera
UCLA Extension
Writing a book: “Introduction to Machine Learning with R”
Facilitates a cascade of technologies
Big Data is facilitated by data science
Data science is facilitated by machine learning
Machine learning is a confluence of technologies and disciplines
– Computer science, mathematical statistics, probability theory, visualization
Data science in nothing new!
Components have been around for decades
“Data science” is just a new name for something old and proven (I do love it!)
“Machine learning” used to be “data mining” or KDD.
Much hype recently
Harvard Business Review proclaimed “sexiest job for the 21st century.” I’ll take it!
Now with “big data” it’s a force barely contained
Data Science in Perspective
/ page 4
/ page 5
/ page 6
/ page 7
Controversy in hiring data scientists
Some companies post job ads for
unicorns, mythical creatures having
no basis in reality
Hire a data science TEAM!
Don’t expect a single individual to be
both a “theorist” and an
“experimentalist”
Consultant vs. full-time hire
/ page 8
Who Does Data Science? Unicorns!
Big Data
– “large data sets so big that commonly-used software tools are unable to capture,
curate, manage, and process the data within a tolerable elapsed time.”
Hadoop Dominates Big Data market
– Used widely by some of the world's largest websites,
such as Facebook, eBay, Amazon and Yahoo
– Moving into the enterprise
– Invented by developers at Yahoo!
/ page 9
What is Big Data?
Apache Hadoop
Applications for Big Data
Smarter Healthcare
Multi-channel sales
Finance
Log Analysis
Homeland Security
Traffic Control
Telecom
Search Quality
Manufacturing
Trading Analytics
Fraud and Risk
Retail: Churn
/ page 10
“Big Data is the definitive source of
competitive advantage across all
industries. For those organizations
that understand and embrace the new
reality of Big Data, the possibilities
for new innovation, improved agility,
and increased profitability are nearly
endless.”
Source: Wikibon 2012
Father and daughter walk into Target store and to speak with the manager:
– Wants to know why the store is bombarding his teenage daughter with ads for baby
strollers, diapers and other baby goods. "Are you trying to encourage her to get
pregnant?”
– The befuddled manager apologizes and responds he has no idea why the company is
sending her such items
Father later phones the store to apologize - turns out his daughter was expecting
How?
– Target used Big Data to predict pregnancy. When a woman begins buying vitamins,
increases her purchases of lotion, and buys an oversized purse or bag, the odds are
very high she is expecting
– Target knew the daughter was pregnant before the family
/ page 11
The Minnesota Dad
/ page 12
What is Machine Learning?
Components have been around for decades
“Data science” is just a new name for something old and proven (I do love it!)
“Machine learning” used to be “data mining” or KDD.
Supervised learning
Prediction and classification
Linear regression, logistic regression, classification trees, SVM, neural nets
Train the algorithm on known labelled data to be able to predict new data
Unsupervised learning
Hierarchical clustering
K-means clustering
Principal component analysis (PCA)
Dimensionality reduction to address “the curse of dimensionality”
/ page 13
Machine Learning Overview
/ page 14
Sentiment Analysis
R
– Very good for data acquisition, cleaning, munging, exploratory analysis, model
selection, machine learning algorithm development and training, model performance
evaluation
– One of the best visualization tools bar none
– Has over 4,000 packages
Python
– Good choice for production deployment
– Rapidly catching up with R in terms of data science capabilities
/ page 15
R vs. Python Wars
/ page 16
Visualization is Critical
Doing Data Science
Cathy O’Neil & Rachel Schutt
O’Reilly Media
/ page 17
Learning More About Data Science
/ page 18
Data Science in Action
Integral part of Big Data– Data science and machine learning fuel big data ✔
The shortage of data scientists is real
– Big data is expected to be a $53.4 billion industry by 2016 ✔
– Job postings for “data scientist” increased 15,000% between 2011 and 2012 ✔
– Job market currently 140,000 – 190,000 open positions ✔
– Between 2010-2020 project growth of 18.7% ✔
Companies of all sizes need to plan out their data science strategy– Increase value of enterprise data assets ✔
2014 should be a wild year!– Conference circuit is exploding ✔
– New books, news sources, press coverage abound ✔
/ page 19
Summary – Data Science is Here to Stay