Upload
vivian-shangxuan-zhang
View
384
Download
1
Tags:
Embed Size (px)
DESCRIPTION
NYC Data Science Academy, NYC Open Data Meetup, Big Data, Chief Data Scientist, Thoughtworks
Citation preview
Data Science Consultingor
Science meets business, again. Third time a charm?
David Johnston ThoughtWorksMarch 17, 2014
Young scientists become…
Professors
Prog
ram
mer
s Trad
ers
Data
scie
ntist
s
Professors
Professors
Talk Overview
• Agile Analytics group at ThoughtWorks
• What is data science anyway? Origins and future. Good or evil?
• Guide to technologies and limits to technology
• Process and methodology for successful data science consulting
ThoughtWorks
• Global software consulting company
• HQ in Chicago. Major offices in NY, San Fran, Dallas, India, Brazil, Australia, China - over 30 worldwide.
• Privately owned by Roy Singham
• Flat hierarchy of passionate people
The three pillars
Agile Analytics at TW
• Practiced started 2011• Led by Ken Collier and John Spens• About a dozen people involved
Key Themes
• BI, data warehousing and analytics has largely missed the revolution in agile methodologies.
• We can do analytics in a agile, fast, light-footprint way.
What do we do?
• Probabilistic modeling• Predictive analytics / machine learning• Advanced BI, prescriptive analysis• Big Data technologies• Advanced algorithms and data structures, streaming
Our main goals
• Use data analysis to give companies an edge in their marketplace
• Use data analysis to improve the world at large
Some typical projects
• Recommending Systems
• Customer behavior analysis
• Optimization
• Efficient algorithms/tech for massive data sets
• Company specific analytics challenges
Case Study 1: HealthCare Group Purchasing Organization
• One of the largest GPOs. 1000s of client hospitals
• Hospital sign up, pay fee and get group-purchasing discounts
• The GPO has to make estimates to hospitals on their likely savings.
• Hospital’s data is usually in a non-standard spreadsheet. No SKUs in healthcare (yet).
• A data matching mess
Case Study 1: HealthCare Group Purchasing Organization
GPO: Johnson & Johnson Sterile Scalpel #F8-505
Hospital: J&J scalpel, steel item f8505 size 3’’
• Their in-place solution – Oracle, lots of ETL tools, using SQL with lots of rigid rules for how to match.
• Data-base of matching rules was difficult to maintain
• Accuracy of matching ~60%. Rest was done by hand. Took 1 day for processing and weeks for lines done by hand.
Case Study 1: HealthCare Group Purchasing Organization
What we did
• First convince them that their solution was highly inefficient.
• Wrote python program using a tree data structure and machine learning to do matching.
• Ran on my laptop in a few minutes. Match rates > 80%
• This done in 3 weeks. Later settled on a solution using Elastic Search.
Case Study 2: Retail Rec Systems
• Customer providing coupons to retailer customers
• Needed a better recommendation system
• We’re using a simple logistic regression model
What exactly is data science?
• Is this really new?
• Does the term “data science” make any sense?
• Is it just a fad? Over-hyped?
• Why did this term just become popular a few years back?
• Where is this going?
• Should scientists/engineers/math-types really go and make a career doing this?
What exactly is data science?
• Is this really new? - Not really
• Does the term “data science” make any sense? - Not really but so what?
• Is it just a fad? Over-hyped? – No, some times.
• Why did this term just become popular a few years back? - Productivity
• Where is this going?
• Should scientists/engineers/math-types really go and make a career doing this? Yes for most
Is it new?
Of course not
Combination of many subjects:• Mathematics and statistics – probability theory• Machine learning• Computer science – algorithms, data structures, data bases• Operations research - process optimization• Business consulting • Software development
Where we have seen this before?
Business: Finance, Insurance, Sports, Government accounting, Retail, GoogleScience: Physics, Astronomy, Biology
Isn’t there anything new?
Of course
• Analytics finally becoming ubiquitous in business (as it always should have been)
• Much more communication between disparate fields
• It’s finally work that’s fun
Ok, but why now?
It’s a big movement so lets give it a new name , Data Science
Why now? - Productivity
• There has always been plenty of data science in science
• Job prospects in academia are slim
• Productivity has been rising much faster than postdoc salaries and scientist job creation
Data scientist productivity growth
• Salary increase over postdoc requires ~2.5 x
• Salaries in Industry are set by productivity and supply/demand
• Crossing the threshold in productivity Leads to new job creation
• Eventual slowing in productivity and/or changes in supply/demand will eventually end this burst in job creation
• Nothing magical happened in 2005!
Productivity Drivers for Data-science
Long time scale
• Compute , Moore’s law
• The internet (duh!)
• HD and RAM price drop
• Science learns to deal with Big Data
• Growing importance of statistics
More recent
• Git , code –sharing
• Libraries machine learning
• Python/ R Open source
• Hadoop and ecosystem
• The Cloud, AWS
• NoSQL databases, in-mem
• Growing community in “data science” cohesion, feedback effects of popularity
Then and now
1990s data science
• Writing code in C/C++
• Working with flat files
• Even relational/SQL is new
• Using Matlab, IDL proprietary software
• Writing all algorithms from scratch. Slow. Buggy.
Data science today
• Working in high level open-source languages Python, R
• We’re good at SQL and have lots of other options NoSQL
• Git, thousands of libraries available. Easy to install.
• Can concentrate more on what we’re good at.
So what is data science now
Data Science:
An interdisciplinary field utilizing statistics, computer science and the methods of scientific research in areas outside of science.
Where is it going?
• Big Data technology is separated from data science
• Software developers take over much of Big Data roles
• Businesses begin to understand data science terminology like they now understand software terminology and they are not Twitter.
• Data scientists and businesses find a methodology that works like industrial scale software development has
Where is it going?
Specialization
• Most experienced data scientists move into consulting or management of teams
• Universities graduate many “data scientist-lite” students from new more specialized BS or MA programs
• Fewer generalists
• PhD students need to learn additional skills. Not instant hires(http://bit.ly/1m3krq6)
Why won’t we have 100x more data scientists in N years?
• Pool of disgruntled postdocs will dry up or “I am not even supposed to be here!”
• Many data science problems don’t need the most cutting edge tools. (Some do).
• People rarely get much experience working with real data in academic settings. Requires real-world experience, takes time.
Are we there yet?Overhyped, underhyped, mis-hyped?
• No, probably not
• Productivity growth is real
• We are solving important problems. Plenty left.
• Big Data will probably peak in the hype cycle before data science
• Just watched my first analytics commercial. IBM.
Why Big Data enthusiasm might peak soon
Big Data defined – Process for performing calculations on data that:
• Cannot possibly be done on a single machine• When sampling and streaming are not effective• What data-reduction is not possible• When storage and compute are closely balanced• Parallelizing is absolutely unavoidable
Most tasks are not like this
• Sampling is usually good enough for training machine learning• Need for rapid feedback, interactive work• CPUs are underutilized. IO limited. • Usually a better algorithm can solve the problem better
Hadoop (Spark)
Good use cases
• Large batch jobs like: restructuring and reducing data from raw files.
• Scoring with ML models
• When you have to do something on every data point.
• Raw storage in HDFS
Bad use cases
• Model development
• Visualization
• Brute-forcing an inefficient algorithm.
• Treating Hadoop like a data-base.
The data-sizes we typically see
Most companies have a few million customers 10^7
Often they storage ~ 1000 items per customer
That’s 10^10 data points. 5 bytes/data-point = 500 GB or a few TB. Fits on our laptops (but not in memory). Such data can be moved to the cloud if need be in 1-2 days.
Often we can be productive with either a sample or an aggregation.
True when • Customer specific items are things like purchases, manually entered text, logins etc.
Not true when • Things are web-events, pair-wise interactions (i.e. graphs, social)
Sources of really big data
Sensor data
• Pictures
• Video
• Health monitoring devices
• Internal device monitors
• Results of combinatorical-complexity
However
• Is it really economic to store and process these huge data sets to begin with?
• Will learn to utilize streaming algorithms
• Will learnt on focus on information not noise
Case study : Particle PhysicsData reduction par excellence
• 600 million collisions per second• Most are boring events and are not saved• Save ~ 100 petabytes per year
Determine existence of Higg-boson – 1 bitMeasure it’s mass to 1% ~ 1 byte
Data = ExabytesInformation = 9 bitsCompression 10^18
Goal
$9 billion per byte!
Data science consulting
The good
• Always something new, always learning.
• Exposed to many different people.
• Get to see how everything works on the inside.
• See the world!
• Low career risk but still fun.
The bad
• Your clients choose you
• People problems often more important than math problems
• Travel can be extreme
• Your great ideas will rarely be credited to you.
Challenges in data science consulting
• Business’s don’t yet understand the terminology, process or techniques. Much teaching involved
• Visionary CEO send you into a not-so-visionary environment
• Problems can be vague
• Communication with business stakeholders takes much of your time
• We are still developing an effective model. More than just agile techniques
Red flags to avoid
• “Built us a platform for analytics so we can become a data-driven company” Non-sequitur
• Wanting prediction of the un-predicable
• Attempting to use ML on noisy data
• When incentives and opinions are all over the map
• Convinced that the problem has been solved 20 years ago. E.g. linear regression, segmentation model, SAS.
Keep offering up bold ideas
• Look for ways for major productivity enhancement
• Keep up on cutting-edge literature in stats/ML
• All my best ideas for web-apps are now successful companies.
• Everybody laughed at them!
Data science is NOT going to beproductized.
FIN