Upload
shon-goodman
View
213
Download
1
Tags:
Embed Size (px)
Citation preview
Big Data: Predictive Analytics VS Causal Inference
Wharton Data Camp Sessions 7
Agenda1) What’s the deal here? 2) Why should you be aware? 3) What kind of development is going on right now?4) “Big Data and You”
Big Data! Some videos
Joy of Stat - Hans Rosling! http://www.youtube.com/watch?v=CiCQepmcuj8 http://www.ted.com/playlists/56/making_sense_of_too_much_data.html
http://motherboard.vice.com/blog/big-data-explained-brilliantly-in-one-short-video Broad stop at 4:26
http://www.intel.com/content/www/us/en/big-data/big-data-101-animation.html More relevant stop at 2:10
“This is the caveman era of the big data”
What’s cool is cool because we are looking at these for the first time and even correlation is cool sometimes! Mash up of different big data makes things scary sometimes (CMU Face app)
Scientific process always begins with correlation then moves onto causality when mature
The Rise of Predictive Models
Statistics & Computer Science
With the rise of data + computational power Better prediction Model free – no theory backing Blackbox algorithms Statistical algorithms
Goal: Predict well (with big enough data, it works)
Techniques: MANY Take CIS 520:Machine Learning for basic intro. At least audit! It
will open up your eyes Stat 9XX- Statistical Learning Theory if offered! Also great – will
be a lot of probability/stat theory
Good Old Causal Inference
Statistics & Econometrics
Explore -> Develop Theory -> Test with Statistical Inference models ( Linear Models / Graphical Models / etc)
Requirement for X Causes Y X must temporally come before Y (NOT in Predictive
model) X must have significant statistical relation to Y Association between X and Y must not be due to
other omitted variable (NOT in Predictive model)
Theory is from economics/sociology/psychology etc
Predictive Analytics VS Causal Inference
Predictive analytics (Machine Learning, Algorithms) Art of prediction RMSE/Error functions
Causal Inference (Rubin Causal Model, Structural) Theory building Testing theory with statistical tools and robust design of experiment or
techniques to deal with observational data
Statistics/Comp Sci (Algorithms and Data mining, Machine Learning)
Statistics/Econometrics (Causality – different school of thoughts even within causal inference groups. For brief fun intro, see http://leedokyun.com/obs.pdf)
Paradigm-Building – Kuhnian sense & Falsify existing beliefs – Popperian Sense Causal inference can do both. Predictive Models cannot
Arguments About the Big data Movement
Great portion of Start-ups & Many big data firms these dayshttp://xkcd.com/882/
You’d be surprised how
Naïve some industry
people are. Some are great.
Companies are trying to collect everything about everyone. Becomes unwieldy beast!
Arguments About the Big data Movement
Read these interesting pieces featuring the dynamic duo of marketing (Prof Eric Bradlow and Prof Peter Fader) and Prof Eric Clemons of OPIM.
http://www.sas.com/resources/asset/SAS_BigData_final.pdf
http://knowledge.wharton.upenn.edu/article.cfm?articleid=2186
http://www.datanami.com/datanami/2012-05-03/wharton_professor_pokes_hole_in_big_data_balloon.html
Some notable examples
Not Causal
Causal
Small Large
RevolutionR
Hal Varian: GoogleSusan Athey: MS
Angrist, Krueger
UCLA Stat bookTargeted learning
Economists
Machine Learning
What are you doing here?
NetflixGoogle
Data Mining
Structural ModelingInformation Systems Management
Lab experiment
Fraud detection
Hans Rosling
Marketing
Association rule
As academics/economists/scientists, we need to embrace the rise of predictive model in big data era and use it to extract unstructured data but never forget that our goal is to build paradigm and contribute to the knowledge/theory base
Usage of Predictive analytics/Machine learning
Exploration (Unsupervised learning/Clustering/Anomaly detection)
Data Extraction From Unstructured Data (NLP + Supervised Learning)
Some people have started to incorporate machine learning techniques into causal inference Machine learning in matching (PSM)Targeted Learning, 2012 Springer Series
(http://www.targetedlearningbook.com/)
Great blogs for causal inference
Andrew Gelman: Bayesian Statistician at Columbia U http://andrewgelman.com/ This is where lot of action is happening That great fight of 2009 between the Pearlian vs Rubinian!
“Boy, these academic disputes are fun! Such vitriol! Such personal animosity! It's better than reality TV. Did Rubin slap Pearl's mom, or perhaps vice versa?”
“With all due respect, I think you are wrong that Judea does not understand the Rubin approach.” – Larry Wasserman
Judea Pearl “Causality”
Observational Learning books by Paul Rosenbaum
Miguel Hernan and Jamie Robins “Causal Inference” free now http://www.hsph.harvard.edu/miguel-hernan/causal-inference-
book/
Intro to Practical Natural Language Processing
Wharton Data Camp Sessions 7
Agenda1) Brief light-hearted Intro to NLP (What is it and why should
I care?)2) Basic ideas in NLP3) Usage in Business Research
Quick OverviewWhat is Natural (Spoken) Language Processing (NLP)?
Examples
How this technology may affect:Industry
Academics
Natural Language Processing
Natural Language Processing is an interdisciplinary field composed of techniques and ideas from computer science, statistics and linguistics that are concerned with making computers able to parse, understand (knowledge representation), store (knowledge database), and ultimately interact (convey information) in natural language (human language such as English)
Methods: machine learning, bayesian statistics, algorithms, higher order logic, linguistics.
Subcategories of NLPInformation Retrieval: Google. Optimizing text database search.
Information Extraction: Crude basic form is Web Crawling + REGEX. Really sophisticated form, you’ll see later – Thomson Reuters
Machine Translation
Sentiment Analysis and more
Cool ApplicationsNSA - uses NLP to detect anomalous activity in internet and phone calls for terrorist activities
Lie detection via spoken language processing
Automatic plagiarism detector
ETS Testing - since 1999 “e-rater” automatic essay scoring on GMAT, GRE, TOEFL.
Shazam – song discovery (application of spoken language processing)
News aggregators based on topic
Entertainment - Cleverbot (Turing test 59.3% VS real human 63.3%) Really evolved from dumb predecessors ELIZA, Smarter child etc.
Business ApplicationsMarketing - sentiment analysis and demand analysis of products from reviews and blogs e.g. movies, consumer products
Marketing – Opinion Mining/Subjectivity analysis/Emotion Detection/Opinion Spam Detection etc
Finance - Quantitative Qualitative high frequency trading ( Thomson Reuters, Bloomberg)
Management – Resume filtering and firm-employee matching
Legal Studies – legal document search engines
E-Commerce – help chat bots
Main stream Applications
Siri (dumb) - preprogrammed. No learning
IBM Watson/ Wolfram Alpha (smart):
semantic representation of concepts
acquisition of knowledge
logical inference machine
As of 2011, Watson had knowledge equivalent of a second year medical student (which isn’t saying much but still cool due to the speed Watson learns)
Watson gets an attitude
IBM Watson learned urban dictionary in 2013…
“Watson couldn't distinguish between polite language and profanity -- which the Urban Dictionary is full of. Watson picked up some bad habits from reading Wikipedia as well. In tests it even used the word "bullshit" in an answer to a researcher's query.
Ultimately, Brown's 35-person team developed a filter to keep Watson from swearing and scraped the Urban Dictionary from its memory.”
Well no $@!# Sherlock! You
mea@#$%s can bite my shiny metal
!@$
Some fun facts
15,000
Average number of words spoken by an average person per day (various sociology, linguistics studies). approximately 15 words per min assuming 8 hour sleep.
100Million~300Million:
Average number of words spoken by an average person in a lifetime.
100 TRILLION:
approx number of words on internet in 2007 by Peter Norvig (leads google research & AI scientist).
Reasons why you should at least acknowledge NLP and keep it in
mind for the rest of your life
1. It will definitely be a disrupting technology and a large part of everyday life affecting most type of business (already has disrupted finance, marketing, management, etc)
2. Text Data: Explosion of web, Company performance report, news, security filings etc
3. Even in business research outside of Information Systems Management and Marketing, more and more researchers are utilizing NLP
Example Focus(Finance)
Thomson Reuters (Automation Team) and Bloomberg
Business Wire: 60 stories per second
“Apple also announced that Scott Forstall will be leaving Apple next year and will serve as an advisor to CEO Tim Cook in the interim”
Lake Shore Bancorp, Inc. (the “Company”) (NASDAQ Global Market: LSBK), the holding company for Lake Shore Savings Bank (the “Bank”), announced third quarter 2012 net income of $863,000, or $0.15 per diluted share, compared to net income of $1.2 million, or $0.20 per diluted share, for third quarter 2011. The Company had net income of $2.8 million, or $0.48 per diluted share, for the nine months ended September 30, 2012, compared to net income of $3.1 million, or $0.54 per diluted share for the same period in 2011.
Extract relevant information -> computer readable format such as XML/JSON
KEY: the format of where information is, and how to extract is not preprogrammed. The NLP engine learns as new information comes in. Initially, it learns how to extract and what is important by humans tagging many articles. (semi-supervised learning)
Lake Shore Bancorp, Inc. (the “Company”) (NASDAQ Global Market: LSBK), the holding company for Lake Shore Savings Bank (the “Bank”), announced third quarter 2012 net income of $863,000, or $0.15 per diluted share, compared to net income of $1.2 million, or $0.20 per diluted share, for third quarter 2011. The Company had net income of $2.8 million, or $0.48 per diluted share, for the nine months ended September 30, 2012, compared to net income of $3.1 million, or $0.54 per diluted share for the same period in 2011. [...]
Named Entity Recognition: Has to realize that “Lake Shore Bancorp, Inc.” is a name of a company Coreference resolution: “the company” is Lake Shore Bancorp, Inc Morphological segmentation: breaking of words into basic parts and meaning “lexeme” e.g. Announced is past tense of lexeme “announce” with inflection rule -edPart of Speech Tagging and Grammar Parsing Chunking and Breaking: e.g. A and B of X and Y is(A,X) and is(B,Y)
Example Focus(Finance)
<company name=”Lake Shore Bancorp”><Alias>The Company</Alias><Holds>Lake Shore Savings Bank</Holds><Holds>The Bank</Holds> <Q year=”2012” period=”third”>863000</Q><Q year=”2011” period=”third”>1.2 Million</Q><Net year=”2012” month=”9”>2.8 Million</Net><Net year=”2011” month=”9”>3.1 Million</Net>.........</company>
Example Focus(Finance)
Lake Shore Bancorp, Inc. (the “Company”) (NASDAQ Global Market: LSBK), the holding company for Lake Shore Savings Bank (the “Bank”), announced third quarter 2012 net income of $863,000, or $0.15 per diluted share, compared to net income of $1.2 million, or $0.20 per diluted share, for third quarter 2011. The Company had net income of $2.8 million, or $0.48 per diluted share, for the nine months ended September 30, 2012, compared to net income of $3.1 million, or $0.54 per diluted share for the same period in 2011. [...]
Bottom Line
NLP can do lots of cool stuff
Unstructured text data is huge and is growing faster than ever. And it will continue to grow as online population increases
NLP is an important tool for anyone to be aware of
Jurafsky & Martin “Speech and Language Processing” for deep theory
Bing Liu’s two books: http://www.cs.uic.edu/~liub/
Practical NLTK books: an NLTK cookbook by Jacob Perkins and “NLP with python” by Steven Birds et al