Big Data: New Tricks for EconometricsVarian, Hal R. "Big data: New tricks for econometrics." The Journal of Economic Perspectives (2014): 3-27.
Konstantina ChristakopoulouLiang ZengGroup G21
Related to the Chapter 28: Data Mining
Motivation. Machine Learning for Economic
Transactions: Linear Regression is not Enough!
Big data size A lot of features: Choose variables Relationships are not only linear!!
Connection to the Course: Decision Trees e.g ID3Challenges of ID3:- Cannot handle continuous attributes- Prone to outliers
1. C4.5, Classification And Regression Trees (CART) can handle: + continuous and discrete attributes+ handle missing attributes+ over-fitting by post-pruning
2. Random Forests: Ensemble of decision stumps. Randomization (choosing sample + choosing attributes) leads to better accuracy!
ID3 Decision Tree
Classification and Regression Trees(CART)CART:Classification tree is when the predicted
outcome is the class to which the data belongs.
Regression tree is when the predicted outcome can be considered a real number (e.g. the age of a house, or a patient’s length of stay in a hospital).
Classification and Regression Trees(CART)Predict Titanic survivors using age and class
Classification and Regression Trees(CART)A CART for Survivors of the Titanic using R language
Random Forests
Random Forests Choose a bootstrap sample and start to grow a tree At each node:
Choose random sample of predictors to make the next decision
Repeat many times to grow a forest of trees
For prediction: have each tree make its prediction and then a majority vote.
Decision Tree Learning + One Tree+ On all learning samples+ Prone to distortions e.g outliers
Random Forest
+ Many decision trees+ Each DT on a random subset of samples+ Reduce the effect of outliers (no overfitting)
Thank you!