Upload
buihanh
View
236
Download
5
Embed Size (px)
Citation preview
Python for applied practitioners
F Curci F Masera T Rodriacuteguez Martiacutenez
February 24th 2017
The two Fedes and Tomas Python Applied reading group 1 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 2 40
What Python is
Python is a widely used programming languare forgeneral-purpose programming It has a syntax which allowsprogrammers to express concepts in fewer lines of code thanpossible in languages such as C++ or JavaFreeUsers contribution librariesMany platforms use Python as communicating way (becominguniversal language)Very powerful in dealing with strings
The two Fedes and Tomas Python Applied reading group 3 40
Good way to learn Python
Coursera course Python for everybody specialization Universityof MichiganBook Python for informatics
The two Fedes and Tomas Python Applied reading group 4 40
Use
How to obtain it and languageDownloadLanguages currently used Python 2 and Python 3Anaconda
Two methodsNotepad+command prompt
Need to save file as pyType filepy in command prompt
Compiler Pycharm Atom Spyder
The two Fedes and Tomas Python Applied reading group 5 40
Web browsing
Python allows you to write a web browserLibrary to open resources by URL urllibExample we want to Python to go to web to obtain FIPS code ofTexas counties
Counties
The two Fedes and Tomas Python Applied reading group 6 40
Web browsing
The two Fedes and Tomas Python Applied reading group 7 40
Web browsing
The two Fedes and Tomas Python Applied reading group 7 40
Web browsing
The two Fedes and Tomas Python Applied reading group 7 40
Ways to extract information from the web
API (Application programming interface)Structured way to access data from websitesUsually limited number of queriesNot all website provide itTwitter Google maps etcSome Stata commands
ScrapingTransformation of unstructured data (HTML format) on the webinto structured data (database)Stata command copy
The two Fedes and Tomas Python Applied reading group 8 40
API
Talk to web services URLs that are designed explicitly to hand databack cleaned for your application
Need to agree on wire protocol for python and javaTwo wire formats to exchange data between applications
XMLJSON
LibrariesxmletreeElementTreejson
Google map APIPossible to retrieve address latitude and longitude of placesDocumentationLimits
2500 requests per day50 requests per second
The two Fedes and Tomas Python Applied reading group 9 40
Google map API
The two Fedes and Tomas Python Applied reading group 10 40
Google map API
The two Fedes and Tomas Python Applied reading group 10 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt
lttitlegtThis is the titlelttitlegtltheadgtltbodygt
lta href =wwwwikipediaes gt Wikipedia linkltagtltbodygt
lthtmlgt
The two Fedes and Tomas Python Applied reading group 11 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt︸ ︷︷ ︸Start tag
lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element
ltheadgt︸ ︷︷ ︸End tag
ltbodygtlta href︸︷︷︸
Attribute name
=wwwwikipediaes︸ ︷︷ ︸Attribute value
gtWikipedia link︸ ︷︷ ︸Content
ltagt
ltbodygtlthtmlgt
Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser
The two Fedes and Tomas Python Applied reading group 12 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt︸ ︷︷ ︸Start tag
lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element
ltheadgt︸ ︷︷ ︸End tag
ltbodygtlta href︸︷︷︸
Attribute name
=wwwwikipediaes︸ ︷︷ ︸Attribute value
gtWikipedia link︸ ︷︷ ︸Content
ltagt
ltbodygtlthtmlgt
Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser
The two Fedes and Tomas Python Applied reading group 12 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
PDF Scraping
Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner
The two Fedes and Tomas Python Applied reading group 17 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 19 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 2 40
What Python is
Python is a widely used programming languare forgeneral-purpose programming It has a syntax which allowsprogrammers to express concepts in fewer lines of code thanpossible in languages such as C++ or JavaFreeUsers contribution librariesMany platforms use Python as communicating way (becominguniversal language)Very powerful in dealing with strings
The two Fedes and Tomas Python Applied reading group 3 40
Good way to learn Python
Coursera course Python for everybody specialization Universityof MichiganBook Python for informatics
The two Fedes and Tomas Python Applied reading group 4 40
Use
How to obtain it and languageDownloadLanguages currently used Python 2 and Python 3Anaconda
Two methodsNotepad+command prompt
Need to save file as pyType filepy in command prompt
Compiler Pycharm Atom Spyder
The two Fedes and Tomas Python Applied reading group 5 40
Web browsing
Python allows you to write a web browserLibrary to open resources by URL urllibExample we want to Python to go to web to obtain FIPS code ofTexas counties
Counties
The two Fedes and Tomas Python Applied reading group 6 40
Web browsing
The two Fedes and Tomas Python Applied reading group 7 40
Web browsing
The two Fedes and Tomas Python Applied reading group 7 40
Web browsing
The two Fedes and Tomas Python Applied reading group 7 40
Ways to extract information from the web
API (Application programming interface)Structured way to access data from websitesUsually limited number of queriesNot all website provide itTwitter Google maps etcSome Stata commands
ScrapingTransformation of unstructured data (HTML format) on the webinto structured data (database)Stata command copy
The two Fedes and Tomas Python Applied reading group 8 40
API
Talk to web services URLs that are designed explicitly to hand databack cleaned for your application
Need to agree on wire protocol for python and javaTwo wire formats to exchange data between applications
XMLJSON
LibrariesxmletreeElementTreejson
Google map APIPossible to retrieve address latitude and longitude of placesDocumentationLimits
2500 requests per day50 requests per second
The two Fedes and Tomas Python Applied reading group 9 40
Google map API
The two Fedes and Tomas Python Applied reading group 10 40
Google map API
The two Fedes and Tomas Python Applied reading group 10 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt
lttitlegtThis is the titlelttitlegtltheadgtltbodygt
lta href =wwwwikipediaes gt Wikipedia linkltagtltbodygt
lthtmlgt
The two Fedes and Tomas Python Applied reading group 11 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt︸ ︷︷ ︸Start tag
lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element
ltheadgt︸ ︷︷ ︸End tag
ltbodygtlta href︸︷︷︸
Attribute name
=wwwwikipediaes︸ ︷︷ ︸Attribute value
gtWikipedia link︸ ︷︷ ︸Content
ltagt
ltbodygtlthtmlgt
Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser
The two Fedes and Tomas Python Applied reading group 12 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt︸ ︷︷ ︸Start tag
lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element
ltheadgt︸ ︷︷ ︸End tag
ltbodygtlta href︸︷︷︸
Attribute name
=wwwwikipediaes︸ ︷︷ ︸Attribute value
gtWikipedia link︸ ︷︷ ︸Content
ltagt
ltbodygtlthtmlgt
Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser
The two Fedes and Tomas Python Applied reading group 12 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
PDF Scraping
Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner
The two Fedes and Tomas Python Applied reading group 17 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 19 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
What Python is
Python is a widely used programming languare forgeneral-purpose programming It has a syntax which allowsprogrammers to express concepts in fewer lines of code thanpossible in languages such as C++ or JavaFreeUsers contribution librariesMany platforms use Python as communicating way (becominguniversal language)Very powerful in dealing with strings
The two Fedes and Tomas Python Applied reading group 3 40
Good way to learn Python
Coursera course Python for everybody specialization Universityof MichiganBook Python for informatics
The two Fedes and Tomas Python Applied reading group 4 40
Use
How to obtain it and languageDownloadLanguages currently used Python 2 and Python 3Anaconda
Two methodsNotepad+command prompt
Need to save file as pyType filepy in command prompt
Compiler Pycharm Atom Spyder
The two Fedes and Tomas Python Applied reading group 5 40
Web browsing
Python allows you to write a web browserLibrary to open resources by URL urllibExample we want to Python to go to web to obtain FIPS code ofTexas counties
Counties
The two Fedes and Tomas Python Applied reading group 6 40
Web browsing
The two Fedes and Tomas Python Applied reading group 7 40
Web browsing
The two Fedes and Tomas Python Applied reading group 7 40
Web browsing
The two Fedes and Tomas Python Applied reading group 7 40
Ways to extract information from the web
API (Application programming interface)Structured way to access data from websitesUsually limited number of queriesNot all website provide itTwitter Google maps etcSome Stata commands
ScrapingTransformation of unstructured data (HTML format) on the webinto structured data (database)Stata command copy
The two Fedes and Tomas Python Applied reading group 8 40
API
Talk to web services URLs that are designed explicitly to hand databack cleaned for your application
Need to agree on wire protocol for python and javaTwo wire formats to exchange data between applications
XMLJSON
LibrariesxmletreeElementTreejson
Google map APIPossible to retrieve address latitude and longitude of placesDocumentationLimits
2500 requests per day50 requests per second
The two Fedes and Tomas Python Applied reading group 9 40
Google map API
The two Fedes and Tomas Python Applied reading group 10 40
Google map API
The two Fedes and Tomas Python Applied reading group 10 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt
lttitlegtThis is the titlelttitlegtltheadgtltbodygt
lta href =wwwwikipediaes gt Wikipedia linkltagtltbodygt
lthtmlgt
The two Fedes and Tomas Python Applied reading group 11 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt︸ ︷︷ ︸Start tag
lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element
ltheadgt︸ ︷︷ ︸End tag
ltbodygtlta href︸︷︷︸
Attribute name
=wwwwikipediaes︸ ︷︷ ︸Attribute value
gtWikipedia link︸ ︷︷ ︸Content
ltagt
ltbodygtlthtmlgt
Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser
The two Fedes and Tomas Python Applied reading group 12 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt︸ ︷︷ ︸Start tag
lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element
ltheadgt︸ ︷︷ ︸End tag
ltbodygtlta href︸︷︷︸
Attribute name
=wwwwikipediaes︸ ︷︷ ︸Attribute value
gtWikipedia link︸ ︷︷ ︸Content
ltagt
ltbodygtlthtmlgt
Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser
The two Fedes and Tomas Python Applied reading group 12 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
PDF Scraping
Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner
The two Fedes and Tomas Python Applied reading group 17 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 19 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Good way to learn Python
Coursera course Python for everybody specialization Universityof MichiganBook Python for informatics
The two Fedes and Tomas Python Applied reading group 4 40
Use
How to obtain it and languageDownloadLanguages currently used Python 2 and Python 3Anaconda
Two methodsNotepad+command prompt
Need to save file as pyType filepy in command prompt
Compiler Pycharm Atom Spyder
The two Fedes and Tomas Python Applied reading group 5 40
Web browsing
Python allows you to write a web browserLibrary to open resources by URL urllibExample we want to Python to go to web to obtain FIPS code ofTexas counties
Counties
The two Fedes and Tomas Python Applied reading group 6 40
Web browsing
The two Fedes and Tomas Python Applied reading group 7 40
Web browsing
The two Fedes and Tomas Python Applied reading group 7 40
Web browsing
The two Fedes and Tomas Python Applied reading group 7 40
Ways to extract information from the web
API (Application programming interface)Structured way to access data from websitesUsually limited number of queriesNot all website provide itTwitter Google maps etcSome Stata commands
ScrapingTransformation of unstructured data (HTML format) on the webinto structured data (database)Stata command copy
The two Fedes and Tomas Python Applied reading group 8 40
API
Talk to web services URLs that are designed explicitly to hand databack cleaned for your application
Need to agree on wire protocol for python and javaTwo wire formats to exchange data between applications
XMLJSON
LibrariesxmletreeElementTreejson
Google map APIPossible to retrieve address latitude and longitude of placesDocumentationLimits
2500 requests per day50 requests per second
The two Fedes and Tomas Python Applied reading group 9 40
Google map API
The two Fedes and Tomas Python Applied reading group 10 40
Google map API
The two Fedes and Tomas Python Applied reading group 10 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt
lttitlegtThis is the titlelttitlegtltheadgtltbodygt
lta href =wwwwikipediaes gt Wikipedia linkltagtltbodygt
lthtmlgt
The two Fedes and Tomas Python Applied reading group 11 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt︸ ︷︷ ︸Start tag
lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element
ltheadgt︸ ︷︷ ︸End tag
ltbodygtlta href︸︷︷︸
Attribute name
=wwwwikipediaes︸ ︷︷ ︸Attribute value
gtWikipedia link︸ ︷︷ ︸Content
ltagt
ltbodygtlthtmlgt
Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser
The two Fedes and Tomas Python Applied reading group 12 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt︸ ︷︷ ︸Start tag
lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element
ltheadgt︸ ︷︷ ︸End tag
ltbodygtlta href︸︷︷︸
Attribute name
=wwwwikipediaes︸ ︷︷ ︸Attribute value
gtWikipedia link︸ ︷︷ ︸Content
ltagt
ltbodygtlthtmlgt
Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser
The two Fedes and Tomas Python Applied reading group 12 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
PDF Scraping
Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner
The two Fedes and Tomas Python Applied reading group 17 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 19 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Use
How to obtain it and languageDownloadLanguages currently used Python 2 and Python 3Anaconda
Two methodsNotepad+command prompt
Need to save file as pyType filepy in command prompt
Compiler Pycharm Atom Spyder
The two Fedes and Tomas Python Applied reading group 5 40
Web browsing
Python allows you to write a web browserLibrary to open resources by URL urllibExample we want to Python to go to web to obtain FIPS code ofTexas counties
Counties
The two Fedes and Tomas Python Applied reading group 6 40
Web browsing
The two Fedes and Tomas Python Applied reading group 7 40
Web browsing
The two Fedes and Tomas Python Applied reading group 7 40
Web browsing
The two Fedes and Tomas Python Applied reading group 7 40
Ways to extract information from the web
API (Application programming interface)Structured way to access data from websitesUsually limited number of queriesNot all website provide itTwitter Google maps etcSome Stata commands
ScrapingTransformation of unstructured data (HTML format) on the webinto structured data (database)Stata command copy
The two Fedes and Tomas Python Applied reading group 8 40
API
Talk to web services URLs that are designed explicitly to hand databack cleaned for your application
Need to agree on wire protocol for python and javaTwo wire formats to exchange data between applications
XMLJSON
LibrariesxmletreeElementTreejson
Google map APIPossible to retrieve address latitude and longitude of placesDocumentationLimits
2500 requests per day50 requests per second
The two Fedes and Tomas Python Applied reading group 9 40
Google map API
The two Fedes and Tomas Python Applied reading group 10 40
Google map API
The two Fedes and Tomas Python Applied reading group 10 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt
lttitlegtThis is the titlelttitlegtltheadgtltbodygt
lta href =wwwwikipediaes gt Wikipedia linkltagtltbodygt
lthtmlgt
The two Fedes and Tomas Python Applied reading group 11 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt︸ ︷︷ ︸Start tag
lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element
ltheadgt︸ ︷︷ ︸End tag
ltbodygtlta href︸︷︷︸
Attribute name
=wwwwikipediaes︸ ︷︷ ︸Attribute value
gtWikipedia link︸ ︷︷ ︸Content
ltagt
ltbodygtlthtmlgt
Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser
The two Fedes and Tomas Python Applied reading group 12 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt︸ ︷︷ ︸Start tag
lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element
ltheadgt︸ ︷︷ ︸End tag
ltbodygtlta href︸︷︷︸
Attribute name
=wwwwikipediaes︸ ︷︷ ︸Attribute value
gtWikipedia link︸ ︷︷ ︸Content
ltagt
ltbodygtlthtmlgt
Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser
The two Fedes and Tomas Python Applied reading group 12 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
PDF Scraping
Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner
The two Fedes and Tomas Python Applied reading group 17 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 19 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Web browsing
Python allows you to write a web browserLibrary to open resources by URL urllibExample we want to Python to go to web to obtain FIPS code ofTexas counties
Counties
The two Fedes and Tomas Python Applied reading group 6 40
Web browsing
The two Fedes and Tomas Python Applied reading group 7 40
Web browsing
The two Fedes and Tomas Python Applied reading group 7 40
Web browsing
The two Fedes and Tomas Python Applied reading group 7 40
Ways to extract information from the web
API (Application programming interface)Structured way to access data from websitesUsually limited number of queriesNot all website provide itTwitter Google maps etcSome Stata commands
ScrapingTransformation of unstructured data (HTML format) on the webinto structured data (database)Stata command copy
The two Fedes and Tomas Python Applied reading group 8 40
API
Talk to web services URLs that are designed explicitly to hand databack cleaned for your application
Need to agree on wire protocol for python and javaTwo wire formats to exchange data between applications
XMLJSON
LibrariesxmletreeElementTreejson
Google map APIPossible to retrieve address latitude and longitude of placesDocumentationLimits
2500 requests per day50 requests per second
The two Fedes and Tomas Python Applied reading group 9 40
Google map API
The two Fedes and Tomas Python Applied reading group 10 40
Google map API
The two Fedes and Tomas Python Applied reading group 10 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt
lttitlegtThis is the titlelttitlegtltheadgtltbodygt
lta href =wwwwikipediaes gt Wikipedia linkltagtltbodygt
lthtmlgt
The two Fedes and Tomas Python Applied reading group 11 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt︸ ︷︷ ︸Start tag
lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element
ltheadgt︸ ︷︷ ︸End tag
ltbodygtlta href︸︷︷︸
Attribute name
=wwwwikipediaes︸ ︷︷ ︸Attribute value
gtWikipedia link︸ ︷︷ ︸Content
ltagt
ltbodygtlthtmlgt
Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser
The two Fedes and Tomas Python Applied reading group 12 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt︸ ︷︷ ︸Start tag
lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element
ltheadgt︸ ︷︷ ︸End tag
ltbodygtlta href︸︷︷︸
Attribute name
=wwwwikipediaes︸ ︷︷ ︸Attribute value
gtWikipedia link︸ ︷︷ ︸Content
ltagt
ltbodygtlthtmlgt
Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser
The two Fedes and Tomas Python Applied reading group 12 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
PDF Scraping
Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner
The two Fedes and Tomas Python Applied reading group 17 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 19 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Web browsing
The two Fedes and Tomas Python Applied reading group 7 40
Web browsing
The two Fedes and Tomas Python Applied reading group 7 40
Web browsing
The two Fedes and Tomas Python Applied reading group 7 40
Ways to extract information from the web
API (Application programming interface)Structured way to access data from websitesUsually limited number of queriesNot all website provide itTwitter Google maps etcSome Stata commands
ScrapingTransformation of unstructured data (HTML format) on the webinto structured data (database)Stata command copy
The two Fedes and Tomas Python Applied reading group 8 40
API
Talk to web services URLs that are designed explicitly to hand databack cleaned for your application
Need to agree on wire protocol for python and javaTwo wire formats to exchange data between applications
XMLJSON
LibrariesxmletreeElementTreejson
Google map APIPossible to retrieve address latitude and longitude of placesDocumentationLimits
2500 requests per day50 requests per second
The two Fedes and Tomas Python Applied reading group 9 40
Google map API
The two Fedes and Tomas Python Applied reading group 10 40
Google map API
The two Fedes and Tomas Python Applied reading group 10 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt
lttitlegtThis is the titlelttitlegtltheadgtltbodygt
lta href =wwwwikipediaes gt Wikipedia linkltagtltbodygt
lthtmlgt
The two Fedes and Tomas Python Applied reading group 11 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt︸ ︷︷ ︸Start tag
lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element
ltheadgt︸ ︷︷ ︸End tag
ltbodygtlta href︸︷︷︸
Attribute name
=wwwwikipediaes︸ ︷︷ ︸Attribute value
gtWikipedia link︸ ︷︷ ︸Content
ltagt
ltbodygtlthtmlgt
Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser
The two Fedes and Tomas Python Applied reading group 12 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt︸ ︷︷ ︸Start tag
lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element
ltheadgt︸ ︷︷ ︸End tag
ltbodygtlta href︸︷︷︸
Attribute name
=wwwwikipediaes︸ ︷︷ ︸Attribute value
gtWikipedia link︸ ︷︷ ︸Content
ltagt
ltbodygtlthtmlgt
Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser
The two Fedes and Tomas Python Applied reading group 12 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
PDF Scraping
Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner
The two Fedes and Tomas Python Applied reading group 17 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 19 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Web browsing
The two Fedes and Tomas Python Applied reading group 7 40
Web browsing
The two Fedes and Tomas Python Applied reading group 7 40
Ways to extract information from the web
API (Application programming interface)Structured way to access data from websitesUsually limited number of queriesNot all website provide itTwitter Google maps etcSome Stata commands
ScrapingTransformation of unstructured data (HTML format) on the webinto structured data (database)Stata command copy
The two Fedes and Tomas Python Applied reading group 8 40
API
Talk to web services URLs that are designed explicitly to hand databack cleaned for your application
Need to agree on wire protocol for python and javaTwo wire formats to exchange data between applications
XMLJSON
LibrariesxmletreeElementTreejson
Google map APIPossible to retrieve address latitude and longitude of placesDocumentationLimits
2500 requests per day50 requests per second
The two Fedes and Tomas Python Applied reading group 9 40
Google map API
The two Fedes and Tomas Python Applied reading group 10 40
Google map API
The two Fedes and Tomas Python Applied reading group 10 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt
lttitlegtThis is the titlelttitlegtltheadgtltbodygt
lta href =wwwwikipediaes gt Wikipedia linkltagtltbodygt
lthtmlgt
The two Fedes and Tomas Python Applied reading group 11 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt︸ ︷︷ ︸Start tag
lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element
ltheadgt︸ ︷︷ ︸End tag
ltbodygtlta href︸︷︷︸
Attribute name
=wwwwikipediaes︸ ︷︷ ︸Attribute value
gtWikipedia link︸ ︷︷ ︸Content
ltagt
ltbodygtlthtmlgt
Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser
The two Fedes and Tomas Python Applied reading group 12 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt︸ ︷︷ ︸Start tag
lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element
ltheadgt︸ ︷︷ ︸End tag
ltbodygtlta href︸︷︷︸
Attribute name
=wwwwikipediaes︸ ︷︷ ︸Attribute value
gtWikipedia link︸ ︷︷ ︸Content
ltagt
ltbodygtlthtmlgt
Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser
The two Fedes and Tomas Python Applied reading group 12 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
PDF Scraping
Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner
The two Fedes and Tomas Python Applied reading group 17 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 19 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Web browsing
The two Fedes and Tomas Python Applied reading group 7 40
Ways to extract information from the web
API (Application programming interface)Structured way to access data from websitesUsually limited number of queriesNot all website provide itTwitter Google maps etcSome Stata commands
ScrapingTransformation of unstructured data (HTML format) on the webinto structured data (database)Stata command copy
The two Fedes and Tomas Python Applied reading group 8 40
API
Talk to web services URLs that are designed explicitly to hand databack cleaned for your application
Need to agree on wire protocol for python and javaTwo wire formats to exchange data between applications
XMLJSON
LibrariesxmletreeElementTreejson
Google map APIPossible to retrieve address latitude and longitude of placesDocumentationLimits
2500 requests per day50 requests per second
The two Fedes and Tomas Python Applied reading group 9 40
Google map API
The two Fedes and Tomas Python Applied reading group 10 40
Google map API
The two Fedes and Tomas Python Applied reading group 10 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt
lttitlegtThis is the titlelttitlegtltheadgtltbodygt
lta href =wwwwikipediaes gt Wikipedia linkltagtltbodygt
lthtmlgt
The two Fedes and Tomas Python Applied reading group 11 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt︸ ︷︷ ︸Start tag
lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element
ltheadgt︸ ︷︷ ︸End tag
ltbodygtlta href︸︷︷︸
Attribute name
=wwwwikipediaes︸ ︷︷ ︸Attribute value
gtWikipedia link︸ ︷︷ ︸Content
ltagt
ltbodygtlthtmlgt
Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser
The two Fedes and Tomas Python Applied reading group 12 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt︸ ︷︷ ︸Start tag
lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element
ltheadgt︸ ︷︷ ︸End tag
ltbodygtlta href︸︷︷︸
Attribute name
=wwwwikipediaes︸ ︷︷ ︸Attribute value
gtWikipedia link︸ ︷︷ ︸Content
ltagt
ltbodygtlthtmlgt
Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser
The two Fedes and Tomas Python Applied reading group 12 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
PDF Scraping
Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner
The two Fedes and Tomas Python Applied reading group 17 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 19 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Ways to extract information from the web
API (Application programming interface)Structured way to access data from websitesUsually limited number of queriesNot all website provide itTwitter Google maps etcSome Stata commands
ScrapingTransformation of unstructured data (HTML format) on the webinto structured data (database)Stata command copy
The two Fedes and Tomas Python Applied reading group 8 40
API
Talk to web services URLs that are designed explicitly to hand databack cleaned for your application
Need to agree on wire protocol for python and javaTwo wire formats to exchange data between applications
XMLJSON
LibrariesxmletreeElementTreejson
Google map APIPossible to retrieve address latitude and longitude of placesDocumentationLimits
2500 requests per day50 requests per second
The two Fedes and Tomas Python Applied reading group 9 40
Google map API
The two Fedes and Tomas Python Applied reading group 10 40
Google map API
The two Fedes and Tomas Python Applied reading group 10 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt
lttitlegtThis is the titlelttitlegtltheadgtltbodygt
lta href =wwwwikipediaes gt Wikipedia linkltagtltbodygt
lthtmlgt
The two Fedes and Tomas Python Applied reading group 11 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt︸ ︷︷ ︸Start tag
lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element
ltheadgt︸ ︷︷ ︸End tag
ltbodygtlta href︸︷︷︸
Attribute name
=wwwwikipediaes︸ ︷︷ ︸Attribute value
gtWikipedia link︸ ︷︷ ︸Content
ltagt
ltbodygtlthtmlgt
Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser
The two Fedes and Tomas Python Applied reading group 12 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt︸ ︷︷ ︸Start tag
lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element
ltheadgt︸ ︷︷ ︸End tag
ltbodygtlta href︸︷︷︸
Attribute name
=wwwwikipediaes︸ ︷︷ ︸Attribute value
gtWikipedia link︸ ︷︷ ︸Content
ltagt
ltbodygtlthtmlgt
Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser
The two Fedes and Tomas Python Applied reading group 12 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
PDF Scraping
Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner
The two Fedes and Tomas Python Applied reading group 17 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 19 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
API
Talk to web services URLs that are designed explicitly to hand databack cleaned for your application
Need to agree on wire protocol for python and javaTwo wire formats to exchange data between applications
XMLJSON
LibrariesxmletreeElementTreejson
Google map APIPossible to retrieve address latitude and longitude of placesDocumentationLimits
2500 requests per day50 requests per second
The two Fedes and Tomas Python Applied reading group 9 40
Google map API
The two Fedes and Tomas Python Applied reading group 10 40
Google map API
The two Fedes and Tomas Python Applied reading group 10 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt
lttitlegtThis is the titlelttitlegtltheadgtltbodygt
lta href =wwwwikipediaes gt Wikipedia linkltagtltbodygt
lthtmlgt
The two Fedes and Tomas Python Applied reading group 11 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt︸ ︷︷ ︸Start tag
lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element
ltheadgt︸ ︷︷ ︸End tag
ltbodygtlta href︸︷︷︸
Attribute name
=wwwwikipediaes︸ ︷︷ ︸Attribute value
gtWikipedia link︸ ︷︷ ︸Content
ltagt
ltbodygtlthtmlgt
Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser
The two Fedes and Tomas Python Applied reading group 12 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt︸ ︷︷ ︸Start tag
lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element
ltheadgt︸ ︷︷ ︸End tag
ltbodygtlta href︸︷︷︸
Attribute name
=wwwwikipediaes︸ ︷︷ ︸Attribute value
gtWikipedia link︸ ︷︷ ︸Content
ltagt
ltbodygtlthtmlgt
Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser
The two Fedes and Tomas Python Applied reading group 12 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
PDF Scraping
Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner
The two Fedes and Tomas Python Applied reading group 17 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 19 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Google map API
The two Fedes and Tomas Python Applied reading group 10 40
Google map API
The two Fedes and Tomas Python Applied reading group 10 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt
lttitlegtThis is the titlelttitlegtltheadgtltbodygt
lta href =wwwwikipediaes gt Wikipedia linkltagtltbodygt
lthtmlgt
The two Fedes and Tomas Python Applied reading group 11 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt︸ ︷︷ ︸Start tag
lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element
ltheadgt︸ ︷︷ ︸End tag
ltbodygtlta href︸︷︷︸
Attribute name
=wwwwikipediaes︸ ︷︷ ︸Attribute value
gtWikipedia link︸ ︷︷ ︸Content
ltagt
ltbodygtlthtmlgt
Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser
The two Fedes and Tomas Python Applied reading group 12 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt︸ ︷︷ ︸Start tag
lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element
ltheadgt︸ ︷︷ ︸End tag
ltbodygtlta href︸︷︷︸
Attribute name
=wwwwikipediaes︸ ︷︷ ︸Attribute value
gtWikipedia link︸ ︷︷ ︸Content
ltagt
ltbodygtlthtmlgt
Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser
The two Fedes and Tomas Python Applied reading group 12 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
PDF Scraping
Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner
The two Fedes and Tomas Python Applied reading group 17 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 19 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Google map API
The two Fedes and Tomas Python Applied reading group 10 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt
lttitlegtThis is the titlelttitlegtltheadgtltbodygt
lta href =wwwwikipediaes gt Wikipedia linkltagtltbodygt
lthtmlgt
The two Fedes and Tomas Python Applied reading group 11 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt︸ ︷︷ ︸Start tag
lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element
ltheadgt︸ ︷︷ ︸End tag
ltbodygtlta href︸︷︷︸
Attribute name
=wwwwikipediaes︸ ︷︷ ︸Attribute value
gtWikipedia link︸ ︷︷ ︸Content
ltagt
ltbodygtlthtmlgt
Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser
The two Fedes and Tomas Python Applied reading group 12 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt︸ ︷︷ ︸Start tag
lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element
ltheadgt︸ ︷︷ ︸End tag
ltbodygtlta href︸︷︷︸
Attribute name
=wwwwikipediaes︸ ︷︷ ︸Attribute value
gtWikipedia link︸ ︷︷ ︸Content
ltagt
ltbodygtlthtmlgt
Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser
The two Fedes and Tomas Python Applied reading group 12 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
PDF Scraping
Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner
The two Fedes and Tomas Python Applied reading group 17 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 19 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt
lttitlegtThis is the titlelttitlegtltheadgtltbodygt
lta href =wwwwikipediaes gt Wikipedia linkltagtltbodygt
lthtmlgt
The two Fedes and Tomas Python Applied reading group 11 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt︸ ︷︷ ︸Start tag
lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element
ltheadgt︸ ︷︷ ︸End tag
ltbodygtlta href︸︷︷︸
Attribute name
=wwwwikipediaes︸ ︷︷ ︸Attribute value
gtWikipedia link︸ ︷︷ ︸Content
ltagt
ltbodygtlthtmlgt
Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser
The two Fedes and Tomas Python Applied reading group 12 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt︸ ︷︷ ︸Start tag
lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element
ltheadgt︸ ︷︷ ︸End tag
ltbodygtlta href︸︷︷︸
Attribute name
=wwwwikipediaes︸ ︷︷ ︸Attribute value
gtWikipedia link︸ ︷︷ ︸Content
ltagt
ltbodygtlthtmlgt
Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser
The two Fedes and Tomas Python Applied reading group 12 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
PDF Scraping
Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner
The two Fedes and Tomas Python Applied reading group 17 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 19 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt︸ ︷︷ ︸Start tag
lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element
ltheadgt︸ ︷︷ ︸End tag
ltbodygtlta href︸︷︷︸
Attribute name
=wwwwikipediaes︸ ︷︷ ︸Attribute value
gtWikipedia link︸ ︷︷ ︸Content
ltagt
ltbodygtlthtmlgt
Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser
The two Fedes and Tomas Python Applied reading group 12 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt︸ ︷︷ ︸Start tag
lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element
ltheadgt︸ ︷︷ ︸End tag
ltbodygtlta href︸︷︷︸
Attribute name
=wwwwikipediaes︸ ︷︷ ︸Attribute value
gtWikipedia link︸ ︷︷ ︸Content
ltagt
ltbodygtlthtmlgt
Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser
The two Fedes and Tomas Python Applied reading group 12 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
PDF Scraping
Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner
The two Fedes and Tomas Python Applied reading group 17 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 19 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Web scraping
HTMLStandard markup language for creating web pages and webapplications Messy
lthtmlgtltheadgt︸ ︷︷ ︸Start tag
lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element
ltheadgt︸ ︷︷ ︸End tag
ltbodygtlta href︸︷︷︸
Attribute name
=wwwwikipediaes︸ ︷︷ ︸Attribute value
gtWikipedia link︸ ︷︷ ︸Content
ltagt
ltbodygtlthtmlgt
Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser
The two Fedes and Tomas Python Applied reading group 12 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
PDF Scraping
Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner
The two Fedes and Tomas Python Applied reading group 17 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 19 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
PDF Scraping
Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner
The two Fedes and Tomas Python Applied reading group 17 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 19 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
PDF Scraping
Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner
The two Fedes and Tomas Python Applied reading group 17 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 19 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
PDF Scraping
Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner
The two Fedes and Tomas Python Applied reading group 17 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 19 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 13 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
PDF Scraping
Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner
The two Fedes and Tomas Python Applied reading group 17 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 19 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
PDF Scraping
Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner
The two Fedes and Tomas Python Applied reading group 17 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 19 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 14 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
PDF Scraping
Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner
The two Fedes and Tomas Python Applied reading group 17 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 19 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
PDF Scraping
Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner
The two Fedes and Tomas Python Applied reading group 17 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 19 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
PDF Scraping
Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner
The two Fedes and Tomas Python Applied reading group 17 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 19 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 15 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
PDF Scraping
Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner
The two Fedes and Tomas Python Applied reading group 17 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 19 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
PDF Scraping
Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner
The two Fedes and Tomas Python Applied reading group 17 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 19 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Looking for house prices
The two Fedes and Tomas Python Applied reading group 16 40
PDF Scraping
Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner
The two Fedes and Tomas Python Applied reading group 17 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 19 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
PDF Scraping
Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner
The two Fedes and Tomas Python Applied reading group 17 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 19 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 19 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Scraping the little prince
The two Fedes and Tomas Python Applied reading group 18 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 19 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 19 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Introduction
Python can be used both in QGIS and ArcGIS as a language tointeract with these programs
These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python
The two Fedes and Tomas Python Applied reading group 20 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 21 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Where to Find Python (QGIS)
Plugins rArr Python Console
The two Fedes and Tomas Python Applied reading group 22 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Learn to Write GIS Python Code
1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that
generates a Python Code)4 Write your own py file
The two Fedes and Tomas Python Applied reading group 23 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Example 1 - Automation of Presentation of Results
You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate
You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map
If you have multiple maps you can even created videos of maps
The two Fedes and Tomas Python Applied reading group 24 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Example 2 - Automation Getting Inputs from STATA
Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)
Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything
1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish
The two Fedes and Tomas Python Applied reading group 25 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Example 3 - Summary Statistics by Distance
Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)
QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python
1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -
intersects()4 Sum up police officers if it intersects and save in a new variable
The two Fedes and Tomas Python Applied reading group 26 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Example 3 - Summary Statistics by Distance
Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or
variables3 You can then save the data and use this newly created variable
back in STATA - writeAsVectorFormat()
The two Fedes and Tomas Python Applied reading group 27 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Example 4 - Neighbors of Polygons
All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )
Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that
Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())
Again this new variable created in QGIS can be then exported inSTATA
The two Fedes and Tomas Python Applied reading group 28 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
A Taste of More Complex Things you Can Do inPython
Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe
Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity
You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()
Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA
The two Fedes and Tomas Python Applied reading group 29 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Sketch presentation
How Python can be useful to applied practitionersFede C
What Python is and scrapingFede M
GIS (geography and maps) applicationTomaacutes
Machine learning tools
The two Fedes and Tomas Python Applied reading group 30 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Machine Learning with Scikit-Learn
Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn
Machine Learningis more about fitprediction
cares a lot about scalability of the algorithm (big data manycovariates)
is neither about causality or inference (Susan Athey disagrees)
nor the formal asymptotic properties
Focus on Out of sample performance cross-validationregularization
Why Python and Scikit-learn
The two Fedes and Tomas Python Applied reading group 31 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
How to learn
Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman
Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes
Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python
The two Fedes and Tomas Python Applied reading group 32 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Ex1 Classification + Regularization
Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets
Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review
Method Regularized logistic regression
Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso
LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α
The two Fedes and Tomas Python Applied reading group 33 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Ex1 Classification + Regularization
166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero
Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202
The two Fedes and Tomas Python Applied reading group 34 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Ex2 Regression Trees
Built in scikit learn Random Forests Bagging Boosting(Adaboost)
Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)
Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them
Scikit implementation RandomForestClassifier
forest =RandomForestClassifier(max_depth=4)fit(XY)
forestpredict_proba(X[1])
print forestfeature_importances_
The two Fedes and Tomas Python Applied reading group 35 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Ex2 Regression Trees
The two Fedes and Tomas Python Applied reading group 36 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Ex3 Clustering
Built in scikit learn K-means Gaussian Mixtures
Example Clustering Samuel L Jackson movies by quality andrevenue box office
Method K-Means
Scikit implementation KMeans
kmeans = KMeans(n_clusters=3random_state=10)fit(data)
centroids = kmeanscluster_centers_
labels = kmeanslabels_
The two Fedes and Tomas Python Applied reading group 37 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Ex3 Clustering
The two Fedes and Tomas Python Applied reading group 38 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Code Appendix
Classifier + Regularization
Word Matrix
vectorizer = CountVectorizer(token_pattern=rbw+b)
words_matrix = vectorizerfit_transform(products[review_clean])
name = vectorizerget_feature_names()
Model
sentiment_model = linear_modelLogisticRegressionCV(penalty=l1
solver=liblinear)
sentiment_modelfit(words_matrix dependent_var)
coefficients = sentiment_modelcoef_
The two Fedes and Tomas Python Applied reading group 39 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40
Appendix 1
Lasso
minβ
Nsumi
(Yi minusXiβ)2 +α||β||1 (1)
The two Fedes and Tomas Python Applied reading group 40 40