19
Data Science: Predict Success of Legislation with Topics Only Natural Language Processing with Sunlight Foundation Open States API Pauline Chow Fall 2016

Data Science: Predict Bill Passage with Topics Only

Embed Size (px)

Citation preview

Data Science: Predict Success of Legislation with Topics OnlyNatural Language Processing with Sunlight Foundation Open States API!Pauline Chow Fall 2016

What policies and laws relate to your well being?

When I first asked this question I was working and interested in transportation policy, especially walking and bicycling

• Increase transparency of the various levels of decision making - federal, state, and local

• Effectively understand trends in public policy, in order to educate and influence

• Distill legislative process into logical system of features

• Extract and identify relationships of decision makers with communities, topics, and laws

Why Does Analyzing Elements of Successful Legislation Important?

Data Science Steps for Predicting Legislative Results

1. Collect Data from Sunlight Foundation API and other open data sources

2. Clean text from legislative bills via web scraping, including removing html, stop words, target variable (i.e. bill passage)

3. Extract features from text in python

4. Build topics from text using Latent Dirchlet Allocation (LDA), probabilistic approach

5. Implement supervised learning models

6. Analyze results

1. Collect: As the initial step to building predictive models, insights reflect features from California bills text between 2009-2014

2. Clean: Good Ole Scraping

3. How to Extract Features from Text?

Sick of Having to go 2 different hut buy pizza sunglass

1 1 1 2 1 1 1 1 1 1 1

4. Build: What is the Latent Dirchlet Allocation (LDA) Topic Model?

• Finds hidden semantic structure, aka context, where topics are cluster of similar words: P(word | context)

• Each document is a mixture of topics, words and phrases, which are split into probabilities

• Tune parameters: # of words in each topic, mixture within each topic, threshold for frequency and probability

• For example: Topic A (1,2,5): breakfast 30%, pizza 10%, smoothie 5%

1 Sick of having to go to two huts for pizza and sunglasses

2 I ate a cold pizza and spinach smoothie for breakfast.

3 I wear my sunglasses at night so I can see

4 Sometimes I get really sick when I go on roller coasters

5 Coffee only for breakfast because coffee is for closers

4. Build Word2Vec: Find Similarities

• Extract relationships in unstructured text

• Leverage context of documents and LDA’s probabilistic models

• Hierarchical structure of probabilities

• Derive meaning from cleaned vector of words and phrases

5. Implement Logistic Regression: CA Bills Over Time

1. Model predicts failure better than successful legislation

2. Model with 50 versus 100 topics predictive results did not differ significantly

3. Precision (TP / TP + FP)

4. Recall (TP / TP + FN)

Model Classification Report: All Topics Over Time

target precision recall f1-score support

failed 0.70 0.98 0.82 3118

passed 0.47 0.04 0.07 1360

avg / total 0.63 0.69 0.59 4478

6. Analyze California Bills (100 Topic Models)

• Bills have an average of 6.57 number of topics, ranging from 2 - 16.

• Passage rate by topic ranged from 18% to 36%, averaging 28% for all bills in the database

• Most frequent topics of legislation relate to local government funding/taxes/leadership initiatives, health care, education, budget and taxes, and court system

• Highest and lowest passage rate topics are reviewed in the next few slides

6. Analyze: Distribution of Topics in California Bills

Top 10 Topics by Frequency

Topic #s Frequency

topic 48 13569

topic 11 11838

topic 51 8024

topic 73 4913

topic 6 3675

topic 63 2782

topic 1 2615

topic 22 1879

topic 64 1726

topic 45 1663

6. What Topics Support Bill Passage?

Rank Topic # Odds Ratio LDA Topics

1 70 453.981 0.023*tank + 0.019*underground + 0.015*transferor + 0.011*lie + 0.010*decennial + 0.008*storage + 0.008*cotenant + 0.008*stanford + 0.006*petroleum +

0.006*orphan2 74 32.797 0.020*contribution + 0.014*calendar + 0.010*canyon + 0.009*lincoln + 0.009*shoulder + 0.007*stenographer + 0.006*inflation + 0.005*dispatcher +

0.005*vine + 0.005*boyer3 47 32.695 0.024*cemetery + 0.011*mexican + 0.010*interment + 0.007*salton + 0.006*elsinore + 0.006*tuberculosis + 0.005*burial + 0.004*bacteria + 0.004*creek

+ 0.004*coliform4 42 28.312 0.008*hoover + 0.003*tricare + 0.002*shower + 0.002*crutch + 0.002*contractholders + 0.002*bath + 0.001*dme + 0.001*durable + 0.000*hcpcs +

0.000*wheelchair5 21 25.041 0.021*wyland + 0.015*reorganization + 0.014*brown + 0.012*gordon + 0.010*presidential + 0.008*ford + 0.008*gerald + 0.008*battalion +

0.007*mitochondrial + 0.007*remembrance6 71 9.608 0.064*andwhereas + 0.024*awareness + 0.020*week + 0.014*whereas + 0.013*violence + 0.012*woman + 0.010*disease + 0.010*resolution + 0.010*month

+ 0.009*furtherresolved7 65 8.798 0.029*pipeline + 0.027*ronald + 0.022*sea + 0.013*coastal + 0.012*marine + 0.009*rise + 0.008*reagan + 0.008*thomas + 0.007*climate + 0.007*arctic

8 34 8.183 0.030*candidate + 0.011*teen + 0.011*precinct + 0.011*nomination + 0.011*poll + 0.009*freeway + 0.009*say + 0.009*dating + 0.008*teenager + 0.007*sca

9 58 5.086 0.022*autism + 0.014*nursing + 0.014*therapist + 0.013*mr + 0.013*calderon + 0.011*backpack + 0.009*credentialing + 0.008*therapy + 0.008*acupuncture +

0.008*marriage10 44 4.899 0.021*scientist + 0.017*negrete + 0.016*factfinding + 0.015*hepatitis + 0.015*mcleod + 0.011*maternity + 0.009*interdistrict + 0.009*knuckle + 0.008*liver

+ 0.005*infected15 94 2.103 0.023*bicycle + 0.017*bus + 0.013*midwife + 0.011*deployed + 0.010*roadway +

0.009*smog + 0.008*schoolbus + 0.007*safer + 0.005*polluter + 0.005*overtaking

Sample CA Bills Containing “Strong” Topics

Bill Status Bill Session, ID (Link) Topic # All Topics

Passed2011-2012-0 AB291

Underground storage tanks: petroleum: charges.

70 11, 45, 48, 70

Passed

2013-2014-0 AB1286 Personal income tax: voluntary contributions: California Breast

Cancer Research Fund

74 11, 45, 74

Passed 2009-2010-0 AB1969 Elsinore Valley Cemetery 47 1, 11, 47, 48

Passed 2011-2012-0 AB2488 Vehicles: buses: length limitations 94 1, 11, 48, 49, 73, 94

6. What Topics Have Weak Bill Passage?

Rank Topic # Odds Ratio Topics

-10 79 0.000498 0.057*emission + 0.050*greenhouse + 0.041*gas + 0.019*warming + 0.017*global + 0.016*climate + 0.014*air + 0.013*carbon + 0.013*reduction + 0.013*solution

-9 92 0.000156 0.038*trafficking + 0.012*duress + 0.010*menace + 0.010*fiduciary + 0.010*ammunition + 0.009*chvez + 0.009*human + 0.008*achadjian + 0.008*wilk + 0.006*bigelow

-8 76 0.000150 0.071*inmate + 0.065*parole + 0.026*parolee + 0.023*prison + 0.021*correction + 0.019*rehabilitation + 0.010*released + 0.009*recidivism + 0.008*reentry + 0.007*journalist

-7 28 0.000113 0.036*bag + 0.024*plastic + 0.015*carryout + 0.011*positioning + 0.007*tends + 0.007*electorate + 0.006*store + 0.006*deliberately + 0.006*undetermined + 0.005*el

-6 50 0.0000630.010*romero + 0.009*antipsychotic + 0.008*medication + 0.006*dementia + 0.006*detachable + 0.005*salvage + 0.004*dietary + 0.004*psychotropic + 0.004*repurchase + 0.003*diminishes

-5 24 0.000053 0.023*baby + 0.006*depression + 0.005*paratransit + 0.005*stewardship + 0.004*producer + 0.004*perinatal + 0.003*unwanted + 0.003*obstetrics + 0.003*sleep + 0.002*calhome

-4 8 0.000035 0.054*gang + 0.013*immunity + 0.010*tort + 0.008*rifle + 0.007*european + 0.007*magazine + 0.007*pervasive + 0.005*deadly + 0.005*mentally + 0.005*disordered

-3 81 0.0000130.014*interpreter + 0.007*excellence + 0.006*digitized + 0.005*reelected + 0.005*easy + 0.004*fluency + 0.004*biodegradable + 0.002*willfulness + 0.002*annoyance + 0.002*disincentive

-2 10 0.000004 0.022*budget + 0.013*muratsuchi + 0.013*sawyer + 0.013*mullin + 0.012*bloom + 0.012*nazarian + 0.012*daly + 0.012*campos + 0.011*rodriguez + 0.010*dababneh

-1 62 0.0000010.005*consummated + 0.004*nonconsenting + 0.003*nonsupervisory + 0.001*peculiar + 0.001*culminating + 0.000*overdue + 0.000*reputation + 0.000*unimpeded + 0.000*foster + 0.000*licentious

Sample CA Bills Containing “Weak” Topics

Bill Status Bill Session, ID (Link) Topic # All Topics

Passed 2011-2012-0 SB1219 Recycling Plastic Bags 28 28, 45, 48, 57

Passed2013-2014-0 AB1405

Subversive Organization Registration Law: repeal

92 6, 11, 20, 37, 48, 51, 66

Passed 2011-2012-0 AB220Interstate Compact for Juveniles. 76 48, 51, 63, 76

Passed2009-2010-0 AB863

Public utilities: municipal districts: civil service exemptions.

62 11, 30, 48, 51, 62

Next Steps for Legislative Predictions

• Add time context for bills in terms of legislative session, chamber, and major political events

• Adding features about the bill, sponsors, districts, political context, duration, committees, public comments

• Include exploratory data analysis from bill and legislator data

• Tune model to apply predictions to current bills

Relevant Citations

• Gensim: Topic Modeling for Humans by Radium Hurek, open source python package

• Wallach, H. M. (n.d.). Topic Modeling: Beyond Bag-of-Words. Retrieved from poster link

• Gerrish, S. M., & Blei, D. M. (2011). Predicting legislative roll calls from text. In Proc. of ICML. Retrieved from article link

• Rong, X. (2016, June 5). Word2vec Parameter Learning Explained. doi:arXiv:1411.2738 [cs.CL]: article link

• Unsplash for stock photo

Thank you ! !!

@DataThinker WhenThereIsData.com