Upload
ekta-grover
View
3.031
Download
1
Embed Size (px)
DESCRIPTION
Experiments in data mining, entity disambiguation and how to think data-structures for designing beautiful algorithms
Citation preview
Experiments in data mining, entity disambiguationand how to think data-structures for designing
beautiful algorithms
Ekta Grover
1st September, 2013
@ektagrover/[email protected] PyCon India, 2013 1/26
Structure of this Talk
How to think through a data mining problem
3 problems in data-mining around disambiguation, Naturallanguage processing & Information Retrieval
Measuring your Model performance - developing a near-realtime performance aggregation platform
Putting it all together
@ektagrover/[email protected] PyCon India, 2013 2/26
The only algorithm you will ever need to know
1 Write down the problem2 Think real hard3 Write down the solution
@ektagrover/[email protected] PyCon India, 2013 3/26
Thinking through Data-Mining problem
What is the data and how should you represent it ?
What are the relationships you are interested in ?
How to define your Objective Function ?
How will you know that the score represents the direction youwant to capture ?
How will you measure the performance after you are done ?
@ektagrover/[email protected] PyCon India, 2013 4/26
Problem # 1 : Entity Disambiguation
Problem
So, How much is your network really worth ?
What is Disambiguation ?
Information = Related + UnrelatedExtract significant attributes/features of an entityRelationships between entities & featuresObjective function is a just a way to measure - rank/score thefeatures for an entity wrt to other entities
Disambiguation is not a deterministic problem
@ektagrover/[email protected] PyCon India, 2013 5/26
Problem # 1 : Entity Disambiguation
Guiding question #1
What form of disambiguation to use for the entities ?
Algorithm & approach
1 Preprocessing - lower casing & removing punctuation
2 Exploratory Analysis - frequency plots (with NLTK freqdist &Pretty table)
3 Tokenization to find ”keywords” to disambiguate
4 Handling stop words & Internationalization
5 Frequent item-set mining with support = #of buckets
6 Post processing, frequency plots & visualization
Extensions: Handling Typo graphical errors and augmenting thisdata-set with scraping
@ektagrover/[email protected] PyCon India, 2013 6/26
Problem # 1 : Talking Specifics & answering the ”Why”
Frequent item-set mining & concept of support
How do we know the number of significant words in an entity name?
1 For every company name - strip the 1st word and store it as the keyof ”local” dictionary, with values as list
2 On the local dict created in step 1, do frequent item-set miningwith support = # baskets - if a word is significant(and uniquelydescribes an entity), it should be present in all baskets
3 Treat all the values in a local dict as a set - If all elements of a localdict are same - no need for item-set mining
4 For every key - find the smallest list of tokens(by length) and searchthe presence of all tokens iteratively till the support appears in lessthan #baskets
5 Replace the words fetched post support to equal all values in thatdictionary
6 Fetch all values from all keys in the dict of list - flatten this list7 Regular frequency counts on this flattened list of now
disambiguated elements
@ektagrover/[email protected] PyCon India, 2013 7/26
Problem # 1 : Limitations of this approach
1 Abbreviations : Boston Consulting Group & BCG, State Bank ofIndia & SBI - build a referencing domain specific Word SenseDisambiguation (WSD) database
2 Differently referenced entities such as ”Innovation labs 247customer private ltd” and ”247 innovation labs” : The referencingkey will be different for these
3 Sub-entities of a company with punctuation ”Ebay/Paypal” - the”key” depends on how we choose to treat the punctuation for {/,:,-}
4 Need a module to handle inadvertent typographical errors inentity names, names are proper nouns, yet no referencing fromWordnet database - plug in Problem #2
Extensions
Cross document and Co-reference resolution,Multi-pass learning, smooth-ing & annealed learning algorithms
@ektagrover/[email protected] PyCon India, 2013 8/26
Problem # 1 : Representation
@ektagrover/[email protected] PyCon India, 2013 9/26
Problem # 1 : Representation, a more extreme example &Codewalk
Codewalk from Github
https://github.com/ekta1007/Experiments-in-Data-mining
@ektagrover/[email protected] PyCon India, 2013 10/26
Problem # 1 : Visualization
YYYYYYYSAPY[Y57Y]YYY[24]7,Inc6[37]
VMware[YzZY]
IbmYY[YZ4Y]
GoogleY[YZ5Y]
ThoughtworksY[YZBY]Yahoo Y[ ZBY]
EbayYincY[Y8Y]
IntuitY[Y8Y]
MicrosoftYCorporationY[Y8Y]
Accenture66[676]
LinkedinY[Y7Y]
University6of6Cologne6[676]
DellY[Y6Y]
Deutsche6Bank6[666]
FlipkartFcomY[Y6Y]BostonYConsultingYGroupY[Y6Y]
ZyngaY[Y6Y]
AmazonY[Y5Y]
Anita6Borg6Institute6for6Women6&Technology6[656]
Top companies by Frequency distribution, post item-set mining
*I used R(ggplot2 and igraph) to plot the nodes with ColorBrewer palette and Inkspace for ”cosmetics”
@ektagrover/[email protected] PyCon India, 2013 11/26
Problem # 2 : Custom Distance metric for handlingTypographical errors optimized for a QWERTY keyboard
Usecase
Disambiguating the typographical errors with a contrasting (noun)item
Three types of misspellings1
1 Typographic errors - know the correct spelling, but motorcoordination slip when typing
2 Cognitive errors - caused by lack of knowledge of the person
3 Phonetic errors2 - sound correct, but are orthographically incorrect
Distance measures: Levenshtein, Manhattan, Euclidean, Cosine -inappropriate for distance in our context. Need a hybrid contextualapproach
1Karen Kukich - Techniques for automatically correcting words in text, ACM Computing Surveys Volume 24
Issue 4, Dec. 1992 ; Creating a Spellchecker with Solr http://emmaespina.wordpress.com/2011/01/31/20/2
Soundex algorithm; example : Chebyshev & Tchebycheff
@ektagrover/[email protected] PyCon India, 2013 12/26
Problem # 2 : QWERTY distance metric
Guiding Assumption
Since typographical errors are inadvertently introduced, they should notcount towards the overall distance
Algorithm & approach?
1 Data structures - QWERTY keyboard as a graph with connectedadjacent nodes, neighborhood nodes with appropriate distances
2 Define a typographical error & it’s thresholds- what can be thedifference in length of words, how many possible ”swaps” can occurin a typo
3 Implement the custom distance metric - fall back to Levenshteinwhen non-typographical - Hybrid approach
4 Output whether the words are identical, or the distance takingtypographical errors into account
5 Contrast and blend in with other fuzzy logic methods, ApacheSolr, Spell checker & Auto-suggest
@ektagrover/[email protected] PyCon India, 2013 13/26
Problem # 2 : Representation & Codewalk
QWERTY Keyboard representation
Codewalk from Github
https://github.com/ekta1007/Custom-Distance-function-for-typos-in-
hand-generated-datasets-with-QWERY-Keyboard/
@ektagrover/[email protected] PyCon India, 2013 14/26
Problem # 3 : Designing your custom job feeds at Linkedin
Part 1
Create filtering criteria to build your custom data-setLinkedin REST API’s, scraping(Scrapy, Beautiful soup), raw html withurllib & NLTK
Part 2
Build relevance score of the filtered resultsTF-IDF(with cosine similarity), Structured Relevance Models (SRM), Prox-imity search, classical Information Retrieval & pattern matching algorithms
@ektagrover/[email protected] PyCon India, 2013 15/26
Problem # 3 : Part 1
Algorithm & approach
1 Exploit the commonality in url pattern of the Jobs page, loopingconstructs for job index http://www.linkedin.com/jobs?
viewJob=&jobId=job_index&trk=rj_jshp
2 Post this only 4 options exist per Job page
1 The job you’re looking for is no longer active2 We can’t find the job you’re looking for3 The job passes your filtering criteria (location, skill-specific
keywords etc)4 The job is active, but does not pass your filtering criteria
3 For urls in 2.3, build a simple csv file with all elements of interest- Company Name, Title of Position, Job Posting Url, Posted on,Functions, Industry, similar Jobs
4 Can pre-filter by company name - by heuristics, or a quick scoringfunction for each company name
@ektagrover/[email protected] PyCon India, 2013 16/26
Problem # 3 : Part 2
Algorithm & Approach1 A document is a bag-of-words - Tokenize, standard
preprocessing(punctuation, lower case)
2 Exploratory analysis with NLTK’s Dispersion plot, FreqDist,collocations, common contexts, count, concordance
3 Lemmatization & stemming with NLTK’s derivationally relatedforms with Morphy(Wordnet), Porter, Wordnet app
4 TF-IDF - with stop word list from NLTK & additional weighage forwords of your specific interest
5 Inverted index - for each word in vocabulary, store the doc-id withit’s TF-IDF, making up the whole TF-IDF vector
6 Given the TF-IDF vector of the document corpus & query, computethe cosine similarity of each document with the base query
7 Output top-k relevant Job listings, with the top-k’ words in eachjob feed
8 Augment this with a classical structured relevance model,Proximity search or pattern matching algorithm
@ektagrover/[email protected] PyCon India, 2013 17/26
Problem # 3 : Representation & Visualization
Business Analysis Sr. Manager at Dell in Bangalore [ 0.99374331 ]
SrI6Software6Development6Engineer6x6Architect6−6Ad6TechFAmazons6[65IbWWGHNNL6]
Technical6Program6Manager6−6Amazon6Fashion6TechnologyFAmazons6[65IbWCbW5WGG6]
Data6AnalyticsxBusiness6Intelligence6Engineer6−6Buying6ExperienceFAmazons6[65Ib:bC55WL:6]
SrI6Software6Development6Engineer6P6Collective6Human6IntelligenceFAmazons6[65IbHW:HGW::6]
Lead6Analyst,6Statistical6Modelling6FWNS6Global6Servicess6[65Ib5GCNWCL6]
Principal6ScientistB6SrI6Computer6ScientistB6Member6of6Research6StaffFAdobes6[65IG&5&WHL5G6]
Research6ScientistFYahoos6[65INb5L:HCNG6]
RPD−Biologics−Sr6ScientistFBiocons6[65INNNCHLWC&6]
Data6ScientistFNetApps6[65I:N555H5WW6]
Relevant Job feeds ranked by Cosine Similarity & TF-IDF to the basequery
@ektagrover/[email protected] PyCon India, 2013 18/26
Problem # 3 : Representation & Codewalk
dell5[50.023801025]
procurement5[50.0230164475]
develops5[50.0104620215]
strategy5[50.0104129465]
goals5[50.0083696175]
reports5[50.0083696175]
analyzes5[50.0083696175]
internal5[50.0062772135]
planning5[50.0062772135]
BusinessdAnalysisdSr.dManager(Dell)
advertising5[50.0157264585]highly5[50.0088483225]
amazon5[50.0070883285]
traffic5[50.0066362425]advertisement5[50.0066362425]
increase5[50.0066362425]
architect5[50.0065324785]
ad5[50.0059930035]
distributed5[50.0047944025]
Sr.dSoftwaredDevelopmentdEngineerd/dArchitect-AddTech(Amazon)
nextorbit5[50.0478708685]energy5[50.0157405285]
discovering5[50.0130556915]
understanding5[50.0110156185]scientist5[50.011000655]
enable5[50.0094321815]
space5[50.009281755]
trends5[50.009281755]
Sr.dEnergydDatadScientist(NextOrbitdInc)
Top ranking words in the respective job-feeds3
Codewalk from Github
https://github.com/ekta1007/Creating-Custom-job-feeds-for-Linkedin3
Note that some of the words are relatively uninformative like ”understanding”, ”Amazon” - could use anyamongst Proximity search, Phase search, Positional index, Next word index
@ektagrover/[email protected] PyCon India, 2013 19/26
Problem # 4 : Designing a near-real time PerformanceAggregation Platform
Concept
Once you roll your competing candidate models into production, theyshould self-decide based on performance
Competing Goals can be amongst
1 Minimizing entropy of the complete system(more centereddistribution)
2 Optimizing precision & recall4 - tradeoff between type I & type II
3 Accounting for trends - 1st order(velocity) and 2nd (acceleration)
4 Minimize bias - Low mean square error, moving towards a moreglobal representation of a problem
4precision - fraction of retrieved instances that are relevant, recall - fraction of relevant instances that are
retrieved
@ektagrover/[email protected] PyCon India, 2013 20/26
Problem # 4 : Defining the aggregated proxy forPerformance
Algorithm & approach
1 Apriori: Initialize the candidate population, all else equal
2 Define model equations, which generate the scores, per model
3 Define the metrics for weighted performance and what isconsidered a ”success event”
4 Define threshold conditions and measure for success for eachmetric, wherever applicable
5 Scale the aggregated performance to have the next periodcandidate population
6 Roll out the new candidate population to next period, per model(Probabilistic model)
Scores are centrality scores(Welford’s Algorithm) - not to be over-enthusiastic(pessimistic) for state-outcomes and weights for recency ofperformance, over long time learning cycles
@ektagrover/[email protected] PyCon India, 2013 21/26
Problem # 4 : Designing the performance benchmark
Proxy for bias Type I & Type II errors Measure of entropy Velocity Metric Acceleration Metric
Time period Model # Mean Square Precision Recall Mean Mean Variance Velocity Velocity Acceleration Accelerationbench-marked against Error Score Score by Metric 1 by Metric 2 by Metric 1 by Metric 2
t1 M1M2M3M4M5
t2 M1M2M3M4M5
@ektagrover/[email protected] PyCon India, 2013 22/26
Problem # 4 : Representation & Codewalk
Aggregated performance in Period 1
Codewalk from Github
https://github.com/ekta1007/Performance-aggregation-platform—
learning-in-near-real-time@ektagrover/[email protected] PyCon India, 2013 23/26
Problem # 4 : Representation cont..
Aggregated performance in Period 1 & 2 respectively
@ektagrover/[email protected] PyCon India, 2013 24/26
Resources
Link to the problems I talked today(I keep updating this)
Understanding data structures in Python
A Programmer’s Guide to Data Mining - Ron Zacharski
Creating a Spell Checker with Solr
An Intuitive example to TF-IDF, Iraq War logs - Jonathan Stray
An Introduction to Information Retrieval - Christopher D. Manning, Prabhakar Raghavan, Hinrich Schutze
Natural Language Processing with Python - Steven Bird, Ewan Klein, Edward Loper
Stemming & lemmatization
Programming Collective Intelligence: Building Smart Web 2.0 Applications - Toby Segaran
Mining the Social Web: Analyzing Data from Facebook, Twitter, LinkedIn - Matthew A. Russell
@ektagrover/[email protected] PyCon India, 2013 25/26
@ektagrover/[email protected] PyCon India, 2013 26/26