Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Word and Graph Embeddings for Machine Learning Models
Steven SkienaDept. of Computer Science
Which is the most useful definition of “cat”?
Distributed Word Representations• Words are represented
by low dimensionalvectors. (20-500 dimensions).
• The NN learning methods based on phrase fluency to learn these representations are language-agnostic, and can scale to huge datasets.
pineoak
rosedaisy
readingwriting
readwrite
|V|
|V|: size of vocabulary
pineoak
rosedaisy
readingwriting
readwrite
d
d << |V|
Similar words share similar
representations.
Latent Dimensions
Explicit Dimensions
The Fluency Test for Training Embeddings
● A good representation should be able to distinguish between real and
randomly corrupted phrases.
● For each phrase you sample a random word from the vocabulary.
S = (”When", "I", "visited", "New", "York")
S' = (”When", "I", "fix", "New", "York")
● The model is asked to perturb the representation for each word so the
score satisfies the following condition.
Score(S) > Score(S') + 1Margin
Neural Network (Word2Vec) [Mikolov et.al. 2013]
CImagination
Cis
Cgreater
Cthan
Cdetail
Score
Hidden Layer
H
CM
|V|
Projection Layer
W2
W1
Forward pass
Backward pass
Visualizing Word Embeddings: Animals, Colors, Numbers, Countries
Talk Outline
● Introduction: Word Embeddings● Multilingual Language Processing (NLP)● Name Embeddings● DeepWalk: Feature Extraction from Graphs● AI Institute
Polyglot Embeddings Demo@ https://bit.ly/embeddings
Examples of the nearest five neighbors of every word in several languages.
22
Applications of Word Embeddings
● Part of Speech (POS) tagging: what is a noun, verb, adjective? [CoNNL 2013]
● Entity Recognition: what are the people, places and things mentioned in the text? [SDM 2015]
● Sentiment Analysis: is a document saying positive or negative things? [ACL 2014]
● Transliteration: how can we say the same thing in different languages and recognize friends?
Polyglot-NER Demo@ https://bit.ly/polyglot-ner
R. Al-Rfou B. Perozzi, and S. Skiena, “Polyglot-NER: Massive Multilingual Named Entity Recognition” , SIAM Conf. Data Mining (SDM 2015)
Legend:LocationOrganizationPerson
33
Talk Outline
● Introduction: Word Embeddings● Multilingual Language Processing (NLP)● Name Embeddings● DeepWalk: Feature Extraction from Graphs● AI Institute
Your Name Tells a Lot About You
Your gender (male, female) Your ethnicity (white, black, hispanic, asian/pacific islander) Your nationality (which country is your family from?) Your marriage status (X Y-Z) Your socio-economic status (Jethro vs. Archibald) Your age (Fannie vs. Caitlin)
How can we capture these nuances into a feature representation for classification and other machine learning tasks?
Homophily and Communications Patterns Brad Pitt: Angelina Jolie, Jennifer Aniston, George Clooney, Cate Blanchett, Julia Roberts Saddam Hussein: Tarik Aziz, Uday Hussein, Samira Shahbandar, SajidaTalfah Donald Trump: Mike Pence, Vladimir Putin, Paul Ryan, IvanikaTrump, Mitch McConnell Xi Jinping: Hu Jintao, Jiang Zemin, Peng Liyuan, Xi Mingze, Ke Lingling
Homophily (“love of the same”) is the tendency for people to associate with people similar to them.
Our analysis of 57 million email contact lists from a major Internet company provided us with sequences of name token sufficient for training distributed word embeddings on.
Nationality classsification using name embeddings (with JuntingYe, Shuchu Han, Yifan Hu, Baris Coskun, Meizhu Liu, Steven Skiena, (CIKM 2017).
Regions in Name Space
These embeddings make natural features in any learning task where you have names: Nationality/ethnicity detection
for biomedical/sociology research.
Demographic analysis Social media analysis Security
NamePrism: A nationality classifier (www.name-prism.com)
Research Project Goal Research Group Country
“the impact of having foreign-sounding name on job search in financial” Nanyang Technological University Singapore
“determine if ethnic group size impacts national cabinet diversity” Department of Political Science, Washington University in St. Louis
U.S.
“promote the contributions of Iranian Americans to members with-in and outside of the Iranian community living in America.”
Iranian Americans' Contributions Project
U.S.
“determine if ethnicity plays a part/plays no part in whether a written evidence submitted to a Parliamentary Inquiry is accepted or rejected”
Parliamentary Digital Service UK
“working on a study on the network effects for long term unemployed” German Institute for Employment Research
Germany
“unveiling the origins of French citizens in order to study discrimination in several areas of the French society”
Laboratoire InterdisciplinaireSciences Innovations Sociétés(LISIS)
French
“Investigate whether hosts on Airbnb get discriminated based on their ethnicity”
Stockholm School of Economics Sweden
● WIRED Magazine;
● API used by 156 social science research projects
● 69.9 million names analyzed (the population of France)
Talk Outline
● Introduction: Word Embeddings● Multilingual Language Processing (NLP)● Name Embeddings● DeepWalk: Feature Extraction from Graphs● AI Institute
Features From Graphs
● Anomaly Detection● Attribute Prediction● Clustering● Link Prediction● ...A
djac
ency
Mat
rix
|V|
A first step in machine learning for graphs is to extract graph features:● node degree● pairs: # of common neighbors● groups: cluster assignments
Advantages of DeepWalk
Bryan Perozzi DeepWalk: Online Learning of Social Representations
● Anomaly Detection● Attribute Prediction● Clustering● Link Prediction● ...
|V|
DeepWalk
d << |V|
Latent Dimensions
Adj
acen
cy M
atri
x
● Scalable - An online algorithm that does not use entire graph at once● Walks as sentences metaphor● Works great!● Implementation available: bit.ly/deepwalk
DeepWalk: The Entire Idea
Bryan Perozzi DeepWalk: Online Learning of Social Representations
Short random walks = sentences
■ We generate random walks for each vertex in the graph.■ Each short random walk has length t .■ Pick the next step uniformly from the vertex neighbors.
Everyone is doing the DeepWalk!
2177 citations since August 2014! 28th in downloads from ACM Digitial Library past 12 months from among all 4,188
KDD papers ever! (by comparison, second best from KDD ‘14 ranks 104th) You, too, can do the DeepWalk!
Identifying Historically Similar EntitiesWe can construct embeddings for people, places and things, to recognize similar entities
Y. Chen, B. Perozzi, and S. Skiena “Vector-Based Similarity Measurements for Historical Figures”, SISAP 2015.
DeepWalk: Nearest NeighborsScarlett Johansson● Kirsten Dunst (0.784)● Natalie Portman (0.786)● Gwyneth Paltrow (0.796)● Brad Pitt (0.858)● Cameron Diaz (0.891)
Steven Skiena● Larry Page (1.597)● Sergey Brin (1.598)● Danny Hillis (1.644)● Andrei Broder (1.652)● Mark Weiser (1.653)
Barack Obama● George W. Bush (0.474)● Hillary Clinton (0.657)● Bill Clinton (0.658)● Joe Biden (0.750)● Al Gore (0.791)
Albert Einstein● Richard Feynman (1.049)● Max Planck (1.073)● Freeman Dyson (1.107)● Stephen Hawking (1.153)● Robert Oppenheimer (1.156)
Ludwig van Beethoven● Franz Schubert (0.489)● Johannes Brahms (0.532)● Wolfgang Mozart (0.567)● Robert Schumann (0.576)● Gustav Mahler (0.635)
Mick Jagger● John Lennon (0.687)● Keith Richards (0.687)● Paul McCartney (0.796)● Ronnie Wood (0.822)● Eric Clapton (0.833)
Institute for AI-Driven Discovery and Innovation
Professor Steven SkienaDirector, AI InstituteNovember 2019
23
Mission of the Institute • Promote activity in AI and related areas towards attracting more
federal, state, industrial and private funding.• Stimulate research and educational activities in AI in CS and
across CEAS.• Make Stony Brook a more attractive destination for graduate
students and faculty interested in AI and related areas.• Check out our website: ai.stonybrook.edu and @AI_SBU on
Twitter.
24
Stony Brook AI InstituteMajor Accomplishments 2018-19
• Building a stronger Stony Brook AI community• Three SUNY EIP hires (Haibin Ling, Michael Ryoo, Zhaozheng Yin)!!• Junior hire in machine learning (Yifan Sun, starting Fall 2020)• NSF Major Research Infrastructure (MRI) grant for large AI/ML cluster.• First substantial philanthropic gift ($375K) for postdoctoral scholars program• Communications Hire (Dan Olawski)• Bloomberg AI Institute kickoff
Core Faculty
26
Research InterestsWe have strong research groups in several areas of AI:
• Biomedical Informatics• Computer Vision• Computational Logic and Reasoning• Data Science• Machine Learning• Natural Language Processing• Social Media Analysis
27
Educational Initiatives• New undergraduate courses in Data Science and Natural
Language Processing• New undergraduate concentration in Artificial Intelligence• New graduate concentration in Data Science, approved• DS+X initiative with College of Arts and Sciences (CEAS)
28
Student Collaborators: Stony Brook Data Science Lab
Bryan Perozzi Vivek Kulkarni Rami al-Rfou Junting Ye
Haochen Chen. Yingtao Tian
Manual Laboring
Questions?