Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

Word and Graph Embeddings for Machine Learning Models

Steven SkienaDept. of Computer Science

Which is the most useful definition of “cat”?

Distributed Word Representations• Words are represented

by low dimensionalvectors. (20-500 dimensions).

• The NN learning methods based on phrase fluency to learn these representations are language-agnostic, and can scale to huge datasets.

pineoak

rosedaisy

readingwriting

readwrite

|V|

|V|: size of vocabulary

pineoak

rosedaisy

readingwriting

readwrite

d

d << |V|

Similar words share similar

representations.

Latent Dimensions

Explicit Dimensions

The Fluency Test for Training Embeddings

● A good representation should be able to distinguish between real and

randomly corrupted phrases.

● For each phrase you sample a random word from the vocabulary.

S = (”When", "I", "visited", "New", "York")

S' = (”When", "I", "fix", "New", "York")

● The model is asked to perturb the representation for each word so the

score satisfies the following condition.

Score(S) > Score(S') + 1Margin

Neural Network (Word2Vec) [Mikolov et.al. 2013]

CImagination

Cis

Cgreater

Cthan

Cdetail

Score

Hidden Layer

H

CM

|V|

Projection Layer

W2

W1

Forward pass

Backward pass

Visualizing Word Embeddings: Animals, Colors, Numbers, Countries

Talk Outline

● Introduction: Word Embeddings● Multilingual Language Processing (NLP)● Name Embeddings● DeepWalk: Feature Extraction from Graphs● AI Institute

Polyglot Embeddings Demo@ https://bit.ly/embeddings

Examples of the nearest five neighbors of every word in several languages.

22

https://bit.ly/embeddings

Applications of Word Embeddings

● Part of Speech (POS) tagging: what is a noun, verb, adjective? [CoNNL 2013]

● Entity Recognition: what are the people, places and things mentioned in the text? [SDM 2015]

● Sentiment Analysis: is a document saying positive or negative things? [ACL 2014]

● Transliteration: how can we say the same thing in different languages and recognize friends?

Polyglot-NER Demo@ https://bit.ly/polyglot-ner

R. Al-Rfou B. Perozzi, and S. Skiena, “Polyglot-NER: Massive Multilingual Named Entity Recognition” , SIAM Conf. Data Mining (SDM 2015)

Legend:LocationOrganizationPerson

33

https://bit.ly/polyglot-ner

Talk Outline


Your Name Tells a Lot About You

Your gender (male, female) Your ethnicity (white, black, hispanic, asian/pacific islander) Your nationality (which country is your family from?) Your marriage status (X Y-Z) Your socio-economic status (Jethro vs. Archibald) Your age (Fannie vs. Caitlin)

How can we capture these nuances into a feature representation for classification and other machine learning tasks?

Homophily and Communications Patterns Brad Pitt: Angelina Jolie, Jennifer Aniston, George Clooney, Cate Blanchett, Julia Roberts Saddam Hussein: Tarik Aziz, Uday Hussein, Samira Shahbandar, SajidaTalfah Donald Trump: Mike Pence, Vladimir Putin, Paul Ryan, IvanikaTrump, Mitch McConnell Xi Jinping: Hu Jintao, Jiang Zemin, Peng Liyuan, Xi Mingze, Ke Lingling

Homophily (“love of the same”) is the tendency for people to associate with people similar to them.

Our analysis of 57 million email contact lists from a major Internet company provided us with sequences of name token sufficient for training distributed word embeddings on.

Nationality classsification using name embeddings (with JuntingYe, Shuchu Han, Yifan Hu, Baris Coskun, Meizhu Liu, Steven Skiena, (CIKM 2017).

Regions in Name Space

These embeddings make natural features in any learning task where you have names: Nationality/ethnicity detection

for biomedical/sociology research.

Demographic analysis Social media analysis Security

NamePrism: A nationality classifier (www.name-prism.com)

Research Project Goal Research Group Country

“the impact of having foreign-sounding name on job search in financial” Nanyang Technological University Singapore

“determine if ethnic group size impacts national cabinet diversity” Department of Political Science, Washington University in St. Louis

U.S.

“promote the contributions of Iranian Americans to members with-in and outside of the Iranian community living in America.”

Iranian Americans' Contributions Project

U.S.

“determine if ethnicity plays a part/plays no part in whether a written evidence submitted to a Parliamentary Inquiry is accepted or rejected”

Parliamentary Digital Service UK

“working on a study on the network effects for long term unemployed” German Institute for Employment Research

Germany

“unveiling the origins of French citizens in order to study discrimination in several areas of the French society”

Laboratoire InterdisciplinaireSciences Innovations Sociétés(LISIS)

French

“Investigate whether hosts on Airbnb get discriminated based on their ethnicity”

Stockholm School of Economics Sweden

● WIRED Magazine;

● API used by 156 social science research projects

● 69.9 million names analyzed (the population of France)

Talk Outline


Features From Graphs

● Anomaly Detection● Attribute Prediction● Clustering● Link Prediction● ...A

djac

ency

Mat

rix

|V|

A first step in machine learning for graphs is to extract graph features:● node degree● pairs: # of common neighbors● groups: cluster assignments

Advantages of DeepWalk

Bryan Perozzi DeepWalk: Online Learning of Social Representations

● Anomaly Detection● Attribute Prediction● Clustering● Link Prediction● ...

|V|

DeepWalk

d << |V|

Latent Dimensions

Adj

acen

cy M

atri

x

● Scalable - An online algorithm that does not use entire graph at once● Walks as sentences metaphor● Works great!● Implementation available: bit.ly/deepwalk

DeepWalk: The Entire Idea

Bryan Perozzi DeepWalk: Online Learning of Social Representations

Short random walks = sentences

■ We generate random walks for each vertex in the graph.■ Each short random walk has length t .■ Pick the next step uniformly from the vertex neighbors.

Everyone is doing the DeepWalk!

2177 citations since August 2014! 28th in downloads from ACM Digitial Library past 12 months from among all 4,188

KDD papers ever! (by comparison, second best from KDD ‘14 ranks 104th) You, too, can do the DeepWalk!

Identifying Historically Similar EntitiesWe can construct embeddings for people, places and things, to recognize similar entities

Y. Chen, B. Perozzi, and S. Skiena “Vector-Based Similarity Measurements for Historical Figures”, SISAP 2015.

DeepWalk: Nearest NeighborsScarlett Johansson● Kirsten Dunst (0.784)● Natalie Portman (0.786)● Gwyneth Paltrow (0.796)● Brad Pitt (0.858)● Cameron Diaz (0.891)

Steven Skiena● Larry Page (1.597)● Sergey Brin (1.598)● Danny Hillis (1.644)● Andrei Broder (1.652)● Mark Weiser (1.653)

Barack Obama● George W. Bush (0.474)● Hillary Clinton (0.657)● Bill Clinton (0.658)● Joe Biden (0.750)● Al Gore (0.791)

Albert Einstein● Richard Feynman (1.049)● Max Planck (1.073)● Freeman Dyson (1.107)● Stephen Hawking (1.153)● Robert Oppenheimer (1.156)

Ludwig van Beethoven● Franz Schubert (0.489)● Johannes Brahms (0.532)● Wolfgang Mozart (0.567)● Robert Schumann (0.576)● Gustav Mahler (0.635)

Mick Jagger● John Lennon (0.687)● Keith Richards (0.687)● Paul McCartney (0.796)● Ronnie Wood (0.822)● Eric Clapton (0.833)

Institute for AI-Driven Discovery and Innovation

Professor Steven SkienaDirector, AI InstituteNovember 2019

23

Mission of the Institute • Promote activity in AI and related areas towards attracting more

federal, state, industrial and private funding.• Stimulate research and educational activities in AI in CS and

across CEAS.• Make Stony Brook a more attractive destination for graduate

students and faculty interested in AI and related areas.• Check out our website: ai.stonybrook.edu and @AI_SBU on

Twitter.

24

http://ai.stonybrook.edu/

https://twitter.com/AI_SBU

Stony Brook AI InstituteMajor Accomplishments 2018-19

• Building a stronger Stony Brook AI community• Three SUNY EIP hires (Haibin Ling, Michael Ryoo, Zhaozheng Yin)!!• Junior hire in machine learning (Yifan Sun, starting Fall 2020)• NSF Major Research Infrastructure (MRI) grant for large AI/ML cluster.• First substantial philanthropic gift ($375K) for postdoctoral scholars program• Communications Hire (Dan Olawski)• Bloomberg AI Institute kickoff

Core Faculty

26

Research InterestsWe have strong research groups in several areas of AI:

• Biomedical Informatics• Computer Vision• Computational Logic and Reasoning• Data Science• Machine Learning• Natural Language Processing• Social Media Analysis

27

Educational Initiatives• New undergraduate courses in Data Science and Natural

Language Processing• New undergraduate concentration in Artificial Intelligence• New graduate concentration in Data Science, approved• DS+X initiative with College of Arts and Sciences (CEAS)

28

Student Collaborators: Stony Brook Data Science Lab

Bryan Perozzi Vivek Kulkarni Rami al-Rfou Junting Ye

Haochen Chen. Yingtao Tian

Manual Laboring

Questions?

Documents

Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted