Characterising the Emergent Semantics in Twitter Lists

Characterising the Emergent Semantics in Twitter Lists

Andrés García-Silva †, Jeon-Hyung Kang*, Kristina Lerman*,Oscar Corcho †

† {hgarcia, ocorcho}@fi.upm.esFacultad de Informática

Universidad Politécnica de Madrid, Spain

*{jeonhyuk,lerman}@isi.edu

Information Sciences Institute,

University of Southern California, USA

Characterising the Emergent Semantics in Twitter Lists 2

Introduction

Twitter Lists

3Characterising the Emergent Semantics in Twitter Lists

Introduction

Curators and

List Names


Introduction

Members and

List Names


Introduction

Subscribers

and

List Names


• Previous examples showed individual uses of lists• Some list names where related among them

• What about if we group the lists?

Introduction


IntroductionLists where the Yahoo!Finance user is a member grouped by frequency of membership

Lists where the NASDAQ user is a member grouped by number of subscriptions


Stocks

PersonalBanking

Investment

BanksCurator 1 Curator 2

Subscriber 1

List members

• Is it possible to identify related keywords from list names according to the use given by the different user roles?• Are two list names related if they have been used by a similar set of

curators?• Are two list names related if a similar set of users have subscribe to the

corresponding lists?• Are two list names related if their corresponding lists have a similar set of

members?• What kind of user roles will generate more related keywords?• What types of relations between keywords can we obtain?

• Synonyms, is-a, siblings..?

Introduction: Research questions


Approach

Elicit related keywords from Twitter lists

Characterise the semantics of the relations

Schema Representation of keywords

Based on members

Based on subscribers

Based on curators

Model to identify similar keywords

Vector Space Model

Latent Dirichlet Allocation

Pairs of related

keywords per

Schema Rep. and

Model

Twitter Lists


Approach



Pairs of related

keywords per

Schema Rep. and

Model

Similarity based on WordNet

Jiang & Conrath (Distributional Inf.)

Wu & Palmer (Hierarchical Inf.)

Path Length

SPARQL queries over general KBs published as Linked Data

DBpedia, OpenCyc, and UMBEL

SynonymsIs-a

SiblingsIndirect is-a

Specificity of relations

Synonyms(sameAs)

Binary relations(TypeOf, BT)

Object Prop.(Occupation)


• Data set• Total

• 297,521 lists, 2,171,140 members, 215,599 curators, and 616,662 subscribers

• We extracted 5932 unique keywords from list names; 55% of them were found in WordNet.

• We use approximate matching of the list names with dictionary entries

• The dictionary was created from Wikipedia article titles

Experiment: Setup


Experiment: Execution

Pairs of related

keywords per

Schema Rep. and

Model

Each keyword

with the 5 Most

related WordNet Similarity


Similarity based on WordNet

Jiang & Conrath (Distributional Inf.)

Wu & Palmer (Hierarchical Inf.)

Path Length



Based on members


Based on curators


Vector Space Model


Dataset


Experiment: Data Analysis

Pearson's coefficient of correlations

Average J&C distance and W&P similarity

Cor

rela

tion

Val

ues

(-1

to

1)


Path Length Members Subscribers Curators

VSM LDA VSM LDA VSM LDA

1 (synonyms) 8.58% 10.87% 3.97% 3.24% 1.24% 0.50%

2 (is-a) 3.42% 3.08% 1.93% 0.47% 0.70% 0.00%

3 (Siblings, ind. Is-a) 2.37% 3.77% 2.96% 2.06% 2.38% 4.03%

>3 67.61% 65.5% 67.2% 67.5% 77.8% 75.8%


In average 97.65% of the relations with a path length greater than 3 involve a common subsumer

Path Length in WordNet

% of relations found by each schema representation and model


Rel

atio

ns

in W

ord

Net

Depth of the least common subsumer


Rel

atio

ns

wit

h d

ept(

LC

S)

>=

5

Length of the path setting up the relation

Depth (LCS) and path length as indicators of specificity


Summary• Similarity models based on members

• produce the results that are most correlated to the results of similarity measures based on WordNet

• find more synonyms and direct relations is-a when compared to the other models (path length).

• The majority of relations found by any model have a path length >= 3 and involve a common subsumer.• Depth of LCS

• VSM based on subscribers produces the highest number of specific relations (depth of LCS >= 5 or 6).

• Similarity models based on curators produce a lower number of relations.

Experiment: Findings


Experiment: ExecutionExperiment: Execution

Pairs of related

keywords per

Schema Rep. and

Model

Each keyword

with the 5 Most

related



Based on members


Based on curators


Vector Space Model


Dataset

Ontological Relations between

keywords


SPARQL queries over general KBs published as Linked Data

DBpedia, OpenCyc, and UMBEL


• We anchor 63.77% of the keywords extracted from Twitter Lists to DBPedia resources

Experiment


Experiment

Linked data pattern (54.73%): x -> object <-yRelations object Keywords

type type 67.35% company nokia intelsubClassOf subClassOf 30.61% activities philanthropy fundraising

Linked data pattern (43.49%): x <-object->yRelations object Keywords

genre genre 12.43% Aesthetica theater filmoccupation genre 10.27% Adam Maxwell fiction writeroccupation occupation 8.11% Alina Tugend poet writer

product product 7.57% ChenOne clothes fashionindustry product 9.73% UserLand Softw. blogs internet

known for occupation 5.41% Adeline Yen Mah author writingknown for known for 3.78% Rebecca Watson skeptics atheist

main interest main interest 3.24% Aristotle politics government

Relation type Example of keywordsBroader Term 26% life-science biotech

subClassOf 26% writers authorsdeveloper 11% google google_apps

genre 11% funland comedylargest city 6% houston texas

Others 20% - -

Vector-space model based on members (direct relations)

Vector-space model based on subscribers (relations of length 3)


• Different models to elicit related keywords from Twitter lists.• Curators, Subscribers and members - VSM and LDA

• Characterise the semantics of relations: WordNet-based similarity measures and SPARQL queries over linked data sets

Conclusions


• Vector-space and LDA models based on members produce the most correlated results to those of WordNet-based metrics.• Shortest JC distance and highest WP similarities

• According to the path length in WordNet• Models based on members produce more synonyms and direct is-a• Most of the relations have path length ≥ 3 and have a common subsumer

• Depth of LCS• Vector-space model based on subscribers finds highest

number of relations (depth LCS ≥ 5 and 4 ≤ path length ≤ 0) • We confirm these results according to linked data sets

Conclusions

Characterising the Emergent Semantics in Twitter Lists

Andrés García-Silva †, Jeon-Hyung Kang*, Kristina Lerman*,Oscar Corcho †

† {hgarcia, ocorcho}@fi.upm.esFacultad de Informática

Universidad Politécnica de Madrid, Spain

*{jeonhyuk,lerman}@isi.edu

Information Sciences Institute,

University of Southern California, USA

Technology

Characterising the Emergent Semantics in Twitter Lists