22
Characterising the Emergent Semantics in Twitter Lists Andrés García-Silva , Jeon-Hyung Kang*, Kristina Lerman*, Oscar Corcho † {hgarcia, ocorcho}@fi.upm.es Facultad de Informática Universidad Politécnica de Madrid, Spain *{jeonhyuk,lerman}@isi.edu Information Sciences Institute, University of Southern California, USA

Characterising the Emergent Semantics in Twitter Lists

Embed Size (px)

DESCRIPTION

Presentation at

Citation preview

Page 1: Characterising the Emergent Semantics in Twitter Lists

Characterising the Emergent Semantics in Twitter Lists

Andrés García-Silva †, Jeon-Hyung Kang*, Kristina Lerman*,Oscar Corcho †

† {hgarcia, ocorcho}@fi.upm.esFacultad de Informática

Universidad Politécnica de Madrid, Spain

*{jeonhyuk,lerman}@isi.edu

Information Sciences Institute,

University of Southern California, USA

Page 2: Characterising the Emergent Semantics in Twitter Lists

Characterising the Emergent Semantics in Twitter Lists 2

Introduction

Twitter Lists

Page 3: Characterising the Emergent Semantics in Twitter Lists

3Characterising the Emergent Semantics in Twitter Lists

Introduction

Curators and

List Names

Page 4: Characterising the Emergent Semantics in Twitter Lists

4Characterising the Emergent Semantics in Twitter Lists

Introduction

Members and

List Names

Page 5: Characterising the Emergent Semantics in Twitter Lists

5Characterising the Emergent Semantics in Twitter Lists

Introduction

Subscribers

and

List Names

Page 6: Characterising the Emergent Semantics in Twitter Lists

6Characterising the Emergent Semantics in Twitter Lists

• Previous examples showed individual uses of lists• Some list names where related among them

• What about if we group the lists?

Introduction

Page 7: Characterising the Emergent Semantics in Twitter Lists

7Characterising the Emergent Semantics in Twitter Lists

IntroductionLists where the Yahoo!Finance user is a member grouped by frequency of membership

Lists where the NASDAQ user is a member grouped by number of subscriptions

Page 8: Characterising the Emergent Semantics in Twitter Lists

8Characterising the Emergent Semantics in Twitter Lists

Stocks

PersonalBanking

Investment

BanksCurator 1 Curator 2

Subscriber 1

List members

• Is it possible to identify related keywords from list names according to the use given by the different user roles?• Are two list names related if they have been used by a similar set of

curators?• Are two list names related if a similar set of users have subscribe to the

corresponding lists?• Are two list names related if their corresponding lists have a similar set of

members?• What kind of user roles will generate more related keywords?• What types of relations between keywords can we obtain?

• Synonyms, is-a, siblings..?

Introduction: Research questions

Page 9: Characterising the Emergent Semantics in Twitter Lists

9Characterising the Emergent Semantics in Twitter Lists

Approach

Elicit related keywords from Twitter lists

Characterise the semantics of the relations

Schema Representation of keywords

Based on members

Based on subscribers

Based on curators

Model to identify similar keywords

Vector Space Model

Latent Dirichlet Allocation

Pairs of related

keywords per

Schema Rep. and

Model

Twitter Lists

Page 10: Characterising the Emergent Semantics in Twitter Lists

10Characterising the Emergent Semantics in Twitter Lists

Approach

Elicit related keywords from Twitter lists

Characterise the semantics of the relations

Pairs of related

keywords per

Schema Rep. and

Model

Similarity based on WordNet

Jiang & Conrath (Distributional Inf.)

Wu & Palmer (Hierarchical Inf.)

Path Length

SPARQL queries over general KBs published as Linked Data

DBpedia, OpenCyc, and UMBEL

SynonymsIs-a

SiblingsIndirect is-a

Specificity of relations

Synonyms(sameAs)

Binary relations(TypeOf, BT)

Object Prop.(Occupation)

Page 11: Characterising the Emergent Semantics in Twitter Lists

11Characterising the Emergent Semantics in Twitter Lists

• Data set• Total

• 297,521 lists, 2,171,140 members, 215,599 curators, and 616,662 subscribers

• We extracted 5932 unique keywords from list names; 55% of them were found in WordNet.

• We use approximate matching of the list names with dictionary entries

• The dictionary was created from Wikipedia article titles

Experiment: Setup

Page 12: Characterising the Emergent Semantics in Twitter Lists

12Characterising the Emergent Semantics in Twitter Lists

Experiment: Execution

Pairs of related

keywords per

Schema Rep. and

Model

Each keyword

with the 5 Most

related WordNet Similarity

Characterise the semantics of the relations

Similarity based on WordNet

Jiang & Conrath (Distributional Inf.)

Wu & Palmer (Hierarchical Inf.)

Path Length

Elicit related keywords from Twitter lists

Schema Representation of keywords

Based on members

Based on subscribers

Based on curators

Model to identify similar keywords

Vector Space Model

Latent Dirichlet Allocation

Dataset

Page 13: Characterising the Emergent Semantics in Twitter Lists

13Characterising the Emergent Semantics in Twitter Lists

Experiment: Data Analysis

Pearson's coefficient of correlations

Average J&C distance and W&P similarity

Cor

rela

tion

Val

ues

(-1

to

1)

Page 14: Characterising the Emergent Semantics in Twitter Lists

14Characterising the Emergent Semantics in Twitter Lists

Path Length Members Subscribers Curators

VSM LDA VSM LDA VSM LDA

1 (synonyms) 8.58% 10.87% 3.97% 3.24% 1.24% 0.50%

2 (is-a) 3.42% 3.08% 1.93% 0.47% 0.70% 0.00%

3 (Siblings, ind. Is-a) 2.37% 3.77% 2.96% 2.06% 2.38% 4.03%

>3 67.61% 65.5% 67.2% 67.5% 77.8% 75.8%

Experiment: Data Analysis

In average 97.65% of the relations with a path length greater than 3 involve a common subsumer

Path Length in WordNet

% of relations found by each schema representation and model

Page 15: Characterising the Emergent Semantics in Twitter Lists

15Characterising the Emergent Semantics in Twitter Lists

Rel

atio

ns

in W

ord

Net

Depth of the least common subsumer

Experiment: Data Analysis

Rel

atio

ns

wit

h d

ept(

LC

S)

>=

5

Length of the path setting up the relation

Depth (LCS) and path length as indicators of specificity

Page 16: Characterising the Emergent Semantics in Twitter Lists

16Characterising the Emergent Semantics in Twitter Lists

Summary• Similarity models based on members

• produce the results that are most correlated to the results of similarity measures based on WordNet

• find more synonyms and direct relations is-a when compared to the other models (path length).

• The majority of relations found by any model have a path length >= 3 and involve a common subsumer.• Depth of LCS

• VSM based on subscribers produces the highest number of specific relations (depth of LCS >= 5 or 6).

• Similarity models based on curators produce a lower number of relations.

Experiment: Findings

Page 17: Characterising the Emergent Semantics in Twitter Lists

17Characterising the Emergent Semantics in Twitter Lists

Experiment: ExecutionExperiment: Execution

Pairs of related

keywords per

Schema Rep. and

Model

Each keyword

with the 5 Most

related

Elicit related keywords from Twitter lists

Schema Representation of keywords

Based on members

Based on subscribers

Based on curators

Model to identify similar keywords

Vector Space Model

Latent Dirichlet Allocation

Dataset

Ontological Relations between

keywords

Characterise the semantics of the relations

SPARQL queries over general KBs published as Linked Data

DBpedia, OpenCyc, and UMBEL

Page 18: Characterising the Emergent Semantics in Twitter Lists

18Characterising the Emergent Semantics in Twitter Lists

• We anchor 63.77% of the keywords extracted from Twitter Lists to DBPedia resources

Experiment

Page 19: Characterising the Emergent Semantics in Twitter Lists

19Characterising the Emergent Semantics in Twitter Lists

Experiment

Linked data pattern (54.73%): x -> object <-yRelations object Keywords

type type 67.35% company nokia intelsubClassOf subClassOf 30.61% activities philanthropy fundraising

Linked data pattern (43.49%): x <-object->yRelations object Keywords

genre genre 12.43% Aesthetica theater filmoccupation genre 10.27% Adam Maxwell fiction writeroccupation occupation 8.11% Alina Tugend poet writer

product product 7.57% ChenOne clothes fashionindustry product 9.73% UserLand Softw. blogs internet

known for occupation 5.41% Adeline Yen Mah author writingknown for known for 3.78% Rebecca Watson skeptics atheist

main interest main interest 3.24% Aristotle politics government

Relation type Example of keywordsBroader Term 26% life-science biotech

subClassOf 26% writers authorsdeveloper 11% google google_apps

genre 11% funland comedylargest city 6% houston texas

Others 20% - -

Vector-space model based on members (direct relations)

Vector-space model based on subscribers (relations of length 3)

Page 20: Characterising the Emergent Semantics in Twitter Lists

20Characterising the Emergent Semantics in Twitter Lists

• Different models to elicit related keywords from Twitter lists.• Curators, Subscribers and members - VSM and LDA

• Characterise the semantics of relations: WordNet-based similarity measures and SPARQL queries over linked data sets

Conclusions

Page 21: Characterising the Emergent Semantics in Twitter Lists

21Characterising the Emergent Semantics in Twitter Lists

• Vector-space and LDA models based on members produce the most correlated results to those of WordNet-based metrics.• Shortest JC distance and highest WP similarities

• According to the path length in WordNet• Models based on members produce more synonyms and direct is-a• Most of the relations have path length ≥ 3 and have a common subsumer

• Depth of LCS• Vector-space model based on subscribers finds highest

number of relations (depth LCS ≥ 5 and 4 ≤ path length ≤ 0) • We confirm these results according to linked data sets

Conclusions

Page 22: Characterising the Emergent Semantics in Twitter Lists

Characterising the Emergent Semantics in Twitter Lists

Andrés García-Silva †, Jeon-Hyung Kang*, Kristina Lerman*,Oscar Corcho †

† {hgarcia, ocorcho}@fi.upm.esFacultad de Informática

Universidad Politécnica de Madrid, Spain

*{jeonhyuk,lerman}@isi.edu

Information Sciences Institute,

University of Southern California, USA