Mining Interest Topics from Plurk

Preview:

Citation preview

Mining Interest Topics from Plurk

Ken Yi-Chien Lee 2012/11/27

Outline • Introduction

– Why and what we do in this thesis? • The SNSD system

– Community detection – Interest hierarchy

• Implementation – Preprocessing – Celery task queue

• Experiments • Conclusions and future works

INTRODUCTION I what to make friend with you.

Scenario

Scenario (cont.)

Plurk Timeline

Private Status

Traffic Statistics of Plurk in Taiwan

The Go!Plurk Project

Issue: 1. Unable to analysis private user 2. Pie chart is too simple, no details interest information

THE SNSD SYSTEM Find out what the plurker is interested in.

Social Networking Service Discovery

• Discover users’ interest topics via 1. Posted contents (plurks) from users 2. Aggregated interest information from

communities for the private users

• Have to prepare – Relationships – Plurks

Work-flow of SNSD System

Aggregation and Derivation

Aggregation and Derivation

Aggregation and Derivation

Aggregation and Derivation

Aggregation and Derivation

Aggregation and Derivation

Community Detection

• Snowball sampling • Louvain algorithm • Filtering

– Karma – Gender – Privacy

Snowball Sampling

Modularity • 𝑸𝑸 = (number of edges within communities) -

(expected number within communities) • Idea:

– dense internal connections between the nodes within modules

– sparse connections between different modules

• Work as a measurement for the quality of partitions and an objective function to optimize.

Definition of Modularity

𝑸𝑸 =12𝑚𝑚

� 𝐴𝐴𝑖𝑖𝑖𝑖 −𝑑𝑑𝑖𝑖𝑑𝑑𝑖𝑖2𝑚𝑚

𝛿𝛿 𝐶𝐶 𝑖𝑖 ,𝐶𝐶 𝑗𝑗𝑖𝑖𝑖𝑖

– 𝐴𝐴𝑖𝑖𝑖𝑖 = the weight of the edge between 𝑖𝑖 and 𝑗𝑗 – 𝑑𝑑𝑖𝑖 = degree of vertex 𝑖𝑖

– 𝑚𝑚 = 12∑ 𝐴𝐴𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 , number of edges of the graph

– 𝛿𝛿 𝐶𝐶 𝑖𝑖 ,𝐶𝐶 𝑗𝑗 = �1, 𝑖𝑖𝑖𝑖 𝐶𝐶 𝑖𝑖 = 𝐶𝐶 𝑗𝑗 0, 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑖𝑖𝑜𝑜𝑜𝑜

– 𝐶𝐶 𝑖𝑖 is the community of vertex 𝑖𝑖

Expected Number of Edges Between Two Nodes

• 𝐸𝐸 𝑖𝑖 → 𝑗𝑗 = 𝑑𝑑𝑖𝑖 × 𝑃𝑃𝑜𝑜 → 𝑗𝑗 = 𝑑𝑑𝑖𝑖 × 𝑑𝑑𝑗𝑗2𝑚𝑚

𝑖𝑖 𝑗𝑗

𝑑𝑑𝑖𝑖 𝑑𝑑𝑖𝑖2𝑚𝑚

Lei Tang, Huan Liu, Community Detection and Mining in Social Media, 2010

Louvain Algorithm • Louvain algorithm is a heuristic greedy method

based on modularity optimization • Louvain algorithm consists of two phases

1. Look for small communities by optimizing modularity locally

2. Aggregate vertices in the same community and build a new network whose vertices are the communities

3. Repeat until a maximum of modularity is attained

Example

• 𝐵𝐵𝑖𝑖𝑖𝑖 = 𝐴𝐴𝑖𝑖𝑖𝑖 −𝑑𝑑𝑖𝑖𝑑𝑑𝑗𝑗2𝑚𝑚

• ∆𝑖𝑖 𝑗𝑗 = 𝐵𝐵𝑖𝑖𝑖𝑖 − 𝐵𝐵𝑖𝑖𝑖𝑖 𝑚𝑚𝑜𝑜𝑑𝑑𝑚𝑚𝑚𝑚𝑚𝑚𝑜𝑜𝑖𝑖𝑜𝑜𝑚𝑚 𝑔𝑔𝑚𝑚𝑖𝑖𝑔𝑔

• 𝑗𝑗∗ 𝑖𝑖 = arg max ∆𝑖𝑖 𝑗𝑗 | 𝑗𝑗 ∈ 𝑔𝑔

9

2 8

5

6

7

4 1

3

• 𝐵𝐵11 = 𝐴𝐴11 −𝑑𝑑1𝑑𝑑12𝑚𝑚

= 0 − 3×32×14

= −0.32

• 𝐵𝐵12 = 𝐴𝐴12 −𝑑𝑑1𝑑𝑑22𝑚𝑚

= 1 − 3×22×14

= 0.79

• 𝐵𝐵13 = 1 − 3×32×14

= 0.68

• 𝐵𝐵14 = 1 − 3×42×14

= 0.57

• 𝐵𝐵15 = 0 − 3×42×14

= −0.43

• 𝐵𝐵16 = 0 − 3×42×14

= −0.43

• 𝐵𝐵17 = 0 − 3×42×14

= −0.43

• 𝐵𝐵18 = 0 − 3×32×14

= −0.32

• 𝐵𝐵19 = 0 − 3×12×14

= −0.11

• 𝑗𝑗∗ 1 = 2

9

2 8

5

6

7

4 1

3 9

2 8

5

6

7

4 1

3

1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

{1,2}

{1,2,3} {1,2,3,4} {5,8} {5,6,8} {7,9}

𝑖𝑖 1 2 3 4 5 6 7 8 9

𝑗𝑗∗ 𝑖𝑖 2 1 2 1 8 8 9 5 7

2

{7,9}

{5,6,8}

{1,2,3,4}

6

10

2 3 9

2 8

5

6

7

4 1

3

{1,2,3,4} {5,6,8} {7,9}

{1,2,3,4}

{5,6,8}

{7,9} {{5,6,8}, {7,9}}

𝑖𝑖 {1,2,3,4} {5,6,8} {7,9}

𝑗𝑗∗ 𝑖𝑖 {1,2,3,4} {5,6,8} {5,6,8}

2

{7,9}

{5,6,8}

{1,2,3,4}

6

10

2 3 {1,2,3,4}

{5,6,7,8,9}

10

14 2

𝑖𝑖 {1,2,3,4} {5,6,7,8,9}

𝑗𝑗∗ 𝑖𝑖 {1,2,3,4} {5,6,7,8,9}

Example (cont.)

9

2 8

5

6

7

4 1

3

{1,2,3,4}

{5,6,7,8,9}

original 1st pass, phase 1

2nd pass, terminate

2

{7,9}

{5,6,8}

{1,2,3,4}

1st pass, phase 2

6

10

10

14

9

2 8

5

6

7

4 1

3

2 3 2

INTEREST KEYWORDS HIERARCHY

Closure Table

Interest Keywords Hierarchy

SNSD

Taeyeon

Bo Peep Bo Peep

Twinkle

YoonA

Gee

Girls Generation

PSY

Gangnam Style

CRAWLING SYSTEM How to dump Plurk.com?

Overview of Crawling System

ZeroMQ: The Intelligent Transport Layer

Work-flow of Crawling Task Queue

Plurk API

• Plurk API 2.0 is based on OAuth Core 1.0a standard

• Requests should be signed using HMAC-SHA1 • API returns JSON encoded data • No request rate limit

Plurk API Library

• Original provider – plurk-oauth by clsung

• Performance Bottleneck – HTTP persistent connection – JSON decode – HMAC-SHA1

• Enhancements – HTTP connection pool – C extension for JSON and HMAC-SHA1

Performance Comparison 53.71

27.49

15.44 15.50 14.97 13.21 13.13

52.77

26.74

14.10 11.17 9.45 7.94 7.08

0.00

10.00

20.00

30.00

40.00

50.00

60.00

8 16 32 64 128 256 512

seco

nds

concurrency

OriginalEnhanced

An Example of a Plurk

Plurk Attributes • _id

– The unique plurk id, used for identification of the plurk • owner

– The owner/poster of this plurk • content

– The formatted and filtered content, e.g. URL will be turned into text tags and emoticons will be filtered etc.

• content_raw – The raw content as user entered it

• posted – The date this plurk was posted in ISODate format

Plurks Preprocessing

URL Filtering

URL Filtering (cont.)

URL Filtering (cont.)

Normalization

Tokenization

Celery Task Queue

Celery Task Queue

Datastore Architecture

• Why MongoDB? – Auto-sharding – Replica sets

• MongoDB cluster – mongos – Config servers – Shard servers

• Deploy to Delta cloud cluster

MongoDB Server Layout

Cluster Configuration

Delta Cloud Server

Delta Cloud Server (cont.)

EXPERIMENTS

Environment

Experiment

• Sampling 40 public plurkers • public: get top-64 freq. interest keywords • private: regard the plurker as private, derive

his interest keywords by communities and get top-64 freq. interest keywords

• len(intersect(public, private))

Result

3

6 7

16

4 3

1

21 ~ 25 26 ~ 30 31 ~ 35 36 ~ 40 41 ~ 45 46 ~ 50 51 ~ 550

2

4

6

8

10

12

14

16

18

# matching

LIVE DEMO Never

CONCLUSIONS AND FUTURE WORKS

Conclusions

• Construct an online SNSD system for Plurk users to find interesting topics and relationship

• Develop a new scalable crawling framework based on ZeroMQ

• Patch the plurk-oauth library • Build a website for visualizing interest and

relationship by D3.js

Future Works

• Interest hierarchy: – Manageable UI – Recommend by users

• Apply the SNSD system to Twitter for western language and Sina weibo for mainland China

• Employ other community dectection algorithm and optimize NetworkX

Future Works (cont.)

• Consider responses in a plurk and fans relationship in interest derivation

• Serve as a Plurk full-text search engine

Q & A

Thank you for listening.

CS Workstation Architecture

Delta Cluster Architecture