18
Jong Y. Choi, Joshua Rosen, Siddharth Maini, Marlon E. Pierce, and Geoffrey C. Fox Community Grids Laboratory Indiana University

Jong Y. Choi, Joshua Rosen, Siddharth Maini, Marlon E. Pierce, and Geoffrey C. Fox Community Grids Laboratory Indiana University

Embed Size (px)

Citation preview

Page 1: Jong Y. Choi, Joshua Rosen, Siddharth Maini, Marlon E. Pierce, and Geoffrey C. Fox Community Grids Laboratory Indiana University

Jong Y. Choi, Joshua Rosen, Siddharth Maini, Marlon E. Pierce, and Geoffrey C. Fox

Community Grids LaboratoryIndiana University

Page 2: Jong Y. Choi, Joshua Rosen, Siddharth Maini, Marlon E. Pierce, and Geoffrey C. Fox Community Grids Laboratory Indiana University

Delicious example

2

Bookmark

Tags

SocialNetwork

s

People-generate

d

Page 3: Jong Y. Choi, Joshua Rosen, Siddharth Maini, Marlon E. Pierce, and Geoffrey C. Fox Community Grids Laboratory Indiana University

Collaborative Tagging Online bookmarking

with annotations Create social networks Utilize power of

people’s knowledge Pros and cons

High-quality classifier by using human intelligence

But lack of control or authority

3

Page 4: Jong Y. Choi, Joshua Rosen, Siddharth Maini, Marlon E. Pierce, and Geoffrey C. Fox Community Grids Laboratory Indiana University

4

Page 5: Jong Y. Choi, Joshua Rosen, Siddharth Maini, Marlon E. Pierce, and Geoffrey C. Fox Community Grids Laboratory Indiana University

5

Search ResultSOAP, REST, …

Repository

Query with various options

RDFRSSAtomHTML

Populate Bookmarks/ tags

Distributed Tagging Data

CCT System

Data Coordinator

User Service

Data Importer

Collective Collaborative Tagging (CCT) System

Page 6: Jong Y. Choi, Joshua Rosen, Siddharth Maini, Marlon E. Pierce, and Geoffrey C. Fox Community Grids Laboratory Indiana University

6

1st - Service and algorithm development Identify services and algorithms

2nd - Interface development Web2.o style interface REST, SOAP, …

3rd – Export/import service development Merging distributed data sets Export data to build mesh-up sites

So far, we are mainly in 1st stage and do some experiments in 2nd stage

Page 7: Jong Y. Choi, Joshua Rosen, Siddharth Maini, Marlon E. Pierce, and Geoffrey C. Fox Community Grids Laboratory Indiana University

7

Different Data Sources

Various IR algorithms

Flexible Options

Result Comparison

Page 8: Jong Y. Choi, Joshua Rosen, Siddharth Maini, Marlon E. Pierce, and Geoffrey C. Fox Community Grids Laboratory Indiana University

8

SearchingSearching

Given input tags, returning the most relevant X (X = URLs, tags, or users)

Given input tags, returning the most relevant X (X = URLs, tags, or users)

Latent Semantic Indexing (LSI), FolkRank

Latent Semantic Indexing (LSI), FolkRank

II

Recommendation

Recommendation

Indirect input tags, returning undiscovered XIndirect input tags, returning undiscovered XIIII

ClusteringClustering

Community discovering. Finding a group or a community with similar interests

Community discovering. Finding a group or a community with similar interests

K-Means, Deterministic Annealing Clustering

K-Means, Deterministic Annealing Clustering

IIIIII

Trend detection

Trend detection

Analysis the tagging activities in time-series manner and detect abnormality

Analysis the tagging activities in time-series manner and detect abnormality

Time Series AnalysisTime Series AnalysisIVIV

Service DescriptionAlgorithm

Type

Page 9: Jong Y. Choi, Joshua Rosen, Siddharth Maini, Marlon E. Pierce, and Geoffrey C. Fox Community Grids Laboratory Indiana University

Vector-space model (bag-of-words model) Assume n URLs and q tags A URL can be represented by q-dimension

vector, di = (t1, t2, … , tq)

A total data set can be represented by n-by-q matrix

Pairwise Dissimilarity Matrix n-by-n symmetric matrix Distance (Euclidean, Manhattan, … ) Angles, cosine, sine, … O(n2) complexity

9

Page 10: Jong Y. Choi, Joshua Rosen, Siddharth Maini, Marlon E. Pierce, and Geoffrey C. Fox Community Grids Laboratory Indiana University

10(Source : MSI-CIEC)

Graph model Building a graph with nodes and edges Edges are indicating relationship Becoming complex networks (tag graph)

Dissimilarity Related with path distance Finding path is important

(Shortest path problem) Naive approach :

O(n3) complexity

Page 11: Jong Y. Choi, Joshua Rosen, Siddharth Maini, Marlon E. Pierce, and Geoffrey C. Fox Community Grids Laboratory Indiana University

Latent Semantic Indexing Using vector-space model, find the most

similar URLs with user’s query tags Dimension reduction from high q to low d (q

>> d) Removing noisy terms, extracting latent

concepts

11Precision

Reca

ll

2 terms4 terms8 terms20% dim. reductionNone

Ideal Line

Page 12: Jong Y. Choi, Joshua Rosen, Siddharth Maini, Marlon E. Pierce, and Geoffrey C. Fox Community Grids Laboratory Indiana University

Discover the group structures of URLs Non-parametric learning algorithm

Non-trivial optimization problem Should avoid local minima/maxima solution

12

Page 13: Jong Y. Choi, Joshua Rosen, Siddharth Maini, Marlon E. Pierce, and Geoffrey C. Fox Community Grids Laboratory Indiana University

Deterministically avoid local minima Tracing global solution by changing level of

energy Analogy to physical annealing process (High

Low)

13

Page 14: Jong Y. Choi, Joshua Rosen, Siddharth Maini, Marlon E. Pierce, and Geoffrey C. Fox Community Grids Laboratory Indiana University

Classification To response more quickly to user’s requests Training data based on user’s input and

answering questions based on the training results

Artificial Neural Network, Support Vector Machine,…

Trend Detection Can be used for prediction/forecasting Time-series analysis of tagging activities Markov chain model, Fourier transform, …

14

Page 15: Jong Y. Choi, Joshua Rosen, Siddharth Maini, Marlon E. Pierce, and Geoffrey C. Fox Community Grids Laboratory Indiana University

The goal of our Collective Collaborative Tagging (CCT) system Utilize various data sets Provide various information retrieval (IR)

algorithms Help to utilize people-powered knowledge

Currently various models and algorithms are being investigated

Service interfaces and import/export function will be added soon

15

Page 16: Jong Y. Choi, Joshua Rosen, Siddharth Maini, Marlon E. Pierce, and Geoffrey C. Fox Community Grids Laboratory Indiana University

16

Page 17: Jong Y. Choi, Joshua Rosen, Siddharth Maini, Marlon E. Pierce, and Geoffrey C. Fox Community Grids Laboratory Indiana University

17

-. Distances, cosine, …-. O(N2) complexity-. Distances, cosine, …-. O(N2) complexity

Dis-similarity

Dis-similarity

Vector-space Model

-. Paths, hops, connectivity, …-. O(N3) complexity

-. Paths, hops, connectivity, …-. O(N3) complexity

Graph Model

-. Latent Semantic Indexing-. Dimension reduction schemes-. PCA

-. Latent Semantic Indexing-. Dimension reduction schemes-. PCA

AlgorithmAlgorithm-. PageRank, FolkRank, …-. Pairwise clustering-. MDS

-. PageRank, FolkRank, …-. Pairwise clustering-. MDS

-. q-dimensional vector-. q-by-n matrix-. q-dimensional vector-. q-by-n matrix

Represen-tation

Represen-tation

-. G(V, E) -. V = {URL, tags, users}-. G(V, E) -. V = {URL, tags, users}

Page 18: Jong Y. Choi, Joshua Rosen, Siddharth Maini, Marlon E. Pierce, and Geoffrey C. Fox Community Grids Laboratory Indiana University

Pairwise clustering Input from vector-based model vs. graph

model How to avoid local minima/maxima? (e.g, K-

Means)

18

Graph modelVector-space model