From Search to Predictions in Tagged Information Spaces

. Christoph Trattner 30.10.2014 – Yahoo! Labs, Barcelona

Christoph TrattnerKnow-Center

ctrattner@know-center.at

@Graz University of Technology, Austria

Before start in this presentation I will talk a bit about myself, my background…

Where do I come from (Austria)?

Academic Back-Ground?

Studied Computer Science at Graz University of Technology & University of Pittsburgh

Worked since 2009 as scientific researcher at the KMI & IICM (BSc 2008, MSc 2009)

My PhD thesis was on the Search & Navigation in Social Tagging Systems (defended 2012)

Since Feb. 2013 @ Know-Center Leading the Social Computing Area At TUG:

WebScience Semantic Technologies

My team

2 Post-Docs, 5 Pre-Docs (2 more to join soon )

2 MSc student2 BSc student

DI. Dieter Theiler

DI. Dominik Kowald

Dr. Peter Kraker

Mag. Sebastian Dennerlein

Dr. Elisabeth Lex

Mag. MatthiasRella

DI. Emanuel Lacic

DI. Ilire Hasani

Thanks to my Collaborators

What is my group doing?

… we research on novel methods and tools that exploit social data to generate a greater value for the individual, communities, companies and the society as whole.

Our competences:• Network & Web Science• Science 2.0• Predictive Modeling• Social Network Analysis• Information Quality Assessment• User Modeling• Machine Learning and Data Mining• Collaborative Systems

Our Services:• Social Analytics: Hub-, Expert -, Community

-, Influencer -, Information Flow-, Trend (Event) Detection, etc.

• Information Quality Assessment• Social & Location-based Recommander

Systems• Customer Segmentation• Social Systems Design

Some industry partners...

Current projects

BlancNoir - “Towards a Big Data recommender engine for offline and online marketplaces”

I2F - “Towards a Social Media and Online Marketing Manager Seminar”

Automation-X - “Towards a scalable Graph-based Visual search solution”

Styria - “Towards a scalable crowd-based hierarchical cluster labeling approach for willhaben.at”

TripRebel - “Towards an engaging hybrid hotel recommender solution for triprebel.com”

CDS - “Towards a scalable Entity & Graph-based Visual search solution for cds.at”

Exthex - “Towards an efficient viral social media marketing champagne in Facebook and Twitter”

The Projects

Project 1: Mendeley – UK Startup (recently acquired by Elsevier): Interested in the problem of hirarchical concept-based search in tagged information spaces.

Project 2: Tallinn University– Interested in the problem of recommending tags and items in tagged information spaces.

Ok, let’s start….

Project 1

Mendeley – UK Startup (recently acquired by Elsevier): Interested in the problem of hierarchical concept-based search.

Research Question 1:

What kind of meta-data is more useful for search in information systems - tags or keywords?

Externals involved: • Mendeley, London, UK

Helic, D., Körner, C., Granitzer, M., Strohmaier, M. and Trattner, C. 2012. Navigational Efficiency of Broad vs. Narrow Folksonomies. In Proceedings of the 23rd ACM Conference on Hypertext and Social Media (HT 2012), ACM, New York, NY, USA, pp. 63-72.

Mendeley

Keywords

Mendeley Desktop

What is the best way to extract hirarchies from tagged information spaces? What is more useful for navigation – keyword or tag hierarchies?

Different types of hierarchy induction algorithms

Helic, D., Strohmaier, M., Trattner, C., Muhr M. and Lermann, K.: Pragmatic Evaluation of Folksonomies, In Proceedings of the 20th international conference on World Wide Web (WWW 2011), ACM, New York, NY, USA, 417-426, 2011.

Issue (!!!)

...no literature on what type of hierarchy is best suited for searching...

D. J. Watts, P. S. Dodds, and M. E. J. Newman. Identity andsearch in social networks. Science, 296:1302–1305, 2002.

J. M. Kleinberg. Navigation in a small world. Nature,406(6798):845, August 2000.

Stanley Milgram

A social psychologist Yale and Harvard University

Study on the Small World Problem,beyond well defined communities and relations(such as actors, scientists, …)

„An Experimental Study of the Small World Problem”

1933-1984

Set Up

Target person: A Boston stockbroker

Three starting populations 100 “Nebraska stockholders” 96 “Nebraska random” 100 “Boston random”

Nebraska random

Nebraska stockholders

Boston stockbroker

Boston random

Target

Results

How many of the starters would be able to establish contact with the target? 64 out of 296 reached the target

How many intermediaries would be required to link starters with the target? Well, that depends: the overall mean 5.2 links Through hometown: 6.1 links Through business: 4.6 links Boston group faster than Nebraska groups Nebraska stockholders not faster than Nebraska random

What form would the distribution of chain lengths take?

Hierarchical decentralized searcher

InformationNetwork

Hierarchy

Results

Validation

We compared simulations with

human click trails of the online Game –

The Wiki Game (http://thewikigame.com/)

Contains 1,500,000

click trails of more

than 500,000 users with

(start; target) information.

Hierachy Creation (1)

Two types of hierarchies were evaluated

1.) First type is based on our previous work Categorial Concepts:

Tags from Delicious Category labels from Wikipedia

Similarity GraphLatent Hierarchical Taxonomy

Wikipedia Category Label Dataset: 2,300,000 category labels,4,500,000 articles, 30,000,000 category label assignments

Delicious Tag Dataset: 440,000 tags, 580,000 articles and3,400,000 tag assignments

Hierarchy Creation (2)

2.) Second type is based on the work of [Muchnik et al. 2007]

Muchnik, L., Itzhack, R., Solomon S. and Louzoun Y.: Self-emergence of knowledge trees: Extraction of the Wikipedia hierarchies, PHYSICAL REVIEW E 76, 016106 (2007)

Simple idea: Algorithm iterates through all links in the network and decides if that link is of a hierarchical type, in which case it remains in the network otherwise it is removed.

Directed link-network dataset of theEnglish-Wikipedia from February 2012.

All in all, the dataset includesaround 10,000,000 articles and around 250,000,000 links

ValidationHuman Searchers

...ok let‘s come back to the Mendeley „problem“...

Are keyword hierarchies better for search than social tag hierarchies?

Keywords

Results: Our Greedy Navigator (= Simulator) needs on average 1-clickmore with keywords to reach the target node than with tags

Results:

With simulations we find that tag-based hierarchies are more efficient for navigation than keywords

...ok let‘s move on to some prediction stuff

Project 2

Tallinn University – Interested in the problem of recommending items and tags to users in social tagging systems.

Research Question 2:

To what extent is human cognition theory applicable to the problem of predicting tags and items to users?

Externals involved: • PUC - Chile, UFCG – Brazil

They help you to classify Web content better [Zubiaga 2012] They help people to navigate large knowledge repositories better

[Helic et al. 2012] They help people to search for information faster [Trattner et al. 2012]

However, there is an issue with social tags…

People are typically lazy to apply social tags(!!)

Zubiaga, A. (2012). Harnessing Folksonomies for Resource Classification. arXiv preprint arXiv:1204.6521.

Helic, D., Körner, C., Granitzer, M., Strohmaier, M., & Trattner, C. (2012, June). Navigational efficiency of broad vs. narrow folksonomies. In Proceedings of the 23rd ACM conference on Hypertext and social media (pp. 63-72). ACM.

Trattner, C., Lin, Y. L., Parra, D., Yue, Z., Real, W., & Brusilovsky, P. (2012, June). Evaluating tag-based information access in image collections. In Proceedings of the 23rd ACM conference on Hypertext and social media (pp. 113-122). ACM.

Motivation

To overcome that issue some smart people started to invent mechanisms that should help the user in applying tags, known as social tag recommender system based on:

Collaborative Filtering

User based- and item-based CF [Marinho et al. 2008]

Matrix Factorization

FM, PITF [Rendle et al. 2010, 2011, 2012]

Graph Structures

Adapted PageRank and FolkRank [Hotho et al. 2006]

Topic Models

Latent Dirichlet Allocation (LDA) [Krestel et al. 2009, 2010, 2011]

Motivation

Why do we need cognitive models?

First answer: We do not like data data driven approaches…

Me: OK

Second answer: We can understand things better……why is something happening and how…

MINERVA2

Approach

Based on a Human cognition (derived from MINERVA2 [Kruschke et al.,

1992])

Evaluation Wikipedia

p-core pruning (p = 14)

To finally measure to performance of our approach we split up our dataset in two

sub-sets 80% for training and 20% for testing Training

Precision, Recall, F1-score, MRR, MAP

As Baseline algorithm we have chosen Latent Dirichlet Allocation (LDA)

[Krestel et al. 2009]

Results

Results:

3Layers reaches higher levels of estimate than the pure LDA approach.

Interestingly, when looking into the literatur of tagging systems - temporal processes are typically modeled

with an exponential function...

D. Yin, L. Hong, and B. D. Davison. Exploiting session-like behaviors in tag prediction. In Proceedings of the 20th international conference companion on World wide web, pages 167–168. ACM, 2011.

L. Zhang, J. Tang, and M. Zhang. Integrating temporal usage pattern into personalized tag prediction. In Web Technologies and Applications, pages 354–365. Springer, 2012

Empirical Analysis: BibSonomy (1)

Linear distribution with log-scale on Y-axis exponential function

Linear distribution with log-scale on X- and Y-axes power function

Empirical Analysis: BibSonomy (2)

Exponential distributionR² = 31%

Power distributionR² = 89%

Results:

Decay factor is better modeled as power-function rather than an ex-function

Experiment 1: Predicting re-use of tags

Results: Predicting re-use of tags

BLLMPU

Results: Recall / Precision

Results:

BLLAC performs fairly well in predicting the re-use of tags

Experiment 2: Recommending Tags

Results: Recall-Precision plots

The time-depended approaches outperform the state-of-the-art

BLL+MPr reaches the highest level of accuracy

CiteULike

Results: Recall \ Precision

Results:

BLL approaches outperform current state-of-the-art tag recommender approaches.

...how about runtime?

Results: Runtime

BLL+C needs only around 1s to generate tag-recommendations for 5,500 users in BibSonomy

Results: Runtime

...predicting (re-ranking) items with ACT-R

Our Approach

= CIRTT 2 main steps

First step:– User-based Collaborative Filtering (CF) to get

candidate items of similar users

Second step:– Item-based CF to rank these candidate items using

the BLL equation to integrate tag and time information:

How does it perform?

3 freely-available folksonomy datasets– BibSonomy (~ 340,000 tag assignments)– CiteULike (~ 100.000 tag assignments)– MovieLens (~ 100.000 tag assignments)

Original datasets (no p-core pruning) Doerfel et al. (2013)

80/20 split (for each user 20% most recent bookmarks/posts in test-set, rest in training-set)

IR metrics: nDCG@20, MAP@20, Recall@20, Diversity and User Coverage

Baseline Methods

• Most Popular (MP)

• User-based Collaborative Filtering (CF)

• Two alternative approaches based on tag and time information– Zheng et al. (2011) exponential function– Huang et al. (2014) linear function

(remember: our CIRTT uses a power function)

Results: nDCG plots

CIRTT reaches the highest level of accuracy

Results: Recall plots

CIRTT reaches the highest level of accuracy

Results

Results:

CIRTT works quite well compared to the current state-of-the-art in tag-based item recommender systems

What are we...

...currently working on...

MINERVA2 + ACT-R

Time in Semantic vs. Lexical Memory

Topical vs. Lexical shift in time

Topics

Results:

Topical shift in time is less pronounced than lexical shift

Results: Recall / Precision

Describer vs. Categorizer

M. Strohmaier, C. Koerner, and R. Kern. Understanding why users tag: A survey of tagging motivation literature and results from an empirical study. Journal of Web Semantics, 17:1–11, 2012.

Results: Categorizer vs. Describer

... ok that‘s basically it

Code and Framework

https://github.com/learning-layers/TagRec/

Thank you!

Christoph Trattner

Email: trattner.christoph@gmail.comWeb: christophtrattner.info

Twitter: @ctrattner

Sponsors:

From Search to Predictions in Tagged Information Spaces

Education

Tagged high

Time-tagged photon imaging

Tagged high new 10

Tagged high new077

Tagged high new2

tagged LUC

Tagged Frog

Optimization of the tagged B meson vertex resolution for ... · Although the Standard Model has been very successful in predictions and in- terpretations of current measurements,

Tagged high-Test

DISTRIBUTION OF WESTERN TAGGED ATLANTIC BLUEFIN TUNA ...€¦ · released archival tagged fish remains New England. Twenty-three bluefin tuna tagged in North Carolina have been recovered

Tagged-Item Performance Protocol (TIPP) Tagged-Item ... · D.Buckley . Created with kind permission of GS1 US from the document. Tagged-Item Performance Protocol (TIPP) Tagged Item

AIM Pocket Guide Tagged

Tagged Sub-Optimal Code

Tagged high test

Permanent tagged appointment 2022

Wyre Forest Tagged

Tagged Data

You've Been Tagged

Sequence-tagged sites (STS)

Mixerman Tagged