56
Mahout becomes a researcher Kris Jack, PhD Senior Data Mining Engineer

Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Embed Size (px)

DESCRIPTION

I gave this presentation as part of the Big Data Week Conferences in London, 25th April, 2012.Mendeley Suggest is a research article recommendation system powered by Mahout. This presentation explores how Mahout's distributed recommender works and how well it performs when applied to the problem of recommending research to Mendeley users. Based on experimentation, some tips are provided on how to speed Mahout up by tuning it to the characteristics of the training data set. A new recommendation algorithm is also presented that implements user-based collaborative filtering which complements Mahout's existing item-based collaborative filtering algorithm. The user-based implementation will soon be contributed back to the Mahout community.

Citation preview

Page 1: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Mahout becomes a researcher

Kris Jack, PhD

Senior Data Mining Engineer

Page 2: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

➔ What's Mendeley?

➔ Applications of Mahout's Recommender

➔ Under Mahout's Bonnet

➔ Mahout's Research Career so Far

➔ Conclusions

Overview

Page 3: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

What's Mendeley?

Page 4: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

➔ Mendeley is a data platform for researchers

➔ We're bringing together researchers and the research that they produce from all over the world

➔ We're structuring this data in a machine readable format

➔ We're opening this data up for you to build applications on top of it using our API

➔ These applications help researchers to do even better research and become more productive

➔ How are we building our community?

Page 5: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

...organise their research

Mendeley provides tools to help users...

...organise their research

➔ Reference management

➔ Cite-as-you-write

➔ Full-text article search

➔ Digitalised annotations

Page 6: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

...organise their research

...collaborate with one another

Mendeley provides tools to help users...

...organise their research

➔ Research network

➔ Professional research groups

Page 7: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

...organise their research

...collaborate with one another

...discover new research

Mendeley provides tools to help users...

...organise their research

➔ Mendeley Suggest

➔ Personalised article recommendations

➔ Weekly batch of 10 recommended articles

➔ Collaborative Filtering

➔ The more data, the better

Page 8: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

1.5 million+ users; the 20 largest user bases:

University of CambridgeStanford University

MITUniversity of Michigan

Harvard UniversityUniversity of OxfordSao Paulo University

Imperial College LondonUniversity of Edinburgh

Cornell UniversityUniversity of California at Berkeley

RWTH AachenColumbia University

Georgia TechUniversity of Wisconsin

UC San DiegoUniversity of California at LA

University of FloridaUniversity of North Carolina50m research articles

Page 9: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

...organise their research

...collaborate with one another

...discover new research

Mendeley provides tools to help users...

...organise their research

We need a recommender that scales up, coping with our data and future growth

Page 10: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Applications of Mahout's Recommender

Page 11: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley
Page 12: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley
Page 13: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

http://www.slideshare.net/kryton/the-data-layer

Mahout use cases:

➔ Retrieve related items in large collections

Page 14: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

http://engineering.foursquare.com/2011/03/22/building-a-recommendation-engine-foursquare-style/

Mahout use cases:

➔ Retrieve related items in large collections

➔ Discover relevant items that you may have overlooked

Page 15: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

http://www.speeddate.com/apps/site/views/mp/technology.php

Mahout use cases:

➔ Retrieve related items in large collections

➔ Discover relevant items that you may have overlooked

➔ Find love!➔ Mahout implements collaborative

filtering, a surprisingly powerful algorithm

Page 16: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

http://krisjack.blogspot.co.uk/2012/02/your-very-own-personalised-research.html

Mahout use cases:

➔ Retrieve related items in large collections

➔ Discover relevant items that you may have overlooked

➔ Find love!➔ Mahout implements collaborative

filtering, a surprisingly powerful algorithm

➔ Mendeley Suggest➔ Discover new research➔ Fill in gaps in your library➔ Your personal advisor

Page 17: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Under Mahout's Bonnet

Page 18: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Generating recommendations through matrix multiplication

Adomavicius, G., & Tuzhilin, A. (2005). Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering, 17(6), 734-749. Piscataway, NJ, USA.

This is item-based recommendations as similarity is based on items, not users

http://www.slideshare.net/srowen/collaborative-filtering-at-scale-2

http://krisjack.blogspot.co.uk/2012/04/under-bonnet-of-mahouts-item-based.html

Not convinced? Try reading these...

Page 19: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Turing Babbage Einstein Newton

Comp Sci 1

Physics 1

Res

earc

h A

rtic

les

Researchers

Physics 2

Comp Sci 2

Input (all user preferences)

Page 20: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Turing Babbage Einstein Newton

Comp Sci 1

Physics 1

Res

earc

h A

rtic

les

Researchers

Physics 2

Comp Sci 2

1.5M

50M

Input (all user preferences)

300M prefs

Page 21: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Res

earc

h

Art

icle

s

Researchers

All User Preferences (item x user)

1. Prep. pref. matrix (1-3)2. Gen. sim. matrix (4-6)3. Multiply matrices (7-10)

item.RecommenderJob

Page 22: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Res

earc

h

Art

icle

sTuring

A User's Preferences(item x user)

Res

earc

h

Art

icle

s

Researchers

All User Preferences (item x user)

1. Prep. pref. matrix (1-3)2. Gen. sim. matrix (4-6)3. Multiply matrices (7-10)

item.RecommenderJob

Page 23: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Res

earc

h

Art

icle

sTuring

A User's Preferences(item x user)

Res

earc

h

Art

icle

s

Researchers

All User Preferences (item x user)

Res

earc

h

Art

icle

s

Research Articles

2 11 10 00 0

2 22 2

0 00 0

Item Similarity (item x item)

1. Prep. pref. matrix (1-3)2. Gen. sim. matrix (4-6)3. Multiply matrices (7-10)

item.RecommenderJob

Page 24: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Res

earc

h

Art

icle

s

Researchers

Comp Sci 1

Physics 1

Res

earc

h A

rtic

les

Research Articles

Physics 2

Comp Sci 2

Comp Sci 1Comp Sci 2

Physics 1Physics 2

2

1

2

2

1

1

2

2

0 0

0 0

0 0

0 0

Input (all user preferences)

Page 25: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Res

earc

h

Art

icle

sTuring

A User's Preferences(item x user)

Res

earc

h

Art

icle

s

Researchers

All User Preferences (item x user)

Res

earc

h

Art

icle

s

Research Articles

2 11 10 00 0

2 22 2

0 00 0

Item Similarity (item x item)R

esea

rch

A

rtic

les

Turing

Recommendations(item x user)

X =

1. Prep. pref. matrix (1-3)2. Gen. sim. matrix (4-6)3. Multiply matrices (7-10)

item.RecommenderJob

Page 26: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Running on Amazon's Elastic Map Reduce

On demand use and easy to cost

Page 27: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Mahout's Research Career so Far

Page 28: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Mendeley Suggest

Page 29: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

Mahout'sPerformance

Page 30: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

Mahout'sPerformance

Page 31: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

Mahout'sPerformance

Page 32: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

Mahout'sPerformance

Page 33: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

3

Mahout'sPerformance

Page 34: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5Orig. item-based

3

Mahout'sPerformance

Page 35: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5Orig. item-based

Cust. item-based➔2.4K, 1.5

3

Mahout'sPerformance

Page 36: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5Orig. item-based

Cust. item-based➔2.4K, 1.5

3

-4.1K(63%)

Mahout'sPerformance

Page 37: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Reducing processing time and cost

➔ Mahout's recommender is already efficient➔ but your data may have unusual properties

➔ We got improvements by:➔ tuning Hadoop's mapper and reducer allocation over the 10

steps in the RecommenderJob➔ using an appropriate partitioner

Page 38: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Task Allocation 37 hours to complete

1 reducer allocated, despite having 48 available...

Page 39: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Task Allocation

job.getConfiguration().set("mapred.max.split.size",String.valueOf(splitSize));

Allocating more mappers on a per job basis

job.getConfiguration().setInt("mapred.reduce.tasks",numMappers);

Allocating more reducers on a per job basis

Page 40: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Task Allocation 37 hours to complete14 hours

From 1 → 40 reducers

Page 41: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Partitioners 14 hours to complete

Page 42: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Partitioners 14 hours to complete

~50KB

~500MB

Page 43: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

InputSampler.Sampler<IntWritable, Text> sampler =new InputSampler.RandomSampler<IntWritable, Text>(...);

InputSampler.writePartitionFile(conf, sampler);conf.setPartitionerClass(TotalOrderPartitioner.class);

http://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-series-issue-2-getting-started-with-customized-partitioning/

Page 44: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Partitioners 14 hours to complete

2 hours

Evenly distributed

Page 45: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5Orig. item-based

Cust. item-based➔2.4K, 1.5

3

Mahout'sPerformance

-4.1K(63%)

Page 46: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Res

earc

h

Art

icle

sTuring

A User's Preferences(item x user)

Res

earc

h

Art

icle

s

Researchers

All User Preferences (item x user)

Res

earc

h

Art

icle

s

Research Articles

2 11 10 00 0

2 22 2

0 00 0

Item Similarity (item x item)R

esea

rch

A

rtic

les

Turing

Recommendations(item x user)

X =

1. Prep. pref. matrix (1-3)2. Gen. sim. matrix (4-6)3. Multiply matrices (7-10)

item.RecommenderJob

Page 47: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Res

earc

h

Art

icle

sTuring

A User's Preferences(item x user)

Res

earc

h

Art

icle

s

Researchers

All User Preferences (item x user)

Res

earc

h

Art

icle

s

Research Articles

2 11 10 00 0

2 22 2

0 00 0

Item Similarity (item x item)R

esea

rch

A

rtic

les

Turing

Recommendations(item x user)

X =

1. Prep. pref. matrix (1-3)2. Gen. sim. matrix (4-6)3. Multiply matrices (7-10)

item.RecommenderJob

user

User Similarity (user x user)

Researchers

Re

sea

rch

ers

Page 48: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5Orig. item-based

Cust. item-based➔2.4K, 1.5

Orig. user-based➔1K, 2.5

3

Mahout'sPerformance

Page 49: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5Orig. item-based

Cust. item-based➔2.4K, 1.5

Orig. user-based➔1K, 2.5

3

-1.4K(58%)

+1 (67%)

Mahout'sPerformance

Page 50: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5Orig. item-based

Cust. item-based➔2.4K, 1.5

Orig. user-based➔1K, 2.5

3

Cust. user-based➔0.3K, 2.5

Mahout'sPerformance

Page 51: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5Orig. item-based

Cust. item-based➔2.4K, 1.5

Orig. user-based➔1K, 2.5

3

Cust. user-based➔0.3K, 2.5

-0.7K(70%)

Mahout'sPerformance

-4.1K(63%)

Page 52: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5Orig. item-based

Cust. item-based➔2.4K, 1.5

Orig. user-based➔1K, 2.5

3

Cust. user-based➔0.3K, 2.5

-6.2K(95%)

Mahout'sPerformance

+1 (67%)

Page 53: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Conclusions

Page 54: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Conclusions

➔ Mahout is doing a great job of powering Mendeley Suggest➔ Large scale data set➔ Excellent for batch processing requirements

➔ We'll soon be feeding our user-based implementation into Mahout

➔ User-based can outperform item-based➔ Makes Mahout's offering more rounded

➔ Save resources and money by understanding your data➔ Help Hadoop with task allocation if necessary➔ Paritition your data appropriately

Page 55: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

We're Hiring!

➔ Hadoop Data Architect➔ design a coherent data model across the company

➔ take ownership of our data

➔ hands on Hadoop administration

➔ Marie Curie Senior Research Fellow ➔ ensure that Mendeley’s research catalogue is of high quality

➔ research and development opportunity

➔ £500 Finder's Fee if you find someone who we hire➔ http://www.mendeley.com/careers/

Page 56: Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

www.mendeley.com