18
ENTER 2015 Research Track Slide Number 1 Analyzing User Reviews in Tourism with Topic Models Marco Rossetti, Fabio Stella, Longbin Cao and Markus Zanker* Alpen-Adria-Universität Klagenfurt , Austria [email protected] http://www.aau.at * The presenter acknowledges the financial support of the European Union (EU), the European Regional Development Fund (ERDF), the Austrian Federal Government and the State of Carinthia in the Interreg IV Italien-Österreich programme (project acronym O-STAR).

Analyzing User Reviews in Tourism with Topic Models

Embed Size (px)

Citation preview

ENTER 2015 Research Track Slide Number 1

Analyzing User Reviews in Tourism with Topic Models

Marco Rossetti, Fabio Stella, Longbin Cao and

Markus Zanker*

Alpen-Adria-Universität Klagenfurt, [email protected]

http://www.aau.at

* The presenter acknowledges the financial support of the European Union (EU), the European Regional Development Fund (ERDF), the Austrian Federal Government andthe State of Carinthia in the Interreg IV Italien-Österreich programme (project acronym O-STAR).

ENTER 2015 Research Track Slide Number 2

Agenda

• Motivation• Topic Models• Application scenarios• Results• Conclusions

ENTER 2015 Research Track Slide Number 3

Motivation• Evergrowing vast amounts of data

– ~200 mio. reviews on Tripadvisor– Valuable opinion source

• Need for automated processing of data harvested from the Web.

• Two principal (research) directions– Machine Learning (ML): fitting general purpose statistical models to data– Semantic Web: goal to move from the traditional „unstructured“ Web to a web of

data (annotate data with semantic descriptors and efficient reasoning mechanisms)

• Topic Model is within the ML direction, but it promises to detect semantic ties between words

ENTER 2015 Research Track Slide Number 4

Topic Model 1/3• Method to organize, search and summarize electronic

documents

• „..algorithms for discovering the themes that pervade a large and otherwise unstructured collection of documents.“ [Blei, CACM, 2012]

• Unsupervised learning strategy that builds on the basic idea:– Big corpus of documents such as reviews– Uncover hidden topical patterns– Annotate documents according to those topics

ENTER 2015 Research Track Slide Number 5

Topic Model 2/3• Topic: coherent and meaningful bag of words• Words: can be related to several topics

(homonyms)• Documents: can be about several topics

• Example: documents can be about cats and dogs:– Kitten, cat, meow..– Dog, bone,…

ENTER 2015 Research Track Slide Number 6

Topic Model 3/3• Intuition: Topics are probability distributions over

words and this discrete distribution generates observations (words in documents).

• Computation task: Compute the topic structure given the observations (Posterior).– Approximation of .. – .. distribution over words for each topic– .. topic proportion for each document– .. topic assignment to each occurence of a word in a

document

ENTER 2015 Research Track Slide Number 7

Example

Topic“Location”

Topic“Food”

Topic“Rooms”

walking_distance breakfast Showerstation service bathroom

city_centre Restaurant mattressmetro Bar roomclose Food tv

“The hotel was right in the centre of the city, at walking distance from the city centre! Huge breakfast with nice food!”

“I stayed in this hotel with my friends, the room was cheap, but the shower was broken and the mattress was very hard!”

“The room was nice, with a flat tv, but the breakfast was so poor! I didn’t have enough food.”

Room

Food

Location

ENTER 2015 Research Track Slide Number 8

Goal and Contributions

1. Explore opportunities for application of the Topic Model* method in the Tourism domain.

2. Provide empirical evidence for their utility.

* Note that it is a family of many different methods.

ENTER 2015 Research Track Slide Number 9

Scenario 1: Item recommendation

• Users write reviews about topics that they care about (preference)

• Textual reviews associated to an overall rating explain what aspects of the item were particularly assessed

“The hotel was right in the centre of the city, at walking distance from the city centre! Huge breakfast with nice food!”

ENTER 2015 Research Track Slide Number 10

Topic-Criteria model 1/3

• User profiles (UP) created from topic distributions in own reviews𝑈�ሺ�, �ሺ= σ �൫�ห���൯��� ∈��|�� |

ENTER 2015 Research Track Slide Number 11

Topic-Criteria model 2/3

• Item profiles created from reviews and ratings

𝐼�ሺ�, �ሺ= σ �൫�ห�𝑖�൯∙ �𝑖��𝑖� ∈��σ �൫�ห�𝑖�൯�𝑖� ∈��

ENTER 2015 Research Track Slide Number 12

Topic-Criteria model 3/3

• Prediction based on the sum of products for all topics– Weight parameter fitted to data– Assumption that not all topics are equally influential

�Ƹ𝑖� = ሺ 𝑈�ሺ𝑖, �ሺ∙ 𝐼�ሺ�, �ሺ∙ ����=1

ENTER 2015 Research Track Slide Number 13

Results for Scenario 1

YELP-5-5 YELP-10-10 TA-3-3 TA-5-5KNN-IB 1,0709 1,0249 1,0531 0,9601KNN-UB 1,1088 1,0424 1,0715 0,9447PMF 1,0956 1,0389 1,0373 0,9946TC 1,0706 1,0247 1,0625 0,9719TC-W 1,0599 0,9955 1,0916 0,9776

• Evaluation on datasets from YELP (restaurants) and Tripadvisor (hotels) with different levels of sparsity

• Accuracy results (RMSE) of Topic-Criteria model comparable to Nearest-Neighbor and Matrix Factorization approaches, BUT richer user profiles and we could explain which topics have been considered in real user interaction!

ENTER 2015 Research Track Slide Number 14

Scenario 2: Analytics

• Anecdotal evidence on what topics might explain a good or bad rating for a service provider or a destination.

• BUT: risk of fallacies due to e.g. cherry-picking.

Cleanliness in reviews on Orlando hotels

Business in reviews on New York hotels

dirty mold bugs smelled smell filthy carpet musty stained disgusting bed_bugs black mildew moldy stains bites dust musty_smell refund

internet free free_internet access wireless internet_access wireless_internet business_center computers free_wireless business boarding gym center print free_internet_access printer bottled passes

ENTER 2015 Research Track Slide Number 15

Scenario 3: Automated Interpretation of reviews

• Automatically derive different properties from a review such as:– Rating value: extract topics from the written text and match with

them with the item profile – if users writes about strengths of the hotel high score

– Identify reviews where the associated rating value is / is not coherent with the predicted rating to identify fake reviews or rank more plausible reviews higher

– Identify reviews with more breath / broader scope (see Daniel Leung‘s thesis)

ENTER 2015 Research Track Slide Number 16

Conclusions

• Several application scenarios for the Topic Model method in the tourism domain identified

• Empirical evidence that proposed Topic-Criteria model achieves comparable or better results than baseline recommendation methods

• Future work:– Different extensions of Topic Model methods employing supervised

learning– Contrasting derived topic distributions with real user assessments

ENTER 2015 Research Track Slide Number 17

Thank you for your attention!

Questions?

Questions?Questions?

Markus ZankerIntelligent Systems and Business Informatics

Alpen-Adria-Universität Klagenfurt, Austria

M: [email protected]

P: +43 463 2700 3753

Skype: markuszanker

W: http://www.isbi.at/mzanker

Visit: http://www.recommenderbook.net

ENTER 2015 Research Track Slide Number 18

Project OSTAR• Development of an innovative online system for

recommending individual tours and trails in alpine regions– Research partners:

• EURAC research, Bolzano, Italy• Free University Bolzano-Bozen, Italy• Autonomous Province of Bolzano – South Tyrol (Dept. for spatial and

statistical informatics)• Alpen-Adria-Universität Klagenfurt

– Application partners:• Tourism regions in Carinthia and South Tyrol

– Runtime: 2012-2014– Programme:

• Interreg IV Italy-Austria