33
Shoaib Jameel, Wai Lam and Xiaojun Qian The Chinese University of Hong Kong Ranking Text Documents Based on Conceptual Difficulty Using Term Embedding and Sequential Discourse Cohesion

Shoaib Jameel , Wai Lam and Xiaojun Qian The Chinese University of Hong Kong

  • Upload
    bracha

  • View
    37

  • Download
    0

Embed Size (px)

DESCRIPTION

Ranking Text Documents Based on Conceptual Difficulty Using Term Embedding and Sequential Discourse Cohesion. Shoaib Jameel , Wai Lam and Xiaojun Qian The Chinese University of Hong Kong. Outline. Introduction to Readability/Conceptual Difficulty Motivation Related Work - PowerPoint PPT Presentation

Citation preview

Page 1: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

Shoaib Jameel, Wai Lam and Xiaojun Qian

The Chinese University of Hong Kong

Ranking Text Documents Based on Conceptual Difficulty Using Term

Embedding and Sequential Discourse Cohesion

Page 2: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

Outline

1. Introduction to Readability/Conceptual Difficulty

2. Motivation3. Related Work4. Our method (Sequential Term Transition

Model (STTM))5. Empirical Evaluation6. Conclusions and Future Work

Page 3: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm

http://en.wikipedia.org/wiki/Proton

Which of the two appears simple to

you?

1

2

3

Page 4: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

Search for a keyword

Results – Sometimes irrelevantand mixed order of readability

4

Page 5: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

An attempt by Google

Page 6: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

Our Objective

Query

Retrieve web pages(considering relevance)

Re-rank web pages based on readability

Automatically accomplished

6

Page 7: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

What has been done so far?

7

Heuristic Readability formulae Unsupervised approaches Supervised approaches

Page 8: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

Heuristic Readability Methods

8

Have been there since 1940’s

Semantic Component – Number of syllables per word, length of the syllables per word etc.

Syntactic Component – Length of sentences etc.

Page 9: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

Example – Flesch Reading Ease

Semantic componentSyntactic component

Manually tuned numerical parameters

9

water -> wa-ter proton -> pro-ton

embryology -> em-bry-ol-o-gy star -> star Problem

Page 10: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

Supervised Learning Methods

10

Language Models• Unigram Language Model based method SVMs (Support Vector Machines) Use of query Log and user profiles• Can address the problem on individual

basis

Page 11: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

Smoothed Unigram Model [1]

[1] K. Collins-Thompson and J. Callan. (2005.) "Predicting reading difficulty with statistical language models". Journal of the American Society for Information Science and Technology 56(13) (pp. 1448-1462).

•Recast the well-studied problem of readability in terms of text categorization and used straightforward techniques from statistical language modeling.

11

Page 12: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

Smoothed Unigram Model

Limitation of their method: Requires training data, which sometimes may be difficult to obtain

12

Page 13: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

Domain-specific Readability• Jin Zhao and Min-Yen Kan. 2010. Domain-specific iterative readability

computation. In Proceedings of the 10th annual joint conference on Digital libraries (JCDL '10).

Based on web-link structure algorithm HITS and SALSA.

• Xin Yan, Dawei Song, and Xue Li. 2006. Concept-based document readability in domain specific information retrieval. In Proceedings of the 15th ACM international conference on Information and knowledge management (CIKM '06).

Based on an ontology. Tested only in the medical domain

Hypertext Induced Topic Search Stochastic Approach for Link-Structure Analysis

I will focus on this work.13

Page 14: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

14OverviewOverview

• The authors state that Document Scope and Document Cohesion are an important parameters in finding simple texts.

• The authors have used a controlled vocabulary thesaurus termed as Medical Subject Headings (MeSH).

• Authors have pointed out the readability based formulae are not directly applicable to web pages.

Page 15: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

MeSH OntologyMeSH Ontology

Concept difficulty increases

Concept difficulty decreases

15

Page 16: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

Overall Concept Based Readability ScoreOverall Concept Based Readability Score

where,DaCw = Dale-Chall Readability MeasurePWD = Percentage of difficult wordsAvgSL = Average sentence length in di

Their work focused on word level readability, hence considered only the PWD

len(ci,cj)=function to compute shortest path between concepts c i cj in the MeSH hierarchyN = total number of domain concepts in document d i

Depth(ci)=depth of the concept ci in the concept hierarchyD= Maximum depth of concept hierarchyNumber of associations = Total number of mutual associations among concepts

16

Page 17: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

Use of Query Log data• Have been conducted by the search engine

companies

• Requires proprietary data, not available publicly

• Thus not very useful to the research community because it cannot be replicated

17

J. Kim, K. Collins-Thompson, P. N. Bennett, S. Dumais. Characterizing Web Content, User Interests, and Search Behavior by Reading Level and Topic. Proceedings of WSDM 2012. (Microsoft Research)Chenhao Tan, Evgeniy Gabrilovich, and Bo Pang. 2012. To each his own: personalized content selection based on text comprehensibility. In Proceedings of WSDM 2012. (Yahoo! Research)

Page 18: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

Our approach

Sequential Term Transition Model (STTM)

A conceptual difficulty determination model which is:

Unsupervised Does not require any knowledge base or

annotated data

Page 19: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

Methodology• We first build a term document matrix• We then perform Singular Value Decomposition (SVD) on

the matrix• SVD : W≈W’=USVT

• U is a Txf matrix of left singular vectors• V is a Dxf matrix of right singular vectors• S is a fxf diagonal matrix of singular values T is the number of terms in the vocabulary D is the number of documents in the collection f is number of factors

Page 20: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

Observation in the SVD space

1. Terms which are central to a document come close to their document vectors

2. General terms are distant away from their document vectors

3. Semantically related terms cluster close to each other

4. Unrelated terms cluster away from each other

Page 21: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

Computing Term Difficulties

Normalized term vector

Normalized document vector Matrix of normalized documentvectors that contain the term

Page 22: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

General Idea about Linear Embedding

t

D1

D6

D3D5

D4

D2

w4

w 3

w1 w

2

w 6

w5

Page 23: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

Cohesion

When units tend to “stick together”, the property is called cohesion

We compute cohesion between terms in sequence

The more cohesive terms in the document are, the easy it is for a person to comprehend a discourse

Page 24: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

Computation of Cohesion

We know related terms cluster close to each other in the latent space obtained via SVD

We have to compute the cluster memberships of each of the terms as SVD does not directly give term memberships to clusters

We use k-means because of its simplicity and ability to handle large datasets

Page 25: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

How we compute cohesion?

W1 W2

W1 W2 W3 W4 W5 W6 W7 W8 W9 W10

Determine the cluster memberships of the twoconsecutive terms w1 and w2

C1 C1

Same cluster, we conclude they are cohesive

W1 W2 W3

C1 C1

Same cluster, we conclude they are cohesive

W1 W2 W3 W4

C1 C1 C4

C1

C4

Compute cosine similarity

Page 26: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

Cohesion using cosine similarity

If the cluster centroids are close to each other, then cosine similarity will be high

When cosine similarity is high means that the two cluster are closely related

Page 27: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

Conceptual Difficulty Score

Conceptual difficulty score for document j

Parameter controlling the relativeweights between [0,1]

Cohesion score of document j

Term difficulty score for document j

Page 28: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

Empirical Evaluation - Dataset

Standard test collections do not have readability judgments

We chose Psychology domain Crawled web pages from Wikipedia,

Psychology.com, Simple English Wikipedia Total web page count = 167,400 No term stemming Tested with both stopwords and no

stopwords

Page 29: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

Retrieval of web pages

Indexed the web pages using a small scale search engine. We used Zettair

Retrieved web pages for a query based on relevance

Followed INEX’s query/topic generation guidelines

Re-ranked web pages based on conceptual difficulty

Annotated some top-10 documents for each query

Page 30: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

Evaluation Metric

Normalized Cumulative Discounted Gain (NDCG)

We suited for ranking evaluation because it takes into account the position of an entity in the ranked list unlike Precision, recall measures or Rank order correlation

Page 31: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

Results when β=0.5

Page 32: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong
Page 33: Shoaib Jameel ,  Wai  Lam and  Xiaojun Qian The Chinese University of Hong Kong

Conclusions and Future Work

We proposed a conceptual difficulty ranking model

Required no training data or ontology Main novelty – use of a conceptual

model Significant improvement In the future, we would study how link-

structure of the web could aid us in conceptual difficulty ranking