Shoaib Jameel , Wai Lam and Xiaojun Qian The Chinese University of Hong Kong

Shoaib Jameel, Wai Lam and Xiaojun Qian

The Chinese University of Hong Kong

Ranking Text Documents Based on Conceptual Difficulty Using Term

Embedding and Sequential Discourse Cohesion

Outline

1. Introduction to Readability/Conceptual Difficulty

2. Motivation3. Related Work4. Our method (Sequential Term Transition

Model (STTM))5. Empirical Evaluation6. Conclusions and Future Work

http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm

http://en.wikipedia.org/wiki/Proton

Which of the two appears simple to

you?

1

2

3

http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm

http://en.wikipedia.org/wiki/Proton

Search for a keyword

Results – Sometimes irrelevantand mixed order of readability

4

An attempt by Google

Our Objective

Query

Retrieve web pages(considering relevance)

Re-rank web pages based on readability

Automatically accomplished

6

What has been done so far?

7

Heuristic Readability formulae Unsupervised approaches Supervised approaches

Heuristic Readability Methods

8

Have been there since 1940’s

Semantic Component – Number of syllables per word, length of the syllables per word etc.

Syntactic Component – Length of sentences etc.

Example – Flesch Reading Ease

Semantic componentSyntactic component

Manually tuned numerical parameters

9

water -> wa-ter proton -> pro-ton

embryology -> em-bry-ol-o-gy star -> star Problem

Supervised Learning Methods

10

Language Models• Unigram Language Model based method SVMs (Support Vector Machines) Use of query Log and user profiles• Can address the problem on individual

basis

Smoothed Unigram Model [1]

[1] K. Collins-Thompson and J. Callan. (2005.) "Predicting reading difficulty with statistical language models". Journal of the American Society for Information Science and Technology 56(13) (pp. 1448-1462).

•Recast the well-studied problem of readability in terms of text categorization and used straightforward techniques from statistical language modeling.

11

Smoothed Unigram Model

Limitation of their method: Requires training data, which sometimes may be difficult to obtain

12

Domain-specific Readability• Jin Zhao and Min-Yen Kan. 2010. Domain-specific iterative readability

computation. In Proceedings of the 10th annual joint conference on Digital libraries (JCDL '10).

Based on web-link structure algorithm HITS and SALSA.

• Xin Yan, Dawei Song, and Xue Li. 2006. Concept-based document readability in domain specific information retrieval. In Proceedings of the 15th ACM international conference on Information and knowledge management (CIKM '06).

Based on an ontology. Tested only in the medical domain

Hypertext Induced Topic Search Stochastic Approach for Link-Structure Analysis

I will focus on this work.13

14OverviewOverview

• The authors state that Document Scope and Document Cohesion are an important parameters in finding simple texts.

• The authors have used a controlled vocabulary thesaurus termed as Medical Subject Headings (MeSH).

• Authors have pointed out the readability based formulae are not directly applicable to web pages.

MeSH OntologyMeSH Ontology

Concept difficulty increases

Concept difficulty decreases

15

Overall Concept Based Readability ScoreOverall Concept Based Readability Score

where,DaCw = Dale-Chall Readability MeasurePWD = Percentage of difficult wordsAvgSL = Average sentence length in di

Their work focused on word level readability, hence considered only the PWD

len(ci,cj)=function to compute shortest path between concepts c i cj in the MeSH hierarchyN = total number of domain concepts in document d i

Depth(ci)=depth of the concept ci in the concept hierarchyD= Maximum depth of concept hierarchyNumber of associations = Total number of mutual associations among concepts

16

Use of Query Log data• Have been conducted by the search engine

companies

• Requires proprietary data, not available publicly

• Thus not very useful to the research community because it cannot be replicated

17

J. Kim, K. Collins-Thompson, P. N. Bennett, S. Dumais. Characterizing Web Content, User Interests, and Search Behavior by Reading Level and Topic. Proceedings of WSDM 2012. (Microsoft Research)Chenhao Tan, Evgeniy Gabrilovich, and Bo Pang. 2012. To each his own: personalized content selection based on text comprehensibility. In Proceedings of WSDM 2012. (Yahoo! Research)

Our approach

Sequential Term Transition Model (STTM)

A conceptual difficulty determination model which is:

Unsupervised Does not require any knowledge base or

annotated data

Methodology• We first build a term document matrix• We then perform Singular Value Decomposition (SVD) on

the matrix• SVD : W≈W’=USVT

• U is a Txf matrix of left singular vectors• V is a Dxf matrix of right singular vectors• S is a fxf diagonal matrix of singular values T is the number of terms in the vocabulary D is the number of documents in the collection f is number of factors

Observation in the SVD space

1. Terms which are central to a document come close to their document vectors

2. General terms are distant away from their document vectors

3. Semantically related terms cluster close to each other

4. Unrelated terms cluster away from each other

Computing Term Difficulties

Normalized term vector

Normalized document vector Matrix of normalized documentvectors that contain the term

General Idea about Linear Embedding

t

D1

D6

D3D5

D4

D2

w4

w 3

w1 w

2

w 6

w5

Cohesion

When units tend to “stick together”, the property is called cohesion

We compute cohesion between terms in sequence

The more cohesive terms in the document are, the easy it is for a person to comprehend a discourse

Computation of Cohesion

We know related terms cluster close to each other in the latent space obtained via SVD

We have to compute the cluster memberships of each of the terms as SVD does not directly give term memberships to clusters

We use k-means because of its simplicity and ability to handle large datasets

How we compute cohesion?

W1 W2

W1 W2 W3 W4 W5 W6 W7 W8 W9 W10

Determine the cluster memberships of the twoconsecutive terms w1 and w2

C1 C1

Same cluster, we conclude they are cohesive

W1 W2 W3

C1 C1

Same cluster, we conclude they are cohesive

W1 W2 W3 W4

C1 C1 C4

C1

C4

Compute cosine similarity

Cohesion using cosine similarity

If the cluster centroids are close to each other, then cosine similarity will be high

When cosine similarity is high means that the two cluster are closely related

Conceptual Difficulty Score

Conceptual difficulty score for document j

Parameter controlling the relativeweights between [0,1]

Cohesion score of document j

Term difficulty score for document j

Empirical Evaluation - Dataset

Standard test collections do not have readability judgments

We chose Psychology domain Crawled web pages from Wikipedia,

Psychology.com, Simple English Wikipedia Total web page count = 167,400 No term stemming Tested with both stopwords and no

stopwords

Retrieval of web pages

Indexed the web pages using a small scale search engine. We used Zettair

Retrieved web pages for a query based on relevance

Followed INEX’s query/topic generation guidelines

Re-ranked web pages based on conceptual difficulty

Annotated some top-10 documents for each query

Evaluation Metric

Normalized Cumulative Discounted Gain (NDCG)

We suited for ranking evaluation because it takes into account the position of an entity in the ranked list unlike Precision, recall measures or Rank order correlation

Results when β=0.5

Conclusions and Future Work

We proposed a conceptual difficulty ranking model

Required no training data or ontology Main novelty – use of a conceptual

model Significant improvement In the future, we would study how link-

structure of the web could aid us in conceptual difficulty ranking