Upload
bracha
View
37
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Ranking Text Documents Based on Conceptual Difficulty Using Term Embedding and Sequential Discourse Cohesion. Shoaib Jameel , Wai Lam and Xiaojun Qian The Chinese University of Hong Kong. Outline. Introduction to Readability/Conceptual Difficulty Motivation Related Work - PowerPoint PPT Presentation
Citation preview
Shoaib Jameel, Wai Lam and Xiaojun Qian
The Chinese University of Hong Kong
Ranking Text Documents Based on Conceptual Difficulty Using Term
Embedding and Sequential Discourse Cohesion
Outline
1. Introduction to Readability/Conceptual Difficulty
2. Motivation3. Related Work4. Our method (Sequential Term Transition
Model (STTM))5. Empirical Evaluation6. Conclusions and Future Work
http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm
http://en.wikipedia.org/wiki/Proton
Which of the two appears simple to
you?
1
2
3
Search for a keyword
Results – Sometimes irrelevantand mixed order of readability
4
An attempt by Google
Our Objective
Query
Retrieve web pages(considering relevance)
Re-rank web pages based on readability
Automatically accomplished
6
What has been done so far?
7
Heuristic Readability formulae Unsupervised approaches Supervised approaches
Heuristic Readability Methods
8
Have been there since 1940’s
Semantic Component – Number of syllables per word, length of the syllables per word etc.
Syntactic Component – Length of sentences etc.
Example – Flesch Reading Ease
Semantic componentSyntactic component
Manually tuned numerical parameters
9
water -> wa-ter proton -> pro-ton
embryology -> em-bry-ol-o-gy star -> star Problem
Supervised Learning Methods
10
Language Models• Unigram Language Model based method SVMs (Support Vector Machines) Use of query Log and user profiles• Can address the problem on individual
basis
Smoothed Unigram Model [1]
[1] K. Collins-Thompson and J. Callan. (2005.) "Predicting reading difficulty with statistical language models". Journal of the American Society for Information Science and Technology 56(13) (pp. 1448-1462).
•Recast the well-studied problem of readability in terms of text categorization and used straightforward techniques from statistical language modeling.
11
Smoothed Unigram Model
Limitation of their method: Requires training data, which sometimes may be difficult to obtain
12
Domain-specific Readability• Jin Zhao and Min-Yen Kan. 2010. Domain-specific iterative readability
computation. In Proceedings of the 10th annual joint conference on Digital libraries (JCDL '10).
Based on web-link structure algorithm HITS and SALSA.
• Xin Yan, Dawei Song, and Xue Li. 2006. Concept-based document readability in domain specific information retrieval. In Proceedings of the 15th ACM international conference on Information and knowledge management (CIKM '06).
Based on an ontology. Tested only in the medical domain
Hypertext Induced Topic Search Stochastic Approach for Link-Structure Analysis
I will focus on this work.13
14OverviewOverview
• The authors state that Document Scope and Document Cohesion are an important parameters in finding simple texts.
• The authors have used a controlled vocabulary thesaurus termed as Medical Subject Headings (MeSH).
• Authors have pointed out the readability based formulae are not directly applicable to web pages.
MeSH OntologyMeSH Ontology
Concept difficulty increases
Concept difficulty decreases
15
Overall Concept Based Readability ScoreOverall Concept Based Readability Score
where,DaCw = Dale-Chall Readability MeasurePWD = Percentage of difficult wordsAvgSL = Average sentence length in di
Their work focused on word level readability, hence considered only the PWD
len(ci,cj)=function to compute shortest path between concepts c i cj in the MeSH hierarchyN = total number of domain concepts in document d i
Depth(ci)=depth of the concept ci in the concept hierarchyD= Maximum depth of concept hierarchyNumber of associations = Total number of mutual associations among concepts
16
Use of Query Log data• Have been conducted by the search engine
companies
• Requires proprietary data, not available publicly
• Thus not very useful to the research community because it cannot be replicated
17
J. Kim, K. Collins-Thompson, P. N. Bennett, S. Dumais. Characterizing Web Content, User Interests, and Search Behavior by Reading Level and Topic. Proceedings of WSDM 2012. (Microsoft Research)Chenhao Tan, Evgeniy Gabrilovich, and Bo Pang. 2012. To each his own: personalized content selection based on text comprehensibility. In Proceedings of WSDM 2012. (Yahoo! Research)
Our approach
Sequential Term Transition Model (STTM)
A conceptual difficulty determination model which is:
Unsupervised Does not require any knowledge base or
annotated data
Methodology• We first build a term document matrix• We then perform Singular Value Decomposition (SVD) on
the matrix• SVD : W≈W’=USVT
• U is a Txf matrix of left singular vectors• V is a Dxf matrix of right singular vectors• S is a fxf diagonal matrix of singular values T is the number of terms in the vocabulary D is the number of documents in the collection f is number of factors
Observation in the SVD space
1. Terms which are central to a document come close to their document vectors
2. General terms are distant away from their document vectors
3. Semantically related terms cluster close to each other
4. Unrelated terms cluster away from each other
Computing Term Difficulties
Normalized term vector
Normalized document vector Matrix of normalized documentvectors that contain the term
General Idea about Linear Embedding
t
D1
D6
D3D5
D4
D2
w4
w 3
w1 w
2
w 6
w5
Cohesion
When units tend to “stick together”, the property is called cohesion
We compute cohesion between terms in sequence
The more cohesive terms in the document are, the easy it is for a person to comprehend a discourse
Computation of Cohesion
We know related terms cluster close to each other in the latent space obtained via SVD
We have to compute the cluster memberships of each of the terms as SVD does not directly give term memberships to clusters
We use k-means because of its simplicity and ability to handle large datasets
How we compute cohesion?
W1 W2
W1 W2 W3 W4 W5 W6 W7 W8 W9 W10
Determine the cluster memberships of the twoconsecutive terms w1 and w2
C1 C1
Same cluster, we conclude they are cohesive
W1 W2 W3
C1 C1
Same cluster, we conclude they are cohesive
W1 W2 W3 W4
C1 C1 C4
C1
C4
Compute cosine similarity
Cohesion using cosine similarity
If the cluster centroids are close to each other, then cosine similarity will be high
When cosine similarity is high means that the two cluster are closely related
Conceptual Difficulty Score
Conceptual difficulty score for document j
Parameter controlling the relativeweights between [0,1]
Cohesion score of document j
Term difficulty score for document j
Empirical Evaluation - Dataset
Standard test collections do not have readability judgments
We chose Psychology domain Crawled web pages from Wikipedia,
Psychology.com, Simple English Wikipedia Total web page count = 167,400 No term stemming Tested with both stopwords and no
stopwords
Retrieval of web pages
Indexed the web pages using a small scale search engine. We used Zettair
Retrieved web pages for a query based on relevance
Followed INEX’s query/topic generation guidelines
Re-ranked web pages based on conceptual difficulty
Annotated some top-10 documents for each query
Evaluation Metric
Normalized Cumulative Discounted Gain (NDCG)
We suited for ranking evaluation because it takes into account the position of an entity in the ranked list unlike Precision, recall measures or Rank order correlation
Results when β=0.5
Conclusions and Future Work
We proposed a conceptual difficulty ranking model
Required no training data or ontology Main novelty – use of a conceptual
model Significant improvement In the future, we would study how link-
structure of the web could aid us in conceptual difficulty ranking