Upload
gladys-melton
View
231
Download
0
Tags:
Embed Size (px)
Citation preview
Domain-Specific Iterative Readability Computation
Jin Zhao
13/05/2011
Jin Zhao and Min-Yen Kan
13/05/2011 / 222WING, NUS
Domain-Specific Resources
Jin Zhao and Min-Yen Kan
13/05/2011 / 22
Domain-Specific Resources
3WING, NUS
Modular arithmetic page from Wikipedia
Modular arithmetic page from Interactivate.com
Domain-specific resources targets at varying audiences.
Jin Zhao and Min-Yen Kan
13/05/2011 / 22
Challenge for a Domain-Specific Search Engine
4WING, NUS
How to measure readability for domain-specific resources?
Jin Zhao and Min-Yen Kan
13/05/2011 / 22
Literature Review• Heuristic-based Readability Measures– Weighted sum of text feature values
– Examples: Flesch Kincaid Reading Ease (FKRE): [Flesch48]
Dale-Chall readability formula: [Dale&Chall48]
5WING, NUS
Quick and indicative but often oversimplify
Jin Zhao and Min-Yen Kan
13/05/2011 / 22
Literature Review• Natural Language Processing and Machine Learning
Approaches– Extract deep text features and use supervised learning
methods to generate models for readability measurement
– Text Features Unigram [Collins-Thompson04],
Parse tree height [Schwarm05], Discourse relations [Pitler08]
– Supervised learning techniques Support Vector Machine (SVM) [Schwarm05],
k-Nearest Neighbor (KNN) [Heilman07]
6WING, NUS
More accurate but annotated corpus required and ignorant of the domain-specific concepts
Jin Zhao and Min-Yen Kan
13/05/2011 / 22
Literature Review• Domain-Specific Readability Measures– Derive information of domain-specific concepts from expert
knowledge sources
– Examples: Open Access and Collaborative Consumer Health Vocabulary
[Kim07] Medical Subject Headings ontology [Yan06]
– Handles domain-specific concepts but expert knowledge sources are still expensive and not always available
7WING, NUS
Key qualities of a good readability measure: effective, portable and domain-aware.
Jin Zhao and Min-Yen Kan
13/05/2011 / 22
Intuitions
• Use an iterative computation algorithm to estimate these two scores from each other
• Example:– Pythagorean theorem vs. ring theory
8WING, NUS
A domain-specific resource is less readableif it contains more difficult concepts
A domain-specific concept is more difficult if it appears in less readable resources
Jin Zhao and Min-Yen Kan
13/05/2011 / 22
Iterative Computation (IC) Algorithm• Graph Construction– Construct a graph representing resources, concepts and
occurrence information
• Score Computation– Initialize and iteratively compute the readability score of domain-
specific resources and the difficulty score of domain-specific concepts
– Two versions: heuristic and probabilistic
• Required Input– A collection of domain-specific resources– A list of domain-specific concepts
9WING, NUS
Jin Zhao and Min-Yen Kan
13/05/2011 / 22
Graph Construction
10WING, NUS
…Pythagorean theorem can be written as a2 + b2 = c2, where c represents the length of the hypotenuse…
…The sine function (sin) can be defined as the ratio of the side opposite the angle to the hypotenuse…
…right trianglePythagorean theoremhypotenusesine functioncosine function…
Resource 1
Resource 2
Concept List
Pythagorean Theorem
hypotenusesine
function
Resource 1 Resource 2
right trianglecosine
function
Jin Zhao and Min-Yen Kan
13/05/2011 / 22
Score Computation (Heuristic)
11WING, NUS
w x y z
a b c
Resource Nodes
Concept Nodes
• Initialization– Resource Node (FKRE)– Concept Node (Average
score of its adjacent nodes)
1.00 3.00 2.00 4.00
2.00 2.50 3.00
w x y z
a b c
Resource Nodes
Concept Nodes
3.00 5.25 4.75 7.00
4.00 5.00 6.00
• Iterative Computation– Each node(Original score + average of the original scores of its adjacent nodes)
Initialization
Iteration 1
Jin Zhao and Min-Yen Kan
13/05/2011 / 22
Score Computation (Heuristic)
12WING, NUS
w x y z
a b c
Resource Nodes
Concept Nodes
7.00 9.75 10.25 13.00
8.13 10.00 11.88
w x y z
a b c
Resource Nodes
Concept Nodes
15.13 18.82 21.19 24.88
16.51 20.00 23.51
• Termination Condition– The rank order of the resource
nodes stabilizes
Iteration 2
Iteration 3
Jin Zhao and Min-Yen Kan
13/05/2011 / 22
Score Computation (Heuristic)• Single-valued score for each node– Unable to handle concepts of varying difficulties
• Simple averaging in score computation– Difficult to incorporate sophisticated computational
mechanisms
13WING, NUS
Jin Zhao and Min-Yen Kan
13/05/2011 / 22
Score Computation (Probabilistic)
14
w x y z
a b c
Resource Nodes
Concept Nodes
• Initialization– Resource Node (Sentence
Sampling)– Concept Node (Resource
Sampling)Initialization
Jin Zhao and Min-Yen Kan
13/05/2011 / 22
Score Computation (Probabilistic)
15
• Iterative Computation– Modified Naïve Bayes Classification
Original:
Modified:
Direct Adaptation:
Resource Nodes
Concept Nodes
Jin Zhao and Min-Yen Kan
13/05/2011 / 22
Evaluation• Key qualities of a good readability measure– Effectiveness
– Portability
– Domain-awareness
16WING, NUS
Jin Zhao and Min-Yen Kan
13/05/2011 / 22
Effectiveness• Corpus of Math Webpages
• Metrics:– Pairwise accuracy– Spearman’s rho
• Baseline:– Heuristic
FKRE– Supervised learning
NB, SVM, MaxEsnt Binary concept features only
17WING, NUS
Pairwise Spearman Iterations
FKRE .72 .48 -
NB .72 .52 -
SVM .80 .70 -
Maxent .82 .67 -
HIC .87 .75 18
PIC .85 .73 7
Jin Zhao and Min-Yen Kan
13/05/2011 / 22
Portability• Different selection strategies– Resource selection at random– Concept selection at random
– Resource selection by quality– Concept selection by TF.IDF
• Performance measurement at 5 levels– 20%, 40%, 60%, 80% and 100% of the original resource
collection / concept list
18WING, NUS
Jin Zhao and Min-Yen Kan
13/05/2011 / 22
Portability
19WING, NUS
Resource Selection StrategiesConcept Selection Strategies
Jin Zhao and Min-Yen Kan
13/05/2011 / 22
Portability
20WING, NUS
Pairwise SpearmanFKRE .63 .28NB .73 .53SVM .82 .70Maxent .76 .60HIC .74 .49
PIC .75 .55
Jin Zhao and Min-Yen Kan
13/05/2011 / 22
Domain-awareness• Handling of domain-specific concepts– Simple yet effective
– Concepts of multiple difficulty levels? Converge to single value even in PIC Splitting? (K-Means, GMM, etc.) Other computational mechanisms?
21WING, NUS
Jin Zhao and Min-Yen Kan
13/05/2011 / 22
Conclusion• Iterative Computation– Estimate the readability of domain-specific resources and
difficulty of domain-specific concepts in a iterative manner– Effective, Portable and Domain-aware
• Future Work– Handling of concepts of multiple difficulty levels
22WING, NUS