Upload
stuart-welch
View
215
Download
1
Tags:
Embed Size (px)
Citation preview
Mining the Web for Information Organization
J. H. WangAcademia Sinica
2
Outline
• Introduction• Web Mining• Cross-Language Web Search• Other Applications
3
Introduction
• Huge amount of Web data– Rich and dynamic resources of human
knowledge– Multimedia – Scalability
How to organize Web data into useful information?
4
Number of Web Pages The world’s
largest search engine ?
Billions Of Textual Documents Indexed
December 1995-September 2003
KEY: GG=Google, ATW=AllTheWeb, INK=Inktomi, TMA=Teoma, AV=AltaVista.
Source: Search Engine Watch (Nov. 2004)
SearchEngine
ReportedSize
PageDepth
Google 8.1 billion 101K
MSN 5.0 billion 150K
Yahoo4.2 billion(estimate)
500K
AskJeeves
2.5 billion 101K+
5
Web Users and Pages (7 years ago)
Area Users Web Pages Time World-wide 150M 800M 7/99 China 4M 2.5~3M 7/99 Taiwan 4M 3M 7/99
Challenge of Scalability !
Total Users: 800MChinese Users: 110M
Including 87M (CN), 4.9M (HK), 11.6M (TW), 2.9M (MY), 2.14M (SG), and others.
Source: Global Reach, 2004
6
Web Mining
• Data Mining• Text Mining• Web Mining Technologies
7
Data Mining
• Data Mining (Knowledge Discovery in Databases) is a process of nontrivial extraction of implicit, previously unknown, and potentially useful information (such as knowledge rules, constraints, regularities) from data in databases [G. Piatetsky-Shapiro and W. J. Frawley]
8
Text Mining
• Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources [Marti Hearst]
9
Web Mining
• Web Mining is the use of data mining techniques to automatically discover and extract information from Web documents and services [O. Etzioni]
10
Comparison
• Data mining tries to find interesting (non-trivial, implicit, previously unknown, potentially useful) patterns from large databases
• In text mining, the patterns are extracted from natural language texts rather than from structured databases of facts
• Web mining discovers and extracts information from Web documents and services
11
Web Mining Technologies
• Web content mining• Web structure mining • Web usage mining
12
Web Content Mining
• Unstructured documents– Free texts such as news articles
• Semi-structured documents– HTML structures and hyperlink information– Intra-document structure
• Applications: text categorization, text clustering, information extraction, computational linguistics, …
13
Web Structure Mining
• The structure of the hyperlinks within the Web– Inter-document structure– HITS, PageRank
• Social network and citation analysis• Applications: to calculate the quality
rank or relevancy of each Web page, Web page categorization, …
14
Web Usage Mining
• Techniques that could predict user behavior while the user interacts with the Web– To map the usage data of the Web server
into relational tables– To use the log data directly
• Applications: learning a user profile (personalization) vs. learning user navigation patterns
15
Related Fields of Research
• IR (Information Retrieval)• IE (Information Extraction)• ML (Machine Learning)
16
LiveTrans: Cross-language Web Search
• LiveTrans: http://livetrans.iis.sinica.edu.tw/lt.html
17
Examples
18
More Examples
19
Cross Language Information Retrieval (CLIR)
• A technology enabling users to query in one language and retrieve relevant documents written or indexed in another language
20
Cross Language Web Search
• A technology enabling users to query in one language and retrieve relevant Web pages written or indexed in another language
21
Why “Cross-Language”?
• Source: Global Reach (global-reach.biz/globstats)
22
Top Ten Languages Used in the Web
Source: Internet World Stats (Sep. 20, 2006)
TOP TEN LANGUAGESIN THE INTERNET
% of allInternet Users
Internet Usersby Language
InternetPenetration
by Language
Internet Growthfor Language( 2000 - 2006 )
World Population2006 Estimate
for the Language
English 29.7 % 322,600,837 28.7 % 135.2 % 1,125,664,397
Chinese 13.3 % 144,301,513 10,8 % 346.7 % 1,340,767,863
Japanese 7.9 % 86,300,000 67.2 % 83.3 % 128,389,000
Spanish 7.5 % 81,729,671 18.7 % 231.1 % 437,502,257
German 5.4 % 58,854,682 61.3 % 113.2 % 95,982,043
French 4.6 % 49,660,498 13.0 % 307.1 % 381,193,149
Portuguese 3.1 % 34,064,760 14.8 % 349.6 % 230,846,275
Korean 3.1 % 32,372,000 45.8 % 78.0 % 73,945,860
Italian 2.7 % 28,870,000 48.8 % 118.7 % 59,115,261
Russian 2.2 % 23,700,000 16.5 % 664.5 % 143,682,757
TOP TEN LANGUAGES 79.5 % 863,981,961 21.5 % 166.7 % 4,017,088,863
Rest of World Languages 20.5 % 222,268,942 9.0 % 500.0 % 2,482,608,197
WORLD TOTAL 100.0 % 1,086,250,903 16.7 % 200.9 % 6,499,697,060
Top Ten Languages Used in the Web( Number of Internet Users by Language )
More and more non-English users!
23
Web Content by Language
Source: http://www.netz-tipp.de/languages.html (2002)
Chart of Web Content, 2002
0
200
400
600
800
1000
1200
English German French Japanese Spanish Chinese Italian Dutch Russian Korean Portuguese
Language
Milli
ons o
f Web
page
s More and more non-English pages
24
866,000,000 pages
Scalability Problem !
Number of Chinese Web Pages
25
Challenge of Cross-Language Web Search
• Existing CLIR systems mostly rely on bilingual dictionaries and dictionary lookup
• 81% of the search terms could not be obtained from common English-Chinese translation dictionaries
中 央 處 理 器 (CPU), 電 子 商 務 (E-commerce),
個人數位助理 (PDA), 雅虎 (Yahoo), 太空總署 (NASA), 星際大戰 (Star War),非典型肺炎 (SARS), …
26
Challenge
• Existing CLIR systems mostly rely on bilingual dictionaries and dictionary lookup
• 81% of the search requests could not be obtained from common English-Chinese translation dictionaries
• How to find effective translations automatically for query terms not included in a dictionary ?
27
CLIR
• Conventional approach to query translation – Parallel documents as the corpus– Assume long queries
• Problems of CLIR in Web search– No corpus for cross-lingual training– Short queries
“Out-of-dictionary” terms– Ex: proper nouns, new
terminologies, …
English Terminologies
Chinese Translation
mechanical strain 機械應變viscous damping 黏滯阻尼Richard Feynman 費曼Hyoplastic Left Heart Syndrome
左心發育不全症候群
NII Japan 國立情報學研究所
SARS 嚴重急性呼吸道症候群
Extracorporeal Shock Wave Lithotripsy
震波碎石
Davinci 達文西
28
Translation Lexicon Construction for CLIR
• To use the Web as the corpus for query translation– Web mining techniques
• Anchor-text-based [ACM TOIS ‘04, ACM TALIP ‘02]• Search-result-based [JCDL ‘04]
• To extract terms from real document collections as possible queries– Term extraction method [SIGIR ‘97]
29
Web Mining Approach to Term Translation Extraction
LiveTrans Engine
LiveTrans Engine
Academia SinicaAnchor textsAnchor texts
Search resultsSearch results
The Web
中央研究院 / 中研院
Source query
Target translations
30
Search-Result Page – National Palace Museum vs. 故宮博
物院
• Mixed-language characteristic in Chinese pages• How to extract translation candidates?• Which candidates to choose?
Noises
31
Coverage Rate of Top-Ranked Search-Result Pages
• 95% of popular Web queries’ translations can be found in top 30-40 result pages
• About 70% of random queries were covered
• Many relevant translations can also be found
32
Anchor-Text Set -- Yahoo vs. 雅虎
• Anchor text (link text)– The descriptive text of a
link on a Web page
• Anchor-text set– A set of anchor texts
pointing to the same page (URL)
– Multilingual translations− Yahoo/雅虎 /야후− America/美国 /アメリカ
• Anchor-text-set corpus– A collection of anchor-
text sets
Yahoo Search Engine
美国雅虎 雅虎搜尋引擎
Yahoo! America
Taiwan
China
Japan
Korea
야후 -USA
アメリカの Yahoo! http://www.yahoo.com
33
Problems
• How to extract translation candidates with correct lexical boundary?– Term extraction
• From the search-result pages• From the document collections
– Bilingual lexicon construction
• How to choose relevant candidates?– Term translation
34
Term Translation Extraction from Different Resources
Term
Extraction
Term
Extraction
Source Query
TargetTranslation
Search-ResultPages
SearchEngineSearchEngine
SimilarityEstimationSimilarityEstimation
National Palace Museum
國立故宮博物院 , 故宮 , 故宮博物院
Anchor-Text
Corpus
WebSpiderWeb
Spider
35
Term Extraction
• Problem
DL
Doc.
國立故宮博物院
故宮博物院
故宮
立故宮
故宮博
宮博物
宮博
…
Correctly segmented
Incorrect text segments
36
Three Approaches to Term Extraction
• NLP (Linguistic) approach– Named entity recognition
• Extraction pattern/template approach– Wrapper generation/induction
• Statistical approach– Class-based language model– PAT-tree-based
37
PAT-tree-based Term Extraction
• SCP (Symmetric Conditional Probability)– Cohesion holding the words together– Low frequency n-grams tend to be
discarded
• CD (Context Dependency)– Dependence on the left- or right- adjacent
word/character– Low frequency n-grams can be extracted
• SCPCD: a combination of the two
38
Association Measure
1
1 11
21
1
1 11
21
1
)...()...(1
1)...(
)()(1
1)(
)(
n
i nii
n
n
i nii
nn
wwfreqwwfreqn
wwfreq
wwpwwpn
wwpwwSCP
21
111
)(
)()()(
n
nnn
wwfreq
wwRCwwLCwwCD
1
1 11
11
111
)()(1
1)()(
)()()(
n
i nii
nn
nnn
wwfreqwwfreqn
wwRCwwLCwwCDwwSCPwwSCPCD
39
Term Extraction Performance
Association Measure
Precision Recall Avg. R-P
CD 68.1 % 5.9 % 37.0 %
SCP 62.6 % 63.3 % 63.0 %
SCPCD 79.3 % 78.2 % 78.7 %
•Table 1. The obtained extraction accuracy including precision, recall, and average recall-precision of auto-extracted translation candidates using different methods.
40
Speed PerformanceTable 2. The obtained average speed performance of different term extraction methods.
Term Extraction MethodTime for
PreprocessingTime for Extraction
LocalMaxs (Web Queries) 0.87 s 0.99 s
PATtree+LocalMaxs (Web Queries)
2.30 s 0.61 s
LocalMaxs (1,367 docs) 63.47 s 4,851.67 s
PATtree+LocalMaxs (1,367 docs)
840.90 s 71.24 s
LocalMaxs (5,357 docs) 47,247.55 s 350,495.65 s
PATtree+LocalMaxs (5,357 docs)
11,086.67 s 759.32 s
41
Term Translation
• Problem 故宮博物院
故宮
繪畫
瓷
書法
陶瓷
玉器
瓷器
刺繡
…
porcelain
Source query
Translation candidates
Similarity
Relevant terms
42
Similarity Estimation
How to decide the ranking?
1) S, Ti: frequently co-occur in the same pages– Not necessarily true
for synonyms and antonyms
2) S, Ti: have similar co-occurring context terms in the search-result pages
QueryS
QueryS .
.
.
T1
T2
Tn
National Palace Museum
國立故宮博物院 , 故宮 , 故宮博物院
43
Chi-Square Test
• Chi-Square Test: a statistical method for co-occurrence analysis [Gale & Church ‘91]
(3) . )()()()(
)(),(
2
2dcdbcaba
cbdaNtsSx
a: # of pages containing both terms s and t
b: # of pages containing term s but not t
c: # of pages containing term t but not s
d: # of pages containing neither term s nor t
N: the total number of pages, i.e., N= a+b+c+d
a: # of pages containing both terms s and t
b: # of pages containing term s but not t
c: # of pages containing term t but not s
d: # of pages containing neither term s nor t
N: the total number of pages, i.e., N= a+b+c+d
Available from search engine
44
Example Boolean Query for Chi-Square Test
45
Context Vector Analysis
• Context Vector Analysis– Co-occurring context terms as feature vectors– TF-IDF weighting
• Similarity measure– Cosine measure
(4) , )n
log(),(max
),(
N
dtf
dtfw
jj
iti
(5) . )()(
),(
1
2
1
2
1
m
it
m
is
tsm
icv
ii
ii
ww
wwtsS
46
The Combined Method• To take advantage of both methods
– Chi-Square Test: co-occurrence– Context Vector Analysis: similar context
(6) ,),(
),( m m
m
tsRtsSall
Rm(s,t) : Ranking of score in different methods
Rm(s,t) : Ranking of score in different methods
47
Experiments
• Web Search– Chinese search engine logs
• Dreamer Log: 228,566 unique terms, during a period of 3 months in 1998
• GAIS Log: 114,182 unique queries during a period of two weeks in 1999
• Digital Library– STICNET Database
• 33,797 scientific documents in 86 categories during 1983 and 1997
48
Experiments on Web Search
• Test sets– Popular-query set: 430 popular query terms
• Type Dic: 36%• Type OOV: 64%
– Random-query set• Randomly selected 200 Chinese queries from the
top 20,000 queries in Dreamer Log– Proper names & technical terms
• 50 scientists’ names & 50 disease names– Common terms
• Randomly selected 100 common nouns and 100 common verbs
49
Popular Chinese Query Set
Table 3. Coverage and inclusion rates for popular Chinese queries using different methods.
Method Query Type Top-1 Top-3 Top-5 Coverage
CV
Dic 56.4% 70.5% 74.4% 80.1%
OOV 56.2% 66.1% 69.3% 85.0%
All 56.3% 67.7% 71.2% 83.3%
χ2
Dic 40.4% 61.5% 67.9% 80.1%
OOV 54.7% 65.0% 68.2% 85.0%
All 49.5% 63.7% 68.1% 83.3%
Combined
Dic 57.7% 71.2% 75.0% 80.1%
OOV 56.6% 67.9% 70.9% 85.0%
All 57.2% 68.6% 72.8% 83.3%
50
Popular English Query Set
Table 4. Coverage and top-n inclusion rates for popular English query set using different methods.
Method Top-1 Top-3 Top-5 Coverage
CV 50.9% 60.1% 60.8% 80.9%
X2 44.6% 56.1% 59.2% 80.9%
Combined 51.8% 60.7% 62.2% 80.9%
51
Random Query Set
Table 5. Coverage and top-n inclusion rates for the random-query set using different methods.
Method Top-1 Top-3 Top-5 Coverage
CV 25.5% 45.5% 50.5% 60.5%
X2 26.0% 44.5% 50.5% 60.5%
Combined 29.5% 49.5% 56.5% 60.5%
52
Proper Names and Technical Terms
Table 6. Top-n inclusion rates for proper names and technical terms using the combined method.
Query Type Top-1 Top-3 Top-5
Scientist Name
40.0% 52.0% 60.0%
Disease Name
44.0% 60.0% 70.0%
53
Common Nouns and Verbs
Table 8. Top-n inclusion rates for common nouns and verbs using the combined approach.
Query Type Top-1 Top-3 Top-5
100 Common Nouns 23.0% 33.0% 43.0%
100 Common Verbs 6.0% 8.0% 10.0%
• Our methods are less reliable to common terms
54
Summary of Different Methods
• χ2 test– Fast– More suitable for high-frequency terms
• CV– Slow (for feature extraction)– Applicable to low-frequency terms
• Combined– Slow– Both high-frequency & low-frequency terms
55
Experiments on Digital Libraries
• Cross-language search for STICNET Database– 33,797 scientific documents in 86
categories, during 1983 and 1997– 410,557 English-Chinese bilingual key terms
• Challenges: – Various categories in specific domains– Hard to find translations on the Web
56
Example Cross-Lingual Queries in STICNET Database
57
STICNET Database Search Result
58
Translation of Auto-Extracted Unknown Terms
Table 9. The top-n inclusion rates of translations for auto-extracted useful unknown terms.
Query Type Top-1 Top-3 Top-5
Auto-extracted useful terms in Information
Engineering33.3% 37.5% 50.0%
Auto-extracted useful terms in Medicine
34.6% 46.2% 50.0%
• The feasibility of auto-extracted unknown terms has been shown
59
Some Examples of the Auto-Extracted Translations
English Terminologies Chinese Translation
mechanical strain 機械應變
viscous damping 黏滯阻尼
Extracorporeal Shock Wave Lithotripsy 震波碎石
Galilei, Galileo 伽利略 / 伽里略 / 加利略
Legionnaires' Disease 退伍軍人症
60
Other Applications
• Text Classification• Query Clustering• Search Result Clustering• Concept Search
61
LiveClassifier
A system that creates classifiers through Web mining
[WWW 2004]
62
LiveClassifier
Users create topic hierarchies and define classes/keywords
63
LiveClassifier
Web
Auto-extracted training data; No manually-labeled data provided
Exploiting the structure information inherent for training
64
LiveClassifier
People
Place
Subjects
Sub-subjects
65
LiveClassifier
Classifying documents
Into classes
66
LiveClassifier
Classifying short texts
Into classes
67
LiveClassifier
68
LiveClassifier
69
70
Term Clustering
71
HAC-based Binary Clustering
72
Min-Max Partition
73
Query Clustering
勞委會
職訓局
就業
青輔會
自傳
徵才
人力資源
104
人力銀
行人力銀行
找工作
履歷表
求職
求才
占卜
塔羅牌
算命
紫微斗數
命理
姓名學
心理測驗
星座
愛情
eva長榮航空
長榮
航空公司
航空
華航
中華航空
補帖
大補帖
泡麵
dbt武俠
金庸
武俠小說
黃易
作家
武俠金庸武俠小說黃易作家
補帖大補帖泡麵dbt
eva長榮航空 (EVA airline)長榮 (EVA)航空公司 (airline)航空 (airway)華航 (China airline)中華航空 (China airline)
占卜塔羅牌算命紫微斗數命理姓名學心理測驗星座愛情
勞委會職訓局就業青輔會自傳徵才人力資源104 人力銀行人力銀行找工作履歷表求職求才cut
1 2 3 4 5
1 23 4
5
74
Thesaurus Construction from Query Log
• Query logs provide a representative terms for DL usage• Taxonomy generation from query logs
– Query clustering– Query categorization– Document categorization
Taxonomy Generation
(Query Clustering)
Query TermCategorization
Document TermCategorization
QueryLogs High-freq
terms
Low-freqterms Relevant
documents
75
Search Result Clustering
• Why search result clustering?• Why is SRC different from
document clustering?– In assessment of algorithm’s quality– Precision, recall vs. user-oriented,
subjective assessment
76
Example of Search Result Clustering
National Taiwan University NTU Hospital
Nanyang Technological University, Singapore
NTU?
77
Example Clustering Search Engines
• Vivisimo.com– Clusty.com
• KillerInfo.com• InfoNetWare.com• SnakeT (Snippet Aggregation for
Knowledge ExTraction): http://roquefort.unipi.it/ – A hierarchical clustering engine for snippets
• Mooter.com• …
78
Example on Vivisimo
79
Vivisimo (cont.)
80
Clusty.com
81
InfoNetWare.com
82
Concept Search
• Conventional search
• Concept-level search
doc Keyword search for “researcher” and “AI” and “Taiwan”
docresearcher AI
“professor”
“NTU”
“neuralnetwork”
researcherAI
Interesting document
Taiwan
83
Further Reading• Jenq-Haur Wang, Jei-Wen Teng, Wen-Hsiang Lu, and Lee-Feng Chien, "Exploiting
the Web as the Multilingual Corpus for Unknown Query Translation," Journal of the American Society for Information Science and Technology (JASIST), Vol. 57, No. 5, pp. 660-670, Special Issue on Multilingual Information Systems, Mar. 2006. (SCI, SSCI)
• Jenq-Haur Wang, Jei-Wen Teng, Pu-Jen Cheng, Wen-Hsiang Lu, and Lee-Feng Chien, "Translating Unknown Cross-Lingual Queries in Digital Libraries Using a Web-based Approach," Proceedings of the 4th ACM/IEEE Joint Conference on Digital Libraries (JCDL 2004), pp. 108-116.
• Wen-Hsiang Lu, Lee-Feng Chien, Hsi-Jian Lee, Anchor Text Mining for Translation of Web Queries: A Transitive Translation Approach. ACM Transactions on Information Systems, Vol. 22, No. 2, pp. 242-269, 2004. (SCI)
• Chien-Chung Huang, Shui-Lung Chuang, Lee-Feng Chien, Liveclassifier: Creating Hierarchical Text Classifiers through Web Corpora, Proceedings of WWW 2004, pp. 184-192.
• Wen-Hsiang Lu, Lee-Feng Chien, Hsi-Jian Lee, Translation of Web Queries Using Anchor Text Mining. ACM Transactions on Asian Language Information Processing, pp. 159-172, 2002.
• Lee-Feng Chien, T.-I. Huang, M-C. Chien, Pat-tree-based Keyword Extraction for Chinese Information Retrieval, Proceedings of SIGIR 1997, pp. 50-58.