Upload
hector-lin
View
837
Download
7
Embed Size (px)
Citation preview
05/2004 L. F. Chien
Opportunities and Challenges of Web Search and Mining
Lee-Feng Chien ( 簡立峰 )
Academia Sinica & National Taiwan University Academia Sinica & National Taiwan University
05/2004 L. F. Chien
Outline Web SE
Inside SE Google’s Business Models Google’s Impacts Recent Development Next-Generation WSE
Web Mining
05/2004 L. F. Chien
WSE = Google
Globalization!
05/2004 L. F. Chien
WSE = Google
05/2004 L. F. Chien
Problems of WSE
Inside WSE . Fast . Coverage
. Accuracy
05/2004 L. F. Chien
Problems of WSE
Inside WSE . Fast . Coverage
. Accuracy
Business . Profitable . Models
. Competitions
05/2004 L. F. Chien
Problems of WSE
Inside WSE . Fast . Coverage
. Accuracy
Business . Profitable . Models
. Competition
Impacts . Web Computing . Knowledge Windows . New Paradigm of Civilization
05/2004 L. F. Chien
I. Some Must-Know
Statistics
05/2004 L. F. Chien
Online Language Populations
Source: Global Reach (global-reach.biz/globstats)
05/2004 L. F. Chien
Top Ten Languages in the Web
TOP TEN LANGUAGESIN THE INTERNET
Internet Users,by Language
AveragePenetration
World PopulationEstimate for Language
Language as % ofTotal Internet Users
English 287,369,520 26.2 % 1,098,654,265 35.9 %
Chinese 105,484,112 8.0 % 1,321,669,200 13.2 %
Japanese 66,548,060 52.1 % 127,853,600 8.3 %
German 54,035,201 56.3 % 95,893,300 6.8 %
Spanish 53,670,063 13.9 % 386,413,200 6.7 %
French 35,034,269 9.3 % 375,164,185 4.4 %
Korean 30,670,000 41.0 % 74,730,000 3.8 %
Italian 28,610,000 49.3 % 57,987,100 3.6 %
Portuguese 23,058,254 10.3 % 224,664,100 2.9 %
Dutch 13,657,170 56.6 % 24,125,950 1.7 %
TOP TEN LANGUAGES 698,353,773 18.4 % 3,787,154,900 87.3 %
Rest of the Languages 101,686,725 3.9 % 2,602,992,587 12.7 %
WORLD TOTAL 800,040,498 12.5 % 6,390,147,487 100.0 %
Source: Internet World Stats
More and more non-English users!
05/2004 L. F. Chien
0.1
1.0
10.0
100.0
Inte
rnet
Hos
ts (
mil
lion
):
English Japanese German French Dutch Finnish Spanish Chinese Swedish
Language (estimated by domain)
Web Content
Source: Network Wizards Jan 99 Internet Domain Survey
More and more non-English pages
05/2004 L. F. Chien
Web Users and Pages (5 years ago)
Area Users Web Pages Time World-wide 150M 800M 7/99 China 4M 2.5~3M 7/99 Taiwan 4M 3M 7/99
Challenge of Scalability !
Chinese Users: 110M
Including 87M (CN), 4.9M (HK), 8.8M (TW), 2.14M (SG), and others.
Source: Global Reach, 2004
05/2004 L. F. Chien
573,000,000 pages
Scalability Problem !
Number of Chinese Web Pages
05/2004 L. F. Chien
Number of Web Pages
The world’s largest search engine ?
4,285,199,774 pages (Google) 4.28 billion Web pages, 880 million images, and other documents
Billions Of Textual Documents IndexedAs of Sept 2, 2003
KEY: GG=Google, ATW=AllTheWeb, INK=Inktomi, TMA=Teoma, AV=AltaVista. Source: Search Engine Watch
05/2004 L. F. Chien
The top 10 Internet trends 2004 predicted by eOneNet.com 1. World Internet population will continue to grow at an exponential r
ate, with China taking the lead in Asia having more than 100 million Internet users.
2. Broadband Internet penetration will continue to grow with China and US in the lead with an expected growth rate exceeding 30% each.
3. Online retail sales will still be led by the US with an expected revenue exceeding US$80 billion.
4. Paid search will account for the biggest online ad spending. With the successful paid search business models of Google and Overture, more search engines will offer paid search advertising.
05/2004 L. F. Chien
The top 10 Internet trends 2004 predicted by eOneNet.com
5. Spams will increase at least 20% despite the new US anti-spam law. The US legislators will be forced to consider amending the anti-spam law from an opt-out law to an opt-in law.
6. Ads placed in opt-in email newsletters will increase 25% as legitimate marketers find this is the easier way to comply with the anti-spam law and a better way of targeting customers.
7. Rich media will continue to be hot. More than 25% of online ads served will contain rich media contents.
05/2004 L. F. Chien
The top 10 Internet trends 2004 predicted by eOneNet.com
8. 20% more small businesses will develop their own websites or use the Internet as a sales and marketing channel.
9. Entertainment online will be grow at a rapid pace, with more sites offering videos and digital music download services.
10. The Internet boom will revive with more Internet companies going for IPO both in the US and in Asia, in particular kicked off by the most anticipated Google IPO in Spring.
05/2004 L. F. Chien
II. Inside WSE
05/2004 L. F. Chien
Components
Crawler/Spider Index Server Query Server Document Delivery
05/2004 L. F. Chien
Architecture
SESE
SESE
SESE BrowserBrowserWeb
1B queries/day
Quality results
LogLog.Spam. Freshness
5B pages
Scalable Scalable
IndexIndex
IndexIndex
IndexIndexSpiderSpider
IndexerIndexer
ArchiveArchive
(1)
(2)
(3)
(4)
(5)
05/2004 L. F. Chien
Spider
Get all Pages from the Web Web Traverse Challenges
Performance, e.g., #Pages/Per PC Coverage Currency Spam Filtering Hidden Web
05/2004 L. F. Chien
Index Server Index occurrences of all words in the
pages Data Cleanness Challenges
Space Overhead,#pages/PC Incremental Scalability & Distributed Processing Multiple Languages
05/2004 L. F. Chien
System Anatomy
05/2004 L. F. Chien
Data StructureLexicon: fit in memory
two different formsHit list: account for most space
use 2 bytes to save spaceForward index: barrels are sorted
by wordID. Inside barrel, sorted by docID
Inverted Index: some content as the forward index, butsorted by wordID.doc list is sorted by docID
05/2004 L. F. Chien
Query Server
Search Relevant URLs for queries via looking up indices
Challenges Speed, check #queries/Per Sec Functions supported Localization
05/2004 L. F. Chien
PageRank
05/2004 L. F. Chien
PageRank (Cont.)
be the set of pages that point to u. be the number of links from u and let c be a factor used for normalization, thena simplified version of PageRank:
05/2004 L. F. Chien
Search Functions Phrase search, e.g. "petite galerie" Truncation, e.g. librar*, wom*n Constraining search, e.g. title:"The Wall Street Journal" Proximity search, e.g. gold near silver Boolean, e.g. +noir +film -"pinot noir" Parentheses and Nested Boolean, e.g. silver and not (gold or platinu
m) Limit search, e.g. limit by date range Capitalization, e.g. turkey vs. Turkey Ranking fields and refine search LiveTopics Translate Service Other
05/2004 L. F. Chien
Document Delivery Bottleneck of Bandwidth Presentation Caching
Queries, Search Results Aakman Model
05/2004 L. F. Chien
III. Business
05/2004 L. F. Chien
What is Google? Specialized web search engine Founded in 1998 by 2 graduate students at Stanford
University (Larry Page and Sergey Brin) Provides a comprehensive, relevant, and easy-to-use web
search and browsing service (free)
Google’s features: fast, unbiased, and accurate results, allows access to over 4 billion web pages, and over 800 million images (most important; valid web pages)
05/2004 L. F. Chien
Company Facts
Employees: 1,300+
Languages spoken: 34
Worldwide Offices: 21
(Mostly in US & Europe)
Annual Revenues: $900m
05/2004 L. F. Chien
Google Revenue
Revenue—(an e-business):
½ from selling relevant text-based ads (sponsored links near search
results)
½ from licensing its search technology to companies like Yahoo
Source: Eric Schmidt Interview, PCWorld.com (January 30, 2002)
05/2004 L. F. Chien
Sources of Revenue
Adwords (150,000 advertisers) “sponsored links” adcost-per-click pricing; only when people click on the link -- Advertisement is extremely cheap and effectivei.e. Edmunds.com spent “$250,000 a month in advertising“ because $1 spent generated $1.70.
Google Search Appliance an integrated hardware/software solution that extends the
power of Google to corporate intranets and web servers
-- Customers include: Cisco Systems, Sony, Procter & Gamble, Sun Microsystems, etc
05/2004 L. F. Chien
Challenges (cont.)
Easy entry into the Search Engine Industry
Lack of customer lock-in (vs. Microsoft);
Google will focus on creating services to voluntarily draw in customers
Large, well-known competitors are focusing on in-house search technology (Yahoo, Microsoft, AOL, eBay, Amazon)
Customers are becoming competitors (Yahoo, AOL)
05/2004 L. F. Chien
Competitors: Ebay and Amazon
Ebay (www.ebay.com) E-commerceWeb-based marketplace in which a community of buyers and sellers are brought together to browse, buy and sell various items -- Business revenue: Charges Proceeds (Fees)
(5%) 0.01-$25 (2.5%) $25-$1000 (1.25%) over $1000 Amazon (www.amazon.com) E-commerce a customer-centric company that sells a range of products that it
purchases from manufacturers and distributors
05/2004 L. F. Chien
Competitors: Microsoft and Yahoo
Microsoft is developing its own search engine-- Can “lasso” users into its search engine through its operating system-- Has the “braniacs” to implement top of the line search engine technology
Yahoo was customer of Google (may now become Google’s biggest competitor)-- Offers placement under sponsored links and within actual results (“unethical”)
05/2004 L. F. Chien
IV. Impacts
05/2004 L. F. Chien
Impacts Web Computing Knowledge Windows New Web OS
05/2004 L. F. Chien
Web Computing Faster than local search Very-large scale of computing
systems Realize global users’ behaviors Acquire global information sources
05/2004 L. F. Chien
Web Computing Local disc or global disc? Personal information management?
Gmails Photo search
05/2004 L. F. Chien
Knowledge Windows Windows of Information Search Alliance with online databases Windows of Personal Knowledge
Management Knowledge Windows
05/2004 L. F. Chien
New Web OS
Merged with Linux OS Software download from end-users Information Service OS
05/2004 L. F. Chien
V. New Gen. of WSE
05/2004 L. F. Chien
Advanced Google Is Google good enough?
“Takano” “Takano NII” “Takano NII Japan”
More about Google Services http://www.google.com/options/
05/2004 L. F. Chien
New Features in Google Google Labs: http://labs.google.com/ Google Desktop Search
Searching text, Web, Word, Excel, PowerPoint, Outlook, AOL Instant Messenger
Google SMS Searching phone book, dictionary, product prices,
… Google Print
Searching books
05/2004 L. F. Chien
05/2004 L. F. Chien
Other Search Tools A9.com (by Amazon)
Bookmark, history, discover, diary Books, movies, …
Clusty.com (by Vivisimo) Clustering engine
Snap.com (by Idealab) Sorting by popularity, satisfaction, Web popularity, Web satisfactio
n, domain, … Alexa.com (by Amazon)
Average user review ratings, … Others: Yahoo, AskJeeves, AOL Search, HotBot, MSN, Netscape,
Lycos, Altavista, LookSmart, Gigablast, Overture, About, FindWhat, Teoma, InformSearch, …
05/2004 L. F. Chien
Clusty.com
05/2004 L. F. Chien
Example on Vivisimo
05/2004 L. F. Chien
Vivisimo (cont.)
05/2004 L. F. Chien
New Directions Personalization
Photo search, email search & filtering Information Extraction
EX: Scholar search Information Agent Deep Web Search
05/2004 L. F. Chien
VI. Web Mining
05/2004 L. F. Chien
Web Search/Information Retrieval
Web Search Engine
Information Seeking
Millions of Users
05/2004 L. F. Chien
Improving Search via Mining
Webtexts, images, logs …
Search Engine
Knowledge Discovery
Millions of Users
05/2004 L. F. Chien
Valuable Web Resources
Weblogs, texts, images, …
Knowledge Discovery
Millions of Users
Hyper LinksAnchor Texts Search Result PagesQuery LogsQuery Session LogsClicked Stream LogsDeep Web, ….
05/2004 L. F. Chien
Discovered Knowledge
Weblogs, texts, images, …
Knowledge Discovery
Millions of Users
Users’ Preferences/Need: Topic, Location, Timing, …Authority/Popularity: Site, File, People, Company, ProductClusters/Associations/Relations: Site, Page, People, Company, Product, Query
05/2004 L. F. Chien
Web Mining for IR
Weblogs, texts, images, …
Knowledge Discovery
Millions of Users
SearchClassificationClusteringCross-language IR
Information Extraction Text miningFiltering
05/2004 L. F. Chien
CS 276 / LING 239IInformation Retrieval and Web Mining
Prabhakar Raghavan and Hinrich Schütze
Course Description: Basic and advanced techniques for text-based information syst
ems: efficient text indexing; Boolean, vector space, and probabilistic retrieval models; evaluation and interface issues; Web search including crawling, link-based algorithms, and Web metadata; text/Web clustering, classification, wrapper, information extraction, and collaborative filtering systems; text mining. Projects can be chosen from diverse topics in information retrieval.
05/2004 L. F. Chien
Computational Linguistics, 29 , Issue 3, September 2003 .
05/2004 L. F. Chien
Research at Web Knowledge Discovery Lab
Web
Resources
Discovered
Knowledge
IR Application
SIGIR’00,
WWW’00,
WI’01,
ICDM’01,
JASIST’02, …
Query Log
Query Session
Log
Classified/Relevant
Queries
User’s Interests Finding
Query Classification
Term Suggestion
Thesaurus Construction
SIGIR’01
ICDM’02
TOIS’04
Anchor Text
Hyper Link
Translation Pairs Translation Lexicon
Cross-language IR
Cross-Language Web Search
ACL’04
SIGIR’04
Search Result
Pages
Translation Pairs Translation Lexicon
Cross-language IR
Cross-Language Web Search
ICDM’02
CILM’04
TOIS’04
Search Result
Pages
Clustered
Queries/Term
Taxonomy
Search Result Clustering
Taxonomy Generation
ICDE’04
WWW’04
TALIP’04
Search Result
Pages
Online Corpus
Generated
Metadata
Online Classifier
Class-based Web Search
05/2004 L. F. Chien
Research at Web Knowledge Discovery Lab
Live series LiveTrans
• SIGIR’04, ACL’04, JCDL’04• ACM Trans. On Information System, 2004• Online Translation of unknown queries via Web
LiveClassifier • WWW’04, IJCNLP’04• ACM Trans. on ALIP, 2004 • Training classifiers and classifying short text via Web
05/2004 L. F. Chien
Research at Web Knowledge Discovery Lab
LiveCluster • CIKM’04• ACM Trans. On Information System, 2004• Generating taxonomy from terms or documents
05/2004 L. F. Chien
LiveTrans: Cross-language Web Search
05/2004 L. F. Chien
LiveClassifier: Classifying search results into user-defined classification tree
05/2004 L. F. Chien
LiveClassifier : Paper Title Categorization
Note: no labeled training data
05/2004 L. F. Chien
LiveCluster: Taxonomy Generation
05/2004 L. F. Chien
Terms Clustering
05/2004 L. F. Chien
Query Clustering
勞委會
職訓局
就業
青輔會
自傳
徵才
人力資源
104
人力銀
行人力銀行
找工作
履歷表
求職
求才
占卜
塔羅牌
算命
紫微斗數
命理
姓名學
心理測驗
星座
愛情
eva長榮航空
長榮
航空公司
航空
華航
中華航空
補帖
大補帖
泡麵
dbt武俠
金庸
武俠小說
黃易
作家
武俠金庸武俠小說黃易作家
補帖大補帖泡麵dbt
eva長榮航空 (EVA airline)長榮 (EVA)航空公司 (airline)航空 (airway)華航 (China airline)中華航空 (China airline)
占卜塔羅牌算命紫微斗數命理姓名學心理測驗星座愛情
勞委會職訓局就業青輔會自傳徵才人力資源104 人力銀行人力銀行找工作履歷表求職求才cut
1 2 3 4 5
1 23 4
5
05/2004 L. F. Chien
05/2004 L. F. Chien
Outline
• Translating Unknown Queries (SIGIR’04)• Training Text Classifiers (WWW’04)• Generating Taxonomy/Topic Hierarchies
(TOIS’04)
05/2004 L. F. Chien
Translating Unknown Queries
I. Anchor Text Mining I. Probabilistic Modeling (ACM TALIP’02)
II. Transitive Translation (ACM TOIS’04)
II. Search-Result Page Mining I. Translation Extraction & Selection (JCDL’04)
II. CLIR & Other Applications (SIGIR’04, ACL’04)
Note: First work dealing with online translation
05/2004 L. F. Chien
Introduction (cont.)
Bottleneck of CLIR service Real queries are often short Out-of-dictionary terms and might have local variations
• Ex: proper nouns, new terminologies, …
Need for a powerful query translation engine Up-to-date dictionary
English Terminologies
Chinese Translation
Digital library 數位圖書館 /數字圖書館
Banff 班夫 / 班芙
Ishikawa 石川県
NII Japan 国立情報学研究所
louvre museum 羅浮宮
SARS 嚴重急性呼吸道症候群 / 非典 / 沙士
Clinton 柯林頓 / 克林頓
Bill Gates 比爾蓋茲
05/2004 L. F. Chien
Web Mining of Query Translations
Different problems for different resources
Source Term
TargetTranslations
TermTranslation
Web Mining
Anchor-Text Mining Search-Result Mining
OOD
Yahoo <-> 雅虎
05/2004 L. F. Chien
Anchor Text (Yahoo <-> 雅虎 )
Applies to most languages Translation candidates are likely to appear in the same anchor-text-set
05/2004 L. F. Chien
Search Result Page (National Palace Museum vs. 故宮博物院 )
Mixed-language characteristic in Chinese pages
05/2004 L. F. Chien
Problems
Term extraction Translation selection & noisy reduction Language pairs with limited corpora Processing speed Data cleanness (language identification) Language independence
05/2004 L. F. Chien
Term Extraction: SCPCD
05/2004 L. F. Chien
inktotal in-l
Uin-linkUP
UP)|U)P(T|UP(TUTPUTP
U)P|U)P(T|UP(T
UPUTTP
UPUTTP
TTP
TTPTTP
ii
Uiitisitis
Uiitis
Uiits
Uiits
ts
tsts
i
i
i
i
#
of #)( where
)(])|()|([
)(
)()|(
)()|(
)(
)()(
……
Term Selection: Probabilistic Inference Model
Page Authority
Co-occurrence
PageRank
Integrating anchor texts and link structures into probabilistic inference model
Based on co-occurrence & page authority
05/2004 L. F. Chien
Source Term(Ts) Translation(Tt)Source Term(Ts) Translation(Tt)
雅虎=>
Yahoo
Observation of Anchor Text
05/2004 L. F. Chien
- in USATaiwan -
www.yahoo.comwww.yahoo.com.tw
Yahoo Yahoo
Source Query
Observation of Anchor Text
05/2004 L. F. Chien
- in USATaiwan - 台灣 - 搜尋引擎
www.yahoo.comwww.yahoo.com.tw
Yahoo 雅虎雅虎 Yahoo
Translation Candidates
Anchor-Text Set
Observation of Anchor Text
05/2004 L. F. Chien
……(#in-link= 187)
……(#in-link= 21)
- in USATaiwan - 台灣 - 搜尋引擎
www.yahoo.comwww.yahoo.com.tw
Yahoo 雅虎雅虎 Yahoo
Page Authority
Observation of Anchor Text
Co-occurrence
05/2004 L. F. Chien
Search Result Mining
Term
Extraction
Term
Extraction
Source Query
TargetTranslations
WebPages
SearchEngineSearchEngine
PAT-tree based term extraction method [Chien, SIGIR ‘97]
PAT-tree based term extraction method [Chien, SIGIR ‘97]
Term
Selection
Term
Selection
05/2004 L. F. Chien
Term Selection
How to decide the ranking?
S, Ti: frequently co-occur in the same pages Not necessarily true for
synonyms and antonyms
S, Ti: the result pages containing similar co-occurring context terms as feature vectors
QueryS
QueryS .
.
.
T1
T2
Tn
05/2004 L. F. Chien
Chi-Square Test
Chi-Square Test: a statistical method for co-occurrence analysis [Gale & Church ‘91]
(3) . )()()()(
)(),(
2
2dcdbcaba
cbdaNtsSx
a: # of pages containing both terms s and t
b: # of pages containing term s but not t
c: # of pages containing term t but not s
d: # of pages containing neither term s nor t
N: the total number of pages, i.e., N= a+b+c+d
a: # of pages containing both terms s and t
b: # of pages containing term s but not t
c: # of pages containing term t but not s
d: # of pages containing neither term s nor t
N: the total number of pages, i.e., N= a+b+c+d
05/2004 L. F. Chien
Context Vector Analysis
Context Vector Analysis: co-occurring context terms as feature vectors
Similarity measure: cosine measure
(4) , )n
log(),(max
),(
N
dtf
dtfw
jj
iti
(5) . )()(
),(),(
1
2
1
2
1
m
it
m
is
tsm
icv
ii
ii
ww
wwftsS
05/2004 L. F. Chien
Indirect Association Problem
Ciscos t
s1 t1系統 (system) system
Fig. 4. An illustration showing the indirect association problem, in which the dashed arrows indicate possible indirect association errors.
思科 (Cisco)
05/2004 L. F. Chien
Competitive Linking Algorithm
t1systems
t2
系統 (system) Cisco
資訊 (information)
網路 (network)
電腦 (computer)
St1
思科 (Cisco)
Fig. 6. An illustration showing a bipartite graph generated by using Algorithm 2.
St2
05/2004 L. F. Chien
Combined Method To take advantage of both methods
Anchor-text-based: higher precision Search-result-based: higher coverage
(6) ,),(
),( m m
m
tsRtsSall
Rm(s,t) : Ranking of score in different methods
Rm(s,t) : Ranking of score in different methods
05/2004 L. F. Chien
Experiments
Performance on Query Translation Test Bed: real query terms from the Dreamer
search engine log in Taiwan 228,566 unique terms, during a period of 3
months in 1998 Random-query test set:
• 50 query terms in Chinese, randomly selected from the top 20,000 queries in the log
• 40 of them were out-of-dictionary
05/2004 L. F. Chien
Random Query Test Set
Table 2. Coverage and top 1~5 inclusion rates obtained with the four different methods for the random-query set.
Method Top-1 Top-3 Top-5 Coverage
CV 40.0% 54.0% 54.0% 68%
X2 36.0% 50.0% 52.0% 68%
AT 20.0% 32.0% 32.0% 32%
Combined 44.0% 64.0% 66.0% 72%
Many query terms didn’t appear in anchor-text sets (coverage)
05/2004 L. F. Chien
Other Experiments
430 popular Chinese queries, 67.4% top-1 inclusion rate
Common terms: randomly selected 100 common nouns and 100 common verbs from general-purpose Chinese dictionary
05/2004 L. F. Chien
Transitive Translation
m
t
Sony
(English)
s : source term
t : target translation
m : intermediate translation
索尼
(Simplified
Chinese)
s
新力
(Traditional
Chinese)
Fig. 3. An abstract diagram showing the
concepts of direct translation and indirect
translation.
Top-n inclusion rates obtained with different models.
Model Top1 Top2 Top3 Top4 Top5
Direct Translation
35.7% 43.0% 46.9% 49.6% 51.2%
Indirect Translation
44.2% 55.1% 58.0% 59.7% 60.5%
Transitive Translation
49.2% 58.1% 60.9% 61.6% 62.0%
05/2004 L. F. Chien
Transitive Translation Model
(3) )(),( tsPtsPdirect
(4) ),()()(
)(),(),(
mPtmPmsP
mPtmmsPtsP
m
mindirect
(5) otherwise, ),,(
)( if ),,(),(
tsP
θs,tPtsPtsP
indirect
directdirecttrans
05/2004 L. F. Chien
Table VI. Top-n inclusion rates obtained with different models for translating traditional Chinese queries into Japanese.
Model Top1 Top2 Top3 Top4 Top5
Direct 10.5% 12.8% 14.3% 15.1% 15.1%
Indirect 40.2% 49.4% 56.6% 58.6% 59.6%
Transitive 42.9% 51.4% 58.6% 61.3% 61.9%
Table VII. Selected source query terms in traditional Chinese and their translations in English, simplified Chinese and Japanese respectively, which were extracted by our transitive translation model.
Source terms (Traditional Chinese)
Extracted target translations
English Simplified Chinese
Japanese
新力耐吉史丹佛雪梨網際網路網路首頁電腦資料庫資訊
SonyNikeStanfor
dSydneyinternetnetworkhomepa
gecomput
erdatabas
einforma
tion
索尼耐克斯坦福悉尼互联网网络主页计算机数据库信息
ソニーナイキスタンフォー
ドシドニーインターネッ
トネットワークホームページコンピュータ
ーデータベースインフォメー
ション
Chinese-Japanese Translation
05/2004 L. F. Chien
Translation Lexicons with
Regional Variations
(a) Taiwan (b) Mainland China (c) Hong KongFigure 1: Examples of search-result pages in different Chinese regions that were obtained via the English query words “George Bush” from Goo
gle.
05/2004 L. F. Chien
Summary
A work dealing with live translation of unknown queries Anchor-text-based
High precision for high-frequency terms Effective for proper nouns in multiple languages Not applicable if size of anchor-text set not enough
Search-result-based Exploit rich Web resources High coverage for English-Chinese language pair
05/2004 L. F. Chien
LiveCluster: Generating Taxonomy from
terms or documents
05/2004 L. F. Chien
Taxonomy Generation from Terms
05/2004 L. F. Chien
Hierarchical Query Clustering
05/2004 L. F. Chien
The Steps Feature Extraction
Use co-occurred seed terms extracted from retrieved top pages Term Vector
Each query term is assigned a term vector • Record the co-occurred feature terms and their frequency values in t
he retrieved documents. Term Similarity
tf*idf-based Cosine measurement Hierarchical Term Clustering
Cluster popular query terms in the log into initial categories Query terms with similar features are grouped into clusters.
05/2004 L. F. Chien
Feature Extraction Use co-occurred seed terms extracted from
retrieved top pages
Creative Nude Photography Network -- Fine Art Nude and ... ... The Creative Nude and Erotic Photography Network is the number one net portal to the best in fine art nude and erotic photography! Over 100 CNPN Member Sites ...
Nude Places... to be naked. Walking in the forest, cruising the lake in open boats, swimming, picnicking and nude photography are all enjoyed in the nude. 60 minutes $39.95. ...
A Brave Nude World... A Brave Nude World! Warning: This site contains links to fine art nude & erotic photography. If you are under 18 or do not wish to view this material, You can ...
nudeCo-occurred feature terms
3/2erotic photography
1/1naked
………
3/2art
2/2photography
tf/dfterm
05/2004 L. F. Chien
Term Weighting
05/2004 L. F. Chien
Extraction of Basic Feature Terms Performance of different features: randomly selected,
hi-frequency, and seed terms Popular queries not affected by ephemeral trends, e.g., “movie”,
“basketball”, “mutual fund”, etc. More expressive and distinguishable in describing a particular category Two logs compared and extracted 9,709 overlapping top query terms as
feature terms
G-1999D-1998
Top 1,000 terms top 20,000 terms ALL
Top 1,000 terms 583/58.30% 977/97.70% 992/99.20%
Top 20,000 terms 914/91.40% 9,709/50.71% 14,721/76.89%
05/2004 L. F. Chien
Task I: Query Clustering (Cont.)
Feature Extraction Use co-occurred seed terms extracted from retrieved top
pages Term Vector
Each query term is assigned a term vector • Record the co-occurred feature terms and their frequency
values in the retrieved documents. Term Similarity
TF *IDF-based Cosine measurement Hierarchical Term Clustering
Cluster popular query terms in the log into initial categories Query terms with similar features are grouped into clusters.
05/2004 L. F. Chien
Term Similarity
05/2004 L. F. Chien
Hierarchical Term Clustering
Agglomerative hierarchical clustering (AHC) Compute the similarity between all pairs of clusters
• Estimate similarity between all pairs of composed terms• Use the lowest term similarity value as the cluster
similarity value Merge the most similar (closest) two clusters
• Complete linkage method Update the cluster vector of the new cluster Repeat steps 2 and 3 until only a single cluster
remains
05/2004 L. F. Chien
05/2004 L. F. Chien
Clustering Results
勞委會
職訓局
就業
青輔會
自傳
徵才
人力資源
104
人力銀
行人力銀行
找工作
履歷表
求職
求才
占卜
塔羅牌
算命
紫微斗數
命理
姓名學
心理測驗
星座
愛情
eva長榮航空
長榮
航空公司
航空
華航
中華航空
補帖
大補帖
泡麵
dbt武俠
金庸
武俠小說
黃易
作家
武俠金庸武俠小說黃易作家
補帖大補帖泡麵dbt
eva長榮航空 (EVA airline)長榮 (EVA)航空公司 (airline)航空 (airway)華航 (China airline)中華航空 (China airline)
占卜塔羅牌算命紫微斗數命理姓名學心理測驗星座愛情
勞委會職訓局就業青輔會自傳徵才人力資源104 人力銀行人力銀行找工作履歷表求職求才cut
1 2 3 4 5
1 23 4
5
05/2004 L. F. Chien
Cluster Partition
05/2004 L. F. Chien
Quality Function
05/2004 L. F. Chien
Quality Function (Cont.)
05/2004 L. F. Chien
Quality Function (Cont.)
05/2004 L. F. Chien
Preliminary Experiment
Test queries • Two sets: top 1k queries and random 1k queries• Each of the test queries has been manually assigned
according classes
Evaluation metrics• F-Measure
05/2004 L. F. Chien
Evaluation: F-Measure
05/2004 L. F. Chien
Obtained F-Measures
05/2004 L. F. Chien
05/2004 L. F. Chien
Results of Hierarchical Structure Generation