Upload
briana-jones
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
A Model for Fast Web Mining Prototyping
Nivio ZivianiUFMG – Brazil
ÁlvaroPereira
RicardoBaeza-Yates
Jesus BisbalUPF – Spain
- 2 -
2nd ACM International Conference on Web Search and Data Mining – WSDM'09
Motivation
• Our focus:
– Web mining as the process of discovering useful information in Web data by means of data mining techniques
• Web mining
– Computation-intensive task
– Iterative process
• Prototyping plays an important role
– Experimenting with different alternatives
– Incorporating the knowledge from previous iterations
• Mining softwares are developed ad-hoc
– Time-consuming tasks
– Not scalable
– Not reusable
- 3 -
2nd ACM International Conference on Web Search and Data Mining – WSDM'09
Main Objective: Design and Development of WIM
WIMWIM – WWeb IInformation MMining model
• WIM goal: facilitate fast Web mining prototyping
• Main research challenges:
– Data model
– Algebra
– Software prototype
• Architecture and implementation issues
- 4 -
2nd ACM International Conference on Web Search and Data Mining – WSDM'09
Web Mining Problems WIM Has Been Applied So Far
• Study of genealogical trees on the Web (WWW'08)
– A study on how the Web textual content evolves
• A usage pagerank for ranking improvement
– A logical graph is created based on usage data
• Linkage Evolution for New Pages
– Hypothesis: duplicates tend to have no evolution of links (inlinks)
• A user intent study
– Identifying queries that cannot be classified as either navigational or informational
• Creation of a reference dataset for learning to rank
- 5 -
2nd ACM International Conference on Web Search and Data Mining – WSDM'09
Outline
• Related work
• WIM data model
• WIM algebra
• Software architecture
• Conclusions and future work
- 7 -
2nd ACM International Conference on Web Search and Data Mining – WSDM'09
First Research Line: Data Mining Tools
• Business-driven solutions
• Not specially designed for Web data
• SQL extensions
• Examples:
– Microsoft SQL Server
– Oracle Data Mining
– IBM DB2 Intelligent Miner
– BI tools:
• Angoss, Infor CRM Epiphany, Portrait Software, SAS
– Weka
- 8 -
2nd ACM International Conference on Web Search and Data Mining – WSDM'09
Second Research Line:Query Languages for Web Data
• Not for mining
• Web data manipulation
– Acquisition, storage, management
• Examples:
– TSIMMIS, W3QL, WebLog, WebSQL, ARANEUS, StruQL, WebOQL, Whoweda, WEBMINER, WUM, Squeal, WebBase, WEBVIEW
- 10 -
2nd ACM International Conference on Web Search and Data Mining – WSDM'09
Data Model – Design Goals
• Feasibility
• Simplicity
• Extensibility
• Data representativity
• Uniformity among operators
• Applicability to other scenarios
- 11 -
2nd ACM International Conference on Web Search and Data Mining – WSDM'09
Relation Type
• Node relations represent nodes of a graph, such as:
– Documents of a Web dataset
– Terms of a document
– Queries of a query log
– Sessions of a query log
• Link relations represent edges of a graph, such as:
– Links between Web documents
– Word distance among terms of a document
– Similarity among queries
– Clicks of a query log
– Association between queries and sessions
• Usage data can be represented as both node or link relations
- 12 -
2nd ACM International Conference on Web Search and Data Mining – WSDM'09
Node Relation
txtdocid
123456
toflyor
nottofly
w.aw.bw.cw.dw.ew.f
url
docs
- 13 -
2nd ACM International Conference on Web Search and Data Mining – WSDM'09
Link Relation
• Main difference: link relations must represent start and end nodes of a graph
1
5
24
5
4
1
1
6
3
3
2
txtdocid123456
toflyornottofly
w.aw.bw.cw.dw.ew.f
url
docs
Stgr.id
11121314151617
En
1122345
2445064
we
4115-132
graph
graph
- 14 -
2nd ACM International Conference on Web Search and Data Mining – WSDM'09
Compatibility
1
5
24
5
4
1
1
6
3
3
2
txtdocid
123456
toflyor
nottofly
w.aw.bw.cw.dw.ew.f
url
docs graph
• A link relation is compatible to a node relation if the nodes of the graph (link relation) are foreign keys in the node relation
- 15 -
2nd ACM International Conference on Web Search and Data Mining – WSDM'09
Operation
• The act of applying an operator to a relation
• An operator is a function defined by the WIM algebra
– Unary or binary
txtdocid123456
toflyor
nottofly
w.aw.bw.cw.dw.ew.f
url
docs
Stgr.id11121314151617
En
1122345
2445064
we
4115-132
graph
txtdocid
123456
toflyornottofly
w.aw.bw.cw.dw.ew.f
url
515152535153
cl.
output
- 16 -
2nd ACM International Conference on Web Search and Data Mining – WSDM'09
WIM Program
• Sequence of operations applied to relations
– Result of users' interaction through the WIM language
• The WIM language:
– Is built upon the WIM algebra
– Is declarative
– Is a dataflow programming language
• Facilitates parallelism
• Allows graphical implementation
- 17 -
2nd ACM International Conference on Web Search and Data Mining – WSDM'09
WIM Program Example – Genealogical Tree Study
- 18 -
2nd ACM International Conference on Web Search and Data Mining – WSDM'09
WIM Program Example – Genealogical Tree Study
text urltobeornottobe
id123456
w.tow.bew.orw.now.taw.bf
text urltobeor
nottobe
id123456
w.tow.bew.orw.now.taw.bf
relOldclus111213141112
relClusterOldnum end
110000
start123456
relDupOld
text urltoflyor
nottofly
id212223242526
w.ttw.flw.ofw.now.tnw.fy
clus313233343132
relClusterNew
56
num end201120
start123456
relSearch
21, 25
2324
21, 25
num end201120
start123456
relSearchUrl
21, 25
2324
21, 25
sim0, 0
01
0, 0
compare
search
searchcGr.
1 21
text url
toorto
id
212325
w.ttw.ofw.tn
clus
313331
relEnd
text url
toor
id2123
w.ttw.of
clus3133
relEndInst
21
qtt
relGenEnd
relGenSt
3
1
3
23
21
23
set
num end212
start135
relSeDifUrl
21, 2523
21, 25
sim0, 00
0, 0
set
sel.
agg.
5 set*
set
set
set
disc.
text urltobeornottobe
id123456
w.tow.bew.orw.now.taw.bf
text urltobeor
nottobe
id123456
w.tow.bew.orw.now.taw.bf
relOldclus111213141112
relClusterOldnum end
110000
start123456
relDupOld
text urltoflyor
nottofly
id212223242526
w.ttw.flw.ofw.now.tnw.fy
clus313233343132
relClusterNew
56
num end201120
start123456
relSearch
21, 25
2324
21, 25
num end201120
start123456
relSearchUrl
21, 25
2324
21, 25
sim0, 0
01
0, 0
compare
search
searchcGr.
1 21
text url
toorto
id
212325
w.ttw.ofw.tn
clus
313331
relEnd
text url
toor
id2123
w.ttw.of
clus3133
relEndInst
21
qtt
relGenEnd
relGenSt
3
1
3
23
21
23
set
num end212
start135
relSeDifUrl
21, 2523
21, 25
sim0, 00
0, 0
set
sel.
agg.
5 set*
set
set
set
disc.
text urltobeornottobe
id123456
w.tow.bew.orw.now.taw.bf
text urltobeor
nottobe
id123456
w.tow.bew.orw.now.taw.bf
relOldclus111213141112
relClusterOldnum end
110000
start123456
relDupOld
text urltoflyor
nottofly
id212223242526
w.ttw.flw.ofw.now.tnw.fy
clus313233343132
relClusterNew
56
num end201120
start123456
relSearch
21, 25
2324
21, 25
num end201120
start123456
relSearchUrl
21, 25
2324
21, 25
sim0, 0
01
0, 0
compare
search
searchcGr.
1 21
text url
toorto
id
212325
w.ttw.ofw.tn
clus
313331
relEnd
text url
toor
id2123
w.ttw.of
clus3133
relEndInst
21
qtt
relGenEnd
relGenSt
3
1
3
23
21
23
set
num end212
start135
relSeDifUrl
21, 2523
21, 25
sim0, 00
0, 0
set
sel.
agg.
5 set*
set
set
set
disc.
text urltobeornottobe
id123456
w.tow.bew.orw.now.taw.bf
text urltobeor
nottobe
id123456
w.tow.bew.orw.now.taw.bf
relOldclus111213141112
relClusterOldnum end
110000
start123456
relDupOld
text urltoflyor
nottofly
id212223242526
w.ttw.flw.ofw.now.tnw.fy
clus313233343132
relClusterNew
56
num end201120
start123456
relSearch
21, 25
2324
21, 25
num end201120
start123456
relSearchUrl
21, 25
2324
21, 25
sim0, 0
01
0, 0
compare
search
searchcGr.
1 21
text url
toorto
id
212325
w.ttw.ofw.tn
clus
313331
relEnd
text url
toor
id2123
w.ttw.of
clus3133
relEndInst
21
qtt
relGenEnd
relGenSt
3
1
3
23
21
23
set
num end212
start135
relSeDifUrl
21, 2523
21, 25
sim0, 00
0, 0
set
sel.
agg.
5 set*
set
set
set
disc.
text urltobeornottobe
id123456
w.tow.bew.orw.now.taw.bf
text urltobeor
nottobe
id123456
w.tow.bew.orw.now.taw.bf
relOldclus111213141112
relClusterOldnum end
110000
start123456
relDupOld
text urltoflyor
nottofly
id212223242526
w.ttw.flw.ofw.now.tnw.fy
clus313233343132
relClusterNew
56
num end201120
start123456
relSearch
21, 25
2324
21, 25
num end201120
start123456
relSearchUrl
21, 25
2324
21, 25
sim0, 0
01
0, 0
compare
search
searchcGr.
1 21
text url
toorto
id
212325
w.ttw.ofw.tn
clus
313331
relEnd
text url
toor
id2123
w.ttw.of
clus3133
relEndInst
21
qtt
relGenEnd
relGenSt
3
1
3
23
21
23
set
num end212
start135
relSeDifUrl
21, 2523
21, 25
sim0, 00
0, 0
set
sel.
agg.
5 set*
set
set
set
disc.
text urltobeornottobe
id123456
w.tow.bew.orw.now.taw.bf
text urltobeor
nottobe
id123456
w.tow.bew.orw.now.taw.bf
relOldclus111213141112
relClusterOldnum end
110000
start123456
relDupOld
text urltoflyor
nottofly
id212223242526
w.ttw.flw.ofw.now.tnw.fy
clus313233343132
relClusterNew
56
num end201120
start123456
relSearch
21, 25
2324
21, 25
num end201120
start123456
relSearchUrl
21, 25
2324
21, 25
sim0, 0
01
0, 0
compare
search
searchcGr.
1 21
text url
toorto
id
212325
w.ttw.ofw.tn
clus
313331
relEnd
text url
toor
id2123
w.ttw.of
clus3133
relEndInst
21
qtt
relGenEnd
relGenSt
3
1
3
23
21
23
set
num end212
start135
relSeDifUrl
21, 2523
21, 25
sim0, 00
0, 0
set
sel.
agg.
5 set*
set
set
set
disc.
text urltobeornottobe
id123456
w.tow.bew.orw.now.taw.bf
text urltobeor
nottobe
id123456
w.tow.bew.orw.now.taw.bf
relOldclus111213141112
relClusterOldnum end
110000
start123456
relDupOld
text urltoflyor
nottofly
id212223242526
w.ttw.flw.ofw.now.tnw.fy
clus313233343132
relClusterNew
56
num end201120
start123456
relSearch
21, 25
2324
21, 25
num end201120
start123456
relSearchUrl
21, 25
2324
21, 25
sim0, 0
01
0, 0
compare
cGr.
1 21
text url
toorto
id
212325
w.ttw.ofw.tn
clus
313331
relEnd
text url
toor
id2123
w.ttw.of
clus3133
relEndInst
21
qtt
relGenEnd
relGenSt
3
1
3
23
21
23
set
num end212
start135
relSeDifUrl
21, 2523
21, 25
sim0, 00
0, 05 set*
set
set
disc.
text urltobeornottobe
id123456
w.tow.bew.orw.now.taw.bf
text urltobeor
nottobe
id123456
w.tow.bew.orw.now.taw.bf
relOldclus111213141112
relClusterOldnum end
110000
start123456
relDupOld
text urltoflyor
nottofly
id212223242526
w.ttw.flw.ofw.now.tnw.fy
clus313233343132
relClusterNew
56
num end201120
start123456
relSearch
21, 25
2324
21, 25
num end201120
start123456
relSearchUrl
21, 25
2324
21, 25
sim0, 0
01
0, 0
compare
cGr.
1 21
text url
toorto
id
212325
w.ttw.ofw.tn
clus
313331
relEnd
text url
toor
id2123
w.ttw.of
clus3133
relEndInst
21
qtt
relGenEnd
relGenSt
3
1
3
23
21
23
set
num end212
start135
relSeDifUrl
21, 2523
21, 25
sim0, 00
0, 05 set*
set
set
disc.
text urltobeornottobe
id123456
w.tow.bew.orw.now.taw.bf
text urltobeor
nottobe
id123456
w.tow.bew.orw.now.taw.bf
relOldclus111213141112
relClusterOldnum end
110000
start123456
relDupOld
text urltoflyor
nottofly
id212223242526
w.ttw.flw.ofw.now.tnw.fy
clus313233343132
relClusterNew
56
num end201120
start123456
relSearch
21, 25
2324
21, 25
num end201120
start123456
relSearchUrl
21, 25
2324
21, 25
sim0, 0
01
0, 0
compare
cGr.
1 21
text url
toorto
id
212325
w.ttw.ofw.tn
clus
313331
relEnd
text url
toor
id2123
w.ttw.of
clus3133
relEndInst
21
qtt
relGenEnd
relGenSt
3
1
3
23
21
23
set
num end212
start135
relSeDifUrl
21, 2523
21, 25
sim0, 00
0, 05 set*
set
set
disc.
- 29 -
2nd ACM International Conference on Web Search and Data Mining – WSDM'09
Two Classes of Operators
• Seven data manipulation operators
– Select, Calculate, CalcGraph, Aggregate, Set, Join, Materialize
• Eight data mining operators
– Search, Compare, CompGraph, Cluster, Disconnect, Associate, Analyze, Relink
- 30 -
2nd ACM International Conference on Web Search and Data Mining – WSDM'09
Select
Select tuples from the input
select
q.Id
ses.Id123456789
10
11121311131214111313
num.C
1211121111
countClick
num.C
q.Id
ses.Id1345789
10
1113111314111313
11111111
one
- 31 -
2nd ACM International Conference on Web Search and Data Mining – WSDM'09
Calculate
For mathematical and statistical calculations
tfidf1endstart
1112
7, 3, 1, 4, 83, 5, 9, 2, 6
tf0.4, 0.3, 0.3, 0.2, 0.10.6, 0.3, 0.2, 0.1, 0.1
tfidf2endstart
1112
7, 3, 1, 4, 83, 5, 9, 2, 6
1.0, 0.7, 0.7, 0.4, 0.01.0, 0.5, 0.3, 0.0, 0.0
tf2calc.
- 32 -
2nd ACM International Conference on Web Search and Data Mining – WSDM'09
CalcGraph
For calculations between nodes of the graph
text urltobeornottobe
id123456
w.tow.bew.orw.now.taw.bf
clus111213141112
relClusterOldtext urltoflyor
nottofly
id212223242526
w.ttw.flw.ofw.now.tnw.fy
clus313233343132
relClusterNewrelGenSt
1
3
21
23c.g.
end2123
start13
relGenCalcsum4246
- 33 -
2nd ACM International Conference on Web Search and Data Mining – WSDM'09
Aggregate
1
3
24
1
3
24
3
4
1
1
2
relCocit relAgg
aggregate
url.Id
q.Id
ses.Id1345789
10
1113111314111313
2122212226212622
onemost
url.Id
q.Id
ses.Id1379
11131413
21222626
m.one
3311
aggregate
group tuples with the same value for one or two attributes
- 34 -
2nd ACM International Conference on Web Search and Data Mining – WSDM'09
Set
text url
toflyornottofly
id212223242526
w.ttw.flw.ofw.now.tnw.fy
clus313233343132
relClusterNew
text url
toorto
id212325
w.ttw.ofw.tn
clus313331
relEndnum end
212
start135
relSeDifUrl
21, 2523
21, 25
sim0, 00
0, 0
set set
For union, intersection and difference of tuples in two relations
- 35 -
2nd ACM International Conference on Web Search and Data Mining – WSDM'09
Join
Add an external attribute into a given relation
queryData
id
1234
tobeorno
q. n.cli4341
url.Id
q.Id
ses.Id137
111314
212226
m.one331
mostOnem.one3031
q.Id11121314
join
- 36 -
2nd ACM International Conference on Web Search and Data Mining – WSDM'09
Search
Used for querying (TF-IDF, BM-25, AND, OR)
dataSetprid
123456789
10
0.60.20.50.90.10.20.20.60.60.3
queryListtextid
1112
to flyto buy
tfidfendstart
1112
7, 3, 1, 4, 83, 5, 9, 2, 6
tf0.4, 0.3, 0.3, 0.2, 0.10.6, 0.3, 0.2, 0.1, 0.1
c.Id1234561768
search
search
textto fly...to buy...
to...to fly...to buy...to buy...to fly...to fly...to buy...
be...
- 37 -
2nd ACM International Conference on Web Search and Data Mining – WSDM'09
Compare
Compare elements of a textual attribute
text urltobeornottobe
id123456
w.tow.bew.orw.now.taw.bf
relOldnum end
110000
start123456
relDupOld
56compare
- 38 -
2nd ACM International Conference on Web Search and Data Mining – WSDM'09
Disconnect
Identify clusters in a graph
text urltobeor
nottobe
id123456
w.tow.bew.orw.now.taw.bf
clus111213141112
relClusterOldnum end
110000
start123456
relDupOld
56 disconnect
- 39 -
2nd ACM International Conference on Web Search and Data Mining – WSDM'09
Analyze
For link analysis (Pagerank, Authority, Indegree)
id1234
u.pr0.10.20.40.5
1
3
24
3
4
2
relPrunedrelUsDocs
analyze
- 41 -
2nd ACM International Conference on Web Search and Data Mining – WSDM'09
Software Architecture
att 1 att 2 att n
index
Compiler
meta
attributes(relations)
output
...
new
program
out 1 out n
Executor
...
attr
data
tmp1 tmp2 tmp n
temporaryattributes
...
tmpindex
Visualizer
Indexer
Pre-processor
Web crawler
- 43 -
2nd ACM International Conference on Web Search and Data Mining – WSDM'09
Conclusions
• WIM – a model and software for fast Web mining prototyping
– Data model
– Algebra
– A software prototype
• Efficient
– Several tens of million of tuples
– Running time is higher for the mining operations
• Ad-hoc solutions also need the mining step
• Scalable
– Future implementation could have the attributes stored in different servers and different parts of programs running distributively
- 44 -
2nd ACM International Conference on Web Search and Data Mining – WSDM'09
Conclusions
• Extensible
– New operators, and new options/methods for the current operators, can be added
• We have designed and implemented an extension of operator Analyze
– calculate pagerank taking into account the label of the graph
• Effective for a set of Web mining applications
- 45 -
2nd ACM International Conference on Web Search and Data Mining – WSDM'09
Future Work on WIM
• Finish the implementation and make a version of the prototype available
– Users would contribute with extensions
– Improve the prototype to become a tool
• Design new operators for other mining tasks
• Aggregate a Web crawler and a data visualization interface
• Implement a graphical interface to the WIM language