Upload
miya-farrant
View
222
Download
0
Tags:
Embed Size (px)
Citation preview
Xyleme, 2001 1
A Dynamic Warehouse for the XML data of the Web
Grégory COBENAINRIA & Xyleme SA( [email protected] )
Serge Abiteboul, INRIA & Xyleme SA( [email protected] )
http://www-rocq.inria.fr/verso/ http://www.xyleme.com/
2
Organization
• 1. The Web and XML
• 2. Xyleme
• 3. Data Acquisition and Maintenance
• XML Repository, Semantic Data Integration and Query Processing
• 4. Query Subscription
• Conclusion
Xyleme, 2001 3
1. The Web and XML
4
The Web today
• Terabytes of data
• A lot of public pages– 1 billion in [06/2000] – several millions of servers
• Private web: not publicly available pages
• Deep web: data hidden behind forms
5
HTML = Hypertext Language
Ref Name PriceX23 Camera 359.99 R2D2 Robot 19350.00Z25 PC 1299.99
Information System
HTML
The <b> X23 </b> new camera replaces the <b> X22 </b>. It comes equipped with a flash (worth by itself <i>53.99 $</i>) and provides great quality for only <i>359.99 $</i>.
Text + presentationWhere is the data ?
hard
6
XML = Semistructured Data
Ref Name PriceX23 Camera 359.99 R2D2 Robot 19350.00Z25 PC 1299.99...
Information System
<product-table>< product reference=”X23"> <designation> camera </designation> <price unit=Dollars> 359.99 </price> <description> … </description></product>< product reference=”R2D2"> <designation> Robot </designation> <price unit=Dollars> 19350 </price> <description> … </description>...</product-table> XML
Data + StructureSemistructured: more flexible
easy
7
XML : Tree Types
• Semantics and structure are in paths– product-table/product/reference– product-table/product/price
product
designation descriptionprice
reference
product-table
Xyleme, 2001 8
2. A Dynamic Warehouse for the XML Data of the Web
Xyleme
9
Xyleme Research
Project Xyleme at INRIA (1999-2000) :Explore XML + Web + SGBD to make the Web a Knowledge Database
• INRIA– Sophie Cluet: Databases (OQL…)– Serge Abiteboul: semi-structured data + web– Guy Ferran: ex O2 Technology
• Mannheim University– Guido Moerkotte
• Université d’Orsay– Marie Christine Rousset
• CNAM– Dan Vodislav
10
Xyleme Company• Started September 2000
(25 employees end of 2001)
• Market Challenges:– Few XML documents available on the Web (because of
weak software support)– Company is focusing on private XML:
• Press, Editors, Financial Data, Biology…
– Technology:• Scalability for large amount of data• Internet (+focus) / Intranet support• Monitoring and Version Management• Heterogeneous Data Integration
11
Architecture
• Cluster of PCs
• Developed with Linux and C++
• Communications– local: Corba– external: HTTP
• Distribution between autonomous machines
• Now Web Services
12
Repository and Index Manager
Change Control
Query Processor
Semantic Module
User Interface
Xyleme Interface
Functional Architecture
-------------------- I N T E R N E T -----------------------
Web Interface
Acquisition& Crawler
Loader
13
Index Index Index
-------------------- I N T E R N E T -----------------------
Change Control andSemantic
Integration
Change Control andSemantic
Integration
ETHERNET
Repository Repository RepositorryRepository
Loader |Query Loader |Query
Architecture
Acquisition andMaintenance
Acquisition andMaintenance
Xyleme, 2001 14
3. Data Acquisition and Maintenance,
Page Importance
15
Goals
• Discover XML pages on the web that are of interest for customers– For this crawl the web (HTML+XML)
• Maintain them up to date
• Do this under bounded resources:– Memory for known URLs– Bandwidth
16
Life Cycle of a page in Xyleme
• The URL of D is discovered as a link in another page (or published by a customer)
• The page scheduler decides to read D– The meta data of D is read
• type, last_date_update...
– The document D is loaded
• The document D is re(read) regularly
17
Main Issues
• Loading of pages– we can load up to 5 millions of pages/day on a
standard PC– main cost is Internet connection
• Metadata management (access to disk)
• Page scheduling– decide which page to read or refresh next
18
Page Importance
• Definition: Important pages are linked to by important pages
• Offline algorithm (used by Google)
• Our Online algorithm(M. Preda, S. Abiteboul, G. Cobena)– does not require to maintain graph information– faster convergence with focused crawling
Xyleme, 2001 19
( XML Repository,Semantic Data Integration
and Query Processing )
20
Querying Language
• Today: A mix of OQL and XQL• We are currently moving to X-Query
(which is also a mix of OQL and XQL…)
Select boss/Name, boss/Phone From comp in BusinessDomain,
boss in comp//Manager Where comp/Product contains “Xyleme”
21
Web Heterogeneity
• Semantic domains, e.g., cinema
• Many possible types for data in this domain, many DTDs
• Semantic Integration– one abstract DTD for the domain– gives the illusion that the system maintains an
homogeneous database for this domain
1 domain = 1 abstract DTD
22
Indexing
• Standard inverted index– word documents that contain this word
• Xyleme index– word elements that contain this word
document + element identifier
• Goal: more work can be performed without accessing data
Xyleme, 2001 23
4. Change Control
24
The Web changes all the time
• Data acquisition + maintenance – keep the warehouse up-to-date
• Version management– representation and storage of changes
• Change monitoring– query subscription
25
Subscription Language
• SQL-like language based on ‘atomic events’.• Combines the use of monitoring queries and
continuous queries.• The language can be extended by adding new
types of atomic events.• Uses the XML Query Language for continuous
queries. “Querying the XML Documents of the Web”, V. Aguilera, S. Cluet, F. Boiscuvier, Tech. Report
26
subscription myPaintings% what are the new painting entries in Musee d’Orsay sitemonitoring newPainting
select URLwhere URL extends www.musee-orsay.fr/*and <painter> contains “Monet”
% manage the changes in the expositions continuous delta Exposition
select ... from ... where when monthly
notify daily % send me a daily report
Example
Atomic events
27
Step 1: Atomic Event Detection
metadatamanager
HTMLparser
XMLloader
document & alerts
d/46
complexevent detection
atomic event 46: URL matches pattern www.musee-orsay.fr/*atomic event 67: XML documentcontains the tag <painter> withthe value “Monet”
5 millions of pages/day
d
d/46,67loading
28
Step 2: Complex Event Detection
HTMLparser
XMLloader
complexevent detection
complex event 12: 67 & 46 (XML document contains the tag <painter> with value “Monet” and URL matches pattern www.musee-orsay.fr/*)
Millions of alerts of pages/dayMillions of subscriptions
29
triggers
notification/monitoring
Step 3: Notification Processor
Reporter
continuousqueries
complexevent detection
clock notification/results
Millions of notifications/day
alerts
30
SQL
Architecture
XylemeAlerter
Web Browser
XylemeReporter
XylemeSubscription
Manager
ComplexEvent
DetectionSubscription
Manager
Reporter
TriggerEngine
XylemeQuery
Processor
SQL
documents
31
Complex Events Algorithm
• The formal problem is NP-hard• We proposed several possible algorithms• Experimental (simulation) values proved the
effectiveness of our solutions• The Hash-Tree based algorithm is well suited for our
application: – 10 million Complex Events– 1 million Atomic Events – 100 Atomic events detected per document0.8 ms to process a document. ~2 million documents per day.
32
Alerters
• Each Alerter can be viewed as a plug-in that acts on a document flow.
• All sorts of Atomic events can be detected: URL pattern detection, Keywords, XPath expressions, Page rank…
• Can be distributed.
33
Some Advanced Alerts
• Process document flow (single pass)
• Full strings– Context Stack– Reversed look-up
• XML Alerts– Reversed XPath expressions– Dual context stack for ‘/’ and ‘//’
34
Versions
• Objectives:– Temporal Queries (persistent identification of nodes)– Version some documents or some sites (store a ‘delta’)– Change Monitoring (query changes)
• We proposed a representation of changes“Change-Centric Management of Versions” (VLDB 2001)
• We developed a Diff algorithm for XML“Detecting Changes in XML Documents”,G. Cobena, S. Abiteboul, A. Marian ICDE 2002 (San Jose)
35
Conclusion & Prospectives
• Focus crawling on important pages– Refine notion of importance– Improve important pages discovery
• Improve Change control accuracy– Semantic web– Real-time advanced processing
Xyleme, 2001 36
Merci