12
1 Towards Automating Complex Associative Access to Multiple Bioinformatics Data Sources Ling Liu, Calton Pu David Buttler, Wei Han Henrique Paques, Dan Rocco Georgia Tech

1 Towards Automating Complex Associative Access to Multiple Bioinformatics Data Sources Ling Liu, Calton Pu David Buttler, Wei Han Henrique Paques, Dan

  • View
    227

  • Download
    1

Embed Size (px)

Citation preview

Page 1: 1 Towards Automating Complex Associative Access to Multiple Bioinformatics Data Sources Ling Liu, Calton Pu David Buttler, Wei Han Henrique Paques, Dan

1

Towards Automating Complex Associative Access to Multiple Bioinformatics Data Sources

Ling Liu, Calton Pu

David Buttler, Wei Han

Henrique Paques, Dan RoccoGeorgia Tech

Page 2: 1 Towards Automating Complex Associative Access to Multiple Bioinformatics Data Sources Ling Liu, Calton Pu David Buttler, Wei Han Henrique Paques, Dan

2

Outline

State of Art Users’ Perspective Technology Perspective

Why SDM Technology – XWRAP Composer Users’ Perspective Technology Perspective

Progress Report and Near Term Deliverables Related Long Term Research

Page 3: 1 Towards Automating Complex Associative Access to Multiple Bioinformatics Data Sources Ling Liu, Calton Pu David Buttler, Wei Han Henrique Paques, Dan

3

Today: Today: Simple Simple Query-Based Query-Based SearchingSearching

Today: Today: Simple Simple Query-Based Query-Based SearchingSearching

Web

Why Automating Complex Associative AccessWhy Automating Complex Associative AccessWhy Automating Complex Associative AccessWhy Automating Complex Associative Access

Large & Unorganized Document CollectionsLarge & Unorganized Document Collections

Tomorrow with SDM Tomorrow with SDM Technology Technology Tomorrow with SDM Tomorrow with SDM Technology Technology

Semantic Semantic

Web Web

Query 3

Query 2

Query 1

Query 4 Query

Complex Associative Access requires experts

Complex Associative Access requires experts

Complex Associative Access is automated (one stop shopping)

Complex Associative Access is automated (one stop shopping)

Page 4: 1 Towards Automating Complex Associative Access to Multiple Bioinformatics Data Sources Ling Liu, Calton Pu David Buttler, Wei Han Henrique Paques, Dan

4

Why Automating Complex Associative AccessWhy Automating Complex Associative AccessWhy Automating Complex Associative AccessWhy Automating Complex Associative Access

Large & Unorganized Document CollectionsLarge & Unorganized Document Collections

CharacterizeCharacterize

SortSort

PartitionPartition

FilterFilter

WebWeb

Today: Today: Simple Simple Query-Based Query-Based SearchingSearching

Today: Today: Simple Simple Query-Based Query-Based SearchingSearching

SummarizeSummarize

Tomorrow with SDM Tomorrow with SDM Technology Technology Tomorrow with SDM Tomorrow with SDM Technology Technology

Semantic Semantic

Web Web

Query 3

Query 2

Query 1

Query 4

Page 5: 1 Towards Automating Complex Associative Access to Multiple Bioinformatics Data Sources Ling Liu, Calton Pu David Buttler, Wei Han Henrique Paques, Dan

5

Automating Complex Associative Access

Wrapper Technology Workflow Technology Semantic Web Technology

Service Discovery Service Selection Service Composition

Research Issues Semantic Data Integration, Interoperability Scalability, High Performance Trusted Computing, Dependable, Survivable

Page 6: 1 Towards Automating Complex Associative Access to Multiple Bioinformatics Data Sources Ling Liu, Calton Pu David Buttler, Wei Han Henrique Paques, Dan

6

XWRAPComposer

What is it? A wrapper generation system that can semi-automatically

generate wrappers (info. extraction programs) capable of accessing multiple scientific Web pages in one

shot. What makes it different from other existing XWRAP

tools? Capable of generating wrappers that extract information

from multiple Web pages connected by URLs (page links) and compose them into an integrated XML document

Extremely useful for Automating Complex Associative Access to multiple scientific data sources

Page 7: 1 Towards Automating Complex Associative Access to Multiple Bioinformatics Data Sources Ling Liu, Calton Pu David Buttler, Wei Han Henrique Paques, Dan

7

Existing Wrapper Existing Wrapper TechnologyTechnologyExisting Wrapper Existing Wrapper TechnologyTechnology

SDM Enabling Technology: XWRAPComposerSDM Enabling Technology: XWRAPComposerSDM Enabling Technology: XWRAPComposerSDM Enabling Technology: XWRAPComposer

Query 1

Query 3

Query 2

Query 4

Seq. LinkWrapper

SequenceWrapper

Blast SumWrapper

Blast DetailWrapper

Extracting Data from a single Web Document

AA045112

CACCTGGAGAAACTTCTGCACTGGCACTGTGTTCCNAGAGCTCCTTCTATGCGTCCCTCC

CAAGTGATTTAATTTCAGCTGATTGGACTACGAATTCACAAGGCAGAAAAGTCAAGGTCA

TTTGGNATCTGGAGACAGGAGAACTCAAGGAACCNAAAGGACT

htgs

Page 8: 1 Towards Automating Complex Associative Access to Multiple Bioinformatics Data Sources Ling Liu, Calton Pu David Buttler, Wei Han Henrique Paques, Dan

8

WrapperComposerWrapperComposerTechnologyTechnologyWrapperComposerWrapperComposerTechnologyTechnology

SDM Enabling Technology: XWRAPComposerSDM Enabling Technology: XWRAPComposerSDM Enabling Technology: XWRAPComposerSDM Enabling Technology: XWRAPComposer

Query 1AA045112

Query 2

Full SeqWrapper

CACCTGGAGAAACTTCTGCACTGGCACTGTGTTCCNAGAGCTCCTTCTATGCGTCCCTCC

CAAGTGATTTAATTTCAGCTGATTGGACTACGAATTCACAAGGCAGAAAAGTCAAGGTCA

TTTGGNATCTGGAGACAGGAGAACTCAAGGAACCNAAAGGACT

htgsBlast

Wrapper

Extracting Data from Multiple Web Documents

Page 9: 1 Towards Automating Complex Associative Access to Multiple Bioinformatics Data Sources Ling Liu, Calton Pu David Buttler, Wei Han Henrique Paques, Dan

9

Given a sequence,

list all matching DNAs.

XWRAPComposer: Technical Perspective

NCBi Blast SiteWeb

Blast Wrapper

Blast Query Page

Blast Format Page

Blast Delay Page

Blast Summary Page

Interface/Outerface Specification Composer Script

Multi-page Control Flow Modeling Data Extraction Workflow

Blast Detail Page

Page 10: 1 Towards Automating Complex Associative Access to Multiple Bioinformatics Data Sources Ling Liu, Calton Pu David Buttler, Wei Han Henrique Paques, Dan

10

SDM Center Data Integration Infrastructure

User (Matt)

Workflow Agent

Service registryand brokering

Data Integration Agent(s)

Data Mediation

Wrapper based AgentWrapper based Agent

Wrapper based Agent

Other Agents(e.g., VIPAR)

Database Access

Com

mu

nic

atio

n P

roto

col G

atew

ay

External Program

XML Wrapper

XML Wrapper

XML Wrapper

Data Source

Data Source

Data Source

XML WrapperXML WrapperXML WrapperXML WrapperData SourceData SourceData SourceData Source

Executable Workflow

Plan: “Matt’s WF”

DB

Data Sources

External InterfaceProgram Interfacing

Other I/O Agents

ExtractionRules

Human Knowledge

GUI

Code Generator

Parameterized Workflow Specification (PWS)

Source Capabilities (SC)

Binding Patterns

User Agent

User constraints & parameters

Workflow ResolutionService (WRS)

Domain Map/Ontology

Workflow InstantiationService (WIS)

WF feasible

WF infeasible:report reason

Data Registration Services Registration

DB

Page 11: 1 Towards Automating Complex Associative Access to Multiple Bioinformatics Data Sources Ling Liu, Calton Pu David Buttler, Wei Han Henrique Paques, Dan

11

Progress Report

Status Produced Three Deliverables

Composer Interface/Outerface Specification Five Java Wrappers for Pilot Scenario Composer Script Examples for Pilto Scenario

XWRAPComposer design and development Near Term Plan

Finish the design of XWRAP Composer scripting language ( Nov. 2002)

Develop the first prototype of XWRAP Composer system (Jan. 2003)

Performance Evaluation (March. 2003)

Page 12: 1 Towards Automating Complex Associative Access to Multiple Bioinformatics Data Sources Ling Liu, Calton Pu David Buttler, Wei Han Henrique Paques, Dan

12

Related Long Term Research

Semantic Web and Semantic Data Integration Service Discovery

dynamic content crawler

Service Selection Adaptive query routing

Service Composition Infopipe Technology