View
50
Download
3
Category
Tags:
Preview:
DESCRIPTION
Knowledge Discovery and Dissemination (KDD) Program. IARPA-BAA-09-10 Question Period: 22 Dec 09 – 2 Feb 10 Proposal Due Date: 16 Feb 10. Issues. Information Extraction / Annotation / Wrapper Generation Wide Variety of Data Sets (Possibly) Large Data Sets (Possibly) Numerous Data Sets - PowerPoint PPT Presentation
Citation preview
KBB: A Knowledge-Bundle Builder for Research Studies
Knowledge Discovery and Dissemination (KDD) ProgramIARPA-BAA-09-10Question Period: 22 Dec 09 2 Feb 10Proposal Due Date: 16 Feb 10IssuesInformation Extraction / Annotation / Wrapper GenerationWide Variety of Data Sets(Possibly) Large Data Sets(Possibly) Numerous Data SetsAlignment of Data SetsSchema Mapping & Schema IntegrationData Cleaning and IntegrationAdvanced Analytic Algorithms / Query / ReasoningPerformance
Unifying Solution ThemeKnowledge Bundles (KBs) ~ discovered/extracted/annotated knowledge organized for dissemination/query/analysisEither actual or virtual, or, a combinationQueries, reasoning, algorithmic analysis, data miningQueries & reasoning should always immediately work based on library of extraction ontologies, ontology snippets, and instance recognizersPay as you go: greater organization, more extraction, improved analysis based on just doing the KDD workKnowledge-Bundle Builder (KBB)Knowledge begets knowledge (KBs as extraction ontologies)Fully automatic KBB toolsSemi-automatic KBB tools
Many ApplicationsBusiness planning and decision makingScientific research studiesPurchase of large-ticket itemsGenealogy and family historyWeb of KnowledgeInterconnected KBs superimposed over a web of pagesYahoos Web of Concepts initiative [Kumar et al., PODS09]And Intelligence Gathering and AnalysisNot just bio-research studies. Ill draw from some of these as I further explain KBs and KBBs. (Switching applications because 1. Cui is not here to explain the medical biology & 2. its not implemented, but most of what I will present is implemented, although not fully integrated and not fully working as well as it should for the system to be commercialized.)4Prior Research (outline for next part of presentation)Formalization of IdeasQuery Processing and ReasoningAskOntos / SerFRGenWoKExtraction and Annotation (Semi- & Un-structured Sources)OntoESFOCIH, TANGO, TISPNERReverse Engineering (Structured Sources)RDB, XML, OWLNested TablesSemantic IntegrationMultifaceted mappings (including mappings based on OntoES)Direct and indirect mappingsSemantic enrichment for integration (e.g., MOGO)Explain formal framework and semi-automatic creation in the rest of the presentation. Tie into ACM-L.5KB FormalizationKBa 7-tuple: (O, R, C, I, D, A, L)O: Object setsone-place predicatesR: Relationship setsn-place predicatesC: Constraintsclosed formulasI: Interpretationspredicate calc. models for (O, R, C)D: Deductive inference rulesopen formulasA: Annotationslinks from KB to source documentsL: Linguistic groundingsdata framesto enable:high-precision document filteringautomatic annotationfree-form query processing
KB: (O, R, C, )7
KB: (O, R, C, , L)
8
KB: (O, R, C, I, , A, L)
9
KB: (O, R, C, I, D, A, L)
Age(x) :- ObituaryDate(y), BirthDate(z), AgeCalculator(x, y, z)Another reasoning possibility to point out: Thursday, which does not have a specific date attached to it can be reasoned about to realize that it must be March 12, 1998.
10KB Query
KB Query
KB ReasoningScreenshots from CWs thesisFree-form Query Processing with Annotated Results
We are working toward KBs and KBBs 14KBB:(Semi)-Automatically Building KBsOntologyEditor (manual; gives full control)FOCIH (semi-automatic)TANGO (semi-automatic)TISP (fully automatic)NER (Named-Entity Recognition research)
Ontology Editor
FOCIH: Form-based Ontology Creation and Information Harvesting
fleckveltergonsity (ld/gg)hepth(gd)burlam1.2120falder2.3230multon2.5400
repeat:understand tablegenerate mini-ontologymatch with growing ontologyadjust & mergeuntil ontology developedTANGO:Table ANalysis for Generating Ontologies
GrowingOntologyTISP: Table Interpretation by Sibling Pages
SameTISP: Table Interpretation by Sibling Pages
DifferentSameNER: Named-Entity Recognition
Automated extraction is critical. OpenDMAP for biology21Reverse Engineering from Structured SourcesTransformation from source (?) to target (O, R, C, I, )Information PreservingConstraint PreservingStructured sourcePredicates and constraints formalized in some wayExamples: RDB, XML, OWL, Nested FormsRDB Reverse EngineeringTheorem. Let S be a relational database with its schema restricted as follows:(1) the only declared constraints are single-attribute primary key constraints andsingle-attribute foreign-key constraints, (2) every relation schema has a primarykey, (3) all foreign keys reference only primary keys and have the same nameas the primary key they reference, (4) except for attributes referencing foreignkeys, all attribute names are unique throughout the entire database schema, (5)all relation schemas are in 3NF. Let T be an OSM-O model instance. A transformationfrom S to T exists that preserves information and constraints. C-XML: Conceptual XML
XML Schema C- XMLIn general, reverse engineering from any structured schema.24OWL OSMYihongs Converter CodeNested Table Reverse Engineering via TISP
Theorem. Let S be a nested table with a single label path to each data item,and let T be an OSM-O model instance. A transformation from S to T existsthat preserves information and constraints. Semantic IntegrationSchema MappingDirect & IndirectUse of extraction ontologiesSemantic Enhancement for IntegrationSemantics of many sources abstracted awayAlignment with global community knowledgeWordNetData-frame libraryMulti-faceted Schema MappingCentral Idea: Exploit All Data & MetadataMatching Possibilities (Facets)Attribute NamesData-Value CharacteristicsExpected Data Values (use of extraction ontologies)Data-Dictionary InformationStructural Properties
ExampleSource Schema SCar YearhasMakehasModelhasCostStylehashasYearhasFeaturehasCosthasCar MileagehasPhonehasModelhasTarget Schema TMakehasMileshasYearModelMakeYearMakeModelCar Car MileageMilesIndividual Facet MatchingAttribute NamesData-Value CharacteristicsExpected Data ValuesAttribute NamesTarget and Source Attributes T : A S : BWordNetC4.5 Decision Tree: feature selection, trained on schemas in DB booksf0: same wordf1: synonymf2: sum of distances to a common hypernym rootf3: number of different common hypernym rootsf4: sum of the number of senses of A and BWordNet Rule
The number of different common hypernym roots of A and BThe sum of distances of A and B to a common hypernymThe sum of the number of senses of A and BConfidence Measures
Data-Value CharacteristicsC4.5 Decision Tree FeaturesNumeric data(Mean, variation, standard deviation, ) Alphanumeric data(String length, numeric ratio, space ratio)Confidence Measures
Expected Data ValuesTarget Schema T and Source Schema SRegular expression recognizer for attribute A in TData instances for attribute B in SHit Ratio = N'/N for (A, B) matchN' : number of B data instances recognized by the regular expressions of AN: number of B data instances
Confidence Measures
Combined Measures
Threshold: 0.510000000000000100000000010000000000010000010000000Final Confidence Measures
000Direct & Indirect Schema MappingsSourceCar YearCostStyleYearFeatureCostPhoneTargetCar MilesMileageModelMakeMake&ModelColorBody TypeMapping GenerationDirect Matches as described earlier:Attribute Names based on WordNetValue Characteristics based on value lengths, averages, Expected Values based on regular-expression recognizersIndirect Matches:1-n, n-1, or n-m based on direct matchesStructure EvaluationUnionSelectionDecompositionComposition
Union and SelectionCar SourceYearCostStyleYearFeatureCostPhoneTargetCar MilesMileageModelMakeMake&ModelColorBody TypeDecomposition and CompositionCar SourceYearCostStyleYearFeatureCostPhoneTargetCar MilesMileageModelMakeMake&ModelColorBody TypeSemantic Enrichment (e.g., MOGO)
fleckveltergonsity (ld/gg)hepth(gd)burlam1.2120falder2.3230multon2.5400
TANGO repeatedly turns raw tables into conceptual mini-ontologies and integrates them into a growing ontology.GrowingOntologyMOGO (Mini-Ontology GeneratOr)generates mini-ontologies frominterpreted tables.Sample Input Region and State InformationLocationPopulation (2000)LatitudeLongitudeNortheast2,122,869 Delaware817,37645-90 Maine1,305,49344-93Northwest9,690,665 Oregon3,559,54745-120 Washington6,131,11843-120
Sample Output- Explain what is mini about the ontology Walk user through the mini-ontology to explain what it tells us about the concepts/relationships45Concept/Value RecognitionLexical CluesLabels as data valuesData value assignmentData Frame CluesLabels as data valuesData value assignmentDefaultRecognize concepts and values by syntax and layout
46Concept/Value RecognitionLexical CluesLabels as data valuesData value assignmentData Frame CluesLabels as data valuesData value assignmentDefaultRecognize concepts and values by syntax and layout
Concepts and Value Assignments NortheastNorthwestDelawareMaineOregonWashington LocationRegionState47Concept/Value RecognitionLexical CluesLabels as data valuesData value assignmentData Frame CluesLabels as data valuesData value assignmentDefaultRecognize concepts and values by syntax and layoutPopulationLatitudeLongitude2,122,869817,3761,305,4939,690,6653,559,5476,131,11845444543-90-93-120-120
Year20022003
Concepts and Value Assignments NortheastNorthwestDelawareMaineOregonWashington LocationRegionState48
Relationship DiscoveryDimension Tree MappingsLexical CluesGeneralization/SpecializationAggregationData FramesOntology Fragment Merge
200049Relationship DiscoveryDimension Tree MappingsLexical CluesGeneralization/SpecializationAggregationData FramesOntology Fragment Merge
50Constraint DiscoveryGeneralization/SpecializationComputed ValuesFunctional RelationshipsOptional Participation
Region and State InformationLocationPopulation (2000)LatitudeLongitudeNortheast2,122,869 Delaware817,37645-90 Maine1,305,49344-93Northwest9,690,665 Oregon3,559,54745-120 Washington6,131,11843-120
- Explain how functional dependencies are found better values in original table are functionally determined51Ontology Workbench: Prototype Development Tool
We are working toward KBs and KBBs 52Case Study: Knowledge Bundlesfor Bio-ResearchProblem: locate, gather, organize dataSolution: semi-automatically create KBs with KBBsKBsConceptualized data + reasoning and provenance linksLinguistically grounded & thus extraction ontologiesKBBsKB Builder tool setActively learns to build KBsWhats my take-home message? KBs and KBBs can play a significant role in assisting researchers locate, gather, and organize information for research studies. As an example, this may well be the essence of what ACM-L means: conceptual modeling (CM) is the foundation for KBs and active learning (A-L) is the foundation for KBBs. Emphasize here and later. (What must listeners understand to be able to take the message home?)
53Research Study: Objective and TaskObjective: Study the association of:TP53 polymorphism andLung cancerTask: locate, gather, organize data from:Single Nucleotide Polymorphism databaseMedical journal articlesMedical-record databaseDoesnt matter whether it is the essence of ACM-L or not may be better to pitch it as one possible way to achieve Active Conceptual Modeling for Learning (ACM-L).54Gather SNP Information from the NCBI dbSNP RepositorySNP: Single Nucleotide PolymorphismNCBI: National Center for Biotechnology Information
Explain how FOCIH works. Also, how it will work when we add a filtering mechanism (e.g., minor allele frequency > 1%)55Search PubMed LiteraturePubMed: Search-engine access to life sciences and biomedical scientific journal articles
Works by linguistically grounding an extraction ontology. (e.g., people in the bioInformatics community may know about OpenDMAP)56Reverse-Engineer Human Subject Information from INDIVOINDIVO: personally controlled health record system
Reverse-Engineer Human Subject Information from INDIVOINDIVO: personally controlled health record systemAdd Annotated Images
Radiology Report(John Doe, July 19, 12:14 pm)Query and Analyze Data in Knowledge Bundle (KB)
Research to AccomplishBuild Unified PrototypeIntegrate projectsEnhance/Add KBB toolsCreate Knowledge RepositoryData-frame recognizersOntology snippetsExtraction ontologies (both developed & developing)Develop user interfaceAllow for virtual KBsAdd/Develop analysis tools & data mining toolsResolve performance issuesDecidability & tractability of basic algorithmsArchitecture for web-scale system
Issue Resolution (Summary)Wide variety of data setsGeneral references the Web? CIA World Factbook? ... (OntoES, FOCIH, TISP)Free-running text news, technical journals (WePS, [Embley09], Ancestry.com)Geospatial data ([Embley89b])Entity databases (RelDB[Embley97], XML[Al-Kamha07,Al-Kamha08], IMS=heirarchical[Mok06,Mok10], Network=graph=OSM, OWL[Ding-converter])Reports (Filled-in forms and semi-structured data [Tao09,Liddle99],TANGO)And more (Attensity?)Large and numerous data sets (extension to large and additional types; performance)Alignment of data models (TANGO)Schema mapping ([Xu03,Xu06], )Data integration ([Biskup03])Semantic enrichment (MOGO)Advanced analytic algorithms (Giraud-Carrier: knowledge-based semantic distance, record linkage, and hybrid social networks; best-effort, quick answers [Zitzelberger thesis])Performance ([Al-Muhammed07b], IS/Liddle, Attensity?)
Vision: KBs & KBBs for Knowledge Discovery and DisseminationCustom harvesting of information into KBsKB creation via a KBBSemi-automatic: shifts harvesting burden to machineSynergistic: works without intrusive overheadActively learns as it goes & improves with experienceResolve challenging research issuesKB/KBB prototypeSemantic integrationAnalysis & data mining toolsPerformance issues (including virtual KBs, large & diverse source repositories, quick construction & immediate usage)www.deg.byu.eduLocation
Northeast
Northwest
Washington
Maine
Oregon
Delaware
[Dimension2]
Longitude
Latitude
Population
2,122,869
-120
817,376
Title: Region and State Information
2000
Location
Northeast
Northwest
Washington
Maine
Oregon
Delaware
[Dimension2]
Longitude
Latitude
Population
2,122,869
-120
817,376
Title: Region and State Information
2000
Location
Northeast
Northwest
Washington
Maine
Oregon
Delaware
[Dimension2]
Longitude
Latitude
Population
2,122,869
-120
817,376
Title: Region and State Information
2000
Location
Northeast
Northwest
Washington
Maine
Oregon
Delaware
[Dimension2]
Longitude
Latitude
Population
2,122,869
-120
817,376
Title: Region and State Information
2000
Location
Northeast
Northwest
Washington
Maine
Oregon
Delaware
[Dimension2]
Longitude
Latitude
Population
2,122,869
-120
817,376
Title: Region and State Information
2000
Location
Northeast
Northwest
Washington
Maine
Oregon
Delaware
[Dimension2]
Longitude
Latitude
Population
2,122,869
-120
817,376
Title: Region and State Information
2000
Recommended