Data and text mining: the search for unknown knowns

  • Upload
    brie

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

Data and text mining: the search for unknown knowns. Geoffrey Bilder UKSG, 2007 [email protected]. - PowerPoint PPT Presentation

Citation preview

  • Data and text mining: the search for unknown knownsGeoffrey BilderUKSG, [email protected]

  • "Reports that say that something hasn't happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns -- the ones we don't know we don't know."

  • The Mining Metaphor

  • Gold Mining

  • Diamond Mining

  • Data Mining

  • Data Mining- What it isnt

  • Information Retrieval

  • Information Extraction

  • Information Analysis

  • ++InformationRetrievalInformationExtractionInformationAnalysis

  • Data Miningnew, previously unknown information

  • And so what is text data mining?

  • Text Mining

  • ++InformationRetrievalInformationExtractionInformationAnalysis

  • Crucial question for publishers is: If hiding information in unstructured text is a problem- then shouldnt we be exploring new ways to publish?

  • So how did we get here?

  • The word tobacco originates from the Taino indians.There is no I in the word Team.The book captured the zeitgeist of the time.I am sure that I turned the gas off.

  • Semantic Web Light

  • But we can do more...

  • The web as a database

  • The Relational Model

    TitleAuthorISBN-13PublisherLabyrinthsJorge Luis Borges978-0811200127New DirectionsHopscotchJulio Cortazar978-0394752846PantheonThe AlephJorge Luis Borges978-0140286809Penguin............

  • Rows represent things

    TitleAuthorISBN-13PublisherLabyrinthsJorge Luis Borges978-0811200127New DirectionsHopscotchJulio Cortazar978-0394752846PantheonThe AlephJorge Luis Borges978-0140286809Penguin............

  • Columns are properties

    TitleAuthorISBN-13PublisherLabyrinthsJorge Luis Borges978-0811200127New DirectionsHopscotchJulio Cortazar978-0394752846PantheonThe AlephJorge Luis Borges978-0140286809Penguin............

  • The things propertyThe book has an author Jorge Luis BorgesSubjectPredicateObject

    TitleAuthorISBN-13PublisherLabyrinthsJorge Luis Borges978-0811200127New DirectionsHopscotchJulio Cortazar978-0394752846PantheonThe AlephJorge Luis Borges978-0140286809Penguin............

  • The book has an author Jorge Luis BorgesSubjectPredicateObject

  • http://www.amazon.com/isbn/978-0140286809has an author http://www.wikipedia.com/borges

  • Journal AJournal BWikiBlogPersonal WebsiteOPAC

  • Journal AJournal BWikiBlogPersonal WebsiteOPAC

  • PREFIX rdf: PREFIX foaf: SELECT DISTINCT ?nameWHERE { ?x rdf:type foaf:Person . ?x foaf:name ?name}ORDER BY ?namehttp://api.ingentaconnect.com/content/cabi/nrr/latest?format=rss

  • The Early Modern Internet

  • Data Mining = With the goal of discovering new, previously unknown informationInformation retrieval +Information extraction +Information analysis...

  • Data Mining = Text Data Mining = With the goal of discovering new, previously unknown informationComplex data extraction layer +data miningInformation retrieval +Information extraction +Information analysis...

  • Why do we publish text?

  • Thank [email protected]

    Standby SlideText Mining vs Data MiningAssumption that text and data have to be two separate things.???The OTMI repository (on http://www.nature.com/) currently hosts 2 years (2005, 2006) worth of content for 5 journals:

    * Nature (nature) * Nature Genetics (ng) * Nature Reviews Drug Discovery (nrd) * Nature Structural SKOS=Simple Knowledge Organisation SystemsEarly modern period about 342 years from Guttenberg to French RevolutionA span of 365 years form Guttenberg to steam press

    Elsevier (~1580)OUP (~1586)Incunabula considered to end ~ 1501