View
29
Download
0
Category
Tags:
Preview:
DESCRIPTION
Data and text mining: the search for unknown knowns. Geoffrey Bilder UKSG, 2007 gbilder@crossref.org. - PowerPoint PPT Presentation
Citation preview
Data and text mining: the search for unknown knownsGeoffrey BilderUKSG, 2007gbilder@crossref.org
"Reports that say that something hasn't happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns -- the ones we don't know we don't know."
The Mining Metaphor
Gold Mining
Diamond Mining
Data Mining
Data Mining- What it isnt
Information Retrieval
Information Extraction
Information Analysis
++InformationRetrievalInformationExtractionInformationAnalysis
Data Miningnew, previously unknown information
And so what is text data mining?
Text Mining
++InformationRetrievalInformationExtractionInformationAnalysis
Crucial question for publishers is: If hiding information in unstructured text is a problem- then shouldnt we be exploring new ways to publish?
So how did we get here?
The word tobacco originates from the Taino indians.There is no I in the word Team.The book captured the zeitgeist of the time.I am sure that I turned the gas off.
Semantic Web Light
But we can do more...
The web as a database
The Relational Model
TitleAuthorISBN-13PublisherLabyrinthsJorge Luis Borges978-0811200127New DirectionsHopscotchJulio Cortazar978-0394752846PantheonThe AlephJorge Luis Borges978-0140286809Penguin............
Rows represent things
TitleAuthorISBN-13PublisherLabyrinthsJorge Luis Borges978-0811200127New DirectionsHopscotchJulio Cortazar978-0394752846PantheonThe AlephJorge Luis Borges978-0140286809Penguin............
Columns are properties
TitleAuthorISBN-13PublisherLabyrinthsJorge Luis Borges978-0811200127New DirectionsHopscotchJulio Cortazar978-0394752846PantheonThe AlephJorge Luis Borges978-0140286809Penguin............
The things propertyThe book has an author Jorge Luis BorgesSubjectPredicateObject
TitleAuthorISBN-13PublisherLabyrinthsJorge Luis Borges978-0811200127New DirectionsHopscotchJulio Cortazar978-0394752846PantheonThe AlephJorge Luis Borges978-0140286809Penguin............
The book has an author Jorge Luis BorgesSubjectPredicateObject
http://www.amazon.com/isbn/978-0140286809has an author http://www.wikipedia.com/borges
Journal AJournal BWikiBlogPersonal WebsiteOPAC
Journal AJournal BWikiBlogPersonal WebsiteOPAC
PREFIX rdf: PREFIX foaf: SELECT DISTINCT ?nameWHERE { ?x rdf:type foaf:Person . ?x foaf:name ?name}ORDER BY ?namehttp://api.ingentaconnect.com/content/cabi/nrr/latest?format=rss
The Early Modern Internet
Data Mining = With the goal of discovering new, previously unknown informationInformation retrieval +Information extraction +Information analysis...
Data Mining = Text Data Mining = With the goal of discovering new, previously unknown informationComplex data extraction layer +data miningInformation retrieval +Information extraction +Information analysis...
Why do we publish text?
Thank Yougbilder@crossref.org
Standby SlideText Mining vs Data MiningAssumption that text and data have to be two separate things.???The OTMI repository (on http://www.nature.com/) currently hosts 2 years (2005, 2006) worth of content for 5 journals:
* Nature (nature) * Nature Genetics (ng) * Nature Reviews Drug Discovery (nrd) * Nature Structural SKOS=Simple Knowledge Organisation SystemsEarly modern period about 342 years from Guttenberg to French RevolutionA span of 365 years form Guttenberg to steam press
Elsevier (~1580)OUP (~1586)Incunabula considered to end ~ 1501
Recommended