55

Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 [email protected]

Embed Size (px)

Citation preview

Page 1: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org
Page 2: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

Data and text mining: the search for unknown knownsGeoffrey BilderUKSG, [email protected]

Page 3: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

"Reports that say that something hasn't happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns -- the ones we don't know we don't know."

Page 4: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

The Mining Metaphor

Page 5: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org
Page 6: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

Gold Mining

Page 7: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

Diamond Mining

Page 8: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

Data Mining

Page 9: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

Data Mining- What it isn’t

Page 10: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

≠ Information Retrieval

Page 11: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

≠ Information Extraction

Page 12: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

≠ Information Analysis

Page 13: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

+ +

InformationRetrieval

InformationExtraction

InformationAnalysis

Page 14: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

Data Mining new, previously unknown information

Page 15: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

And so what is text data mining?

Page 16: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

Text Mining

Page 17: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org
Page 18: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

+ +

InformationRetrieval

InformationExtraction

InformationAnalysis

Page 19: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org
Page 20: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

Crucial question for publishers is: “If ‘hiding’ information in unstructured text is a problem- then shouldn’t we be exploring new ways to

“publish”?

Page 21: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

So how did we get here?

Page 22: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

• The word tobacco originates from the Taino indians.

• There is no I in the word Team.

• The book captured the zeitgeist of the time.

• I am sure that I turned the gas off.

Page 23: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

The book captured the <foreign_phrase lang="DE">zeitgeist</foreign_phrase> of the time.

I am <emphasis>sure</emphasis> that I turned the gas off.

Page 24: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org
Page 25: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org
Page 26: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

Semantic Web “Light”

Page 27: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org
Page 28: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org
Page 29: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org
Page 30: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org
Page 31: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org
Page 32: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

But we can do more...

Page 33: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

The web as a database

Page 34: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

Title Author ISBN-13 Publisher

LabyrinthsJorge Luis

Borges978-

0811200127New

Directions

Hopscotch Julio Cortazar978-

0394752846Pantheon

The AlephJorge Luis

Borges978-

0140286809Penguin

... ... ... ...

The Relational Model

Page 35: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

Title Author ISBN-13 Publisher

LabyrinthsJorge Luis

Borges978-

0811200127New

Directions

Hopscotch Julio Cortazar978-

0394752846Pantheon

The AlephJorge Luis

Borges978-

0140286809Penguin

... ... ... ...

Rows represent things

Page 36: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

Title Author ISBN-13 Publisher

LabyrinthsJorge Luis

Borges978-

0811200127New

Directions

Hopscotch Julio Cortazar978-

0394752846Pantheon

The AlephJorge Luis

Borges978-

0140286809Penguin

... ... ... ...

Columns are properties

Page 37: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

Title Author ISBN-13 Publisher

LabyrinthsJorge Luis

Borges978-0811200127 New Directions

Hopscotch Julio Cortazar 978-0394752846 Pantheon

The AlephJorge Luis

Borges978-0140286809 Penguin

... ... ... ...

The book has an author “Jorge Luis Borges”

The thing’s property

Subject Predicate Object

Page 38: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

The book has an author “Jorge Luis Borges”

Subject Predicate Object

URI URI

Page 39: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

http://www.amazon.com/isbn/978-0140286809has an author

http://www.wikipedia.com/borges

RDF: Resource Description Framework

Page 40: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

Journal A Journal B

Wiki

Blog

Personal Website

OPAC

Page 41: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

Journal A Journal B

Wiki

Blog

Personal Website

OPAC

Page 42: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org
Page 43: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX foaf: <http://xmlns.com/foaf/0.1/>SELECT DISTINCT ?nameWHERE { ?x rdf:type foaf:Person . ?x foaf:name ?name}ORDER BY ?name

SPARQL

http://api.ingentaconnect.com/content/cabi/nrr/latest?format=rss

Page 44: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org
Page 45: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org
Page 46: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

RSS 1.0

FRBR

Creative CommonsFOAF

Geo

SKOS

Page 47: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

The Early Modern Internet

Page 48: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

Data Mining =

With the goal of discovering new, previously unknown information

Information retrieval +Information extraction +Information analysis...

Page 49: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

Data Mining =

Text Data Mining =

With the goal of discovering new, previously unknown information

Complex data extraction layer +data mining

Information retrieval +Information extraction +Information analysis...

Page 50: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org
Page 51: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org
Page 52: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org
Page 53: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org
Page 54: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

Why do we publish text?

Page 55: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

Thank [email protected]