Upload
dansk-bibliotekscenter
View
859
Download
0
Embed Size (px)
Citation preview
Data Science at DBC in 29 slides
Christian BoesgaardDBC, Team XP
●a few words about DBC●a very short story about data science at DBC
●some examples of what we do
DBC provides solutions to support the goal of the libraries:
“to encourage enlightenment, education and cultural activities”.(From “lov om biblioteksvirksomhed”)
National Bibliographyregistration of books, music, AV materials, Internet documents, articles and reviews in newspapers and magazines
(metadata production, 50+ persons)
DanBibThe union catalogue of the Danish libraries and the infrastructure for interlibrary loans.
(metadata + usage)
bibliotek.dkAccess to all Danish publications and to the holdings of the Danish libraries.
(metadata usage
… and user data production)
What Data?●registration metadata
some full text docs
front covers
loan data●search data
(And much more)
How it began(I have been at DBC 10+ years and have a
background in distributed systems, applied cryptography, and philosopy)
Stanford CS221
So... AI is not that magical
… And it works
We should really use this!
Automatic metadata assignment for articlesTraining set: 136K articles with subject metadata
22K subject terms (95% used 169 times or less)
“København Zoo beskyldes for at udbrede kreationisme”+- creationisme-+ 9 darwinisme++ 12 evolutionsteori+- formidling-+ 6 intelligent design+- kristendom+- livets oprindelse-+ 6 religion++ 8 skabelsen+- skilte+- zoologiske haver
“Copenhagen Zoo is accused of advancing creationism”+- creationism-+ 9 darwinism++ 12 evolution theory+- dissemination-+ 6 intelligent design+- christianity+- origin of life-+ 6 religion++ 8 creation+- signs+- zoo
Approach1.bag-of-words + liblinear
2.bag-of-words + k-nearest neighbors
3.paragraph vectors + k-nn
Works pretty well for assisted indexing and is now an integrated part of the system used for registration.
Metadata to Metadata Sometimes, simple is good:demokrati [930] politiske_forhold 897 politik 341 historie 243 islam 234 valg 155 ytringsfrihed 129 menneskerettigheder 117 oprør 94 udenrigspolitik 93
XP
Recommendationscontent-based (metadata)
collaborative (item-item, loans)Foucaults Pendul - Umberto EcoDronning Loanas mystiske flamme - Umberto EcoBaudolino - Umberto EcoRosens navn - Umberto EcoKirkegården i Prag - Umberto EcoJudasbrevet - Eric FrattiniSkaberens kort - Emilio Calderón
Ranking
●popularity
personalized (loans/likes/...)For search results (or recommendations)
Suggestions●popularity (loans)
subjects, creator, etc.
E.g. for completion
From Lady Gaga to James Joyce
“Enlightenment” ...Not guaranteed
But we can recommend “towards” a curated collection
(based on item-item similarity or
P(loan(y)|loan(x)) )
Similarity PathsBorn this way - Lady Gaga (music)Rasmus Seebach - Rasmus Seebach (music)In these waters - Mads Langer (music)De urørlige (movie)Fasandræberne - Jussi Adler-Olsen (book)
Similarity PathsFasandræberne - Jussi Adler-OlsenDet syvende barn - Erik ValeurProfeterne i Evighedsfjorden - Kim LeineMin kamp - Karl Ove KnausgårdPå sporet af den tabte tid - Marcel ProustFædre og sønner - Ivan TurgenevPortræt af kunstneren [...] - James JoyceUlysses - James Joyce
Similarity Paths(for the kids...)
Sheik Yerbouti - Frank ZappaAladdin Sane - David BowieThe red shoes - Kate BushMDNA world tour - MadonnaLotus - Christina AguileraBorn this way - Lady Gaga
The EndChristian Boesgaard
Team XP [email protected]
What we use(for “data science”)
Python: SciPy stack, scikit-learn, gensim, Tornado.
Kafka, MongoDB, Solr.
(Java, R)