20
NoSQL technologies from an STM publishing perspective Bradley P. Allen, Elsevier Labs Presentation at NoSQL Now 2011 San Jose, CA, USA 2011-08-25

NoSQL technologies from an STM publishing perspective

Embed Size (px)

DESCRIPTION

The disruption of traditional scholarly publishing by the Internet is leading to a host of changes in the requirements, standards and workflows for content acquisition, production and management technologies in use by scientific, medical and technical (STM) publishers in the 2010's. This presentation will discuss how NoSQL technologies impact these changes by offering a range of solutions that enabling the development of next-generation systems for scholarly publishing. A taxonomy of use cases in STM content management will be presented with a mapping into current NoSQL solution categories, and specific reference made to three key areas of relevance to publishing: support in NoSQL stores for search, semantic enhancement of content, and text analytics. The discussion will be illustrated with work in this area currently underway at Elsevier.

Citation preview

Page 1: NoSQL technologies from an STM publishing perspective

NoSQL technologies from an STM publishing perspective

Bradley P. Allen, Elsevier Labs

Presentation at NoSQL Now 2011

San Jose, CA, USA

2011-08-25

Page 2: NoSQL technologies from an STM publishing perspective

Peak physical media: is it here?

• “Music Sales”, New York Times, 1 August 2009. http://www.nytimes.com/imagepages/2009/08/01/opinion/01blow.ready.html

• “Initial Circs per student”, William Denton, 31 January 2011. http://www.miskatonic.org/2011/01/31/initial-circs-student

• “Rise of e-book Readers to Result in Decline of Book Publishing Business”, Steven Mather, iSuppli, 28 April 2011. http://www.isuppli.com/Home-and-Consumer-Electronics/News/Pages/Rise-of-e-book-Readers-to-Result-in-Decline-of-Book-Publishing-Business.aspx 2

Page 3: NoSQL technologies from an STM publishing perspective

• Print revenue is softening

• Online channels are exploding

– Changing the way customers create and consume our content

– Leading to new requirements and market opportunities for online products

In any case, the challenge to STM publishers is clear

3

Page 4: NoSQL technologies from an STM publishing perspective

• Academic context and tradition inhibits business model innovation

• Technology and business traditionally separate concerns

• Acquisitions create content and data silos

• Global market drives lowest common denominator technology choices

Additional challenges in STM publishing

4

Page 5: NoSQL technologies from an STM publishing perspective

A simple model of the evolution of STM publishing

Print era: 1600s -1980

• Packaged as books and journals

• Physically distributed

• Access and discovery through libraries

Digital Library era: 1980 – 2010s

• Packaged as books and journals

• Digitally distributed

• Access and discovery through search engines

Platform-as-a-service era: 2010s

• Packaged as apps

• Digitally distributed

• Access and discovery through social networks

5

Page 6: NoSQL technologies from an STM publishing perspective

STM publishing use cases in transition

Use case Digital Library era Platform-as-a-service era

A new medical term relevant to an emerging healthcare issue (e.g. a new type of avian flu virus) needs to be incorporated into a search index immediately

Organizational governance issues about how taxonomies are be updated, coupled with manually-intensive workflows and ad-hocapproaches to content tagging, inhibit rapid response

A single, automated and standardized taxonomy management and content enhancement workflow allows rapid and timely update of search applications

Application developers want to mash up epidemiological data with medical journal articles to create topic-specific Web resource

Data silos without easy means of programmatic access by developers, coupled with governance and business model questions , inhibit data reuse

Content API and single-point-of-access repository allow data and content to be accessed, discovered and reused across multiple applications

Digital library developers want to stagecontent into single repository for unified search index generation

Duplication of core content leads to synchronization, quality control issues

Consolidation of duplicate repositories into a single point of truth across all content accessible and discoverable through a Content API eliminates the need forduplication and synchronization

Third party solutions providers want to integrate content (e.g. tagged medical journal articles, medical taxonomies) into point-of-care solutions

No standards, no APIs for point-of-care content integration across all content and data

Standards and APIs that scale across multiple partners, for all content types, for all delivery formats

Publishers want to deliver their content to tablets and e-readers in delivery formats that take advantage of the displays and interaction modalities on those devices

No clear standard or approach for targeting emerging eReader, tablet devices, multipleand divergent approaches leading to siloedsolutions, duplication of effort

Web- and industry-standards for eReader, tablet devices supported as part of standard automated processing into delivery channel-specific formats, regularly updated and exposed through a Content API

Journal publisher wants to integrate content enhancements across multiple subject matter areas to add value to products leveraging Article of the Future technology

No single point of access to content enhancements, no standards for contentenhancement suppliers and partners to deliver enhancements for integration

Easy access to multiple opportunities for content enhancements embedded in standard next-generation article formats and provided using standard content enhancement formats

6

Page 7: NoSQL technologies from an STM publishing perspective

Facets of STM publishing processes

Acquisition TransformationAccess and discovery

Enhancement Composition Delivery

submitting

crawling

syndicating

formatting

mapping

cleansing

indexing

querying

updating

storing

annotating

subject tagging

classification

entity recognition

author

supplier

Web site

typesetter

automated process

subject matter expert

search engine

content repository

entity registry

product catalog

editor

reviewer

user

designer

developer

e-book

mobile app

mobile-enhanced Web site

API

entity extraction

fact extraction

clustering

aggregating

ordering

summarizing

filtering

analysis

rendering

design

publishing

accessing

retrieving

deleting

Entity Activity

Process Type

article

book

media object

entity record

taxonomy

ontology

user-generated content

Content Type

7

Page 8: NoSQL technologies from an STM publishing perspective

• Broad range of content types– Must treat as first-class objects video, audio,

images, datasets, metadata and knowledge organization systems in addition to articles and books

• Standards-based– Web-standard formats to support ease of

integration and interoperability

• Fine-grained– Must be decomposable into and addressable in

fragments smaller than the unit of publication; e.g., down to the level of specific words, phrases, images, table cells in articles or book chapters, key frames and segments in videos

• Discoverable– Must be easily located across all levels of

granularity,

• Accessible– Must be easily accessed through content

creation, retrieval, update and deletion (CRUD) services

• Flexible– New content types and associated schemas

must be easily added through configuration

• Reusable– It must be efficient for product developers to

aggregate and compose content fragments into new products

• Modifiable– Support the enhancement and correction of

content at any time following creation

• Broad range of delivery formats– Content standards and services must support

fulfillment, delivery and presentation across desktop, notebook, tablet and mobile computing devices

Emerging content requirements

8

Page 9: NoSQL technologies from an STM publishing perspective

Relational metadata

Relational Metadata

Relational Metadata

Relational Metadata

9

Emerging content architecture

Linked data

Acquire

Transform,

Enhance, Compose

Deliver

Document

Entity record

Media object

Relational metadata

Relational metadata

Relational metadata

Page 10: NoSQL technologies from an STM publishing perspective

Content acquisition and transformation

10

Page 11: NoSQL technologies from an STM publishing perspective

Content enhancement and analytics

11

Page 12: NoSQL technologies from an STM publishing perspective

Content composition and delivery

12

Page 13: NoSQL technologies from an STM publishing perspective

• NoSQL emphasizes design choices that focus on delivering robust, scalable Web applications– Document-centric

– Schemaless

– Support for analytics

– Read/write at Web scale

– Move scale-out from development to operations

• As we shift to the platform-as-a-service era, these features become an important part of the STM publishing technology stack

Why NoSQL is important to STM publishing

13

Page 14: NoSQL technologies from an STM publishing perspective

• Schemaless, document-centric stores– Ease repository extension to accommodate expanding range of new, finer-

grained content types– Fit HTML5/JS/CSS content stack providing web-based alternatives to native apps– Expedite application stack refresh in support of authoring and editorial workflow

portals and tools

• Support for analytics eases innovation in scientometrics• Read/write at Web scale accommodates solutions incorporating content

at more dynamic, fine-grained scale– Entity records– Annotations – Other forms of community-contributed content– Linked data integration of heterogeneous information resources across the Web

for mashups/solutions

• Moving scale-out from development to operations reduces time-to-market, cost of failure for emerging, niche publishing opportunities

How NoSQL addresses STM publishing’s needs

14

Page 15: NoSQL technologies from an STM publishing perspective

• Integrated support for search– Free text retrieval– Faceted navigation

• Query language functionality– Nearest-neighbor matching– Joins vs. join-free

• Primitives/support for analytics design patterns– Clustering– Classification– Entity resolution

• Primitives/support for semantic enhancement– Linked data– Language processing

• Versioning for document stores

Where STM publishing can drive NoSQL requirements

15

Page 16: NoSQL technologies from an STM publishing perspective

• Entity registries

• Metadata repositories

• Big data analytics

• User-built apps

Elsevier applications of NoSQL technologies

16

Page 17: NoSQL technologies from an STM publishing perspective

Linked Data Repository

17

Page 18: NoSQL technologies from an STM publishing perspective

SciVal

18

Page 19: NoSQL technologies from an STM publishing perspective

SciVerse

19

Page 20: NoSQL technologies from an STM publishing perspective

• STM publishing is in transition

• This is driving new requirements for content

• Many of these requirements are well met by NoSQL solutions

• Some requirements point to areas of future work for NoSQL technologists and vendors

Conclusions

20