II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Preview:

Citation preview

The Road to Federated Text Mining: Are we there yet?

II-SDV 2014

Guy Singh

Click to edit Master title style Click to edit Master title style

“Federated search is an information retrieval technology that allows the simultaneous search of multiple searchable resources.

2

What is federated search?

A user makes a single query request which is distributed to the search engines participating in the federation”

- Wikipedia

Click to edit Master title style Click to edit Master title style Current Situation

• Volume of data ever increasing

• Proprietary content can reside within Enterprise

• No need for everyone to keep standard sources up-to-date

• Data from content providers can reside on their sites

Linguamatics Customer Confidential 3

Internal Content External Content

MEDLINE Clinical Trials

Publisher Content

FDA Drug Labels

Patents

Click to edit Master title style Click to edit Master title style

Data Sources

Scientific Literature

Social Media

News

Web Pages

Internal Documents

Patents

RSS

Clinical Trials

4

Increasing Range of Data Sources

Click to edit Master title style Click to edit Master title style

5

Varying in Structure

Click to edit Master title style Click to edit Master title style How does text mining differ from keyword search?

Example: What genes affect breast cancer

Click to edit Master title style Click to edit Master title style

• Searching across documents using keywords is relatively trivial

– Do not need to be aware of where the words occur and in what context

• Text mining documents with varying structure requires a more sophisticated approach; Need to:

– Know where words matching entities/concepts occur

– Disambiguate depending on context and location

– Find terms in particular regions/parts of document for targeted searches

7

Why does document structure matter?

Click to edit Master title style Click to edit Master title style

• Integrate the data together into a data warehouse

– Extract, Transform and Load each data source into a new database

– Multiple copies of the data

– Data normalisation can be difficult and challenging

– Time consuming and expensive process

– Most database vendors take this approach

– Allows users to perform a single search across all the content

• Leave the data where it is, federated content

– Data remains in it’s original form and location

– Multiple data types

– Multiple network locations

– Single search across multiple different data sources

8

Approaches to dealing with different data sources

Click to edit Master title style Click to edit Master title style

Data Normalisation

Link the Content Servers

Merge Results

Federated Text Mining

9

How do we get to Federated Text Mining?

Click to edit Master title style Click to edit Master title style

10

Data Normalisation – Virtual Indexes

Pathology Reports Index

Journal Abstracts Index

Virtual Index

Click to edit Master title style Click to edit Master title style

11

Data Normalisation – Document Structure

Pathology Reports

Journal Abstracts

Click to edit Master title style Click to edit Master title style

12

Data Normalisation - Entities

Journal Abstracts

Pathology Reports Combined

(Normalized)

Linking Content Servers

Linguamatics Customer Confidential 13

Click to edit Master title style Click to edit Master title style

• I2E 4.1 introduced a new feature – Linked Server

• One I2E server can be linked to another I2E server

• Provides access to remote and local indexes and queries through a single I2E interface (Linked Servers)

– Indexes and queries on remote servers on the network appear the same as local indexes

Linked Servers

Development Status

Click to edit Master title style Click to edit Master title style

Linguamatics – Customer confidential

I2E 4.1 Linked Servers

I2E Enterprise on Customer network

I2E OnDemand SaaS

Infrastructure

In-house Indexes

I2E OnDemand Standard Indexes

I2E Enterprise Access

Custom Indexes

Access via Linked Servers

Access via single UI

Merging Results (Part I)

Single Server, Multiple Queries

Click to edit Master title style Click to edit Master title style I2E 3.0 (2009) – Merging Results (part I) from one server

Profiling Individuals

• Example from news reports related to pharmaceutical industry

• Pick up properties from one document or many

© Linguamatics 2012 - Customer Confidential

Click to edit Master title style Click to edit Master title style

© Linguamatics 2013 - Confidential

I2E 3.0 – Merging Results (part I) from one server

Document

Identifier

Patient

information Disease history

Patient data

Medications

and dosages

Hit displayed in

context

Merging Results (Part II)

Linguamatics Customer Confidential 19

Multiple Servers, Multiple Queries

Click to edit Master title style Click to edit Master title style

20

Each Server supplying separate set of results

Content Server 1

Content Server 2

Content Server 3

Content Server 4

Merge into a single set of results

The Road to Federated Text Mining

Linking Content Servers

Click to edit Master title style Click to edit Master title style I2E 4.0: Multiple Clients, Multiple Results

I2E Server 2 FDA Drug Labels

I2E Server 1 Internal Documents

external network internal network

Linguamatics Customer Confidential 23

Click to edit Master title style Click to edit Master title style I2E 4.1/4.2: Single Client, Multiple Results

I2E Server 2 FDA Drug Labels

I2E Server 1 Internal Documents

external network internal network

Linguamatics Customer Confidential 24

Linked server

Merging Results (Part II)

Click to edit Master title style Click to edit Master title style Q4 2014: Single Client, Single Result, Multiple Servers

I2E Server 2 FDA Drug Labels

I2E Server 1 Internal Documents

external network internal network

Linguamatics Customer Confidential 26

Linked server

Click to edit Master title style Click to edit Master title style Q4 2014: Federated Text Mining Example

• Single Query

• Differently structured data sources on different servers

– Journal Articles (PubMed Central) on Enterprise Server

– MEDLINE on I2E OnDemand

• Single set of results

Linguamatics Customer Confidential 27

Click to edit Master title style Click to edit Master title style The Road to Federated – Are we there yet?

I2E 4.0

Dec 2012

I2E 4.1

October 2013

Next release: in Development

Q4 2014

Merging the Results (part II)

Data Normalisation

Linking Content Servers

Demo

Linguamatics – Customer confidential

Click to edit Master title style Click to edit Master title style

30

Demo

Cambridge

VPN

Nice

Linked Server

Journal Abstracts

Pathology Reports

Thank you

Linguamatics – Customer confidential