Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and...

Preview:

Citation preview

Technology for Technology for integrated access integrated access

and discoveryand discovery

Presented by: Marc KrellensteinTitle: VP, Search and DiscoveryAdvanced Technology GroupDate: February 5, 2004

Basic search is pretty good Basic search is pretty good Basic search is pretty good Basic search is pretty good

• Modern search engines are fast and scalable– Having the data (usually lots) is still key

• Can interpret keyword, Boolean and pseudo-natural language queries– Ex: “how to make an international call with

my Blackberry”• Spell checking, thesauri and stemming to

improve recall• Users are more experienced

– More multi-term searches• Gets lots of hits, but that’s usually OK if

good ones on top

Basic search is pretty goodBasic search is pretty goodBasic search is pretty goodBasic search is pretty good• Best practice relevancy ranking is good:

– Term frequency (TF): more hits count more– Inverse document frequency (IDF): hits of rarer

search terms count more • Ex: diabetes diagnosis and treatment

– Hits of search terms near each other count more

• Ex: penicillin allergy vs. “penicillin allergy”– Hits on metadata (title,subject, etc.) count

more• Use anchor text – referring text – as metadata

– Items with more links/references to them count more

• Authoritative links/referrers count yet more– Many other factors: length, date, etc.

Basic search is pretty goodBasic search is pretty goodBasic search is pretty goodBasic search is pretty good

• Using these techniques search engines can locate specific documents, or good documents (if not the absolute best) around general or specific topics

• But challenges remain…

Current challengesCurrent challengesCurrent challengesCurrent challenges

• Integrated search: Content still exists in separate silos– Silos getting bigger but there are still too

many– Library patrons have dozens of choices – Putting even more into Google is probably not

sufficient to solve the problem• Finding the best/novel documents

– Hard to perform complicated searches (e.g., research similar to one’s own)

• Historians can’t define a profile…

• Discovery– Hard to do more than search: summarize,

uncover novelty and relationships, analyze

The integration challengeThe integration challengeThe integration challengeThe integration challenge

• Two approaches:– Build even bigger databases (well,

yes…)• Not easy, but sometimes the easiest

approach• Can be difficult to manage and secure

appropriate rights

– Distribute search: Search separately managed (or owned) large databases as if they are one

• Technically more challenging, but a scalable and maintainable architecture

Distributed searchDistributed searchDistributed searchDistributed search

• Index multiple (maybe geographically) separate databases with a single search engine that supports distributed search– Use common metadata scheme (e.g., Dublin

Core) and/or determine other common fields or field mappings for each database

– Search engine provides parallel search, integrated ranking and integrated results

– The separate databases can be maintained and updated separately

– Elsevier is currently unifying its own sources in such a model with a ‘web service’ architecture

• Has contributed specifications to the public domain– Such services can also be offered externally

Distributed searchDistributed searchDistributed searchDistributed search• Simplifies some business issues, but still

requires common technology platform• Where common platform not possible, add

federated search (i.e., metasearch)– Translate queries– Access and perform parallel search of multiple

search engines (vs. multiple databases)– Integrate results as best as possible– Use standards to approximate distributed

research • Uniform access, one query language (Z39.50,

updated)• Add standards for relevancy ranking and results

return?• NISO and its members are working on standards

Finding the best: NavigationFinding the best: NavigationFinding the best: NavigationFinding the best: Navigation

• More data can also make finding the best or novel documents harder– For searches for rare items, more data is a win– For all other searches, it’s more likely your

answer is in there…but it’s also more likely there’s lots of other stuff close but not as good

• Why? relevancy is good but…• Relevancy has its limits…there may be

many ‘good’ documents referring to different aspects of the search…the best?

• Underlying problems: – User’s needs may not be that specific– Even long searches are under-specified

One solution: clustering One solution: clustering documentsdocumentsOne solution: clustering One solution: clustering documentsdocuments

• Group results around common themes: same subject, author, web site, journal,…

• Show largest/most interesting categories • Depression psychology, economics,

meteorology, antiques…– Psychology treatment of depression,

depression symptoms, seasonal affective…– Psychology Kocsis, J. (10), Berg, R. (8), …

• Themes could come from static metadata or dynamically by analysis of results text– Static: fixed, clear categories and assignments– Dynamic: doesn’t require metadata (or

controlled vocabulary to draw from)

Clustering benefitsClustering benefitsClustering benefitsClustering benefits• Disambiguates and refines search results to

get to documents of interest quickly• Can navigate long result lists hierarchically

– Would never offer thousands of choices to choose from as input…

– Access to bottom of list…maybe just less common

• Discovery – new aspects or sources• Can narrow results *after* search

– Start with the broadest area search – don’t narrow by subject or other categories first

– Easier, plus can’t guess wrong, miss useful, or pick unneeded, categories…results-driven

• Knee surgery cartilage replacement, plastics, …

Finding the best: Complex Finding the best: Complex searchsearchFinding the best: Complex Finding the best: Complex searchsearch

• Main problem is still short searches/under-specification….which the keyword-based ‘enter a query’ paradigm encourages

• One solution: Relevance feedback – marking good and bad results

• A long-standing and proven search refinement technique– More information is better than less (longer

queries are better)– Pseudo-relev feedback is a research standard

• Commercial forms – find-similar, etc. --– not widely used (or well executed)...

• …but successful in Pubmed (diff users)

Relevance feedbackRelevance feedbackRelevance feedbackRelevance feedback

• One catch: Must first find a good document to be similar to

• Solution: Let the user provide the ideal document – or a long query or problem statement – as input in the first place– Can enter free text or specific documents

describing the interest, e.g., article, grant proposal, experiment description, etc.

– Should provide the best possible matches

Discovery challenge: Beyond Discovery challenge: Beyond searchsearchDiscovery challenge: Beyond Discovery challenge: Beyond searchsearch

• How do you summarize a corpus?– May want to report on what’s present,

numbers of occurrences, trends, etc.– Ex: What diseases are studied the most?– Must know all diseases and look one by one

• How to you find a relationship if you don’t know what relationships exist?– Ex:does gene p53 relate to any disease?– Must check for each possible relationship

• Ad hoc analysis– How do all genes relate to this one disease?

Over time? What organisms have the gene been studied in? Show me the document evidence…

One solution: entity extractionOne solution: entity extractionOne solution: entity extractionOne solution: entity extraction• Identify entities (things) in a text corpus

– Examples: authors, universities… diseases, drugs, side-effects, genes…companies, law suits, plaintiffs, defendants…

– Use lexicons, patterns, NLP for finding any or all instances of the entity

• Identify relationships:– Through co-occurrence

• Relationship presumed from proximity• Example: author-university affiliation

– Through limited limited natural language processing

• Semantic relations – causes, is-part-of, etc.• Examples: drug-causes-disease…drug-is treatment

for-disease…a is suing b…

ClearForest pilot, Fall 2002ClearForest pilot, Fall 2002ClearForest pilot, Fall 2002ClearForest pilot, Fall 2002

• Goal: Demonstrate real value to a working expert in 90 days

• Chose biomedical domain• Hired expert to help define entities and

relationships• Used 25,000 abstracts from 23 Elsevier

journals• Worked with ClearForest to define and

revise extraction of entities and relationships

• Have related partnership with Stanford for text mining

Pilot scenariosPilot scenariosPilot scenariosPilot scenarios

• Answered real questions using real data – not a demo or mock-up

• The user:– anyone involved in genomic academic

research: a primary researcher, graduate student or post-doc

• Scenario 1: Research about gene p53– What journals should I publish in? – Who’s an expert I can ask for advice? – What connections have been made to my

gene?– What organisms have my gene?

What journals should I publish in?

Who’s an expert?

Connections to p53?

To organisms?

Pilot scenariosPilot scenariosPilot scenariosPilot scenarios

• Scenario 2: Disease research– What diseases are most researched?– What’s the time trend in HIV research?– What are the centers of HIV research?– Who are the author teams in HIV?– What gene-disease relationships are

there? What were they to start in 1996? through 1997?

– (Note: Cannot answer the above with search alone)

What diseases are most researched?

Time trend in HIV research?

Centers of HIV research?

Author teamsIn HIV research?

Gene-disease relationships?

To start, in 1996?

Through 1997?

Pilot scenariosPilot scenariosPilot scenariosPilot scenarios

• Scenario 3: Connections between leukemia and Alzheimer’s– Are there direct connections between

leukemia and Alzheimer’s?– What enzymatic activity is associated

with leukemia?– Are there indirect connections

between leukemia and Alzheimer’s mediated by enzymatic activity?

Direct connections between leukemia and Alzheimer’s?

Enzymes associated with leukemia?

Indirect links fromleukemia to Alzheimer’s via enzymes

The power of indirect links The power of indirect links The power of indirect links The power of indirect links

• Almost impossible to determine manually

• Can provide completely unexpected relationships between source and target

The value of analyticsThe value of analyticsThe value of analyticsThe value of analytics

• Goes beyond search – summarizes, shows relationships, answers complex questions

• A significant value-added service– Value of one new drug discovery?

SummarySummarySummarySummary

• Need to search more broadly, more easily – Larger databases– Distributed search

• Need to locate best/novel documents in even larger (distributed) databases– Clustering to find documents of real interest– Find/similar, descriptive search

• Need to go beyond search for overviews, relationships and discovery– Text-based data mining and entity extraction

Recommended