25
Analysis as an Interface to Digital Libraries Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology Drexel University, Philadelphia, Pennsylvania, USA

Term Co-occurrence Analysis as an Interface to Digital Libraries

Embed Size (px)

DESCRIPTION

Term Co-occurrence Analysis as an Interface to Digital Libraries. Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology Drexel University, Philadelphia, Pennsylvania, USA. Digital Library Research. First Wave How to store it Next Wave - PowerPoint PPT Presentation

Citation preview

Term Co-occurrence Analysis as an Interface to Digital Libraries

Jan W. Buzydlowski

Howard D. White

Xia Lin

College of Information Science and Technology

Drexel University, Philadelphia, Pennsylvania, USA

Digital Library Research

First Wave– How to store it

Next Wave– How to retrieve it (IR)

• Text Mining• Visual Information Retrieval Interface (VIRI)

Term Co-occurrence Analysis (TCA)– Co-occurrence vs. lexical associations– Maps vs. lists

Term Definition Unit of Analysis

– Words– Documents– Authors– Journals

Section of Focus– Abstract/Text– Title– Bibliography– Keywords

Example

Words in Title– Term– Co-occurrence– Analysis– Interface– Digital– Library

Authors in Bibliography– Salton-G– Chen-C– White-HD– Ding-Y– Cleveland-W– McCain-K– Lin-X– Schvaneveldt-R– Kamada-T– Fruchterman-T

Term Co-occurrence Methodology

User determines which terms are of interest– Via a seed term– From a pre-defined list

The system returns the pair-wise co-occurrence counts of the terms over the collection of records

Example Unit: Author; Section: Bibliography User Supplied List: Plato, Aristotle, Smith, Brown For a given data set (N = 4 unique terms)

– Article 1: Plato, Aristotle, Smith, …– Article 2: Plato, Smith, …– Article 3: Plato, Aristotle, Smith, Brown, …

The following co-citations (C(4,2) = 6) are found– COMBINATION COUNT ARTICLES– Plato and Smith 3 1, 2, 3– Plato and Aristotle 2 1, 3– Plato and Brown 1 3– Aristotle and Smith 2 1, 3– Aristotle and Brown 1 3– Smith and Brown 1 3

Term Co-occurrence Significance

The frequent co-occurrence of term pairs within a set of documents indicates a strong association between those terms, whereas a infrequent count indicates the opposite

– The association you would expect is borne out by the frequency

– The frequency you compute suggests a level of association

Pain and Management Pain and Obtainment

Plato and Aristotle Plato and Cher

Science and Nature Science and National Tattler

A and B C and D

Term Co-occurrence Uses

Allows a user to get a “foothold” with just one term– One seed term returns many other related

terms Allows a user to get a “overview” with

user-supplied/system-supplied terms– Co-occurrence counts with visualization

Seeding

User types in – One term, e.g., Plato– Boolean expression, e.g., Plato AND Brown

System supplies top n terms, in ranked order of frequency of co-occurrence with the initial term

Example

For Plato seed:

ARISTOTLEPLUTARCHCICEROHOMERBIBLEEURIPIDESARISTOPHANESXENOPHONAUGUSTINEHERODOTUSKANT-IAESCHYLUS

SOPHOCLESTHUCYDIDESOVIDHESIODDIOGENES-LAERTIHEIDEGGER-MDERRIDA-JPINDARNIETZSCHE-FHEGEL-GWFVERGILAQUINAS-T

Need for Visualization

Given a list of user- / system-supplied terms– Find the frequency of co-occurrence of each pair-wise

combination of terms• Plato AND Aristotle = 1,920• Plato AND Plutarch = 380,• …

– Too many numbers to take in at once• C(25, 2) = (25 * 24)/ 2 = 300 pairs

Three major visualization techniques– Multidimensional Scaling (MDS)– Self-Organizing (Kohonen) Maps (SOMs)– PathFinder Networks (PFNETs)

RR Sokal

PHA Sneath

JC Gower

JH Ward

JD CarrollJB Kruskal

VE McGee

RN Shepard

JA HartiganHA Skinner

SC Johnson

M Wish

P Arabie

RK Blashfield

PE Green

White’s MDS map of 15 co-cited classificationists, ca. 1990

White’s PFNet of co-cited authors in Biblical and literary hermeneutics, 1988-1997

SCHLEIERMACHER F

GADAMER HG

KANT I

HEGEL GWF

BARTH K

DILTHEY W

HEIDEGGER M

PLATO

BIBLE

ARISTOTLE

HABERMAS J

DERRIDA J

RICOEUR P

GOETHE JWV

BULTMANN R

FRANK M

NIETZSCHE F

TILLICH P

FICHTE JG

PANNENBERG W

TROELTSCH E

SCHELLING FWJ

SCHLEGEL FV

LUTHER M

EBELING G

Our System Three tiered

– User interface

– Server

– Database

Real-time and interactive Significant data sources

– ISI AHCI– MedLine

Live interface for retrieval

BRS Search EngineWeb Server

Java Servlets

Web-based Map Interface

Java Applet

MappingProcedures

Application Server

OracleDatabases

PUBMED Search Engine

User Interface - Seed

User Interface – SOM

Interface - PFNET

Interface - Visual Information Retrieval Interface (VIRI)

User Interface IV

Database Interface API

– String [ ] findRel( String, int )– Int [ ] findOcc( String [ ] )

Implemented on:– BRS

• API via a wrapper

– Oracle• API via JDBC

– Noah• Specialized co-occurrence database• API via JNI

Future Plans

User Study– Preference

• Type of map, etc.

– Cognitive map• How well does the map match experts’ mental

models

Larger datasets Additional data sources