Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE

Preview:

DESCRIPTION

Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE. Bruno Pinheiro bfp@cin.ufpe.br Renato Correa renato.correa@ufpe.br. Guide. Information Retrieval Systems (IRS) IRS + SOM Related Works Document Collection System Architecture Methodology - PowerPoint PPT Presentation

Citation preview

Self-organizing maps applied to information retrieval of dissertations and theses

from BDTD-UFPE

Bruno Pinheirobfp@cin.ufpe.br

Renato Correarenato.correa@ufpe.br

Guide• Information Retrieval Systems (IRS)• IRS + SOM• Related Works• Document Collection• System Architecture• Methodology• Results

Information Retrieval Systems (IRS)

• Indexing, Searching , classifying textual documents.

• User’s information needs

• Matching user’s queries and system’s vocabulary.

IRS + SOMSelf-

Organized Maps

Information Retrieval System

IRS + SOM• Navigation Interface build trough document

maps

• Document’s maps– Self-Organizing Map trained with document

vectors

Related Works• First Works (1991 - 1995)

– Lin / Merkl • Great projects(1996 -2000)

– Arizona Digital Library, WEBSOM , SOMLib • Diversification (2001 - 2005)

– LiGHtSOM, GHSOM, H2SOM• Convergence (2006)

Document Collection• UFPE Digital Library of Theses and

Dissertations(BDTD-UFPE)– Offers in full all the theses and dissertations

produced on the graduate programs of the university.

– Approximately 6000 documents. – Linked to Brazilian BDTD and to NDLTD

(Networked Digital Library of Theses and Dissertations)

Document Representation

Dimensionality Reduction

Volume Reduction

Construction of Document Map

Document Vectors

Reduced Vectors

Prototype Vectors

Document Map

Document IndexingInverted Index

Document AcquisitionDocuments’ content

System Architecture

Methodology• Document Acquisition

– Harvesting process through the OAI-PMH protocol

– XMLs containing document’s metadata

– Data extraction through the java library JColtrane

Methodology• Indexing

– Java library, Lucene.

– Stemming operations, digits and stopwords elimination.

– Inverted index built through vectorial space model.

Methodology• Document representation

– Documents are represented by vectors, where terms are the indexes and the corresponding values are functions of term’s frequency of occurrence in the document.

Methodology• Dimensionality reduction

– Feature selection based on words’ frequency– Stopwords elimination– Final dimensionality: 13095 terms

• Volume reduction– Not used.– Volume : 4781 documents

Methodology• Document’s map construction

– Single stage

– somtoolbox functions for MATLAB

– Document’s vectors normalized before training

– SOM map with rectangular structure (10 x 12) and hexagonal neighborhood

Methodology• Document’s map construction

– Weights initialized linearly along the two greatest eigenvectors

– Batch-type SOM algorithm with dot product metric

– Gaussian neighborhood function – Neighborhood size linearly decreasing with the

number of epochs

Methodology• Document’s map construction

– Parameters• Number of epochs

– Rough phase : 10 epochs– Fine-tuning phase : 10 epoch

• Neighborhood size – Rough phase

» Initial: [(biggest dimension units number )/2 ]+ 1» Final: 2

– Fine-tuning phase: » Initial: 2» Final: 0.8

Methodology• User’s interface construction

– Documents are mapped to the node with the closest model vector in terms of cosine distance

– Each map node is labeled according to the category

• Knowledge areas (CHLA, CBS, TCEN)• Graduate programs

Results

Categories Accuracy F1 micro F1 macro Topographic error

3 0.96 0.96 0.96 0.01

61 0.66 0.66 0.44 0.01

Results

Knowledge Areas Graduate Programs

Acknowledgement

Questions?

Bruno Pinheiro bfp@cin.ufpe.brRenato Correa renato.correa@ufpe.br