Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE

Self-organizing maps applied to information retrieval of dissertations and theses

from BDTD-UFPE

Bruno Pinheirobfp@cin.ufpe.br

Renato Correarenato.correa@ufpe.br

Guide• Information Retrieval Systems (IRS)• IRS + SOM• Related Works• Document Collection• System Architecture• Methodology• Results

Information Retrieval Systems (IRS)

• Indexing, Searching , classifying textual documents.

• User’s information needs

• Matching user’s queries and system’s vocabulary.

IRS + SOMSelf-

Organized Maps

Information Retrieval System

IRS + SOM• Navigation Interface build trough document

• Document’s maps– Self-Organizing Map trained with document

vectors

Related Works• First Works (1991 - 1995)

– Lin / Merkl • Great projects(1996 -2000)

– Arizona Digital Library, WEBSOM , SOMLib • Diversification (2001 - 2005)

– LiGHtSOM, GHSOM, H2SOM• Convergence (2006)

Document Collection• UFPE Digital Library of Theses and

Dissertations(BDTD-UFPE)– Offers in full all the theses and dissertations

produced on the graduate programs of the university.

– Approximately 6000 documents. – Linked to Brazilian BDTD and to NDLTD

(Networked Digital Library of Theses and Dissertations)

Document Representation

Dimensionality Reduction

Volume Reduction

Construction of Document Map

Document Vectors

Reduced Vectors

Prototype Vectors

Document Map

Document IndexingInverted Index

Document AcquisitionDocuments’ content

System Architecture

Methodology• Document Acquisition

– Harvesting process through the OAI-PMH protocol

– XMLs containing document’s metadata

– Data extraction through the java library JColtrane

Methodology• Indexing

– Java library, Lucene.

– Stemming operations, digits and stopwords elimination.

– Inverted index built through vectorial space model.

Methodology• Document representation

– Documents are represented by vectors, where terms are the indexes and the corresponding values are functions of term’s frequency of occurrence in the document.

Methodology• Dimensionality reduction

– Feature selection based on words’ frequency– Stopwords elimination– Final dimensionality: 13095 terms

• Volume reduction– Not used.– Volume : 4781 documents

Methodology• Document’s map construction

– Single stage

– somtoolbox functions for MATLAB

– Document’s vectors normalized before training

– SOM map with rectangular structure (10 x 12) and hexagonal neighborhood

– Weights initialized linearly along the two greatest eigenvectors

– Batch-type SOM algorithm with dot product metric

– Gaussian neighborhood function – Neighborhood size linearly decreasing with the

number of epochs

– Parameters• Number of epochs

– Rough phase : 10 epochs– Fine-tuning phase : 10 epoch

• Neighborhood size – Rough phase

» Initial: [(biggest dimension units number )/2 ]+ 1» Final: 2

– Fine-tuning phase: » Initial: 2» Final: 0.8

Methodology• User’s interface construction

– Documents are mapped to the node with the closest model vector in terms of cosine distance

– Each map node is labeled according to the category

• Knowledge areas (CHLA, CBS, TCEN)• Graduate programs

Results

Categories Accuracy F1 micro F1 macro Topographic error

3 0.96 0.96 0.96 0.01

61 0.66 0.66 0.44 0.01

Results

Knowledge Areas Graduate Programs

Acknowledgement

Questions?

Bruno Pinheiro bfp@cin.ufpe.brRenato Correa renato.correa@ufpe.br

Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE

Documents

Simulado UFPE

Locating Dissertations Video: 3 min. 45 sec.. Dissertations Sources – ProQuest Dissertations and Theses Database (1.6 million full-text dissertations)

MATEMÁTICA UFPE

New Theses & Dissertations Boston University Theses & Dissertations · 2020. 4. 22. · Theses & Dissertations Boston University Theses & Dissertations 2019 ... Ph.D. Truman Collins

Bdtd Hinh Thoi

ArthurFreitasRamos - UFPE

Doctoral Dissertations

Tutorial bdtd

UFPE LOFEC

CIn / UFPE

Atenção!! - UFPE

Abertura do Ano Letivo da UFPE PROACAD - UFPE 20032003

UNIVERSIDADE FEDERAL DE SÃO CARLOS - UFSCar · Databank of Thesis and Dissertations - BDTD, of the Brazilian Institute of Information in Science and Technology - IBICT; the second

MerchFolha - UFPE

Portugues Ufpe

ETD 2005 BDTD – The Brazilian National ETD Project

BDTD/UFPB - Universidade Federal da Paraíba

UFPE Mestrado

Bibliotheca - UFPE

Robocode - UFPE