Automatic Metadata Generation using Associative Networks

Preview:

DESCRIPTION

In spite of its tremendous value, metadata is generally sparse and incomplete, thereby hampering the effectiveness of digital information services. Many of the existing mechanisms for the automated creation of metadata rely primarily on content analysis which can be costly and inefficient. The automatic metadata generation system proposed in this article leverages resource relationships generated from existing metadata as a medium for propagation from metadata-rich to metadata-poor resources. Because of its independence from content analysis, it can be applied to a wide variety of resource media types and is shown to be computationally inexpensive. The proposed method operates through two distinct phases. Occurrence and co-occurrence algorithms first generate an associative network of repository resources leveraging existing repository metadata. Second, using the associative network as a substrate, metadata associated with metadata-rich resources is propagated to metadata-poor resources by means of a discrete-form spreading activation algorithm. This article discusses the general framework for building associative networks, an algorithm for disseminating metadata through such networks, and the results of an experiment and validation of the proposed method using a standard bibliographic dataset.

Citation preview

Automatic Metadata Generation Using Associative-Networks

Marko A. RodriguezCCS-3 ‘Tech Talk’December 7, 2005

http://www.soe.ucsc.edu/~okram

Resources and Metadata

• A resource is any digital-object (e.g. manuscripts, images, video, audio, etc.).

• A resource’s metadata record is a list of attributes describing the resource

[ EXAMPLE MANUSCRIPT METADATA ] Authors, Institutions, Keywords, Subject Categories, Citations, Year, Publishing Journal, Usage Data

Metadata Record<?xml version="1.0" encoding="UTF-8" ?> <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">  <responseDate>2005-09-07T15:25:04Z</responseDate>   <request verb="GetRecord" identifier="oai:arXiv.org:cs/0412047" metadataPrefix="oai_dc">http://arXiv.org/oai2</request> <GetRecord> <record> <header>  <identifier>oai:arXiv.org:cs/0412047</identifier>   <datestamp>2004-12-14</datestamp>   <setSpec>cs</setSpec>   </header> <metadata> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/">  <dc:title>A Social Network for Societal-Scale Decision-Making Systems</dc:title>   <dc:creator>Rodriguez, Marko</dc:creator>   <dc:creator>Steinbock, Daniel</dc:creator>   <dc:subject>Computers and Society</dc:subject>   <dc:subject>Data Structures and Algorithms</dc:subject>   <dc:subject>Human-Computer Interaction</dc:subject>   <dc:subject>H.4.2</dc:subject>   <dc:subject>J.7</dc:subject>   <dc:subject>K.4.m</dc:subject>   <dc:description>In societal-scale decision-making systems the collective is faced ...</dc:description>   <dc:description>Comment: Dynamically Distributed Democracy algorithm</dc:description>   <dc:date>2004-12-10</dc:date>   <dc:type>text</dc:type>   <dc:identifier>http://arxiv.org/abs/cs/0412047</dc:identifier>   <dc:identifier>North American Association for Computational Social and Organizational Science Conference Proceedings 2004</dc:identifier>   </oai_dc:dc>  </metadata>  </record>  </GetRecord></OAI-PMH>

Problem Statement

• Metadata is costly to generate by hand

• Metadata is hard to extract from raw resource (e.g. audio, video)

• How can we automatically generate metadata for atrophied resource records?

General System Overview

• Generate resource relations with existing metadata in the repository.– occurrence and/or co-occurrence networks

• Propagate metadata from metadata rich resources to metadata limited resources– encapsulate metadata in discrete particles

and disseminate them over the generated associative network

HEP-TH 2003 Semantic Network

A1

P1

Autho

r

of

O1

J1J2

K1

K2

T1

T2

A2

A3P2

O2

P3

P4

P5

cite

s

Aut

hor o

f

Published

journal

Published

journal

Has ke

ywor

d

Has keywordAuthor

of

Author of

Author of

Organization of

Organization of

Publishedtime

Publis

hed

time Published time

Author of

Organizationof

Publis

hed

time

Haskeyword

cites

Publishedjournal

c

ites

cite

s

A4Author

of

Transforming the Semantic Network

Convert the multi-node network into a collection of manuscripts with their associated attributes (metadata record).

– manuscript• Authors• Citations• Publication Date• Keywords• Organizations• Journal

resource

metadata record

Occurrence/Co-Occurrence

• Citation: two manuscripts are connected if one manuscript cites the other.

• Co-Author: two manuscripts are connected if they share the same authors

• Co-Citation: two manuscripts are connected if they share the same authors

• Co-Keyword: two manuscripts are connected if they share the same keywords

• Co-Organization: two manuscripts are connected if they share the same organizations

• Co-Date: two manuscripts are connected if they share the same publication date

• Co-Journal: two manuscripts are connected if they share the same journal

Network Generation Running Times

• Occurrence: O(N)– Each resource’s metadata record much be

checked once and only once for a direct reference to another resource.

• Co-occurrence: O([N2 – N] / 2)– Each resource’s metadata record much be

check against every other resource’s (N2), except itself (-N), once and only once (1/2).

A B

A B

C

Particle Propagation

• Every resource is given one particle, p_i. This particle contains all the metadata associated with its resource.

• A particle also has an energy value, e_i. The further the particle travels (edge steps), the more its energy value decays.

e_i(t+1) = e_i(t) * (1-\delta)

Particle Propagation

• The particle takes an outgoing edge of its current node based on the probability distribution of its outgoing edge set. If the resource it encounters doesn’t have metadata of a particular type, it recommends that resource its metadata weighted by its energy value.

Metadata Recommendations

• Manuscript A– Journal

• Journal of Complexity [0.2457]• Journal of Information Science [0.1]• Information Processing and Management [0.001]

recommendation strength

Mini-Break

Terrorist Alert

System Parameters

• Metadata Density: to validate the algorithm we kill a percentage of the metadata in the system and see if we can reconstruct it using the algorithm (d \in [0,1])

• Metadata Percentile: only those metadata tags in the pth percentile are accepted as valid metadata (p \in [0,1])

** Validation is based Precision and Recall values

Results for Co-Author Network(Citation Metadata)

Results for Co-Author Network (Organization Metadata)

Results for Co-Author Network (Keyword Metadata)

Results for Co-Keyword Network(Citation Metadata)

Results for Co-Keyword Network(Journal Metadata)

Results for Citation Network(Author Metadata)

Results for Citation Network(Keyword Metadata)

Results for Citation Network(Journal Metadata)

Take Home Points

• Different edge types are better a propagating different metadata types.

• Can work for any resource type as long as there exists some preliminary vetted metadata and a way to create resource relations. (if there is pre-existing metadata then resource relations can be automatically created).

Future Work (part 1)

• What about path types? e.g. take a co-author edge, then a citation edge, etc. Better precision and recall?

• Explore usage metadata (applicable to any resource type—and allows for cross resource relations (e.g. manuscripts connected to audio)). The weight between two resources is a function of the interval between their download from the same IP. (Bollen, et.al. 2004)

Future Work (part 2)

• Application to social-networks? Given an unknown individual, infer his attributes according to his social-relationships

how does ‘work_with’ differ from ‘married_to’? They share same income metadata and religious belief metadata, respectively.

Conclusion

• Good life…

Rodriguez, M.A., Bollen, J., Van de Sompel, H., “Automatic Metadata Generation using Associative Networks”, [unpublished], 2005.

Know of a good journal venue?

Recommended