14
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li

Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li

Embed Size (px)

Citation preview

Online Autonomous Citation Management for

CiteSeerCSE598B Course Project

By Huajing Li

2

Introduction to CiteSeer

Software package developed at NEC-Labs Domain Independent Software for

Automatic Citation Indexing (ACI) Focus is on scholarly publications in

electronic format (PS / PDF and variants) Performs:

– Document Discovery / Retrieval / Parsing– Automatic Citation Extraction– Document & Citation Indexing / Search

3

4

Crawler

Retrieval

Conversion

Parsing & Meta-Data Extraction

Meta-Data Database

PDBM_File & Chunk Tables

Indexing

Web Server

Indexes

CD

DocumentDatabase

File System

Document (Plain Text)

DocumentMeta-data

Set

DIDTitle

Authorsetc.

DocumentBody Text

N CitationTexts

Document (PDF/PS)

Document URL

N CitationMeta-data

Sets

CIDGIDTitle

Authorsetc.

5

Submitting Documents

Output of Crawl / User Submission is URL of page linking to document.

These URLs are dumped in Paper Table Paper Table maintains status for each document:

– Downloaded/undownloaded– Processed/unprocessed– Other processing errors

(tooshort/noreference/etc.) CiteSeer regularly scans this table to start

download of new documents Only Documents meeting typical pattern of

scholarly publications are eventually added to the collection

6

Document Structure Identification

– Title– Subject (keywords)– Description (abstract)– Author names– Author affiliations– Author address, email, phone, Homepage URL– Publication date, Publication number– Archive date– Contributor– Type– Format– Identifier– Source– Publisher– Journal/Conference– Pages– Relation

• References• Is Referenced By

From document header

System info

From citation graph

7

Citations grouping

Citations to same document have common Group ID– Each Group ID has a set of keys associated

to it, based on citation information– {authorkey1-titlekey; … authorkey2-

titlekey}• For every single word in the authors

information there is an authorkey• For a given citation, titlekey is unique and

is concatenation of all title words

8

Citations Grouping

For newly discovered citation– Extract

• Authors : C. Lee Giles, S. Lawrence• Title : “Good Paper Title”

– Generate keys {giles-goodpapertitle; lee-goodpapertitle; lawrence-goodpapertitle}

– Try to match at least one of them with existing Group ID key

• If there is a match, add this citation (Citation ID) to the group

• Otherwise create a new Group ID for this citation

9

Linking Citations to Documents

Citation ID->Group ID– We just saw that …

Document ID->Group ID– Based on document’s metadata, generate

authorkey-titlekey in the same way and try to match a Group ID key generated from the citations

– Document metadata can be erroneous, so successful mapping often happens AFTER correction by users

10

Problems of the Current Approach

There is no guarantee that the most similar citation contains the best metadata

Building citation graph is a time-intensive, offline task

Due to batch clustering, the addition of a single citation requires rebuilding the entire citation graph to include the new instance

The so-called canonical metadata is fixed to the document record

11

Goals of the New Citation Management System

Provide better document metadata Reduce the cost of maintenance Use on-line citation matching such that the

citation graph environment can be adjusted immediately based on a single new citation

Provide a fluid framework for building canonical metadata in which all evidence is always considered

Allow the development of flexible APIs into CiteSeer citation graph system

Maintain data security despite an open, wiki-like approach to user-contributed metadata changes

Provide better citation matching compared to the current system

12

Prototype Overview

DocumentMetadata

Index

CitationMetadata

Index

CitationResolver

CitationMetadata (XML)

DocumentMetadata (XML)

QueryHandler

Edge DB(SQL)

Query

May ultimately be located in separate service

13

Edge DB

One simple table containing one edge per row:– Id: citation handle (equivalent to CID)– citingDoc: citing document handle– citedDoc: cited document handle

Row-level locking

14

Matching citations and docs

Exact string match across disparate metadata fields way too optimistic - need better matching criteria

Lucene provides two methods out of the box:– Match based on Levenshtein distance

• Specify arbitrary distance cut-off per field• choose most similar match out of returned set

– Cut out the middleman - similarity-based matching• Specify arbitrary similarity threshold• Choose most similar match out of return set over

threshold Criteria to be determined through empirical tests

using prototype system.