Naive Algorithms for Key-phrase Extraction and Text Summarization from a Single Document inspired by the Protein Biosynthesis Process Daniel Gayo Avello

Naive Algorithms for Key-phrase Extraction and Text Naive Algorithms for Key-phrase Extraction and Text Summarization from a Single Document inspired by Summarization from a Single Document inspired by

the Protein Biosynthesis Processthe Protein Biosynthesis Process

Daniel Gayo Avello(University of Oviedo)

What’s the problem?What’s the problem?

• Document reading is a time consuming task…

• Many common documents (e.g., e-mail, newsgroup posts, web pages) lack of abstract or keywords…

• But, they are “electronic” so we can work on them in some way…

8%

What’s the problem? (cont.)What’s the problem? (cont.)

• Many techniques to perform several Natural Language Processing (NLP) useful tasks:– Language identification.

– Document categorization and clustering.

– Keyword extraction.

– Text summarization.

• Quite different:– With/Without human supervision.

– With/Without training.

– With/Without complex linguistic data.

– With/Without document corpora.

17%

Any suggestion?Any suggestion?

• It would be great to use only one technique to carry out several of those tasks.

• Desirable goals:– Simple (only free text, not linguistic data)– Fully automatic (neither supervision nor ad

hoc heuristics)– Scalable (from one web page to several web

sites)

• Could it be a bio-inspired solution?Could it be a bio-inspired solution?

25%

Our (bio-inspired) hypothesisOur (bio-inspired) hypothesis

• Living beings are defined by their genome.• Document from a corpus ≈ Individual from a

population• So…? • Let’s imagine a “document genome”…

– Similar documents (similar language/topic) Similar genomes.

– More interesting, translation from More interesting, translation from “document genome” to “significance “document genome” to “significance proteins” (i.e., keyphrases and proteins” (i.e., keyphrases and summaries).summaries).

33%

42%

Our biological inspirationOur biological inspiration

• The protein biosynthesis process…

DNA

copied into a single-stranded mRNA molecule

mRNA AUGAUGCCGGGUUACUAAUAAUAC

Polypeptide chain

Protein folded into a 3D structure

Folding process

Transcription

InitiationElongation

Termination

aminoacids

Could we mimic this to distill from a single

document keyphrases and summaries!?

The “ingredients”…The “ingredients”…

Biological element

Computational “counterpart”

tRNASpliced document “genome”

mRNA Document’s plain text

Ribosome Algorithm

Polypeptide chainDocument chunks with significance weights

Protein Keyphrases

50%

A “DNA” for Natural Language?A “DNA” for Natural Language?

• n-grams (slices of adjoining n characters)

• Frequency not the most relevant weight for each n-gram.

• There exist different measures to show relation between both elements in a bigram:– Mutual information.– Dice coefficient.– Loglike.– …

• Cannot be applied straightforward to n-grams… • …But, they can be generalized (Ferreira and

Pereira, 1999)

58%

A “DNA” for Natural Language? (cont.)A “DNA” for Natural Language? (cont.)

The rain in Spain stays mainly in the plain.The rain in Spain stays mainly in the plain.

Original document

< in > < mai> < pla> < rai> < Spa> < sta> < the> <ain > <ainl> <ays > <e pl> <e ra>…

n-grams

0.025

0.025

Relative frequency

1.9751.975<inly><inly>

2.0132.013<Spai><Spai>

Fair Specific Mutual

Information

Assigning weights to n-grams

67%

Document genome translationDocument genome translation

The rain in Spain stays mainly in the plain.The rain in Spain stays mainly in the plain.

The-The-

2020

he-rhe-r

2929

e-rae-ra

2424

pseudo-mRNA2020The The 4949The r The r 7373The ra The ra

etcetc..

• So…– “Document genome” spliced into “pseudo-

tRNA”.– Document used as “pseudo-mRNA”.– We “attach” to the document pseudo-tRNA

“molecules” (with max. weight) while average significance per character continues growing.

• Result: Document spliced into “chunks” Result: Document spliced into “chunks” with maximum average significance.with maximum average significance.

TheraininSpainstays mainly inthe plain

75%

• To obtain keyphrases the “protein” (text chunks) must be folded…

• At this moment we are studying different alternatives:– Mutual reinforcement?– Chunks ≈ Documents Apply classical IR

techniques?– Others?

• Automatic text summarization– Simple but useful approach.– Use the shortest paragraphs with the most

significant keyphrases.

Folding the “protein” / summarizationFolding the “protein” / summarization

Work on Early Stage

Work on Early Stage

83%

• To test feasibility of these ideas a prototype was developed.

• blindLight – http://www.purl.org/NET/blindLight

• It receives a user-provided URL and produces:

– A “blindlighted” version of the original URL.

– A list of keyphrases.

– An automatic summary.

92%

ConclusionsConclusions

• Proof-of-concept tests have been performed– Details in the paper…– Results can be improved.– Thorough study and analysis is needed.– Really promising!

• Summary of the proposal1. Free text from just one document.2. Language independent (currently only western

languages).3. Bio-inspired.4. Extremely simple to implement.

100%

Merci beaucoup!¡Muchas gracias!Thank you!