Upload
tristan-privott
View
222
Download
0
Tags:
Embed Size (px)
Citation preview
Naive Algorithms for Key-phrase Extraction and Text Naive Algorithms for Key-phrase Extraction and Text Summarization from a Single Document inspired by Summarization from a Single Document inspired by
the Protein Biosynthesis Processthe Protein Biosynthesis Process
Daniel Gayo Avello(University of Oviedo)
What’s the problem?What’s the problem?
• Document reading is a time consuming task…
• Many common documents (e.g., e-mail, newsgroup posts, web pages) lack of abstract or keywords…
• But, they are “electronic” so we can work on them in some way…
8%
What’s the problem? (cont.)What’s the problem? (cont.)
• Many techniques to perform several Natural Language Processing (NLP) useful tasks:– Language identification.
– Document categorization and clustering.
– Keyword extraction.
– Text summarization.
• Quite different:– With/Without human supervision.
– With/Without training.
– With/Without complex linguistic data.
– With/Without document corpora.
17%
Any suggestion?Any suggestion?
• It would be great to use only one technique to carry out several of those tasks.
• Desirable goals:– Simple (only free text, not linguistic data)– Fully automatic (neither supervision nor ad
hoc heuristics)– Scalable (from one web page to several web
sites)
• Could it be a bio-inspired solution?Could it be a bio-inspired solution?
25%
Our (bio-inspired) hypothesisOur (bio-inspired) hypothesis
• Living beings are defined by their genome.• Document from a corpus ≈ Individual from a
population• So…? • Let’s imagine a “document genome”…
– Similar documents (similar language/topic) Similar genomes.
– More interesting, translation from More interesting, translation from “document genome” to “significance “document genome” to “significance proteins” (i.e., keyphrases and proteins” (i.e., keyphrases and summaries).summaries).
33%
42%
Our biological inspirationOur biological inspiration
• The protein biosynthesis process…
DNA
copied into a single-stranded mRNA molecule
mRNA AUGAUGCCGGGUUACUAAUAAUAC
Polypeptide chain
Protein folded into a 3D structure
Folding process
Transcription
InitiationElongation
Termination
aminoacids
Could we mimic this to distill from a single
document keyphrases and summaries!?
The “ingredients”…The “ingredients”…
Biological element
Computational “counterpart”
tRNASpliced document “genome”
mRNA Document’s plain text
Ribosome Algorithm
Polypeptide chainDocument chunks with significance weights
Protein Keyphrases
50%
A “DNA” for Natural Language?A “DNA” for Natural Language?
• n-grams (slices of adjoining n characters)
• Frequency not the most relevant weight for each n-gram.
• There exist different measures to show relation between both elements in a bigram:– Mutual information.– Dice coefficient.– Loglike.– …
• Cannot be applied straightforward to n-grams… • …But, they can be generalized (Ferreira and
Pereira, 1999)
58%
A “DNA” for Natural Language? (cont.)A “DNA” for Natural Language? (cont.)
The rain in Spain stays mainly in the plain.The rain in Spain stays mainly in the plain.
Original document
< in > < mai> < pla> < rai> < Spa> < sta> < the> <ain > <ainl> <ays > <e pl> <e ra>…
n-grams
0.025
0.025
Relative frequency
1.9751.975<inly><inly>
2.0132.013<Spai><Spai>
Fair Specific Mutual
Information
Assigning weights to n-grams
67%
Document genome translationDocument genome translation
The rain in Spain stays mainly in the plain.The rain in Spain stays mainly in the plain.
The-The-
2020
he-rhe-r
2929
e-rae-ra
2424
pseudo-mRNA2020The The 4949The r The r 7373The ra The ra
etcetc..
• So…– “Document genome” spliced into “pseudo-
tRNA”.– Document used as “pseudo-mRNA”.– We “attach” to the document pseudo-tRNA
“molecules” (with max. weight) while average significance per character continues growing.
• Result: Document spliced into “chunks” Result: Document spliced into “chunks” with maximum average significance.with maximum average significance.
TheraininSpainstays mainly inthe plain
75%
• To obtain keyphrases the “protein” (text chunks) must be folded…
• At this moment we are studying different alternatives:– Mutual reinforcement?– Chunks ≈ Documents Apply classical IR
techniques?– Others?
• Automatic text summarization– Simple but useful approach.– Use the shortest paragraphs with the most
significant keyphrases.
Folding the “protein” / summarizationFolding the “protein” / summarization
Work on Early Stage
Work on Early Stage
83%
• To test feasibility of these ideas a prototype was developed.
• blindLight – http://www.purl.org/NET/blindLight
• It receives a user-provided URL and produces:
– A “blindlighted” version of the original URL.
– A list of keyphrases.
– An automatic summary.
92%
ConclusionsConclusions
• Proof-of-concept tests have been performed– Details in the paper…– Results can be improved.– Thorough study and analysis is needed.– Really promising!
• Summary of the proposal1. Free text from just one document.2. Language independent (currently only western
languages).3. Bio-inspired.4. Extremely simple to implement.
100%
Merci beaucoup!¡Muchas gracias!Thank you!