View
2
Download
0
Category
Preview:
Citation preview
NLP and Text Mining: an Introduction
Matteo Romanello (DAI/KCL)
Histore Workshop – IHR – June 21, 2012
Introduction
Basic Concepts
Section 1
Introduction
me
I BA Classics (Greek Literature and Philology)I MA Digital Humanities (Univ. of Venice)
I e-journals in Classics
I Currently:I PhD in Digital Humanities, King’s College London
I information extraction from secondary sources
I Research Associate at German Archeological Institute (Berlin)I Digital Infrastructure for Research in the Arts and Humanities
(DARIAH)
What and Why?
NLP Methods
< 1990s
I rely heavily on hand-coded rulesI extract named entities with regexps
I grammars, parsing, etc.
I top down
I hardly scalable
>= 1990s
I emphasis on statistical based approach
I machine learning
I bottom up
I scalable
NLP in DH
I increasing need for mediation of NLP knowledgeI adoption and appropriation of technology need
I understanding of technologyI familiarising with
I JargonI to code or not to code?I basic concepts
I understanding a fieldI evolving quicklyI with a growing body of literatureI highly specialised
(some of the main) NLP TasksSpeech Processing
I Machine Translation
I Speech Synthesis
Information Extraction
I Named Entity ExtractionI Named Entity [Classification | Resolution]
I Relationship Extraction
I Co-reference Resolution
Text Classification
I Sentiment Analysis
I Topic Modelling
My playlist of NLP frameworks
I Voyeur/Voyant tools [web-based]I reading, text analysisI text visualisation
I Natural Language Toolkit [Python]
I General Architecture for Text Engineering (Uni Sheffield)[Java]
I LingPipe [Java]
I OpenNLP (Apache foundation) [Java]
Challenges for NLP in DH
I tools not always work straight out of the boxI issues with
I character encoding (despite Unicode)I output of OCRon historical documentsI normalisation and pre-processing
I lack of ad-hoc resourcesI datasets for training, testing, evaluationI dictionaries and gazetteersI previous results for comparison
Section 2
Basic Concepts
Machine Learning
Supervised
I model is learned fromtraining data
Models
I Hidden Markov Model
I Support Vector Machine
I Conditional RandomFields
Applications
I sequence labelling
Unsupervised
I data are fit into a model
Models
I Clustering
I Latent DirichletAllocation
I Latent Semantic Indexing
Applications
I document clustering
I topic modelling
Machine Learning Cycle (Sequence Labelling)
Evaluation
I TP, FP, TN, FN are defined in relation to a specific taskI applicable to those where is known (quantifiable) what we are
looking for
I Information RetrievalI retrieving of information relevant to a given search queryI TP True Positives
I docs we did expect to show up and showed up (relevant,present)
I FP False PositivesI docs we didn’t expect to show up but showed up (not
relevant, present)
I TN True NegativesI not relevant docs we didn’t expect to show up and did not
show up (not relevant, missing)
I FN False NegativesI relevant docs we didn’t expect to show up but showed up
(not relevant, present)
Evaluation Metrics
I precisionI precision = tp
tp+fp
I recallI recall = tp
tp+fn
I accuracyI accuracy = tp+tn
tp+tn+fp+fn
I f-scoreI fscore = 2 ∗ precision∗recall
precision+recall
Topic ModellingM. Jockers, The LDA Buffet is Now Open; or, Latent DirichletAllocation for English Majors
Key concepts
I the algorithm extracts topics and representative wordsI the human interpreter eventually assigns a name/label to each
topic
I the number of topics is decided a priori
I each doc has different % of all the topics
I diachronic/synchronic exploration of topics
TM frameworks
I Mallet (Java)
I Gensim (Python)
I Stanford Topic Modelling Toolbox
Topic Modelling (cont’d)
https://dhs.stanford.edu/algorithmic-literacy/
my-definition-of-topic-modeling/
Martha Ballard’s Diary
http://historying.org/2010/04/01/
topic-modeling-martha-ballards-diary/
Thematic Index of Classics in JSTOR
http://catalog.perseus.tufts.edu/jstor/
Comprehending the Digital Humanities
https://dhs.stanford.edu/
comprehending-the-digital-humanities/
Recommended