Upload
lynnea
View
34
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Incorporating N-gram Statistics in the Normalization of Clinical Notes. By Bridget Thomson McInnes. Overview. Ngrams Ngram Statistics for Spelling Correction Spelling Correction Ngram Statistics for Multi Term Identification Multi Term Identification. Ngram. - PowerPoint PPT Presentation
Citation preview
1
Incorporating N-gram Incorporating N-gram Statistics in the Statistics in the
Normalization of Clinical Normalization of Clinical NotesNotes
By Bridget Thomson McInnesBy Bridget Thomson McInnes
2
OverviewOverview NgramsNgrams
Ngram Statistics for Spelling Ngram Statistics for Spelling CorrectionCorrection
Spelling CorrectionSpelling Correction
Ngram Statistics for Multi Term Ngram Statistics for Multi Term IdentificationIdentification
Multi Term IdentificationMulti Term Identification
3
NgramNgram
Her dobutamine stress echo showed mild Her dobutamine stress echo showed mild aortic stenosis with a subaortic gradient.aortic stenosis with a subaortic gradient.
Her dobutamineDobutamine stressStress echoEcho showedShowed mildMild aorticAortic stenosisStenosis withWith a A subaorticSubaortic gradient
her dobutamine stressdobutamine stress echostress echo showedecho showed mildshowed mild aorticmild aortic stenosisaortic stenosis withstenosis with aa subaortic gradient
Bigrams Trigrams
4
Contingency TablesContingency Tables
Word 2 ! Word 2
Word 1
! Word 1
n11 n12
n21 n22
n1p
np1 np2
n2p
npp
• n11 = the joint frequency of word1 and word2• n12 = the frequency word 1 occurs and word 2 does not• n21 = the frequency word 2 occurs and word 1 does not• n22 = the frequency word 1 and word 2 do not occur
• npp = the total number of ngrams
• n1p, np1, np2, n2p are the marginal counts
5
Contingency TablesContingency Tables
echo ! echo
stress
!stress
1 0
0 10
1
1 10
10
11
Her dobutamine 1Dobutamine stress 1Stress echo 1Echo showed 1Showed mild 1Mild aortic 1Aortic stenosis 1Stenosis with 1With a 1A subaortic 1Subaortic gradient 1
6
Contingency TablesContingency TablesExpected ValuesExpected Values
Word 2 ! Word 2
Word 1
! Word 1
n11 n12
n21 n22
n1p
np1 np2
n2p
npp
• Expected Values
• m11 = (np1 * n1p) / npp
• m12 = (np2 * n1p) / npp
• m21 = (np1 * n2p) / npp
• m22 = (np2 * n2p) / npp
7
Contingency TablesContingency Tables
echo ! echo
stress
!stress
1 0
0 10
1
1 10
10
11
• Expected Values
• m11 = ( 1 * 1 ) / 11 = 0.09
• m12 = ( 1 * 10) / 11 = 0.91
• m21 = ( 1 * 10) / 11 = 0.90
• m22 = (10 * 10) / 11 = 9.09
What is this telling you?
‘this is’ occurs twice in our example.
The expected occurrence of ‘this is’ if they are independent is .09 (m11).
8
Ngram StatisticsNgram Statistics Measures of AssociationMeasures of Association
Log Likelihood RatioLog Likelihood Ratio Chi Squared TestChi Squared Test Odds RatioOdds Ratio Phi CoefficientPhi Coefficient T-ScoreT-Score Dice CoefficientDice Coefficient True Mutual InformationTrue Mutual Information
9
Log Likelihood RatioLog Likelihood Ratio
Word 2 ! Word 2
Word 1
! Word 1
n11 n12
n21 n22
n1p
np1 np2
n2p
npp
Log Likelihood = 2 * ∑ ( nij * log( nij / mij) )
The log likelihood ratio measures the difference between the observed values and the expected values. It is the sum
of the ratio of the observed and expected values
10
Chi Squared TestChi Squared Test
Word 2 ! Word 2
Word 1
! Word 1
n11 n12
n21 n22
n1p
np1 np2
n2p
npp
x2 = ∑ pow( (nij – mij), 2) / mij
The chi squared test also measures the difference between the observed values and the expected values. It is the sum
of the difference between the observed and expected values
11
Odds RatioOdds Ratio
Word 2 ! Word 2
Word 1
! Word 1
n11 n12
n21 n22
n1p
np1 np2
n2p
npp
Odds Ratio = (n11 * n22) / (n21 * n12)
The odds ratio is the ratio is the total number of times anevent takes place to the total number of times that it does
not take place. It is the cross product ratio of the 2x2 contingencytable and measures the magnitude of association between two words
12
Phi CoefficientPhi Coefficient
Word 2 ! Word 2
Word 1
! Word 1
n11 n12
n21 n22
n1p
np1 np2
n2p
npp
Phi = ( (n11 * n22) - (n21 * n12) ) / Sqrt(np1 * n1p * n2p * np2)
The bigrams are considered positively associated if most of data is along the diagonal (meaning if n11 and n22 are larger than
n12 and n21) and negatively associated if most of the data falls off the diagonal.
13
T ScoreT Score
Word 2 ! Word 2
Word 1
! Word 1
n11 n12
n21 n22
n1p
np1 np2
n2p
npp
T Score = ( n11 – m11 ) / sqrt( n11 )
The tscore determines whether there is some non randomassociation between two words. It is the quotient of your
known and expected divided by the square root of your known
14
Dice CoefficientDice Coefficient
Word 2 ! Word 2
Word 1
! Word 1
n11 n12
n21 n22
n1p
np1 np2
n2p
npp
Dice coefficient = 2 * n11 / (np1 + n1p)
The dice coefficient depends on the frequency of the eventsoccurring together and their individual frequencies.
15
True Mutual InformationTrue Mutual Information
Word 2 ! Word 2
Word 1
! Word 1
n11 n12
n21 n22
n1p
np1 np2
n2p
npp
TMI = (nij / npp) * ∑ log( nij / mij)
True Mutual Information measures to what extent theobserved frequencies differ from the expected.
16
Spelling CorrectionSpelling Correction Using context sensitive information through the Using context sensitive information through the
bigrams to determine the ranking of a given set bigrams to determine the ranking of a given set of possible spelling corrections for a misspelled of possible spelling corrections for a misspelled word.word.
Given:Given: First content word prior to the misspelled wordFirst content word prior to the misspelled word First content word after the misspelled wordFirst content word after the misspelled word List of possible spelling correctionsList of possible spelling corrections
17
Spelling Correction ExampleSpelling Correction Example Example Sentence:Example Sentence:
Her dobutamine stress echo showed mild Her dobutamine stress echo showed mild aurticaurtic stenosis with a subaortic gradient. stenosis with a subaortic gradient.
List of Possible corrections:List of Possible corrections: articartic aorticaortic
Statistical Analysis :Statistical Analysis : Basic Idea Basic Idea
herher dobutaminedobutamine stressstress echoecho showeshowedd
mildmild POSPOS stenosistenosiss
withwith subaorticsubaortic gradientgradient
18
Spelling Correction StatisticsSpelling Correction Statistics
mild mild articartic 0.400.40articartic stenosis stenosis 0.030.03Weighted averageWeighted average 0.2150.215
mild mild aorticaortic 0.660.66aorticaortic stenosis stenosis 0.300.30Weighted averageWeighted average 0.460.46
Possible 1 : Possible 2:
• This allows us to take into consideration finding a bigram with word prior to the misspelling and after the misspelling
• The possible word with its score are then returned
19
Types of ResultsTypes of Results Types of Results Types of Results
Gspell onlyGspell only Context sensitive onlyContext sensitive only Hybrid of both Gspell and ContextHybrid of both Gspell and Context
Taking the average of the Gspell and context Taking the average of the Gspell and context sensitive scoressensitive scores
Note : this turns into a backoff method when no Note : this turns into a backoff method when no statistical data is found for any of the possibilitiesstatistical data is found for any of the possibilities
Backoff methodBackoff method Use only the context sensitive score unless it does not Use only the context sensitive score unless it does not
exists then revert to the Gspell scoreexists then revert to the Gspell score
20
Preliminary Test SetPreliminary Test Set Test set : partially scrubbed clinical Test set : partially scrubbed clinical
notesnotes
Size : 854 wordsSize : 854 words Number of misspellings : 82Number of misspellings : 82
Includes AbbreviationsIncludes Abbreviations
21
Preliminary ResultsPreliminary Results
GSPELLGSPELL PrecisionPrecision Recall Recall FmeasureFmeasure
0.53570.5357 0.73170.7317 0.61860.6186
Measure of Measure of AssociationAssociation PrecisionPrecision RecallRecall FmeasureFmeasure
PHIPHI 0.61610.6161 0.84150.8415 0.71130.7113LLLL 0.60710.6071 0.82930.8293 0.70100.7010TMITMI 0.60710.6071 0.82930.8293 0.70100.7010ODDSODDS 0.60710.6071 0.82930.8293 0.70100.7010X2X2 0.61610.6161 0.84150.8415 0.71130.7113TSCORETSCORE 0.56250.5625 0.76830.7683 0.64950.6495DICEDICE 0.63390.6339 0.86590.8659 0.73200.7320
GSPELL Results :
Context Sensitive Results:
22
Preliminary ResultsPreliminary Results
Measure of Measure of associationassociation PrecisionPrecision RecallRecall FmeasureFmeasure
PHIPHI 0.66070.6607 0.90240.9024 0.76290.7629LLLL 0.63390.6339 0.86590.8659 0.7320.732TMITMI 0.66070.6607 0.90240.9024 0.76290.7629ODDSODDS 0.62500.6250 0.85370.8537 0.72160.7216X2X2 0.63390.6339 0.86590.8659 0.7320.732TSCORETSCORE 0.60710.6071 0.82930.8293 0.7010.701DICEDICE 0.66960.6696 0.91460.9146 0.77320.7732
Hybrid Method Results:
23
Notes on Log LikelihoodNotes on Log Likelihood Log Likelihood is used quite often with context Log Likelihood is used quite often with context
sensitive spelling correctionsensitive spelling correction
Problem with large sample sizesProblem with large sample sizes The marginal values are very large due to the sample The marginal values are very large due to the sample
sizesize Increases the expected values so the actually values are Increases the expected values so the actually values are
commonly so much lower than the expected valuescommonly so much lower than the expected values Very independent and very dependent ngrams end up Very independent and very dependent ngrams end up
with the same valuewith the same value
Noticed similar characteristics with true mutual Noticed similar characteristics with true mutual informationinformation
24
Example of ProblemExample of Problem
hip ! hip
follow
! follow
n11 88951
65729 69783140
88962
65740 69872091
69848869
69937831
n11n11 Log LikelihoodLog Likelihood1111 145.3647145.3647190190 143.4268143.42688686 0.098640.09864
25
Conclusions with Preliminary Conclusions with Preliminary ResultsResults
Dice coefficient returns the best resultsDice coefficient returns the best results Phi coefficient returns the second bestPhi coefficient returns the second best
Log Likelihood and True Mutual Log Likelihood and True Mutual Information should not be usedInformation should not be used
Need to now test the program with a Need to now test the program with a more extensive test bed which is in the more extensive test bed which is in the process of being createdprocess of being created
26
NGram Statistics for Multi Term NGram Statistics for Multi Term IdentificationIdentification
Can not use previous statistics package Can not use previous statistics package Memory constraints due to the amount of dataMemory constraints due to the amount of data Would like to look for longer ngramsWould like to look for longer ngrams
Alternative : Suffix Arrays (Church and Yamamoto)Alternative : Suffix Arrays (Church and Yamamoto) Reduces the amount of memoryReduces the amount of memory
Two Arrays Two Arrays Contains the corpusContains the corpus Contains identifiers to the ngrams in the corpusContains identifiers to the ngrams in the corpus
Two StacksTwo Stacks Contains the longest common prefixContains the longest common prefix Contains the document frequencyContains the document frequency
Allows for ngrams up to the size of the corpus to be foundAllows for ngrams up to the size of the corpus to be found
27
Suffix ArraysSuffix Arrays
toto be be oror notnot toto bebe
To be or not to be
to be or not to be
be or not to be
or not to be
not to be
to be
be
• Each array element is considered a suffix
• A Ngram is from a suffix until the end of the array
28
Suffix ArraysSuffix Arraysto be or not to bebe or not to beor not to benot to beto bebe
[0] = 5 => be[1] = 1 => be or not to be[2] = 3 => not to be[3] = 2 => or not to be[4] = 4 => to be[5] = 0 => to be or not to be
55 11 33 22 44 00
Actual Suffix Array :
29
Term FrequencyTerm Frequency Term frequency (tf) is the number of times Term frequency (tf) is the number of times
a ngram occurs in the corpusa ngram occurs in the corpus
To determine the tf of an ngram:To determine the tf of an ngram: Sorted the suffix arraySorted the suffix array
tf = j – i + 1tf = j – i + 1 j = first occurrencej = first occurrence i = last occurrencei = last occurrence
[0] = 5 => be[1] = 1 => be or not to be[2] = 3 => not to be[3] = 2 => or not to be[4] = 4 => to be[5] = 0 => to be or not to be
30
Measures of Association Measures of Association Residual Inverse Document Frequency (RIDF)Residual Inverse Document Frequency (RIDF)
RIDF = - log (df / D) + log(1 – exp(-tf/D) )RIDF = - log (df / D) + log(1 – exp(-tf/D) )
Compares the distribution of a term over documents Compares the distribution of a term over documents to what would be expected by a random termto what would be expected by a random term
Mutual Information (MI)Mutual Information (MI) MI(xYz) = log MI(xYz) = log tf( xYz ) * tf( Y )tf( xYz ) * tf( Y )
tf( xY) * tf( Yz )tf( xY) * tf( Yz )
Compares the frequency of the whole to the frequency of Compares the frequency of the whole to the frequency of the partsthe parts
31
Present WorkPresent Work Calculated the MI and RIDF for the clinical Calculated the MI and RIDF for the clinical
notes for each of the possible sections: CC, notes for each of the possible sections: CC, CM, IP, HPI, PSH, SH and DXCM, IP, HPI, PSH, SH and DX Retrieved the respective text for each headingRetrieved the respective text for each heading
Calculate the ridf and mi each possible ngrams Calculate the ridf and mi each possible ngrams with a term frequency greater than 10 for the with a term frequency greater than 10 for the data under each sectionsdata under each sections
Noticed that different multi terms appear for Noticed that different multi terms appear for each of the different sectionseach of the different sections
32
ConclusionsConclusions Ngram statistics can be applied directly Ngram statistics can be applied directly
and indirectly to various problemsand indirectly to various problems DirectlyDirectly
Spelling correctionSpelling correction Compound word identificationCompound word identification Term extractionTerm extraction Name identificationName identification
IndirectlyIndirectly Part of Speech taggingPart of Speech tagging Information RetrievalInformation Retrieval Data MiningData Mining
33
PackagesPackages Two Statistical PackagesTwo Statistical Packages
Contingency Table approachContingency Table approach Measures for bigramsMeasures for bigrams
Log Likelihood, True Mutual Information, Chi Log Likelihood, True Mutual Information, Chi Squared Test, 0dds Ratio, Phi Coefficient, T Score, Squared Test, 0dds Ratio, Phi Coefficient, T Score, and Dice Coefficientand Dice Coefficient
Measures for trigramsMeasures for trigrams Log Likelihood and True Mutual InformationLog Likelihood and True Mutual Information
Suffix Array approachSuffix Array approach Measures for all lengths of ngramsMeasures for all lengths of ngrams
Residual Inverse Document Frequency and Mutual Residual Inverse Document Frequency and Mutual InformationInformation