Incorporating N-gram Statistics in the Normalization of Clinical Notes

1

Incorporating N-gram Incorporating N-gram Statistics in the Statistics in the

Normalization of Clinical Normalization of Clinical NotesNotes

By Bridget Thomson McInnesBy Bridget Thomson McInnes

2

OverviewOverview NgramsNgrams

Ngram Statistics for Spelling Ngram Statistics for Spelling CorrectionCorrection

Spelling CorrectionSpelling Correction

Ngram Statistics for Multi Term Ngram Statistics for Multi Term IdentificationIdentification

Multi Term IdentificationMulti Term Identification

3

NgramNgram

Her dobutamine stress echo showed mild Her dobutamine stress echo showed mild aortic stenosis with a subaortic gradient.aortic stenosis with a subaortic gradient.

Her dobutamineDobutamine stressStress echoEcho showedShowed mildMild aorticAortic stenosisStenosis withWith a A subaorticSubaortic gradient

her dobutamine stressdobutamine stress echostress echo showedecho showed mildshowed mild aorticmild aortic stenosisaortic stenosis withstenosis with aa subaortic gradient

Bigrams Trigrams

4

Contingency TablesContingency Tables

Word 2 ! Word 2

Word 1

! Word 1

n11 n12

n21 n22

n1p

np1 np2

n2p

npp

• n11 = the joint frequency of word1 and word2• n12 = the frequency word 1 occurs and word 2 does not• n21 = the frequency word 2 occurs and word 1 does not• n22 = the frequency word 1 and word 2 do not occur

• npp = the total number of ngrams

• n1p, np1, np2, n2p are the marginal counts

5


echo ! echo

stress

!stress

1 0

0 10

1

1 10

10

11

Her dobutamine 1Dobutamine stress 1Stress echo 1Echo showed 1Showed mild 1Mild aortic 1Aortic stenosis 1Stenosis with 1With a 1A subaortic 1Subaortic gradient 1

6

Contingency TablesContingency TablesExpected ValuesExpected Values

Word 2 ! Word 2

Word 1

! Word 1

n11 n12

n21 n22

n1p

np1 np2

n2p

npp

• Expected Values

• m11 = (np1 * n1p) / npp

• m12 = (np2 * n1p) / npp

• m21 = (np1 * n2p) / npp

• m22 = (np2 * n2p) / npp

7


echo ! echo

stress

!stress

1 0

0 10

1

1 10

10

11

• Expected Values

• m11 = ( 1 * 1 ) / 11 = 0.09

• m12 = ( 1 * 10) / 11 = 0.91

• m21 = ( 1 * 10) / 11 = 0.90

• m22 = (10 * 10) / 11 = 9.09

What is this telling you?

‘this is’ occurs twice in our example.

The expected occurrence of ‘this is’ if they are independent is .09 (m11).

8

Ngram StatisticsNgram Statistics Measures of AssociationMeasures of Association

Log Likelihood RatioLog Likelihood Ratio Chi Squared TestChi Squared Test Odds RatioOdds Ratio Phi CoefficientPhi Coefficient T-ScoreT-Score Dice CoefficientDice Coefficient True Mutual InformationTrue Mutual Information

9

Log Likelihood RatioLog Likelihood Ratio

Word 2 ! Word 2

Word 1

! Word 1

n11 n12

n21 n22

n1p

np1 np2

n2p

npp

Log Likelihood = 2 * ∑ ( nij * log( nij / mij) )

The log likelihood ratio measures the difference between the observed values and the expected values. It is the sum

of the ratio of the observed and expected values

10

Chi Squared TestChi Squared Test

Word 2 ! Word 2

Word 1

! Word 1

n11 n12

n21 n22

n1p

np1 np2

n2p

npp

x2 = ∑ pow( (nij – mij), 2) / mij

The chi squared test also measures the difference between the observed values and the expected values. It is the sum

of the difference between the observed and expected values

11

Odds RatioOdds Ratio

Word 2 ! Word 2

Word 1

! Word 1

n11 n12

n21 n22

n1p

np1 np2

n2p

npp

Odds Ratio = (n11 * n22) / (n21 * n12)

The odds ratio is the ratio is the total number of times anevent takes place to the total number of times that it does

not take place. It is the cross product ratio of the 2x2 contingencytable and measures the magnitude of association between two words

12

Phi CoefficientPhi Coefficient

Word 2 ! Word 2

Word 1

! Word 1

n11 n12

n21 n22

n1p

np1 np2

n2p

npp

Phi = ( (n11 * n22) - (n21 * n12) ) / Sqrt(np1 * n1p * n2p * np2)

The bigrams are considered positively associated if most of data is along the diagonal (meaning if n11 and n22 are larger than

n12 and n21) and negatively associated if most of the data falls off the diagonal.

13

T ScoreT Score

Word 2 ! Word 2

Word 1

! Word 1

n11 n12

n21 n22

n1p

np1 np2

n2p

npp

T Score = ( n11 – m11 ) / sqrt( n11 )

The tscore determines whether there is some non randomassociation between two words. It is the quotient of your

known and expected divided by the square root of your known

14

Dice CoefficientDice Coefficient

Word 2 ! Word 2

Word 1

! Word 1

n11 n12

n21 n22

n1p

np1 np2

n2p

npp

Dice coefficient = 2 * n11 / (np1 + n1p)

The dice coefficient depends on the frequency of the eventsoccurring together and their individual frequencies.

15

True Mutual InformationTrue Mutual Information

Word 2 ! Word 2

Word 1

! Word 1

n11 n12

n21 n22

n1p

np1 np2

n2p

npp

TMI = (nij / npp) * ∑ log( nij / mij)

True Mutual Information measures to what extent theobserved frequencies differ from the expected.

16

Spelling CorrectionSpelling Correction Using context sensitive information through the Using context sensitive information through the

bigrams to determine the ranking of a given set bigrams to determine the ranking of a given set of possible spelling corrections for a misspelled of possible spelling corrections for a misspelled word.word.

Given:Given: First content word prior to the misspelled wordFirst content word prior to the misspelled word First content word after the misspelled wordFirst content word after the misspelled word List of possible spelling correctionsList of possible spelling corrections

17

Spelling Correction ExampleSpelling Correction Example Example Sentence:Example Sentence:

Her dobutamine stress echo showed mild Her dobutamine stress echo showed mild aurticaurtic stenosis with a subaortic gradient. stenosis with a subaortic gradient.

List of Possible corrections:List of Possible corrections: articartic aorticaortic

Statistical Analysis :Statistical Analysis : Basic Idea Basic Idea

herher dobutaminedobutamine stressstress echoecho showeshowedd

mildmild POSPOS stenosistenosiss

withwith subaorticsubaortic gradientgradient

18

Spelling Correction StatisticsSpelling Correction Statistics

mild mild articartic 0.400.40articartic stenosis stenosis 0.030.03Weighted averageWeighted average 0.2150.215

mild mild aorticaortic 0.660.66aorticaortic stenosis stenosis 0.300.30Weighted averageWeighted average 0.460.46

Possible 1 : Possible 2:

• This allows us to take into consideration finding a bigram with word prior to the misspelling and after the misspelling

• The possible word with its score are then returned

19

Types of ResultsTypes of Results Types of Results Types of Results

Gspell onlyGspell only Context sensitive onlyContext sensitive only Hybrid of both Gspell and ContextHybrid of both Gspell and Context

Taking the average of the Gspell and context Taking the average of the Gspell and context sensitive scoressensitive scores

Note : this turns into a backoff method when no Note : this turns into a backoff method when no statistical data is found for any of the possibilitiesstatistical data is found for any of the possibilities

Backoff methodBackoff method Use only the context sensitive score unless it does not Use only the context sensitive score unless it does not

exists then revert to the Gspell scoreexists then revert to the Gspell score

20

Preliminary Test SetPreliminary Test Set Test set : partially scrubbed clinical Test set : partially scrubbed clinical

notesnotes

Size : 854 wordsSize : 854 words Number of misspellings : 82Number of misspellings : 82

Includes AbbreviationsIncludes Abbreviations

21

Preliminary ResultsPreliminary Results

GSPELLGSPELL PrecisionPrecision Recall Recall FmeasureFmeasure

0.53570.5357 0.73170.7317 0.61860.6186

Measure of Measure of AssociationAssociation PrecisionPrecision RecallRecall FmeasureFmeasure

PHIPHI 0.61610.6161 0.84150.8415 0.71130.7113LLLL 0.60710.6071 0.82930.8293 0.70100.7010TMITMI 0.60710.6071 0.82930.8293 0.70100.7010ODDSODDS 0.60710.6071 0.82930.8293 0.70100.7010X2X2 0.61610.6161 0.84150.8415 0.71130.7113TSCORETSCORE 0.56250.5625 0.76830.7683 0.64950.6495DICEDICE 0.63390.6339 0.86590.8659 0.73200.7320

GSPELL Results :

Context Sensitive Results:

22

Preliminary ResultsPreliminary Results

Measure of Measure of associationassociation PrecisionPrecision RecallRecall FmeasureFmeasure

PHIPHI 0.66070.6607 0.90240.9024 0.76290.7629LLLL 0.63390.6339 0.86590.8659 0.7320.732TMITMI 0.66070.6607 0.90240.9024 0.76290.7629ODDSODDS 0.62500.6250 0.85370.8537 0.72160.7216X2X2 0.63390.6339 0.86590.8659 0.7320.732TSCORETSCORE 0.60710.6071 0.82930.8293 0.7010.701DICEDICE 0.66960.6696 0.91460.9146 0.77320.7732

Hybrid Method Results:

23

Notes on Log LikelihoodNotes on Log Likelihood Log Likelihood is used quite often with context Log Likelihood is used quite often with context

sensitive spelling correctionsensitive spelling correction

Problem with large sample sizesProblem with large sample sizes The marginal values are very large due to the sample The marginal values are very large due to the sample

sizesize Increases the expected values so the actually values are Increases the expected values so the actually values are

commonly so much lower than the expected valuescommonly so much lower than the expected values Very independent and very dependent ngrams end up Very independent and very dependent ngrams end up

with the same valuewith the same value

Noticed similar characteristics with true mutual Noticed similar characteristics with true mutual informationinformation

24

Example of ProblemExample of Problem

hip ! hip

follow

! follow

n11 88951

65729 69783140

88962

65740 69872091

69848869

69937831

n11n11 Log LikelihoodLog Likelihood1111 145.3647145.3647190190 143.4268143.42688686 0.098640.09864

25

Conclusions with Preliminary Conclusions with Preliminary ResultsResults

Dice coefficient returns the best resultsDice coefficient returns the best results Phi coefficient returns the second bestPhi coefficient returns the second best

Log Likelihood and True Mutual Log Likelihood and True Mutual Information should not be usedInformation should not be used

Need to now test the program with a Need to now test the program with a more extensive test bed which is in the more extensive test bed which is in the process of being createdprocess of being created

26

NGram Statistics for Multi Term NGram Statistics for Multi Term IdentificationIdentification

Can not use previous statistics package Can not use previous statistics package Memory constraints due to the amount of dataMemory constraints due to the amount of data Would like to look for longer ngramsWould like to look for longer ngrams

Alternative : Suffix Arrays (Church and Yamamoto)Alternative : Suffix Arrays (Church and Yamamoto) Reduces the amount of memoryReduces the amount of memory

Two Arrays Two Arrays Contains the corpusContains the corpus Contains identifiers to the ngrams in the corpusContains identifiers to the ngrams in the corpus

Two StacksTwo Stacks Contains the longest common prefixContains the longest common prefix Contains the document frequencyContains the document frequency

Allows for ngrams up to the size of the corpus to be foundAllows for ngrams up to the size of the corpus to be found

27

Suffix ArraysSuffix Arrays

toto be be oror notnot toto bebe

To be or not to be

to be or not to be

be or not to be

or not to be

not to be

to be

be

• Each array element is considered a suffix

• A Ngram is from a suffix until the end of the array

28

Suffix ArraysSuffix Arraysto be or not to bebe or not to beor not to benot to beto bebe

[0] = 5 => be[1] = 1 => be or not to be[2] = 3 => not to be[3] = 2 => or not to be[4] = 4 => to be[5] = 0 => to be or not to be

55 11 33 22 44 00

Actual Suffix Array :

29

Term FrequencyTerm Frequency Term frequency (tf) is the number of times Term frequency (tf) is the number of times

a ngram occurs in the corpusa ngram occurs in the corpus

To determine the tf of an ngram:To determine the tf of an ngram: Sorted the suffix arraySorted the suffix array

tf = j – i + 1tf = j – i + 1 j = first occurrencej = first occurrence i = last occurrencei = last occurrence

[0] = 5 => be[1] = 1 => be or not to be[2] = 3 => not to be[3] = 2 => or not to be[4] = 4 => to be[5] = 0 => to be or not to be

30

Measures of Association Measures of Association Residual Inverse Document Frequency (RIDF)Residual Inverse Document Frequency (RIDF)

RIDF = - log (df / D) + log(1 – exp(-tf/D) )RIDF = - log (df / D) + log(1 – exp(-tf/D) )

Compares the distribution of a term over documents Compares the distribution of a term over documents to what would be expected by a random termto what would be expected by a random term

Mutual Information (MI)Mutual Information (MI) MI(xYz) = log MI(xYz) = log tf( xYz ) * tf( Y )tf( xYz ) * tf( Y )

tf( xY) * tf( Yz )tf( xY) * tf( Yz )

Compares the frequency of the whole to the frequency of Compares the frequency of the whole to the frequency of the partsthe parts

31

Present WorkPresent Work Calculated the MI and RIDF for the clinical Calculated the MI and RIDF for the clinical

notes for each of the possible sections: CC, notes for each of the possible sections: CC, CM, IP, HPI, PSH, SH and DXCM, IP, HPI, PSH, SH and DX Retrieved the respective text for each headingRetrieved the respective text for each heading

Calculate the ridf and mi each possible ngrams Calculate the ridf and mi each possible ngrams with a term frequency greater than 10 for the with a term frequency greater than 10 for the data under each sectionsdata under each sections

Noticed that different multi terms appear for Noticed that different multi terms appear for each of the different sectionseach of the different sections

32

ConclusionsConclusions Ngram statistics can be applied directly Ngram statistics can be applied directly

and indirectly to various problemsand indirectly to various problems DirectlyDirectly

Spelling correctionSpelling correction Compound word identificationCompound word identification Term extractionTerm extraction Name identificationName identification

IndirectlyIndirectly Part of Speech taggingPart of Speech tagging Information RetrievalInformation Retrieval Data MiningData Mining

33

PackagesPackages Two Statistical PackagesTwo Statistical Packages

Contingency Table approachContingency Table approach Measures for bigramsMeasures for bigrams

Log Likelihood, True Mutual Information, Chi Log Likelihood, True Mutual Information, Chi Squared Test, 0dds Ratio, Phi Coefficient, T Score, Squared Test, 0dds Ratio, Phi Coefficient, T Score, and Dice Coefficientand Dice Coefficient

Measures for trigramsMeasures for trigrams Log Likelihood and True Mutual InformationLog Likelihood and True Mutual Information

Suffix Array approachSuffix Array approach Measures for all lengths of ngramsMeasures for all lengths of ngrams

Residual Inverse Document Frequency and Mutual Residual Inverse Document Frequency and Mutual InformationInformation

Documents

Incorporating N-gram Statistics in the Normalization of Clinical Notes