30
By: By: Chris Lu Chris Lu Guy Divita Guy Divita Allen Browne Allen Browne Date: 12.13.2004 Date: 12.13.2004 Remove Parenthesis Plural Forms Remove Parenthesis Plural Forms of (s), (es), and (ies) of (s), (es), and (ies)

By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004

Embed Size (px)

DESCRIPTION

Remove Parenthesis Plural Forms of (s), (es), and (ies). By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004. Table of Content. Background Problems Objective Methods Results Future work. Background. Norm: is the most common used program in Lvg - PowerPoint PPT Presentation

Citation preview

By:By:

Chris LuChris Lu

Guy DivitaGuy Divita

Allen BrowneAllen Browne

Date: 12.13.2004Date: 12.13.2004

Remove Parenthesis Plural Forms Remove Parenthesis Plural Forms of (s), (es), and (ies)of (s), (es), and (ies)

• BackgroundBackground• ProblemsProblems• ObjectiveObjective• MethodsMethods• ResultsResults• Future workFuture work

Table of Content

Norm: Norm: • is the most common used program in Lvgis the most common used program in Lvg• is used to create the normalized string and word is used to create the normalized string and word

indexes to UMLS Metathesaurusindexes to UMLS Metathesaurus• is used to access those indexes in UMLS Metathesaurusis used to access those indexes in UMLS Metathesaurus• includes 10 lvg flows (2004)includes 10 lvg flows (2004)

Background

Norm:Norm:

1.1. Remove genitivesRemove genitives

2.2. Replace punctuations with spaceReplace punctuations with space

3.3. Remove stop wordsRemove stop words

4.4. Strip diacriticStrip diacritic

5.5. Split ligaturesSplit ligatures

6.6. LowercaseLowercase

7.7. Uninflect each wordsUninflect each words

8.8. Retrieve citation Retrieve citation

9.9. Word sortWord sort

10.10. Retrieve Unicode symbolRetrieve Unicode symbol

Background – Cont.

Plural forms with parenthesisPlural forms with parenthesis• (s):(s):

Accessory finger(s)Accessory finger(s) Addiction, drug(s)Addiction, drug(s) Burn of wrist(s) and hand(s)Burn of wrist(s) and hand(s)

• (es):(es):• Abdomen CT Adrenal Mass(es) BilateralAbdomen CT Adrenal Mass(es) Bilateral• Provide picture of fetus(es), as appropriateProvide picture of fetus(es), as appropriate• sequelae of; injury, nerve, roots and plexus(es), spinalsequelae of; injury, nerve, roots and plexus(es), spinal

• (ies):(ies):• Donor pneumonectomy(ies) with preparation and Donor pneumonectomy(ies) with preparation and maintenance pf allograft (cadaver)maintenance pf allograft (cadaver)• Orthotic(s) fitting and training, upper extremity(ies), Orthotic(s) fitting and training, upper extremity(ies), lower lower extremity(ies), and/or trunk, each 15 minutesextremity(ies), and/or trunk, each 15 minutes

Background – Cont.

• No flow in lvg to handle this issueNo flow in lvg to handle this issue• Can we just simply remove (s), (es), (ies) ?Can we just simply remove (s), (es), (ies) ?

to get the uninflected formto get the uninflected form without change the wordwithout change the word

• (es), (ies): no problem(es), (ies): no problem• (s): ?(s): ?

Problems

How about:How about:• 1-N-(s)-4-amino-2-hydroxybutyryl-3'4'-deoxyneamine 1-N-(s)-4-amino-2-hydroxybutyryl-3'4'-deoxyneamine • 9(s)-erythromycylamine 9(s)-erythromycylamine • anatoxin-b(s) anatoxin-b(s) • Ap(s)pCHClpp(s)A Ap(s)pCHClpp(s)A • Bacillus phage rho11(s) Bacillus phage rho11(s) • Cbz-AAPhepsi((s)-CH(OH)CH2)GlyVV-OMe Cbz-AAPhepsi((s)-CH(OH)CH2)GlyVV-OMe • EAV G(s) glycoprotein EAV G(s) glycoprotein • G(s), alpha Subunit G(s), alpha Subunit • Histone H1(s) Histone H1(s) • J(s)(b) ANTIBODY J(s)(b) ANTIBODY • N(alpha)-benzoylarginineamide monohydrochloride, (s)-isomer N(alpha)-benzoylarginineamide monohydrochloride, (s)-isomer • natoxin-a(s) natoxin-a(s) • Salmonella II 6,7:(g),m,(s),t:1,5 Salmonella II 6,7:(g),m,(s),t:1,5 • (s)-(+)-citreofuran (s)-(+)-citreofuran • su(s) protein, Drosophila su(s) protein, Drosophila • XLalpha(s) proteinXLalpha(s) protein• [X]O spontn disrptn/lig(s)knee [X]O spontn disrptn/lig(s)knee • O spontn disrptn/lig(s)kneeO spontn disrptn/lig(s)knee

Challenge

• Not to remove (s) in chemical, Protein, Gene, mathematics, etc. Not to remove (s) in chemical, Protein, Gene, mathematics, etc. • Sometimes, (s) should be replaced by a space instead of removalSometimes, (s) should be replaced by a space instead of removal

Challenge – Cont.

• Remove parenthesis plural forms of (s), (es), (ies)Remove parenthesis plural forms of (s), (es), (ies)• Do not remove (s) in chemical, protein, gene, etc..Do not remove (s) in chemical, protein, gene, etc..• Replace (s) with a space appropriatelyReplace (s) with a space appropriately• Fast performance Fast performance • High precisionHigh precision

Objective

• UMLS Metathesaurus: 2.8 M termsUMLS Metathesaurus: 2.8 M terms• Lexicon: 0.8 M inflected termsLexicon: 0.8 M inflected terms• Total: 3.6 M termsTotal: 3.6 M terms• Terms with (s), (es), (ies) patterns: ~ 2800Terms with (s), (es), (ies) patterns: ~ 2800

Scope

Methods - Pattern ObservationMethods - Pattern Observation

• XLalpha(s) protein

• su(s) protein, Drosophila

• (s)-(+)-citreofuran

• Salmonella II 6,7:(g),m,(s),t:1,5

• natoxin-a(s)

• N(alpha)-benzoylarginineamide monohydrochloride, (s)-isomer

• J(s)(b) ANTIBODY

• Histone H1(s)

• G(s), alpha Subunit

• EAV G(s) glycoprotein

• Cbz-AAPhepsi((s)-CH(OH)CH2)GlyVV-OMe

• Bacillus phage rho11(s)

• Ap(s)pCHClpp(s)A

• anatoxin-b(s)

• 9(s)-erythromycylamine

• 1-N-(s)-4-amino-2-hydroxybutyryl-3'4'-deoxyneamine

Pattern Observation – (1)Pattern Observation – (1)

• XLalpha(s) protein

• su(s) protein, Drosophila

• (s)-(+)-citreofuran

• Salmonella II 6,7:(g),m,(s),t:1,5

• natoxin-a(s)

• N(alpha)-benzoylarginineamide monohydrochloride, (s)-isomer

• J(s)(b) ANTIBODY

• Histone H1(s)

• G(s), alpha Subunit

• EAV G(s) glycoprotein

• Cbz-AAPhepsi((s)-CH(OH)CH2)GlyVV-OMe

• Bacillus phage rho11(s)

• Ap(s)pCHClpp(s)A

• anatoxin-b(s)

• 9(s)-erythromycylamine

• 1-N-(s)-4-amino-2-hydroxybutyryl-3'4'-deoxyneamine

Sample TermSample Term Word SizeWord Size DistanceDistance

9(s)-erythromycylamine9(s)-erythromycylamine 11 11

Ap(s)pCHClpp(s)AAp(s)pCHClpp(s)A 22 11

EAV G(s) glycoproteinEAV G(s) glycoprotein 11 11

G(s), alpha SubunitG(s), alpha Subunit 11 11

Histone H1(s)Histone H1(s) 22 11

J(s)(b) ANTIBODYJ(s)(b) ANTIBODY 11 11

N(alpha)-benzoylarginineamide monohydrochloride, (s)-isomerN(alpha)-benzoylarginineamide monohydrochloride, (s)-isomer 00 11

(s)-(+)-citreofuran(s)-(+)-citreofuran 00 11

su(s) protein, Drosophilasu(s) protein, Drosophila 22 11

• The size of the word in front of (s) must be less than/equal to 2

Pattern Observation – (1)Pattern Observation – (1)

Pattern Observation – (2)Pattern Observation – (2)

• XLalpha(s) protein

• su(s) protein, Drosophila

• (s)-(+)-citreofuran

• Salmonella II 6,7:(g),m,(s),t:1,5

• natoxin-a(s)

• N(alpha)-benzoylarginineamide monohydrochloride, (s)-isomer

• J(s)(b) ANTIBODY

• Histone H1(s)

• G(s), alpha Subunit

• EAV G(s) glycoprotein

• Cbz-AAPhepsi((s)-CH(OH)CH2)GlyVV-OMe

• Bacillus phage rho11(s)

• Ap(s)pCHClpp(s)A

• anatoxin-b(s)

• 9(s)-erythromycylamine

• 1-N-(s)-4-amino-2-hydroxybutyryl-3'4'-deoxyneamine

Sample TermSample Term CharacterCharacter DistanceDistance

9(s)-erythromycylamine9(s)-erythromycylamine Arabic number 9Arabic number 9 11

Bacillus phage rho11(s)Bacillus phage rho11(s) Arabic number 1Arabic number 1 11

Histone H1(s)Histone H1(s) Arabic number 1Arabic number 1 11

• The character in front of (s) is an Arabic number

Pattern Observation – (2)Pattern Observation – (2)

Pattern Observation – (3)Pattern Observation – (3)

• XLalpha(s) protein

• su(s) protein, Drosophila

• (s)-(+)-citreofuran

• Salmonella II 6,7:(g),m,(s),t:1,5

• natoxin-a(s)

• N(alpha)-benzoylarginineamide monohydrochloride, (s)-isomer

• J(s)(b) ANTIBODY

• Histone H1(s)

• G(s), alpha Subunit

• EAV G(s) glycoprotein

• Cbz-AAPhepsi((s)-CH(OH)CH2)GlyVV-OMe

• Bacillus phage rho11(s)

• Ap(s)pCHClpp(s)A

• anatoxin-b(s)

• 9(s)-erythromycylamine

• 1-N-(s)-4-amino-2-hydroxybutyryl-3'4'-deoxyneamine

Sample TermSample Term CharacterCharacter DistanceDistance

1-N-(s)-4-amino-2-hydroxybutyryl-3'4'-deoxyneamine1-N-(s)-4-amino-2-hydroxybutyryl-3'4'-deoxyneamine Punctuation -Punctuation - 11

anatoxin-b(s)anatoxin-b(s) Punctuation -Punctuation - 22

Cbz-AAPhepsi((s)-CH(OH)CH2)GlyVV-OMeCbz-AAPhepsi((s)-CH(OH)CH2)GlyVV-OMe Punctuation (Punctuation ( 11

natoxin-a(s)natoxin-a(s) Punctuation -Punctuation - 22

Salmonella II 6,7:(g),m,(s),t:1,5Salmonella II 6,7:(g),m,(s),t:1,5 Punctuation ,Punctuation , 11

• Punctuation is in front of (s) within distance 1 or 2

Pattern Observation – (3)Pattern Observation – (3)

Pattern Observation – (4)Pattern Observation – (4)

• XLalpha(s) protein

• su(s) protein, Drosophila

• (s)-(+)-citreofuran

• Salmonella II 6,7:(g),m,(s),t:1,5

• natoxin-a(s)

• N(alpha)-benzoylarginineamide monohydrochloride, (s)-isomer

• J(s)(b) ANTIBODY

• Histone H1(s)

• G(s), alpha Subunit

• EAV G(s) glycoprotein

• Cbz-AAPhepsi((s)-CH(OH)CH2)GlyVV-OMe

• Bacillus phage rho11(s)

• Ap(s)pCHClpp(s)A

• anatoxin-b(s)

• 9(s)-erythromycylamine

• 1-N-(s)-4-amino-2-hydroxybutyryl-3'4'-deoxyneamine

Sample TermSample Term PatternPattern DistanceDistance

Ap(s)pCHClpp(s)AAp(s)pCHClpp(s)A pppp 11

XLalpha(s) proteinXLalpha(s) protein alphaalpha 11

• The word in front of (s) ends with: pp alpha

Pattern Observation – (4)Pattern Observation – (4)

Pattern Observation – (5)Pattern Observation – (5)

Sample TermSample Term PatternPattern DistanceDistance

[X]O spontn disrptn/lig(s)knee[X]O spontn disrptn/lig(s)knee Followed by a wordFollowed by a word 11

O spontn disrptn/lig(s)kneeO spontn disrptn/lig(s)knee Followed by a wordFollowed by a word 11

• (s) followed with an English word• An English word begins with a letter

if (s) followed with a letter, replace (s) with a space

• Exceptions: Ap(s)pCHClpp(s)A G(s)alpha

Implementation – Wild CardsImplementation – Wild Cards

Wild Card Definition:• ^: start, starting mark of the term• $: end, ending mark of the term right before (s) • C: any character• D: any digit, [0-9] • L any letter, [a-z] • P: punctuation: [- ( ,] • S: space: [ ]

Implementation – Rule RepresentationsImplementation – Rule Representations

PatternPattern Sample TermSample Term RuleRule

11 (s)-(+)-citreofuran(s)-(+)-citreofuran ^$^$

11 J(s)(b) ANTIBODYJ(s)(b) ANTIBODY ^C$^C$

11 EAV G(s) glycoproteinEAV G(s) glycoprotein SC$SC$

11 su(s) protein, Drosophilasu(s) protein, Drosophila ^CC$^CC$

11 Histone H1(s)Histone H1(s) SCC$SCC$

22 9(s)-erythromycylamine9(s)-erythromycylamine D$D$

33 Salmonella II 6,7:(g),m,(s),t:1,5Salmonella II 6,7:(g),m,(s),t:1,5 P$P$

33 natoxin-a(s)natoxin-a(s) PC$PC$

44 Ap(s)pCHClpp(s)AAp(s)pCHClpp(s)A pp$pp$

44 XLalpha(s) proteinXLalpha(s) protein alpha$alpha$

.... …… ……

RuleRule

^$^$

^C$^C$

SC$SC$

^CC$^CC$

SCC$SCC$

D$D$

P$P$

PC$PC$

pp$pp$

alpha$alpha$

……

Implementation – Reversed Trie TreeImplementation – Reversed Trie Tree

D ^

^S

C S ^

b

t

g

m

l

h

a

p

a

m

a

e

Etc.

p

p

C

P

$

Implementation – Reversed Trie TreeImplementation – Reversed Trie Tree

• Example: anatoxin-bExample: anatoxin-b(s)(s)

D ^

^S

C S ^

b

t

g

m

l

h

a

p

a

m

a

e

Etc.

p

p

C

P

$

Implementation – Reversed Trie TreeImplementation – Reversed Trie Tree

• Example: anatoxin-Example: anatoxin-b(s)b(s)

D ^

^S

C S ^

b

t

g

m

l

h

a

p

a

m

a

e

Etc.

p

p

C

P

$

Implementation – Reversed Trie TreeImplementation – Reversed Trie Tree

• Example: anatoxinExample: anatoxin-b(s)-b(s)

D ^

^S

C S ^

b

t

g

m

l

h

a

p

a

m

a

e

Etc.

p

p

C

P

$

Implementation – Algorithm FlowImplementation – Algorithm Flow

Find (s), (es), and (ies)

if (s)

Remove (es) and (ies) Go through the reversed trie

if patternmatch

End

Start

If followingcharacter a letter

Remove (s) Repalce (s)with a space

No

No

No Yes

Yes

Yes

ResultsResults

• Remove (s) properly• Remove (es) properly• Remove (ies) properly• Replace (s) with space properly

• A fast, precise, and expandable system

Future WorkFuture Work

• More testing cases, update more rules• Implement this feature to both Norm and LuiNorm• Apply to (ing), (ed), (en)

Thank you !Thank you !

[email protected]• http://umlslex.nlm.nih.gov/lvg/2005