36
1 http://www.iict.bas.bg/acomin 8.08.13 INSTITUTE OF INFORMATION AND COMMUNICATION TECHNOLOGIES BULGARIAN ACADEMY OF SCIENCE Galia Angelova Institute of Information and Communication Technologies (IICT), Bulgarian Academy of Sciences ACL 2013 BioNLP Workshop, 8 August 2013, Sofia Processing Clinical Texts in a Less-Resourced Language: the Challenge to Start with Bulgarian AComIn: Advanced Computing for Innovation

Processing Clinical Texts in a Less-Resourced Language ...iict.bas.bg/acomin/docs/sci-forums/8-August-2013/GAngelova_ppt.pdf · 08/08/2013 · 8.08.13 7 Latin in Bulgarian clinical

Embed Size (px)

Citation preview

1http://www.iict.bas.bg/acomin

8.08.13

INSTITUTE OF INFORMATION AND COMMUNICATION TECHNOLOGIESBULGARIAN ACADEMY OF SCIENCE

Galia AngelovaInstitute of Information and Communication

Technologies (IICT), Bulgarian Academy of Sciences

ACL 2013 BioNLP Workshop, 8 August 2013, Sofia

Processing Clinical Texts in a Less-Resourced Language: the

Challenge to Start with Bulgarian

AComIn: Advanced Computing for Innovation

8.08.13 2http://www.iict.bas.bg

Thanks for the invitation!

Summary of joint work with:• Dr Dimitar Tcharaktchiev, University Specialized

Hospital for Endocrinology (USHATE), Medical University Sofia

• Dr Svetla Boytcheva, American University in BG• Ivelina Nikolova, PhD student IICT-BAS• Hristo Dimitrov, PhD student MU-Sofia• Dr Zhivko Angelov, Adiss Lab Ltd.• Dr Nadia Dimitrova, National Cancer Register

AComIn: Advanced Computing for Innovation

8.08.13 3http://www.iict.bas.bg

Outline

• Context• Specific features of Bulgarian clinical texts

esp. Hospital discharge letters and Outpatient (ambulatory) records

• Achievements• Current work• Conclusion

8.08.13 4http://www.iict.bas.bg

Bulgarian context• Medical Informatics – underdeveloped and not viewed as a useful technology for cost

optimization /transparency/ in decision making – by hospitals and – by the government

• Only few academic experts think about Big Data in eHealth; doctors are impressed only by practical results affecting their own field

• BioNLP practically unknown (no resources: ICD supported in Bulgarian but not e.g. ATC)

8.08.13 5http://www.iict.bas.bg

Bulgarian clinical texts

A medical sublanguage:

• Phrases instead of complete sentences• Specific types of negation• A lot of implicit or tacit knowledge needed for

proper understanding• Relevant facts are documented – e.g., in a

specialized diabetic hospital

• There is much text: diagnoses, drugs, results of clinical tests done outside the hospital …

8.08.13 6http://www.iict.bas.bg

Latin in Bulgarian clinical textsLatin in hospital discharge letters• Mixture of phrases in Latin and Bulgarian in one

paragraph. Example: … феохромоцитом (silеnt рhеосhrомосytома) ….• Latin terms transliterated with Cyrillic letters пиелонефритис хроника херниа умбиликалис ет ингвиналис катаракта сенилис хипертония артериалис

8.08.13 7http://www.iict.bas.bg

Latin in Bulgarian clinical textsCorpus of 6200 anonymised hospital discharge Corpus of 6200 anonymised hospital discharge

letters of diabetic patientsletters of diabetic patients

Terms Wordforms Lemmas Abbrevia-tions

Bulgarian (Cyrillic) 601 233 12 009 (63%) 1 471Latin 18 926 560 (3%) 1 189

Transliterations 179 589 6 465 (34%) 982

Total 799 748 19 034 3 642Boytcheva, S. Multilingual Aspects of Information Extraction from Medical Texts in Bulgarian. In: C.

Vertan and W. v. Hahn (Eds.), Multilingual Processing in Eastern and Southern EU Languages: Less-resourced Technologies and Translation, Cambridge Scholars Publishing, 2012, pp. 308-329.

8.08.13 8http://www.iict.bas.bg

Sections in hospital discharge letters• 2-3 pages, structured into sections (by law)

8.08.13 9http://www.iict.bas.bg

BG hospital discharge letters• 77г. - ж• гр. София • Диагноза: Захарен диабет тип 2, с вторична резистентност към СУП.

Полиневропатия диабетика. Нефропатия диабетика инципиенс. Тиреоидитис Хашимото – хипотиреоиден стадии. Анемия пернициоза. Двустранна глухота.

• Анамнеза: Постъпва за пореден път в клиниката за контрол на състоянието. Зах. диабет тип ІІ с 20г. давност, открит случайно при изследвания по друг повод. От 11г. е на лечение с инсулин, …. Оплаквания при постъпването изцяло от страна на крайниците, изброени по – горе.

• Минали заболявания: Нефролитиазис билатералис. • Фамилна обремененост :отрича. • Рискови фактори – алергия към пеницилини и Аналгин.• Статус: Жена на видима възраст около действителната, в задоволително

общо състояние, ориентирана, …• Изследвания: СУЕ – 22 , Хб - 133 , Ер – 4,6, Хт – 0,42 , Левк – 4,8 , МСV –

91,4; Тр - 258, HDL-chol – 1.28, общ хол. – 4,8, 3-гл – 1,07.; …..• Обсъждане: …..

8.08.13 10http://www.iict.bas.bg

Outpatient (ambulatory) records• Reimbursement requests to the Nat. Health

Insurance Fund with obligatory xml structure:• …..

• <Patient> <EGN> pseudonymized ID </EGN>

<age>60</age><gender>1</gender> </Patient>• <MainDiag> <ICD>E11.8</ICD> </MainDiag>• <Anamnesa> . Text . diseases .drugs .. </Anamnesa> • <HState> … Text … </HState> • <Examine> … Text .. tests, lab data… </Examine> • <Therapy> <Nonreimburce> … </Nonreimburce>

<Reimburce> Text …drugs </Reimburce> </Therapy>

8.08.13 11http://www.iict.bas.bg

Research projects

8.08.13 12http://www.iict.bas.bg

PSIP (Patient Safety through Intelligent Procedures in medication) 7FP ICT eHealth

• Extension of a running project with 14 core partners• Integration of HIS data and events delivered by IE• Development of 3 extractors: drugs, diagnoses, labdata

8.08.13 13http://www.iict.bas.bg

Extraction of Drug names• 1500 drug names manually translated to Bulgarian, to

“localize” the ATC classification • The extractor assigns ATC codes to medication events• Discovered 355 drugs outside the HIS

8.08.13 14http://www.iict.bas.bg

Assignment of ICD-10 codes

• Matching of ICD-10 labels to text phrases in the Diagnose section

• ML algorithm trained on 1300 discharge letters • Difficulties: Latin terms and their transliterations,

paraphrases, abbreviations• Errors often due to: description of states which

are hard to associate to ICD-10 codes, various types of ambiguity.

8.08.13 15http://www.iict.bas.bg

Evaluation • Training corpus 1300 discharge letters, test corpus

6200 discharge letters• All occurrences were delivered to the PSIP repository

Precision Recall F-score PSIP entitiesDiagnoses

97.30% 74.69% 84.50% 26 826

DrugsDosage 97.28% 99.59% 98.42%

93,85%160 892

Values of lab data 99.04% 100% 99.52% 114 441

8.08.13 16http://www.iict.bas.bg

Drugs at hospitalization day 0

• Contextualization: timing of drug events• Using Anamnesis (Case History) section only• 355 “external” drugs in 6200 discharge letters• Careful training on suitable phrases:

– “at the admission” (при постъпването) – “at the moment” (в момента... )

• Precision: 88% • Recall: 92,45% • F-score: 90,17% • Award for best paper on EHR at EFMI 2011

8.08.13 17http://www.iict.bas.bg

Contextualization in PSIP

8.08.13 18http://www.iict.bas.bg

ADEs in the particular hospital

• The following ADEs have been encountered:– hypo- and hyperkalemia, – risks of renal failure, – Hemorrhages, – changes in some enzyme constellations

• ADE scorecards for USHATE have been developed

• Doctors received only contextualized alerts

8.08.13 19http://www.iict.bas.bg

EVTIMA: Conceptual Structuring

Representation of• Patient status as attribute-value pairs (section

Status)• Patient history as a set of temporally-related past

episodes (section Anamnesis)

• The same corpus of discharge letters of diabetic patients

8.08.13 20http://www.iict.bas.bg

Learning attributes and values• Typical phrasal units that are not included in

terminological dictionaries:– apparent age (visible human age)– corresponding to the real one ... видима възраст отговаряща на

действителната ... attribute - value– general state .. satisfactory / good …– visible mucous membranes

• numerous domain-specific phrases for skin (colors) and other organs

• conditions – e.g. complaints

8.08.13 21http://www.iict.bas.bg

ResultsStatus sect. training set

Status section test set

All sections test set

Wordforms 3,159 6,178 29,469 Occurrences 169,959 729,893 917,985 Filtered 2-grams 67 117 279

occurrences 22,573 93,025 46,181 Filtered 3-grams 5 8 8

occurrences 2,146 13,586 2,177

Boytcheva, S. Structured Information Extraction from Medical Texts in Bulgarian. In: Journal Cybernetics and Information Technologies, 12(4), 2012, pp. 52-65

8.08.13 22http://www.iict.bas.bg

Results for English*

Training set 95 PRs Test set 121 PRs

Wordforms 5,394 7,573 Occurrences 40,817 73,801 Filtered 2-grams 8 9

occurrences 342 409 Filtered 3-grams 1 1

occurrences 23 56

* System presented at the 6th i2b2 Shared Task and Workshop Challenges in Natural Language Processing for Clinical Data: Temporal RelationsIvelina Nikolova, Svetla Boytcheva, Galia Angelova, K. Bretonnel Cohen. Temporal expressions in clinical text: Event recognition and time expressions

8.08.13 23http://www.iict.bas.bg

Evaluation of status extraction

Skin Neck Thyroid gland

Limbs Age

Precision 95.65 95.65 94.94 93.41 88.89

Recall 73.82 88.00 90.36 85.00 90

F-score 83.33 91.67 92.59 89.01 89.44

Boytcheva S., I. Nikolova, E. Paskaleva, G. Angelova, D. Tcharaktchiev and N. Dimitrova. Obtaining Status Descriptions via Automatic Analysis of Hospital Patient Records. In: V. Fomichov (Ed.), Special Issue on Semantic Technologies, Informatica (Slovenia), Issue 4, December 2010, pp. 269-278.

8.08.13 24http://www.iict.bas.bg

Temporal model of case history

• A Primitive Event (in our context) is a: • (1) a diagnose,• (2) a drug, • (3) a condition: can be a complaint, a symptom, a

change in the status that signals abnormality– high BP– decompensation of diabetes mellitus – increased levels of serum creatinine

• Complex event – aggregation of e.g. all drugs

8.08.13 25http://www.iict.bas.bg

Sample case history

Diabetes Mellitus diagnosed 5-6 years ago, manifested by most symptoms. At the beginning started treatment with Maninil only, afterwards in combination with Siofor. After few months the Maninil was replaced by Diaprel. Since October 2005 treated with Insulin Novomix 30 – 32E in the morning, 26E in the evening with diagnosed diabetic retinopathy. Complains of strong pains in the feet.

8.08.13 26http://www.iict.bas.bg

Temporal expressions • Dates day/month/year • Year or month only • Prepositional phrases containing temporal

information• Classified into

– Absolute– Relative according to hospitalisation date,

birthdate, events like e.g. previous moment „after the operation“, “since then”, “since puberty”

8.08.13 27http://www.iict.bas.bg

Ordering events on time lines

• Directed multi-graph representation • Time markers are nodes (states) • The edges represent primitive events incident

with the beginning and end time nodes • Two graphs are generated – one for relative

and one for absolute time scales

8.08.13 28http://www.iict.bas.bg

8.08.13 29http://www.iict.bas.bg

Evaluation: Training/test sets 1300/6200• Average Primitive events per Discharge Letter: 20,69 • In the training/test set: 371/565 different diagnoses

(patients have similar diagnoses and treatments) • In the test set:

– 1,349 dates (day/month/year), – 2,698 markers (year and/or month only), – 2,362 markers for relative time periods – 2,351 concerning the admission date

• Distribution of temporal markers:– 38% to events presenting diagnoses – 47% to events expressing drug admission / change– 15% to complaints and conditions

8.08.13 30http://www.iict.bas.bg

Accuracy Precision % Recall % F %

Drugs 97.28 99.59 98.42

Event Diagnoses 97.30 74.68 84.50

Complaints 97.98 96.82 97.40

Dates 98.86 98.21 98.53

Time Duration 99.14 98.26 98.70

Frequency 92.25 95.51 93.85

8.08.13 31http://www.iict.bas.bg

Current work<Pay>1</Pay> - <Patient>  <EGN>29d53d021a8ea04f8a58b0b7b17ca901d471c111</EGN> PSEUDONYM  <RZOK>22</RZOK>   <ZdrRajon>01</ZdrRajon> - ……  <age>68</age>   <gender>2</gender> </Patient>

- <MainDiag>  <imeMD>Неинсулинозависим захарен диабет с неврологични усложнения</imeMD>   <MKB>E11.4</MKB> </MainDiag>

- <Diag>  <imeD>Диабетна полиневропатия (Е10-Е14 с общ четвърти знак .4)</imeD>   <MKB>G63.2</MKB> </Diag>- <Diag>  <imeD>Тиреоидит, неуточнен</imeD>   <MKB>E06.9</MKB> </Diag>- <Diag>  <imeD>Хипертонична болест на сърцето</imeD>   <MKB>I11</MKB> </Diag>- <Diag>  <imeD>Стенокардия</imeD>   <MKB>I20</MKB>   </Diag>

8.08.13 32http://www.iict.bas.bg

Data Mining of multiple visits

8.08.13 33http://www.iict.bas.bg

Possible findings

• When diabetic patients come second time for control examinations, what is the reason for the worsened lab test results?

• Does it depend on the drugs (giving more expensive drugs does not always mean better compensation)

• Grouping patients by: gender, age, region, drugs, accompanying diseases … NLP delivers entities to feed data mining experiments

8.08.13 34http://www.iict.bas.bg

Conclusion

• Diabetes is a relatively compact subdomain • The structure of clinical texts helps

substantially• It is possible to learn concepts (attribute-

value pairs) from raw texts • NLP accuracy >90% enables meaningful big

data experiments

8.08.13 35http://www.iict.bas.bg

Thanks a lot for your attention

Тенкс а лот фор юър атеншън(English transliterated to Bulgarian)

8.08.13 36http://www.iict.bas.bg

Acknowledgements

• AComIn (Advanced Computing for Innovation), FP7-REGPOT-2012-2013-1 grant 316087

• PSIP (Patient Safety through Intelligent Procedures in medication), FP7 ICT eHealth grant 216130

• EVTIMA (Effective search of conceptual information with applications in medical informatics), Bulgarian National Science Fund DO 02-292