TASS2018: Medical knowledge discovery by combining ...ceur-ws.org/Vol-2172/p10_upf_tass2018.pdf · TASS2018: Medical knowledge discovery by combining terminology extraction techniques

TASS2018: Medical knowledge discovery bycombining terminology extraction techniques with

machine learning classification

TASS2018: Obtencion de conocimiento medico mediantecombinacion de tecnicas de extraccion de terminologıas y

clasificacion basada en aprendizaje automatico

Jorge Vivaldi Palatresi1, Horacio Rodrıguez Hontoria2

1Universitat Pompeu Fabra, Barcelona, Spain2Universidad Politecnica de Catalunya, Barcelona, Spain

[email protected] [email protected]

Resumen: En este artıculo presentamos la aproximacion seguida por el equipoUPF-UPC en la tarea TASS 2018 Task 3 challenge. Nuestra aproximacion puedecalificarse, de acuerdo a los codigos propuestos por la organizacion, como H-KB-S, ya que utiliza metodos basados en conocimiento y aprendizaje supervisado. Elpipeline utilizado incluye: i) Un pre-proceso standard de los documentos usandoFreeling (etiquetado morfosintactico y analisis de dependencias); ii) El uso de unaherramienta de etiquetado sequencial basada en CRF para completar las subtareasA (identificacion de frases) y B (clasificacion de frases), y iii) El abordaje de lasubtarea C (extraccion de relaciones semanticas) usando una aproximacion hıbridaque integra dos classificadores basados en Regresion Logıstica, y dos extractoreslexicos para pares entity/entity y relaciones is-a y same-as.Palabras clave: obtencion de conocimiento medico, terminologıa medica, identifi-cacion de relations semanticas

Abstract: In this paper we present the procedure followed to complete the runsubmitted by the UPF-UPC team to the TASS 2018 Task 3 challenge. Such pro-cedure may be classified, according the organization’s codes, as H-KB-S as it takesprofit from a knowledge based methodology as well as some supervised methods.Our pipeline includes: i) A standard pre-process of the documents using Freelingtool suite (POS tagging and dependency parsing); ii) Use of a CRF sequence la-belling tool for completing both subtasks A (key phrase identification) and B (keyphrase classification), and iii) Facing the subtask C (setting semantic relationships)by using a hybrid approach that uses two Logistic Regression classifiers, followed bylexical shallow relation extractors for entity/entity pairs related by is-a and same-asrelations.Keywords: health knowledge discovery, terminology extraction, identification ofsemantic relations

1 Introduction

Text mining and natural language processing(NLP) techniques have been applied to thebiomedical domain for a long time. Auto-matic identification of relevant terms in med-ical texts (research and educational materialas well as medical reports) and how they re-late each other represent a major improve-ment for indexing and for search tools. Itsresults are useful for research as well as clin-ical and educational purposes.

In this paper we present two tools for fac-ing the tasks proposed in the TASS-2018-

Task 3 challenge: eHealth Knowledge Dis-covery. The first one is a term extractiontool for finding terminologically relevant sub-strings (key phrases) and classify them in oneof the two classes proposed by the organiza-tion: Concept or Action. The second oneis dedicated to recognize those semantic re-lationships chosen by the Organization be-tween the recognized entities.

TASS 2018: Workshop on Semantic Analysis at SEPLN, septiembre 2018, págs. 89-95

ISSN 1613-0073 Copyright © 2018 by the paper's authors. Copying permitted for private and academic purposes.

2 TASS 2018 Task 3

2.1 Description of TASS 2018Task 3

The corpus for this competition was com-piled by sampling XML files producedby the National Library of Medicine,the world’s largest medical library. Itbrings information about diseases, condi-tions, and wellness issues in understandablelanguage. The full collection is available athttps://medlineplus.gov/xml.html.

Given a collection of eHealth documentswritten in Spanish, TASS 2018 Task 3 hasbeen conceived as a three task for three dif-ferent scenarios. Each task may be describedas follows:

A) To identify all the key phrases per docu-ment;

B) To assign a label (Concept or Action) toeach of the key phrases;

C) to link the entities detected and labelledin each document through the followingsemantic relationships:

(a) Concept-Concept: is-a, part-of,property-of and same-as;

(b) Action-Concept: subject and target.

The output of each task is the input of thenext one. Proceeding in this way the orga-nization has considered the following threeevaluation scenarios:

1. Only plain text is given (Subtasks A, Band C must be completed);

2. Plain text and manually annotated keyphrase boundaries are given (Subtasks Band C must be completed);

3. Plain text with manually annotated keyphrases and their types are given (onlySubtask C must be completed).

More details about the tasks and scenariosmay be obtained through the web site of theTASS-2018 competition1 and the overviewpaper (Martınez-Camara et al., 2018).

2.2 Our approach to TASS 2018Task 3

After downloading the documents we haveprocessed them using Freeling tool suite2,

1http://www.sepln.org/workshops/tass/2018/task-3/

2http://nlp.lsi.upc.edu/freeling/

(Padro and Stanilovsky, 2012). We have usedbasically tokenization, and POS tagging forsubtasks A and B and EWN3 tagging and de-pendency and constituency parsing for sub-task C. In this section we present some de-tails about the full system designed for thesetasks. Figure 1 shows the overall scheme. Itpresents the main modules and its intercon-nection to complete the full task proposed inthis competition.

Figure 1: Full system architecture.

2.3 Subtasks A & B: key phraseidentification and classification

For subtasks A and B we proceeded jointlyusing a single tool that is able to select theset of nominal term candidates, TC, includedin the text under analysis. This approachis based in YATE, an in house term extrac-tor that has been tuned for treating medicaltext. See (Vivaldi, 2001) and (Vivaldi andRodrıguez, 2010) for a full description.

Term extraction can be seen as semanticannotation task because it provides machine-readable information based on meaning. Theway to attack the problem varies accord-ing the available resources for each language.Some languages (mainly English) disposes oflexical resources (like ontologies and/or termrepositories) that can be used for referencewhile other languages have to identify termwithin text using other procedures that in-clude linguistic/statistical strategies.

3Freeling can provide the set of possible synsetsfor each token when a WordNet is available. In thecase of Spanish EuroWordNet is used.

Jorge Vivaldi Palatresi y Horacio Rodríguez Hontoria

90

YATE is a hybrid system whose key-points are: i) the combination of heteroge-neous detection strategies and ii) the use se-mantic knowledge in such strategies. Ini-tially, it used the lexical ontology EuroWord-Net (Vossen, 2004) (EWN) and since recentlyits evolution: MCR 3.0 (Gonzalez-Agirre, La-parra, and Rigau, 2012). In order to obtaindomain terms, we mark on this resource somedomain borders4.

After applying standard linguistic analysisprocedures, it starts by extracting a basic listof TCs. Such candidates are then analysedusing a collection of heterogeneous methods.

A first method to evaluate the termhoodof any candidate is its domain coefficient. Itis calculated using the above mentioned do-main borders and indicates in which degreea given TC belong to the domain of interest(medicine in this case).

Other methods included in YATE arebased on statistical information, context in-formation and the result to decompose a TCin its graeco-latin components. In the origi-nal tool all these informations were combinedusing a voting scheme or a boosting algo-rithm. For this competition, we decided toadd information from Snomed-CT 5 and also,due to its specific requirements (analysing ev-ery mention of each nominal or verbal TC),we modify the original term extraction tooland chose to use a CRF classifier6 for com-bining all the available information and topredict the BIO tag7 for each token. In or-der to select which pieces of information aregiven as an input to the model a template hasto be built for each TC. It allows to take intoaccount both the features associated to thetarget token as well as the ones of its neigh-bours. The set of features chosen for eachword in this task was the following:

• The lemmas of the target token and ofthe tokens appearing in a size-2 windowaround it;

4A domain border is defined as an EWN synsetlikely belonging itself and its descendants to such do-main (eg. disease, bodypart and medical procedureamong many others).

5A comprehensive and well known clinical healthterminology available for several languages includingSpanish, https://www.snomed.org/

6CRF++, https://taku910.github.io/crfpp/7BIO is a popular way of tagging tokens for de-

tecting useful sequences, B stands for beginning of asequence, I for inside it and O for out of it.

• The reduced POS tag8 of the target to-ken and the tokens appearing in a size-2window around it;

• The domain border to each of the tokensof the string detected as a TC ;

• The main class for those TC includedin Snomed-CT9. This information is ap-plied to all the tokens of the string de-tected as a TC ;

• The first/last three letters of each token.

2.4 Subtask C: Setting semanticrelationships

For facing the subtask C we learned twomulti-class classifiers, one for action/conceptrelations and the other for concept/conceptrelations. We used a simple LR Logistic Re-gression model10 for both tasks with the sameset of features, detailed below, but for learn-ing two different classifiers. For defining thefeature set we performed an initial learningprocess from the training documents consist-ing on the following steps:

• Collecting the whole set of correct en-tities. Computing the tf *idf and sort-ing the collection by descending tf *idfweight.

• Decomposing the multi-word terms inthe previous collection into atomic com-ponents. Computing also their tf *idfweight and sorting them accordingly.

• Extracting the shapes of all the cor-rect entities. We consider two types ofshapes, long and short, the long shapesimply maps the characters occurring inthe term into a set of tags11 while theshort shape groups together sequences ofidentical tags. For instance, for the term”DM-2” the long shape is ”AA*0” andthe short one is ”A*0”. We also obtainedan histogram of the length of entities intokens, in order to constraint the lengthof the generated candidates.

8The first character of the label.9Snomed-ct is organized as a tangled taxonomy

with 19 top classes. We have used these top classesas tagset.

10using Scikit-learn package.11”A” stands for upper case letter, ”a” for lower

case, ”0” for number, ” ” for the space, and ”*” forother characters.

TASS2018: Medical Knowledge Discovery by Combining Terminology Extraction Techniques with Machine Learning Classification

91

• Collecting all the labels occurring in thedependency trees of all the sentences inthe training documents.

• Collecting histograms of the POS ap-pearing as initial, middle, and endingtokens in valid multi-word terms, andthose appearing in single-word terms.From these collections those POS occur-ring under a threshold were removed.For instance, in the initial set (re-sulting in 26 POS) the most frequentPOS was ”NCMS000”12, that appears90 times, in the middle set (18 POS)”DA0FS0”13 occurred 20 times, in thesingle set (65 POS) ”NCFS000” oc-curred 316 times, and in the endingset (10 POS), ”NCFS000”14 occurred 77times.

It is worth noting that all these collectionshave been extracted independently for thetwo settings, Concept-Concept and Action-Concept. Once obtained these collections weperformed a feature selection process usingthe development corpus, looking for differentset of features (which we name ”configura-tions”). The most reliable configuration wasthe following:

• A vector of the 1,000 most relevant lem-mas (using tf *idf weigth) for both enti-ties in the relation15.

• A vector of the 200 most relevant atomicwords (using tf *idf weigth) for both en-tities in the relation.

• A vector of the labels of the path be-tween the two entities (is existing) inthe dependency tree of the involved sen-tence. 32 labels occurred in the trainingset in the Concept-Concept case.

• A vector of the long word shapes for bothentities. 51 different long word shapeswere detected in the training set in theConcept-Concept case.

• A vector of the short word shapes forboth entities. 13 different short wordshapes were detected in the training setin the Concept-Concept case.

12Common noun, masculine, singular13Determiner article, feminine, singular14Common noun, feminine, singular15All the feature vectors are used as Boolean indi-

cators.

• The distance in tokens between the twoentities.

• The distance in characters between thetwo entities.

• The length in tokens of both entities upto a maximum of 5 tokens.

• Whether the entities are simple or com-plex.

As samples for learning and testing weconsider as positive examples all the pairs ofentities from subtask B occurring in the samesentence and the text between them. For thefirst classifier one of the entities has to be anaction and the other a concept, for the secondone both entities have to be concepts. Fornegative examples we used pairs of entities sothat only one element of the pair occurred asan entity in subtask B and the other shouldsatisfy the POS constraints obtained above.

Besides these two classifiers we have ap-plied two other simpler lexical relation ex-tractors for some specific relations:

• We looked in each sentence for the co-occurrence of two entities being one ofthem an acronym and the other its longform. If it was the case the two enti-ties are tagged as related by a same asrelation, (Montalvo et al., 2017).

• We looked for pairs of compatible enti-ties where different degrees of compati-bility were considered: equal word form,equal lemma, EWN synsets overlapping,approximate string matching, etc. In thecase of compatibility same as or is a re-lations are set. The latter is set for thecase of being one concept an hyponym ofthe other in EWN or when one conceptis a prefix of the other.

3 Experimentation

In order to adjust the behaviour of our sys-tem to the specificities of this competition weuse the package develop also provided by theorganization. In this context we adjust someparameters to the characteristics of the textto be processed. In particular, for subtasks Aand B, we reviewed the set of domain bordersalready defined adding three new ones.

For the task C a summary of the featuresused by the Concept-Concept logistic classi-fier is shown in Table 1. Each candidate isrepresented as a vector of float in a 2,578


92

dimensional space. Note that some of thefeatures, as the dependency labels or the dis-tances are applied to the pair, while othersto both entities in the pair.

In our setting we generated 1,700 exam-ples, from which 994 positives. Figures forthe Action-Concept classifier are similar.

Table 1: Features used for the logistic classi-fier Concept-Concept

Feature Quantity

lemmas 2,000atomic 400labels DT 36long shapes 102short shapes 26length 10distance tk 1distance ch 1is simple 2

Total 2,578

We also performed some experimentationin obtaining some is-a relation for multi-word terms. Consider for example theTC sındrome de Marfan, if the nucleus ofthis term (sındrome) has been validated forYATE it is reasonable to consider that thefollowing semantic relation exists: sındromede Marfan →is-a →sındrome. This proce-dure reach a precision of 66 %.

4 Results

The evaluation results obtained by the abovedescribed system and delivered to the orga-nization for the evaluation are quantified inTable 2. This table shows both of our resultsas well as the baseline and best scores16 17.As can be observed in Table 2 our results forsub-tasks A and B are at least acceptable asthey are above the baseline. Overall we wereranged on the third position, with a globalscore of 0.446 (0.464 and 0.461 for the firstand second teams). Unfortunately such re-sults for sub-task C in two of the scenarios arebellow such baseline and this fact does not

16Note that the best score for each scenario cor-respond to the team with the best global result foreach scenario. It means that it considers all subtasksfor such scenario. This explains, for instance, thatfor the task C in scenario 1 the best scores are set tozero.

17Detailed results may be checked at the web siteof the TASS-2018 competition. See footnote 1.

satisfy our expectation. For such reason wedecided to revise in depth our procedures andalgorithm for such task and we found a mal-function for this task. After solving such bugand performing some minor improvement forsub-tasks A and B, we run again all our sys-tem and evaluate the new results but using inthis case the script provided by the Organi-zation. The results obtained after correctingour procedures are shown in Table 3. Suchtable shows a clear improvement in the re-sults obtained for Task C.

5 Discussion

In examining the text to be analysed wefound some sentences whose inclusion in ahealth related corpus is not clear as theyseems to be quite out of the domain. Ex-ample 1 shows a clear example of this kind ofsentences.

(1) El CO se encuentra en el humo de lacombustion, como lo es el expulsadopor automoviles y camiones, cande-labros, estufas, fogones de gas y sis-temas de calefaccion.

In this example, the annotator tags Con-cepts like ”CO”, ”humo” and ”fogones”among others. Also it was tagged an is-arelation involving ”humo de la combustion”and ”humo”. It is not clear the reason thatsuch Concepts and Relations are consideredrelevant in a Spanish health document. Theconsequence is that our term extractor iden-tify such units but does not validated themas relevant in the domain. As a matter offact, most of the evaluated as missing havesimilar characteristics (such as: ”proveedor”,”insecticidas”, ”pintura”, ...).

Training phase has been completed usingonly those files provided by the organization.As YATE only obtain nominal terms. Werely in the training phase for obtaining Ac-tion terms.

In order to show the behaviour of the twomain parts of system (YATE and semanticrelationship detector) we present some addi-tional information about such modules.

Table 4 shows the amount of different termcandidates that has been validated and dis-carded for the text provided for scenario 1.It also shows the details for each term candi-date analyser. It should be noted that a sin-gle candidate may be successfully evaluatedby more than one analyser (ex. ”botulismo”,”diureticos” and ”psoriasis” among others).


93

Table 2: Official evaluation results (ours plus baseline plus the best for each scenario)

Task MetricScenario 1 Scenario 2 Scenario 3

Base. Best Ours Base. Best Ours Base. Best Ours

APrecision 0.673 0.862 0.862 – – – – – –

Recall 0.536 0.882 0.755 – – – – – –

F1 0.597 0.872 0.806 – – – – – –

B Accuracy 0.932 0.959 0.945 0.774 0.931 0.954 – – –

CPrecision 0.262 0 0.18 0.676 0.487 0.431 0.714 0.506 0.263

Recall 0.022 0 0.063 0.093 0.431 0.062 0.058 0.402 0.019

F1 0.041 0 0.093 0.163 0.458 0.109 0.107 0.448 0.036

Table 3: Unofficial evaluation of our resultsfor all scenarios and tasks

Task InformationScenario

1 2 3

APrecision 0.91 – –

Recall 0.79 – –

F1 0.85 – –

BPrecision 0.86 0.85 –

Recall 0.76 0.76 –

F1 0.81 0.80 –

CPrecision 0.42 0.43 0.39

Recall 0.20 0.21 0.18

F1 0.27 0.28 0.25

Table 4: Behaviour of YATE in Scenario1

Validated by Quantity

... domain coefficient 55

... MCR database 137

... graeco-latins formants 19

... Snomed CT 181

Total accepted 212Total discarded 59

Table 5 shows the amount of different re-lation candidates. We include the candi-dates generated by our LR classifiers (boththe action-concept and the concept-conceptones) and the candidates generated by ourlexical extractors. It is worth noting thatno acronym/expansion pair was found in thetest dataset, and, so, only one lexical extrac-tor is included.

Table 5: Labels assigned by the relation ex-tractor

Classifier Label Quant.

LR action-concept subject 49LR action-concept target 97LR concept-concept same-as 1LR concept-concept is-a 24LR concept-concept part-of 11LR concept-concept property-of 22lexical classifier same-as 105lexical classifier is-a 181

Total LR - 204

Total Lexical - 286

6 Conclusions

We have presented the approach followed bythe UPF-UPC team to the TASS 2018 Task3 challenge. Our official results were not badfor the tasks A and B but were deceiving fortask C. The results were consistent amongthe three scenarios with no clear improve-ment, as could be expected, from the firstto the third. Anyway this problem seems tobe general.

In our post-challenge analysis we detecteda serious bug on the generation of our runfor task C. Correcting it we improved the F1for task C in the first scenario from 0.093 to0.27. This is a clear improvement but thescore continues to be small.

Using the YATE based approach seems tobe a valuable way of facing tasks A and B.

The results of task C are clearly deceiv-ing. Specially the recall score is very low,precision is closer to the winner.

Some issues detected merit to perform amore in depth analysis that we propose as


94

future work:

• Improve YATE accuracy when dealingwith Actions.

• Improve the integration of SNOMEDCT in YATE architecture

• Analyse why there is no clear improve-ment when moving from scenario 1 to 3.

• Analyse the contribution of the differentfeatures to the tasks A, B, and C.

• Analyse the poor recall of the two LRclassifiers for task C

• In the case of task C we have used twolinear multi-class classifiers LR. The setof features used was rather big and as thenumber of training samples was smallthe learning process was prone to over-fitting. We plan to test other non linearclassifiers and to reduce the number offeatures, specially the lexical ones, lem-mas and atomic forms, probably usingembeddings learned on the medical do-main.

• Analyse the performance of the two LRclassifiers separately. Only the overallresults have been considered. The factthat basically the same feature set wasused for learning the two classifiers wasnot the better solution.

• Taking into account that the number ofclasses in task C is rather small, mov-ing from multi-class to binary classifica-tion, i.e learning one classifier per class,should be considered.

Acknowledgements

This work was partially supported bythe projects TERMMED (FFI2017-88100-P, MINECO) and GRAPHMED (TIN2016-77820-C3-3R).

References

Gonzalez-Agirre, A., E. Laparra, andG. Rigau. 2012. Multilingual centralrepository version 3.0: upgrading a verylarge lexical knowledge base. In Pro-ceedings of the Sixth International GlobalWordNet Conference (GWC’12)., Mat-sue, Japan.

Martınez-Camara, E., Y. Almeida-Cruz,M. C. Dıaz-Galiano, S. Estevez-Velarde,

M. A. Garcıa-Cumbreras, M. Garcıa-Vega, Y. Gutierrez, A. Montejo-Raez,A. Montoyo, R. Munoz, A. Piad-Morffis, and J. Villena-Roman. 2018.Overview of TASS 2018: Opinions,health and emotions. In E. Martınez-Camara, Y. Almeida Cruz, M. C.Dıaz-Galiano, S. Estevez Velarde, M. A.Garcıa-Cumbreras, M. Garcıa-Vega,Y. Gutierrez Vazquez, A. Montejo Raez,A. Montoyo Guijarro, R. Munoz Guillena,A. Piad Morffis, and J. Villena-Roman,editors, Proceedings of TASS 2018: Work-shop on Semantic Analysis at SEPLN(TASS 2018), volume 2172 of CEURWorkshop Proceedings, Sevilla, Spain,September. CEUR-WS.

Montalvo, S., M. Oronoz, H. Rodrıguez, andR. Martınez. 2017. Biomedical abbrevia-tion recognition and resolution by prosa-med. In Proceedings of the Workshop onEvaluation of Human Language Technolo-gies for Iberian Languages, pages 247–254,Murcia, Spain, September. SEPLN.

Padro, L. and E. Stanilovsky. 2012. Freel-ing 3.0: Towards wider multilinguality.In Proceedings of the Language Resourcesand Evaluation Conference (LREC 2012),Istanbul, Turkey, May. ELRA.

Vivaldi, J. 2001. Extraccion de Candidatosa Termino mediante combinacion de es-trategias heterogeneas. Ph.D. thesis, De-partment of Computer and InformationScience, Politechnical University of Cat-alonia, Barcelona, Spain, 06.

Vivaldi, J. and H. Rodrıguez. 2010. Us-ing Wikipedia for term extraction in thebiomedical domain: first experience. InProcesamiento del Lenguaje Natural, vol-ume 45, pages 251–254.

Vossen, P. 2004. EuroWordNet: a multi-lingual database of autonomous and lan-guage specific wordnets connected via anInter-Lingual-Index. International. vol-ume 17, pages 161–173.


95

Documents

TASS2018: Medical knowledge discovery by combining ...ceur-ws.org/Vol-2172/p10_upf_tass2018.pdf · TASS2018: Medical knowledge discovery by combining terminology extraction techniques