Presentation on finding relation among medical terms

Slide 1

FINDING semantic relationship among associated medical terms

Submitted By: Manisha Singh(111497) Sneha Bairagi(111717) Abhinav Rai (511004)IntroductionWith the continuous digitisation of medical knowledge, information extraction tools become more and more important for practitioners of the medical domain. In this project we tackle semantic relationships extraction from medical texts.In this project, we have focused on Disease-Medicine co-occurrence relationship extraction from the text of the literature. A large-scale and accurate list of drug-disease treatment pairs derived from published biomedical literature can be used for drug repurposing.PROPOSED SYSTEMInformation extraction is the identification of specific information in unstructured data sources, such as natural resources text.First task identifies and extracts informative sentences on diseases and treatment topics.The second one performs a finer grained classification of these sentence according to semantic relation that exist between diseases and treatments.ImplementationSteps involved are:Obtaining documents from the web containing medical data.Perform tokenization.Perform stemming.Perform POS tagging.Perform annotation.Find disease-treatment pairs using pattern matching.

Tokenization

Tokenization is the process of breaking up the given text into units called token. The tokens may be words or number or punctuation mark. Tokenization does this task by locating word boundaries. Ending point of a word and beginning of the next word is called word boundaries. Tokenization is also known as word segmentation.

Stemming Stemmingis the term used inlinguistic morphology andinformation retrievalto describe the process for reducing inflected words to theirword stem, base orrootform-generally a written word form.

Existing stemming algorithms are : Truncate(n), Lovins Stemmer, Dawson stemmer, Porters Stemmer. We are using porters stemmer. POS taggingPart-of-speech tagging(POS taggingorPOST), also calledgrammaticaltaggingorword-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particularpart of speech , based on both its definition, as well as its contexti.e.relationship with adjacent and related words in a phrase,sentence, orparagraph.The process of assigning a part-of-speech to each word in a sentence.

VITERBI ALGORITMGiven a) start state: s1 b)alphabet A={a1 a2 an} c)Set of states S={s1 s2 .. sn} d) Transition probability.

Data structure1. A N*T array called SEQSCORE to maintain the winner sequence always(N=#states, T=length of O/P sequence).2.Another N*T array called BACKPTR to recover the path.

Steps 1.InitilizationSEQSCORE(1,1)=1.0BACKPTR(1,1)=0.0For(i=2 to N) do SEQSCORE(i,1)=0.0(expressing the fact that first state is S1)

2 Iterationfor(t=2 to T) do for(i=2 to N) do SEQSCORE(i,t)=max(j=1,N) BACKPTR(i,t)=index j that gives the max above.

3 Sequence identificationC(T)= i that maximizes SEQSCORE(i,T) for i from (T-1) to 1 do C(i)=BACKPTR[C(i+1),(i+1)]

Example [a1,0.3] [a1,0.3] s1s2[a2,0.4][a1,0.2][a1,0.1][a2,0.2][a2,0.2] [a2,0.3] Tabular representation EA1A2A1A2S11.00.10.09.012.0081S20.00.3.06.027.0054 Probability table EA1A2A1A2S101222s21212 BACKPTR TableAnnotating Corpora and Searching patternsSentences are tagged with disease entities from the clean disease lexicon and drug entities from the drug list. Pattern is searched between disease and drug : - in, - in the treatment of, - for, - in patients with, - for the treatment of, - treatment of, - therapy for, - therapy in etc.AlgorithmInput: Disease, Rules. Output: Medicine, Semantic Relationship. 1. For any disease do Extract paper form Medline. 2. Tokenize the document. 3. Remove all stopwords. 4. Perform stemming. 5. POS tagging is preformed to separate required part of speech.6. convert this corpora to annotated corpora.7. From annotated sentences Extract sentence having atleast one medicine and one disease. 8. Pattern is searched between disease and medicine.9. Medicines are associated and ranked based on frequency and superiority. 10. Semantic relationships are then presented to user.

HARDWARE requirements

PROCESSOR : PENTIUM IVRAM : 256 MBHARD DISK: 40GB

SOFTWARE REQUIREMENTSFRONT END : JAVA SWINGOPERATING SYSTEM : WINDOWS XP/7TOOL : ECLIPSE DeliverablesRapid access to information regarding potential immunizations.Medicines ranked on the basis of their frequency.Can be used in medicine repurposing.Can provide knowledge to doctors about new drugs available for disease by processing biomedical literature and clinical trial studies.

Extension PossibilityIt can extended to extract information regarding cure, symptoms and prevention of disease. It can help in finding the root cause of the disease and then by taking the patient history or condition and providing him the dose accordingly. It is based on viewing the composition of medicine and after applying it on patient report identifying that is it be suiting him .ReferencesRong Xu and QuanQiu Wang Large- scale extraction of accurate drug-disease treatment pairs from biomedical literature for drug repurposing, Issue 2013.Fadi Yamout, Further Enhancement to the Porters Stemming Algorithm, Issue 2006.Ray S and Craven M,Representing sentence structure in Hidden Markov Models for information extraction, Proceedings of IJCAI-2001.M. S. Ryan and G. R. Nudd., The Viterbi Algorithm, Department of Computer Science, University of Warwick, Coventry,England,Issue 1993.Jesse Davis jdavis Mark Goadrich, The Relationship Between Precision-Recall and ROC Curves, Department of Computer Sciences and Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison,USA.

Thank you

Documents

Presentation on finding relation among medical terms