7
Text is the medium used to store the tremendous wealth of scientific knowledge regarding the world we live in. However, with its ever-increasing magnitude and throughput; analysing this unstructured data has become an impossibly tedious task. This has led to the rise of Text Mining and Natural Language Processing (NLP) techniques and tools as the go-to for examining and processing large amounts of natural text data. Text-Mining is the automatic extraction of structured semantic information from unstructured machine- readable text. The identification and further analysis of these explicit concepts and relationships help in discovering multiple insights contained in text in a scalable and efficient way. Some of the various text-mining/NLP techniques include; structure extraction, tokenisation, acronym normalisation, lemmatisation, de-compounding, and identifying language, sentences, entities, relations, phrases and paragraphs. But once we have this so called “structure semantic information”, what do we do then? Do these text mining techniques simply produce the insights we are trying to uncover? Well, it is not that simple. Even after extracting some information from text, there is a long way to go to convert that into, first, knowledge, and then, into a valuable insight. That insight may be a new discovery or confirming and verifying a previous hypothesis in the form of creating new links in existing knowledge we have about our domain. Let us look at why going beyond text-mining is not an easy task. However, what do we do when we have hundreds, thousands or millions of independent text instances in our corpus. In this case it becomes extremely hard to uncover and understand the relationships between the knowledge extracted (text mined outputs) from each text against each other. To be able to do such analyses, we need to integrate all the outputs in one place—which is not as easy as it sounds. 2. Difficult to contextualise knowledge extracted from text with existing knowledge Second, not only do we want to analyse knowledge extracted from text, but we want to go beyond that, to see how the information extracted relates to all the other data we have. These data would have its own format or structure; making it impossible to compare it with our original NLP output. This leads to the difficulty of contextualising relationships between various groups of disparate and heterogeneous data. 3. Difficult to investigate insights in a scalable and efficient way Finally, due to the magnitude of text that can be extracted, it becomes extremely tedious to generate or investigate insights in a scalable way. Of course, valuable insights can be discovered manually for single instances of text, but such an approach is impossible to scale across millions of text instances. Moreover, in most cases, doing this manually is simply practically impossible. What do we do then? INDUSTRY WHITE PAPER What are the Challenges of Going Beyond Text Mining? While working with a number of NLP tools, I found several challenges between the output of a text- mining/NLP tool and the insights I was looking for. These can be summarised as follows: 1. Difficult to ingest and integrate complex networks of text-mined outputs Text mining a single instance of text is easy. We could also just read the text and extract the knowledge contained in it ourselves.

Text Mined Knowledge Graphs · 2020-08-08 · Text Mined Knowledge Graphs Text is the medium used to store the tremendous wealth of scientific knowledge regarding the world we live

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Text Mined Knowledge Graphs · 2020-08-08 · Text Mined Knowledge Graphs Text is the medium used to store the tremendous wealth of scientific knowledge regarding the world we live

Text is the medium used to store the tremendouswealth of sc ient i f ic knowledge regarding the world wel ive in . However , with i ts ever- increasing magnitudeand throughput; analys ing th is unstructured data hasbecome an impossibly tedious task. This has led to ther ise of Text Mining and Natural Language Processing(NLP) techniques and tools as the go-to for examiningand processing large amounts of natural text data.

Text-Mining is the automatic extract ion of structuredsemantic information from unstructured machine-readable text . The ident i f icat ion and further analys isof these expl ic i t concepts and re lat ionships help indiscover ing mult ip le ins ights contained in text in ascalable and eff ic ient way.

Some of the var ious text-mining/NLP techniquesinclude; structure extract ion, tokenisat ion, acronymnormal isat ion, lemmatisat ion, de-compounding, andident i fy ing language, sentences, ent i t ies , re lat ions,phrases and paragraphs.

But once we have this so cal led “structure semanticinformation” , what do we do then? Do these textmining techniques s imply produce the ins ights we aretry ing to uncover?

Wel l , i t is not that s imple . Even after extract ing someinformation from text , there is a long way to go toconvert that into, f i rst , knowledge, and then, into avaluable ins ight . That ins ight may be a new discoveryor conf i rming and ver i fy ing a previous hypothesis inthe form of creat ing new l inks in ex ist ing knowledgewe have about our domain. Let us look at why goingbeyond text-mining is not an easy task.

However , what do we do when we have hundreds,thousands or mi l l ions of independent textinstances in our corpus. In th is case i t becomesextremely hard to uncover and understand therelat ionships between the knowledge extracted(text mined outputs) f rom each text against eachother . To be able to do such analyses, we need tointegrate a l l the outputs in one place—which is notas easy as i t sounds.

2. Diff icult to contextual ise knowledge extractedfrom text with exist ing knowledgeSecond, not only do we want to analyse knowledgeextracted from text , but we want to go beyondthat , to see how the information extracted re latesto a l l the other data we have. These data wouldhave i ts own format or structure; making i timpossible to compare i t with our or ig inal NLPoutput . This leads to the d i f f iculty ofcontextual is ing re lat ionships between var iousgroups of d isparate and heterogeneous data.

3. Diff icult to invest igate insights in a scalableand eff ic ient wayFinal ly , due to the magnitude of text that can beextracted, i t becomes extremely tedious togenerate or invest igate ins ights in a scalable way.Of course, valuable ins ights can be d iscoveredmanual ly for s ingle instances of text , but such anapproach is impossible to scale across mi l l ions oftext instances. Moreover , in most cases, doing th ismanual ly is s imply pract ical ly impossible . What dowe do then?

TEXT-MINED

KNOWLEDGE

GRAPHS

INDUSTRY WHITE PAPER

What are the Chal lenges of GoingBeyond Text Mining? While working with a number of NLP tools , I foundseveral chal lenges between the output of a text-mining/NLP tool and the ins ights I was looking for .These can be summarised as fo l lows:

1. Diff icult to ingest and integrate complexnetworks of text-mined outputsText mining a s ingle instance of text is easy. Wecould a lso just read the text and extract theknowledge contained in i t ourselves.

Page 2: Text Mined Knowledge Graphs · 2020-08-08 · Text Mined Knowledge Graphs Text is the medium used to store the tremendous wealth of scientific knowledge regarding the world we live

How do we Bui ld a Text-MinedKnowledge Graph?

Step 1: Identify textTo get started, we f i rst need to know the text we aremining and want to store. In th is art ic le , I want tofocus specif ica l ly on b iomedical data. Therefore, thetypes of unstructured text data that I looked atinc luded:

1. Medical L iterature2. Diagnostic Test Reports3. Electronic Health Records4. Pat ient Medical History5. Cl in ical Reports

Out of the above l isted text corpuses. I chose to lookspecif ica l ly into medical l i terature. Specif ica l ly ,PubMed art ic le abstracts .

Step 2: Identify Text Mining/NLP tool While looking into th is space, I found i t easy tointegrate Stanford’s CoreNLP API to a l low me to tra inand mine text . The scr ipt to mine text can be found onour GitHub repo here . Other tools that may be usedinstead may inc lude; NLTK , TextBlob , gensim , spaCy ,IBM Whatson NLU , PubTator , LitVar , NegBio ,OpenNLP , and BioCreat ive .

INDUSTRY WHITE PAPER

How do we Address These Chal lenges? With this in mind, we can think of potent ia l solut ionsthat address these chal lenges. Based on my research,I suggest th is methodology:

1. Integrate and ingest complex networks of text-mined output into one col lect ion databaseTo solve the f i rst chal lenge, we need a method toeasi ly accumulate text mined output into onecol lect ion—in other words, a text minedknowledge graph .

2. Impose and expl ic it structure to normal ise al ldataTo enable the intel l igent analys is and integrat ionof data, whi le a lso maintain ing data integr ity , weneed to impose an expl ic i t structure on a l l the datawe want to analyse. Not only wi l l th is help tocontextual ise the concepts themselves, but a lsothe re lat ionships between them. This translatesinto to having a h igher- level data model toencompass the var ious types of data andconsol idate their presence in the knowledge graph.This wi l l then a lso a l low us to val idate the data atthe t ime of ingest ion. The data model wi l l act as anumbrel la over a l l data types a l lowing us tocontextual ise a l l re lat ionships within and betweenthem.

3. Discovering new insights using automatedreasoningIn order to extract or infer as much information aspossible from our knowledge graph, we need somesort of automated reasoning tool to propagate ourdomain expert ise throughout the ent i rety of thedata. This wi l l enable us to ask quest ions from ourknowledge graph and get the r ight answers withtheir explanat ions—where other tradit ionalmethods would fa i l .

Having ident i f ied solut ions to the previously l istedchal lenges, I wondered whether there was any onetechnology out there, that encompassed al l threepoints?

Wel l , to our luck, Grakn solves a l l of these.

I f you’re unfami l iar with th is technology, Grakn is anintel l igent database in the form of a knowledge graphto organise complex networks of data. I t contains aknowledge representat ion system based on hyper-graphs ;

enabl ing the model l ing of every complex b io logicalre lat ionship . This knowledge representat ion system isthen interpreted by an automated reasoning engine,which performs reasoning in real-t ime. This softwaregets exposed to the user in the form of a f lex ib le andeasi ly understood query language—Graql .

Step 3: Model your dataThanks to CoreNLP, we now have raw text mined data,and we can move onto data model l ing. To th is end,Grakn ut i l ises the entity-relat ionship model to groupeach concept into e ither an ent ity , attr ibute, orre lat ionship . This means that a l l we have to do is tomap each concept to a schema concept type , andrecognise the re lat ions between them. Let us look atan example to demonstrate how we would go aboutdoing th is .

In order to start model l ing our CoreNLP output , wef irst need to know what i t actual ly looks l ike . Themost basic mining extracts sentences from a body oftext . Those sentences have a sent iment, tokens (whichmake up the sentence) , and re lat ions between certa intokens. We also get a conf idence measure of each typethat the tool ident i f ies . A v isual representat ion of th isis shown below:

Page 3: Text Mined Knowledge Graphs · 2020-08-08 · Text Mined Knowledge Graphs Text is the medium used to store the tremendous wealth of scientific knowledge regarding the world we live

with respect to the mined-text and as the containerwith respect to a token. Last ly , the token only p laysthe contained with respect to the sentence.

The second re lat ion is the mined-re lat ion which re latesthe tokens as objects or subjects and a lso has thecontained re lat ion to show which sentence that mined-re lat ion was extracted from. The re lat ion mined-re lat ion a lso has an attr ibute type , to represent thetype of re lat ion.

F inal ly , we a lso need to specify the datatypes of eachattr ibute we have def ined, which can be done as thelast four l ines of the schema above.

Step 4: Migrate data into GraknNow that we have the data, and a structure imposedon this data, the next step is to migrate th is intoGrakn to convert i t into knowledge. Please note thereare many di f ferent ways to do migrat ion, but here Iwould l ike to specif ica l ly touch on how we would goabout using Java, NodeJS and Python.

As we can see, the structure of our output is a l readygraph-l ike . Tradit ional methods str ive to encapsulatethis r ich information into tables, however that str ipsaway a whole d imension of information and iscounter-product ive. I t feels more natural to keep thisgraph-l ike structure to bui ld an integrated complexknowledge graph of everything we have extractedfrom the text corpus. We can easi ly map the output toGraql .

We start by recognis ing mined-text , sentence and tokento be ent it ies , having attr ibutes such as text ,sent iment , lemma and type . We can a lso see the rolesthey play in the re lat ions we wi l l def ine next .

INDUSTRY WHITE PAPER

We can ident i fy tworelat ions. One is a contain ingrelat ion which is betweensomething that is containedand something that is thecontainer . We can look at theent it ies to see that themined-text ent ity p lays thecontainer for a sentence, asentence plays the contained

Page 4: Text Mined Knowledge Graphs · 2020-08-08 · Text Mined Knowledge Graphs Text is the medium used to store the tremendous wealth of scientific knowledge regarding the world we live

For th is , we can easi ly use any of these languages to insert instances of extracted information compl iant with theschema we model led. The image below depicts how to insert a s ingle instance of a token with a lemma and type intoGrakn using any of these three languages:

To learn more about migrat ing data into Grakn, make sure to read this art ic le : Model l ing & Migrat ing Big BiologicalData with Grakn.

Step 5: Discover and interpret new insightsAfter migrat ion, we can start to d iscover new insights . Discover ing ins ights refers to f inding new data that may bevaluable to what we are try ing to accompl ish. In order to do that , we need to f i rst look or ask for something. In otherwords, we start with a quest ion. These quest ions can range from asking within the text , or even more complex onesregarding other data which is augmented by the text mined output .

Let us look at some examples, and see how our text mined knowledge graph may provide answers to them:

Quest ion 1 : What knowledge is extracted f rom a PubMed art ic le?

Answer 1 :

INDUSTRY WHITE PAPER

Page 5: Text Mined Knowledge Graphs · 2020-08-08 · Text Mined Knowledge Graphs Text is the medium used to store the tremendous wealth of scientific knowledge regarding the world we live

These ent it ies and re lat ions were extracted due to thetra in ing performed on our CoreNLP tool . Moreinformation regarding th is can be found here.

Let us now try asking a harder quest ion:

Quest ion 2 : Which PubMed art ic les ment ion the d iseaseMelanoma and the gene BRAF

Answer 2 :

The answer we get back are the var ious concepts thatwere mined from the abstract of the PubMed art ic lewe are interested in . The f igure above shows anexample of th is where the ent it ies extracted were:

1. Gene — BRAF2. Drug — Trametin ib3. Drug — Dabrafenib4. Disease — Melanoma5. Protein — MEK

We also can see some mined-relat ions that wereextracted:

5. Inhib it ion — between BRAF and Dabtrafenib6. Inhib it ion — between MEK and Trametin ib7. Treatment — between Dabtrafenib , Melanoma8. Treatment — between Trametin ib and Melanoma

INDUSTRY WHITE PAPER

The answer we receive from Grakn is a l ist of a l l PubMed art ic les that match the condit ion provided. A greatappl icat ion of such a query can be proposed in the Precis ion Medic ine domain where we want to l ink indiv idualpat ients to medical l i terature re levant to their personal b io logical medical case. I f you’re interested in prec is ionmedic ine and knowledge graphs, checkout th is art ic le .

These quer ies are useful , however now it begs to quest ion — how can i t leverage our text mined knowledge graph toaugment our ex ist ing knowledge? How can we use what we mined to expand our ins ights and apply them to otherdomains or f ie lds in l i fe sc iences? The next quest ion shows a demonstrat ion of th is .

Page 6: Text Mined Knowledge Graphs · 2020-08-08 · Text Mined Knowledge Graphs Text is the medium used to store the tremendous wealth of scientific knowledge regarding the world we live

Quest ion  3 : Which drugs are re lated to the d iseaseMelanoma?

Answer 3 :

Even though Grakn g ives us a correct answer to ourquest ion, th is data was actual ly never ingested intoGrakn — no connect ions ex ist between diseases anddrugs. So, how did we get th is re levant answer?

In short — Grakn’s automated reasoner created thisanswer for us through automated reasoning.

This can be model led as a when and then scenar io .

The when in th is case refers to a sub-graph of ourtext-mined knowledge graph — i .e . , whenever any sub-graph l ike the one on the left is found, the re lat ion onthe r ight is created.

This logic propagates throughout the knowledge graphcreat ing new valuable connect ions, which is done bywritten rules in Graql . The rule that I used for theinference above, is as fo l lows:

The above rule states that when:1. There ex ists a PubMed art ic le $p with anabstract $a2. There ex ists a sentence $s contained in abstract$a3. There ex ists a mined-relat ion extracted fromsentence $s which has the type “treatment”4. The tokens taking part in that mined-relat ionhave lemmas5. Those lemmas have the same value as a drug anda disease

As this type of reasoning is fu l ly expla inable , we caninterpret any inferred concept to understand how itwas inferred/created. Below you can see how thisexplanat ion looks l ike . In the next sect ion, I wi l l d ivedeeper into how we created the logic and rules thatal lowed Grakn to infer these re lat ionships.

The re lat ion we saw above, was the product of f i rst-order logic .

INDUSTRY WHITE PAPER

Page 7: Text Mined Knowledge Graphs · 2020-08-08 · Text Mined Knowledge Graphs Text is the medium used to store the tremendous wealth of scientific knowledge regarding the world we live

We start with the text which can come from mult ip le sources and in var ious formats. An NLP tool is used to minethe text and produce some sort of output with a structure. That structure is used to create a schema (h igh level datamodel ) to enforce a structure on the raw NLP output . Once that is done, we use one of Grakn’s c l ients to migrate theinstances of NLP output into Grakn making sure every insert ion adheres to the schema.

Grakn stores th is in i ts knowledge representat ion system, which can be quer ied for ins ights to d iscover complexconcepts or even test out hypotheses. These ins ights can a lready be in the graph or even be created at the t ime ofquery by the reasoning engine; for example, d iscover ing gene c luster ident i f icat ions, protein interact ions, gene-disease associat ions, protein-disease associat ions, drug-disease associat ions, pat ient matching, and even forc l in ical dec is ion support .

I f th is is true, then create a treatment re lat ion between the drug and disease. We can a lso see from the rule abovethat i t is agnost ic to the d isease or drug; hence creat ing treatment re lat ions between each drug and disease thatmatch the g iven condit ions.

I t should be noted that the rule above is a demonstrat ion of how automated reasoning can be used to create a h igh-level abstract ion over complex ins ights which scale through the data. I t by no means compares to the ful limplementat ion of how an end-user appl icat ion would look l ike .

How do al l these pieces f i t together in one architecture? Now, let us take a step back, and look at how al l of the components of bui ld ing a text mined knowledge graph piecetogether .

INDUSTRY WHITE PAPER

ConclusionsSo, we know that Text Mining is extremely promising in several domains, and in th is art ic le , we have specif ica l lylooked at the b iomedical domain. We understand that there is a lot of work required from the point of extract inginformation from text to actual ly d iscover ing something useful . I hope to have shown that Grakn can help to br idgethat gap. In summary, Grakn helps to solve the three key chal lenges in text mining when i t comes to going beyondtext mining.

In th is art ic le , we've only scratched the surface of what you can do with Grakn. I f you'd l ike to learn more pleasecontact us on enterpr [email protected] .

Grakn is a distributed knowledge graph: a logicaldatabase to organise large and complex networks ofdata as one body of knowledge. Grakn provides theknowledge engineering tools for developers to easilyleverage the power of Knowledge Representation andReasoning when building complex systems. Ourenterprise product, Grakn Cluster, is available on anycloud provider and on premise.

Grakn is used in numerous applications from taxautomation bots to complex use cases in drugdiscovery via protein pathways, a knowledge networkof drones and robots, cybersecurity and financialservices. Users include organisations such asAstraZeneca, Cisco, the French Intelligent Services,Bayer and Nestlé.