
Providing Consultancy &

Research in Health Economics

Providing Consultancy &

Research in Health Economics

Do simple text mining tools

have anything to offer

Embase users?

Sunrise seminar

May 2016

Julie Glanville

Providing Consultancy &

Research in Health Economics


I acknowledge funding from Elsevier which has covered my

attendance at the MOSAIC conference

I work for YHEC, a consultancy company that does contract

research for range of public and private sector organisations

I offer training courses in advanced searching and text mining

Providing Consultancy &

Research in Health Economics


What are simple text mining tools?

How might TM tools help us with searching?

How can we use TM tools with Embase?

Learning more

Providing Consultancy &

Research in Health Economics

What is text mining?

“Text mining is the process of discovering and extracting

knowledge from unstructured data. This comprises three

main activities:

– Information retrieval (IR) to gather relevant texts.

– Information extraction (IE) to identify and extract entities, facts

and relationships between them.

– Data mining to find associations among the pieces of information

extracted from many different texts.

…[TM] can help make the implicit information in your

documents more explicit…”

Source: Nat Centre for TM.

Providing Consultancy &

Research in Health Economics

TM is not a single thing

TM software comes in many forms and can do many

different things

Simple things – word frequency analysis

counting the numbers of times words appear in the text

More complex things – word co-occurrence

looking at patterns of words occurring together to identify concepts and

relationships between words

Semantic analysis – analysing text according to the meaning of

words not just their presence or absence

“89% of the group achieved smoking cessation”

“Five different smoking cessation interventions were explored”

Providing Consultancy &

Research in Health Economics

What is behind TM software?

TM software works according to algorithms

Packages use algorithms to achieve results

Algorithms make use of features in the texts such as

frequency of terms, co-occurrence of terms, presence/absence of terms

Different packages use different algorithms

TM software may make use of dictionaries and stop

word lists

It is likely that these will differ across software packages

TM software may also make use of lists, vocabularies

and term relationships (ontologies/taxonomies)

E.g. lists of diseases, geographic areas, proteins

Providing Consultancy &

Research in Health Economics

Providing Consultancy &

Research in Health Economics

Providing Consultancy &

Research in Health Economics

What can TM do?

TM can be used to extract information from texts and identify

patterns in that information

This might lead to identifying themes in the text of which we were


Particularly useful for helping to see information more clearly that

might otherwise be “concealed” within large volumes of text.

It is “objective” in its analysis of text

Many TM packages can cope with information in very different

formats and explore them in a single corpus (body of text)

Database records, papers, web pages, tweets

Providing Consultancy &

Research in Health Economics

How easy is it to access TM

software? Many different TM software packages are available free of charge on the

internet or to download

Some do single tasks

e.g. simple word frequency analysis such as PubMed PubReMiner

Others offer bundles of tools to give different ways to explore a set of


e.g. Voyant

There is also free software for more sophisticated tasks within TM, such as

machine learning

e.g. GATE

To get the best out of this software will require some investment of time

to fully learn the options within the software and their implications

Providing Consultancy &

Research in Health Economics

What are simple text mining


The selected tools we will look at today, achieve the following:

Analyse the frequency of terms appearing in database records



Subject headings

Other fields

Analyse phrases within records

Analyse the collocation of terms within records

Show us the content/themes within a set of records

There are many more tools…

Providing Consultancy &

Research in Health Economics

How can we use TM tools with

Embase records?

Many TM packages are built as interfaces to PubMed

PubMed PubReminer



MeSH on Demand

These are helpful for analysing records from PubMed and building

MEDLINE strategies

What services can we use to help us with exploring Embase


To develop strategies

To explore the content of a set of records

Providing Consultancy &

Research in Health Economics

Identifying search terms

Our topic for today is

How can we treat biofilms that have

formed in infected wounds

I have found a set of 902 Embase records

using a first very basic scoping search

How can frequency analysis tools help us

see the words in the records



Example Embase

(OvidSP) search

1. Biofilm$1.ti,ab.

2. Wound$1.ti,ab.

3. 1 and 2

Providing Consultancy &

Research in Health Economics

EndNote offers simple frequency analysis:

You can see which terms might be useful for strategy


All records are indexed as they are loaded into EndNote:

The Keywords field is indexed automatically to create a Term


Term Lists can be used to create an index

Additional Term Lists can be defined and populated

e.g. title, title/abstract

Ideal for analysing Embase records

Frequency of terms in the title, title/abstract, EMTREE

EndNote, 1

Providing Consultancy &

Research in Health Economics

Before loading records into EndNote decide how to treat the

information coming in to Keywords field:

You can break up subject index terms by changing the term


E.g. Pseudomonas infection/dt [Drug Therapy] can be parsed


Separate words - Pseudomonas Infection dt drug therapy

Two phrases - Pseudomonas Infection dt [drug therapy]

If you want to do frequency analysis of other fields or combinations of

fields you can do this once the records are loaded

EndNote, 2

Providing Consultancy &

Research in Health Economics

Keywords field

To set the term delimiters you can use the following sequence

(EndNote X7.5) in a new empty EndNote library:

Tools, Define term lists, Keywords

(Change) Delimiters – select the ‘/’ symbol to cut the

subheading from the EMTREE heading

Update list


Load your Embase records

EndNote, 3

Providing Consultancy &

Research in Health Economics

Create the frequency analysis of the EMTREE terms :

Tools, Subject bibliography

Keywords, OK

Select all, OK

Choose display format by selecting Layout

Terms, Subject terms only

Change number of lines between entries e.g. remove suffix


Change display order to frequency by selecting ‘By term

count – decending’

Select OK

To print listing, select Print

To save the listing select Save

EndNote, 4

Providing Consultancy &

Research in Health Economics

Providing Consultancy &

Research in Health Economics

Endnote, 5

To see a title and abstract frequency analysis, first define a Term LIst

Select Tools, Define Term Lists

Select create List

Give the list a helpful name e.g. Titleab

Check the custom delimiters and make sure that a space is added so

that words will be processed individually

Select Update list

Then select the title field and link it to the Titleab term list

Then select the abstract field and link it to the Titleab term list

The term list is now ready and any subject bibliography involving those

fields will be able to use single terms

Providing Consultancy &

Research in Health Economics

Endnote, 6

To save or print out the title and abstract frequency analysis

Tools, Subject bibliography

Select Title as well as Abstract (using the control key), OK

Choose Select all, OK

Choose display format by selecting Layout

Terms, Subject terms only

Change number of lines between entries e.g. remove suffix ^p^p

Change display order to frequency by selecting ‘By term count –


Select OK

To print listing, select Print

To save the listing select Save

Providing Consultancy &

Research in Health Economics

Providing Consultancy &

Research in Health Economics

EndNote pros and cons


Easy and quick to create frequency counts of the fields in Embase

records, particularly of EMTREE

This facility is built into a package you might be using to manage your

search results


Not very visual

Cannot do phrase analysis of title and abstract

Cannot do more sophisticated analyses such as word collocation

Cannot implement stopwords

Providing Consultancy &

Research in Health Economics

Format of Embase records

Often need the Embase records in plain text format

Best to select only the fields you want to analyse so that

The files process more quickly

The output is not cluttered with unwanted words e.g. from the address


To get ‘clean data’

Download just the selected fields from Embase (e.g. in OvidSP select

text file output and then selected fields e.g. ti, ab, sh)

Or, download records to EndNote and export only fields of interest from

EndNote into a file

Sometimes you may want to have one file of title/abstract fields as

well as a separate file of the EMTREE only

Providing Consultancy &

Research in Health Economics

Voyant Tools

Can upload a text file or

cut and paste the

contents of a text file into


It provides various

views on the text

Example (title/abstract)

904 biofilms and wounds


Providing Consultancy &

Research in Health Economics

Voyant pros and cons


Offers a simple terms display and a word cloud

Much more visual presentation

We can explore set of records in a non-linear way

Can save the data visualisations for the future

Can manipulate the stopword list e.g. to remove words such as ‘title’ or



May be a little slow to respond?

Providing Consultancy &

Research in Health Economics

Identifying phrases

Providing Consultancy &

Research in Health Economics

Voyant Tools

Running the phrase analysis over the biofilms and

wounds records helps us to identify phrases

Can choose a seed word e.g. ‘wound’ and inspect

phrases that contain it

Providing Consultancy &

Research in Health Economics


2 MB file size

Can paste in records, document text e.g. protocol, or

parts of records

Analysis of EMTREE headings from a batch of 498

Embase records

Blue items are phrases – any frequency

Red items are phrases with higher frequency – threshold set at 4

Providing Consultancy &

Research in Health Economics

Providing Consultancy &

Research in Health Economics


Table representation of phrases by C

score weight

NOTE: additional pre-processing

(tidying) could be undertaken e.g.

taking out the * for the focused


Providing Consultancy &

Research in Health Economics

Text Analyzer

Paste in text e.g. set of records

Click ‘Process text’

Groups results that appear as most frequent


Choose phrase lengths

402 Embase records: title abstract

Providing Consultancy &

Research in Health Economics

Text Analyzer, 2

Providing Consultancy &

Research in Health Economics

Providing Consultancy &

Research in Health Economics

Providing Consultancy &

Research in Health Economics

Identifying options for

proximity operator use

Text analyzer

– The display with longer phrases can help with

deciding on proximity operators


– The keyword in context might help with deciding on

proximity operators

– The phrase option can be set to phrases of specific


Providing Consultancy &

Research in Health Economics

Voyant Collocates Tool

The table view shows:

– Term: this is the keyword (or keywords) being


– Collocate: these are the words found in proximity of

each keyword

– Count (context): this is the frequency of the collocate

occurring in proximity to the keyword

Providing Consultancy &

Research in Health Economics

Providing Consultancy &

Research in Health Economics



Providing Consultancy &

Research in Health Economics

Voyant settings

Keywords are in blue/green and collocates (words in proximity to the

keywords) are showing in orange/red.

Default: most frequent collocates are shown for the 10 most

frequent keywords in the corpus

Probably best for title/abstract terms?

Might want to add some terms to stopwords e.g. Title

If you change the context slider at the bottom more terms are

included (with lower frequency)

Hovering over collocates shows their frequency in proximity (not

their total frequency)

Exploring the terms can highlight words for consideration for

proximity searches

Providing Consultancy &

Research in Health Economics

Identifying options for /freq

command use in OvidSP

OvidSP offers the option to implement frequency

selection in searches

– biofilm.ab. /freq=2

TM exploration using Voyant (using the

collocator option) might suggest combinations of

words on which to focus

– These could be tested with the frequency operator to

see if they do improve the precision of the search

Providing Consultancy &

Research in Health Economics

Identifying concepts?

TM software can help us to see themes in a batch of search results


Can be run online (needs Java) but best to download software

It creates maps of themes within documents

– Network visualisation

– Density visualisation

Possible to zoom into areas of interest

Scroll over the map and zoom in

Providing Consultancy &

Research in Health Economics


Carry out an Embase search and download results as RIS

Open VosViewer ( and


Select Create, map based on text data

Select RIS option and load the RIS file

Select Next and choose the specific fields and the term score base


Select Next and choose binary (presence/absence) or full

(frequency) counting

Choose the minimum number of occurrences of a term (e.g. 5)

Choose Relevance score for each

Providing Consultancy &

Research in Health Economics

VOSviewer: network


Network visualisation

The font size and the size of the circle depend on the weight of

an item

Weight is determined by total strength of all links to the item

The colour of the circle of an item is determined by the cluster to

which the item belongs

Providing Consultancy &

Research in Health Economics

Providing Consultancy &

Research in Health Economics

Providing Consultancy &

Research in Health Economics

Providing Consultancy &

Research in Health Economics

Providing Consultancy &

Research in Health Economics

VOSviewer: Density


Item density visualisation

Each point has a colour that depends on density of items at that

point (between red and blue)

Larger the number of items in neighbourhood of a point and

higher the weights of neighbouring items, the close the colour of

the point is to red

Smaller the number of items around a point and lower the

weights of the neighbouring items, the closer the colour of the

point is to blue.

Providing Consultancy &

Research in Health Economics

Providing Consultancy &

Research in Health Economics

Providing Consultancy &

Research in Health Economics

Identify where research is



– Creating a term list of the address field – important to

use the comma as a delimiter

– Analysing words in title and abstract


– Create a file with only address fields and see what

the pictures show

Providing Consultancy &

Research in Health Economics

Explore trends within research


– If a set of records are loaded in date order the

frequency of terms over time can be shown

Providing Consultancy &

Research in Health Economics

Learning more about TM and

systematic reviews

Providing Consultancy &

Research in Health Economics

IQWIG approach, 1

Hausner, E, Waffenschmidt S, Kaiser T, and Simon M. Routine Development of Objectively Derived Search Strategies. Sys Rev 2012;1:19 DOI: 10.1186/2046-4053-1-19.

Identify a test set and split randomly

Load into Endnote

Text mining package “tm” (Text Mining Infrastructure in R) in R:


On the basis of information derived from the titles and abstracts of the downloaded references, terms are ranked by frequency.

Terms present in at least 20% of the references in the development set are selected for further examination.

Most frequent terms with a low sensitivity of 2% or less are used in strategy

Providing Consultancy &

Research in Health Economics

IQWIG approach, 2

[They use Pubreminer for selected MeSH]

Terms are then divided up into Condition, Intervention and Study design.

Then iterative trial and error approach to find strategy that works

Then strategy is tested against validation set

They also use antconc software to identify phrases or adjacent terms

Has to be downloaded

Providing Consultancy &

Research in Health Economics

AHRQ report

Paynter RA, Bañez LL, Berliner E, Erinoff E, Lege-Matsuura J, Potter S, Uhl

S. EPC Methods: An Exploration of the Use of Text-Mining Software in

Systematic Reviews. Research White Paper. (Prepared by the Scientific

Resource Center and the Vanderbilt and ECRI Evidence-based Practice

Centers under Contract Nos. 290-2012-00004-C [SRC], 290- 2012-00009-I

[Vanderbilt], and 290-2012-00011-I [ECRI].) AHRQ Publication 16-EHC023-

EF. Rockville, MD: Agency for Healthcare Research and Quality; April



Overview of TM tools used in searching

Evaluation of TM tools recommended in literature and by interviewees

Also summarises research in using TM for other parts of systematic review


Providing Consultancy &

Research in Health Economics

Flinders University Library

Text mining resource

Lists of various tools

Providing Consultancy &

Research in Health Economics

What are the challenges of

using TM software?

TM has many different options – need to identify what aspect of TM

will be used at what stage of the search process and how

Lack of standardisation – no agreed single approach

Different systems use different algorithms – how might these impact

on the resulting searches we develop?

TM can help with volume processing but we still need to make

decisions based on the results, and these may still be subjective

unless we can define benchmarks or decision rules a priori

Some software is complex to learn

The process of using TM can be challenging to document

Providing Consultancy &

Research in Health Economics


There are powerful free tools which can help us with term and

phrase identification to assist with developing Embase searches

Data visualisation such as VOSviewer might help with developing

strategies for complex topics by showing the concepts to de-

emphasise and topics on which to focus

Possibly most useful for more complex searches?

There are many different tools to explore

Many of these can be downloaded to a PC and downloaded

versions may offer more flexibility and reliability than the web-based


Providing Consultancy &

Research in Health Economics

Experiences, questions and


Providing Consultancy &

Research in Health Economics

Providing Consultancy &

Research in Health Economics

Thank [email protected]

Telephone: +44 1904 324832
