Appendix A Probability, Distribution Theory, and ...978-1-59745-290-8/1.pdfAppendix A Probability, Distribution Theory, and Statistical Inference A.1 Deﬁnitions Wedeﬁneaneventasanysetofpossibleoutcomesofarandomexperiment.Inturn,

Appendix A

Probability, Distribution Theory,

and Statistical Inference

A.1 Definitions

We define an event as any set of possible outcomes of a random experiment. In turn,

an outcome is the result of a single trial of a random experiment. A randomexperiment is an experiment that has an outcome that is not completely predictable,

that is, we can repeat the experiment under the same conditions and potentially get

a different result. The sample space (sometimes called universe) is the set of all

possible outcomes for one trial of the random experiment and is often denoted by

O.1 The symbol f denotes the set containing no outcomes, or the empty set.

Using the above definitions, we can now define the union, intersection, and

complement of events. We use two events A and B:

The union of two events A and B is the set

of outcomes that are in either A or B or

both and is denoted by A [ B.Ω

A B

(continued)

1 The letter U is also sometimes used for this and we may use either interchangeably.

R. Sullivan, Introduction to Data Mining for the Life Sciences,DOI 10.1007/978-1-59745-290-8, # Springer Science+Business Media, LLC 2012

593

The intersection of two events A and B is the

set of outcomes that are in both A and Bsimultaneously and is denoted by A \ B.

Ω

A∩BA B

The complement of an event A is the set of outcomes

that are not in A and is denoted by �A or A0. Ω

A A

An event B is a subset of an event A if all of the

outcomes of B are also outcomes of A. ΩA

B

Two events A and B which have no outcomes in

common are mutually exclusive or disjoint.In this case the occurrence of A, without lossof generality, precludes the occurrence of event B.

Ω

A B

594 Appendix A

A.2 Axioms of Probability

A probability is a real number between 0 and 1 that is assigned to each of the events

that an experiment generates, with higher values indicating that the event is more

likely to occur. Assigning a value of 1 to an event means that the event is certain

to occur while assigning a value of 0 means that the event can never occur. The

following axioms must be satisfied for any consistent probability assignment:

1. 0�PðAÞ� 1 for any event A.2. P Oð Þ ¼ 1:3. If A and B are mutually exclusive events, that is, P A \ Bð Þ ¼ f, P A [ Bð Þ

¼ PðAÞ þ PðBÞ. In the more general case, P A [ B [ C [ :::ð Þ ¼ PðAÞ þ PðBÞþPðCÞ þ � � �

We can now use these axioms to prove the following rules:

1. P fð Þ ¼ 0:Since O [ f ¼ O and O \ f ¼ f, by axiom 3, 1 ¼ 1þ P fð Þ.

2. P �Að Þ ¼ 1� PðAÞ:Since O ¼ A [ �A and A \ �A ¼ f, by axiom 3, 1 ¼ PðAÞ þ P �Að Þ.

3. P A [ Bð Þ ¼ PðAÞ þ PðBÞ � P A \ Bð Þ:A [ B ¼ A [ �A \ Bð Þ which are disjoint. By axiom 3, P A [ Bð Þ ¼ PðAÞþP �A \ Bð Þ. B ¼ A \ Bð Þ [ �A \ Bð Þ which are disjoint. By axiom 3, PðBÞ¼ P A \ Bð Þ þ P �A \ Bð Þ. Substituting for P �A \ Bð Þ in above gives P A [ Bð Þ¼ PðAÞ þ PðBÞ � P A \ Bð Þ.

A visualization of this rule from the Venn diagram shows us that P(A) + P(B)would actually include P A \ Bð Þ twice, thus we subtract one from the sum.

A.3 Joint Probability

Consider a random experiment in which O includes two events A and B. We define

the joint probability as the probability of both events occurring simultaneously.2

The implication of this is that the set of outcomes occurs in both event A and event

B, which, on the Venn diagram, is the quantity A \ B. That is, the joint probabilityof the events A and B is P A \ Bð Þ. But what happens if the events are independent ofeach other? In this case, P A \ Bð Þ ¼ PðAÞ � PðBÞ.

We note here that independent events and disjoint events are not identical.

In probability, two events are independent if they have no effect on each other:

the occurrence of one has no impact on the occurrence, or lack thereof, of the other.

Thus, independence is a property that arises from the probabilities assigned to the

2Actually, we really mean simultaneously on the same iteration or repetition of the random

experiment.

Appendix A 595

events and their intersection. Disjoint, or mutually exclusive, events are events

which have no outcomes in common: such events with non-zero probabilities

cannot be independent since their intersection is the empty set, with P fð Þ ¼ 0,

which can never equal the product of the probabilities of the two events.

We call the probability of an event A, in the joint probability setting, itsmarginalprobability, and this is calculated as P A \ Bð Þ þ P A \ �Bð Þ using the axioms of

probability. (The confirmation of this is left as an exercise for the reader.)

A.4 Conditional Probability

If we are told that event A has occurred, then we know that everything outside of

AðA0Þ is no longer possible. We can consider the reduced universe Or ¼ A since we

only need to consider outcomes within A. Consider a second event B. If we knowthat event A has occurred, will this affect the probability of our second event Boccurring? That is, what is the conditional probability of B occurring given that Ahas occurred? The only part of the event B which will be relevant is that part which

is also a part of A, that is, A \ B. Given that we know that A has occurred,

P Orð Þ ¼ 1, by the second axiom or probability. Therefore, the probability of Bgiven A, written P B Ajð Þ, is the probability of A \ B multiplied by the scaling factor1

PðAÞ due to our reduced universe Or:

P B Ajð Þ ¼ P A \ Bð ÞPðAÞ (A.1)

Note that this equation tells us that P B Ajð Þ / P A \ Bð Þ.In the event that A and B are independent events, P B Ajð Þ ¼ P A\Bð Þ

PðAÞ ¼PðAÞ�PðBÞ

PðAÞ ¼ PðBÞ, which states that knowledge about the event A does not affect

the probability of the event B occurring when A and B are independent events.

Although it is tempting to consider P A Bjð Þ in the analogous way and write

P A Bjð Þ ¼ P A\Bð ÞPðBÞ , we need to remember that B is an unobservable event. That is,

when we calculate the conditional probability – the probability of B occurring giventhat A occurred – we are not observing either the occurrence or the nonoccurrence

of B. Thus, the probability of A occurring will be dependent on which one of Bor �Bhas occurred: P(A) is conditional on the occurrence of nonoccurrence of event B.Rewriting the conditional probability formula as

P A \ Bð Þ ¼ PðBÞ � P A Bjð Þ (A.2)

we get the rule known as the multiplication rule for probability. Although this may

seem trivial, it provides us with a conditional probability rule for any observable

596 Appendix A

event given an unobservable event and allows us to find the joint probability.

Equation A.2 shows the relationship when B occurs and similarly,

P A \ �Bð Þ ¼ P �Bð Þ � P A �Bjð Þ

covers the case when B does not occur.

A.5 Bayes’ Theorem

Although we discuss Bayes’ theorem at various points in the book, we include it

here for completeness.

From its definition, the marginal probability of an event A is calculated by

summing the probabilities of its disjoint parts: PðAÞ ¼ P A \ Bð Þ þ P A \ �Bð Þ andsubstituting this into Eq. A.1 gives

P B Ajð Þ ¼ P A \ Bð ÞP A \ Bð Þ þ P A \ �Bð Þ

Using the multiplication rule, we can find the join probabilities and rewrite the

equation to give Bayes’ theorem for an event A:

P B Ajð Þ ¼ P A Bjð Þ � PðBÞP A Bjð Þ � PðBÞ þ P A �Bjð Þ � P �Bð Þ (A.3)

In A.3 and our definition for P(A), we used B and �B which comprised the

universe – O ¼ B [ �B. B and �B are disjoint. We say that B and �B partition the

universe. This is the simplest case. Often we will have more than two events which

partition the universe. Consider the case where we have n events such that B1 [B2 [ ::: [ Bn ¼ O and Bi \ Bj ¼ f 8i; j, where i ¼ 1,. . .,n, j ¼ 1,. . .,n, i 6¼ j.

Ω

A

B4B1

B2B3

An observable event A will therefore be partitioned into n parts: A ¼ A \ B1ð Þ [A \ B2ð Þ [ ::: [ A \ Bnð Þ where A \ Bi and A \ Bj are disjoint because Bi andBj are

disjoint. Thus,

Appendix A 597

r AΩ =B4

B1B2

B3

A∩B1

A∩B2 A∩B3A∩B4

PðAÞ ¼Xni¼1

A \ Bið Þ Law of Total Probabilityð Þ

Using the multiplication rule gives

PðAÞ ¼Xni¼1

P A Bijð Þ � P Bið Þ

where each conditional probability isP Bi Ajð Þ ¼ P A\Bið ÞPðAÞ and the multiplication rule

is used to determine the joint probability of the numerator P A \ Bið Þ ¼ P Bið Þ �P A Bijð Þ to give

P Bi Ajð Þ ¼ P A Bijð Þ � P Bið ÞPni¼1

P A Bijð Þ � P Bið Þ

A.6 Estimators

Consider a random sample from a distribution where we are tossing a thumb tack

(or drawing pint) and where the random variable X takes a value of 1 if the tack

lands on its head and 0 otherwise. Then in tossing the tack we are sampling from a

distribution of X which has the probability distribution p0 ¼ P(Xi ¼ 0) ¼ 1 � p,p1 ¼ P(Xi ¼ 1) ¼ p, where p is the probability that the tack lands on its side.

We will not typically know the value of p, and so a problem we often face is to

estimate the value of p using a random sample. In such cases, we know the

distribution type, but it depends upon the values of one or more unknown

parameters. In our example, the unknown parameter is p.We therefore use a random sample to obtain estimates for the values of these

parameters. In the general case, we have a random variable X whose distribution is

of a known type, but which depends upon the value of some unknown parameter y.We wish to obtain a numerical value for y and use the procedure of obtaining a

random sample from the distribution of X and then use the value of an appropriate

statistic to estimate y.

598 Appendix A

Thus, formally, an estimator for a parameter y is a statistic (random variable)

whose value is used to estimate y. We typically use the notation y to denote an

estimator for y.On occasions, it will be obvious what an estimator is for the unknown parameter.

In our thumb tack example above, an obvious procedure would be to toss the tack

n times, counting the number of times the tack lands on its side. If it lands on its side

a total of a times, that is, the proportion of tosses which result in the tack landing on

its size, then we could use

p ¼ a

n

as our estimator for p.However, the choice is not always so clear. In fact, there may be several choices

for an estimator. For example, a random variable with a probability distribution

f ðxÞ ¼ 1ye

�xy if x� 0;

¼ 0 otherwise

could use either the samplemean or the sample standard deviation of a random sample

of size n to estimate y. An important question is which is the best estimator for y.If y is an estimator for y, we obviously wish to know how good an estimator it is.

An intuitive question to help us answer this is: How close is the estimated value to

the real value? We use the mean to help us with this. Other terms for the mean of a

random variable X with probability function px is defined as

EðXÞ ¼Xni¼1

xpx

where the summation is over all possible values of X. The notation E(X) is read as

the expected value, or expectation.

If EðyÞ ¼ y, we say that y is an unbiased estimator for y. If EðyÞ 6¼ y, then we

define the bias as given by

biasðyÞ ¼ EðyÞ � y

and say that y is a biased estimator for y.If X is a random variable with mean m and variance s2, then the sample mean �X

and the sample variance S2 are unbiased estimators for m and s2, respectively.

If we consider S2 ¼ 1n

Pni¼1

ðXi � �XÞ2as an estimator for s2, then EðS2Þ ¼ n�1ðn�1Þs2n

and so biasðS2Þ ¼ �1ns

2, which explains why S2 ¼ 1n�1

Pni¼1

Xi � �Xð Þ2 is usually used

as the estimator of the population variance.

The variance of a random variable X is defined by VarðXÞ ¼ E½ X � mð Þ2�where m ¼ EðXÞ and the symbol s2 is reserved for the variance, with S2 being

Appendix A 599

used for the variance of a set of data. This is a measure of the dispersion of the

probability distribution.

We say that an unbiased estimator y1 is a more efficient (better) estimator for ythan a second unbiased estimator y2 if Varðy1Þ<Vary2Þ and the relative efficiency ofy2 with respect to y1 is Varðy1Þ=Varðy2Þ:. This fraction is often expressed as

a percentage by ½Varðy1Þ=Varðy2Þ:� � 100. Thus, the relative efficiency is >1

(or 100%) if y2 is more efficient than y1 and less than 1 if y1 is more efficient than y2.The criterion outlined above for comparing estimators can only be used when all

the estimators under consideration are unbiased. However, under certain

conditions, we may prefer to use a biased estimator. We illustrate this by returning

to our thumb tack example above where we selected p ¼ an as our estimator, which

we will now denote by p1. Let us define a second estimator as follows:

p2 ¼ aþ 1

nþ 2

We won’t explain the origin of this estimator since we’re using it for illustration

purposes. Instead, we note that this does in fact occur as an estimator in certain

more complex models.

E p2ð Þ ¼ EðaÞ þ 1

nþ 2¼ npþ 1

nþ 2

In order for p2 to be unbiased, p ¼ 12, otherwise it is biased. However, if n is

large, the bias is small. Further, the value of Varðp2Þ is also small, where n is large.Thus, the estimate is close to the true value of p. So how can we compare two such

estimators? One common method is to use the mean square error.The mean square error (MSE) of an estimator y for a parameter y is defined as

MSEðyÞ ¼ E y� y� �2

� �

As can be seen below, it is related to other quantities we have introduced above:

MSE y� �

¼ E y� y� �2

� �

¼ E y2 � 2yyþ y2

h i

¼ E y2

� �� 2yE y

� �þ y2

¼ E y2

� �� E y

� �h i2þ E y

� �h i2� 2yE y

� �þ y2

� �

¼ Var y� �

þ E y� �

� yh i2

¼ Var y� �

þ bias y� �h i2

600 Appendix A

Thus, MSEðyÞ ¼ VarðyÞ þ ½biasðyÞ�2Note also that if y is unbiased, then MSEðyÞ ¼ VarðyÞ. Therefore, we can think

of the MSE as being a generalization of the variance that allows for bias.

An estimator y, based on a sample of size n, for a parameter y is said to be a

consistent estimator for y if,

EðyÞ ! y and VarðyÞ ! 0 as n ! 1: Equivalently; MSEðyÞ ! 0:

This is an important consideration because when an estimator is consistent, it

becomes increasingly likely that the estimate is close to the parameter’s actual

value as the sample size increases.

A.7 The Law of Large Numbers

The Law of Large Numbers states that for repeated, independent trials with the

same probability p of success in each trial, the percentage of successes is increas-

ingly likely to be close to the chance of success as the number of trials increases.

Formally, the chance that the percentage of successes differs from the probabil-

ity p of success by more than a fixed positive amount, e > 0, tends to zero as the

number of trials n tends to infinity, for every number t > 0.

Another way of thinking about this is to consider an experiment where the

outcome is a random variable, and where the experiment is conducted repeatedly.

The outcome for different repetitions is independent from any other. The law of

large numbers says that as the number of independent repetitions increases, the

average of the observed outcomes approaches the average of all possible outcomes.

This can be easily illustrated by using the example of a die being rolled. The

average of all possible outcomes is the average value of the six numbers:

(1 + 2 + 3 + 4 + 5 + 6)/6 ¼ 3.5. Obviously, 3.5 is not a possible, observable

outcome for any individual experiment since it is not on the face of a die. If the

observed result we see for the ith experiment is denoted xi, the average of the

observed outcomes is

Pni¼1

xi

n

If x1 ¼ 3, and x2 ¼ 5, then the average of the observed outcomes is

ð3þ 5Þ=2 ¼ 4, if x3 ¼ 1, the average of the observed outcomes is

ð3þ 5þ 1Þ=3 ¼ 3, and so on. We can make the value as close as we like to 3.5

by increasing the number of repetitions of the experiment.

Appendix A 601

Appendix B

Databases and Software Projects

A plethora of software products exist to support data mining efforts. These products

include core technical infrastructure products such as the underlying database

management systems and facilities within to support mining models, through

general-purpose products to help with the various stages of the mining effort, on

to those products which offer very specific, and complete support for specific

subject areas such as microarray analysis. To include an exhaustive list of products

is simply not feasible and would certainly be completely out of date by the time this

book is published.

Instead, we have included a number of products and databases that we have

found useful in our experiences and which, we believe, will continue to be avail-

able, developed, and supported. We include these software products and databases

with the kind agreement of the respective various authors of the respective products

and databases to illustrate the various mining techniques we describe.

All of the products or databases we use are listed here for easy reference by the

reader. We have kept the information to a minimum, preferring to point the reader

to the official web site or reference since the versions, licensing details, and scope of

these products or databases may change drastically between the time of writing and

when the reader accesses the software or data.

Wherever we have been granted permission, the actual version of the software/

database used in this book can be located at www.lsdm.org/thirdparty.

The descriptions of the products and databases contained in this appendix are

from the various vendors and not from the authors.

The nature of the Internet is such that URLs accurate at the time of writing may

no longer be accessible. We have tried to provide enough context for a search in the

event that the URL does not work.

603

http://www.lsdm.org/thirdparty

B.1 Software Projects

B.1.1 Alyuda NeuroIntelligence

http://www.alyuda.com/neural-networks-software.htm

This is a neural networks software application designed to assist experts in solving

real-world problems. NeuroIntelligence features only proven algorithms and

techniques. It is fast and easy to use.

B.1.2 AmiGO

www.godatabase.org

This tool provides a mechanism for searching the gene ontology database, which

comprises a controlled vocabulary of terms for biological concepts and for genes

and gene products which have been annotated using the controlled vocabulary. For

example, an inquiry using “protein kinase activity” returned 93 matches in the

database.3

See also “The Gene Ontology.”

B.1.3 AutoClass@IJM

http://ytat2.ijm.univ-paris-diderot.fr/AutoclassAtIJM.html

AutoClass@IJM is a freely available (to academia) computational resource with a

web interface to AutoClass, a powerful unsupervised Bayesian classification system

developed by the Ames Research Center at NASA AutoClass. End users upload

their datasets through a web interface; computations are then queued into the cluster

server. When the clustering is completed, an URL to the results is sent back to the

user by e-mail.

B.1.4 Babelomics

http://babelomics.bioinfo.cipf.es/

An integrative platform for the analysis of transcriptomics, proteomics, and geno-

mic data with advanced functional profiling, integrating primary (normalization,

calls, etc.) and secondary (signatures, predictors, associations, TDTs, clustering,

3 A sample run performed on May 5, 2007.

604 Appendix B

http://www.alyuda.com/neural-networks-software.htm

http://www.godatabase.org

http://ytat2.ijm.univ-paris-diderot.fr/AutoclassAtIJM.html

http://babelomics.bioinfo.cipf.es/

etc.) analysis tools within an environment that allows relating genomic data and/or

interpreting them by means of different functional enrichment or gene set methods.

This tool integrates primary (normalization, calls, etc.) and secondary (signatures,

predictors, associations, TDTs, clustering, etc.) analysis tools within an environ-

ment that allows relating genomic data and/or interpreting them by means of

different functional enrichment or gene set methods.

B.1.5 BioCarta

www.biocarta.com

This is an interactive, web-based resource for life scientists that includes informa-

tion on gene function, proteomic pathways, ePosters, and research reagents.

B.1.6 Bioconductor

www.bioconductor.org

The Bioconductor project was started in 2001 with the overall objective of devel-

oping a set of tools for analyzing and understanding genomic data. Gentleman et al.

(2001 #90) is the official FAQ for this project at www.bioconductor.org/docs/faq

and the reader is referred to this page for more details on the project’s goals and

objectives.

The version of Bioconductor used in this book is version 1.9.

B.1.7 Biomedical Informatics Research Network (BIRN)

The Biomedical Informatics Research Network (BIRN) is a national initiative to

advance biomedical research through data sharing and online collaboration. Funded

by the National Center for Research Resources (NCRR), a component of the US

National Institutes of Health (NIH), BIRN provides data-sharing infrastructure,

software tools, strategies, and advisory services – all from a single source.

http://www.birncommunity.org/

B.1.8 BioPAX

www.biopax.org

This is a data exchange format for biological pathway data.

Appendix B 605

http://www.biocarta.com

http://www.bioconductor.org

http://www.bioconductor.org/docs/faq

http://www.birncommunity.org/

http://www.biopax.org

B.1.9 BioTapestry

http://www.biotapestry.org/

BioTapestry is an interactive tool for building, visualizing, and simulating genetic

regulatory networks.

B.1.10 BLAST

http://www.ncbi.nlm.nih.gov/BLAST

B.1.11 BLAT

http://genome.ucsc.edu/cgi-bin/hgBlat?command¼start

B.1.12 BrainMaker

http://www.calsci.com/BrainIndex.html

This is a neural network software that lets you use your computer for business and

marketing forecasting, stock, bond, commodity, and futures prediction, pattern

recognition, medical diagnosis, sports handicapping. . .almost any activity where

you need special insight.

B.1.13 Cambridge Structure Database (CSD)

The Cambridge Structural Database (CSD) records bibliographic, chemical, and

crystallographic information for organic molecules and metal-organic compounds

whose 3D structures have been determined using X-ray diffraction and neutron

diffraction. Results are recorded for single crystal studies and powder diffraction

studies, which yield 3D atomic coordinate data for at least all non-hydrogen atoms.

In some cases, the Cambridge Crystallographic Data Centre (CCDC) is unable to

obtain coordinates, and incomplete entries are archived to the CSD. Crystal struc-

ture data is included that arises from publications in the open literature or private

deposits.

http://www.ccdc.cam.ac.uk/products/csd/

B.1.14 CDC National Center for Health Statistics

http://www.cdc.gov/nchs/datawh.htm

606 Appendix B

http://www.biotapestry.org/

http://www.ncbi.nlm.nih.gov/BLAST

http://genome.ucsc.edu/cgi-bin/hgBlat?command=start

http://genome.ucsc.edu/cgi-bin/hgBlat?command=start

http://www.calsci.com/BrainIndex.html

http://www.ccdc.cam.ac.uk/products/csd/

http://www.cdc.gov/nchs/datawh.htm

B.1.15 Clone|Gene ID Converter

http://idconverter.bioinfo.cnio.es/

This is a tool for converting gene, clone, or protein IDs to other IDs, which can be

used for small queries or for tens of thousands of IDs (typically from a microarray

experiment). Most of the conversions are pre-generated every time the databases

are updated in order to get a fast answer for each query.

B.1.16 Clustal

http://www.clustal.org

B.1.17 Cognos

http://www-01.ibm.com/software/data/cognos/

B.1.18 COPASI: Biological Network Simulator

http://www.copasi.org/tiki-view_articles.php

COPASI is a software application for simulation and analysis of biochemical

networks and their dynamics. COPASI is a stand-alone program that supports

models in the SBML standard and can simulate their behavior using ODEs or

Gillespie’s stochastic simulation algorithm; arbitrary discrete events can be

included in such simulations.

COPASI carries out several analyses of the network and its dynamics and has

extensive support for parameter estimation and optimization. COPASI provides

means to visualize data in customizable plots, histograms, and animations of

network diagrams.

B.1.19 cMAP

cmap.nci.nih.gov

B.1.20 CompuCell3D

CompuCell is an open-source software modeling environment and PDE solver. It is

largely used for cellular modeling (foams, tissues, etc.); however, efforts are being

made to include fluid simulation capabilities. Created in collaboration between

Appendix B 607

http://idconverter.bioinfo.cnio.es/

http://www.clustal.org

http://www-01.ibm.com/software/data/cognos/

http://www.copasi.org/tiki-view_articles.php

groups at IU and Notre Dame, CompuCell provides an easy user interface for

complex cellular modeling.

https://simtk.org/home/compucell3d

http://www.compucell3d.org/

B.1.21 Cytoscape

http://www.cytoscape.org/

Cytoscape is an open-source bioinformatics software platform for visualizing these

molecular interaction networks and biological pathways and integrating these

networks with annotations, gene expression profiles, and other data. Cytoscape

was originally designed for biological research, now it is a general platform

supporting complex network analysis and visualization. Its use of plug-ins to

provide additional features supporting new file formats, profiling analysis, layouts,

scripting, or database connectivity, to name but a few domains, allows this tool to

be a more general-purpose tool.

B.1.22 Database for Annotation, Visualization,and Integrated Discovery (DAVID)

http://david.abcc.ncifcrf.gov/home.jsp

Database for Annotation, Visualization, and Integrated Discovery (DAVID)provides a comprehensive set of functional annotation tools for investigators to

understand biological meaning behind large list of genes. Its functionality includes

the ability to identify enriched biological themes, particularly GO terms, discover

enriched functional-related gene groups, cluster redundant annotation terms,

visualize genes on BioCarta and KEGG pathway maps, display related many-

genes-to-many-terms on 2-D views, search for other functionally related genes

not in the user-provided list, list interacting proteins, explore gene names in

batch, link gene-disease associations, highlight protein functional domains and

motifs, redirect the user to related literatures, and convert gene identifiers from

one type to another.

B.1.23 EasyNN Plus

http://www.easynn.com/

This is a low-cost, intuitive software product for developing neural networks.

608 Appendix B

https://simtk.org/home/compucell3d

http://www.compucell3d.org/

http://www.cytoscape.org/

http://david.abcc.ncifcrf.gov/home.jsp

http://www.easynn.com/

B.1.24 EMBOSS

http://www.hgmp.mrc.ac.uk/Software/EMBOSS

B.1.25 Entrez

www.ncbi.nlm.nih.gov/entrez

B.1.26 Evoker

Evoker is a graphical tool for visualizing genotype intensity data in order to assess

genotype calls as part of quality control procedures for genome-wide association

studies. It provides a solution to the computational and storage problems related to

being able to work with the huge volumes of data generated by such projects by

implementing a compact, binary format that allows rapid access to data, even with

hundreds of thousands of observations.

http://www.sanger.ac.uk/resources/software/evoker/

B.1.27 Flapack

New software tools for graphical genotyping and haplotype visualization are

required that can routinely handle the large data volumes generated by high

throughput SNP and comparable genotyping technologies. Flapjack is a new

visualization tool to facilitate analysis of these data types. Its visualizations are

rendered in real time allowing for rapid navigation and comparisons between lines,

markers, and chromosomes.

Based on the input of map, genotype, and trait data, Flapjack is able to provide a

number of alternative graphical genotype views with individual alleles colored by

state, frequency, or similarity to a given standard line. Flapjack supports a range of

interactions with the data, including graphically moving lines or markers around the

display, insertions or deletions of data, and sorting or clustering of lines by either

genotype similarity to other lines or by trait scores. Any map-based information

such as QTL positions can be aligned against graphical genotypes to identify

associated haplotypes.

All results are saved in an XML-based project format and can also be exported as

raw data or graphically as image files. We have devised efficient data storage

structures that provide high-speed access to any subset of the data, resulting in

fast visualization regardless of the size of the data.

http://bioinf.scri.ac.uk/flapjack/

Appendix B 609

http://www.hgmp.mrc.ac.uk/Software/EMBOSS

http://www.ncbi.nlm.nih.gov/entrez

http://www.sanger.ac.uk/resources/software/evoker/

http://bioinf.scri.ac.uk/flapjack/

B.1.28 GEO

www.ncbi.nlm.nih.gov/geo

B.1.29 Gene Ontology (GO)

The Gene Ontology (GO) is a bioinformatics initiative aimed at standardizing the

representation of gene and gene product attributes (RNA or proteins resulting from

the expression of a gene) across all species by developing a controlled vocabulary,

annotating genes and gene products, assimilating and disseminating the

annotations, and providing tools to access the data, such as the AmiGO browser.

www.geneontology.org

B.1.30 g:Profiler

http://biit.cs.ut.ee/gprofiler/gconvert.cgi

This is a gene identifier tool that allows conversion of genes, proteins, microarray

probes, standard names, and various database identifiers.

B.1.31 Graphviz

www.graphviz.org

This is an open-source graph visualization software.

B.1.32 HMMER

http://hmmer.wustl.edu

B.1.33 Java Data Mining (JDM) Toolkit

http://www.jcp.org/en/jsr/detail?id¼247

B.1.34 ID Converter

http://biodb.jp/#ids

This is a tool for converting data IDs for biological molecules that are used in a

database into other, corresponding data IDs that are used in other databases.

610 Appendix B

http://www.ncbi.nlm.nih.gov/geo

http://www.geneontology.org

http://biit.cs.ut.ee/gprofiler/gconvert.cgi

http://www.graphviz.org

http://hmmer.wustl.edu

http://www.jcp.org/en/jsr/detail?id=247

http://www.jcp.org/en/jsr/detail?id=247

http://biodb.jp/#ids

B.1.35 iModel

http://www.biocompsystems.com/products/imodel/

B.1.36 KnowledgeMiner

http://www.knowledgeminer.com/

KnowledgeMiner (yX) for Excel is a knowledge mining tool that works with data

stored in Microsoft Excel for building predictive and descriptive models from this

data autonomously and easily.

B.1.37 Libcurl

curl.haxx.se

B.1.38 MatchMiner

http://discover.nci.nih.gov/matchminer/MatchMinerLookup.jsp

This is a set of tools that enables the user to translate between disparate ids for the

same gene using data from the UCSC, LocusLink, Unigene, OMIM, Affymetrix,

and Jackson data sources to determine how different ids relate. Supported id types

include gene symbols and names, IMAGE and FISH clones, GenBank accession

numbers, and UniGene cluster ids.

B.1.39 MATLAB

http://www.mathworks.com/products/matlab/

MATLAB® probably needs no introduction to the majority of readers. It is a high-

level technical computing language and interactive environment for algorithm

development, data visualization, data analysis, and numeric computation. Using

the MATLAB product, you can solve technical computing problems faster than

with traditional programming languages, such as C, C++, and Fortran.

You can use MATLAB in a wide range of applications, including signal and

image processing, communications, control design, test and measurement, financial

modeling and analysis, and computational biology. Add-on toolboxes (collections

of special-purpose MATLAB functions, available separately) extend the MATLAB

environment to solve particular classes of problems in these application areas.

Appendix B 611

http://www.biocompsystems.com/products/imodel/

http://www.knowledgeminer.com/

http://curl.haxx.se

http://discover.nci.nih.gov/matchminer/MatchMinerLookup.jsp

http://www.mathworks.com/products/matlab/

MATLAB provides a number of features for documenting and sharing your

work. You can integrate your MATLAB code with other languages and

applications, and distribute your MATLAB algorithms and applications.

Of more specific interest is the range of toolkits that are available within the

MATLAB architecture to support specific needs.

Image Processing Toolkit

http://www.mathworks.com/products/image/#thd1

Image Processing Toolbox™ provides a comprehensive set of reference-standard

algorithms and graphical tools for image processing, analysis, visualization, and

algorithm development. You can perform image enhancement, image deblurring,

feature detection, noise reduction, image segmentation, spatial transformations, and

image registration.

B.1.40 MemBrain

http://www.membrain-nn.de/main_en.htm

MemBrain is a powerful graphical neural network editor and simulator for

Microsoft Windows, supporting neural networks of arbitrary size and architecture.

B.1.41 Microarray Databases

Database URL

Gene Expression Omnibus – NCBI http://www.ncbi.nlm.nih.gov/geo/

Stanford Microarray database http://smd.stanford.edu/

GeneNetwork system http://www.genenetwork.org/

ArrayExpress at EBI http://www.ebi.ac.uk/arrayexpress/

UNC Microarray database https://genome.unc.edu/

Genevestigator database https://www.genevestigator.com/

caArray at NCI http://array.nci.nih.gov/caarray/

UPenn RAD database http://www.cbil.upenn.edu/RAD

UNC modENCODE Microarray

database

https://genome.unc.edu:8443/nimblegen

ArrayTrack http://www.fda.gov/ScienceResearch/BioinformaticsTools/

Arraytrack//

MUSC database http://proteogenomics.musc.edu/ma/musc_madb.php?

page¼home&act¼manage

UPSC-BASE http://www.upscbase.db.umu.se/

612 Appendix B

http://www.mathworks.com/products/image/#thd1

http://www.membrain-nn.de/main_en.htm

http://www.ncbi.nlm.nih.gov/geo/

http://smd.stanford.edu/

http://www.genenetwork.org/

http://www.ebi.ac.uk/arrayexpress/

https://genome.unc.edu/

https://www.genevestigator.com/

http://array.nci.nih.gov/caarray/

http://www.cbil.upenn.edu/RAD

https://genome.unc.edu:8443/nimblegen

http://www.fda.gov/ScienceResearch/BioinformaticsTools/Arraytrack//

http://www.fda.gov/ScienceResearch/BioinformaticsTools/Arraytrack//

http://proteogenomics.musc.edu/ma/musc_madb.php?page=home&act=manage




http://www.upscbase.db.umu.se/

B.1.42 Molecular Signatures Database (MSigDB)

http://www.broadinstitute.org/gsea/msigdb/index.jsp

This is a collection of annotated gene sets for use with GSEA software.

B.1.43 NAR

www3.oup.co.uk/nar/database/c

B.1.44 Neural Network Toolbox (MATLAB)

http://www.mathworks.com/products/neuralnet/

Neural Network Toolbox™ extends MATLAB® with tools for designing,

implementing, visualizing, and simulating neural networks. Neural networks are

invaluable for applications where formal analysis would be difficult or impossible,

such as pattern recognition and nonlinear system identification and control. Neural

Network Toolbox software provides comprehensive support for many proven

network paradigms, as well as graphical user interfaces (GUIs) that enable you to

design and manage your networks. The modular, open, and extensible design of the

toolbox simplifies the creation of customized functions and networks.

B.1.45 NeuralWorks

http://www.neuralware.com/index.jsp

NeuralWorks Predict is an integrated, state-of-the-art tool for rapidly creating and

deploying prediction and classification applications. Predict combines neural net-

work technology with genetic algorithms, statistics, and fuzzy logic to automati-

cally find optimal or near-optimal solutions for a wide range of problems. Predict

incorporates years of modeling and analysis experience gained from working with

customers faced with a wide variety of analysis and interpretation problems.

B.1.46 NeuroSolutions

http://www.nd.com/

This leading edge neural network development software combines a modular, icon-

based network design interface with an implementation of advanced learning

procedures, such as conjugate gradients and backpropagation through time.

Appendix B 613

http://www.broadinstitute.org/gsea/msigdb/index.jsp

http://www3.oup.co.uk/nar/database/c

http://www.mathworks.com/products/neuralnet/

http://www.neuralware.com/index.jsp

http://www.nd.com/

B.1.47 NeuroXL

http://www.neuroxl.com/

NeuroXL Classifier is a fast, powerful, and easy-to-use neural network software

tool for classifying data in Microsoft Excel. Designed to aid experts in real-world

data mining and pattern recognition tasks, it hides the underlying complexity of

neural network processes while providing graphs and statistics for the user to easily

understand results. NeuroXL Classifier uses only proven algorithms and

techniques, and integrates seamlessly with Microsoft Excel. OLSOFT Neural

Network Library is the class to create, learn, and use Back Propagation neural

networks and SOFM (Self-Organizing Feature Map). The library makes integration

of neural networks’ functionality into your own applications easy and seamless. It

enables your programs to handle data analysis, classification, and forecasting needs.

B.1.48 Open Biological and Biomedical Ontologies(OBO Foundry)

This is a collaborative experiment involving developers of science-based

ontologies who are establishing a set of principles for ontology development with

the goal of creating a suite of orthogonal interoperable reference ontologies in the

biomedical domain.

http://www.obofoundry.org/

B.1.49 Omegahat

www.omegahat.org

B.1.50 OMIM


B.1.51 OpenDiseaseModels.Org

OpenDiseaseModels.org is an open-source disease/systems model development

project. Analogous to open-source software development projects, the goal of this

effort is to develop better, more useful models in a transparent and public collabo-

rative forum.

http://www.opendiseasemodels.org/

614 Appendix B

http://www.neuroxl.com/

http://www.obofoundry.org/

http://www.omegahat.org


http://www.opendiseasemodels.org/

B.1.52 Perl

http://strawberryperl.com/

http://www.activestate.com/activeperl/

Perl is a programming language that has seen a particularly widespread adoption in

the bioinformatics space. It is available on all the usual platforms.

B.1.53 PIPE

http://sourceforge.net/projects/pipe2/

It can create, model, and analyze Petri nets with a standards-compliant Petri net

tool. This is the active fork of the Platform Independent Petri net Editor project,

which originated at Imperial College London.

B.1.54 PNK

http://www2.informatik.hu-berlin.de/top/pnk/index.html

PNK provides an infrastructure for bringing ideas for analyzing, simulating, and

verifying Petri Nets.

B.1.55 PNK2e

http://page.mi.fu-berlin.de/trieglaf/PNK2e/index.html

This is a software environment for the modeling and simulation of biological

processes that uses Stochastic Petri Nets (SPNs), a graphical representation of

Markov Jump Processes.

B.1.56 Proper

Most databases employ the relational model for data storage. To use this data in a

propositional learner, a propositionalization step has to take place. Similarly, the

data has to be transformed to be amenable to a multi-instance learner. The Proper

Toolbox contains an extended version of RELAGGS, the Multi-Instance Learning

Kit MILK, and can also combine the multi-instance data with aggregated data from

RELAGGS. RELAGGS was extended to handle arbitrarily nested relations and to

work with both primary keys and indices. For MILK, the relational model is

flattened into a single table and this data is fed into a multi-instance learner.

Appendix B 615

http://strawberryperl.com/

http://www.activestate.com/activeperl/

http://sourceforge.net/projects/pipe2/

http://www2.informatik.hu-berlin.de/top/pnk/index.html

http://page.mi.fu-berlin.de/trieglaf/PNK2e/index.html

REMILK finally combines the aggregated data produced by RELAGGS and the

multi-instance data, flattened for MILK, into a single table that is once again the

input for a multi-instance learner. Several well-known datasets are used for

experiments which highlight the strengths and weaknesses of the different

approaches. (Abstract from Reutemann et al. (2004 #848))

http://www.cs.waikato.ac.nz/ml/proper/

B.1.57 PseAAC

http://www.csbio.sjtu.edu.cn/bioinf/PseAAC/

This site allows you to generate pseudo amino acid compositions, according to

Chou (2001 #520).

B.1.58 PubMed


B.1.59 R

www.r-project.org

R is an open-source analytical environment that is not only comprehensive in its

own right but is widely enhanced and extended through packages that have been

developed. At the time of writing, several hundred add-on packages are available

for R that cover a wide array of subjects and disciplines.

B.1.60 R Packages

We have used many R packages throughout this book and gratefully appreciate

the skill and time that the various authors have spent developing these packages.

In particular, we have used several packages extensively and highlight these herein.

These packages can all be accessed using the Comprehensive R Archive Net-

work (CRAN) at http://cran.r-project.org.

B.1.60.1 odfWeave

The Sweave function combines R code and LATEX so that the output of the code is

embedded in the processed document. The odfWeave package was created so that

616 Appendix B

http://www.cs.waikato.ac.nz/ml/proper/

http://www.csbio.sjtu.edu.cn/bioinf/PseAAC/


http://www.r-project.org

http://cran.r-project.org

the functionality of Sweave can used to generate documents that the end{user can

easily edit. The markup language used is the Open Document Format (ODF), which

is an open, non{proprietary format that encompasses text documents, presentations

and spreadsheets.

B.1.61 RapidMiner

http://rapid-i.com/content/view/181/190/

RapidMiner provides comprehensive data mining capabilities, including

• Data Integration, Analytical ETL, Data Analysis, and Reporting in one single

suite

• Powerful but intuitive graphical user interface for the design of analysis

processes

• Repositories for process, data, and metadata handling

• Only solution with metadata transformation: forget trial and error and inspect

results already during design time

• Only solution which supports on-the-fly error recognition and quick fixes

• Complete and flexible: hundreds of data loading, data transformation, data

modeling, and data visualization methods

RapidMiner Image Processing Extension

http://spl.utko.feec.vutbr.cz/en/component/content/article/21-zpracovani-

medicinskych-signalu/46-image-processing-extension-for-rapidminer-5

This add-on package provides capabilities including

• Local image features extraction

• Segment feature extraction

• Global image feature extraction

• Extracting features from single/multiple image(s)

• Detect a template in image (rotation invariant)

• Point of interest detection

• Image comparison

• Image transforms

• Color mode transforms

• Noise reduction

• Image segmentation

• Object detection and object detector training (Haar-like features)

Appendix B 617

http://rapid-i.com/content/view/181/190/

http://spl.utko.feec.vutbr.cz/en/component/content/article/21-zpracovani-medicinskych-signalu/46-image-processing-extension-for-rapidminer-5

http://spl.utko.feec.vutbr.cz/en/component/content/article/21-zpracovani-medicinskych-signalu/46-image-processing-extension-for-rapidminer-5

B.1.62 Reactome

www.reactome.org

B.1.63 Resourcerer

Pga.tigt.org/tigr-scripts/magic/r1.pl

B.1.64 RStudio

RStudio™ is a new integrated development environment (IDE) for R. RStudio

combines an intuitive user interface with powerful coding tools to help you get the

most out of R.

http://www.rstudio.org/

B.1.65 Systems Biology Graphical Notation (SBGN)

The Systems Biology Graphical Notation (SBGN) project is an effort to standardize

the graphical notation used in maps of biochemical and cellular processes studied in

systems biology.

Standardizing the visual representation is crucial for more efficient and accurate

transmission of biological knowledge between different communities in research,

education, publishing, and more. When biologists are as familiar with the notation

as electronics engineers are familiar with the notation of circuit schematics, they

can save the time and effort required to familiarize themselves with different

notations, and instead spend more time thinking about the biology being depicted.

http://sbgn.org/

B.1.66 Systems Biology Markup Language (SBML)

http://sbml.org/

A free and open interchange format for computer models of biological processes.

SBML is useful for models of metabolism, cell signaling, and more. It has been in

development by an international community since the year 2000.

618 Appendix B

http://www.reactome.org

http://Pga.tigt.org/tigr-scripts/magic/r1.pl

http://www.rstudio.org/

http://sbgn.org/

http://sbml.org/

B.1.67 Systems Biology Workbench (SBW)

http://sys-bio.org/sbwWiki/doku.php?id¼sysbio:sbw

This is an open-source framework connecting heterogeneous software applications.

SBW is made up of two kinds of components:

• Modules: These are the applications that a user would use. We have a wide

collection of model editing, model simulation, and model analysis tools.

• Framework: The software framework that allows developers to cross program-

ming language boundaries and connect application modules to form new

applications.

B.1.68 W3C

W3c.org

B.1.69 Weka

This is a machine learning toolkit that includes an implementation of an SVM

classifier. Weka can be used both interactively through a graphical interface and as

a software library. (The SVM implementation is called “SMO.” It can be found in

the Weka Explorer GUI, under the “functions” category.)

http://www.cs.waikato.ac.nz/ml/weka/

B.2 Data Sources

B.2.1 BioModels Database

This is a repository of peer-reviewed, published, computational models. These

mathematical models are primarily from the field of systems biology, but more

generally are of biological interest. This resource allows biologists to store, search,

and retrieve published mathematical models. In addition, models in the database

can be used to generate sub-models, can be simulated online, and can be converted

between different representational formats. This resource also features program-

matic access via web services.

http://www.ebi.ac.uk/biomodels-main/

B.2.2 DDBJ

http://www.ddbj.nig.ac.jp

Appendix B 619

http://sys-bio.org/sbwWiki/doku.php?id=sysbio:sbw

http://sys-bio.org/sbwWiki/doku.php?id=sysbio:sbw

http://W3c.org

http://www.cs.waikato.ac.nz/ml/weka/

http://www.ebi.ac.uk/biomodels-main/

http://www.ddbj.nig.ac.jp

B.2.3 EMBL

http://www.ebi.ac.uk/embl/index.html

B.2.4 GenBank

http://www.ncbi.nlm.nih.gov/GenBank

B.2.5 Pfam

http://pfam.wustl.edu

B.2.6 PROSITE

http://us.expasy.org/prosite

B.2.7 SWISS-PROT

http://us.expasy.org/sprot

B.3 Support Vector Machines

Support vector machines (SVMs) have seen a particular growth in the tools and

software available to support this valuable supervised learning paradigm. We list a

number of products available at the time of writing4.

The Kernel-Machine Library (GNU) – C++ template library for support vector

machines.

Lush – a Lisp-like interpreted/compiled language with C/C++/Fortran interfaces

that has packages to interface to a number of different SVM implementations.

Interfaces to LASVM, LIBSVM, mySVM, SVQP, SVQP2 (SVQP3 in future) are

available. Leverage these against Lush’s other interfaces to machine learning,

hidden Markov models, numerical libraries (LAPACK, BLAS, GSL), and built-in

vector/matrix/tensor engine.

4 As available from the Wikipedia entry http://en.wikipedia.org/wiki/Support_vector_machine

620 Appendix B

http://www.ebi.ac.uk/embl/index.html

http://www.ncbi.nlm.nih.gov/GenBank

http://pfam.wustl.edu

http://us.expasy.org/prosite

http://us.expasy.org/sprot

http://en.wikipedia.org/wiki/Support_vector_machine

SVMlight – a popular implementation of the SVM algorithm by Thorsten Joachims;

it can be used to solve classification, regression, and ranking problems.

SVMProt – Protein Functional Family Prediction.

LIBSVM – A library for support vector machines, Chih-Chung Chang and Chih-Jen

Lin.

YALE – a powerful machine learning toolbox containing wrappers for SVMLight,

LibSVM, and MySVM in addition to many evaluation and preprocessing methods.

LS-SVMLab – MATLAB/C SVM toolbox; well-documented, many features.

Gist – implementation of the SVM algorithm with feature selection.

Weka – a machine learning toolkit that includes an implementation of an SVM

classifier; Weka can be used both interactively through a graphical interface and as

a software library. (The SVM implementation is called “SMO.” It can be found in

the Weka Explorer GUI, under the “functions” category.)

OSU SVM – MATLAB implementation based on LIBSVM.

Torch – C++ machine learning library with SVM.

Shogun – Large-Scale Machine Learning Toolbox that provides several SVM

implementations (like libSVM, SVMlight) under a common framework and

interfaces to Octave, MATLAB, Python, and R.

Spider – machine learning library for MATLAB.

kernlab – Kernel-based machine learning library for R.

e1071 – machine learning library for R.

SimpleSVM – SimpleSVM toolbox for MATLAB.

SVM and Kernel Methods MATLAB Toolbox

PCP – C program for supervised pattern classification; includes LIBSVM wrapper.

TinySVM – a small SVM implementation, written in C++.

pcSVM is an object-oriented SVM framework written in C++ and provides

wrapping to Python classes. The site provides a stand-alone demo tool for

experimenting with SVMs.

PyML – a Python machine learning package; includes: SVM, nearest neighbor

classifiers, ridge regression, multi-class methods (one-against-one and one-against-

rest), Feature selection (filter methods, RFE, multiplicative update), Model selec-

tion, Classifier testing (cross-validation, error rates, ROC curves, statistical test for

comparing classifiers).

Algorithm::SVM – Perl bindings for the libsvm support vector machine library.

SVM Classification Applet – performs classification on any given dataset and gives

10-fold cross-validation error rate.

Appendix B 621

Appendix C

The Patient Dataset

C.1 Introduction

At various points in the book we refer to a patient dataset. We define this dataset in

this appendix rather than in the main body of the book so as not to lose too much

momentum in the book itself.

The patient dataset definition, together with some artificially created data, is

available from the companion website along with some example code for various

data mining techniques.

The datasets defined below are not intended to be viewed as complete, or even as

necessarily valuable, but simply serve to provide us with a meaningful data

environment with which to illustrate some of the data mining concepts we discuss

within the book.

For the purposes of illustration of the data mining techniques, we have

denormalized the patient dataset into two datasets – one for the patient, and one

for the tests performed on specimens/isolates from those patients. A normalized

version (to mimic a transactional environment) and a star-schema (to mimic a data

warehouse environment) are available on the companion website.

C.2 Supporting Datasets

In addition to the dataset generated from our work, we would typically use a

number of datasets that come from other sources. For example, in chapter two we

use a body mass index dataset to allow us to calculate weight-related categories.

Where so used, we indicate these in the main text.

623

C.3 Patient Dataset Definition

Key Attribute Format Description

PK Id Integer Unique key value for this record in the dataset. Every record

will have a unique value

Patient Label String Some identification for the patient. This may be their name,

initials, or other “human-readable” attribute. Note that this

is not an identifier in the pure sense, but is simply a text label

MRN String Medical record number

Patient Id Integer A unique identifier assigned to each patient. This is an integral

value and may simply be used as a sequence, where the next

available number is assigned to the next unique patient

without any particular ordering

Context Id Integer A unique identifier assigned to a particular context associated

with this data. Contexts will typically be some project or

study, or other high-level grouping for the data. In this book,

the context is synonymous with a project

Encounter Id Integer A unique identifier assigned to each encounter between a patient

and a healthcare professional within a specific context. Thisitalicized constraint is important since it is likely that this

value will be a sequence within a project/study

Age Integer The patient’s age

Weight Double The patient’s weight. This is measured in kilograms

Height Double The patient’s height. This is measured in meters

Gender Character The patient’s gender. Allowed values are M(ale), F(emale),

N(ot provided), and ‘ ‘ or NULL

Smoker Boolean TRUE/FALSE values indicating whether the patient is a smoker

or not

Activity Factor Integer A scalar value from 0 to 10 indicating the level of activity/

exercise in the patient’s daily life. 0 indicates immobility,

whereas 10 indicates an extreme level of activity such as for

professional athletes

C.4 Test Dataset Definition


PK Id Integer Unique key value for this record in the dataset. Every record will

have a unique value

FK Patient Id Integer The patient id value from the patient dataset for the patient

associated with this test result

(continued)

624 Appendix C


Test Id String Some unique identifier for the test that is consistent within the

data environment. For example, LDL-C might be a test id for

measurement of LDL cholesterol. This can then be compared

across instances

Test

Description

String A description of the test

Test Result

Value String

String The actual test result. This attribute will be populated based on

the test result domain being a string or character value

Test Result

Value

Double The actual test result. This attribute will be populated based on

the test result domain being an integer or numeric value

Test Result

Domain

Integer A value indicating the actual result value’s domain. Values are

S(tring), I(nteger), N(umeric), and C(haracter). This can be

used to convert the values to their appropriate types

Interpretation String The interpretive result, if any, associated with this test

Appendix C 625

Appendix D

The Clinical Trial and Data Mining

D.1 Introduction

Throughout our book we have referred to the clinical trial as an archetype for data

mining. This is for some intuitive reasons along with some which may not be quite

as intuitive to our readers.

It is definitely true to say that much of the data generated in the healthcare and

life sciences arenas relate to the pharmaceutical product or medical device

industries, and by our focus on these industries, we are certainly not intending to

dismiss any of the other vitally important research and development disciplines that

additionally form the foundation for the life sciences today. We instead have used

the pharmaceutical development process since many more people outside of these

disciplines have some knowledge of the pharmaceutical process due to its being

front and center in the public psyche due to advertising, public information on the

safety of drugs and devices, and other media outlets.

D.2 The Clinical Trial Process

D.2.1 Statistics of a Phase III Clinical Trial

We now present some of the basic statistics of a phase III clinical trial.

Number of patients (start of trial) 2,500

Number of visits per patient 12

Number of data attributes collected

per patient per visit in bytes (characters) 4,000

Total 114.44 Mb

In the above list, the data is the maximum amount of data that might be

captured. Thus there are several assumptions that typically do not occur in reality.

627

The most affective of which is the assumption that every patient is included in the

trial from the beginning to the very end. This does not typically happen as patients

drop out, are excluded, or, unfortunately, die before the trial ends. Since the largest

driver of the data volume is obviously the number of patients involved in the trial,

we will assume that only 25% of the patients initially enrolled see the trial through.

This would result in 28.6 Mb of data.

628 Appendix D

References

Chou K-C (2001) Prediction of protein cellular attributes using pseudo amino acid composition.

Proteins 43:246–255

Gentleman RC et al (2001) The Bioconductor FAQ

Reutemann P, Pfahringer B, Frank E (2004) A toolbox for learning from relational data with

propositional and multi-instance learners. In: 17th Australian joint conference on Artificial

Intelligence (AI2004). Springer, Berlin

629

Index

A

Accuracy, 5, 10, 12, 25, 42, 70, 80, 120,

133–135, 142, 192, 211, 217, 233, 240,

260, 278, 327, 332, 352, 364, 365,

370–373, 396, 407, 429, 458, 460, 467,

472, 474, 484, 489–497, 514, 515, 585

Aggregation, 21, 41, 110–113, 202, 209–211,

225, 379, 497, 544

Analytical entity, 193, 194

Apriori algorithm, 146, 148–150, 372, 447, 587

ARFF, 86, 333

Artificial neural networks, 396, 399–425, 577

Association, 4, 6, 8, 11, 14, 18, 25, 38, 42, 46,

69–73, 82, 86, 87, 105, 119, 121,

145–152, 161, 170–172, 183, 283–285,

289, 297, 339, 364, 365, 372, 397, 447,

548, 561, 562, 566, 568, 576, 580

Association analysis, 42, 46

Association rules, 69–73, 82, 145–152, 183,

364, 372, 447

Asymmetry, 75, 238, 272–273

Attribute, 4, 36, 87, 125, 192, 236, 304, 365,

459, 514, 564

B

Back-propagation, 396, 412–416, 423–425,

469, 477

Bayes factor, 312–315

Bayesian, 14, 29, 183, 186, 237, 240, 241,

303–360, 456, 457, 469, 471–477, 480,

549, 577

Bayesian belief networks, 337–340

Bayesian classification, 327–337, 469

Bayesian modeling, 471–477

Bayesian reasoning, 322–327

Bayes’ theorem, 293, 304, 309–312, 317, 320,

327, 331, 339, 471

Bias, 10, 43–45, 47, 55, 129, 196, 222, 229,

239, 260, 269, 341, 351, 404, 411,

459, 495

Bias-variance decomposition, 55

Binary tree algorithm, 436–437

Binning, 211–215

Boolean logic, 440–448

Box plot, 154, 159–161, 272

Branching, 133–142, 144, 145, 532

Branching factor, 128

C

C4.5, 79, 138, 587

Causality, 8, 11

Characteristic, 24–26, 34, 36, 38, 42, 45, 48,

52, 57, 65, 70, 77, 81, 95, 102, 113, 133,

144, 145, 170, 216, 243, 261, 270, 286,

293, 294, 298, 325, 327, 351, 371, 375,

383, 386, 396, 399, 405, 424, 425, 431,

460, 472, 478, 482, 492, 512, 513, 515,

531, 540, 546, 549, 550, 564, 590

Characterization, 45, 351, 363

Classification, 4, 35, 87, 127, 241, 308, 364,

455–498, 513, 544, 587

Classification and regression trees, 129–145

Classification rules, 66–69, 72, 127–129, 371,

464, 465

Cluster analysis, 18, 47–48, 266, 364, 365, 396,

397, 446, 447

Clustering, 4, 14, 25, 73–74, 76, 77, 81, 119,

168, 169, 177, 187, 194, 211, 215,

371–373, 383, 396, 397, 426–431, 436,

446, 448, 456, 458–460, 473, 476, 477,

631

481, 483, 485, 486, 498, 514, 532, 533,

535, 544, 546, 576, 577

Concept, 2, 33–82, 85, 132, 194, 236, 304, 364,

456, 502, 553, 585

Concept hierarchy, 38, 52–55, 205, 206,

215, 227

Conditional random field (CRF), 351–355

Confidence, 13, 26, 42, 70, 75, 114, 145–148,

176, 238, 250, 259, 260, 278–282, 293,

341, 372, 394, 492

Confidence intervals (CI), 259, 260, 278–282,

293

Contingency tables, 127, 277, 278

Correlation, 2, 3, 6, 8, 11, 18, 24, 33, 35, 44, 45,

55, 75, 145, 161, 169, 186, 194, 219,

221, 224, 238, 241, 242, 259–265, 283,

284, 289, 295, 296, 314, 321, 322, 329,

396, 460, 480, 486–488, 531, 563

Co-training, 373, 374, 377–381, 387

Coverage, 4, 42, 70, 145, 208, 209, 260, 372,

467, 511

Cross-validation, 132, 143–144, 382, 493, 494

D

Data architecture, 2, 4, 12, 13, 29, 82,

85–122, 193

Database for Annotation, Visualization, and

Integrated Discovery (DAVID), 208,

209, 297

Databases, 2–4, 11, 12, 14, 16, 24, 26, 29,

30, 34, 53, 85, 86, 94, 96, 98, 99,

103–105, 110, 111, 116–118, 148,

171, 176, 181, 197, 202–204,

206–208, 218, 222, 228, 266, 290,

294, 295, 297, 315, 327, 328, 349,

352, 378–380, 433, 447, 458,

502–504, 507–512, 520, 538, 545,

546, 553, 563–565, 567, 570, 576

Data curation, 111, 115, 504

Data format, 13, 22–23, 172, 198, 504

Data integration, 102–106, 504–508, 545

Data mart, 29, 110–112, 228

Data mining

ethics, 10–12

process, 15, 19, 24, 43

results, 23, 24, 125–189

Data modeling, 85–122, 203

Data plot, 154–158

Data preparation, 19, 21–23, 93, 194–195,

459–460

Data reduction, 194, 212, 214, 215, 224–227

Data warehouse, 3, 4, 17, 29, 37, 53,

110–113, 115–116, 126, 191, 192,

207, 228, 230, 508

DAVID.See Database for Annotation,Visualization, and Integrated Discovery

(DAVID)

Decision tables, 125, 127–129

Decision trees, 23, 38, 44, 63–68, 79–81,

127–129, 215, 327, 332, 371, 448, 457,

463–466, 469, 588

Denormalization, 40, 201–206

Dependent entity, 87, 88

Diffusion map, 426–431

Dimensionality, 49–51, 74, 80, 114, 211, 217,

394, 426, 480, 481, 497

Discretization, 62, 193, 194, 212, 215, 227

Discrimination, 45

Distance, 48, 76–79, 154, 162, 198, 199, 216,

272, 340, 369, 382, 383, 385, 389–391,

416, 417, 422, 423, 426, 446, 447, 459,

470, 480, 484, 486–488, 498, 531–535,

537, 554

Dynamical systems, 202, 546–548,

567, 580

E

Enrichment analysis, 294–297

Entity, 46, 86–92, 95, 102–103, 105–109, 120,

180, 181, 184, 193, 194, 196, 218, 374,

379, 380, 569, 570, 576

Entrez, 99, 176, 208, 508–510, 566, 569

Entropy, 75, 134, 137–138, 140, 145, 238,

269, 270, 351, 467

Error, 10, 36, 86, 138, 211, 239, 333, 367, 461,

512, 546

Expectation-maximization (EM) algorithm

119, 223, 346, 373–377, 382, 448, 587

External consistency, 195, 197, 206

Extract, 5, 17, 22, 91, 100, 120, 126, 132,

137, 204, 206–209, 217, 264, 363,

365, 378, 384, 466, 479, 505, 553,

567, 568, 576, 579

F

False discover rate (FDR), 296–300

Feature selection, 51, 79–81, 364, 515

Feed-forward network, 400, 411–413, 477

Frequency polygon, 154

Functional networks, 404, 411, 568

Fuzzy logic, 394, 440–448

632 Index

G

Gaussian mixture, 375–377

Gene expression analysis, 394–398, 430

Gene identifier cross-referencing,

208–209

Gene ontology (GO), 95–96, 116–118,

152, 176, 294, 509, 512, 544,

563–565, 567, 577, 578

Generalization, 4, 37, 43–45, 52, 81, 82,

87–89, 226–227, 390, 393, 423, 469,

493, 501, 565

Gene selection, 397, 514, 515

Genome-wide association studies (GWAS),

171, 172, 297, 580

Genomic data, 169, 170, 503, 563

Genotype visualization, 171–172

Gini, 75, 134–140, 142, 145, 238,

269–272

Graph theory, 130, 339, 554–563

H

Heat map, 166–170, 510

Heterogeneity, 75, 86, 104, 105, 238,

269–270, 363

Hidden Markov models (HMM), 49,

342–356, 358, 359, 374, 377, 387,

480, 538

Histogram, 153–154, 268, 277, 296,

298, 354

Hypothesis testing, 239–265, 300

I

Independent entity, 87, 88

Information gain, 134, 137–139

Input representation, 39–40

Instance, 10, 37, 87, 126, 193, 246, 304, 364,

457, 514, 544, 587

Integrating annotations, 505

Interestingness, 26–28

Internal consistency, 192, 195–197

Interpretive data, 93

K

Kernels, 50, 368, 383–386, 388, 391–394,

417, 426, 428, 431, 549

K-means, 49, 74, 375, 448, 477, 481–486,

498, 587

K-nearest neighbor (k-NN), 198, 446, 457,

470, 587

Kurtosis, 75, 238, 273–274

L

Learning algorithm, 49, 50, 73, 183, 382, 383,

385, 388, 391, 399, 405, 407, 408, 425,

448, 481, 588

Least squares, 74, 76, 286–290, 461

Linear model, 55, 74–75, 81, 130, 157, 161,

282, 286, 463, 549

Linear regression, 74, 212, 283–290, 460–463,

477, 488

M

Many-to-many relationship, 87, 88

Marginal probability, 309, 310, 471, 477

Maximum a posteriori (MAP) estimation, 235,

293–294, 374

Maximum likelihood estimation (MLE), 75,

76, 290–294, 374, 376

Measures of central tendency, 266–276

Metadata, 105, 215–217

MIAME.See Minimal information about a

microarray experiment (MIAME)

Microarray databases, 563–565

Microarrays, 8, 9, 33, 121, 122, 171, 393,

395–397, 447, 457, 502, 505, 513–515,

537, 538, 544, 546, 548, 567

Minimal information about a microarray

experiment (MIAME), 505, 565

Minimum confidence threshold, 146

Minimum support threshold, 146, 148

Mining by navigation, 561–562

Missing data, 21, 40–41, 55, 58, 62, 82, 94,

104, 194, 199, 217–219, 222, 223,

228, 231, 232, 366, 367, 460, 474

Missing values, 22, 40, 41, 62, 66, 128, 129,

199, 217–223, 267, 374, 467

Model, 2, 34, 85, 126, 192, 236, 305, 364,

456, 501, 543, 585, 586

Modeling, 13, 19, 22, 23, 29, 51, 56–58, 66,

85–122, 126, 203, 216, 217, 230, 233,

237, 241, 282, 314, 322, 326, 347, 352,

355, 377, 438–440, 457, 461, 471–477,

502, 546–549, 552, 578, 580, 586

Monte Carlo strategies, 56–60

Motifs, 57, 94, 295, 297, 373, 387, 431–436,

480, 516–520, 528, 539, 577

Multiple sequence alignment, 99, 343

N

Naıve Bayes, 80, 304, 587

Naıve Bayes classifier, 183, 304, 309, 311,

312, 332–337

Index 633

Nearest neighbor, 81, 382, 383, 469–470, 588

Neighbors, 76–79, 99, 188, 434, 470

Network analysis, 176, 513, 577

Network (graph)-based analysis, 431

Network motif, 431–436, 577

Network visualization, 172–175

Neural networks, 3, 46, 49, 241, 327, 332,

364, 370, 391, 392, 394, 396, 404,

411, 424, 436, 456, 457, 460, 474,

477–481, 492, 498, 549, 577, 588

Noisy data, 12, 211–215

Normal form (NF), 90, 91, 120, 201, 208

Normalization, 6, 49, 50, 90, 91, 121, 194, 196,

198, 200–201, 204, 406, 426, 510, 545

Null hypothesis, 75, 164, 238, 241–245, 250,

254, 256, 259, 263

O

Odds ratio, 312–315

One-to-many relationship, 87, 88, 203, 207

Ontologies, 95, 96, 105, 116, 152, 512, 564,

565, 567, 577

Operational data store (ODS), 110–113

Outlier, 4, 21, 22, 38, 48–49, 154, 163,

211, 239, 258, 268, 462, 488, 493,

510, 544, 576

Over-fitting, 12, 26, 27, 51–52, 81, 134, 142,

218, 290, 371, 393, 394, 458, 466,

469, 474, 476, 502

Over-training, 50, 368

P

Pairwise sequence alignment, 521–527

Parameter estimation, 340–341

Partial ordering, 53

Patterns, 2, 34, 102, 126, 194, 236, 324, 364,

457, 502, 543, 586

Perceptron, 391, 407–409, 411–412, 423,

424, 477

Petri nets (PNs), 437–440, 503, 549

Probabilistic Boolean networks (PBN),

548–550

Phylogenetic trees, 174, 188, 531, 537

Phylogenetic tree visualization, 177–178

Position map, 170–171

Posterior distribution, 304, 340

Prediction, 3, 4, 18, 27, 28, 34, 46–47, 52, 57,

127, 132, 144, 170, 240, 322, 335,

352–354, 356, 365, 372, 386–389,

395–397, 416, 422, 455–498, 501, 509,

514, 515, 543, 546, 553, 576

Principal component analysis, 51, 56, 80, 159,

169, 170, 194, 224, 225, 426, 477, 546

Prior distribution, 304, 324, 475

Prior probability, 305, 306, 308, 311, 312, 319,

324–326, 340, 471, 472, 477

Protein, 28, 33, 93, 172, 207, 290, 313, 386,

478, 502, 544

Protein identifier cross-referencing, 208–209

Proteomic data, 8, 503–504, 514

P-value, 200, 243–245, 255, 261–265, 296–299

Q

Qualitative and quantitative data, 90–93

R

1R, 16, 41, 60–63, 82, 237, 304, 465–469

Radial basis function (RBFs), 50, 393, 416, 417

Random walk, 426–431

Receiver operator characteristic (ROC), 386,

486, 492

Reducing (the) dimensionality, 80, 481

Regression, 27, 35, 74, 129–145, 156, 157, 161,

166, 183, 186, 211, 212, 216, 238, 239,

241, 259, 283–287, 289, 290, 364, 389,

393, 407, 447, 457, 460, 577, 587

Regression analysis, 27, 46, 141, 282–290,

456–458

Regular expressions, 480, 516–520, 589

Regulatory networks, 431, 547–549, 552,

576, 580

Reinforcement learning, 425

Relational data model, 85, 94

Relevance, 80–82, 216, 242, 300, 473

Relevance analysis, 47, 194, 459

S

Sampling error, 239

Scatterplot, 154, 157, 161–162, 283–289

Self-organizing map (SOM), 49, 418, 419,

422, 423

Semi-supervised learning (SSL), 373–383

Sequence data, 28, 34, 93–95, 207, 215, 319,

386, 430, 492, 503–505, 508, 509, 520,

521, 528, 531, 536, 589

Shortest common substring (SCS), 574, 575

Similarity, 14, 47, 48, 73, 76, 77, 94, 99, 173,

368–369, 381–388, 392, 393, 426, 428,

459, 464, 470, 478, 479, 486–488, 498,

510, 527, 539

Singular value decomposition (SVD), 51

634 Index

Skewness, 75, 238, 258, 274

Spectral clustering, 426–431

Standardization, 6, 194, 198, 222, 512, 545

Star schema, 86, 109, 193

Statistical inference, 10, 237, 239–265,

269, 471

Statistical significance, 44, 242, 260, 300, 515

Structural risk minimization (SRM), 393–395

Subtype, 87–89, 354, 373, 397, 458, 509

Supertype, 88, 89

Supervised learning, 37, 216, 363–366, 368,

370–372, 389, 395, 399, 407–423, 425,

448, 473

Support, 2, 12, 13, 17, 19–21, 25, 29, 30, 38,

42, 70, 82, 86, 103, 110, 121, 126,

145–147, 162, 172, 175–177, 195, 196,

207, 228, 239, 251, 262, 268, 297, 320,

335, 351, 372, 378, 388, 390, 426, 429,

430, 434, 443, 444, 456, 471, 477, 479,

502, 508, 511, 512, 548, 552, 566

Support vector machines (SVMs), 46, 369,

384, 387–398, 456, 457, 469, 492, 587

Systems biology, 7, 97, 152, 153, 184, 498,

501, 502, 543–581

Systems biology graphical notation (SBGN),

183–185

Systems biology markup language (SBML),

176, 178–182, 186, 548

T

Temporal models, 544, 578

Text mining, 4, 97, 119, 120, 217, 377, 544,

546, 565–571, 575

Time dimension, 112–113, 230

Tools, 2, 56, 95, 126, 207, 236, 323, 374, 456,

504, 546, 586

Total ordering, 53, 54

Training, 23, 35, 91, 129, 260, 312, 364, 457,

502, 544

Training data, 35–37, 39, 43, 52, 282, 329,

333–335, 337, 345, 346, 353, 371,

373, 393, 413, 457, 467, 469, 480,

494, 502, 515

Transactional data model, 110–112

Transform and load, 17, 206–209

Transformation, 22, 82, 103, 104, 110, 126,

181, 191, 194, 198–200, 207, 274,

380, 412

Tree pruning, 133, 142

Twoing rule, 134, 137

U

Unsupervised learning, 37, 49, 50, 73,

363–366, 368, 372–373, 396, 399,

406, 407, 418, 423, 458, 459, 473,

477, 481, 544

V

Validation, 14, 23, 24, 29, 52, 69, 111,

132, 143, 216, 327, 371, 458, 460,

465, 468, 469, 478, 490, 494–496,

498, 534

Variation, 55, 75, 116, 170–172, 231, 238,

289, 296, 355, 396, 493, 509, 515,

528, 544, 580

Index 635

Documents

Appendix A Probability, Distribution Theory, and ...978-1-59745-290-8/1.pdfAppendix A Probability, Distribution Theory, and Statistical Inference A.1 Deﬁnitions Wedeﬁneaneventasanysetofpossibleoutcomesofarandomexperiment.Inturn,