94
Graduado en Ingeniería Informática Universidad Politécnica de Madrid Escuela Técnica Superior de Ingenieros Informáticos TRABAJO FIN DE GRADO A Graph Mining technique for identifying individuals at risk of genetic diseases in pedigrees Autor: Luciano García Giordano Director: Sergio Paraíso Medina MADRID, JUNIO DE 2019

TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

Graduado en Ingeniería Informática

Universidad Politécnica de Madrid

Escuela Técnica Superior deIngenieros Informáticos

TRABAJO FIN DE GRADO

A Graph Mining technique for identifying individuals atrisk of genetic diseases in pedigrees

Autor: Luciano García Giordano

Director: Sergio Paraíso Medina

MADRID, JUNIO DE 2019

Page 2: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction
Page 3: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

CONTENTS

1 Introduction 11.1 Genetic diseases and Genetic Risk Assessment . . . . . . . . . . . . . . 2

1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Work Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 State of the art 62.1 Pedigree diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Risk assessment of genetic diseases . . . . . . . . . . . . . . . . . . . . 8

2.3 Pedigree diagram drawing systems . . . . . . . . . . . . . . . . . . . . . 11

2.4 Biomedical vocabularies . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5 Knowledge extraction from family data . . . . . . . . . . . . . . . . . . 14

2.6 Likelihood estimation in pedigree data . . . . . . . . . . . . . . . . . . . 16

2.7 Numerical Computing and Machine Learning approaches for information

extraction from graph-based data . . . . . . . . . . . . . . . . . . . . . . 17

3 State of development 213.1 Phenomizer: Exploring the Symptoms-Disease relationship . . . . . . . . 22

3.2 Previous Pedigree Drawing Systems . . . . . . . . . . . . . . . . . . . . 23

3.2.1 Madeline 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2.2 CRA Health - Risk Assessment Software . . . . . . . . . . . . . 24

3.3 genoDraw . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3.1 Representation of family-related information . . . . . . . . . . . 26

4 Methods 314.1 Genotype probability distribution propagation . . . . . . . . . . . . . . . 34

4.1.1 Downwards propagation . . . . . . . . . . . . . . . . . . . . . . 35

4.1.2 Upwards propagation . . . . . . . . . . . . . . . . . . . . . . . . 38

4.1.3 Upwards constraint propagation . . . . . . . . . . . . . . . . . . 39

4.2 Masked genotypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3 Expectation Maximization for genotype expectation propagation . . . . . 41

4.4 Mode of inheritance-specific factors . . . . . . . . . . . . . . . . . . . . 43

4.5 Information contribution prediction . . . . . . . . . . . . . . . . . . . . 44

5 Implementation and Evaluation procedure 465.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.1.1 Implementation of the genetic information propagation algorithm 47

5.1.2 Implementation of the supportive elements . . . . . . . . . . . . 48

5.1.3 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

i

Page 4: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

5.2 Evaluation procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6 Results 536.1 Atomic family cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.2 More complex cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

7 Discussion 69

8 Conclusions and Future lines of work 72

9 References 75

A genoDraw: A tool to create pedigree diagrams based on biomedical termi-nologies and standards 80

ii

Page 5: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

LIST OF FIGURES

1 An early example of pedigree . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Pedigree that complies with the 2008 update of the Standardized Human

Pedigree Nomenclature but not with the 1995 version . . . . . . . . . . . 9

3 An UMLS concept is a way to group many equivalent terms from different

vocabularies. There can also be relations between two such concepts. . . . 14

4 A visualization of the importance of pedigrees even in scenarios with la-

tent data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 Graphical User Interface of Phenomizer . . . . . . . . . . . . . . . . . . 22

6 Image output of Madeline 2.0 . . . . . . . . . . . . . . . . . . . . . . . . 24

7 Risk Assessment Software of CRA Health . . . . . . . . . . . . . . . . . 25

8 Interface of genoDraw . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

9 Interface of genoDraw . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

10 Comparison between an undirected hypergraph and a pedigree diagram

that follows the Standardized Human Pedigree Nomenclature in its up-

dated version [10]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

11 Comparison between a complex pedigree diagram that follows the Stan-

dardized Human Pedigree Nomenclature in its updated version [10] and

its underlying semantic network. . . . . . . . . . . . . . . . . . . . . . . 30

12 Example of phenotype-genotype distribution conversion with penetrance

of 60% for being affected. . . . . . . . . . . . . . . . . . . . . . . . . . 33

13 Example of small family for which the application of upwards constraint

can improve the prediction. . . . . . . . . . . . . . . . . . . . . . . . . . 40

14 Architecture of genoDraw. . . . . . . . . . . . . . . . . . . . . . . . . . 49

15 Screenshot of the risk assessment mode of genoDraw. As we can see,

individuals are annotated, after the prediction, with two possible distribu-

tions (methods A and B). On the right, the sidebar is shown, from which

the settings for the prediction are set. . . . . . . . . . . . . . . . . . . . . 50

16 Screenshot of the risk assessment mode of genoDraw. Context menus can

be used to assign and unassign individuals phenotypes. In this specific

case, the user is on the brink of assigning individual D the status of carrier

for the disease being analyzed. . . . . . . . . . . . . . . . . . . . . . . . 51

17 Visualization of the progressive removal of information in case 1. . . . . . 61

18 Initial situation of complex case 2. . . . . . . . . . . . . . . . . . . . . . 62

19 Complex case 2 after the omission of some information. . . . . . . . . . 63

20 Initial situation of complex case 3. . . . . . . . . . . . . . . . . . . . . . 64

21 Complex case 3 after the omission that E is carrier. . . . . . . . . . . . . 65

22 Complex case 3 after the omission that K is carrier. . . . . . . . . . . . . 66

iii

Page 6: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

23 Complex case 3 after the omission that H is carrier. . . . . . . . . . . . . 67

24 Complex case 3 after the omission that O is affected. . . . . . . . . . . . 68

iv

Page 7: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

LIST OF TABLES

1 Punnett square for two heterozygous parents . . . . . . . . . . . . . . . . 35

2 Profile tensor for monogenic biallelic autosomal diseases . . . . . . . . . 36

3 Inheritance by descent tensor . . . . . . . . . . . . . . . . . . . . . . . . 37

4 Examples of executions for a monogenic biallelic autosomal recessive

disease with full penetrance. . . . . . . . . . . . . . . . . . . . . . . . . 57

5 Examples of executions for a monogenic biallelic autosomal recessive

disease with penetrance 60%. . . . . . . . . . . . . . . . . . . . . . . . . 58

6 Examples of executions for a monogenic biallelic autosomal dominant

disease with full penetrance. . . . . . . . . . . . . . . . . . . . . . . . . 59

7 Examples of executions for a monogenic biallelic autosomal dominant

disease with penetrance 60%. . . . . . . . . . . . . . . . . . . . . . . . . 59

v

Page 8: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

Abstract

Since the 1970s, many statistics-based models for performing genetic pre-

diction on individuals were described, formalized and even implemented.

However, their adoption in clinical practice is not significant. Nowadays, ge-

neticists continue to use the traditional technique based on Punnett squares,

and calculations are still mostly done by hand. With the current integration of

genetic information into clinical practice, there is a necessity for tools to as-

sist the exploitation of family-related data as part of genetic risk assessment.

A tool with this purpose would decrease the chance of errors in mathematical

operations by geneticists, enable fast predictions and simulations, and could

facilitate the visualization of the process, which are all operations that tend

to be cumbersome and tedious without the support of a computerized envi-

ronment. In this work, I propose a technique which is intended to cause an

improvement in such context by automatically providing predictions for the

genotypes and phenotypes of individuals based on their inheritance through

the use of Graph Mining techniques. In order to evaluate its results, I imple-

ment the method as a module for genoDraw, a Pedigree Drawing System cur-

rently under development at the Biomedical Informatics Group of the Tech-

nical University of Madrid in collaboration with the Genetics and Inheritance

Research Group of the 12 de Octubre Hospital, Madrid. The results show that

my technique is proper in terms of predictions and is capable of conceiving

an insight into the genetic dynamics of families, thus being of hopeful utility

to future clinical practice.

Keywords: statistical genetics, genetic risk assessment, graph mining,

human genetics

Resumen

Desde la década de 1970, se describieron, formalizaron e incluso imple-

mentaron muchos modelos basados en estadísticas para realizar predicciones

genéticas en individuos. Sin embargo, su adopción en la práctica clínica no

es significativa. Hoy en día, los genetistas continúan utilizando la técnica tra-

dicional basada en las tablas de Punnett, y los cálculos todavía se realizan

principalmente a mano. Con la integración actual de la información genéti-

ca en la práctica clínica, hay una necesidad de herramientas para ayudar a

la explotación de datos relacionados con la familia como parte de la evalua-

ción del riesgo genético. Una herramienta con este propósito disminuiría la

posibilidad de errores en las operaciones matemáticas de los genetistas, per-

mitiría predicciones y simulaciones rápidas, y podría facilitar la visualización

vi

Page 9: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

del proceso, que son todas operaciones que tienden a ser engorrosas y tedio-

sas sin el soporte de un entorno informático. En este trabajo, propongo una

técnica que pretende causar una mejora en dicho contexto al proporcionar

automáticamente predicciones para los genotipos y fenotipos de los indivi-

duos en función de su herencia mediante el uso de técnicas de Graph Mining.

Para evaluar sus resultados, implemento el método como un módulo para ge-

noDraw, un Sistema de Dibujo de Pedigree actualmente en desarrollo en el

Grupo de Informática Biomédica de la Universidad Técnica de Madrid en co-

laboración con el Grupo de Investigación en Genética y Herencia del Hospital

12 de Octubre Madrid. Los resultados muestran que mi técnica es adecuada

en términos de predicciones y es capaz de concebir una visión de la dinámica

genética de las familias, por lo que es de una utilidad esperanzadora para la

práctica clínica futura.

Palabras clave: estadística genética, evaluación del riesgo genético, mi-

nería de grafos, genética humana

vii

Page 10: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

1 INTRODUCTION

In this section, I first make a brief description of what genetic diseases are, and how they

are transmitted. Then, I comment on the use of pedigrees as a tool to enable Genetic Risk

Assessment and the computing tools that enable their eased representation. In this context,

the objectives of this work are then laid out: to devise a method capable of estimating the

risks of individuals of having genetic diseases given mostly latent genetic information of

their family and the specific mode of inheritance of the disease in question. Lastly, in this

section, the work plan for this work is summarized.

1

Page 11: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

1.1 Genetic diseases and Genetic Risk Assessment

A genetic disease is a term used to describe a disease that is caused by abnormalities in

the genome of a person. Every individual typically inherits half of their genome from

their mother and the other half from their father in the form of chromosomes organized

in pairs. In normal humans, 22 pairs of chromosomes and two extra sex chromosomes,

that can be of type X or Y , exist. Gametes, the cells that are combined with cells from

the opposite sex to generate offspring, are ideally formed by one chromosome from each

pair and one of the two sex chromosomes. The children of this offspring have thus 23

chromosomes from the father and 23 chromosomes from the mother, adding to 22 pairs

and two sex chromosomes, as in each of their parents. New mutations, chromosomal

crossovers, parental imprinting, and uniparental disomy are examples of factors that might

cause this pattern to not work in such an exact manner. Although these factors are of

sufficient importance to not be dismissed, the information about the context of a family

does not cease to be of tremendous importance when analyzing the risk of a certain patient

to be affected by a certain disease.

A medicine field which takes special advantage of this relationship among individuals

is the field of Precision Medicine. It is a field in medicine that targets the identification

of the best approaches to treat or prevent a disease based on the patient’s characteristics.

Since the patient’s family is a heavy influence in their having a genetic disease or being

able to transmit it to their offspring, being able to aggregate genetic information is of

relevant potential towards precision medicine.

For this aggregation of information, pedigrees are a widely-adopted graphical lan-

guage used by medical specialists to collect information about the family of the patient.

In a pedigree, some types of information can be inserted, such as who the individuals of

the family are and what characteristics they present, how and to whom they are related,

and by which diseases each of them is affected. A Pedigree Drawing System (PDS) is

an informatics tool developed especially to help medical practitioners collect, in a com-

puterized environment, the necessary information for diagnosis or analysis to be made.

Ideally, a PDS is capable of facilitating the tasks required to collect, process and visualize

family-related information, thus enabling a broad and encompassing analysis of a family,

contributing to advancement in current precision medicine. Current PDSs can range from

not much more than a canvas in which symbols of a pedigree can be positioned manu-

ally to complex integrated environments that operate on structured family data, generate

pedigrees automatically and link the information included with external resources, such

as biomedical vocabularies for the annotation of diseases.

2

Page 12: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

1.2 Objective

In the context of Pedigree Drawing Systems, genoDraw is an in-house development at the

Biomedical Informatics Group created in collaboration with the Genetics and Inheritance

Research Group of the 12 de Octubre Hospital, Madrid, and designed and implemented

by me. It is an integrated environment that helps automate the creation, management, and

visualization of pedigree diagrams. genoDraw presents some characteristics in this sense

that are promisingly useful for medical specialists in the area of genetics [1], which make

it a tool that can be expected to be adopted in the near future by medical practitioners.

With the major necessities of a complex PDS addressed (i.e. automation, generation

of pedigrees from structured data, etc.), the next step for augmenting its potential impact

in the area of precision medicine is to include tools to facilitate the risk assessment of

patients. In current clinical practice, calculations are performed by hand to reach conclu-

sions such as that, for example, a certain person is in risk of being affected by a certain

genetic disease [2]. These calculations are most of the times very complex (thus prone to

errors) and based on certainties. In many complex family scenarios, no exact calculations

can be made, and only approximated probabilities of one being affected or not can be ob-

tained. Using simple techniques, risk assessment in these situations is rendered tedious,

very difficult, prone to errors and even outright incorrect, especially since such techniques

are based on statistics intuitions and calculations, which are especially susceptible to hu-

man errors [3, 4].

In this work, I propose a new method for facilitating the prediction of risks related to

genetic diseases of individuals based on (a) the underlying genotypes associated with the

risks of having a genetic disease, (b) the propagation of information in the family graph,

and (c) the integration with biomedical vocabularies for obtaining information about spe-

cific genetic diseases automatically. The method is based on the prediction of genotype

distributions for individuals in two different ways. The combination of such two ways

not only shows a prediction for each individual, but also the uncertainty in the prediction.

This technique will help prioritize certain key individuals in the family. Therefore, not

only can genetics specialists dedicate fewer resources performing prone-to-error mathe-

matical operations, they can also better decide who the individuals on whom to perform

further genetic analysis are, potentially reducing diagnosis time and resources needed.

In this work, I also implement the proposed method as a module for genoDraw and

analyze its results.

1.3 Work Plan

List of Objectives

• Study the currently-existing data sources based on biomedical terminologies that

3

Page 13: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

can provide information of the inheritance mode of diseases caused by genetic fac-

tors. Define, from these data sources, a method that can allow for obtaining a link

between a genetic trait and its mode(s) of inheritance.

• Establish adequate methods for modelling the propagation of information in the

family graphs, with the objective of estimating the risk that a certain person is

affected, carrier, or not affected by a certain genetic trait given the phenotypical

information of other individuals to which they are related, directly or indirectly, as

well as to point to the user the persons from whom obtaining information tends to

be most advantageous.

• Implement the methods as a module in genoDraw, thus enabling healthcare pro-

fessionals to observe such risks as an insight towards the current risk condition of

a family, observing important characteristics such as ease of use, adequacy to the

clinical practice and precision of the estimations given. Design, Implementation,

Testing and Deployment are included as parts of this objective.

List of Tasks

1. Search for biomedical terminologies-based solutions through which a link between

genetic traits and their modes of inheritance can be established.

2. Study the literature on algorithmic prediction of genetic traits, searching for existing

methods to predict the risks of an individual to have a certain genetic disease given

its mode of inheritance and given its family graph.

3. Model the propagation of information in the family graph to define a graph-based

algorithm for algorithmic prediction of genetic traits given phenotypical or geno-

typical information of related individuals.

4. Design an implementation of the method as a genoDraw module

5. Evaluate the resulting module as part of genoDraw, focusing on ease-of-use, ade-

quacy to the activity of healthcare professionals and accuracy of the predicted risks.

6. Deploy the module as part of genoDraw

7. Write the final memory

4

Page 14: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

Gantt diagram The obtained Gantt diagram is as follows:

5

Page 15: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

2 STATE OF THE ART

This work presents a method for genotype and phenotype prediction based on the genetic

information of related individuals. It is implemented as a module for genoDraw, a web-

based platform for drawing and maintaining pedigree diagrams in a standard-complying

and integrated way. In this section, I describe the foundations on which my work is based.

Initially, an introduction to the motivation behind this work is presented, in which

the concept and purpose of pedigree diagrams in clinical practice are presented. Then, I

include an overview of the task of assessing the risk of one of being affected by or carrier

of a genetic disease.

The next subsection discusses the existing computing approaches to the problem of

representing pedigree diagrams. Then, an overview of the concept and usefulness of

biomedical vocabularies in the context of this work is also included. Lastly, a discussion

of the current computing approaches to the problem of performing knowledge extraction

and risk assessment on individuals in a computerized environment is also presented.

6

Page 16: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

2.1 Pedigree diagrams

Pedigree diagrams are a graphical language to represent families by representing each of

the individuals of interest and the relations existing among them. The essence of pedigrees

is being able to distinguish the key characteristics observable in a small population in

order to interpret a family. That is, pedigrees are a method of information visualization.

In a historical approach, according to [5], the drawing of pedigrees as an intent to

represent a family dates back at least to the 19th century. It has been in use as an in-

dispensable tool for the medical practice of genetics for more than a century now [5].

A depiction of an early pedigree is shown in Figure 1. This pedigree was drawn by the

infamous Swiss geneticist and eugenicist Ernst Rüdin. He was one of the theorists and

evangelists of social Darwinism. He was largely funded by Nazi Germany and advo-

cated for mass sterilization and outright killing of individuals as a mechanism to achieve

a “better population” [6].

Having been used as a means for (incorrectly) justifying the practice of racial hygiene,

pedigree diagrams have a much more noble utility at emphasizing important relationships

among individuals that suffer from certain conditions or are at risk of doing so. Since

the spread of the technique of pedigree drawing as a tool in clinical practice, a neces-

sity for uniform ways to represent pedigrees that encompassed the necessary situations

and that were able to represent precisely the complex relationships among individuals has

been made increasingly evident. Unsurprisingly, standardization efforts arose from this

necessity [7] and culminated in the Pedigree Standardization Task Force (PSTF) of the

National Society of Genetic Counselors at the end of the last century. The Recommenda-

tions for the Standardized Pedigree Nomenclature [8], developed by the PSTF and pub-

lished initially in 1995, establishes uniform and non-ambiguous directives for the drawing

of pedigrees.

More recently, the availability of alternative reproductive scenarios (i.e. ovum dona-

tions and surrogate gestations) indicated a need for more complex relationships among

parents and children to be able to be represented. In 2008, an update to the Standardized

Human Pedigree Nomenclature is published [10]. This update introduces some extensions

to the previous version [8], enabling most major reproductive scenarios to be clearly rep-

resented in pedigree diagrams. In Figure 2, a situation of impossible representation in the

previous version is drawn following the updated version.

Currently, the use of pedigree diagrams as a tool to collect and interpret genetic char-

acteristics in families is of widespread use. Medical practitioners, especially those within

the genetics area, make use of pedigree diagrams in their day-to-day clinical activities. A

rather comprehensive list of functions enabled or facilitated by the use of pedigrees during

medical encounters and other activities is found in [2], and includes functions that range

from “Making a medical diagnosis” and “Calculating disease risks”, to seemingly unim-

7

Page 17: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

Figure 1: A Sippschaftstafel drawn by Ernst Rüdin in 1910. (from Mazumdar, 1992 [9]

and Resta, 1993 [5])

portant but actually essential activities, such as “Educating the patient” and “Exploring

the patient’s understanding”.

In conclusion, pedigree diagrams are a relevant tool for medical practitioners to per-

form part of their activities, ranging from collecting information from the patient and their

family to reaching a conclusion and helping form the patient’s understanding about the

risks and facts observed.

2.2 Risk assessment of genetic diseases

It is widely known that the phenotype of a person, that is, their presenting some charac-

teristics or not, is affected mainly by two factors: environment and genotype. The study

of the environment in which a person lives is sometimes enough to explain their condi-

tion. For example, a mine worker is much more likely to develop lung diseases such as

emphysema or pneumoconiosis than a control urban population [11]. On the contrary,

many other disorders are defined almost uniquely by the contents of the genome of the

individual. For example, having a certain mutation in the APC gene is almost certain to

cause the individual in question to develop familial adenomatous polyposis by the age of

40 [2].

8

Page 18: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

Figure 2: This pedigree complies with the 2008 update of the Standardized Human Pedi-

gree Nomenclature, but not with the 1995 version.

The unfamiliarized reader might interpret that diseases with a clear genetic component

are very simple to understand. However, not only different diseases follow potentially

different inheritance patterns, some of them even have almost unknown or unpredictable

patterns. One example is Alzheimer’s disease. According to [12], not only are there

many genes with multiple possible mutations involved, there is more than one possible

inheritance pattern for the same disease.

In terms of inheritance patterns often observed in genetic diseases, the most common

ones are the following [2]:

• Autosomal Dominant (AD): if one gene is mutated, it is enough to cause the disor-

der

• Autosomal Recessive (AR): for the same locus in the two chromosomes of the same

pair, if both genes are equally mutated, the disorder is observed

• X-linked Recessive (XLR): women possess two X chromosomes, while men only

carry one. In X-linked recessive disorders, a man that has a mutation in a certain

gene of his X chromosome will be affected by the disorder, while the woman with

one X containing the mutation will most likely present no symptoms of the disease

or none at all, needing both of her X chromosomes to have such mutation to be

affected by the disorder.

• X-linked Dominant (XLD): contrary to X-linked recessive disorders, in XLD dis-

orders both women and men are affected if they carry a mutated gene in an X

chromosome (only one of the two for women). However, due to lyonization, also

9

Page 19: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

called X-inactivation, women tend to present milder symptoms, while the effects in

men are usually lethal.

• Y-linked (YL): while men carry one Y chromosome, women carry none. For this

reason, only men are affected by or carrier of YL diseases. Additionally, having

the mutated gene in this chromosome is usually enough to show a phenotype corre-

sponding to the related disease

• Chromosomal inheritance (C): a chromosomal disease is the one caused by the ex-

istence of a disruptive added segment or by the inexistence of an important segment

of a chromosome, that can range from a few bases to a whole chromosome.

• Multifactorial and Polygenic (MP): there are some disorders which are caused by

alterations on many loci at once. Since they are very difficult to observe in families,

they are usually grouped in one category.

• Mitochondrial (M): the mitochondrion is an organelle present in the cells of most

eukaryotic organisms that processes some compounds to generate, among other

compounds, ATP. ATP is a molecule that stores energy and is used in many bio-

logical phenomena inside the cell as energy source. However, given the nature of

the fertilization in humans (and many other species), mitochondria are only inher-

ited from biological mother to child. Furthermore, mitochondria possess their own

DNA, the mitochondrial DNA. For that reason, mutations in this DNA that cause

defects in the mitochondria of a woman are passed on to her offspring. If this off-

spring is a male, the defect is not transmitted to his children, but if it is a female,

then the inheritance is assured.

Although the previous list is by no means complete (nor does it intend to be), we

can easily observe that the inheritance patterns of genetic diseases can range from simple

“inherit one mutated gene and be affected” to patterns in which the lack of knowledge

about its underlying mechanism is still present in the literature.

Additionally, sometimes an individual that has a genotype that corresponds to having

the disease does not show the corresponding phenotype. This is known as incompletepenetrance. The penetrance is a probability that, having the necessary genomic conditions

to have a disease, the individual actually presents the phenotype corresponding to it.

In clinical practice, genetics specialists must not only collect information obtained

from what is presented by the patient and their family, but also predict, given the specific

inheritance modes in which each of the diseases may be expressed, who are the carrier

individuals, who may be affected or come to be affected, and who are the individuals on

whom more specific genetic tests must be performed. According to [2], it is important not

10

Page 20: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

only to register who are the affected individuals, but also to take note of the ones who are

unaffected.

Given the multiple scenarios and the complexity of the inheritance schemes for each

of them, a visualization method that helps the user to observe which of them is taking

place for a certain disease in a family is necessary. One of the methods widely used for

this task is pedigree diagrams [2].

2.3 Pedigree diagram drawing systems

A Pedigree Drawing System (PDS) is an informatics tool that enables a user to create,

manage and visualize pedigrees. Many PDSs exist currently. However, none of them

is widely adopted in clinical practice. Some characteristics identified in [1] can be key

at making a PDS useful for genetics specialists. The first of them is the ability to in-

teractively generate pedigree diagrams during medical encounters and to make dynamic

changes to the pedigree. For that, the medical practitioner must be deeply familiarized

with the tool. In this aspect, the easier the tool is to be mastered by an expert user, the

better. The second characteristic identified is the capability of the system of generating

pedigrees from structured data. This is important because a pedigree can be generated di-

rectly from the information of individuals and their relations, which can either be inserted

by the user or retrieved from a medical information system. The third characteristic is the

automation of the generation process. That is, the components that compose a pedigree

are automatically generated and placed on the screen, with minimal interaction from the

user. This helps the user focus on the interaction with the patient, rather than on the in-

teraction with the system. The fourth characteristic identified as very important for the

adoption of any PDS in daily clinical practice is the generation of standard-complying

pedigrees. In the case of this work, we consider only the Nomenclature presented by

the National Society of Genetic Counselors since it is the only reported Recommenda-

tion and it is of wide adoption. The fifth characteristic is that the information inserted

into the pedigree is meaningful as concepts of current standard biomedical vocabularies.

This could enable the integration of pedigrees with current medical information systems,

allowing the PDS to receive information following standard formats and using standard

concepts to refer to each entity and use it internally as a means of obtaining information or

making changes to a pedigree. Ideally, since a pedigree is a visualization technique, any

changes made to it (except those exclusively related to the placements of entities) should

be directly reflected on the medical history of the involved individuals.

Most of the PDSs currently-available have similar characteristics. The most relevant

are Madeline 2.0 [13], My Family Health Portrait (MFHP) [14], Progeny [15], GenoPro

[16] and CRA Health [17]. Although Madeline 2.0 is a self-hosted server and MFHP is

an online tool, both are only useful for exporting pedigrees as images from a description.

11

Page 21: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

None of them enable the user to make changes to the exported pedigrees since they are

images. Also, both receive data to draw the pedigree as structured data and export a

pedigree without any kind of interaction with the user. The whole pedigree is made at

once without a prior possibility to visualize the insertion process or to make changes

and corrections to the possibly bad-positioned pedigree. In Madeline, the input is a text

file with descriptions of the individuals. In MFHP, the user must enter all the data by

hand using forms. The output of both is a picture of a generated pedigree generated with

relative positioning rules.

Progeny is a commercial tool that includes a PDS. As a platform, it aims at the in-

clusion of every possible aspect related to the diagnosis of genetic diseases in one tool.

However, as a PDS, it only offers so many features. GenoPro is very similar to Progeny

as a pedigree drawing system. First, they do not generate pedigree diagrams automati-

cally from structured data. The user must insert it by hand using menus and textual data.

Second, although the user can interact with the layout of the pedigree and make changes

and corrections, the representation is easily violated. That is, a modification of the lay-

out made by the user can render the representation wrong. As in the previous cases of

Madeline and MFHP, Progeny and GenoPro comply partly with the previous version of

the Pedigree Nomenclature.

From the presented selection of PDSs, CRA Health contains one of the most advanced

engines. Reported in [18], it is the first to allow automatic creation of diagrams directly

from structured data while considering usability and interactivity aspects. It makes use of

optimization techniques to position the nodes in a canvas, and the user can make changes

to the graph as desired. However, the data model is in itself limited. In Section 3.3.1, I

discuss these limitations to emphasize the necessity of a broader data model for the correct

representation of pedigrees.

2.4 Biomedical vocabularies

In the field of biomedical informatics, one important concept is that of a vocabulary. In

this context, as well as in other fields, a vocabulary is essentially a data structure contain-

ing concepts that can be referred to in a homogeneous manner. In computer vocabularies,

usual naming strategies are that each entry of a vocabulary is assigned a code, and a

textual explanation or usual names are to it assigned.

In terms of utility, vocabularies tend to be used as a way to establish a common ref-

erence to each of its entries. For example, multiple people might refer to the same ideausing different words or expressions. When a vocabulary is used, the different ways to

express the same idea can be unified in the same concept. That way, every reference made

to the same object is done through the same code.

Vocabularies can range from simple lists of concepts and their meanings, also called

12

Page 22: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

glossaries, to complex structures usually called ontologies. The differentiation among the

different subtypes of vocabularies is in their expressive power about the field of knowl-

edge they are intended to represent. A simple list with names for reference and a sufficient

explanation for each concept is a glossary. Concepts can be subtypes of other concepts.

A vocabulary that contains such hierarchical relationships is usually called a taxonomy.

Additionally, non-hierarchical relations can be added. That might also include the ad-

dition of properties to concepts. When both are done, referencing to this vocabulary as

thesaurus is adequate [19].

On the far end of the aforementioned spectrum are the ontologies. As well as con-

cepts, their meanings, relationships and properties, ontologies also contain vast informa-

tion about the behavior of the concepts included. They are expressed through the use of

axioms, restrictions of many types and descriptions of the relations among concepts.

In biomedical informatics, a very important vocabulary is the Unified Medical Lan-

guage System (UMLS) [20]. UMLS is a metathesaurus, in the sense that it is a thesaurus

that contains many different vocabularies and unifies the common concepts present.

In UMLS, the information is contained in sections (easily translatable to database

tables). One of the most relevant of them for this work is the one that contains the concepts

themselves, MRCONSO. Since in the medical literature many expressions might refer to

the same concept, in MRCONSO, as well as in the vocabularies on which it is based, each

unit of information is a text entry of a concept from a vocabulary. One or more text entries

(strings) are grouped into a term. A term is a concept in the vocabulary of origin. One

or more terms are then grouped into concepts. That way, UMLS contains information

regarding which terms of the vocabularies are the same concept in a global sense. That

is, if a concept 1 from vocabulary A is attached to the same UMLS concept as another

concept 2 from vocabulary B, we can say that they are equivalent. That is, that they refer

to the same idea in medicine.

Another section of UMLS which is relevant for this work is MRREL. It contains the

relations among concepts. Each entry of this file is a link of a certain type between two

concepts. Although these relations are extracted from the vocabularies used, the relations

are always between two UMLS concepts. As depicted in Figure 3, Abetalipoproteine-

mia, an autosomal recessive disorder, can be referenced to as “ABETALIPOPROTEINE-

MIA”, “BASSEN-KORNZWEIG SYNDROME” or as “Abetalipoproteinaemia”. These

are some of the names used as STR (string) in OMIM for the term with code 200100 and

in SNOMED CT with code 8312300. The three terms are linked in the UMLS concept

code C0000744. This code has a relationship in terms of the type of inheritance provided

by OMIM with the UMLS code C0441748, which corresponds to being an autosomal re-

cessive disorder. In OMIM, the code MTHU000016 corresponds to this concept. In HPO

and in SNOMED CT, the codes HP:0000006 and 258211005 are used.

Many vocabularies are integrated into UMLS. Some of them are SNOMED CT [21],

13

Page 23: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

Figure 3: In UMLS, many different terms from different vocabularies can correspond to

the same “idea”. If that is the case, one UMLS concept encompasses all of them. Mean-

while, UMLS concepts are interrelated. In the case of this figure, the concept C0441748

(autosomal recessive inheritance) is the type of inheritance of the concept C0000744 (a

disorder called “Abetalipoproteinaemia”)

the Online Mendelian Inheritance in Man (OMIM) [22], and the Human Phenotype On-

tology (HPO) [23].

Although currently-existing biomedical vocabularies present many characteristics be-

yond the principles here discussed, what is mentioned in this section is sufficient for the

scope of this work.

2.5 Knowledge extraction from family data

The extraction of knowledge from family data is a field that might come to be of great help

to medical professionals, in contexts that can range from individual healthcare to public

health. As in many other types of data, the nature of family-related genetic information is

largely characterized by its uncertainty and by the fact that most individuals do not have

any of their genetic characteristics registered anywhere, and thus have to be considered as

latent data.

Nonetheless, the analysis of such data is enabled because of the connectivity among

individuals. Since the vast majority of humans are “created” from one father and one

mother via fertilization, each individual’s genetic information is deeply connected with

two individuals. We might not know anything about the three of them, but they are certain

to share some genetic characteristics that can reveal important features about the rest of

the individuals in their family (since they are carriers of chromosomes, which contain

genetic information). In Figure 4, for example, although no information is known for the

individuals A,B, and C, they are an important part of the explanation of why individual

F is affected by the disease represented (affected individuals contain a colored symbol),

assuming an Autosomal Recessive disease. This is due to the individuals E and D being

also affected, and thus F most likely received the disease-causing allele from both parents.

That is, C is at least carrier of such allele, which was most likely inherited from A or B.

As we can see, the mere information that C has a brother who is affected to the disease is

14

Page 24: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

Figure 4: The lack of information on A, B and C about being at least carriers of an auto-

somal recessive trait is important to understand why F is affected. However, the structure

of the family makes this information evident.

an explanation to why her daughter is affected, even though we do not have any additional

information about C’s parents.

According to [2], genetic data based on the information of the family is inevitably

uncertain. Many factors that come to play are sometimes not registered or not even dis-

covered by the genetic counselor. When no information about a certain disease is given

for an individual, for example, they can be not affected, not well observed while being af-

fected or even still not affected. They might also never come to be affected while carrying

at least one mutated copy of a disease-causing gene. Observations can thus be imprecise,

out-of-date and even outright false.

Additionally, factors such as that meiosis, the process that generates the genome of

every individual, is perfectly random in a simplistic approach, but reveals itself much

more complex when more factors that come to play are considered. Similarly, mutations

cause changes in the genome and are also seemingly-random processes. How a system

that extracts information from family data considers the randomness of the generation

of the genome is, therefore, a critical factor to its precision. However, there is a severe

limitation on this precision, since many biological processes known to influence the cre-

ation and expression of the genome are currently unknown. These are such as parental

imprinting and even information regarding some genetic diseases [2].

However, as discussed previously, the inheritance among individuals is not to be dis-

missed as an important source of information to help detect the root causes of most genetic

diseases and to help increase the precision of the diagnosis of patients.

15

Page 25: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

2.6 Likelihood estimation in pedigree data

One objective of the extraction of information from family data is the prediction of the

affection statuses of individuals in a family. This is rendered useful in cases such as when

a rare genetic disease is observed in both families of a couple. If the objective is that

their children do not carry or are affected by such disease, an analysis may be useful to

determine if they are at risk of giving birth to an affected child.

Currently, genetics specialists more often than not do the corresponding risk calcu-

lations by hand. As mentioned in the Introduction of this work, this process is prone to

errors and usually outright incorrect. Their calculations correspond to simple methods

that resemble the calculation of the Maximum Likelihood Estimation (MLE) of the geno-

types of individuals. Therefore, an adequate calculation of the MLE for the individuals in

a family should be of utility to specialists.

In past works, approximately from 1970 to 1995, this approach is vastly explored,

reaching complex algorithms. In 1970, [24] proposes an algorithm to calculate likeli-

hoods in pedigrees. To calculate a likelihood means to reach the statistical functions that

describe the information in it contained. The likelihood formula proposed by the work

mentioned is exact and is based on existing data. Thus, no extrapolation is made. Eight

years later, [25] extends the calculation for complex pedigrees (those which do not as-

sume statistical independence among parents of children). Furthermore, the authors also

propose that individuals are represented as triples (σ(A), e(A), φ(A)). For an individual

A, σ corresponds to their unique ouriotype, which is a set of fixed characteristics that

includes their genotype. e is a set of personal characteristics, such as age, gender, and

whether A smokes. φ corresponds to A’s phenotype. The work also models the pene-

trance of a characteristic as the probability that an individual has a certain genotype given

their phenotype. One aspect to notice is that such penetrance is calculated among all the

possibilities of phenotypes and genotypes. The calculations presented in [25] are also

exact, and are based on recursive calculations of transmission and penetrance.

In 1979, the concept of Maximum Likelihood Estimation is brought to the analysis

of pedigree data [26]. This is done via mixed models and gene counting. In gene count-

ing, the probabilities that a child has a certain genotype is calculated directly from their

parents, in an exact manner. A posterior work [27] extends the approaches of the then

state of the art by proposing an algorithm for the exclusion of genotypes. Although the

method is notably overconfident by nature, it introduces the possibility of omitting the

whole analysis for some genotypes or phenotypes of people given their own status and

their surrounding family.

During the 1980s, few advancements are done. Nonetheless, some unorthodox ap-

proaches are developed during the decade. From one perspective, the analysis of identity

by descent is first approached [28]. Simply put, it is the overdetailed analysis of the seg-

16

Page 26: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

regation probabilities that occur during the meiosis process (gamete generation), with or

without recombinations and uniparental disomies. These probabilities identify which in-

dividuals are likely to have inherited two copies of the exact same chromosome both from

father and mother (that is, a direct measure of consanguinity). Another unconventional

approach was the one presented by [29], in which image processing ideas are applied to

the calculation of likelihoods in pedigrees.

From the 1990s onwards, less innovative statistical approaches are proposed. A project

proposal from 1990 [30] establishes some basis for posterior works [31, 32, 33, 34,

35]. After a sequence of similar works, [33] formalizes a simple Monte Carlo estima-

tion method. It uses the Expectation Maximization algorithm, a Gibbs sampler in order

to reach a model that handles random and fixed components of heritable and nonheri-

table characteristics to calculate predictions for the Maximum Likelihood of genotypes.

Random and fixed components, in this sense, refer to elements that come to play when de-

ciding the composition of the genome of the individual in question. The fixed components

are, for example, the factors that underly the segregation of chromosomes during meiosis,

while random components are such as mutations and chromosomal recombinations, as

well as uniparental disomies.

More recently, despite parallel advancements in genome sequencing and other -omicsfields, works such as [36] and [37] do not present any actual novelty in the field. In the

era of Big Data and almost unlimited computing power, possibilities such as machine

learning approaches have not been explored yet for genotype distribution prediction in

pedigree data.

2.7 Numerical Computing and Machine Learning approaches for in-formation extraction from graph-based data

In this section, a brief collection of numerical computing and machine learning approaches

is presented. Such approaches are some of the most promising for the characteristics of

the data included in pedigrees, which is structured in a graph-like structure and contains

mostly categorical data, although continuous and discrete numerical data also exists.

Also regarding the structure of data, in the family graph, the relationship between a

parent and a child exists in only one direction. No child is a parent of their parent, and

this is guaranteed by temporal reasons. That means that a family can be understood as an

acyclical directed graph (more specifically as a semantic network, as will be discussed in

Section 3.3.1). Therefore, nodes of this graph can be considered as individuals that have

their own genotypes. Directed edges are parenthood relationships (also interpretable as ischild of ). From every node, a maximum of two parenthood edges can be originated.

Given this graph-like nature and the randomness that characterizes the meiosis and

fertilization from which each individual originates, some algorithms can be considered as

17

Page 27: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

candidates for the extraction of information from a pedigree.

Expectation Maximization algorithm The first algorithm that can be considered is the

previously-mentioned and extensively explored EM algorithm. Since it is a statistical al-

gorithm, many formulations derived from its essence can be made. It is based on two

steps (hence its name): the first is the Expectation step, which is the calculation of ex-

pected values for latent data from the data we have, while the second is the Maximization

step, in which measurements are made to try to better estimate the Expectation in the

next iteration of the algorithm. Sometimes, however, the M step can be summarized to

the reestablishment of some parameters, in a sense that no actual optimization is made.

The convergence in those cases is a consequence of navigating the parameters without

any driving factors. A characteristic feature of the EM algorithm is that it has a greedy

behavior, in which no alternative options for maximization are considered, only one so-

lution is calculated from the input, and the maximization is, in the best-case scenario,

monotonically upwards towards a local maximum.

Hidden Markov Models The second formulation can be that the problem is a Hidden

Markov Model (HMM). It is a simplification of other Markov process-based models. In

an HMM, previous information is considered as causing factor to posterior information.

However, only one causing factor can be utilized for this calculation. Therefore, the

prediction is limited to a one-way propagation across many generations. Latent data can

have a tremendous impact on prediction and many lines of descent must be analyzed

before a prediction can be made.

Rule-based models A third formulation is to use rule-based models to make predictions

of the genotypes of individuals given their parents and their children. A rule-based model

is the one that either searches for an explanation for a phenomenon or that extrapolates

a given scenario in search of a prediction for an unknown scenario. For the searches,

rule-based models possess sets of rules that are activated according to some situations.

A rule-based model could, for example, activate a rule that concludes that the individual

A from Figure 4 does not have a genotype that only contains the dominant allele for the

disease in question because D, his son, is affected by it.

However, once again, the existence of mostly latent data in pedigrees renders the

vast use of this kind of model impractical. Pure rule-based approaches can only define

constraints to the data. However, only constraints are not enough to enable prediction on

pedigrees to work.

Graphical models Graphical models are models devised under the formalism of a struc-

tured probabilistic model, as HMMs and Markov chain-based methods. These models can

18

Page 28: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

facilitate the prediction with latent data when the situation to be predicted can be mod-

eled as a graph. As listed in [38], two groups of models can be identified: the directed

models, also known as Bayesian networks or belief networks, and the undirected models,

also known as Markov random fields or Markov networks.

In the case of Bayesian networks, each node of the graph is a variable in the model,

while a directed edge means that the variable at the destination of the edge is conditioned

on the variable at its origin. Currently, BNs with binary variables are being used in fields

in which a graph-like relationships’ network among variables can be obtained.

From another perspective, undirected models do not consider the edges of the graph

as one-way conditionals. Instead, each edge has the meaning of interdependence between

two variables. One practical algorithm derivated from this formalization is the Boltzmann

Machine. A Boltzmann Machine is characterized by an undirected graph whose nodes are

binary variables. It is trained as an unsupervised algorithm and can be used for supervised,

unsupervised and generative tasks. Both groups of models can be used interchangeably

[38].

However, although graphical models have the potential to be extremely useful in the

analysis of pedigrees, current implementations lack the support for categorical data types,

which are essential for this work.

Graph embedding-based models One of the obstacles posed by information organized

in graph structures is the interpretation of the connections themselves. Traditional and

modern machine learning tools are capable of learning from information that can be or-

ganized in tensors. As it is of wide knowledge, graphs’ connectivity matrices can also

be expressed in tensors. However, graphs tend to be of an arbitrary number of nodes and

these connectivity matrices are usually not only of variable size, but they also tend to be

very sparse, since the nodes in these graphs are usually very loosely connected (a node

connects to very few other nodes). In the specific case of pedigrees, each person is only

connected to their parents and children.

Largely used in social sciences and bioinformatics research in part due to the char-

acteristics of the underlying graphs, graph embedding techniques are ways of not only

standardizing the size of the features vector of each node but also providing more infor-

mation at once about its vicinity. That way, traditional machine learning techniques can

be used to make predictions on nodes, subgraphs and even on the whole graph [39].

In the case of genetic prediction, the predictions can be about an individual given

their family’s pedigree. To my knowledge, the technique of graph embedding has never

been used for the prediction of individuals given their family. The reason for so might

be that, given the current state of the graph embedding techniques, gargantuan amounts

of currently badly-collected data would be needed to train an algorithm capable of “sum-

marizing” the context of a node from the genetic information of its vicinity (parents and

19

Page 29: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

children).

In the graph embedding literature, there are various techniques that enable the em-

bedding of graphs. The two which are currently most relevant are graph2vec [40] and

node2vec [41].

20

Page 30: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

3 STATE OF DEVELOPMENT

This work is related to other research and commercial efforts that culminated in tools that

are currently adopted by medical specialists.

One of such tools is Phenomizer [42]. Although the existence of a relationship be-

tween genotype and phenotype is as widely-known as it can possibly get in the field of

genetics, another relationship of interest in the area is the relationship among diseases and

their corresponding symptoms (phenotypical expression).

Another genre of tools that precede this work and are of relevance to what is here

exposed is the Pedigree Drawing Systems. As commented in the State of the Art of

this work (Section 2), there are simpler and more complex Pedigree Drawing Systems.

An example of a simple PDS is Madeline 2.0. A more complex example is the Risk

Assessment module of CRA Health. Furthermore, a specially-relevant tool for this work

is genoDraw, to which the module implemented in this work is integrated.

In the next subsections, the most relevant platforms that either influence or precede

this work are introduced and their main characteristics are outlined.

21

Page 31: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

3.1 Phenomizer: Exploring the Symptoms-Disease relationship

As commented previously, some relationships are of special interest in Risk Assessment.

One example of which is the relationships among symptoms and diseases.

Diseases express themselves through symptoms, which are typically used in their di-

agnosis. However, many diseases share similar symptoms.

In this context, Phenomizer is a tool of relevance. It serves its purpose by estimating

the likelihood that a person that presents some specific symptoms (phenotype) has each

of the possible diseases, assigning scores for each of the associations. The most likely

are given to the user as highly-likely possibilities. The exploration is done following a

Semantic Similarity search, through which the semantic network that expresses the ontol-

ogy (biomedical vocabulary) is explored in a top-down fashion, searching for similarities

among the disease in question and the symptoms selected by the user.

Currently, Phenomizer is a tool of free access available on http://compbio.

charite.de/phenomizer/. In its implementation, the user is able to select symp-

toms which are terms from the HPO [23] vocabulary from a list and select the mode of

inheritance with which the disease manifests itself in the family. From both of these fea-

tures, a list of possible genetic diseases from OMIM [22] is obtained. A depiction of the

graphical interface of this software can be seen in Figure 5.

Figure 5: Graphical User Interface of Phenomizer. Source: http://compbio.

charite.de/phenomizer/

22

Page 32: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

3.2 Previous Pedigree Drawing Systems

3.2.1 Madeline 2.0

Madeline 2.0 is an open-source Pedigree Drawing Engine first reported in 2007 [13]. It

is a web service that automatically and unassistedly generates pedigree diagrams from

family descriptions. In terms of characteristics, Madeline 2.0 offers a service that pro-

duces pedigrees in compliance with the Standardized Human Pedigree Nomenclature

from 1995.

From the perspective of the characteristics a PDS can offer, however, Madeline 2.0

fits best in the simple PDS category. Although the creation of pedigrees is automatic, the

user is not able to adapt it to their liking. For this reason, a pedigree, once generated is a

static image in which no personalization can be done. Furthermore, the user is not able

to select specific diseases for the affected individuals in the family, let alone register such

diseases as terms from biomedical vocabularies.

Madeline 2.0 also relies on a data model which is far from ideal. Such data model is

based on a list of individuals to whom characteristics are given. The code below corre-

sponds to a simple example that can be found on https://madeline.med.umich.

edu/madeline/testdata/input/si_001.data. As we can see, the first ele-

ment of the code is a sequence of variables to be used in the block below (the columns).

Then, each row corresponds to an individual.

FamilyId

IndividualId

Gender

Father

Mother

Deceased

Proband

DOB

MZTwin

DZTwin

Sampled

Affected

si_001 S00102 M S00100 S00101 . . 1937.07 . . Y U

si_001 S00103 F S00100 S00101 . Y 1939.02.25 . . Y A

si_001 S00104 F S00100 S00101 . . 1942.04.15 . . Y U

si_001 S00105 F S00100 S00101 . . 1943 . . Y A

si_001 U00106 . S00100 S00101 . . 1944.06.07 . . . U

23

Page 33: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

The previous example, when processed with Madeline 2.0, outputs the following im-

age, in SVG format:

Figure 6: Image output of Madeline 2.0. Source: https://madeline.med.

umich.edu/madeline/testdata/

In terms of limitations, the data model used in Madeline 2.0 is not capable of ex-

pressing any states of the relationships among individuals. Furthermore, the annotation

of diseases is limited to expressing whether an individual is affected, not affected, carrier

or unknown. The disease itself is not known. Furthermore, some important information

is contained in the gestations and is completely ignored in Madeline 2.0.

In conclusion, Madeline 2.0 is prohibitively complex to be used by genetics specialists

during medical encounters and does not allow users to insert and remove data dynami-

cally.

3.2.2 CRA Health - Risk Assessment Software

CRA Health (https://www.crahealth.com/) is a company that develops a risk

assessment software suite. Implemented in CRA Health is a plugin for pedigree draw-

ing activities first reported in 2011 [18]. Such plugin is more appropriate to be called a

complex PDS, due to the characteristics it presents. First of all, it is an interactive and au-

tomatic system that allows the user to insert individuals in the diagram by using high-level

directives and without having to manually place each entity. Secondly, the generation is

done from structured data.

The plugin also complies partly with the 1995 Standardized Pedigree Nomenclature,

due to its data structure (as will be discussed later). Diseases are also not registered as

associated with terms from medical vocabularies.

In terms of usability, however, the plugin presents interesting agility, being thus seem-

ingly adequate for clinical practice. A screenshot of its use extracted from the webpage

of the company is shown in Figure 7.

24

Page 34: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

Figure 7: Risk Assessment Software of CRA Health. Source: https://www.

crahealth.com/screensamples

3.3 genoDraw

genoDraw is an in-house development of the Biomedical Informatics Group of the Tech-

nical University of Madrid. It is strongly based on the characteristics identified as promis-

ingly useful for medical specialists in the area of genetics [1]. Figures 8 and 9 are screen-

shots of genoDraw.

In terms of compliance with standard visual nomenclatures for the drawing of indi-

viduals and their relations, the 2008 Standardized Human Pedigree Nomenclature of the

National Society of Genetic Counselors is fully implemented.

Furthermore, genoDraw works with a data model that is comprehensive enough to

represent a wide spectrum of possibilities which can be observed in society. It also gen-

erates pedigree diagrams directly from such data model automatically and supports the

interactive manipulation of entities, either to reposition them or to make changes to the

underlying data.

Another important characteristic of genoDraw is the capability to handle diseases as

terms from biomedical vocabularies. This not only enables genoDraw to offer as wide

a selection as possible, it also enables a common reference to disorders. Additionally,

future exploitation of pedigree data is facilitated.

25

Page 35: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

Figure 8: Interface of genoDraw. The user is editing a simple pedigree using the normal

mode, intended for visualizations and minor structural changes. The individual F is a

male adopted by A and B. In the sidebar, as we can see, changes to individual F can be

made.

In the next subsection, an aspect of utmost importance to the understanding of how

pedigrees and their data models differ is explained. This is an issue of special relevance

both for the drawing process, as described in [43], and for the exploitation of such data, be

it for the propagation of genetic information as in this work, or for gathering other types

of information.

3.3.1 Representation of family-related information

From a mathematical perspective, a (modern) pedigree is a visual representation of an

undirected hypergraph. A hypergraph is a generalization of a graph. A graph can be

defined as a tuple G = (V,E), in which V is a set of vertices v and E is a set of edges

defined between two vertices. When we refer to directed graphs, we can consider E as

E ⊂ V × V . Similarly, undirected graphs are graphs in which E is composed of the

unordered pairs of elements of V . Meanwhile, a hypergraph is a graph whose edges

are not defined between two nodes, but among an arbitrary number of vertices. The

unordered pairs of vertices that define edges in graphs are now sets of vertices (which can

26

Page 36: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

Figure 9: Interface of genoDraw. The user is editing a simple pedigree using the editing

mode, intended for major structural changes and overall pedigree creation. Its interface is

optimized for fast interaction.

also be thought of as unordered lists of vertices). The definition of undirected hypergraph

considered in this work is as following [44]:

Definition 3.1. Undirected hypergraph An undirected hypergraph is a pair H = (V,E),

where V = {v1, v2, ..., vn} is the set of vertices and E = {E1, E2, ..., Em} is the set of

hyperedges. A hyperedge Ek is a set of vertices with unrestricted cardinality.

As an illustration of the previous definition, Figure 10a depicts an undirected hyper-

graph. As we can see, not constraining an edge to be between two nodes allow us to

connect multiple nodes at once. When compared to the pedigree diagram shown in Fig-

ure 10b, we can see that a pedigree is clearly an undirected hypergraph in which nodes

are drawn according to their characteristics and edges are only connections among many

nodes, and are usually drawn as compositions of straight lines. Additionally, positions are

defined for each node, so that, for example, parents are always drawn above their children.

However, the relations in families are best represented as a semantic network. A se-

mantic network is a directed graph whose vertices and edges are assigned a type. Usually,

they are used to describe ontologies, but as mathematical instruments, their usage must

27

Page 37: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

(a) An undirected hypergraph (b) A simple pedigree diagram

Figure 10: Comparison between an undirected hypergraph and a pedigree diagram that

follows the Standardized Human Pedigree Nomenclature in its updated version [10].

not be unique. In this work, I define a family graph as a semantic network with types of

nodes Individual, Gestation and Relationship. The possible types of directed edges are

biological father, biological mother, birth mother, gestation, partner and non-biologicalparent. I extracted these types in a formalization performed directly from the Standard-

ized Human Pedigree Nomenclature [10].

Given that families are composed solely by persons, and the relations among indi-

viduals can be of multiple types, a first approach to the problem could indicate that a

directed graph with labeled edges could be enough to model the representation of a fam-

ily. However, the existence of characteristics such as dates of engagement and divorce in

relationships invalidates the consideration that a relationship is a usual relation and could

be modeled as two is partner of relations for each relationship. Instead, an entity that car-

ries information on its own is required. Furthermore, gestations also carry information on

their own. Initially, one may conclude that a gestation is part of the individual. However,

the existence of multiple gestations requires that a common entity exists between two or

more twins and parents. Therefore, given that both entities and edges must be of certain

types, a directed graph is rather restrictive for modeling the information required. For this

task, a semantic network is ideal.

Considering the formal aspects of pedigree diagrams is of utmost importance to un-

derstand the limitations of most of the past software dedicated to the creation and man-

agement of such graphical representations. (Kelleher, 2011) [18], for example, defines

multiple layers of directed graphs to represent the families. However, many limitations

arise from this decision. One of them is that the chosen set of layers and the content

of each of them, as a whole, limits the information that can be inserted. Although the

model proposed by [18] complies partially with the first version of the Standardized Hu-

man Pedigree Nomenclature, the restriction posed by the set of layers is that each child is

to have two parents and that each of the two parents needs to have a relationship. This is

due to one of the layers being the “couples’ graph”. Individuals have their parents defined

directly through the couples’ graph. That means that two parents must form a couple to

be able to have their children represented. This is not only a violation to the old Nomen-

clature from 1995, that allowed parents to not be married, but also a factor that blocks the

28

Page 38: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

compliance of the proposed system with the Updated Nomenclature because modeling

that an individual is a child of more than two parents at once is absolutely impossible, and

this is exactly what happens with ovum donations, for example.

In an ovum donation, the father provides the sperm, one mother provides an ovum,

while the other mother gestates the child (in this work, I refer to them as biological fa-ther, biological mother and birth mother, respectively). That way, three people must be

somehow annotated as parents of the child. In this case, failing to indicate that the mother

that gestates the child is a parent is a loss of important information, since her gestation of

the child may have had its effect on the fetus, for example by transmitting teratogens.

For a more flexible formal representation of a family to be possible, we must be able

to: (a) represent vertices as entities of multiple types, given that individuals have different

connectivity patterns and a different semantic meaning than gestations or relationships,

and (b) represent edges as directed and labeled links between two entities. For this sce-

nario, one of the most adequate structures is a semantic network.

A semantic network is a graph-like structure based on the existence of entities (ver-

tices) and links among them (directed edges). Both entities and links are of a certain type.

This type carries information about which type of entity or connectivity we are dealing

with. Semantic networks are widely used in the field of semantic web, given both its

modeling power and its simplicity (i.e. any information to be captured in the network is

done so through a triplet, that carries a source entity, a relation, and a destination entity).

Thus, while the representation of a family as a pedigree is an undirected hypergraph,

the interpersonal data underlying it is best described as a semantic network. In the case of

genoDraw, the semantic network is composed by nodes of classes Individual, Gestationand Relationship, while the relations among such entities are of types biological father,

biological mother, birth mother, gestation, partner and non-biological parent.As an illustrative example, Figure 11 depicts, at the top, a correct pedigree with mul-

tiple less-common relations among the individuals, as well as common relations. At the

bottom, the underlying semantic network is shown.

29

Page 39: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

(a) A correct pedigree diagram

(b) Its underlying semantic network

Figure 11: Comparison between a complex pedigree diagram that follows the Standard-

ized Human Pedigree Nomenclature in its updated version [10] and its underlying seman-

tic network.

30

Page 40: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

4 METHODS

In this section, the elements that comprise the information propagation in the family

graph, as well as the means by which this information is treated, are laid out.

Initially, genotypes are explained to be considered as distributions over the possibil-

ities for each individual. Then, three techniques for genetic information propagation in

the family graph are presented. After that, a brief comment is made on the technique

used to hold known information in the genotype distribution of individuals, the masking

of genotype distributions. Next, the proposed structuring of the techniques in an Expecta-

tion Maximization algorithm is described, as well as how the approach presented can be

changed to adapt to various modes of inheritance. Lastly, the strategy used to estimate the

contribution that information of each individual can provide to the genetic information of

the whole family is proposed.

31

Page 41: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

In this work, I propose a method that aims to identify which individuals are capa-

ble of providing most information to the genetic information of the whole family. For

this purpose, phenotype and genotype distribution estimation must be made from genetic

information available. In my formalization of the problem, I propose a model for ge-

netic information propagation in the family graph that enables an interpretation of the

Expectation-Maximization (EM) algorithm. Three information propagation schemes are

proposed. The final model is composed of two parallel EM algorithms, and the difference

between the results of both provides a predicted measure of how decisive the individual

in question is. These predictions are calculated across the whole family graph, and the

results are obtained on an individual’s level. Each person can have a phenotype, or a

genotype, associated to them for a certain genetic trait. As mentioned in Section 2.4, ge-

netic traits are associated with specific modes of inheritance, if known. Specific modes of

inheritance mean that a pattern can be observed in the way a genetic disease is inherited

from parents to child. For this method, the structure of the family in question is necessary.

From that, we know who is child of whom with high reliability. Having this information,

a model is devised to simulate the inheritance pattern so that a prediction of the risk of

having a certain disease can be made exclusively from data available for a family.

As in many other cases, genetic data in families are generally very sparse and mostly

latent. Additionally, the nature of genetics is such that many phenomena are mostly un-

predictable. That means that if a person is known to have or not a certain disease, not

much can be said about their surroundings in a decisive manner. For example, in the

case of a couple whose both partners suffer from an autosomal dominant genetic disease,

it is wrong to assume their children to also be or become affected. For example, these

parents might be both heterozygous, and, in this case, their children are nearly at a 25%

probability of not being affected nor being carriers of the mutation.

Since no assumptions should be made (i.e. assign automatically a person as affected),

an approach based on the genotype distributions for each individual is more appropriate

than a decisive model, such as rule-based systems or other naïve algorithms (those which

stick to the most likely prediction). This technique has been used in many studies (i.e. [26,

31, 32, 33, 34]). From the phenotypical information of a person, a genotypical distribu-

tion is derived, as depicted in Figure 12 for a generic case. When genotypical information

is given, this step is skipped. As we know, each person is to have two chromosomes of

each type, one from mother and one from father (sex chromosomes are not necessarily the

same from father and from mother, but autosomes are ideally very similar). Each gene is

a pair of alleles. Since no differentiation can be easily made from the copies from father

and mother [35]), the pair is considered in genetics as an unordered pair. The alleles that

correspond to a locus (gene) can be of various types, and each type can be the result of a

mutation or recombination from the other types. The most common alleles are generally

known and are typically assigned a letter. When the allele is dominant over the others,

32

Page 42: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

Figure 12: Example of phenotype-genotype distribution conversion with penetrance of

60% for being affected.

it is generally written as an uppercase letter. If it is recessive, it is generally written as

a lowercase letter. In this work, ‘A’ and ‘a’ are used for each of the cases, respectively.

The different combinations of alleles can cause different phenotypes with different prob-

abilities (depending on many factors, such as the age of the patient or penetrance of the

disease). Therefore, each phenotype can indicate with certain confidence which is the

underlying genotype. Although no certain assertions can be reached (except in isolated

cases), information can be gathered as a variation in the original probabilities of each

genotype (pair of alleles) for a person.

From the previous strategy, a genetic counselor can assign if a person is affected or not

by a disease (phenotype) given their genotype, and vice versa. Knowing some parameters

of the disease in question enables estimating the probability of each combination of alleles

in the genome of each patient (genotype). For the people for whom the status for the

disease is unknown (latent), their probabilities can be estimated as an analysis of the

surrounding family. Probabilities for each combination of alleles can be calculated for

these people. When considered in conjunction, these probabilities form a distribution

for each individual. Then, after the transformation to genotype distribution, people are

assigned probabilities for each genotype. After the operations to be defined are performed

on the genotype distributions of the individuals, new phenotypes can be estimated from

the resulting genotype distribution.

Comparatively to what is currently performed by genetic counselors, these operations

are rather complex and extremely prone to errors, as many statistics-based calculations

and intuitions when performed by humans [3, 4]. In a computerized environment, on the

other hand, each of the steps can be automatized, enabling a decrease in the workload of

the genetic counselor and an increase in the correctness of the calculations. The genotype

33

Page 43: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

can either be given by the user or predicted from a known phenotype. If none of the op-

tions are viable, the patient is considered latent. Then, a predictive algorithm can analyze

the family graph, propagating information that can help observe the likelihoods for each

genotype in each person. Then, from the likelihood estimations (genotype distributions)

for each individual, phenotype distributions can be calculated using parameters of the dis-

ease being analyzed, such as average age of onset, the gender of the patient, penetrance,

etc. In all, this process is decidedly faster and less prone to errors to what is currently

done.

In this work, I consider the genotype and the phenotype distributions of the individuals

as feature vectors in a similar fashion to one-hot encoding, where the probability of each

of the possibilities is calculated. The vectors always add up to 1, being thus normalized.

An example of such a vector is the genotype of a fictitious individual A for a monogenic

biallelic disease. The individual A has a probability of 0.6 of being homozygous on the

dominant alleles (AA), 0.28 of being heterozygous (Aa) and 0.12 of being homozygous

on the recessive alleles (aa). The vector is considered in this work as follows:

GA =

⎡⎢⎣0.6

0.28

0.12

⎤⎥⎦ (1)

In order to extract genetic information from the family graph, a propagation-based

predictive algorithm is proposed. It is based on three components for the propagation of

information that can be used and combined as desired. Namely, a) downwards propaga-

tion, b) upwards propagation and c) upwards constraint. In the next subsections, each of

these modules is introduced. These three components operate on the genotype distribu-

tions of the represented individuals, and can, in conjunction, be used as an EM algorithm,

which is explained in subsequent subsections.

4.1 Genotype probability distribution propagation

In current clinical practice, genetics specialists usually perform manual calculations dur-

ing their analysis of a family in order to calculate the risks associated with genetic dis-

eases. They make extensive use of a method that will be called biological model in this

work.

The biological model is based on a 50% chance of each parent contributing with each

of their alleles to a child. That means that an individual with Aa genotype will transmit

either A or a at a 50% chance each. This is due to the segregation that occurs during the

process of meiosis, which splits the pairs of chromosomes in two, so that each resulting

gamete has only half the genetic information of the original cell. Combining both parents,

we can estimate the joint probability of them both. A practical way to visualize this split-

34

Page 44: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

A aA AA Aa

a aA aa

Table 1: Punnett square of two heterozygous parents. Alleles are A and a. The first row

and the first column are the possibilities of inheritance (alleles) from each parent. The

rest of the cells are the possible genotypes of the child, at a 25% probability each.

ting is the Punnett Square. The numbers written in a Punnett square are the probabilities

of a child having a certain combination of alleles. Imagining the pairs of alleles of the

parents as ordered lists, a child can inherit the first allele of the father and the second

of the mother, for example, which ideally corresponds to a 25% inheritance probability

in the Punnett square. As shown in Table 1, parents who are heterozygous for a certain

trait are at a 25% chance of having a child that inherits the conditions to be affected. At

the same time, the probability that the child is a carrier of the disease (heterozygous, in

this case), is at 50%, since, in biological terms, Aa or aA (first from mother, second from

father, for example) are equivalent and the probabilities are summed.

By chaining the biological model, some invaluable information can be obtained. For

example, relationships between cousins can be justified to be of high risk to the child,

since identical mutated alleles can be inherited at the same time from father and mother.

This is a model especially relevant for the risk assessment of autosomal recessive traits.

The biological model is widely used in part because of its simplicity. It can be of

help in the prediction from known parents to children. However, when non-latent family

members are more distant, the biological model is of virtually no utility.

4.1.1 Downwards propagation

The use of Punnett squares is very useful for the immediate visualization given the full

information of the parents. However, when dealing with pedigrees, since many people

are represented and the presence of latent data is highly likely, the probabilities must be

passed on without having to assume that someone has a certain genotype simply because

it is the most likely (this strategy is necessary when chaining the biological model). That

is, all scenarios must be considered concurrently.

For that reason, one proposition is to consider a matrix (Profile) whose cells are the

probabilities that a child has a certain genotype (as in the Punnett square). However, in-

stead of being bidimensional, it is a tensor of four dimensions. These dimensions are 1)

the probability of each combination of alleles (genotype) for the father, 2) the probability

of each genotype for the mother, 3) the alleles that can be inherited from father, 4) the al-

leles that can be inherited from mother. Therefore, each cell now carries more conditions.

35

Page 45: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

AA Aa aa

A a A a A a

AAA 1 0 0.5 0.5 0 1

a 0 0 0 0 0 0

AaA 0.5 0 0.25 0.25 0 0.5

a 0.5 0 0.25 0.25 0 0.5

aaA 0 0 0 0 0 0

a 1 0 0.5 0.5 0 1

Table 2: Profile tensor for monogenic biallelic autosomal diseases.

Two examples are given:

(2)ProfileAA,AA,A,A = P (child inherits A from father|father is AA)

× P (child inherits A from mother|mother is AA)= 1

(3)ProfileAa,Aa,A,A = P (child inherits A from father|father is Aa)

× P (child inherits A from mother|mother is Aa)= 0.25

Table 2 contains the Profile for the case of monogenic biallelic autosomal traits in

which only one gene is involved, and two alleles are possible.

The Profile matrix holds the probabilities that can be observed in a Punnett square.

In order for the actual calculation to be performed (prediction of child given parents), the

prior information is needed. In this case, the genotype distributions of the parents are prior

information. If the genotype distribution of the father and the mother are vectors in which

each position is the probability of one genotype, one step for calculating the probabilities

in the downwards propagation is to generate a matrix from the product of both vectors,

which is then multiplied element by element in the first two axis of the Profile tensor.

The result is a map of the conditional probabilities for each possible genotype of the

parents (Inheritance By Case - IBC), considering the probabilities of each of the cases.

Mathematically, if a father has probability 0.25 of being heterozygous for a trait (Aa) and

a mother has probability 0.1 of having such genotype, the position that corresponds to

the child inheriting the genotype AA given that the mother is Aa and the father Aa is

0.25. This is a conditional probability. For us to predict the probability of the child being

AA, all the other cases must be considered, including the probabilities for every other

combination of genotypes for the parents. Table 3 contains an IBC tensor containing one

possible example of the aforementioned case.

36

Page 46: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

AA (0.9) Aa (0.1) aa (0.0)

A a A a A a

AA (0.5)A 0.45 0 0.025 0.025 0 0

a 0 0 0 0 0 0

Aa (0.25)A 0.1125 0 0.00625 0.00625 0 0

a 0.1125 0 0.00625 0.00625 0 0

aa (0.25)A 0 0 0 0 0 0

a 0.225 0 0.0125 0.0125 0 0

Table 3: Inheritance By Case tensor for the case of 0.25 probability that the mother is

heterozygous and 0.1 probability that the father is heterozygous too. The rows axis corre-

sponds to the father. The columns axis corresponds to the mother.

This table is mathematically identical to the use of Punnett squares when the geno-

types of the parents are known in the position of their genotypes. However, when all we

can know are the probabilities of each genotype, instead of not being able to easily reach

a conclusion, the Profile table handles separate father and mother genotype probabilities.

Having obtained the IBC tensor for the child from the parents, two steps are required

to take it from a 4-dimensional tensor to a genotype vector (prediction of the genotype of

the child). First of all, a sum over the first two axes of the IBC tensor will result in a matrix

(C) whose units are the probabilities of each genotype for the child (for example, CA,a =

P (child is Aa)). Nonetheless, each axis of the matrix is composed of the possibilities for

each allele, and the matrix is not triangular. That is, for two alleles in one gene, CA,a

and Ca,A have the same meaning but are two independent numbers. The second step

for reaching the genotype vector is, therefore, to remap the matrix into a vector whose

dimensions are the same as the genotype vector (in the case of 2 alleles and one gene it

will be 3 (AA, Aa, aa)). This calculation for the case of monogenic biallelic diseases is

done as follows:

Gchild =

⎡⎢⎣1 0 0 0

0 1 1 0

0 0 0 1

⎤⎥⎦×

∑i

∑j

IBCi,j,k,l (4)

Since the IBC-based model does not cease to consider the probabilities of the parents

when calculating the probabilities of the child, it does not lose information by consider-

ing only the most likely option for each parent. Furthermore, it is statistically correct for

both known and unknown parents’ genotypes. Therefore, it will remain consistent across

multiple executions. Additionally, given the characteristics described, the model is also

capable of handling chaining across many generations of people. Since latent data is ex-

pected to be notoriously large, precision in results is not expected to be perfect. Instead, it

37

Page 47: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

is expected to indicate some insights into who are most likely to be carriers, not affected

or affected patients. However, be it precise or not, chaining the downwards model is pos-

sible and does not lose information in each iteration, carrying all that can be inferred to the

next generation without unnecessary losses. These characteristics render the downwards

propagation model adequate for the EM algorithm.

4.1.2 Upwards propagation

Similarly to the downwards propagation model, it is possible to model the inverse effect.

In the upwards propagation, we can tune the probabilities of the parents of a child when

information about this child is given. For example, it is very likely that, if a child is

known to be affected by an autosomal recessive trait, their parents are either carriers of or

affected by the same trait. This is not mandatory since de novo mutations can have caused

the child to be affected. However, it is a reasonable approach to set a higher probabilitythat the parents are carriers or affected when this is the case.

Furthermore, combining the upwards propagation with the downwards propagation

can help us model the effect that a cousin of the patient for whom we know the phenotype

has on the patient itself. Since the grandparents of this cousin are shared with the ones of

the patient of interest, some of the genes of these two individuals might carry the same

mutations. Therefore, the genetic relatedness among individuals who have ancestry in

common can be expressed by combining the presented downwards propagation model

and the upwards model to be introduced in this section.

Since the downwards model is a numerical method on the probabilities of each possi-

ble genotype of the child conditioned to the probabilities of the possible genotypes of the

parents (P (Gc|Gf , Gm)), it is reasonable to model the upwards model as the probabilities

of the parents conditioned to the probability of the child (P (Gf , Gm|Gc)).

According to Bayes’ rule,

P (Gc|Gf , Gm) =P (Gf , Gm|Gc)× P (Gc)

P (Gf , Gm)(5)

Therefore,

P (Gf , Gm|Gc) =P (Gc|Gf , Gm)× P (Gf , Gm)

P (Gc)(6)

Thus, it is possible to devise a method for having an insight about the genotypes of

the parents given the genotype of the child.

In this work, I formalize the previous statement in an algorithm. First, the parents’

prior matrix is calculated (P (Gf , Gm)). Then, a na × na × na tensor (B) is created,

where na is the possible number of alleles for the disease in question. Each of the three

38

Page 48: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

dimensions corresponds to one individual. In this work, I consider the first two as father

and mother, respectively. The third corresponds to the child.

For each of the positions of the B tensor, Bi,j,k, an operation is performed. If, for

the corresponding child’s genotype distribution, the probability at the position k is zero,

then Bi,j,k is zero. If this is not the case, the conditional probability from the Profile is

calculated for the i, j position. Then, a calculation following the Bayes’ rule above is

done. The result is stored in the B tensor.

At the end of this calculation, the normalized sum on the first axis gives the most

likely genotype probabilities of the father for that child and the sum on the second axis

has the same meaning for the mother.

One aspect to be noticed when performing the upwards propagation is that any char-

acteristics existing on the child will be propagated equally to both parents (in autosomal

diseases). Therefore, some mutation that comes from the father, for example, will be

assigned a high likelihood to also have come from the mother. This is, in a sense, incor-

rect. However, it is only incorrect in the case we already know that this should not have

happened. Any individual, as expert as they may be, will suspect that both parents are

equally likely to have transmitted a mutation to the child unless more information proves

this assumption wrong.

4.1.3 Upwards constraint propagation

In some cases, a direct constraint on the possible genotypes of the parents given the pos-

sible genotypes of the child can be made. Since this behavior is much more precise in

effect than the upwards propagation, being able to propagate a constraint instead of an ex-

pectation can be a very useful characteristic of a model for predicting genotypes of latent

or semi-latent individuals (the case of being semi-latent will be discussed in Section 4.2).

By propagating a constraint, what is intended is to propose that, if some genotype is

mandatorily existing or non-existing in the parents of the individual in question, then the

genotype distributions of these parents are going to be transformed so that this possible

genotype is the only that is likely to exist or that this genotype is impossible to exist. For

example, in a monogenic biallelic autosomal recessive disease, to affirm that a child is

affected by the disease means that the only possible genotype for this child is that they

are homozygous on the recessive allele (that is, aa in the naming scheme of this work).

Therefore, since each of these alleles comes from a parent, then the parents must possess

it in their genomes. Therefore, both the parents cannot be homozygous on the dominant

alleles. That is, they cannot be AA.

As an example, consider the individuals in Figure 13. The genotype distribution

shown in Equation 7 for the individual C can improve the estimation of the genotype

distributions of her parents. C is affected by a monogenic biallelic autosomal recessive

39

Page 49: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

Figure 13: Example of small family for which the application of upwards constraint can

improve the prediction.

disease (genotype is certainly aa). Therefore, none of her parents should have a distri-

bution that allowed for the AA genotype to exist. Equation 9 is the new prediction of the

family by applying one step of upwards constraint from C to A and B. The operation is

performed following Equation 8 for individual A. The same calculation is performed for

individual B.

GA =

⎡⎢⎣0.96

0.03

0.01

⎤⎥⎦ , GB =

⎡⎢⎣0.96

0.03

0.01

⎤⎥⎦ , GC =

⎡⎢⎣0

0

1

⎤⎥⎦ (7)

GA =

⎡⎢⎣0.96

0.03

0.01

⎤⎥⎦ constraint−−−−−−→ GA =

⎡⎢⎣

0

0.03

0.01

⎤⎥⎦ normalization−−−−−−−−−→ GA =

⎡⎢⎣

0

0.75

0.25

⎤⎥⎦ (8)

GA =

⎡⎢⎣

0

0.75

0.25

⎤⎥⎦ , GB =

⎡⎢⎣

0

0.75

0.25

⎤⎥⎦ , GC =

⎡⎢⎣0

0

1

⎤⎥⎦ (9)

The attentive reader may notice that propagating a constraint such as in the example

given may result in a wrong parental expectation, since there is the possibility that, for

example, one parent is carrier of the mutated allele and the other allele is either the result

of a uniparental disomy (both alleles come from a single parent) or of a mutation. How-

ever, for some diseases, this is very unlikely. In such cases, the likelihood that the allele is

inherited from both parents is highly credible. Therefore, the propagation of constraints

can be useful insight into the risk of transmission from the parents. Additionally, in this

case, a false positive constraint is more useful than not considering that the parents are

40

Page 50: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

much more likely to have at least one recessive allele than none. If having none is the case

and this can be proven in any other manner, it is registered by the user and the constraint

is ignored.

4.2 Masked genotypes

Given the nature of the algorithm used for genotype distribution prediction, there is a

necessity of holding the partial predictions without changes where information is enough

and changing when more expected probabilities can be found. From this necessity emerges

the concept of masked genotypes. A masked genotype is a genotype whose probability is

fixed and unchangeable. When the new genotype distribution is estimated from the stack

of possible genotype distributions for a specific patient, only those positions which are

free to be changed are actually changed (all except the masked genotypes). The rest is left

as before. Furthermore, the new free probabilities only add up to what is left by the fixed

ones, leaving thus the genotype probabilities array normalized.

4.3 Expectation Maximization for genotype expectation propagation

While families are networks of multiple individuals, the previous subsections of this sec-

tion (methods) only mentions families which are formed by one father, one mother and

one child. The reason for that is that the basis for the model here proposed is triplets

formed by exactly one father, one mother and one child. If one of the parents does not

exist, they are created and considered as a latent individual. That way, there is always the

possibility to propagate genetic information among individuals. The existence of siblings

for a given person, for example, provides information to the family (if there is some to

be provided), but not directly. Each gestation is a different gestation and each meiosis is

different from any other even from the same parents, even though the genetic material is

the same. Thus, combining siblings in the same unit of information does not seem to be

an adequate idea.

Therefore, in the model here proposed, triplets are the basic units. Each individual in

a triplet has a clear purpose and different triplets may share individuals (the alterations

in their genotype distributions are shared with other triplets). Families are decomposed

into triplets by biological roles: biological father (provider of sperm), biological mother

(provider of ovum) and child.

The Expectation Maximization process is composed by two steps, the first being the

calculation of the expectation (E step), and the second being the calculation of the new

parameters, searching for maximizing a function (M step). In this work, I do not consider

a function to be maximized. This can be seen as a simplification of the algorithm, but

since there is no need to formulate such a complex function, it will be left without further

41

Page 51: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

formalization.

E step The E step is responsible for calculating an intermediate decision that will enable

the M step to modify the actual parameters, searching for maximization of expectation.

In this work, the E step finishes leaving the individuals with a stack of expected genotype

distributions. These individuals are the ones for whom a prediction can be made (to

be discussed later). The stack contains one genotype distribution for each operation in

the triplets that involves the individual in question. Therefore, different individuals may

have different numbers of genotype distributions, depending on their connectivity and the

activation criteria for each triplet of which they are members.

The criteria for triplet activation (calculation) depends on the status of the key indi-

vidual(s) present and their genotype masks. The list below contains the operations and

criteria that enable the activation of a triplet. Each triplet is activated if at least one of the

operations is possible. It also only performs those operations which are possible. The list

of criteria by the operation is as follows:

• Downwards propagation: If the mask of the child allows for adjustment (it is not

fully determined), the triplet can perform downwards propagation.

• Upwards propagation: If the mask of at least one parent allows for adjustment,

the triplet can perform upwards propagation.

• Upwards constraint propagation: If either the genotype of the child or their phe-

notype is known and the penetrance provides enough confidence in propagating the

constraint to the parents, the constraint can be performed (if it is the case that there

is a possible constraint to be propagated).

The E step starts by calculating the propagations of constraints. Then, the possible

upwards and downwards propagations are executed, each adding a new genotype distribu-

tion to the stack of some individuals (parents in upwards propagation, child in downwards

propagation). The E step finishes when all the possible operations are performed, and the

individuals’ stacks of genotype distributions contain all the necessary entries.

M step The M step is the one in charge of calculating the new parameters for the next it-

eration of the EM algorithm (or to finish the iterative process). In this interpretation of the

algorithm, the M step calculates the expected genotype distributions for each individual

given the previous distribution and the stack calculated by the E step.

For each individual, we calculate the average genotype distribution from the stack.

The process of calculating the average distribution can take weights into account, de-

pending on the confidence of each estimation. In this work, I do not consider weights. A

42

Page 52: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

gradient is then calculated between the average distribution and the previous distribution.

Lastly, the previous genotype is updated by a percentage of the calculated gradient (the

percentage is a learning parameter, also commonly mentioned as α learning parameter in

the Machine Learning literature).

Iterative process Since the EM algorithm is an iterative process, it is constructed around

a loop. In the case of the EM algorithm specifically, each iteration consists of performing

one E step before one M step. Then, if convergence is found, the process is terminated.

In this work, since it is an objective that the iterative process is implemented in a

PDS, a mechanism for early termination of the algorithm must be devised. Thus, the

maximum number of iterations is performed. If convergence is found before, the process

is terminated without running out of iterations. This is done for usability reasons.

One aspect to be noticed is that, in each M step, the genotype distribution of an indi-

vidual is changed following only the gradient, without any stochastic term or any other

type of maximization mechanism. For that reason, the EM algorithm, in this work, has

a greedy maximization. Therefore, it ends up searching for local maxima of the search

function (which is not formalized in this work but exists implicitly nonetheless).

4.4 Mode of inheritance-specific factors

In this work, only monogenic biallelic autosomal genetic diseases are used in examples

and explanations. However, the model here proposed is also compatible with other sce-

narios.

For example, in the same context of autosomal diseases, both more possible alleles

and more genes are supported. In the first case, a Profile tensor, genotype probabilities

arrays, etc. adapted to the changes (more possibilities) is enough for being able to make

predictions in diseases in which more than two classes of alleles are significant for the

analysis of the risk of the disease. In the second case, if more than one gene is involved,

more predictions in parallel are enough for multiple genes to be considered. In this case,

the phenotype-genotype calculation must be adapted, so that more than one genotype is

analyzed to determine the phenotype probabilities.

Changing from autosomal diseases to sex-linked ones, those caused by mutations in

the sex chromosomes, more substantial changes have to be made. In the case of Y-linked

diseases, the method here explained is better than any rule-based system only for those

diseases with incomplete penetrance. The profile tensor can now have only two dimen-

sions (only the father’s dimensions). Some adaptations in the rest of the operations are

needed. In the case of X-linked diseases, more variables have to be analyzed. Since X-

linked diseases have different behaviors in men and women [2], the treatment is to be

different in each of those cases. In the case of men, the X chromosome is only inherited

43

Page 53: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

from the mother. the Profile tensor is thus bidimensional, only containing the axis corre-

sponding to the mother. In the case of women, one of the chromosomes is inherited from

the father exactly as in men. However, the other is inherited like an autosome from the

mother. For the prediction, therefore, what is needed is only an adapted Profile tensor and

a modified phenotype-genotype transformation.

When it is the case that two or more genes are in the same chromosome, the use of

dynamic Profile tensors for each of the alleles can be key at finding the most likely geno-

types for each individual. The behavior for genetic linkage can also be treated the same

way. When genetic linkage occurs, meiosis randomness is somewhat voided because of

the tendency to inherit certain chromosomes always from the same grandparents.

In the case of mitochondrial diseases, although few are the diseases that result in

viable human beings when the percentage of mitochondria that carry the mutation [2],

mitochondria are always inherited only from mother. The model proposed in this work

should theoretically work well for mitochondrial diseases when the mother’s mitochon-

dria are very likely to be mutated, or when the percentage is at least known. In terms

of model, the changes in the Profile tensor are similar to those for Y-linked diseases, but

from the mother.

One example of genetic disease type which is not supported by the model here pro-

posed is the chromosomal diseases. This is due to them being based on the existence of

multiple copies of the same alleles or the inexistence of any of them.

4.5 Information contribution prediction

Some previous approaches to the prediction of genetic traits given family-related genetic

information apply models that take into consideration more parameters than the ones here

described. (Guo et al., 1994) [33], for example, formalizes a mixed models method based

on fixed and random parameters, as well as both genetic and nongenetic factors. In terms

of prediction, it is very useful. However, in a clinical scenario, fine-tuning such param-

eters is complex, and, most likely than not, uncertain. This renders such a method inap-

propriate.

In order to actually assist the medical professional, an algorithm capable of effortlessly

predicting who is most likely to provide information that helps better specify the likeli-

hoods for each other individual is more adequate than a pinpoint immoderately complex

Maximum Likelihood Estimation. Therefore, in this work, a difference-based perspective

is applied. Instead of making one precise but excessively complex operation, the algo-

rithm performs two different operations, one which is underconfident (method A), and

one which is overconfident (method B). Both represent local optima of the search space.

The difference between the predictions of both algorithms expresses how uncertain we

can be about a person. The ones which are most uncertain are assumed to be most likely

44

Page 54: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

to provide useful information to the genetic screening process. The medical professional

can thus visualize this information and, based on the suggestions given and the prediction

intervals obtained, make a decision.

Method B is the EM algorithm described above with the three modes of information

propagation. Method A is very similar, but only with downwards propagation and up-

wards constraint propagation modes. The effects of such combinations are that method

A is composed of those modes which do not calculate expectations in an inverse manner.

Therefore, information only goes upwards in the family graph by constraint propagation,

which has a fainter but more decisive effect in comparison to the upwards propagation.

Both results are intended to be shown to the user for each individual so that they observe

the intervals between the predictions. Also, a distance measure can be performed so that

a list of individuals sorted by lack of information can be obtained.

45

Page 55: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

5 IMPLEMENTATION AND EVALUATION PROCEDURE

In order to complete one of the objectives of this work, as well as to evaluate the useful-

ness of the method here introduced in actual clinical practice, I implemented the genetic

information propagation as a module of genoDraw.

In this implementation, so as to evaluate the method with utmost clarity, two modes

of inheritance are considered: monogenic biallelic autosomal recessive and monogenic

biallelic autosomal dominant. These types were chosen due to their conceptual simplicity

and direct compatibility with the method. Although the selection of modes of inheritance

chosen is small, the implementation of more modes of inheritance is trivial, as explained

in the Methods section.

In regard to the implementation itself, the next subsection is dedicated to its details

and intricacies. Next, the testing strategy is explained.

46

Page 56: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

5.1 Implementation

The method for genetic information propagation in the family graph with the objective of

predicting genotype and phenotype distributions for individuals was implemented for two

types of inheritance as a module for genoDraw. Since genoDraw is a web-based PDS,

technologies of the web stack were to be chosen for this implementation.

The description of the implementation will be made in three parts. The first part

corresponds to the essential calculations. That is, to the parts that operate the propagation

of information itself. The second part is the supportive elements, which deal with the

existing components of genoDraw and retrieve data from databases. The retrieval of data

from databases is done in order to provide genoDraw with the capability to be integrated

with biomedical vocabularies and to add the functionality of suggesting inheritance modes

depending on the specific disease being analyzed. The next subsections are dedicated each

to one of the parts here mentioned. The third part is a brief description of the resulting

user interface of the genoDraw module.

5.1.1 Implementation of the genetic information propagation algorithm

The genetic information propagation algorithm was implemented inside the module ded-

icated to Risk Assessment in genoDraw using JavaScript as a core language and Tensor-

Flow.js as a library. JavaScript is a programming language of widespread use. As of now,

it is by far the most popular programming language for web development and is one in a

handful compatible with modern web browsers.

TensorFlow.js is also a unique asset in its class. It is a library for JavaScript which

brings access to GPU processing in browsers via WebGL. TensorFlow.js is thought as a

means for enabling Machine Learning applications to be run in web-based environments

locally in the device of the user. In the specific case of this work, TensorFlow’s capabilities

for accelerating machine learning algorithms are of no importance. However, being able

to manage numerical tensors with already-implemented functions and top-notch manip-

ulation capabilities are features that cannot be found in other JavaScript libraries. Hence

the use of TensorFlow.js in the implementation of this non-machine learning method.

The method was implemented as a loop with a limit of iterations and convergence

detection (so as to terminate the execution earlier in the case of convergence). Inside

the loop, a selection of propagation methods can be run. They are executed whenever

possible for each triplet of parents and one child. Then, a collapse of the stack of geno-

type probability distribution candidates is done for each individual, finding new estimated

distributions for every person (Likelihood estimation).

At the end of the prediction, the predicted genotype distributions are assigned to the

profile of each individual. Then, a prediction of the distribution of phenotypes is calcu-

lated.

47

Page 57: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

The whole process here described is done twice, one for each mode of calculation

(underconfident and overconfident).

5.1.2 Implementation of the supportive elements

As supportive elements, genoDraw makes use of some internal and external resources.

From one perspective, the drawing of the pedigree is fundamental for displaying the cal-

culations’ results to the user. From another perspective, data must be fetched from exter-

nal databases in order to provide information about diseases (i.e. codes to which they are

attached and modes of inheritance).

As reported in [43], the drawing of the pedigree is made semi-automatically in the

user’s device, following a unique three-step process. The pedigree drawing module is

implemented in JavaScript and makes use of WebCola [45] and D3.js [46] as libraries.

The present module then draws the results on the nodes already represented on the screen.

External resources are retrieved from a relational database containing the information

of UMLS [20] for three biomedical vocabularies pointed by 12 de Octubre Hospital’s

genetics team: SNOMED-CT [21], OMIM [22] and HPO [23]. Also as mentioned in [1]

and [43], such database holds information of many diseases, to which patients can have

their codes linked as a means for registering their status for each of the diseases of interest

in a pedigree. As discussed in Section 2.4, their code is enough for this attachment.

However, the code registered is also used to retrieve, from the same database, the modes

of inheritance of the diseases which have them linked. Nongenetic disease or even some

diseases known to have genetic components do not have any mode of inheritance attached.

For this reason, they are retrieved as a hint to the user, who may choose any other mode

of inheritance to analyze families.

The communication between the client application and the database is made using

a server implemented in Node.js. This server works under a REST API, is enabled to

operate on HTTPS, and handles user authentication via username and password. User data

is stored in a MongoDB database, and passwords are hashed before being stored. From

this server, access to the UMLS database, which is a MySQL database, is performed.

The internal graph files which correspond to the pedigrees are JSON (JavaScript Ob-

ject Notation) files managed by the user locally. An internal graph can be loaded into the

platform but is never uploaded to the server for confidentiality reasons. After making the

desired changes or visualizing the pedigree, the user can choose to download the internal

graph as a file, which is the same file that can be loaded again into the platform in the

future. If desire be, the server can also be configured to store family graphs, a feature

which is deactivated in the current deployment on www.genodraw.com.

48

Page 58: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

Figure 14: Architecture of genoDraw.

5.1.3 User Interface

As thoroughly described in [1] and [43], genoDraw is a PDS which automatically draws

pedigrees in a canvas with minimum necessary interaction from the user for adjustments.

In this implementation, a new mode was developed which allows for risk assessment. In

this module, the pedigree is drawn automatically in a canvas, as in all the other modes

(normal and editing). Then, by using a sidebar, the user is able to select specific inheri-

tance modes, with some suggestions retrieved from UMLS. The user is also able to select

the specific penetrance and allele frequency in a submenu. A button for launching the

prediction is also present in the sidebar. When the prediction is calculated, its results are

shown below the nodes of the pedigree, in their corresponding individuals (Figure 15).

A list is also composed of the individuals for whom not enough genetic information is

known, and is sorted by lack of information. That is, difference between methods A and

B. From this list, the user can assign phenotypes for the individuals, but this task can also

be done directly in the canvas through the use of context menus (Figure 16).

49

Page 59: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

Figure 15: Screenshot of the risk assessment mode of genoDraw. As we can see, individ-

uals are annotated, after the prediction, with two possible distributions (methods A and

B). On the right, the sidebar is shown, from which the settings for the prediction are set.

50

Page 60: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

Figure 16: Screenshot of the risk assessment mode of genoDraw. Context menus can be

used to assign and unassign individuals phenotypes. In this specific case, the user is on

the brink of assigning individual D the status of carrier for the disease being analyzed.

5.2 Evaluation procedure

To evaluate the method as a whole and the module implemented in genoDraw, an analysis

of results was performed for some examples. The examples range from simple situations,

in which there are only two parents and one child of theirs, to complex situations. The

simple examples always contain latent individuals and are composed of cases in which

information is expected to flow upwards in the family graph and cases in which the down-

wards flow is expected.

The analysis of simple examples consists of comparing the estimated results with the

aforementioned biological model, also referred to as part of the gene counting method by

other authors [26, 37]. That is, the model based on Punnett squares and widely adopted in

current clinical practice. Since autosomal dominant and autosomal recessive models are

implemented, the tests were performed for both cases. For each of the cases, a solution

for full penetrance and a solution for penetrance at 60% were given.

For more complex cases, which consist of large and complex pedigrees, a different

approach was taken. Even though separation of cases by inheritance mode was done, the

analysis using the biological model is not feasible, since it does not, by itself, present

51

Page 61: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

acceptable results. Since one of the main aims of the method here described is to assist

the detection of individuals of high influence in the genetic distribution of the family,

the evaluation was performed by starting from a partially-latent family and removing

some individuals and evaluating the changes in the predictions for the family, as well as

discussing possible issues and other pertinent details.

genoDraw is an online tool, and it is intended to be used by medical professionals as

part of their routine. Therefore, good usability is required for its success in the intended

scenario. This usability was asserted in previous publications for the creation and visu-

alization parts [1, 43]. Since the addition of the mode reported in this work causes only

slight changes in the tool itself, and the method here presented is what is most essential

about this thesis, no usability evaluations will be presented in the next section.

52

Page 62: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

6 RESULTS

In this section, the results obtained by following the testing procedures described in Sec-

tion 5.2 are presented. In all, it contains the calculations for a set of cases for atomic

pedigrees (triples of a father, a mother and a child), as well as more complex cases, both

with full penetrance and partial penetrance.

The cases associated with atomic families are presented in the next subsection. Then,

more complex cases are separated into another subsection.

53

Page 63: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

6.1 Atomic family cases

The results for full penetrance in biallelic monogenic autosomal recessive traits are in-

cluded in Table 4. In every three rows, an atomic family is represented. The columns con-

tain a number to identify each family, the individual to which we refer by the row, their

phenotype, their initial predicted genotype distribution, their predicted genotype distribu-

tion at the end of the calculations and their predicted phenotype. In each of the predicted

columns, results are divided in two: calculated using only downwards probability prop-

agation and upwards constraint propagation (method A), and calculated using the three

proposed propagation methods (method B). In this table, since penetrance is complete,

the results of the last sets of three columns are the same.

In the first family included, all individuals are latent. Thus, no information should

be obtained from the processing of their data. As we can see, method A makes causes

almost no change to the calculations. Method B, from another perspective, transforms

the distributions of all the individuals by calculating the most expected case for the child

(since there is no constraint, the most expected is that no information is known for the

child. That is, if there are 4 possible combinations of ordered pairs of alleles, each is

assigned 25% probability, and since in terms of unordered pairs Aa is the same as aA,

their probabilities are added, and the result is 50% probability). A reasonable approach

could consider both situations as correct since no information is known for any individual.

Therefore, methods A and B are both correct for cases in which no information is given.

The second family includes a constraint. The father is defined as not affected for the

disease in question. That is, the father is known not to be affected nor carrier of the trait.

Therefore, for the case of full penetrance, the only possible genotype he can have is AA.

As we can see in the results, again, method A uses the prior information of the mother

and the constraint on the father to calculate the distribution of the child, reaching a correct

result in which the child is known not to be affected, since the father is not able to transmit

the recessive allele to the child, since his genome does not contain such allele. Method

B, on the other hand, predicts what is most likely from the perspective of the child. That

is, in a situation with no information about one of the parents but in which the father is

known to be AA, the child is most expected to either be AA or Aa, and this is predicted at

50% each.

Cases 3 and 4 represent similar situations and are not going to be commented on.

Suffice to state that both methods A and B, in their own perspectives, are correct.

Case 5 is an observation that the prediction for the latent child is equal for both meth-

ods when constraints are presented for both parents. That is, when their genotype dis-

tributions are fully defined. In the biological model for monogenic biallelic autosomal

recessive diseases, when one parent is carrier and the other parent is known to be ho-

mozygous and not affected (not carrier), their child will always be either carrier or also

54

Page 64: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

homozygous, with equal probability. Since methods A and B reach the same result, both

are correct in case 5.

Cases 6, 7 and 8 contain situations in which information is expected to flow upwards

in the family graph. That is, information is given for a child and we want it to affect

the genetic distributions of the parents. These situations are what justify the existence

of an upwards propagation model, as well as separated A and B models. In case 6, an

observation of the constraint propagation is noticeable. As we can see, since penetrance

is complete since the child is affected, parents cannot be not affected at all, otherwise, the

mutated gene would not get to the child. Therefore, they are most likely to be carriers.

Model B, on the other hand, estimates that the most likely scenario for no information

of the parents is that both are very likely to also be affected. We know that this is an

exaggerated estimation. Nonetheless, it is indeed the most likely if a child is affected and

we know no information about the parents. Case 8 develops a similar result. Case 7, in

contrast, estimates nearly a 25%/50%/25% distribution, since the child is asserted to be

carrier of the disease. That is, their genotype is Aa.

In Table 5, calculations for 60% penetrance in monogenic biallelic autosomal reces-

sive diseases are included. Its columns have the same meaning as in the previously-

described full penetrance table. However, predicted genotypes and predicted phenotypes

are not equal anymore.

In the first case, latent individuals have the same genotype probabilities as in any other

case, since no phenotypes constraint the distributions of the persons in the family. The

predicted phenotypes are calculations over the predicted genotypes, calculated over the

distributions themselves and the penetrance factor.

Case 2 contains a phenotype which was not shown in the full penetrance scenario,

although it does not cease to be important for full penetrance disease cases. Essentially,

if a person is annotated to be possibly carrier, what it means is that that person is not

affected (or not known to be so, as discussed previously), or carrier. In this work, I

consider this duplicity of possibilities as a 50%/50% ratio of probabilities. In the case

of 60% penetrance, since there is a probability that the person in question is aa and still

does not present the characteristic, the distribution is shifted accordingly to represent the

scenario correctly. As in the full penetrance results, model A predicts the child given

the parents, which causes a shift in the child towards a higher probability of them being

carrier, although not affected, since the mother is assumed latent. Also similarly to the full

penetrance model, model B predicts the parents as based on the children to present all the

cases with as equally-distributed a distribution as possible. The phenotype distributions

are calculated from the obtained results, considering, as described earlier, the penetrance

factor and the genotype distribution. As we can see, the probabilities of actually being

affected are slightly decreased, as expected in incomplete penetrance cases. A similar

scenario is observed in case 3.

55

Page 65: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

Case 4, on the other hand, is composed of two latent parents and a child who is asserted

to be possibly carrier. As in the previous cases, model A does not make any changes to

the initial genotypes of the parents, therefore not providing any new information to the

latent parents, while it is known that their child is possibly carrier. Model B, however,

searches for a most likely scenario for the most equally-distributed genotype for the child.

Of course, being a child possibly carrier is still no sufficient information to pinpoint the

genotype distributions of the parents. Nonetheless, models A and B provide results that

can enable one to have a better observation of the possible scenarios in a family.

In Tables 6 and 7, results for full penetrance and incomplete penetrance (60%) are in-

cluded for monogenic biallelic autosomal dominant diseases. As discussed before, domi-

nant diseases are the ones observable in the phenotype of an individual when at least one

allele is disease-causing (“mutated”). If they are both disease-causing, the symptoms tend

to be more pronounced [2].

As far as full penetrance cases are concerned, cases 1, 2 and 3 of Table 6 are examples

of such. As previously, models A and B perform differently but in a complementary

manner. A peculiarity of the autosomal dominant diseases with full penetrance is that no

individuals can be carriers and not affected (hence the carrier column, which is composed

fully by zeros). In case 1, all individuals are latent. Model A predicts no changes, and

model B predicts as equally-distributed a distribution as possible. In case 2, a similar case

happens. However, the father is known to not be affected. Thence, since the mother is

very unlikely to be affected, the child is predicted at 98% to also not be affected, and 2%

to be so by model A. Model B, on the other hand, finds its balance in the child being 50%

probable of being Aa and 50% of being aa, since no better approximation can be done as

far as eliminating the likelihood of the child being AA.

For the more general cases of incomplete penetrance, case 1 of Table 7 is a fully-latent

example. As we can see, the calculation of the genotype distribution is the same as for the

complete penetrance. The predicted phenotype, however, is quite different. Model A, as

usual, estimates no different genotypes as the initial ones. Model B also searches for the

most uniform distribution (considering the Aa/aA unbalance previously mentioned). Case

2 is very similar to other previously-commented cases. Case 3, however, is an interesting

example of how the balance within the probabilities can be important for the downwards

prediction. In case 3, the father is affected, and the mother is latent. However, the father

is much more likely to be Aa than AA, as observable in the population (allele frequency)

[2]. The consequence is that the child is slightly more likely to be heterozygous than

homozygous on the non-disease-causing alleles, and only 1% likely to be homozygous on

the disease-causing alleles.

56

Page 66: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

Ph

eno

typ

eIn

itia

lG

eno

typ

eP

red

icte

dG

eno

typ

eP

red

icte

dP

hen

oty

pe

AA

Aa

aaA

AA

aaa

no

taf

fect

edca

rrie

raf

fect

ed

1fa

ther

late

nt

0.9

69

0.0

30

.00

10

.97

/0.2

10

.03

/0.5

90

.00

1/0

.20

0.9

7/0

.21

0.0

3/0

.59

0.0

01

/0.2

0

mo

ther

late

nt

0.9

69

0.0

30

.00

10

.97

/0.2

10

.03

/0.5

90

.00

1/0

.20

0.9

7/0

.21

0.0

3/0

.59

0.0

01

/0.2

0

chil

dla

ten

t0

.96

90

.03

0.0

01

0.9

7/0

.25

0.0

3/0

.50

0.0

00

/0.2

50

.97

/0.2

50

.03

/0.5

00

.00

0/0

.25

2fa

ther

no

taf

fect

ed1

00

1/

10

/00

/01

/1

0/0

0/0

mo

ther

late

nt

0.9

69

0.0

30

.00

10

.97

/0.2

00

.03

/0.5

90

.00

1/0

.20

.97

/0.2

00

.03

/0.5

90

.00

1/0

.2

chil

dla

ten

t0

.96

90

.03

0.0

01

0.9

8/0

.50

0.0

2/0

.50

0/0

0.9

8/0

.50

0.0

2/0

.50

0/0

3fa

ther

affe

cted

00

10

01

00

1

mo

ther

late

nt

0.9

69

0.0

30

.00

10

.97

/0.2

00

.03

/0.6

00

.00

1/0

.20

0.9

7/0

.20

0.0

3/0

.60

0.0

01

/0.2

0

chil

dla

ten

t0

.96

90

.03

0.0

01

0/0

0.9

8/0

.50

0.0

2/0

.50

0/0

0.9

8/0

.50

0.0

2/0

.50

4fa

ther

no

taf

fect

ed1

00

1/1

0/0

0/0

1/1

0/0

0/0

mo

ther

affe

cted

00

10

/00

/01

/1

0/0

0/0

1/

1

chil

dla

ten

t0

.96

90

.03

0.0

10

/01

/1

0/0

0/0

1/

10

/0

5fa

ther

no

taf

fect

ed1

00

1/

10

/00

/01

/1

0/0

0/0

mo

ther

carr

ier

01

00

/01

/1

0/0

0/0

1/

10

/0

chil

dla

ten

t0

.96

90

.03

0.0

01

0.5

0/0

.50

0.5

0/0

.50

0/0

0.5

0/0

.50

0.5

0/0

.50

0/0

6fa

ther

late

nt

0.9

69

0.0

30

.00

10

.00

/0.0

00

.97

/0.0

10

.03

/0.9

90

.00

/0.0

00

.97

/0.0

10

.03

/0.9

9

mo

ther

late

nt

0.9

69

0.0

30

.00

10

.00

/0.0

00

.97

/0.0

10

.03

/0.9

90

.00

/0.0

00

.97

/0.0

10

.03

/0.9

9

chil

daf

fect

ed0

01

0/0

0/0

1/

10

/00

/01

/1

7fa

ther

late

nt

0.9

69

0.0

30

.00

10

.97

/0.2

30

.03

/0.5

50

.00

1/0

.22

0.9

7/0

.23

0.0

3/0

.55

0.0

01

/0.2

2

mo

ther

late

nt

0.9

69

0.0

30

.00

10

.97

/0.2

30

.03

/0.5

50

.00

1/0

.22

0.9

7/0

.23

0.0

3/0

.55

0.0

01

/0.2

2

chil

dca

rrie

r0

10

0/0

1/

10

/00

/01

/1

0/0

8fa

ther

late

nt

0.9

69

0.0

30

.00

10

.97

/1.0

00

.03

/0.0

00

.00

1/0

.00

0.9

7/1

.00

0.0

3/0

.00

0.0

01

/0.0

0

mo

ther

late

nt

0.9

69

0.0

30

.00

10

.97

/1.0

00

.03

/0.0

00

.00

1/0

.00

0.9

7/1

.00

0.0

3/0

.00

0.0

01

/0.0

0

chil

dn

ot

affe

cted

10

01

/1

0/0

0/0

1/

10

/00

/0

Tab

le4

:E

xam

ple

so

fex

ecu

tio

ns

for

am

on

og

enic

bia

llel

icau

toso

mal

rece

ssiv

ed

isea

sew

ith

full

pen

etra

nce

.

57

Page 67: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

Ph

eno

typ

eIn

itia

lG

eno

typ

eP

red

icte

dG

eno

typ

eP

red

icte

dP

hen

oty

pe

AA

Aa

aaA

AA

aaa

no

taf

fect

edca

rrie

raf

fect

ed

1fa

ther

late

nt

0.9

69

0.0

30

.01

0.9

7/0

.21

0.0

3/0

.59

0.0

01

/0.2

00

.97

/0.2

10

.03

/0.6

70

.00

1/0

.12

mo

ther

late

nt

0.9

69

0.0

30

.01

0.9

7/0

.21

0.0

3/0

.59

0.0

01

/0.2

00

.97

/0.2

10

.03

/0.6

70

.00

1/0

.12

chil

dla

ten

t0

.96

90

.03

0.0

10

.97

/0.2

60

.03

/0.5

00

.00

/0.2

40

.97

/0.2

60

.03

/0.6

00

.00

/0.1

5

2fa

ther

po

ssib

lyca

rrie

r0

.50

.36

0.1

40

.50

/0.2

50

.36

/0.4

50

.14

/0.3

00

.50

/0.2

50

.41

/0.5

70

.09

/0.1

8

mo

ther

late

nt

0.9

69

0.0

30

.01

0.9

7/0

.23

0.0

3/0

.58

0.0

0/0

.18

0.9

7/0

.23

0.0

3/0

.65

0.0

0/0

.11

chil

dla

ten

t0

.96

90

.03

0.0

10

.67

/0.2

60

.32

/0.5

00

.01

/0.2

40

.67

/0.2

60

.33

/0.6

00

.00

/0.1

4

3fa

ther

affe

cted

00

10

/00

/01

/1

0/0

0/0

1/

1

mo

ther

late

nt

0.9

69

0.0

30

.01

0.9

7/0

.21

0.0

3/0

.59

0.0

01

/0.2

00

.97

/0.2

10

.03

/0.6

70

.00

/0.1

2

chil

dla

ten

t0

.96

90

.03

0.0

10

.00

/0.0

00

.98

/0.5

00

.02

/0.5

00

.00

/0.0

00

.99

/0.7

00

.01

/0.3

0

4fa

ther

late

nt

0.9

69

0.0

30

.01

0.9

7/0

.28

0.0

3/0

.47

0.0

0/0

.25

0.9

7/0

.28

0.0

3/0

.57

0.0

0/0

.15

mo

ther

late

nt

0.9

69

0.0

30

.01

0.9

7/0

.28

0.0

3/0

.47

0.0

0/0

.25

0.9

7/0

.28

0.0

3/0

.57

0.0

0/0

.15

chil

dp

oss

ibly

carr

ier

0.5

0.3

60

.14

0.9

7/0

.27

0.0

3/0

.50

0.0

0/0

.23

0.9

7/0

.27

0.0

3/0

.59

0.0

0/0

.14

Tab

le5

:E

xam

ple

so

fex

ecu

tio

ns

for

am

on

og

enic

bia

llel

icau

toso

mal

rece

ssiv

ed

isea

sew

ith

pen

etra

nce

60

%.

58

Page 68: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

Ph

eno

typ

eIn

itia

lG

eno

typ

eP

red

icte

dG

eno

typ

eP

red

icte

dP

hen

oty

pe

AA

Aa

aaA

AA

aaa

no

taf

fect

edca

rrie

raf

fect

ed

1fa

ther

late

nt

0.0

01

0.0

30

.96

90

.00

1/0

.20

0.0

3/0

.59

0.9

7/0

.21

0.9

7/0

.21

0.0

0/0

.00

0.0

3/0

.79

mo

ther

late

nt

0.0

01

0.0

30

.96

90

.00

1/0

.20

0.0

3/0

.59

0.9

7/0

.21

0.9

7/0

.21

0.0

0/0

.00

0.0

3/0

.79

chil

dla

ten

t0

.00

10

.03

0.9

69

0.0

0/0

.24

0.0

3/0

.50

0.9

7/0

.26

0.9

7/0

.26

0.0

0/0

.00

0.0

3/0

.74

2fa

ther

no

taf

fect

ed0

01

0/0

0/0

1/

11

/1

0/0

0/0

mo

ther

late

nt

0.0

01

0.0

30

.96

90

.00

1/0

.20

0.0

3/0

.59

0.9

7/0

.21

0.9

7/0

.21

0.0

0/0

.00

0.0

3/0

.79

chil

dla

ten

t0

.00

10

.03

0.9

69

0.0

0/0

.00

0.0

2/0

.50

0.9

8/0

.50

0.9

8/0

.50

0.0

0/0

.00

0.0

2/0

.50

3fa

ther

affe

cted

0.0

30

.97

00

.03

/0.0

80

.97

/0.9

20

/00

.00

/0.0

00

.00

/0.0

01

.00

/1.0

0

mo

ther

late

nt

0.0

01

0.0

30

.96

90

.00

1/0

.18

0.0

3/0

.57

0.9

7/0

.24

0.9

7/0

.23

0.0

0/0

.00

0.0

3/0

.77

chil

dla

ten

t0

.00

10

.03

0.9

69

0.0

1/0

.26

0.5

1/0

.50

0.4

7/0

.24

0.4

8/0

.24

0.0

0/0

.00

0.5

2/0

.76

Tab

le6

:E

xam

ple

so

fex

ecu

tio

ns

for

am

on

og

enic

bia

llel

icau

toso

mal

do

min

ant

dis

ease

wit

hfu

llp

enet

ran

ce.

Ph

eno

typ

eIn

itia

lG

eno

typ

eP

red

icte

dG

eno

typ

eP

red

icte

dP

hen

oty

pe

AA

Aa

aaA

AA

aaa

no

taf

fect

edca

rrie

raf

fect

ed

1fa

ther

late

nt

0.0

01

0.0

30

.96

90

.00

/0.2

00

.03

/0.5

90

.97

/0.2

10

.97

/0.2

10

.01

/0.3

20

.02

/0.4

7

mo

ther

late

nt

0.0

01

0.0

30

.96

90

.00

/0.2

00

.03

/0.5

90

.97

/0.2

10

.97

/0.2

10

.01

/0.3

20

.02

/0.4

7

chil

dla

ten

t0

.00

10

.03

0.9

69

0.0

0/0

.24

0.0

3/0

.50

0.9

7/0

.26

0.9

7/0

.26

0.0

1/0

.30

0.0

2/0

.44

2fa

ther

po

ssib

lyca

rrie

r0

.22

0.2

20

.56

0.2

2/0

.39

0.2

2/0

.28

0.5

6/0

.33

0.5

6/0

.33

0.1

8/0

.27

0.2

7/0

.40

mo

ther

late

nt

0.0

01

0.0

30

.96

90

.00

/0.1

90

.03

/0.5

80

.97

/0.2

20

.97

/0.2

20

.01

/0.3

10

.02

/0.4

7

chil

dla

ten

t0

.00

10

.03

0.9

69

0.0

1/0

.25

0.3

4/0

.50

0.6

6/0

.25

0.6

6/0

.25

0.1

4/0

.30

0.2

1/0

.45

3fa

ther

affe

cted

0.0

30

.97

00

.03

/0.0

80

.97

/0.9

20

/00

/00

/01

/1

mo

ther

late

nt

0.0

01

0.0

30

.96

90

.00

/0.1

90

.03

/0.5

80

.97

/0.2

30

.97

/0.2

30

.01

/0.3

10

.02

/0.4

6

chil

dla

ten

t0

.00

10

.03

0.9

69

0.0

1/0

.26

0.5

1/0

.50

0.4

7/0

.24

0.4

8/0

.24

0.2

1/0

.30

0.3

1/0

.46

Tab

le7

:E

xam

ple

so

fex

ecu

tio

ns

for

am

on

og

enic

bia

llel

icau

toso

mal

do

min

ant

dis

ease

wit

hp

enet

ran

ce6

0%

.

59

Page 69: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

6.2 More complex cases

The objectives of this work include the devisal of a method which can help identify in-

dividuals who may add most information to the whole family. In this work, this is done

by analyzing the difference between two predictive methods for each individual. As a

side effect, eliminating the information of some individuals indicate the efficacy of the

method. In this section, I present the results obtained in three hypothetical pedigrees by

making use of such a technique in more complex scenarios. Some authors refer to com-plex in pedigrees when they present loops in a family. In this work, I use this term for

non-triad families.

Case 1 The first case to be included as more complex is that of a family composed of two

parents and two children. It is very simple in terms of structure. However, for the sake

of this work, demonstrating that the interaction between the triads exists is of interest.

As depicted in Figure 17a, individual C is affected by an autosomal recessive disease

represented in grey, to which every other related individual is carrier. In Figure 17b, I

omit that D is carrier. The calculations predict that her genotype probability distribution

is, in fact, as uniform as possible, as shown earlier. Next, I remove the information present

for individual A (Figure 17c). Since individual C is affected, A cannot be not affected

nor carrier, which is observed in his prediction. The probability that A is affected thus

escalates to 56%, which also causes the probability that D is affected to almost 40%,

according to method B. Figure 17d is the result of eliminating the information that the

mother (B) is carrier. As we can see, A is now less likely to be affected, while B is more

likely to be so. This unbalance affects the whole family, indicating that latent information

may have exceeded an acceptable level.

Case 2 The second case is composed of a large but simple pedigree. The simplicity here

refers to the lack of loops in the family.

This case is represented in Figure 18. As we can see, individuals C and K are carriers

of an autosomal recessive trait with full penetrance, G and R are affected by such trait and

D is not affected, although we do not know if she is carrier or not.

According to the predictions, however, individual D is sure to be carrier, since G is

affected and propagates a constraint that limits her to be carrier or affected.

After omitting that C is carrier and D is but affected, we reach the situation depicted

in Figure 19. As we can see, B is much more likely to be carrier according to method B,

which causes F to be more likely to be affected, and E to be less likely to be so.

Case 3 Lastly, the third case is a pedigree that can be considered as a complex pedigree

in the sense most authors in the literature use. It is comprised of a family in which a

60

Page 70: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

(a) (b)

(c) (d)

Figure 17: Visualization of the progressive removal of information in case 1.

61

Page 71: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

Figure 18: Initial situation of complex case 2.

62

Page 72: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

Figure 19: Complex case 2 after the omission of some information.

63

Page 73: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

Figure 20: Initial situation of complex case 3.

relationship between two individuals of the same family exists. Such family is represented

in Figure 20.

Automatically finding carriers is a capability that can be of interest in clinical practice.

Finding them in many cases requires a genome analysis, which is expensive and some-

times of no practical use. In this case, I am going to omit, one by one, all the carriers of

this hypothetical family, so as to evaluate its behavior.

In the first omission (Figure 21), since A is latent and B is affected, method A predicts

that E is almost sure to be carrier. Method B is more aggressive, estimating a 50% chance

that E is in fact affected.

After the second omission (Figure 22), N is less defined, since her probability of being

affected increases. A is also more likely to be affected.

The third omission (Figure 21) is that H is not registered as carrier. Therefore, almost

64

Page 74: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

Figure 21: Complex case 3 after the omission that E is carrier.

65

Page 75: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

Figure 22: Complex case 3 after the omission that K is carrier.

66

Page 76: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

Figure 23: Complex case 3 after the omission that H is carrier.

the whole family has their genotype distributions free, except B and O. However, not

much is changed in terms of phenotype distributions. Q, who is in gestation, was initially

at a risk between 25% and 48% of being affected and is now at a risk between 26% and

52%, which is not a big change. This is due to O being an older sibling who is affected.

In a fourth omission (Figure 24), we can simulate that O is not a sibling of Q just by

eliminating his being affected. A shift in the whole family is observed, and the likelihood

that Q is affected decreases. However, with the amount of information available, although

there is a risk of 2% that Q is affected, nothing can be asserted, since their distribution,

according to method B, is too uniform, although pending to the affected side.

67

Page 77: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

Figure 24: Complex case 3 after the omission that O is affected.

68

Page 78: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

7 DISCUSSION

In this section, the context surrounding this work is considered. From one perspective,

other methods aim at better precisions in intricate and diverse inheritance scenarios. From

another perspective, the lack of methods that aim at assisting the risk assessment of ge-

netic diseases indicates a lack not of extremely precise, but of adequate tools in clinical

practice.

69

Page 79: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

In this work, I describe three algorithms that can be useful at propagating genetic

information in family graphs, combining such algorithms in two groups (methods A and

B). This is done so providing an insight into the balance between available information

and latent information. Such groups were implemented as a module in genoDraw, and

their results were analyzed in the previous section.

The basis of this work, as described in the Methods section, is the propagation of

information in all directions, as well as the consideration that genotypes can have their

characteristics fixed by constraints. Each of the three algorithms presented is independent

from the rest, which enables flexibility in their combinations. Each combination is run in a

separate Expectation Maximization algorithm, in which each individual can be influenced

both by their parents and children by the means of gradients influencing their genotype

probability distributions. In a pedigree with more than one generation, a step-by-steppropagation can thus be observed.

In terms of related works, there are, to my knowledge, no other contributions that share

the specific purpose of this work, which is to provide medical professionals with a tool

that facilitates their calculations tasks without being overcomplex nor cumbersome. That

is, which assists them in what is more tedious and prone to errors in their daily routine.

The method presented here is nothing without the user knowing what every estimation

means, and what the intervals between estimations from methods A and B indicate in

terms of the precision and assertiveness the professional is seeking.

Most of the works I refer to in Section 2.6 intend to be as precise as possible. In fact,

they grew more and more complex throughout the 1980s and 1990s, seeking higher and

higher precisions via complex statistical tools. However, many of them lack the sensibility

to only include parameters which can actually be managed in a clinical scenario. These

render an otherwise useful tool a hindrance in the daily routine of geneticists. This may

be the reason for the remarkable lack of tools and platforms with purposes similar to what

I propose and implement here. Works such as [31, 32, 33, 34, 35] explore the nature of

genetics with extreme precision. [37] does not even mention alleles as unique blocks of

genetic information but refers to shared sections of the genome with multiple purposes.

Consanguinity is not analyzed in a global sense but as an analysis of identity of genes

by descent. Of course, the precision achieved can come to be excellent given enough

data. However, better still is the sequencing of every individual’s section of the genome

responsible for a trait. It is definitely a more expensive procedure, but undoubtedly more

precise. If precision is the intention, there are precise methods. However, for a tool to be

successful in clinical practice, one of the many characteristics it must have is a balance

between usefulness and precision.

In this work, usefulness is most considered. The fact that intervals between predicted

genetic probability distributions (methods A and B) are presented instead of a pinpoint

estimation is the utmost statement that precision is not what may appear to be the most

70

Page 80: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

considered characteristic of the method. However, the segregation process during meio-sis is enough of a random process to enable us the luxury of presenting the user results

that are certain to not be accurate. This is why this work is developed using simple,

easy-to-calculate, propagation models. As the reader may observe, no mixed models or

otherwise complex and full-of-parameters methods are used here. Instead, I apply as sim-

ple a process as it can get, while expanding it to enable for the propagation of probability

distributions instead of most-likely situations. The only parameters of interest to the user

are the mode of inheritance of the disease being analyzed, its penetrance, and the mu-

tated allele’s frequency in the population. All of these parameters are easily obtained in

public databases and/or simple statistical observations. If a disease does not present a

clear mode of inheritance or it is quantitative, the method here presented is clearly not

adequate. However, having to deal with quantitative parameters in simple X-linked re-

cessive diseases, for example, is not an advantage over the current clinical practice. In

this sense and in this specific scenario, a simpler tool is expected to be more effective and

better accepted in its own scope than one adaptable to even the most complex situations

but excessively complex.

71

Page 81: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

8 CONCLUSIONS AND FUTURE LINES OF WORK

In this section, I bring this work to a close, while contemplating new possibilities and

conceivable future efforts.

72

Page 82: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

In this work, I present a method for genetic information propagation using graph min-

ing techniques. It is based on three modes of propagation which operate in complemen-

tary manners. Each combination of such modes serves a different purpose, as observed

in the Results section, being thus useful in different family scenarios. Together, they are

observed to indicate plausible genotype distributions and to bring attention to lacks of

information in the family. The method by me developed is centered in its applicability

in clinical practice, without focusing only on unmatchable precision scores. It takes ad-

vantage of widely-known and used parameters, such as modes of inheritance, as well as

a little more complex penetrance and allele frequency variables. In all, what is presented

here is a method which is novel in its central determination, which is to help genetics

counselors in their daily routines. The means through which this is performed is by au-

tomatically executing mathematical operations and statistics interpretations, an activity

that is currently performed by hand, and is thus extremely prone to errors and is usually

performed in a partial manner, without considering all possibilities from all individuals.

The method proposed is now implemented in genoDraw. genoDraw is a complex PDS

developed by me at the Biomedical Informatics Group, Technical University of Madrid,

in collaboration with the Genetics and Inheritance Research Group of the 12 de Octubre

Hospital, Madrid. The platform has already been reported in a conference paper presented

in March of this year (Inforsalud, 2019) [1], of which I am the main author. The article

is attached to this work as Appendix A. As commented in the introduction of this work,

the method in it proposed solves some of the necessities of complex PDSs, such as the

composition and disposition of pedigrees and the model of interaction with them. In the

present work, I build on such past work, by implementing the proposed method as a risk

assessment module in genoDraw.

Therefore, by being implement in an actual PDS, the method can come to have a

paramount impact in clinical practice by helping professionals more easily visualize the

dynamics of genetic diseases in families. At the 12 de Octubre Hospital, this could mean

savings in diagnosis time and professional efforts, as well as in genome sequencing and

more advanced but sometimes unnecessary analyses, be it in the context of precision

medicine or preventive medicine. In other medical centers, however, this could have

much deeper and far-reaching consequences. We currently live in a society decidedly

marked by a harsh contrast between rich and poor regions. In many areas of the World,

despite the existence of medical professionals (in many cases with top-notch academic

training), resources are scarce. In such cases, genome analysis and collection of detailed

genotypical information tend to not be available to the general public. In these scenarios,

the method here described can be most useful. Not only can it perform estimations based

only on family structures and on the phenotypes of some individuals, it is also based on

widely-known information, such as the mode of inheritance of diseases. In this sense, it

is compatible with the objective, purpose and reality of clinical procedures and resources.

73

Page 83: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

The genetic information propagation techniques here described are not an end to the

field of statistical genetics in terms of modes of inheritance and related diseases, nor do

they intend to be. Quite the contrary, they are not more than some of many possibilities. In

the search for adequate ways for supporting the daily practice of genetic risk assessment,

some tradeoffs are required. Not all information is always known, and some may be less

decisive than others. In the future, I intend to (a) more deeply explore the possibilities

Graph Mining provides to the extraction of information from vastly-latent data sources in

genetics, in which the structure in which entities are related is significantly decisive. In

terms of genetics, some possibilities lie on the expansion of the methods here described to

incorporate factors such as how likely is a de novo mutation to happen, or even to enable

for monoparental disomies to be predicted. These are possibilities vastly more complex

to consider without turning the whole process a hindrance for the genetics practitioner.

However, more possible future efforts related to this work might be the (b) deployment

of the implementation here described so that the Genetics and Inheritance team at the

12 de Octubre Hospital, in Madrid, is able to not only use the tool in their daily clinical

practice but also to contribute with weighty feedback. Another idea aligned with the

deployment is (c) the devisal of a predictive algorithm capable of finding the most likely

mode of inheritance for genetic diseases given a family and some phenotypes. Lastly,

one more intrepid idea of future work is (d) to evaluate the association between modes

of inheritance. In essence, this line of work would contrast the modes of inheritance

currently assigned to genetic diseases by analyzing actual pedigrees.

74

Page 84: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

9 REFERENCES

[1] L. Garcia-Giordano, S. Paraiso-Medina, R. Alonso-Calvo, F. J. Fernández-Martínez,

and V. Maojo, “Genodraw: A tool to create pedigree diagrams based on biomedical

terminologies and standards”, 2019.

[2] R. L. Bennett, The Practical Guide to the Genetic Family History. John Wiley and

Sons, 2011, ISBN: 978-1-118-20981-3.

[3] S. Lee, “Why do we read many articles with bad statistics? : What does the new

american statistical association’s statement on pvalues mean?”, Korean Journal ofAnesthesiology, vol. 69, no. 2, 109–110, Apr. 2016, ISSN: 2005-6419. DOI: 10.

4097/kjae.2016.69.2.109.

[4] S. J. Maglio and E. Polman, “Revising probability estimates: Why increasing like-

lihood means increasing impact.”, Journal of Personality and Social Psychology,

vol. 111, no. 2, 141–158, 2016, ISSN: 1939-1315, 0022-3514. DOI: 10.1037/

pspa0000058.

[5] R. G. Resta, “The crane’s foot: The rise of the pedigree in human genetics”, Journalof Genetic Counseling, vol. 2, no. 4, 235–260, 1993, ISSN: 1059-7700, 1573-3599.

DOI: 10.1007/BF00961574.

[6] M. M. Weber, “Ernst rüdin, 1874-1952: A german psychiatrist and geneticist”,

American Journal of Medical Genetics, vol. 67, no. 4, 323–331, 1996, ISSN: 0148-

7299. DOI: 10.1002/(SICI)1096-8628(19960726)67:4<323::AID-

AJMG2>3.0.CO;2-N.

[7] R. L. Bennett, K. A. Steinhaus, S. B. Uhrich, and C. O’Sullivan, “The need for de-

veloping standardized family pedigree nomenclature”, Journal of Genetic Counsel-ing, vol. 2, no. 4, 261–273, 1993, ISSN: 1573-3599. DOI: 10.1007/BF00961575.

[8] R. L. Bennett, K. A. Steinhaus, S. B. Uhrich, C. K. O’Sullivan, R. G. Resta, D

Lochner-Doyle, D. S. Markel, V Vincent, and J Hamanishi, “Recommendations for

standardized human pedigree nomenclature. pedigree standardization task force of

the national society of genetic counselors.”, American Journal of Human Genetics,

vol. 56, no. 3, 745–752, 1995, ISSN: 0002-9297.

[9] S. Mazumdar and K. M. Das, “Immunocytochemical localization of vasoactive in-

testinal peptide and substance p in the colon from normal subjects and patients

with inflammatory bowel disease.”, American Journal of Gastroenterology, vol. 87,

no. 2, 1992.

75

Page 85: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

[10] R. L. Bennett, K. S. French, R. G. Resta, and D. L. Doyle, “Standardized human

pedigree nomenclature: Update and assessment of the recommendations of the na-

tional society of genetic counselors”, Journal of Genetic Counseling, vol. 17, no. 5,

424–433, 2008, ISSN: 1059-7700, 1573-3599. DOI: 10.1007/s10897-008-

9169-9.

[11] E. L. Petsonk, C. Rose, and R. Cohen, “Coal mine dust lung disease. new lessons

from an old exposure”, American Journal of Respiratory and Critical Care Medicine,

vol. 187, no. 11, 1178–1185, 2013, ISSN: 1073-449X. DOI: 10.1164/rccm.

201301-0042CI.

[12] A. G. Efthymiou and A. M. Goate, “Late onset alzheimer’s disease genetics impli-

cates microglial pathways in disease risk”, Molecular Neurodegeneration, vol. 12,

no. 1, p. 43, 2017, ISSN: 1750-1326. DOI: 10.1186/s13024-017-0184-x.

[13] E. H. Trager, R. Khanna, A. Marrs, L. Siden, K. E. H. Branham, A. Swaroop,

and J. E. Richards, “Madeline 2.0 pde: A new program for local and web-based

pedigree drawing”, Bioinformatics, vol. 23, no. 14, 1854–1856, 2007, ISSN: 1367-

4803. DOI: 10.1093/bioinformatics/btm242.

[14] My family health portrait - centers for disease control and prevention. [Online].

Available: https://phgkb.cdc.gov/FHH/html/index.html.

[15] Progeny genetics. [Online]. Available: https://www.progenygenetics.

com/.

[16] Genopro. [Online]. Available: https://www.genopro.com/.

[17] Cra health. [Online]. Available: https://www.crahealth.com/.

[18] C. Kelleher, B. Drohan, K. Hughes, and G. Grinstein, “Self organizing interactive

pedigree diagrams”, 2011.

[19] “Ansi/niso z39.19-2005 (r2010) guidelines for the construction, format, and man-

agement of monolingual controlled vocabularies”, [Online]. Available: https:

//www.niso.org/publications/ansiniso-z3919-2005-r2010.

[20] O. Bodenreider, “The unified medical language system (umls): Integrating biomed-

ical terminology”, Nucleic Acids Research, vol. 32, no. Database issue, pp. D267–

270, 2004, ISSN: 1362-4962. DOI: 10.1093/nar/gkh061.

[21] K. Donnelly, “Snomed-ct: The advanced terminology and coding system for ehealth.”,

Studies in health technology and informatics, vol. 121, 279–290, 2006, ISSN: 0926-

9630.

76

Page 86: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

[22] A. Hamosh, A. F. Scott, J. S. Amberger, C. A. Bocchini, and V. A. McKusick,

“Online mendelian inheritance in man (omim), a knowledgebase of human genes

and genetic disorders”, Nucleic Acids Research, vol. 33, D514–D517, 2005, ISSN:

0305-1048. DOI: 10.1093/nar/gki033.

[23] S. Köhler, L. Carmody, N. Vasilevsky, J. O. B. Jacobsen, D. Danis, J.-P. Gourdine,

M. Gargano, N. L. Harris, N. Matentzoglu, J. A. McMurry, and et al., “Expansion

of the human phenotype ontology (hpo) knowledge base and resources”, NucleicAcids Research, vol. 47, no. D1, D1018–D1027, 2019, ISSN: 0305-1048. DOI: 10.

1093/nar/gky1105.

[24] R. C. Elston and J. Stewart, “A new test of association for continuous variables”,

Biometrics, vol. 26, no. 2, 305–314, 1970, ISSN: 0006-341X. DOI: 10.2307/

2529077.

[25] C. Cannings, E. A. Thompson, and M. H. Skolnick, “Probability functions on com-

plex pedigrees”, Advances in Applied Probability, vol. 10, no. 01, 26–61, 1978,

ISSN: 0001-8678, 1475-6064. DOI: 10.2307/1426718.

[26] J Ott, “Maximum likelihood estimation by counting methods under polygenic and

mixed models in human pedigrees.”, American Journal of Human Genetics, vol. 31,

no. 2, 161–175, 1979, ISSN: 0002-9297.

[27] K. Lange and M. Boehnke, “Extensions to pedigree analysis”, 1983.

[28] K. P. Donnelly, “The probability that related individuals share some section of

genome identical by descent”, Theoretical Population Biology, vol. 23, no. 1, 34–63,

1983, ISSN: 0040-5809. DOI: 10.1016/0040-5809(83)90004-7.

[29] N. A. Sheehan, “Image processing procedures applied to the estimation of geno-

types on pedigrees”, p. 56,

[30] E. A. Thompson and E. M. Wijsman, “The gibbs sampler on extended pedigrees:

Monte carlo methods for the genetic analysis of complex traits”, p. 31,

[31] E. A. Thompson and S. W. Guo, “Evaluation of likelihood ratios for complex ge-

netic models”, Mathematical Medicine and Biology: A Journal of the IMA, vol. 8,

no. 3, 149–169, 1991, ISSN: 1477-8599. DOI: 10.1093/imammb/8.3.149.

[32] S. W. Guo and E. A. Thompson, “A monte carlo method for combined segrega-

tion and linkage analysis.”, American Journal of Human Genetics, vol. 51, no. 5,

1111–1126, 1992, ISSN: 0002-9297.

[33] S. W. Guo and E. A. Thompson, “Monte carlo estimation of mixed models for large

complex pedigrees”, Biometrics, vol. 50, no. 2, p. 417, 1994, ISSN: 0006341X. DOI:

10.2307/2533385.

77

Page 87: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

[34] C. Stricker, R. L. Fernando, and R. C. Elston, “An algorithm to approximate the

likelihood for pedigree data with loops by cutting”, Theoretical and Applied Ge-netics, vol. 91–91, no. 6–7, 1054–1063, 1995, ISSN: 0040-5752, 1432-2242. DOI:

10.1007/BF00223919.

[35] E. A. Thompson, “Statistical inference from genetic data on pedigrees”, NSF-CBMS Regional Conference Series in Probability and Statistics, 2000. [Online].

Available: http://www.jstor.org/stable/4153187.

[36] X. Li, “Haplotype inference from pedigree data and population data”, PhD thesis,

Case Western Reserve University, 2010. [Online]. Available: https://etd.

ohiolink.edu/pg_10?::NO:10:P10_ETD_SUBID:52101.

[37] D. E. A. Thompson, “Identity by descent in pedigrees and populations; methods

for genome-wide linkage and association. une short course: Feb 14-18, 201”, p. 99,

2011.

[38] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016,

Google-Books-ID: omivDQAAQBAJ, ISBN: 978-0-262-33737-3.

[39] P. Goyal and E. Ferrara, “Graph embedding techniques, applications, and perfor-

mance: A survey”, Knowledge-Based Systems, vol. 151, 78–94, 2018, ISSN: 0950-

7051. DOI: 10.1016/j.knosys.2018.03.022.

[40] A. Narayanan, M. Chandramohan, R. Venkatesan, L. Chen, Y. Liu, and S. Jaiswal,

“Graph2vec: Learning distributed representations of graphs”, arXiv:1707.05005[cs], 2017, arXiv: 1707.05005. [Online]. Available: http://arxiv.org/

abs/1707.05005.

[41] A. Grover and J. Leskovec, “Node2vec: Scalable feature learning for networks”, in

Proceedings of the 22Nd ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, ser. KDD ’16, event-place: San Francisco, Califor-

nia, USA, ACM, 2016, 855–864, ISBN: 978-1-4503-4232-2. DOI: 10.1145/

2939672.2939754. [Online]. Available: http://doi.acm.org/10.

1145/2939672.2939754.

[42] S. Köhler, M. H. Schulz, P. Krawitz, S. Bauer, S. Dölken, C. E. Ott, C. Mundlos,

D. Horn, S. Mundlos, and P. N. Robinson, “Clinical diagnostics in human genet-

ics with semantic similarity searches in ontologies”, American Journal of HumanGenetics, vol. 85, no. 4, 457–464, 2009, ISSN: 1537-6605. DOI: 10.1016/j.

ajhg.2009.09.003.

[43] L. Garcia-Giordano, S. Paraiso-Medina, R. Alonso-Calvo, F. J. Fernández-Martínez,

and V. Maojo, “Genodraw: A web tool for developing pedigree diagrams using the

78

Page 88: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

standardized human pedigree nomenclature integrated with biomedical vocabular-

ies”, Submitted, N.D.

[44] G. Gallo, G. Longo, S. Nguyen, and S. Pallottino, Directed Hypergraphs And Ap-plications. 1992.

[45] T. Dwyer, Y. Koren, and K. Marriott, “Ipsep-cola: An incremental procedure for

separation constraint layout of graphs”, IEEE Transactions on Visualization andComputer Graphics, vol. 12, no. 5, 821–828, 2006, ISSN: 1077-2626. DOI: 10.

1109/TVCG.2006.156.

[46] M. Bostock, V. Ogievetsky, and J. Heer, “D3 data-driven documents”, IEEE Trans-actions on Visualization and Computer Graphics, vol. 17, no. 12, 2301–2309, 2011,

ISSN: 1077-2626. DOI: 10.1109/TVCG.2011.185.

79

Page 89: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

A GENODRAW: A TOOL TO CREATE PEDIGREE DI-AGRAMS BASED ON BIOMEDICAL TERMINOLO-GIES AND STANDARDS

80

Page 90: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

291

Madrid, 21, 22 y 23 de marzo Infors@lud2017 Infors@lud2019

GENODRAW: A TOOL TO CREATE PEDIGREE DIAGRAMS BASED ON BIOMEDI-CAL TERMINOLOGIES AND STANDARDSL.GARCÍA-GIORDANO, S. PARAISO-MEDINA, R. ALONSO-CALVO, F. J . FERNÁNDEZ MARTÍNEZ, V. MAOJO

AbstractThe need for integrating genomic data into daily clinical practice raises the demand for tools capable of represen-ting individuals’ data and also their biological relationships with other individuals. In this work, we introduce geno-Draw, a new platform for creating and managing pedigree diagrams following biomedical standards. The proposed work focuses in five critical aspects for the adoption of this platform in the clinical practice, namely: data-drivenness, automation, in-teractivity, comprehensiveness and com-patibility with widely-adopted biomedical vocabularies for the annotation of traits and characteristics. We present a novel process for generat-ing pedigree diagrams from individual data. This process generates pedigree diagrams that comply with the pedigree nomenclature used as a de-facto standard in the area. We implemented the system as a web platform for ensuring com-plete compatibility. We also performed an evaluation pro-cess, which included usability tests, and the results show a promising adequacy for the usage in the clinical practice.

Keywords: Data visualization, Biomedical vocabularies, Genetics

IntroductionThe current usage of genomic data in the clinical practice indicates a demand for tools to represent individuals’ data and their relations. This demand also indicates a necessity for these tools, with characteristics such as, for instance, data-drivenness, automation, interactivity, comprehensi-veness, or compatibility with standard biomedical voca-bularies in such tools. To our knowledge, there are cu-rrently no informatics tools addressing all of these aspects. In this work, we present genoDraw, a new interactive and user-friendly system that aims to address all the characte-ristics previously mentioned. It is capable of following the guidelines established by the new-est revised version of the pedigree nomenclature [1]. The sys-tem is (a) compre-hensive enough to represent all the major scenarios, (b) capable of automating the creation of the geno-gram, (c) interactive, (d) data-driven and (e) compatible with the annotation of characteristics of each individual as terms from standard biomedical vocabularies, such as the Hu-man Phenotype Ontology (HPO) [2]

MethodsGenoDraw aims to address required characteristics enumerat-ed in the previous section from (a) to (e). Firstly, to provide comprehensiveness (a), we adopted a widely-used visual no-menclature [1][3]. This no-menclature is capable of clearly representing all kinds of clinical heritage scenarios, not only the traditional family relations, but also major non-traditional re-productive scenarios, such as ovum donations, adop-tions, sperm donations, surrogate gestations, planned adoptions, among others. We represent the pedigree diagram as a graph. The no-des of the graph are the entities of a genogram, which are positioned according to an optimization engine. This form of representa-tion allows us to achieve the characteristics (b-e), as it is commented in the pedi-gree graph creation process explained below.The generation of the graph and such constraints are done following an automatic process. Initially, the system contains information about the individuals, their relations and charac-teristics, including their traits as terms from biomedical vo-cabularies. The process generates a correct genogram from the data of individuals and their relations (data-drivenness) in a wide variety of scenarios (comprehensiveness). As a visual example of this process, we can consider a family comprised of a man (A) and a woman (B) who have a relationship and a female child still being gestated (C) and in which both the mother and the child have a certain trait (grey). Depicted in Figure 1 is the result of the genogram generation process for this set of data. Notice that all the traits and charac-teristics are represented (each individual as the shape that corresponds, the ‘P’ symbol that indicates that the individual is still being gestated, and the characte-ristics annotated – in this case the grey marks). While the nodes and links of the genogram are determined following pre-made rules, its structure is deter-mined by an optimization process, which defines the posi-tions in which the nodes that are to be positioned in the canvas. This optimization process takes advanta-ge of the alignments required by the nomencla-ture to define linear restrictions for an optimization process. An example for the result of this process is shown in Figure 2.The interactivity is achieved by facilitating the mani-pulation of the positions of the nodes in the repre-sented graph and the input of new information by allowing direct interaction with the drawn entities. The manipulation of positions is achieved by moving any node of the representation to a new desired po-sition. However, this could cause the diagram to be incor-rect according to the features of the nomencla-ture. To solve this possible issue, the graph as a whole is then repositioned by the convergence process of the optimization engine, which finds a new disposition that complies with the desire of the user and with the nomenclature.

Page 91: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

292

XX Congreso Nacional de Informática de la Salud Infors@lud2019

The input of new information is achieved by the use of con-text menus in each of the represented nodes, as well as by other input methods. For example, for a child to be added to a certain person, the user might click on the relationship be-tween the two biological parents and then choose the item on the context menu that corresponds to adding a child. Further information about this child, such as name, gender or traits as terms from biomedical voca-bularies, can then be inserted using a sidebar menu that is shown by clicking the correspond-ing node.ResultsThe presented pedigree diagram drawing system was imple-mented as a web platform that allows for intuitive creation and display of genograms. The implementation of the plat-form was done using common web techno-logies, such as Ja-vaScript, and the representation engi-ne was based on Web-Cola (https://ialab.it.monash.edu/webcola/), a graph visuali-zation tool derived from the Force module of D3.js (https://d3js.org/). This was achie-ved by adapting the rules deduced from the nomencla-ture to the engine chosen for displaying the graph using linear constraints. From an interaction perspective, the user is able to, step-by-step, build a genogram only by inserting people and/or creat-ing relations between them. As shown in Figure 3, starting from a blank representation, a user may: (i) Add the parents: for this purpose, add a person called A, a man who has albi-nism (shown in grey), which is a term from HPO. Then, the user is able to set a partner for person A, which creates another person B and the rela-tionship among them. (ii) Add biological children: thus, another person C may be added, having person A and B as biological father and mother, respectively. Follow-ing the same steps, a child D can be added. Child C is also ob-served to have albinism, which is added through a sidebar menu for the affected individual using thus the same term as person A to describe this characteristic. Ultimately, the user is able to construct the family step by step, and all the infor-mation that is inserted are the information of each individual (name and affections) and the relations (biological father, biological mother, and partner are the ones used in the exam-ple). DiscussionIn this work, we describe the main elements of genoDraw, a new system that enables the user to create and edit ge-nograms not only in a highly interactive manner, but also in a way that engages them to follow the chosen nomen-clature, which is widely-accepted as a visual guideline to drawing pedigree diagrams. The first of these characteris-tics, interactiveness, has been continuously gaining rele-vance, since the advent of touchscreen-equipped devices and powerful graphics proces-sors, especially in mobile devices. The most recent pedigree drawing tools date back to the beginning of the massive as-cendance of such devices, which justifies the lack of interac-tiveness that is noticeable in them, but of tremendous utility nowadays. To test of the use of our system, we devised and carried out a usability test. The test consisted of, without prior use of the platform, generating two pedigree diagrams gi-ven a real-world situation written as text for each of them. The first diagram represented a simple but large represen-tation, to assess the familiarization of the users in a low complexity level.

The second diagram consisted of a family in which many chil-dren were born from surrogate gestations, ovum donors, or were adopted. In both cases, traits and characteristics were asked to be symbolized as specific terms from standard vo-cabularies. The test was carried out with a group of individuals with expertise in the biomedical domain that were previously formed in the standard nomenclature. The results show that the platform offers an adequate set of functionalities and cha-racteristics for its purpose, and is, therefore, suitable for the use in the clinical practice. The second characteristic, which relates to the correctness of the representations, is a common feature among current and old systems. None of them follows the revised version of the nomenclature that we chose, since instead they follow their own specifications of pedigree diagram. In terms of compre-hensiveness, our system complies with all the specific scenari-os proposed by the nomenclature. Since the nomenclature is very comprehensive by itself, genoDraw is thus very compre-hensive in terms of the diversity of situations can represent. We decided not to include some features that can become use-ful when analyzing psychological elements of a family, such as affinity among individuals. Therefore, regarding this specific issue, other tools are more complete than ours.In terms of data-drivenness, some reported systems imple-ment some kind of generation of diagrams from data, but it tends to be translated into the diagram upon insertion by the user and the information in itself is then lost. Using our ap-proach, we accomplished to collect all the inserted data and store it as the information of the genogram itself. The genera-tion of the pedigree diagram is then performed from the data stored and the output is a correct genogram.Finally, regarding the storage of traits and characteristics as terms of standard biomedical vocabularies, to our knowledge, no other pedigree drawing system reported in the literature discusses the implementation of this feature. This is undou-bt-edly a step towards the integration of this system with the electronic health records of each individual that might be re-presented in a genogram.Some limitations for creating a visual representation of fre-quent situations have been revealed. In fact, these limita-tions were previously commented in the pedigree diagram drawing literature [4]. One of them is the impossibility of drawing more than two relationships simultaneously for the same person. This can be addressed by allowing the user to hide relation-ships at their will without necessarily removing them from the data of the genogram. Another limitation is the impossibility of representing relationships between people from different generations in the same family. This is solved by creating new structure constraints which enable more fle-xibility of the rep-resented pedigree diagram.ConclusionsThis work explores the capabilities of the representation of ge-nograms by an automated interactive tool. By following the process mentioned in the methods section, we developed a system that complies with the proposed characteristics: com-prehensiveness, automation, interactiveness, data-drivenness and compatibility with biomedical vocabularies for traits.Regarding the limitations of the system, in terms of the com-prehensiveness of the representation, they are already defined

Page 92: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

293

Madrid, 21, 22 y 23 de marzo Infors@lud2017 Infors@lud2019

by the limitations of the nomenclature itself, which are due to planarity and alignment issues, unavoidable on any bidimen-sional representation. As far as automation and data-drivenness are concerned, any graph generated from the in-formation of individuals and their relations is always com-posed by the entities required by the nomenclature, and the resulting graph is isomorphic to that derived from a correct pedigree diagram according to the directives of the nomencla-ture. Nonetheless, the structure, being calcula-ted only by an optimization engine, might not be able to, without corrections, represent the most adequate genogram. However, from an interactivity standpoint, the support for corrections of the nodes’ positions by the user makes the system capable of representing, in a correct manner, all the situations included in the chosen nomenclature.Although our system is currently capable of storing the anno-tation of diseases and traits as terms from standard biomedi-cal vocabularies, a current limitation is the limited access to information for electronic health records. We in-tend to ad-dress this limitation in the near future, so as to have this sys-tem not as an isolated part of the diagnosis of genetic diseas-es, but as a tool that is capable of retrieving, updating and using relevant information that is stored in electronic health records to contribute to an enhancement and facilitation of the diagnosis and risk evaluation of gene-tic diseases

Figures and Graphs

AcknowledgementsThis work is supported by “Proyecto colaborativo de inte-gración de datos genómicos (CICLOGEN)” PI17/01561 fund-ed by the Carlos III Health Institute from the Spa-nish Na-tional plan for Scientific and Technical Research and Innova-tion 2017-2020 and the European Regional Development Funds (FEDER).References[1] R.L. Bennett, K.S. French, R.G. Resta, and D.L. Doyle, Standardized Human Pedigree Nomenclature: Up-date and Assessment of the Recommendations of the Na-tional Society of Genetic Counselors, J Genet Counsel. 17 (2008) 424–433. doi:10.1007/s10897-008-9169-9.[2] S. Köhler, L. Carmody, N. Vasilevsky, et al., Ex-pansion of the Human Phenotype Ontology (HPO) knowledge base and resources, Nucleic Acids Res. 47 (2019) D1018–D1027. doi:10.1093/nar/gky1105.

Page 93: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

294

XX Congreso Nacional de Informática de la Salud Infors@lud2019

[3]R.L. Bennett, K.A. Steinhaus, S.B. Uhrich, C.K. O’Sullivan, R.G. Resta, D. Lochner-Doyle, D.S. Markel, V. Vincent, and J. Hamanishi, Re-commendations for standard-ized human pe-digree nomenclature. Pedigree Standardization Task Force of the National Society of Genetic Counselors., Am J Hum Genet. 56 (1995) 745–752.[4]C. Kelleher, B. Drohan, K. Hughes, and G. Grinstein, Self Organizing Interactive Pedigree Diagrams, (2011).[5]K. Donnelly, SNOMED-CT: The advanced terminology and coding system for eHealth, Stud Health Technol In-form. 121 (2006) 279–290.

Page 94: TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction

Este documento esta firmado porFirmante CN=tfgm.fi.upm.es, OU=CCFI, O=Facultad de Informatica - UPM,

C=ES

Fecha/Hora Mon Jun 03 13:04:12 CEST 2019

Emisor delCertificado

[email protected], CN=CA Facultad deInformatica, O=Facultad de Informatica - UPM, C=ES

Numero de Serie 630

Metodo urn:adobe.com:Adobe.PPKLite:adbe.pkcs7.sha1 (AdobeSignature)