93
Institut für Informatik TOWARDS PREDICTION OF EPIGENETICS-RELATED PROTEIN TYPES Master’s thesis to obtain the academic degree Master of Science submitted to the Faculty Physics, Mathematics and Computer Science of the Johannes Gutenberg-University Mainz on 04. September 2014 by Thomas Kemmer

TOWARDS PREDICTION OF EPIGENETICS-RELATED … · TOWARDS PREDICTION OF EPIGENETICS-RELATED PROTEIN TYPES Master’s thesis to obtain the academic degree Master of Science submitted

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Institutfür

Informatik

T O WA R D S P R E D I C T I O N O FE P I G E N E T I C S - R E L AT E D P R O T E I N T Y P E S

Master’s thesis

to obtain the academic degree Master of Science submittedto the Faculty Physics, Mathematics and Computer Science

of the Johannes Gutenberg-University Mainz

on 04. September 2014 by

Thomas Kemmer

Submission date: 04. September 2014

First reviewer: Prof. Dr. Andreas HildebrandtSoftware-Technik und Bioinformatik

Second reviewer: Prof. Dr. Stefan KramerInformationssysteme

Supervisor: Dr. Katerina TaškovaSoftware-Technik und Bioinformatik

Declaration

I hereby declare that I have written the present thesis independently and withoutuse of other than the indicated means. I also declare that to the best of my knowl-edge all passages taken from published and unpublished sources have been refer-enced. The paper has not been submitted for evaluation to any other examiningauthority nor has it been published in any form whatsoever.

Mainz, 04. September 2014

Thomas Kemmer

A B S T R A C T

Epigenetic modifications regulate gene expression by controlling the DNA orga-nization within the cell nucleus, and thus play a key role in the understandingof mammalian development, biological pathways, and diseases. Although it isknown that different types of proteins are involved in epigenetic mechanisms, thereexists currently no common definition of these specific types. In previous stud-ies, epigenetics-related proteins have been predicted using machine learning tech-niques. More specifically, the studies focused on the more general task of identifica-tion whether or not a protein (domain) is involved in epigenetic mechanisms. In thisthesis, we assess the possibility of building accurate prediction models for a morespecific categorization of the epigenetics-related proteins in the mouse genome.This involves five types: Erasers, mediators, modifiers, readers, and remodelers.For this purpose, we create a local database of known epigenetics-related proteinsand perform a correlation analysis of their domains. While we can find most of thefrequently co-occurring domains in the literature, there are a few exceptions thatmight indicate currently unknown epigenetic associations. Overall, we are able tosuccessfully predict most of the epigenetics-related protein types in our data. Fi-nally, we find indications for a possible overlap between some of the types, thatshould be further inspected by the domain experts in order to be confirmed.

V

G E R M A N A B S T R A C T

Epigenetische Modifikationen verändern die DNA-Struktur im Zellkern und regu-lieren damit die Transkription bestimmter Gene. Sie spielen deshalb eine wichtigeRolle in biologischen Stoffwechselwegen, diversen Krankheiten und der Entwick-lung von Säugetieren. Obwohl bekannt ist, dass verschiedene Typen von Proteinenan epigenetischen Prozessen beteiligt sind, existieren für sie bislang leider keineeindeutigen Definitionen. Unter Verwendung von Methoden aus dem Bereich desMaschinellen Lernens haben vergangene Studien bereits wiederholt Proteine identi-fizieren können, die im Zusammenhang mit epigenetischen Modifikationen stehen.Dies allerdings nur von einem allgemeinen Standpunkt aus, der nicht noch zusätz-lich zwischen verschiedenen Typen unterscheidet. In der vorliegenden Arbeit be-werten wir daher die Möglichkeit, auch jene bei den Vorhersagen miteinzubeziehen.Dabei unterscheiden wir folgende fünf Typen: Eraser, Mediator, Modifier, Readerund Remodeler. Zu diesem Zweck erzeugen wir zunächst einmal eine lokaleDatenbank für in epigenetische Prozesse involvierte Proteine und vergleichen an-schließend die jeweils vorliegenden Kombinationen von Proteindomänen in Bezugauf die einzelnen Typen. Während die meisten dieser gemeinsam auftretendenDomänen bereits in der Vergangenheit mit epigenetischen Prozessen in Verbindunggebracht werden konnten, finden wir ebenso mehrere Ausnahmen, die auf einebisher unbekannte Verbindung hindeuten. Insgesamt sind wir mit unserer Arbeitin der Lage, die Mehrheit der epigenetischen Proteintypen korrekt vorherzusagen.Bei den verbleibenden finden wir dagegen Hinweise, die an der Überschneidungs-freiheit zu den anderen Typen zweifeln lassen und entsprechend durch Experten-hand überprüft werden sollten.

VII

A C K N O W L E D G E M E N T S

First and foremost, I want to thank Professor Dr. Andreas Hildebrandt for the topicof this thesis, his support, and for introducing me to the fascinating field of bioinfor-matics. I also want to thank him and Dr. Anna Katharina Hildebrandt for startingthe RemoDB project as a database of potential chromatin remodelers and for invit-ing me to contribute a part to it. Although it ended up as a prototype, the projectarouse my interest in epigenetics and encouraged me to face the challenge of thisthesis.

I would like to express my sincerest gratitude to my supervisor, Dr. KaterinaTaškova, who has guided me throughout my research with her patience and knowl-edge, even when we were separated by a large geographical distance. I am gratefulfor her encouragement and invaluable feedback with regard to the research for thethesis and the writing of it.

Furthermore, I would like to acknowledge Professor Dr. Stefan Kramer for his use-ful comments and remarks during the master’s seminar. I also owe my gratitude toDr. Jörg Wicker for his continuous support with regard to data mining and machinelearning; and for letting me ransack his bookshelf. For his feedback in the earlystages of my research and the guidance through the jungle of cross-referenced bio-logical databases, I would like to thank Dr. Markus Krupp. Moreover, I am indebtedto Christian Hundt for introducing me to scikit-learn.

I want to thank Marc-André Vef for repeatedly proofreading my thesis and a lot ofconstructive criticism. Likewise, I thank Tim Seifert for struggling with my inclina-tion to overly complicated sentences. I am also thankful to Dr. Marco Carnini forhis support.

Although it was not possible to follow every path that emerged during research, Iwould also like to acknowledge Dr. Andreas Karwath, Michael Geilke, and AndreyTyukin for taking the time to suggest and discuss alternative approaches when en-countering a problem.

Last but not least, I would like to thank my parents for always supporting methroughout all my studies at the university.

IX

C O N T E N T S

1 I N T R O D U C T I O N 1

2 T H E O R E T I C A L B A C K G R O U N D 52.1 Epigenetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 DNA organization . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.2 Histone modification . . . . . . . . . . . . . . . . . . . . . . . . 62.1.3 DNA methylation . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.4 Binding of non-histone proteins . . . . . . . . . . . . . . . . . 72.1.5 The emerging role of epigenetics . . . . . . . . . . . . . . . . . 8

2.2 Expert-generated chromatin remodeler types . . . . . . . . . . . . . . 82.3 Data mining and machine learning . . . . . . . . . . . . . . . . . . . . 8

2.3.1 Data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.2 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4.1 Epigenetics-related databases . . . . . . . . . . . . . . . . . . . 102.4.2 Web tools for epigenetics-related protein analysis . . . . . . . 112.4.3 Prediction of epigenetics-related proteins . . . . . . . . . . . . 11

3 M AT E R I A L A N D M E T H O D S 133.1 Data acquisition and organization . . . . . . . . . . . . . . . . . . . . 13

3.1.1 Extracting information from the original input data . . . . . . 143.1.2 Mapping gene names to gene database entries . . . . . . . . . 153.1.3 Mapping genes to proteins and their domains . . . . . . . . . 173.1.4 Creating a local database . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2.1 Unreviewed proteins and predicted annotations . . . . . . . . 213.2.2 Redundant protein information . . . . . . . . . . . . . . . . . . 22

3.3 Data representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.4 Statistical learning methods . . . . . . . . . . . . . . . . . . . . . . . . 26

3.4.1 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.4.2 Model evaluation and validation . . . . . . . . . . . . . . . . . 29

3.5 Post hoc data/model analysis . . . . . . . . . . . . . . . . . . . . . . . 303.5.1 Feature importance . . . . . . . . . . . . . . . . . . . . . . . . . 303.5.2 Domain co-occurrences . . . . . . . . . . . . . . . . . . . . . . 31

4 R E S U LT S A N D D I S C U S S I O N 334.1 Data acquisition and preprocessing . . . . . . . . . . . . . . . . . . . . 33

4.1.1 Data loss during the creation of the database . . . . . . . . . . 33

XI

XII C O N T E N T S

4.1.2 Cluster representatives . . . . . . . . . . . . . . . . . . . . . . . 364.1.3 GO terms for chromatin modification . . . . . . . . . . . . . . 37

4.2 Initial prediction models and correlation analysis . . . . . . . . . . . . 384.2.1 Validation results . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2.2 Label domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.2.3 Label clans and their members . . . . . . . . . . . . . . . . . . 414.2.4 Domain co-occurrences . . . . . . . . . . . . . . . . . . . . . . 45

4.3 Final prediction models . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5 C O N C L U S I O N A N D O U T L O O K 51

A A P P E N D I X 55

B I B L I O G R A P H Y 67

L I S T O F F I G U R E S

Figure 2.1 Nucleosome organization . . . . . . . . . . . . . . . . . . . . 6

Figure 3.1 Workflow for creating and improving prediction models . . 13Figure 3.2 Compact model of the local database . . . . . . . . . . . . . . 19Figure 3.3 Chromatin remodeler type distribution across the data sets . 25Figure 3.4 Object diagram of the parameter-tuned classifiers . . . . . . 28Figure 3.5 Workflow for validating a prediction model . . . . . . . . . . 30

Figure 4.1 Data loss summary . . . . . . . . . . . . . . . . . . . . . . . . 34Figure 4.2 Number of annotations for different data sets. . . . . . . . . . 36Figure 4.3 Validation results of the initial models . . . . . . . . . . . . . 39Figure 4.4 Feature importance in the data . . . . . . . . . . . . . . . . . 42Figure 4.5 Domain co-occurrences in the clustered data set . . . . . . . 46Figure 4.6 F-measures of the final models . . . . . . . . . . . . . . . . . . 48

Figure A.1 Model of the local database . . . . . . . . . . . . . . . . . . . 56Figure A.2 Chromatin remodeler type distribution across the data sets . 57Figure A.3 Configuration of the classifiers . . . . . . . . . . . . . . . . . . 58Figure A.4 Validation results of the initial models (second set) . . . . . . 59

XIII

L I S T O F TA B L E S

Table 3.1 Example content from the initial data file . . . . . . . . . . . . 14Table 3.2 Identification of mouse genes by human-readable names . . . 15Table 3.3 Column names used with UniProt’s REST interface . . . . . . 20Table 3.4 Relevant proteins for different grades of sequence identity . . 22Table 3.5 Naming scheme for the data sets . . . . . . . . . . . . . . . . . 23Table 3.6 Data set summary statistics . . . . . . . . . . . . . . . . . . . . 24Table 3.7 Co-occurring chromatin remodeler types in the set . . . . . . 26

Table 4.1 Frequent patterns among the gene name mismatches . . . . . 34Table 4.2 Comparison of expected and actual protein domains . . . . . 35Table 4.3 Proteins with GO annotations for chromatin modification . . 37Table 4.4 Abundant label domains for the chromatin remodeler types . 40Table 4.5 Overall coverage of labels and label domains . . . . . . . . . . 41Table 4.6 Most important label domain relatives . . . . . . . . . . . . . . 43Table 4.7 Distribution of domain assignments per protein . . . . . . . . 44Table 4.8 Feature overview of the final data sets . . . . . . . . . . . . . . 47Table 4.9 Excerpt from the final validation results . . . . . . . . . . . . . 49

Table A.1 Source files used to build the local database . . . . . . . . . . 60Table A.2 Label domain relatives in the data set . . . . . . . . . . . . . . 61Table A.3 Final SVM model validation results (GO sets) . . . . . . . . . 62Table A.4 Final RF model validation results (GO sets) . . . . . . . . . . . 63Table A.5 Final SVM model validation results (SL sets) . . . . . . . . . . 64Table A.6 Final RF model validation results (SL sets) . . . . . . . . . . . 65

XV

L I S T O F A B B R E V I AT I O N S

ATP Adenosine triphosphate

DNA Deoxyribonucleic acid

FTP File Transfer Protocol

GO Gene Ontology

KDD Knowledge Discovery from Data

NCBI National Center for Biotechnology Information

ORF Open Reading Frame

REST Representational State Transfer

UCSC University of California, Santa Cruz

URL Uniform Resource Locator

XVII

1 Introduction

One classic definition of epigenetics is “the study of mitotically and/or meioticallyheritable changes1 in gene function that cannot be explained by changes in DNAsequence” [2]. During the past decades, numerous experiments have examined theinfluence of epigenetic mechanisms on the mammalian development, biologicalpathways, and diseases. A popular example is research regarding monozygotic(identical) twins as they are known to show an increasing number of differencesover the years, despite sharing a mutual genome [3]. Although current knowledgeis still limited, technological advances allow epigenetic studies on whole speciesgenomes (rather than single genes), commonly known as epigenomics [4].

The basic principle behind epigenetic mechanisms is the regulation of gene expres-sion via changes in the DNA organization (chromatin) within the cell nucleus [5].This process requires a variety of chemical reactions being induced by different pro-teins. Regardless of their diversity, these proteins can be roughly categorized intoseveral types according to their epigenetics-related function, including the recog-nition of epigenetic signals or the modification of DNA and molecules in the nu-cleus [1]. However, to the best of our knowledge, there exists no widely accepteddefinition and consequently no unified representation of these types, in the follow-ing referred to as chromatin remodeler types.

In order to understand the full spectrum of epigenetic mechanisms, the involvedproteins and their domains have to be analyzed. Unfortunately, only a limited num-ber of them is currently known and wet lab experiments are usually both expen-sive and time-consuming. As a result, computational approaches are necessary toreduce the search space by automated identification of the most promising can-didate proteins and domains within a given proteome. On a general level, manysoftware pipelines for automated functional annotation of proteins have been de-veloped over the past decade, producing vast amounts of (often publicly available)data. Furthermore, recent studies have aimed to identify new epigenetics-relatedproteins and domains [6]. The next step is now to combine these approaches inorder to identify the distinct chromatin remodeler types among the candidate pro-teins.

In this thesis, we aim to assess the possibility of building accurate predictionmodels for differentiating between the given chromatin remodeler types. Ourstudy is based on five types, as extracted from an expert-generated list of knownepigenetics-related mouse genes, which are linked to the respective types via pro-

1 In other words, changes can occur between generations of cells (mitotic inheritance) or species (mei-otic inheritance) [1].

1

2 I N T R O D U C T I O N

tein domains. Since we assume this list to be created using both manual and au-tomated methods, it is necessary to evaluate how much of its data can be actuallyused for building reasonable models and to provide reliable data sets.

For this purpose, we first extract the available information from the input listand identify the corresponding gene and protein entries in the well-known UCSCGenome Browser [7] and UniProtKB [8] databases. In the second step, we enrichthe proteins with Pfam [9] and GO [10] annotations. This enables us to use a broadvariety of protein characteristics (including their biological functions and cellularlocalization) in the form of unified machine-readable descriptors rather than nat-ural language. Special attention is paid to the quality of the created data sets inorder to reduce the noise in the final prediction models. More specifically, the am-biguousness of the given input data as well as unreviewed proteins and predictedannotations are the main issues of concern.

In the second half of this thesis, we predominantly focus on a descriptive analysisof the created data sets. In other words, we summarize noticeable characteristicsof the proteins, for example, co-occurrences of domains and their relations to thefive chromatin remodeler types. This analysis reveals how the credibility of ourprediction models is influenced by domains sharing a mutual biological origin aswell as those cooperating in an epigenetic context. In addition, we present domainco-occurrences that might indicate novel epigenetic associations.

Overall, we are able to successfully predict a subset of the suggested chromatin re-modeler types in our data sets. Due to problems with the original input data, weexcluded a considerable amount of proteins and annotations used for predictionmodel building, partly resulting in training sets that are too small for reliable predic-tions. Moreover, we find indications for a possible overlap between the suggestedtypes, such that their credibility cannot be fully confirmed right now. However, ourdata sets are not limited to the expert-generated input data used in this thesis. In-stead, they can be easily utilized with any other set of UniProtKB proteins labeledwith chromatin remodeler types via Pfam domains.

The subsequent chapters are structured as follows: In Chapter 2, we introduce thetheoretical background of our study. More specifically, we give more detailed in-formation about epigenetics, including its mechanisms and the environment epige-netic gene regulation takes place in. This is followed by a short introduction of theprinciples of data mining, an overview of the related work, and problem-specificremarks. Chapter 3 describes the materials and methods used in this thesis. Thefirst part covers the work that is done before the initial prediction model training,namely the extraction of information from the input file and the public databases aswell as the initial data quality review. The second part of the chapter then focuseson descriptions of the methods for the actual prediction model training and the sub-sequent feature correlation analysis. In Chapter 4, we show and discuss the resultsof our study. First, we state the problems and observations made during the cre-

I N T R O D U C T I O N 3

ation and preprocessing of the data sets. Afterwards, we focus on the performanceof the initial prediction model and the correlation analysis of the data, includingthe domain co-occurrences among the known epigenetics-related proteins. Finally,we show the performance of the final prediction models, refined on the basis of thepreceding steps. In Chapter 5, we conclude this thesis with a short summary of theaddressed problem as well as the results with suggestions for further work.

2 Theoretical Background

This chapter focuses on the theoretical background of our study, both from a bi-ological and a computational point of view. In the first section, we have a closerlook at epigenetics and the principles behind its mechanisms. In the second section,we show the new expert-generated chromatin remodeler types we use for build-ing prediction models and their allocation in the epigenetic context. In the thirdsection, the field of data mining and machine learning methods are introduced. Inthe fourth section, we present previous studies and projects that are related to ourstudy.

2.1 Epigenetics

In this section, we first give a short introduction on DNA organization within eu-karyotic cells, which is crucial for the understanding of epigenetic mechanisms.Thereafter, the different types of mechanisms are described: Histone modification,DNA methylation, and the binding of non-histone proteins [1,11]. Finally, we focuson a selection of biological processes they are involved in.

2.1.1 DNA organization

In eukaryotic cells, most of the DNA is located in the cell nucleus in a more or lessdensely compacted form, known as chromatin. Its basic structure can be describedas a chain of similarly constructed units, the so-called nucleosomes (Figure 2.1). Eachnucleosome consists of a small section of DNA being wrapped around a complexof eight specific proteins, the core histones. The complex is usually referred to ashistone core. Another type of histones (linker histones) stabilizes the construction andthe DNA connecting the nucleosomes (linker DNA) [12].

The chromatin of an eukaryotic genome can be present in two states: Euchromatinand heterochromatin. In the heterochromatin, the DNA is very densely packed andthus transcriptionally inactive. The euchromatin, on the other hand, is more looselystructured, making the DNA available for biological tasks, such as repair, replica-tion, and gene transcription. It should be noted that the euchromatin offers onlythe opportunity for active gene transcription rather than implying it. Additionalrequirements have to be met before the actual transcription can take place, e.g., thepresence of particular proteins. In general, a major part of the eukaryotic genomecan be found in the euchromatin [13].

5

6 T H E O R E T I C A L B A C K G R O U N D

Histone core

Core DNA

Linker histone

Linker DNA

Figure 2.1: Nucleosome organization. The chromatin is built of several nucleo-somes connected by the linker DNA.

The chromatin structure (that is, the DNA organization) in the cell nucleus is con-trolled by epigenetic mechanisms [5]. More specifically, this is done via binding ofnon-histone proteins, or covalent modifications of the DNA and the histones, re-sulting in the regulation of gene expression [1].

2.1.2 Histone modification

The first type of epigenetic mechanisms are post-translational modifications of thehistone cores that cause the chromatin structure to open (euchromatin) or close(heterochromatin). Each core histone consists of a mostly globular part as well asseveral unstructured N-terminal tails, which are protruding from the protein’s sur-face. These tails are subject to epigenetic modifications, for instance, acetylation andmethylation [13].

Histone modifications and a possible connection to gene regulation have first beendescribed in 1964 [14]. However, it took more than 30 years before the first histoneacetyltransferases (HAT) and histone deacetylases (HDAC) were discovered andconfirmed to be linked to gene regulation [15, 16]. Today, at least six distinct typesof histone modifications are known, apart from acetylation and methylation. In ad-dition, these types can appear in different forms (e.g., mono-, di-, or trimethylation1)and at more than 60 different sites of the histone tail, leading to a large number ofpossible combinations of modifications [13].

This variety of histone modifications inspired the idea of a histone code that can beread by other proteins in order to induce particular biological events, rather thanhaving the modifications directly alter the chromatin structure [17]. This sugges-tion has been controversially debated over the past years and there is still onlylittle agreement on the meaning of the code. Nevertheless, it has been found thatthe presence or absence of particular histone modifications serves as a mean of com-munication for histone modifying proteins and other epigenetic regulators. Thus, it

1 As the name suggests, these forms involve one, two, or three methyl groups, respectively.

2.1 E P I G E N E T I C S 7

is assumed that the modifications are both directly involved in the reorganization(remodeling) of the chromatin structure and signals for other proteins [18].

Most of the histone modifying proteins appear as a part of protein complexes, of-ten in combination with DNA binding domains or those able to identify particularhistone modifications. Even so, these complexes are not always exclusively relatedto gene regulation [18].

2.1.3 DNA methylation

DNA methylation is the second type of epigenetic mechanisms and currently theonly one known to directly affect the DNA [1]. In contrast to changes in the DNAsequence (for example, by a substitution of a nucleobase2), this procedure only re-places a hydrogen atom of an adenine or cytosine molecule with a methyl group.This leads to a locally more densely compacted chromatin structure and thereforeto a repression of gene transcription. In mammals, DNA methylation appears ex-clusively in CpG dinucleotides3. These molecules tend to build clusters near thepromoter regions of genes. Interestingly, while about 70-80% of the mammalianCpG dinucleotides are methylated, this is almost never the case when present inthe cluster conformation (the so-called CpG islands) [19, 20].

Although histone modification and DNA methylation are two completely differentmechanisms, with their own sets of proteins involved and various chemical reac-tions, they have been found to cooperate in various epigenetics-related processes.For instance, recent studies suggest that DNA methylation patterns are essential forreconstructing histone modifications after cell division [21].

2.1.4 Binding of non-histone proteins

The third type of epigenetic mechanisms is the binding of non-histone proteins.Its main representatives are the ATP-dependent chromatin remodelers, which can di-rectly displace nucleosomes along the DNA and promote the exchange of his-tones [22]. They commonly appear as multi-protein complexes, rather than singleproteins, and have often a specialized and non-redundant function in the processof mammalian development [11]. Another important representative of the non-histone protein binders are sometimes referred to as the “readers and writers of theepigenome”. They catalyze the other epigenetic mechanisms after binding to theDNA or histones. Transcription factors, who are also known to affect the chromatinstructure and would, as a result, fit into this group of mechanisms, are excluded byconvention [1].

2 The DNA’s building blocks: Adenine, guanine, thymine, and cytosine.3 These molecules consist of cytosine and guanine, connected through a phosphate bond.

8 T H E O R E T I C A L B A C K G R O U N D

2.1.5 The emerging role of epigenetics

Epigenetic modifications of histones or DNA are considered to play an importantrole in the understanding of particular biological processes, such as the mammaliandevelopment. In general, the DNA sequence is known to be mostly invariant acrossdifferent tissues and cells. The epigenome, on the other hand, shows highly special-ized, parent-specific variations [4]. The combination of both genomic and epige-nomic information is crucial for cell fate decisions4 during the development [23].Furthermore, cell aging leads to a heterogeneous accumulation of epigenetic mod-ifications among cell populations, which has been found to support the tumorige-nesis, i.e., the development of cancer [24]. At the same time, the analysis of theepigenome allows early detection and offers advanced strategies for fighting can-cer [25]. Apart from that, epigenetic mechanisms are also involved in the alterationof metabolic pathways [26] and most likely in several other fields.

2.2 Expert-generated chromatin remodeler types

Starting point of our study is a set of five presumably distinct chromatin remod-eler types extracted from a list of known epigenetics-related mouse genes linked tothe respective types via protein domains, kindly provided by Sudhir Thakurela5.The types are named as follows: Eraser, mediator, modifier, reader, and remod-eler. Erasers and modifiers are able to remove or add epigenetic modifications tohistones/DNA, respectively. Remodelers, on the other hand, open and close thechromatin structure without removing/adding epigenetic modifications. Readersare able to recognize particular epigenetic modifications on histones or DNA. Medi-ators catalyze protein-protein interaction in an epigenetic context [27].

While these type definitions are very similar to the different types of epigenetics-related proteins described in the previous section, the five-type chromatin remod-eler classification does not differentiate between the three epigenetic mechanisms(histone modification, DNA methylation, and non-histone binding). In other words,proteins that add modifications to histones or DNA are both located among themodifiers, etc. Furthermore, the remodeler type and the ATP-dependent chromatinremodelers (Section 2.1.4) are not equivalent, as there exist remodelers which donot require the presence of ATP [27].

2.3 Data mining and machine learning

In this section, we give a short introduction into the field of data mining and thegeneral classes of machine learning techniques it utilizes (based on [28–30]).

4 The specialization of cells, e.g., from a stem cell to a tissue-specific cell.5 Institute of Molecular Biology, Working group “Epigenetic Regulation of Development and Disease”,

55128 Mainz, Germany

2.3 D ATA M I N I N G A N D M A C H I N E L E A R N I N G 9

2.3.1 Data mining

Data mining is usually considered as the process of searching for interesting pat-terns in data, as part of the general approach of knowledge discovery from data (KDD).Nowadays, vast amounts of data are generated on a daily basis, including networktraffic, digital transactions, and medical data. Nevertheless, not every piece of in-formation stored is necessarily useful. Consequently, computational methods arerequired in order to extract new knowledge from large data collections. Applicationexamples include risk assessment for diseases (e.g., cancer types) [31], evaluation ofgene expression analyses (e.g., DNA microarray data) [32], or text recognition [33].

The process of knowledge discovery can be generalized as follows: First, a database(often referred to as data warehouse) is created based on input from several sources,such as flat files or external databases. The main challenge here is the mostly het-erogeneous structure and consistency of the input data, requiring a preprocessingstep (to remove noise, unify, or summarize the data) prior to the integration intothe warehouse. Second, the information subset of interest is extracted and prepro-cessed in order to be used with the selected data mining technique. Normally, thedata is represented as a set of entities (instances) described with a mutual set ofcharacteristics (features), for example, proteins and their annotations, respectively.Finally, the selected data mining technique is applied and the resulting patterns areevaluated and visualized.

2.3.2 Machine learning

In general, there are several different ways to search for interesting patterns in data.The most fundamental approach is to represent the data set using basic statisticaldescriptions, for instance, in terms of feature frequencies and distributions. A sec-ond approach is the utilization of machine learning techniques. The process of learn-ing with regard to machines (computer programs) can be defined as follows: “Acomputer program is said to learn from experience E with respect to some class oftasks T and performance measure P, if its performance at tasks in T, as measures byP, improves with experience E.” [34]. In this thesis, we use machine learning tech-niques in order to predict chromatin remodeler types based on protein and geneannotations.

Perhaps the most prominent representative of machine learning techniques is calledsupervised learning. Here, one or more features are designated as target. The corre-sponding value classifies each instance with respect to the target, for example, anumerical value representing the risk of a patient for having a liver tumor. The goalof supervised learning is to create a prediction model that is able to differentiate be-tween the classes, given a labeled training set of instances as a reference. In addition,a more realistic measure for the model’s performance can be obtained by applyingthe model to an unseen data set with known labels, referred to as test set. When pre-

10 T H E O R E T I C A L B A C K G R O U N D

dicting categorical (i.e., discrete and unordered) class values, supervised learningis usually called classification, otherwise (continuous-valued classes) regression.

In contrast, unsupervised learning works on unlabeled data, that is, instances with-out designated targets. For example, unsupervised learning methods can be usedto cluster instances in terms of similarity or to find frequently occurring feature sets.The third class of machine learning techniques, semi-supervised learning, combinesthe utilization of labeled and unlabeled data when building prediction models. Inthis thesis we focus primarily on basic statistical descriptions and classification, ex-tensively described in Section 3.4.

2.4 Related work

In this section, we present recent epigenetics-related projects categorized in threedifferent groups: Databases, web tools, and prediction approaches.

2.4.1 Epigenetics-related databases

Over the past decade, several databases of epigenetics-related proteins have beencreated. In the following, we present two of them: ChromDB [35] and DAnCER [36].

The ChromDB (Chromatin Database) public database offers a collection of proteins,which are in the broadest sense involved in chromatin remodeling (including thehistones themselves). The proteins are classified into a hierarchy of over 90 groups,describing their relation to chromatin remodeling. While initially built solely frommanually curated proteins of the plants Arabidopsis thaliana and Zea mays (maize),ChromDB offers now several fungal and animal proteins from the NCBI RefSeq [37]database. The main goal of the database is to provide both data and methods (inthe form of links to external tools) for comparative analyses of chromatin-relatedproteins among different organisms.

The DAnCER (Disease-Annotated Chromatin Epigenetics Resource) database linksknown epigenetics-related genes to human disease annotations as well as Pfamdomains and GO terms6. Its main goal is to explore epigenetics-related genes interms of gene expression profiles, protein-protein interactions, cellular pathways,and patterns of evolutionary conservation. DAnCER’s core is manually curatedfrom literature, external databases, and analyses of protein complexes. In addition,it offers predicted epigenetics-related genes based on protein homology relation-ship analyses and prediction models trained on Pfam domain compositions andco-occurrences.

6 Although the latter two refer to protein annotations, the database focuses on genes.

2.4 R E L AT E D W O R K 11

2.4.2 Web tools for epigenetics-related protein analysis

Databases such as ChromDB and DAnCER are valuable sources of epigenetics-related information. However, they neither offer an option to automatically analyzethe available knowledge in combination with additional (user-generated) data, nordo they provide a visualization of the knowledge in its genomic context. In the fol-lowing, we present two projects that aim to bridge this gap: EpiGRAPH [38] andEpiExplorer [39].

EpiGRAPH is a user-friendly web application for statistical analysis and predic-tion of epigenomic data. It allows the identification of novel relations between ge-nomic regions (rather than genes or proteins) with regard to a specific biologicalrole. The starting point are different genome assemblies (human, mouse, chimp,and chicken), extended by various annotations from external databases, includingDNA composition, repetitive regions, and CpG islands. The user can specify a setof genomic regions in which he is interested and add additional annotations. More-over, the regions have to be designated as positive or negative examples in termsof the selected biological role. The software then tries to find (and visualize) corre-lations between the annotations and evaluates whether or not there are significantdifferences between the positive and negative regions.

EpiExplorer is a very similar approach, that also works with annotated genomicregions, accepts user-generated content, and offers statistical analyses of the data.In contrast to EpiGRAPH, this project focuses on the visualization of results andallows exploration of genomic regions via an intuitive web interface.

2.4.3 Prediction of epigenetics-related proteins

Over the past years, data mining techniques have been used extensively to pre-dict the biological functions of genes and proteins (e.g., [40–42]). More specifically,several approaches have focused on the prediction of epigenetics-related functions.In this section, we present the work of Pu et al. [6], which is, to the best of ourknowledge, the first full-scale study on the prediction of epigenetics-related pro-teins across different model organisms.

The goal of the study was to identify proteins and protein domains that are in-volved in epigenetic mechanisms. The study was based on protein-coding genes(yeast, worm, fly, mouse, and human), extracted from the Ensembl database [43],as well as protein annotations from Pfam, GO, CORUM [44] (human proteincomplexes), and CYC2008 [45] (yeast protein complexes). A gene was consideredepigenetics-related if it met specific requirements, including having a correspond-ing protein annotated with a particular GO term. The authors performed a cor-relation analysis of the protein domains in order to investigate their level of con-servation between the model organisms as well as to identify epigenetics-related

12 T H E O R E T I C A L B A C K G R O U N D

domains, resulting in 47 candidates. Furthermore, they built supervised predic-tion models based on yeast and human genes to predict epigenetics-related hu-man genes. With this approach, the authors predicted 379 candidates. In contrast tothis thesis, their study did not distinguish between different chromatin remodelertypes.

3 Material and Methods

Since the given set of known epigenetics-related proteins cannot be used directlyfor predicting chromatin remodeler types, a suitable data set has to be created andprepared first. Once the prediction model is built, the results have to be evaluated.This usually leads to the need of additional data or further processing, before thewhole process can be repeated, eventually resulting in a refined prediction model.

The general workflow used in this thesis is shown in Figure 3.1. Based on thisscheme, the chapter is organized as follows: First, we describe how to create adatabase that contains current knowledge about the mouse proteome, includingseveral annotations from different sources and the chromatin remodeler type as-signments given in the expert-generated list of known epigenetically active genes.Second, the protein data is filtered from information that could affect the predic-tion results. Moreover, the data is transformed into a format that can be utilized bydata mining tools. Finally, we explain how these tools are used to learn from theknown epigenetically active proteins and how to obtain more reasonable resultswhen repeating the process.

Input file

Externaldatabases

Data acquisition Preprocessing

Learning ofprediction model

Analysis

Useful model

Figure 3.1: Workflow for creating and improving prediction models of chromatinremodeler types.

3.1 Data acquisition and organization

In this section, we concentrate on the data acquisition and organization. For thispurpose, we first analyze what information can be extracted from the input dataand how to build a database upon it. Since we are given gene information andwant to operate mainly on the protein level, we need to identify the correspondinggene entries in public databases, including their proteins and the correspondingannotations that might be interesting for prediction.

13

14 M AT E R I A L A N D M E T H O D S

# Gene name Domain names Domain IDs Types

1 100039000 Krab PF01352 Remodeler

2 Abtb1 Ankyrin;Btb PF00023;PF00651 Reader;Mediator

3 Asxl1 Asxh;Hare-Hth;Phd PF13919;PF05066;PF00628 Mediator;Reader

4 Ldb1 NA NA NA

5 Mdb1 Cxxc;Mbd;Zf-Cxxc;MBD3701 PF02008;PF01429;NA Reader;NA

6 Pc2 Chromo PF00385 Reader

Table 3.1: Example content from the initial data file. Multiple values per columnare separated by semicolons. NA represents missing values.

3.1.1 Extracting information from the original input data

The initial set of known chromatin remodelers is obtained from a flat file contain-ing three major pieces of information: gene names, protein domains and remodelertypes. It should be noted that we use the term gene name to describe a mixture ofgene symbols and different kinds of identifiers (see Section 4.1.1). Domain infor-mation is both available as human-readable names and alphanumerical identifiersfrom the well-known Pfam database [9]1. The three kinds of information are inter-connected and form unique triplets in the form of <Gene, Domain, Type>, i.e., eachgene is assigned a chromatin remodeler type and the domain that is considered re-sponsible for the gene product being of this particular type. There might be severaltriplets for the same gene, as it can be associated with multiple types or domains in-dependently leading to the same chromatin remodeler type2. Due to the condensedformat of the file, there is only one line of tab-delimited values per gene with thedomains and types being semicolon-separated lists. Table 3.1 gives an example ofhow data is organized in the input file. The triplet structure can be extracted ei-ther directly (cf. first gene) or by combining every n-th domain ID/name with thecorresponding n-th chromatin remodeler type (cf. second gene).

The file inspection reveals several lines where the number of domains and types dif-fer (cf. third and fifth gene), so that the triplets cannot be reconstructed for a certainnumber of genes. For the purpose of representing the whole connection betweenthose three entities in our local database, we remove the affected parts from thedata set. We also exclude all genes which are known to be involved in chromatinremodeling although their exact types are yet to be determined (cf. fourth and fifthgene), as they cannot be used for building a prediction model that allows to distin-guish between the different chromatin remodeler types. See Section 4.1.1 for moredetails on genes with missing type assignments.

1 Here, domain actually refers to protein family in the sense of Pfam nomenclature [46].2 Types originating from domain combinations (including multiple copies of the same domain) are not

covered by the data set.

3.1 D ATA A C Q U I S I T I O N A N D O R G A N I Z AT I O N 15

NCBI Gene Ensembl UCSC genes (mm10)

Name ID Name ID Name ID

Cbx4 12418 Cbx4 ENSMUSG00000039989 Cbx4 uc007mpw.2

Pkd2 18764 Pkd2 ENSMUSG00000023036 Scg5 uc008lpq.1

Pcsk2 18549 Pcsk2 ENSMUSG00000027419 Klf3 uc008xms.1

Ms6hm3 111469 Pcdhga9 ENSMUSG00000023036 Ccnc uc008scw.2

G6pc2 ENSMUSG00000005232 30 more

Table 3.2: Identification of mouse genes by human-readable names across differentdatabases. For each database, the table lists the gene names and corre-sponding IDs of the records obtained when searching for Pc2. For UCSCgenes only one ID per name is displayed.

From an original set of 4148 gene names only 3717 remain, after removing all un-clear data as described above. The remaining gene names can now be used to findcorresponding entries in external gene annotation databases. Once the relevantgenes have been identified, we can do the same for the corresponding proteins.

3.1.2 Mapping gene names to gene database entries

One major problem with gene names is that they are used ambiguously among dif-ferent databases. Often, one has to know exactly from which database a particularname was retrieved, if they want to find the gene originally associated with thatname. Knowing the database, however, is not a guarantee, as not only names couldhave changed over time, but also the same name can serve as a synonym for differ-ent genes. Moreover, gene entries might become discontinued or get merged withothers on account of continuous changes in current knowledge. Table 3.2 showsthe resulting genes when searching for Pc2, one of the names included in our setof known chromatin remodelers, using three popular databases: NCBI Gene [47],Ensembl [43] and the UCSC Genome Browser [48]. Evidently, the number of resultshighly depends on the selection made. Some search terms lead to nearly similar re-sults, for instance Cbx4, which can be found with the same name in all of the threedatabases. For other terms, like Pc2, the results can differ greatly. In this example,the provided name does not even appear directly among the results. Especially forthe UCSC Genome Browser one can expect a relatively high number of results, sinceit combines information from various sources.

Due to the lack of reliable gene identifiers in our data set, we have to decide whichdatabase to choose in order to build our own local copy. NCBI Gene, for instance,uses a well-documented scheme for assigning gene names and follows recommen-dations of species-specific nomenclature committees, if available [49]. However, the

16 M AT E R I A L A N D M E T H O D S

original data set has been derived mainly by using information from the mouse-specific mm9 assembly (2007), which is accessible through the UCSC GenomeBrowser and its database [7] as part of the UCSC Mouse Genome Project [50]. Thus,we use the same database but with the considerably newer mm10 assembly (2011),in order to reduce the number of dead cross-references to other sources in the nextstep. One advantage of preferring the UCSC data over NCBI Gene is the presenceof DNA-related annotations that can easily be included in our local database.

All UCSC data is available via a public MySQL interface3 or FTP server4 and thedatabase structure can be visualized using the UCSC Table Browser [51]. Unfortu-nately, the naming scheme in the database is in a way inconsistent when it comes todifferentiating between genes and their transcript variants. The gene IDs, as shownin Table 3.2, are actually referring to the latter, i.e., transcripts. Genes as single en-tities (transcript clusters), on the other hand, are only indicated by integer valuesin the table mm10.knownIsoforms. In other words, all gene annotations describe theparticular transcript variant with the gene being just a numeric property of the tran-script. This relationship also holds for any kind of gene name, hence our task hereis to connect the provided names to transcript IDs rather than genes.

The majority of the gene names we are interested in can be found in the tablesmm10.kgAlias (known gene alias) and hgFixed.transMapGeneUcscGenes. Becauseof their foreign key relations to mm10.knownGene, which represents the transcripts,names can directly be associated to the corresponding transcript ID. The third mostuseful table is mm10.geneName, although mm10.refLink and mm10.knownToRefSeq

are needed in addition to link names and transcripts, as these names are primar-ily used for gene information derived from RefSeq database [37]. The file inspec-tion further shows that there are impurities among the gene names in our expert-generated list of chromatin remodelers, mostly identifiers from different sourcessuch as NCBI Gene (cf. first gene in Table 3.1) and Ensembl. While mm10.kgAlias

already contains a vast number of IDs, a combination of mm10.ensemblToGeneNameand mm10.knownToEnsembl is necessary to cover Ensembl identifiers as well.

Using the method described above, a total of 1909 UCSC genes (i.e., unique tran-script cluster numbers) can be identified for the 3717 gene names; 497 names donot seem to be present at all, indicating a high ratio of synonyms in the set. Oneproblem that cannot be handled at this early stage is the presence of gene nameswhich relate to different genes at the same time. With transcript/gene data only, wecannot decide which of these genes should be truly considered, indicating difficul-ties in finding proteins or confirming the triplet information for all of the genes inthe following steps.

3 https://genome.ucsc.edu/goldenPath/help/mysql.html4 http://hgdownload.cse.ucsc.edu/downloads.html

3.1 D ATA A C Q U I S I T I O N A N D O R G A N I Z AT I O N 17

3.1.3 Mapping genes to proteins and their domains

For protein information, we use the well established UniProt Knowledgebase (Uni-ProtKB) [8]. This database consists of two main sections: UniProtKB/Swiss-Protand UniProtKB/TrEMBL. The former section, henceforward referred to as Swiss-Prot, covers manually curated proteins and annotations with information beingderived from literature and expert-evaluated computational analysis. The secondsection, henceforth called TrEMBL, hosts automatically annotated protein entries.Since UniProtKB maintains cross-references to several external databases (includ-ing the UCSC Genome Browser), it is a good starting point for gathering protein-related information.

UCSC genes and UniProtKB protein entries can be connected by using cross-references from either of the two databases. In this approach, we use the referencesprovided by the table mm10.kgXref (UCSC Genome Browser database) as they seemto be more abundant: Using UniProtKB cross-references, only 27,123 out of 59,121UCSC mouse transcripts could be linked to protein entries. Those references canbe fetched through UniProtKB’s REST interface5 as described in the correspond-ing documentation [52], for instance. With mm10.kgXref, on the other hand, 40,176transcripts can be linked to proteins, with all of the aforementioned references stillbeing included and after replacing or removing outdated identifiers. For the lattertask, the UniProt ID Mapping tool [53] comes in handy. In the case of convertingfrom UniProtKB AC/ID to UniProtKB ID, each given accession number or ID getsmapped to the most current identifier, if the corresponding protein has not beendeleted in the meantime. The mapping tool can also be accessed programmaticallythrough the REST interface [52].

Regarding our list of known chromatin remodelers, 1827 out of 1909 UCSC genescan be connected to a protein, 1311 of which to a reviewed one (Swiss-Prot). Theloss of 82 entries supports the assumption that a certain number of genes has beenincluded unintentionally in the previous step, due to ambiguous names.

Although we have successfully identified a considerable number of relevant genesand proteins, we cannot yet assign the provided chromatin remodeler types, asour data set still lacks an essential piece of information: the protein domains. Sincedomain references are given as Pfam identifiers in the initial input file, we use thePfam information to represent the triplet structure in our local database as preciselyas possible. Domains and proteins can be connected using both Pfam’s and Uni-ProtKB’s cross-references (see Section 3.1.4 for a more detailed description of thesources). Finally, the chromatin remodeler types can be assigned.

As assumed, the main problem on assigning remodeler types is that for severalproteins the given domains cannot be confirmed by UniProtKB. This is most likelybecause of the ambiguity of gene names (as described in the previous section) and

5 http://uniprot.org/uniprot/?query=organism:10090&columns=id,database(ucsc)&format=tab

18 M AT E R I A L A N D M E T H O D S

the inclusion of proteins where the domain of interest is yet to be annotated or sim-ply missing (maybe due to protein isoforms or revised annotations). Finally, fromthe original set of 4148 gene names only 1216 UCSC genes remain, correspondingto 1574 proteins in the whole UniProtKB and 800 in Swiss-Prot (also reducing thenumber of genes to 800), respectively. The further reasons behind this data loss arediscussed in more detail in Section 4.1.1.

3.1.4 Creating a local database

With the relevant proteins and domains being identified, we can focus on creatinga local database that keeps all the information we might want to use for buildinga prediction model for different chromatin remodeler types. We have already de-scribed how to gather gene and transcript data from the UCSC Genome Browserdatabase. In the same manner, we can extend the data set by additional annota-tions present in the UCSC database, such as the actual DNA sequence parts relatedto the respective transcripts (including exon positions, coding regions and detailsabout CpG islands6). In the previous section, we have also described how to con-nect UCSC genes to UniProtKB proteins and Pfam domains. Similarly, we can alsoinclude additional annotations present in UniProtKB, e.g. GO terms [10], whichdescribe the molecular function and the expected cellular location of the proteinwithin the related organism as well as the biological processes it is involved in.

For our local database we use PostgreSQL 9.2 [54]. A compact version of the ta-ble structure is shown in Figure 3.2; The full model is available in the Appendix(Figure A.1). Our database consists of five main sections (represented as tableschemas), corresponding to the sources we use for the respective tables: UCSC(mm10), UniProt, Pfam, GO and RemoDB. Unless otherwise stated, data is ex-tracted from flat files downloaded from the respective FTP servers. The full listof source files is given in the Appendix (Table A.1).

The mm10 schema covers all gene and transcript-related information gathered fromthe mm10 assembly in the UCSC Genome Browser database. The most impor-tant table here is mm10.transcripts as it portrays the central connection pointto almost all of the other UCSC tables. The mm10.genes table contains all tran-script cluster IDs and mm10.aliases holds all gene names (connected to transcripts)we have collected in Section 3.1.2. In addition, we store chromosome data in themm10.chromosomes table (including the sequences) and the corresponding CpG is-lands in the mm10.cpgislands table. We also include the same DNA compositionfeatures that are used in EpiExplorer [39, 55] (mostly nucleotide and dinucleotidefrequencies), computed directly from the sequence.

6 See Section 2.1.3 for a description of CpG islands.

3.1 D ATA A C Q U I S I T I O N A N D O R G A N I Z AT I O N 19

Figure 3.2: Compact model of the local database.

Gene ontology (GO) data is kept in the go schema and represents just the avail-able set of GO terms (go.ontologies) along with their relations among each other(go.ontologyrelations), independent of any protein. The terms are grouped intocategories (biological process, cellular component or molecular function) and further or-ganized in a hierarchy of different relation types, such as is a (i.e., is a subtype of ),part of and regulates [56]. For the sake of simplicity, both categories and relationtypes are stored as integer numbers in our tables.

The pfam schema mainly concentrates on general information on protein do-mains (pfam.domains) and their clans (pfam.clans), e.g. names, descriptions andPfam identifiers. In the Pfam database, domains belong to the same clan ifthey share a mutual biological origin [57]. Since a membership is not manda-tory, it is represented as an optional foreign key relation in our database. Thepfam.domainlocations table contains domain positional data on UniProtKB pro-tein sequences, as provided by Pfam itself (that is, start and end indexes, resultingimplicitly in the number of domain occurrences per protein). However, this data isonly available for a few protein domains.

Most protein information can be found in the uniprot schema. Its main tableuniprot.proteins represents UniProtKB protein entries (sometimes unifying mul-tiple isoforms in a single entry), their sequences and their curation status (manuallycurated or automatically annotated). In addition, we store identifiers of the UniRef

20 M AT E R I A L A N D M E T H O D S

UniProtKB UniRef

Column Description Column Description

id Protein ID id Cluster ID

reviewed Curation status members Protein IDs of cluster members

protein names Protein names identity Identity level (100/90/50%)

sequence Protein sequence

database(pfam) Cross-references to Pfam

Base URLs for mouse proteins:

UniProtKB: http://uniprot.org/uniprot/?query=organism:10090&format=tab&columns=

UniRef: http://uniprot.org/uniref/?query=organism:10090&format=tab&columns=

Table 3.3: Column names used with UniProt’s REST interface.

clusters [58] each protein belongs to, for 100%, 90% and 50% sequence identity, re-spectively. In other words, entries with a sequence identity level of at least 50%belong to the same UniRef50 cluster, and so on. This data is utilized in a later stepto filter proteins with a high sequence similarity. Analogously, we add cluster in-formation for 30% and 20% sequence identity that we computed ourself. The exactmethods used for clustering are explained in Section 3.2.2. The remaining tables ofthe uniprot schema represent cross-references to GO (uniprot.proteinontologies)and Pfam (uniprot.proteindomains). The latter table contains the more general re-lation between proteins and domains in comparison with pfam.domainlocations,as neither positional information nor numbers of occurring domains are included.GO annotations are provided by the UniProt GO annotation project [59]. In con-trast to the usual data acquisition from flat files, we gather most UniProtKB andUniRef information, including Pfam cross-references, via UniProt’s REST interface.Table 3.3 shows the column names used. Nevertheless, this data can also be down-loaded and processed like the other flat files by simply appending &compress=yes

to the URL.

The last schema, remodb, contains data extracted from the input file. All availablechromatin remodeler types are contained in the remodb.rtypes table, their proteinassociations in the remodb.proteinrtypes table. We preserve the triplet structureby connecting chromatin remodeler types to uniprot.proteindomains entries, thatis, tuples of proteins and domains. While in this case we utilize a single input set, anadditional table remodb.inputsets is used in order to be able to represent differentinput sets at the same time.

3.2 D ATA P R E P R O C E S S I N G 21

3.2 Data preprocessing

In this section, we focus on the preprocessing of the data sets used for building areasonable prediction model. Our key piece of information is the triplet associationretrieved from the list of chromatin remodelers. Therefore, switching from UCSCgenes to protein information seems to be a reasonable starting point. However, notall of the proteins in our database should be used for learning a prediction model,as many entries include uncertain information, such as automatically generated an-notations. While unclear and missing chromatin remodeler type assignments havealready been removed during data acquisition, redundant and very similar proteinentries still pose a problem for learning a reasonable prediction model.

3.2.1 Unreviewed proteins and predicted annotations

The first set of problems originates in the presence of unreviewed proteins and pre-dicted annotations in our database. This applies for the proteins and their annota-tions, as extracted from external sources, as well as the expert-generated chromatinremodeler type assignments.

The protein annotations in our database were extracted from UniProtKB. As de-scribed in Section 3.1.3, the credibility of the proteins in UniProtKB is defined bythe section they belong to (Swiss-Prot or TrEMBL). The chromatin remodelers wehave identified in the data acquisition step belong to either section, that is, our cur-rent data set is made of both reviewed and unreviewed proteins. This fact mightbe problematic with regard to building an accurate prediction model, as we donot only have to deal with missing annotations but also with erroneously assignedones. As no protein database can be considered complete in terms of an “ultimatetruth”, we always have to expect missing annotations, no matter the curation level.For Swiss-Prot, however, we assume this to be considerably less likely. In this con-text, we exclusively select proteins from this section as training data for the task ofbuilding prediction models.

Automated data extraction might occur on several levels of the curation process.Aside from the proteins themselves, their annotations can also be obtained by au-tomated methods. Therefore, GO annotations in UniProtKB carry a so-called evi-dence code, i.e., a two- or three-letter word reflecting the kind of evidence used forjustifying the assignment. The main categories are experimental, computational analy-sis, author statements, curatorial statements and inferred from electronic annotation [60].In contrast to Pu et al. [6], where only GO terms with the evidence codes IDA (in-ferred from direct assay), IPI (inferred from protein interaction), IGI (inferred fromgenetic interaction) and IMP (inferred from mutant phenotype) are considered, we

22 M AT E R I A L A N D M E T H O D S

UniProtKB Swiss-Prot

Clustering all non-predicted all non-predicted

None 1574 1486 800 739

UniRef100 1442 1462 800 739

UniRef90 1297 1315 799 738

UniRef50 1143 1160 764 704

Clus30 696 648 574 529

Clus20 560 522 480 443

Table 3.4: Summary of input data – derived proteins. The numbers in the table rep-resent the number of proteins for different levels of protein similarity(sequence identity) and chromatin remodeler type certainty (all vs. non-predicted types). The numerical suffix of each cluster name correspondsto the sequence identity level used with UniRef (UniRef*) or PSI-CD-HIT(Clus*). Underlined numbers denote the data sets selected for prediction.

exclude only automatically assigned terms (IEA) and those marked with ND (nodata)7 in the first step.

Similarly, we only take Pfam-A domains into account, rather than the automaticallygenerated Pfam-B ones (these are not included in our database), as well as non-predicted chromatin remodeler types from the input file.

3.2.2 Redundant protein information

Besides uncertain data, we have to take redundant information in our data set intoaccount. UniProtKB contains a lot of redundant protein data, such as multiple en-tries for the same or very similar proteins and for sub-fragments. Including theseentries could bias our prediction models and increase the time necessary to buildthem. Moreover, it has been shown that even a reduced protein space can providethe same amount of biological information when only one representative per clus-ter of proteins with at least 50% sequence identity is considered [61].

In order to identify similar entries, UniProt offers its UniRef database [58] whichconsists of three parts: UniRef100, UniRef90, and UniRef50. These parts combineproteins with 100%, 90% or 50% sequence identity, respectively. They are generatedusing the popular CD-HIT algorithm [62, 63], that allows protein-protein compar-ison in a greedy incremental fashion (based on simple short word filtering) andthus achieves feasible runtimes on large data sets such as UniProtKB or PDB [64].

7 The ND evidence code is exclusively used for the root terms biological process, cellular component andmolecular function, present in almost all protein records.

3.2 D ATA P R E P R O C E S S I N G 23

Representatives chosen by major

Clustering Sequence length Number of GO annotations

UniRef50 UniRef50SL UniRef50GO

Clus30 Clus30SL Clus30GO

Clus20 Clus20SL Clus20GO

Table 3.5: Naming scheme for the data sets.

As the algorithm cannot handle identity levels below 40%, a variant called PSI-CD-HIT [65] can be applied, because it utilizes BLAST [66] for similarity calculationinstead of word filtering.

For each protein in our local database, we store its membership in all three UniRefreference clusters as well as two additional groups for mutual sequence identity of30% (Clus30) and 20% (Clus20), built with PSI-CD-HIT. Table 3.4 shows the totalnumber of epigenetics-related proteins for different similarity levels. For the ini-tial iteration of prediction model building, we select proteins that belong to theUniRef50, Clus30, and Clus20 clusters. Hence, in combination with the considera-tions regarding predicted protein annotations, as explained in the previous section,only 443 to 704 proteins remain.

One general problem with aggregating protein entries is that we need to choose arepresentative for each cluster. (PSI-)CD-HIT automatically selects the representa-tive protein by sequence length, i.e., the one with the longest sequence. However,carrying the most amino acids does not necessarily correspond to a rich informa-tional content in terms of curated annotations. Therefore, UniRef follows a slightlydifferent set of rules for selecting representatives, with sequence length being theleast important criterion after curation status (reviewed or unreviewed), a mean-ingful protein name8 and a model species annotation [58]. While a UniRef clusterrepresentative might belong to any available species, we solely consider mouse pro-teins. Consequently, we cannot simply apply all of the aforementioned rules on ourown database. For the sake of comparison, we build two independent sets of rep-resentatives chosen by modified UniRef rules. Our main criterion for both sets isa membership in Swiss-Prot9, followed by the major number of GO annotations,for the first set, and the longest sequence, for the second set. Table 3.5 gives anoverview of the created data sets, along with the names used henceforth.

8 The UniRef documentation does not offer an exact definition of the word meaningful.9 Swiss-Prot proteins should be expected to have a meaningful name, anyway.

24 M AT E R I A L A N D M E T H O D S

Category Value type Features

DNA composition Numeric 10

Pfam domains Binary 306

Pfam clans Binary 64

GO terms Binary 3548

Chromatin remodeler types Binary 5

Total: 3933

Table 3.6: Data set summary statistics. The table shows the number of features percategory as well as their value types.

3.3 Data representation

Model building tools require the data to be available in some data format (cf. Sec-tion 2.3). Many of these tools work with simple table-like formatted flat files wherethe rows represent the instances (in this case proteins). The columns represent theirfeatures, here the corresponding annotations and the five expert-generated chro-matin remodeler types. Table 3.6 gives an overview over the different features, asexplained in the following paragraphs.

An easy way to include Pfam information is to list each domain as a binary fea-ture and assign true in case of presence in the respective protein or false otherwise.In order to reduce the feature space, only domains that appear in at least one ofthe proteins are taken into account. Instead of boolean values, the number of oc-currences per protein could be used. Since this kind of information is available fora small number of instances only, we rely on the binary features. Similarly, Pfamclans are represented by assigning true whenever at least one clan member is partof the respective protein, otherwise false.

The third and most prominent set of features are the GO terms, which are assignedto the proteins in UniProtKB. Owing to the hierarchical structure on the Gene On-tology, mapping the GO annotations into boolean values (in the same way as Pfamdata) would lead to a huge information loss. For example, if we consider a proteinannotated with protein complex (GO:0043234) and another one with catenin complex(GO:0016342), which is a subtype of the former, then both proteins share a com-mon characteristic, i.e., both are known to build complexes. However, this fact isnot directly evident from the transformed data set, since the second protein is notpositive for the feature protein complex (it is not annotated with this term). Thereforewe include all GO terms the proteins are directly associated with, including all oftheir ancestors in terms of a relation of type is a or part of. The regulation-specificrelationships are currently not part of the data set. Adding them would require anadditional set of features, clearly separated from the above-mentioned terms, anda special treatment of their hierarchy.

3.3 D ATA R E P R E S E N TAT I O N 25

Eraser Mediator Modifier Reader Remodeler

0

50

100

150

200

250

300

350

20

230

110

357

55

20

221

107

336

53

14

140

86

273

41

9

117

73

230

37

Nu

mb

erof

inst

an

ces

No clustering

UniRef50GO

Clus30GO

Clus20GO

Figure 3.3: Chromatin remodeler type distribution across different Swiss-Prot datasets.

Furthermore, an important thing to consider is that the GO terms cover a large spec-trum of biological descriptions, including numerous terms for epigenetic activity:Above all, Chromatin modification (GO:0016568) and its 135 more specific child termsas well as Chromatin silencing (GO:0006342) and its nine children. However, we donot use them in order to generally determine whether or not a protein is a chromatinremodeler (in contrast to Pu et al. [6]). Also, we cannot expect currently unknownepigenetics-related proteins, i.e., the ones we aim to predict, to be annotated withthese GO terms, so it would be problematic to include them in our feature space10.We therefore exclude these GO terms in advance.

With UniProt, Pfam and GO being covered, only DNA level information extractedfrom UCSC is yet to be transformed. Unfortunately, a single protein can be transla-tion product of several transcripts. Furthermore, some information is only availablefor genomic regions rather than actual genes or transcripts, e.g., CpG islands. Theproblem here is how to represent this kind of data. One way to handle it is usingInductive Logic Programming [67] for the prediction task as it is able to work directlywith relational data. The whole GO hierarchy could be included without the need ofthe flat representation explained above. Another way is the so-called multi-instanceclassification, where each instance is an aggregation of multiple instances. In this re-spect, each protein is replaced by all of its transcripts. Unfortunately, special toolsare required to build a prediction model on this data and, moreover, the combina-tion of transcript and protein features can lead to a huge overhead in the data set,since the same protein information is assigned to each transcript. A third way is toaggregate transcript-level features only, e.g., by computing the transcripts’ averageguanine content and use it as a feature on protein level. The latter allows us to con-

10 See Section 4.1.3 for a correlation analysis.

26 M AT E R I A L A N D M E T H O D S

Chromatin remodeler type

Proteins Eraser Mediator Modifier Reader Remodeler

Erasers 20 0 0 3 0

Mediators 0 230 1 8 7

Modifiers 0 1 110 8 1

Readers 3 8 8 357 5

Remodelers 0 7 1 5 55

Table 3.7: Co-occurring chromatin remodeler types in the unclustered Swiss-Protdata set.

sider the DNA composition features without the need to limit ourselves to specificprediction tools. While the DNA composition is straightforward to include, consid-ering the CpG island information is more difficult. More specifically, it requires asuitable definition to determine whether a given genomic region belongs to a tran-script (e.g., complete inclusion, overlap or presence in some kind of neighborhood).Therefore, we decided to neglect the CpG island information.

Finally, the five chromatin remodeler types are represented as separate binary fea-tures, that can be predicted independently11. Figure 3.3 summarizes the chromatinremodeler type distribution for the whole Swiss-Prot data set and the three rep-resentative sets UniRef50GO, Clus30GO, and Clus20GO. Due to the fact that eachprotein can belong to multiple types, the distribution of the alternative representa-tive sets (constructed from longest protein sequence) differs slightly (see Figure A.2in the Appendix). In addition, the chromatin remodeler type co-occurrences areshown in Table 3.7.

3.4 Statistical learning methods

In this section, we focus on the statistical learning methods we use for building theprediction models (cf. Section 2.3). With the data sets being prepared, we have tochoose the target feature we want to predict. As described in the previous section,the chromatin remodeler types are represented as five separate binary features. Ingeneral, it is possible to designate multiple features as target (in this context referredto as labels) and predict them at the same time. There are different ways to handlethis so-called multi-label scenario12: The most basic one is the binary relevance method(BR), a problem transformation approach where a separate model is trained for each

11 A single feature covering all of the five possible types would not be accurate, as each protein can beassigned to more than one chromatin remodeler type.

12 Not to be confused with multi-class classification, which describes a single target variable with morethan two possible values.

3.4 S TAT I S T I C A L L E A R N I N G M E T H O D S 27

binary label [68, 69], assuming label independence. Its advantage is that it can beused with any binary classification method without suffering from large runtimes.However, since we are interested in the performance of the prediction models foreach type individually, we do not perform actual multi-label classification ratherthan five separately evaluated binary classification tasks. In each task, we selectone of the chromatin remodeler type features as target and remove the remainingones from the feature space. Nevertheless, we will henceforth refer to the five typesas labels.

As shown in Figure 3.3, the data sets cover only proteins that are known to berelated to epigenetics. In other words, our data set is composed of only positiveexamples. Due to the nature of the proteins, negative examples, in our case pro-teins which are not epigenetically relevant, are difficult to be confirmed experimen-tally [70]. However, when learning to predict one of the available labels (accordingto BR), all proteins of the corresponding chromatin remodeler type serve as positiveexamples, while the remaining ones serve as negative examples.

For building the prediction models, we use version 3.7.11 of the well-known Wekalibrary [30]. Among others, it offers a large number of different classification, op-timization and evaluation methods. Data is read from files in Weka’s own ARFFformat, which uses a table-like representation as described in Section 3.3 and hence,is almost trivial to parse and write out.

3.4.1 Classifiers

For building prediction models, we use two different classification methods: Sup-port vector machines and random forests. A summary of the classifiers’ object struc-ture is given in Figure 3.4 and will be further explained in the following paragraphs.A complete version of the figure, including the full configuration of the parameters,can be found in the Appendix (Figure A.3).

Our first classification method uses a support vector machine (SVM) [71] as its baseclassifier, trained with the LibSVM library [72], which is available in Weka througha wrapper class [73]. The basic idea behind support vector machines is to linearlyseparate the instances of the training data into two classes. For this purpose, theSVM constructs a hyperplane, such that there are only instances of the same classon each side of the plane (with as few exceptions as possible) and that there is thelargest possible margin around the linear boundary, where no instance is located.Since most training sets contain data that is not linearly separable, the instancescan be mapped into a very high-dimensional feature space using a kernel function,where such a hyperplane is usually easier to construct [71].

With some adjustments, the method’s setup is adapted from Pu et al. [6], hencewe also focus on the classic regularized support vector classification (C-SVC) with a

28 M AT E R I A L A N D M E T H O D S

:GridSearch

:AttributeSelection :BestFirst

:CfsSubsetEvaluator

:LibSVM� classifier

K filter

� search

K evaluator

(a)

:GridSearch

:AllFilter :RandomForest

� classifier

K filter

(b)

Figure 3.4: Object diagram of the parameter-tuned classifiers: (a) Support vectormachine with feature selection (b) Random forest. The names corre-spond to actual Weka classes.

Gaussian radial basis function (RBF) kernel. Feature selection is done by using Best-First, a greedy hill climbing algorithm augmented with a backtracking facility [30, p.492], in combination with the CfsSubsetEvaluator, which evaluates the predictiveability of each feature separately along with the grade of redundancy within thesubset [74].

The second classification method is a random forest (RF) approach [75], which im-plements bagging of several decision trees. Compared to the SVMs, random forestsoffer a considerably smaller number of parameters that have to be specified by theuser. Due to the nature of the method, separate feature selection is not necessary. Ithas been shown that bagging is quite effective on decision trees, which tend to benoisy when used alone. On the other hand, they tend to have a relatively low biaswhen grown sufficiently deep [28, pp. 587–588].

For the parameter tuning of both approaches, we utilize Weka’s own GridSearchimplementation, which finds the best pair of classifier or filter parameters accordingto a specified performance measure, such as accuracy [30, p. 478] (see Section 3.4.2).In the SVM case, we optimize the gamma value of the Gaussian kernel and the costparameter of the SVM. The feature selection is applied as a filter during the gridsearch. In the RF case, only the number of randomly selected features is subjectto optimization. However, as the algorithm requires exactly two parameters to beoptimized, we also consider the number of trees here. In contrast to the number ofrandomly selected features, we specify only two possible values for the trees: Thedesired value (100) and a considerably smaller one (5). In our experiments, a highernumber of trees was usually preferred, hence the grid search can still be applied inthis case. Since the creation of a random forest on our data for a small number oftrees takes only a few seconds, the optimization is affordable. AllFilter (Figure 3.4b)just lets all instances pass through and thus serves as a filter in the RF approach, asthe grid search implementation requires both a classifier and a filter being specified.

3.4 S TAT I S T I C A L L E A R N I N G M E T H O D S 29

3.4.2 Model evaluation and validation

Weka offers numerous performance measures for prediction models. In this thesis,we mainly focus on the accuracy, recall, precision, and F-measure, in order to compareour results with those by Pu et al. [6]. The four performance measures are definedby

Accuracy =TP + TN

TP + FP + TN + FN, Recall =

TPTP + FN

,

Precision =TP

TP + FP, F-measure = 2 · precision · recall

precision + recall,

with TP and TN being the number of true positives/negatives, and FP and FN beingthe number of false positives/negatives, respectively. The accuracy represents theoverall ratio of correctly predicted instances. Recall and precision focus on the truepositives. The harmonic mean between those two values is called F-measure [76].In the following, we describe the three different approaches to calculate the perfor-mance measures.

The most basic approach for this purpose, the computation of the resubstitution er-ror, is applied right after building the prediction model on the full training set, bypredicting the very same instances and comparing the result to the true label. Whilenot being overly accurate with regard to the true error rate on unseen data [30, pp.148–149], it is a good indicator for implementation errors as well as problems withthe training data. Thus, we use the resubstitution error for this purpose only, not toassess the models or the methods presented in this thesis.

The second set of performance measures is computed during a k-fold cross-validationon the full training set [28, pp. 241–245]13. Five- or tenfold cross-validation presentsa good compromise between prediction bias and variance [77,78]. In order to reduceruntime, we opt for five folds. Sample selection bias can be further reduced bysplitting training and test set in a stratified way [78], that is, maintaining roughlythe same distribution of labels among the two sets. Fortunately, Weka’s Evaluationclass applies stratification automatically when encountering nominal/binary classvalues. We use cross-validation to assess the whole classification method ratherthan the final model, as the latter is not involved in the actual evaluation step [79].

In order to estimate its predictive accuracy [79], the final model is subject to an ex-ternal test set validation. For this purpose, the unlabeled data is split into two disjointsets: 80% are used for training, the remaining 20% for evaluation. Again, stratifica-tion is used to reduce the sample selection bias.

The process of building and validating a prediction model is summarized in Fig-ure 3.5. We utilize the UniRef50, Clus30, and Clus20 protein sets as labeled data,

13 This is done in addition to the cross-validation performed as part of the grid search.

30 M AT E R I A L A N D M E T H O D S

Labeled data80% 20%

k-foldcross-validation

External testset validation

Resubstitution

Classifier

Prediction Model

Figure 3.5: Workflow for validating a prediction model. Validation steps are rep-resented by rectangles with rounded corners. Thick lines denote databeing used for training, dashed ones for test. Lines with both styles rep-resent data directly involved in cross-validation, thus partly serving asboth training and test set.

while the parameter-tuned SVMs and random forests serve as classifiers (cf. Sec-tion 3.4.1). As evident from the figure, the final prediction model is build on the fulltraining set.

3.5 Post hoc data/model analysis

In this section, we focus on the analysis methods we apply in order to confirm thecredibility of our trained models. More specifically, correlation analysis can giveus an answer to whether an exceptionally good model performance is the result ofhigh correlations between the labels and particular subsets of the features in ourdata.

3.5.1 Feature importance

As the name suggests, feature importance allows us to draw conclusions about therelations between specific features and the labels. While the feature selection duringthe training process tries to find the most promising feature subset, we are now in-terested in features that are extraordinarily important when taken in isolation. Forthis purpose, the RandomForestClassifier from the Python module scikit-learn [80] isused, as Weka’s implementation does not grant access to the importance values. Welet the classifier compute 100 trees (max_features=100) with entropy serving as cri-terion for the node splitting in the trees (criterion=‘entropy’). Furthermore, weset the number of randomly selected features (max_features) to the optimal valuedetermined by the grid search during the training of the respective model. Theremaining parameters are used with default values. After building the model, thefeature importances can be accessed through one of the classifier’s public attributes

3.5 P O S T H O C D ATA / M O D E L A N A LY S I S 31

(feature_importances_). The values are represented as normalized floating pointnumbers in the range [−1, 1], with high values representing more important fea-tures.

3.5.2 Domain co-occurrences

Domain co-occurrences play an important role in our experiments as we are inter-ested in finding new domain combinations that serve an epigenetic function. Fur-thermore, they might affect the performance of the prediction models when ob-served in (almost) exclusive correlation with the labels. Although we extract the ex-isting domain co-occurrences directly from the data sets, we use a slightly modifiedversion of our prediction model building process to analyze the influence of a singledomain feature on the instances with a particular label. This process is performedin three steps, as follows. First, one label is selected and all corresponding negativeinstances are removed from the data set. In the second step, the domain feature ofinterest is selected and all affected instances are excluded in order to serve as ex-ternal test set. After removing the selected domain from the feature space, in thethird step we build the prediction model on the remaining instances and validate itaccordingly.

4 Results and Discussion

With a local database being built and data sets being prepared, we can now focus onthe results of our study. In the first section, we discuss our observations from thedata acquisition and preprocessing steps. The second section presents the valida-tion results on the initial prediction models and the subsequent feature correlationanalysis. Finally, the third section presents the validation results on the final predic-tion models.

4.1 Data acquisition and preprocessing

In this section we show and discuss particular results of the first two workflowsteps (cf. Figure 3.1): Data acquisition and organization and Data preprocessing. First,we have a closer look at the parts of the original input data file that could not beused for creating our local database. Subsequently, the options for selecting rep-resentatives for the clustered protein sets are analyzed. We finally focus on theepigenetics-related GO terms which were excluded from the data during the pre-processing phase.

4.1.1 Data loss during the creation of the database

The data loss during the creation of the database, as already briefly described inthe Material and Methods section, is summarized in Figure 4.1. In the first dataacquisition step, we excluded 431 entries (gene names) from the input file: 250 dueto the file’s inconsistent structure and 181 because of missing chromatin remodelertype assignments. The latter entries were originally extracted from the databasesDAnCER [36] and ChromDB [35] (see Section 2.4.1). In contrast to UniProtKB, bothof them are specialized on epigenetics-related proteins and thus relevant sources forour study. Unfortunately, we cannot use the corresponding proteins for buildinga prediction model, as the databases do not provide information on the differenttypes. For the 250 incorrectly formatted gene names in the input file, 278 UniProKBproteins (128 reviewed ones) can be found, 223 (88 reviewed ones) of which not yetconsidered. However, without deducible domain and chromatin remodeler typeassignments, these proteins cannot be used for training either. Nevertheless, wecan predict their types once a suitable model has been created.

During the process of mapping gene names to gene entries in the UCSC database,we have found that more than a third of the gene names are synonyms for thealready covered genes. Among the 497 mismatches (i.e., gene names that could not

33

34 R E S U LT S A N D D I S C U S S I O N

4148 Gene names in input file

431 Unclear entries

3717 Clear gene names

1311 Synonyms

497 Mismatches

1909 UCSC genes

82 Mismatches

1827 UCSC genes with a known protein

611 Mismatches

1216 Epigenetics-related UCSC genes

Figure 4.1: Data loss summary.

Gene name Example #

mcg_* mcg_12939 130

gm* gm1758 67

mkiaa* mkiaa0172 52

*rik 1110033m05rik 26

kiaa* kiaa1074 15

rp*-* rp23-376n23.3-002 13

LOC* LOC100045488 5

OTTMUSG* OTTMUSG00000016219 5

Table 4.1: Frequent patterns among thegene name mismatches.

be associated to an UCSC gene), there are several frequently occurring patterns,as listed in Table 4.1. Many of the names refer to (predicted) genes from differentsources, such as the Mouse Genome Database [81] (gm*), NCBI Gene [47] (LOC* [49])and Havana/Vega [82] (OTTMUSG*). However, the majority can be linked to openreading frames (mcg_*, rf*-*) and cDNA ((m)kiaa* [83] and *rik).

The loss of 82 UCSC genes during the gene-to-protein mapping can be explained bythe ambiguous use of names when referring to actual genes. While this also holdsfor the subsequent step, when trying to confirm the expected protein domains usingUniProtKB and Pfam, the mismatch analysis indicates a problem with the domainassignments. For 2803 protein domain associations, extracted from the input file,only 1652 can be confirmed by the databases. Interestingly, we would find abouthalf (578) of the remaining 1151 associations in UniProtKB and Pfam, if we replacedthe expected domain with another member of the same Pfam clan. In other words,instead of the given domains we can find highly similar ones, known to be presentin the respective proteins. The comparison of expected and actually present proteindomains is shown in Table 4.2. A closer look at their names often shows only aslight difference, for example, observing Ank_2 (PF12796) instead of Ank (PF00023)almost 200 times. In this special case, both entries represent the ankyrin repeat withAnk_2 signaling the presence of three copies in the same protein. Similarly, thereexist other Pfam entries for different numbers of copies. For additional examples,expert knowledge is required, as descriptions in Pfam are missing or not leading toa clear solution. Overall, manual curation is required in order to solve the problemwith incorrectly or imprecisely assigned domains.

4.1 D ATA A C Q U I S I T I O N A N D P R E P R O C E S S I N G 35

Expected assignment Actual assignment

Pfam ID Name Pfam ID Name Occurrences

PF00023 Ank PF12796 Ank_2 195PF13639 zf-RING_2 PF00097 zf-C3HC4 74PF13639 zf-RING_2 PF15227 zf-C3HC4_4 42PF13639 zf-RING_2 PF13920 zf-C3HC4_3 39PF13639 zf-RING_2 PF13923 zf-C3HC4_2 24PF00651 BTB PF02214 BTB_2 24PF00023 Ank PF13857 Ank_5 20PF08433 KTI12 PF01591 6PF2K 15PF00023 Ank PF13637 Ank_4 14PF00628 PHD PF13771 zf-HC5HC2H 9PF13639 zf-RING_2 PF14634 zf-RING_5 7PF02373 JmjC PF13621 Cupin_8 7PF00628 PHD PF13832 zf-HC5HC2H_2 7PF05175 MTS PF13847 Methyltransf_31 6PF00628 PHD PF13831 PHD_2 6PF00567 TUDOR PF05641 Agenet 5PF00145 DNA_methylase PF13659 Methyltransf_26 5PF00583 Acetyltransf_1 PF13508 Acetyltransf_7 5PF05175 MTS PF13659 Methyltransf_26 5PF00567 TUDOR PF06003 SMN 4PF00176 SNF2_N PF13307 Helicase_C_2 3PF13639 zf-RING_2 PF12678 zf-rbx1 3PF05175 MTS PF06325 PrmA 3PF13639 zf-RING_2 PF13445 zf-RING_UBOX 3PF02373 JmjC PF08007 Cupin_4 3PF00400 WD40 PF08662 eIF2A 3PF00023 Ank PF13606 Ank_3 3PF00249 Myb_DNA-binding PF13921 Myb_DNA-bind_6 3PF05175 MTS PF05971 Methyltransf_10 2PF13639 zf-RING_2 PF14570 zf-RING_4 2PF00145 DNA_methylase PF05175 MTS 2PF00176 SNF2_N PF06733 DEAD_2 2PF00400 WD40 PF11768 DUF3312 2PF00145 DNA_methylase PF05063 MT-A70 2PF08123 DOT1 PF01135 PCMT 2PF00145 DNA_methylase PF01170 UPF0020 2PF00533 BRCT PF12738 PTCB-BRCT 2PF13639 zf-RING_2 PF11793 FANCL_C 2PF00385 Chromo PF11717 Tudor-knot 2PF05175 MTS PF05185 PRMT5 2

Table 4.2: Comparison of expected vs. actual protein domains, as given by the orig-inal input file and UniProt/Pfam, respectively. None of the expected as-signments can be confirmed by the two databases. Actual protein do-mains belong to the same Pfam clans as the expected ones. Domain pairswith only one occurrence in our local data are omitted.

36 R E S U LT S A N D D I S C U S S I O N

0 10 20 30 40 50 60Number of domains and GO terms

100

101

102

UniRef50SL

UniRef50GO

(a) UniRef50

0 10 20 30 40 50 60Number of domains and GO terms

100

101

102

Clus30SL

Clus30GO

(b) Clus30

Figure 4.2: Comparison of the number of annotations (domains and GO terms) perprotein for different data sets.

4.1.2 Cluster representatives

In Section 3.2.2, we have explained how redundant protein information is handledby sequence identity clustering using UniRef [58] and CD-HIT [62]. For the trainingof prediction models, we have prepared three data sets of representatives, chosenfrom protein clusters with a mutual sequence identity of 50 (UniRef50), 30 (Clus30),or 20 (Clus20) percent. Each of those sets exists in two versions: In the first case,the representative is selected by the major sequence length (e.g., UniRef50SL), in thesecond case by the major number of linked GO terms (e.g., UniRef50GO). Figure 4.2illustrates the total number of domains and GO annotations for both options, withthe same 12 proteins (66 to 243 features) being neglected in both cases.

In UniRef, a long protein sequence is preferred over the number of annotationswhen designating representatives that provide the most biological information [58].Nevertheless, this criterion does not always lead to the most richly annotated pro-teins, as evident from Figure 4.2. In our data sets, we observe steady annotationdivergence with decreasing sequence identity thresholds.

It is not surprising that the GO sets are the better choice when opting for a richfeature space, since GO annotations are by far more abundant than domains andcontribute a substantial amount to it. On the other hand, this does not mean that thecorresponding proteins are generally more valuable than the ones with the longestsequence. Hence, we train our prediction models with both sets. In order to reducethe representative selection bias, resampling could be performed. Due to the asso-ciated reiterations, it can take a considerably larger runtimes in contrast to a singlerepresentative set.

4.1 D ATA A C Q U I S I T I O N A N D P R E P R O C E S S I N G 37

GO annotated proteins [%]

Label Number CM CS CM or CS CM and CS

All 739 13.9 0.7 14.1 0.5

Eraser 20 80.0 5.0 80.0 5.0

Mediator 230 11.7 0.0 11.7 0.0

Modifier 110 15.5 1.8 16.4 0.9

Reader 357 12.6 0.6 12.6 0.6

Remodeler 55 36.4 1.8 36.4 1.8

Table 4.3: Ratio of unclustered Swiss-Prot proteins annotated with the GO termsfor Chromatin modification (CM; GO:0016568) and Chromatin silencing (CS;GO:0006342), or their respective child terms (is a or part of relations only).

4.1.3 GO terms for chromatin modification

There exist two main GO terms for describing epigenetics-related proteins: Chro-matin modification (CM; GO:0016568) and Chromatin silencing (CS; GO:0006342). TheCM term indicates the alteration of DNA, protein, or RNA, resulting in a changeof the chromatin structure. The CS term indicates the inhibition of transcriptioncaused by a structural change of the chromatin. Moreover, instead of those twomain terms, a total of 144 more specific child terms can be used to describe a pro-tein’s particular epigenetic function.

In Section 3.3, we have removed CM and CS as well as their children from our dataset, as the new proteins we want to predict most likely will not have these terms. Ta-ble 4.3 displays the ratio of proteins annotated with the aforementioned GO termsin the unclustered set. Actually, it appears that a high correlation is only presentfor a small part of the data, although only Swiss-Prot entries are considered. Nearly80% of the proteins are readers or mediators and as a result only remotely associ-ated with actual chromatin modifications, strictly speaking. Therefore, it is morereasonable to focus on the erasers, modifiers, and remodelers. As evident from thetable, only the erasers share a strong connection with the CM and CS GO terms.Modifiers, on the other hand, are on par with the readers and mediators, despitebeing the main representatives of chromatin modifying proteins. Analogously, onlya third of the remodelers seem to be linked to an epigenetic function.

It seems that the CM and CS GO annotations in Swiss-Prot are not currently syn-chronized with the literature knowledge, or that the definitions of whether a pro-tein is related to epigenetics are too different to find a common denominator.

38 R E S U LT S A N D D I S C U S S I O N

4.2 Initial prediction models and correlation analysis

This section demonstrates the problems that have emerged during the validationof the initial prediction models. Furthermore, the results of the subsequent analysisof the data sets are discussed, with focus on the identification of exceptionally highcorrelations between some of the features and the labels.

4.2.1 Validation results

With the data and the techniques described in the Material and Methods section, aninitial prediction model for each label is created and validated. As evident from Fig-ure 4.3a, the F-measures (and thus the precision and recall values) of these modelsare remarkably high, sometimes even perfect. This holds true for both SVM and RFmodels, except for the mediators. The eraser label’s negative peak on the externaltest set validation can be explained by a test set size of only three instances.

The great model performances indicate a possible inclusion of the labels in the fea-ture space or an extraordinarily high correlation between some of the features andthe labels. When excluding the former case and other implementation errors, a cor-relation analysis can provide further valuable information. For this purpose, weretrain the models on subsets of the feature space (Figures 4.3b-d). Apparently, themain reason behind the good performances is hidden within the protein domainsand clans. In comparison with the clans, the protein domains lead to consistentlyremarkable F-measure values with only slight deviations. The less stable perfor-mances on the clans are most likely due to the fact that not all proteins in our testsets belong to a clan. The latter is especially true for the remodelers. Moreover, theyare also affected by small test set sizes (less than ten instances). The GO features,on the other hand, perform moderately, at least when taken alone. Interestingly,mediators and remodelers do not appear to benefit from the inclusion of their GOannotations. In the case of the mediators, this might explain the noticeable differ-ences in the SVM and the RF performances on all features. The random featureselection during the creation of the RF models naturally favors the more abundantGO annotations, while our SVM classifier specifically builds the feature set basedon their predictive value (cf. Section 3.4.1).

The validation results on the clustered sets, with protein representatives being se-lected by the major sequence length, show only small differences against the onespresented (see Figure A.4 in the Appendix). In the following sections, we furtheranalyze the protein domains and clans in order to clarify their apparently high cor-relations with the labels.

4.2 I N I T I A L P R E D I C T I O N M O D E L S A N D C O R R E L AT I O N A N A LY S I S 39

u50 c30Eraser

c20 u50 c30Mediator

c20 u50 c30Modifier

c20 u50 c30Reader

c20 u50 c30Remodeler

c20

0.6

0.8

1

F-m

easu

re

(a) Models trained on all features

u50 c30Eraser

c20 u50 c30Mediator

c20 u50 c30Modifier

c20 u50 c30Reader

c20 u50 c30Remodeler

c20

0.8

1

F-m

easu

re

(b) Models trained on domains only

u50 c30Eraser

c20 u50 c30Mediator

c20 u50 c30Modifier

c20 u50 c30Reader

c20 u50 c30Remodeler

c200.2

0.4

0.6

0.8

1

F-m

easu

re

(c) Models trained on clans only

u50 c30Eraser

c20 u50 c30Mediator

c20 u50 c30Modifier

c20 u50 c30Reader

c20 u50 c30Remodeler

c20

00.20.40.60.8

1

F-m

easu

re

Cross-validation (SVM) External test set validation (SVM)

Cross-validation (RF) External test set validation (RF)

(d) Models trained on GO annotations only

Figure 4.3: Validation results of the initial SVM and RF models on the clusteredprotein sets UniRef50GO (u50), Clus30GO (c30), and Clus20GO (c20).

40 R E S U LT S A N D D I S C U S S I O N

Label Domain ID Domain name Clan ID Clan name Total COV [%]

Eraser PF02373 JmjC CL0029 Cupin 9 50

PF02146 SIR2 CL0085 FAD_DHS 7 100

Mediator PF00651 BTB CL0033 POZ 97 93

PF01352 KRAB 49 91

Modifier PF13639 zf-RING_2 CL0229 RING 64 94

PF00583 Acetyltransf_1 CL0257 Acetyltrans 20 91

Reader PF00400 WD40 CL0186 Beta_propeller 199 93

PF00023 Ank CL0465 Ank 65 87

Remodeler PF00249 Myb_DNA-binding CL0123 HTH 16 89

PF00176 SNF2_N CL0023 P-loop_NTPase 13 59

Table 4.4: Most abundant label domains for the chromatin remodeler types. Thecoverage value (COV) indicates how many of all Swiss-Prot proteinswith the corresponding domain are associated with the respective type.

4.2.2 Label domains

Our goal is to find domains with a nearly exclusive occurrence in proteins of aparticular label, and those originally used in the input file triplets are potentialcandidates. Henceforward, they will be referred to as label domains. Our analysis,however, shows that the same label domain is never directly associated with morethan one distinct chromatin remodeler type. On the other hand, the correspondingproteins may be unlabeled or of different types (due to other domains). Table 4.4lists the most prominent domain assignments for each of the chromatin remodelertypes as well as the ratio of proteins where the label domain is responsible for therespective type. A list of all 69 label domains is available on request from the au-thor1.

Within the set of 739 epigenetics-related Swiss-Prot proteins, there are only threeexceptions where the label domain is not associated with a chromatin remodelertype2. In one case, the Q8BJL0 protein is connected to the two modifier-linked labeldomains HARP (PF07443) and SNF2_N (PF00176), while only one of them actuallyappears in the input file and can thus be assumed responsible for the type. Theremaining exceptions are two non-modifier proteins with a modifier-linked labeldomain: Q91WC0 (mediator) and P13864 (mediator and reader). The first proteincarries the SET domain (PF00856), known from 14 modifiers in our data. The sec-ond one is linked to the DNA_methylase domain (PF00145), known from a singlemodifier.

1 [email protected] The proteins are obviously labeled, otherwise they would not be part of the data set.

4.2 I N I T I A L P R E D I C T I O N M O D E L S A N D C O R R E L AT I O N A N A LY S I S 41

Eraser Mediator Modifier Reader Remodeler

Overall coverage 52.6% 88.5% 78.0% 78.1% 78.6%

Table 4.5: Overall coverage of labels and label domains. The values indicate howmany of the Swiss-Prot proteins with a label domain are also associatedwith the respective label.

Apart from these exceptions, the union of the label domains corresponds to thelabels. In other words, although the chromatin remodeler types are not directlyincluded in our feature space, they can easily be reconstructed by using the labeldomains instead. Knowing this, the extraordinarily good validation results of theinitial prediction models are not surprising at all.

A second remarkable result of the data analysis is that the majority of proteins witha label domain is also known to be of the corresponding label. This does not onlyhold for most of the domains individually but also for the labels in general (seeTable 4.5), despite having lost a considerable amount of information during thecreation of our local database (see Section 4.1.1). The Table 4.5 values can thus beconsidered a lower bound of the real ones. The results seem to imply that singledomains play a crucial role in the determination of a protein’s epigenetic function,nearly independent of the actual protein, and that the chromatin remodeler typeswere originally assigned to domains, not proteins or genes. If this assumption holds,it will be easy to extend the data set. Unfortunately, this does not seem to be the case.For example, the readers’ two most abundant label domains, WD40 (PF00400) andAnk (PF00023), are known to be found in proteins which are assumed to be notrelated to epigenetics [6].

In order to create more reasonable prediction models, it is necessary to removeall label domains from our data sets. Even so, we still have to expect extremelyhigh correlations between the features and the labels, as the Pfam clans of the labeldomains are still included. In addition, other features co-occurring with the labeldomains might pose the same problem.

4.2.3 Label clans and their members

Removing only the label domains is not enough to minimize the high correlationsbetween single features and the labels in our data. Also their clans, henceforthcalled label clans, have to be taken into account. Since only 20 out of 69 label domainsbelong to a clan, their influence is not as high as for the domains, but still present(see Figure 4.3c). Similar to the domains, the label clans point to a single chromatinremodeler type, with only one exception: The P-loop_NTPase clan (CL0023) is asso-

42 R E S U LT S A N D D I S C U S S I O N

10-5 10-4 10-3 10-2 10-1 100

Feature importanceUni

Ref50GO

Uni

Ref50GOD

(a) Eraser

10-5 10-4 10-3 10-2 10-1 100

Feature importanceUni

Ref50GO

Uni

Ref50GOD

(b) Mediator

10-5 10-4 10-3 10-2 10-1 100

Feature importanceUni

Ref50GO

Uni

Ref50GOD

(c) Modifier

10-5 10-4 10-3 10-2 10-1 100

Feature importanceUni

Ref50GO

Uni

Ref50GOD

(d) Reader

10-5 10-4 10-3 10-2 10-1 100

Feature importanceUni

Ref50GO

Uni

Ref50GOD

(e) Remodeler

Figure 4.4: Feature importance in the UniRef50GO (original) and UniRef50GOD (all

label domains excluded) data sets. The vast majority of the features (90-99%) has an importance value of zero (not shown here). Red bars denotelabel domains, label clans and their members. A dot above a red barindicates a label clan member which is not a label domain.

4.2 I N I T I A L P R E D I C T I O N M O D E L S A N D C O R R E L AT I O N A N A LY S I S 43

Label Domain ID Domain name Clan ID Clan name Total Positives

Mediator PF07707 BACK CL0033 POZ 41 41

PF01344 Kelch_1 CL0033 POZ 34 34

Reader PF12796 Ank_2 CL0465 Ank 56 52

PF13637 Ank_4 CL0465 Ank 16 15

PF13857 Ank_5 CL0465 Ank 11 11

Remodeler PF00271 Helicase_C CL0023 P-loop_NTPase 15 14

Table 4.6: Most important label domain relatives in descending order per label (interms of importance score). The table shows the total number of instancesin the UniRef50GO data set having the respective domain, including thenumber of proteins being positive for the corresponding label.

ciated with two mediators (via KTI12; PF08433) and 13 remodelers (via SNF2_N;PF00176). In total, there are 19 distinct label clans in our data.

Aside from the label clans themselves, their members (i.e., protein domains) can po-tentially lead to extremely high correlations between features and labels, withouteven being a label domain. This is due to the fact that members of the same clanshare the same evolutionary origin and have thus usually a very similar structure orfunction [57]. Therefore, we have to expect that our epigenetics-related proteins arealso affected by this. Label clan members which are not label domains will hence-forth be referred to as label domain relatives. Figure 4.4 shows the feature importancein the UniRef50GO (original) and UniRef50GO

D (all label domains excluded) data sets.The displayed values are generated using scikit-learn (see Section 3.5.1). More than90% of the features have an importance value of zero, which is largely because ofthe flat representation of the GO hierarchy in our data, leading to a sparse featurespace (see Section 3.3). Nevertheless, there are several label clan domain relatives(indicated by black dots above the red bars in the figure) with a considerable im-portance, at least among the mediators, readers and modifiers. Also, the major im-portance of the label clans is evident (indicated by red bars without a dot in theUniRef50GO

D rows3).

Table 4.6 lists the most important label domain relatives for the mediator, reader,and remodeler labels. The BACK domain (PF07707) is known to co-appear withthe mediators’ most abundant label domain, BTB (PF00651), in the majority of pro-teins that also have the Kelch_1 motif (PF01344) [84]. Hence, it is not surprisingthat both of them appear as the most important label domain relatives of the me-diators. Among the readers, the ankyrin repeat (PF00023), that is, the second mostabundant label domain introduced in the previous section, comes again into play.Most of the Ank clan (CL0465) members describe multiple copies of the repeat in

3 In the UniRef50GO rows, the label domains are still included and also indicated by a red bar.

44 R E S U LT S A N D D I S C U S S I O N

Distinct domains per protein

Data set 0 1 2 3 4 5 6 7 8 9

UniRef50GO 0 275 225 135 45 17 5 0 1 1

UniRef50GOD 296 212 132 41 17 4 0 1 1

UniRef50GOR 394 215 62 23 8 0 0 2

Table 4.7: Distribution of distinct domain assignments per protein in theUniRef50GO (original), UniRef50GO

D (all label domains excluded), andUniRef50GO

R (all label domains, clans, and relatives excluded) data sets.

the same protein, present in the form of Ank_2 (PF12796), Ank_4 (PF13637), andAnk_5 (PF13857) as the three most important label domain relatives for the readers.The exceptional importance of the Helicase_C (PF00271) domain among the remod-elers can be explained by its co-occurrence with the second most abundant label do-main SNF2_N (PF00176), which is present in particular members of the epigenetics-related SMARC (SWI/SNF-related, matrix associated, actin-dependent regulator ofchromatin) protein family, such as SMARCA2 or SMARCAL1 [85, 86].

The label clans are very heterogeneous in size: They range between 2 and 202 in theunclustered data set, corresponding to a total number of 885 distinct label domainrelatives. However, the observed high correlation between some label domain rel-atives and the labels is not sufficient to generalize this to all relatives, especiallysince only 49 of them actually appear at least once in our data. Even though themajority of the present label domain relatives have a small number of occurrences,they usually point to the same label (cf. Table A.2 in the Appendix), that seems toreflect the high similarity to the respective label domains.

For the purpose of building a more realistic prediction model, we remove all labelclans from our sets. Furthermore, it appears reasonable to remove the label domainrelatives as well. The corresponding set is henceforward referred to as UniRef50GO

R .Nevertheless, we cannot guarantee that all label clan members share a similar epi-genetic function with the corresponding label domains. Hence, expert knowledgeis required to verify the currently unknown domain co-occurrences in the results ofthe present correlation analysis.

The distribution of domain assignments per protein for the different data sets isshown in Table 4.7. Evidently, more than half of the proteins are associated withlabel clan members only, i.e., their feature space is limited to the DNA compositionand the moderately performing GO annotations. On the other hand, the remainingdomain assignments in the UniRef50GO

R data set pose a good starting point in orderto find new epigenetics-related domain co-occurrences.

4.2 I N I T I A L P R E D I C T I O N M O D E L S A N D C O R R E L AT I O N A N A LY S I S 45

4.2.4 Domain co-occurrences

Even without the label domains, clans, and relatives in our data sets (red bars in Fig-ure 4.4), extraordinarily important features remain, signaling that there are eithermore label substitutes (like the label domains/clans) present or, on the contrary,only a few noticeable feature correlations to the labels left. In the following, wefocus on domain co-occurrences.

Figure 4.5 shows domains co-occurring at least five times in the UniRef50GO dataset. Evidently, our observations from Section 4.2.2 can be confirmed: The BACK(PF07707) and Kelch_1 (PF01344) domains do not only predominantly co-occur withBTB (PF00651) but also with each other. Furthermore, the Pfam entries represent-ing different numbers of copies of the Ank repeat (PF00023, PF12796, PF13857, andPF13637) appear to be present in the same proteins most of the times.

In our experiments, many mediators have proven to be reliably predictable evenwhen using the UniRef50GO

R data set. The reason for this is the presence of numer-ous C2H2 zinc finger motifs (PF00096, PF13465, PF13894, PF13912) in proteins alsohaving the highly abundant label domains BTB or KRAB (PF01352). Similarly to theAnk repeats, these motifs share the same Pfam clan (CL0361) and tend to co-occurwith each other. It is known from the literature that the C2H2 zinc fingers often ap-pear in combination with other protein domains that regulate sub-cellular localiza-tion and gene expression, such as BTB, KRAB, and SCAN (PF02023) [87]. All of thesecombinations are assumed to be involved in protein-protein interactions [88–90]. Itshould further be noted that the SCAN domain is always co-occurring with KRABin our data set.

Among the readers, four domains seem to be relevant in terms of co-occurrences:SOCS_box (PF07525), SAM_1 (PF00536), HELP (PF03451), and Death (PF00531). TheSOCS_box appears in correlation with either the Ank or WD40 (PF00400)4 repeat,which can be confirmed by the literature [91]. The function of the correspondingproteins is still largely unexplored, but recent studies indicate a link to the ubiqui-tination of particular proteins [92, 93]. In our data set, the combination of the MBT(PF02820) repeat with the SAM_1 domain is present mainly in the form of so-calledLethal(3)malignant brain tumor-like (L3MBTL) proteins. At least one L3MBTL kind isable to recognize mono- and dimethylated histones [94,95]. While all of the findingsdescribed above support the understanding of the respective labels, the epigeneticrole of the following two domain co-occurrences remains unknown to us: First, theapoptosis-related Death domain in proteins with Ank repeats [96]. Second, the co-occurrences of the HELP domain and the WD40 repeat, which lead to Echinodermmicrotubule-associated protein-like (EPL) proteins and the capability of binding to mi-crotubules [97].

4 Not shown in Figure 4.5 because of only three occurrences in the data set.

46 R E S U LT S A N D D I S C U S S I O N

PF12796

Ank 256

PF00023

Ank62

PF13857

Ank 511

PF13637

Ank 416

PF07525

SOCS box12

PF00531

Death5

16

11 95

52

15

5

11 95

PF00651

BTB90

PF01344

Kelch 134

PF07707

BACK41

PF13894

zf-C2H2 419

PF13465

zf-H2C2 277

PF00096

zf-C2H238

PF02023

SCAN8

PF01352

KRAB47

PF13912

zf-C2H2 414

5

34

41

32

12

30

17 14

10

35

44 25

7

8

11

13 7

PF00400

WD40186

PF03451

HELP5

5

PF13639

zf-RING 263

PF02225

PA7

7

PF00176

SNF2 N14

PF00271

Helicase C15

14

PF02820

MBT7

PF00536

SAM 17

5

PF00856

SET15

PF05033

Pre-SET5

5

Figure 4.5: Domain co-occurrences in the UniRef50GO data set. The nodes representlabel domains (blue) and their relatives (red) as well as domains beingneither of them (green). The number of proteins annotated with the cor-responding domain is given in the nodes. The edge weights representthe number of co-occurrences (at least 5). An exclusive co-occurrencefor at least one of the domains is indicated by a thick edge.

4.3 F I N A L P R E D I C T I O N M O D E L S 47

C∗ sets A∗ sets

UniRef50*C, UniRef50*

A,

Category Features Clus30*C, Clus20*

C Clus30*A, Clus20*

A

DNA composition All X X

Pfam domains Label domains

Label domain relatives X

Co-occurring domains (= 5x) X

Others X X

Pfam clans Label clans

Clans of co-occurring domains X

Others X X

GO annotations CM/CS terms

Others X X

Table 4.8: Feature overview of the final data sets. An asterisk in the data set nameindicates that the feature list holds for both representative selection meth-ods (GO/SL).

The remaining proteins affected by the domain co-occurrences in Figure 4.5 aremodifiers. The PA (PF02225) domains [98] belong mostly to E3 ubiquitin-protein lig-ases, which are involved in the ubiquitination of lysine [99]. The combination ofSET (PF00856) and Pre-SET (PF05033) domains are unique to Suv39/Clr4 H3 his-tone methyltransferases [100]. As the name suggests, they regulate the chromatinstructure by methylation of the histone H3 [101, 102]. The described modifiers ap-pear to comply with our understanding of the chromatin remodeler type.

Unfortunately, most of the domains listed in this section imply the respective labelvia a label domain. Thus, the removal of the affected features seems to be requiredfor building more reasonable prediction models.

4.3 Final prediction models

So far, we have described the reasons behind the extremely good performance re-sults of the initial prediction models, leading to several possible modifications ofthe feature space. The removal of the label domains and clans is absolutely neces-sary in order to build reasonable prediction models, since they represent the labelsalmost invariably. This also holds for some domain co-occurrences, including thelabel domain relatives. On the other hand, excluding all of them might lead to anoticeable information loss, as we do not know whether or not the observed co-occurrences are independent of the labels. Therefore, we have decided to build the

48 R E S U LT S A N D D I S C U S S I O N

u50 c30Eraser

c20 u50 c30Mediator

c20 u50 c30Modifier

c20 u50 c30Reader

c20 u50 c30Remodeler

c200

0.2

0.4

0.6

0.8

F-m

easu

re

(a) UniRef50GOC , Clus30GO

C , and Clus20GOC

u50 c30Eraser

c20 u50 c30Mediator

c20 u50 c30Modifier

c20 u50 c30Reader

c20 u50 c30Remodeler

c200

0.2

0.4

0.6

0.8

F-m

easu

re

(b) UniRef50SLC , Clus30SL

C , and Clus20SLC

u50 c30Eraser

c20 u50 c30Mediator

c20 u50 c30Modifier

c20 u50 c30Reader

c20 u50 c30Remodeler

c200

0.2

0.4

0.6

0.8

F-m

easu

re

(c) UniRef50GOA , Clus30GO

A , and Clus20GOA

u50 c30Eraser

c20 u50 c30Mediator

c20 u50 c30Modifier

c20 u50 c30Reader

c20 u50 c30Remodeler

c200

0.2

0.4

0.6

0.8

F-m

easu

re

Cross-validation (SVM) External test set validation (SVM)

Cross-validation (RF) External test set validation (RF)

(d) UniRef50SLA , Clus30SL

A , and Clus20SLA

Figure 4.6: F-measures of the final SVM and RF models on the clustered protein setsUniRef50 (u50), Clus30 (c30), and Clus20 (c20). Missing external test setvalidation results are due to unreasonably small test set sizes (less thanten instances).

4.3 F I N A L P R E D I C T I O N M O D E L S 49

Cross-validation External test set validation

Label Data set ACC PRE REC FM ACC PRE REC FM

Eraser UniRef50GOC 0.985 0.917 0.550 0.687 – – – –

UniRef50GOA 0.983 0.846 0.550 0.667 – – – –

Mediator UniRef50GOC 0.855 0.858 0.653 0.741 0.852 0.848 0.651 0.737

UniRef50GOA 0.791 0.707 0.550 0.618 0.811 0.711 0.659 0.684

Modifier UniRef50GOC 0.900 0.747 0.557 0.638 0.889 0.818 0.409 0.545

UniRef50GOA 0.896 0.698 0.632 0.663 0.879 0.667 0.476 0.556

Reader UniRef50GOC 0.773 0.755 0.760 0.757 0.800 0.810 0.746 0.777

UniRef50GOA 0.697 0.682 0.667 0.674 0.735 0.714 0.726 0.720

Remodeler UniRef50GOC 0.934 0.636 0.396 0.488 – – – –

UniRef50GOA 0.922 0.556 0.189 0.282 – – – –

Table 4.9: Excerpt from the validation results of the final prediction models. Thetable shows the accuracy (ACC), precision (PRE), recall (REC), and F-measure (FM) values for the SVM models trained on the two UniRef50GO

data sets.

final prediction models on two separate groups of data sets (Table 4.8), representinga lower (C∗ sets) and an upper (A∗ sets) bound of feature filtering stringency.

The F-measures of the final SVM and RF prediction models are shown in Figure 4.6.External test set validation has not been performed for the erasers and remodelers,because of an unreasonably small number of instances in the corresponding testsets. According to Figure 4.6, the models trained on UniRef50 data sets outperformtheir Clus30/Clus20 counterparts in most of the cases. We assume this to be mostlikely due to their considerable differences in terms of instance numbers (cf. Fig-ure A.2 in the Appendix). Furthermore, the GO sets lead usually to slightly betterF-measures than the SL sets, possibly reflecting the more sparsely populated fea-ture space in the latter ones. With only a few exceptions, the SVM models performa little bit better than those trained with RF classifiers. Table 4.9 compares the val-idation results of the SVM-trained UniRef50GO prediction models with regard todifferent performance measures. A full version, covering all data sets and meth-ods, can be found in the Appendix (Tables A.3 to A.6). The generally high accuracyvalues can be explained by the unbalanced distribution of positive and negativeinstances for all labels except the readers.

The erasers represent the least abundant proteins in our data sets. They are almostunaffected by the removal of numerous domains and clans, which is not surprising,as they do not have label domain relatives or noteworthy domain co-occurrences.

50 R E S U LT S A N D D I S C U S S I O N

Nevertheless, the erasers are too small in terms of numbers to be reliably predicted,even though the validation results are moderate.

By far the worst performance can be observed among the remodelers. Here we canconfirm the assumption that the remarkably good validation results of the initialprediction models are almost exclusively due to domains and clans correlating withthe labels. The performance values drop severely each time a more stringent datafiltering is used. As a result, it might be assumed that the remodeler label overlapsheavily with at least one of the others, or that the current features are not sufficientto clearly identify the corresponding proteins.

The mediators show the biggest divergence between the UniRef50 and Clus20 datasets. Similarly to the remodelers, they largely depend on the Pfam domains andclans. This can be explained by the presence of numerous C2H2 zinc finger motifsand other co-occurring domains (BACK, Kelch_1, SCAN) in the less stringent C∗

data sets. Consequently, we expect the performance results from the A∗ sets to bemore realistic for this label.

The readers’ validation results appear again extremely stable across the differentdata sets. Although they, too, suffer noticeably from the exclusion of several do-mains and clans, the clustering level of the data sets plays a rather subordinaterole. Still, the readers pose the most abundant type of epigenetics-related proteinsin our data sets, e.g., there are as many readers in the Clus20 data set as there aremediators (the second most abundant type) in the unclustered one.

Surprisingly and in contrast to the other labels, the modifiers benefit from the re-duction of the feature space (recall and F-measure), although there are at least afew domain co-occurrences present.

Overall, the model performances on the A∗ sets resemble pretty much the ones ofthe initial models trained only on GO annotations, indicating that neither Pfam norDNA composition features are capable of influencing the results substantially. Onthe other hand, some highly co-occurring domains exclusively related to a labelprevent us from including all Pfam features. Hence, domain/clan information doesnot seem to be suitable for building a reasonable prediction model as long as thereare exclusive relations between single domains and labels.

In conclusion, we are able to identify between 67% and 76% of the readers in ourdata set, as well as about 55% to 65% of the mediators and modifiers. For the erasersand remodelers, on the other hand, we cannot provide reasonable prediction mod-els.

5 Conclusion and Outlook

The research field of epigenetics is a promising step towards better understand-ing of the mammalian development, biological pathways and several diseases. Epi-genetic mechanisms influence these processes by regulating the organism’s genetranscription through changes in the DNA organization (chromatin) in the cell nu-cleus. It is widely accepted that there are distinct kinds of mechanisms, such ascovalent histone modification and DNA methylation. They are carried out by coop-erating proteins and multi-protein complexes with different capabilities, includingthe recognition of epigenetic signals, binding to DNA, and actual histone modifi-cation. However, so far there exists no common definition for the specific types ofthese proteins, here referred to as chromatin remodeler types.

In this thesis, we have analyzed an expert-generated list of protein domains andmouse genes divided into five groups of chromatin remodeler types: Erasers, medi-ators, modifiers, readers, and remodelers. Our goal has been to assess the credibil-ity of the suggested type assignments and to find currently unknown epigenetics-related domains and proteins in the mouse genome. For this purpose, first we haveidentified the genes from the given list in the popular UCSC Genome Browserdatabase and, subsequently, the corresponding proteins in UniProtKB. In a secondstep, we have created a local database on top of this data, which has then been ex-tended by Pfam and GO annotations. Based on the collected information, we havebuilt different data sets for the purpose of creating prediction models for the fivechromatin remodeler types. Special attention has been paid regarding the exclusionof unreviewed proteins and predicted annotations as well as redundant protein in-formation. Furthermore, we have repeatedly checked the data sets for exceedinglyhigh feature correlations.

Using the well-known Weka data mining library, we have built prediction modelsfor the chromatin remodeler types with parameter-tuned support vector machinesand random forest classifiers. The resulting models correctly predict about 67% to76% of the readers, and about 55% to 65% of the mediators and modifiers in ourdata sets. Unfortunately, we have not been able to create a reliable predictor forthe erasers, because of the very limited amount of available data for this type. Inaddition, there is strong evidence that indicates a possible overlap between the re-modelers and the remaining chromatin remodeler types, which is why we cannotprovide a predictor for this type either.

In order to create more suitable prediction models in the future, we strongly sug-gest a revision of the initial input list. More specifically, several hundred entrieshad to be ignored because of inconsistencies in the structure of the data, including

51

52 C O N C L U S I O N A N D O U T L O O K

76 eraser-associated gene names, which are notoriously underrepresented in ourdatabase. We deem it absolutely necessary to provide unique identifiers insteadof mere gene names or symbols. Moreover, since chromatin remodeler types arelinked to genes via protein domains, we do not see the necessity of providing genesin the first place and recommend to refer directly to proteins, preferably via Uni-ProtKB IDs. This would address several problems at once: First and foremost, wecould benefit from a drastic reduction of the protein diversity when searching for aparticular gene name, avoiding the unintended inclusion of epigenetics-unrelatedproteins in our data sets. This diversity is due to splice variants, protein isoforms,and the generally ambiguous utilization of gene names across the databases. Sec-ond, the data acquisition would become faster and less error-prone owing to theremoval of the unnecessary automated mapping from genes to proteins. Finally, itwould simplify the manual curation of presumably erroneous protein domain as-signments, which has been the second most common reason for data loss duringthe creation of our local database.

The correlation analysis of the protein annotations has shown an extremely closerelation between Pfam domains and the chromatin modification types, which alsoaffects the corresponding clans as well as their members. The prediction modellearning on these features would reduce the actual prediction task to a simple an-swer of the question whether or not a particular domain/clan is present in the givenprotein. Therefore, we do not consider them to be reasonable features for this pur-pose. Instead, we suggest to analyze the domains separately in order to find newepigenetics-related co-occurrences. In the literature, we have found evidence for allthe most frequent co-occurring domains among the proteins in our data sets, with afew exceptions that might indicate novel epigenetic associations (see Section 4.2.4).Nevertheless, the latter should be experimentally verified by domain experts.

Furthermore, it may be reasonable to assume that the five expert-generated chro-matin remodeler types (labels) are not distinct. Especially the remodelers seem tobe hard to distinguish from the other epigenetics-related proteins. Even though wedo not know for certain whether this observation is based on overlapping types oran insufficient feature space, the results imply that this is a reasonable assumption.For example, the proteins that have both the SET (PF00856) and Pre-SET (PF05033)domains are labeled as modifiers in our data sets (and most of the time as readers,too). Literature, however, indicates that this particular domain combination exclu-sively belongs to proteins that regulate the chromatin structure [100, 102], whichmatches with our understanding of the remodeler type. Future work should ad-dress this issue, clarifying the relations between the different chromatin remodelertypes, e.g., by hierarchical clustering [28, chap. 14.3.12]. Alternatively, the use ofclassification methods that take label correlations into account, such as classifierchains [68], could lead to more reasonable prediction results.

C O N C L U S I O N A N D O U T L O O K 53

Finally, while performing an analysis based on extending the feature space was outof the scope of this thesis, we believe that our prediction models would benefit fromfeatures describing the structure of proteins or their tendency towards building pro-tein complexes, as present in databases such as SCOP2 [103] or CORUM [44]. Oncea set of chromatin remodeler types has proven to be reasonable, the models can befurther utilized to classify proteins in a genome-wide approach. Better results canbe expected in combination with a prior step of deciding whether or not the tar-geted protein is in general epigenetics-related, since our models are not designedfor this more general task.

A Appendix

The appendix predominantly contains more detailed versions of the figures and ta-bles presented in the previous chapters, including a full model of the local databaseand the performance results of the built prediction models. A summary of the ap-pendix’ content is shown below.

Figures:

Figure A.1: Model of the local database

Figure A.2: Chromatin remodeler type distribution across the data sets

Figure A.3: Configuration of the classifiers

Figure A.4: Validation results of the initial models (second set)

Tables:

Table A.1: Source files used to build the local database

Table A.2: Label domain relatives in the data set

Table A.3: Final SVM model validation results (GO sets)

Table A.4: Final RF model validation results (GO sets)

Table A.5: Final SVM model validation results (SL sets)

Table A.6: Final RF model validation results (SL sets)

55

56 A P P E N D I X

Figure A.1: Model of the local database.

A P P E N D I X 57

Eraser Mediator Modifier Reader Remodeler

0

50

100

150

200

250

300

350

20

230

110

357

55

20

221

107

336

53

14

140

86

273

41

9

117

73

230

37

Nu

mb

erof

inst

an

ces

No clustering

UniRef50GO

Clus30GO

Clus20GO

(a) Representatives chosen by major number of GO annotations

Eraser Mediator Modifier Reader Remodeler

0

50

100

150

200

250

300

350

20

230

110

357

55

20

222

107

335

52

14

141

85

271

40

9

118

71

228

36

Nu

mb

erof

inst

an

ces

No clustering

UniRef50SL

Clus30SL

Clus20SL

(b) Representatives chosen by major sequence length

Figure A.2: Chromatin remodeler type distribution across different Swiss-Prot datasets.

58 A P P E N D I X

:GridSearch

evaluation = "ACC"gridIsExtendable = Falseseed = 1xBase = 10xExpression = "pow(BASE, I)"xMax = 4xMin = 0xProperty = "classifier.cost"xStep = 1yBase = 10yExpression = "pow(BASE, I)"yMax = -1yMin = -4yProperty = "classifier.gamma"yStep = 1

:LibSVM

eps = 0.001kernelType = "KERNELTYPE_RBF"normalize = FalseprobabilityEstimates = Trueshrinking = FalsesvmType = "SVMTYPE_C_SVC"

:AttributeSelection :BestFirst

:CfsSubsetEvaluator

� classifier

� filter � search

� evaluator

(a) Support vector classifier

:GridSearch

evaluation = "ACC"gridIsExtendable = Falseseed = 1xBase = 2xExpression = "pow(BASE, I)"xMax = 11xMin = 4xProperty = "classifier.numFeatures"xStep = 1yExpression = "I"yMax = 100yMin = 5yProperty = "classifier.numTrees"yStep = 95

:RandomForest

maxDepth = 0seed = 1

:AllFilter

� classifier

� filter

(b) Random forest classifier

Figure A.3: Structure and configuration of the two classifiers used for prediction.Object names represent actual Weka classes. Omitted attributes areused with default values.

A P P E N D I X 59

u50 c30Eraser

c20 u50 c30Mediator

c20 u50 c30Modifier

c20 u50 c30Reader

c20 u50 c30Remodeler

c20

0.8

1F

-mea

sure

(a) Models trained on all features

u50 c30Eraser

c20 u50 c30Mediator

c20 u50 c30Modifier

c20 u50 c30Reader

c20 u50 c30Remodeler

c20

0.8

1

F-m

easu

re

(b) Models trained on domains only

u50 c30Eraser

c20 u50 c30Mediator

c20 u50 c30Modifier

c20 u50 c30Reader

c20 u50 c30Remodeler

c20

0.4

0.6

0.8

1

F-m

easu

re

(c) Models trained on clans only

u50 c30Eraser

c20 u50 c30Mediator

c20 u50 c30Modifier

c20 u50 c30Reader

c20 u50 c30Remodeler

c200

0.2

0.4

0.6

0.8

1

F-m

easu

re

Cross-validation (SVM) External test set validation (SVM)

Cross-validation (RF) External test set validation (RF)

(d) Models trained on GO annotations only

Figure A.4: Validation results of the initial SVM and RF models on the clusteredprotein sets UniRef50SL (u50), Clus30SL (c30) and Clus20SL (c20). Botheraser and remodeler labels suffer from small test set sizes, as evidentfrom the positive and negative peaks in the figure. The remodelers’ badperformance in (c) is because of missing clan assignments for most ofthe positives examples.

60 A P P E N D I X

Tabl

eSo

urce

file

Vers

ion

go.ontologies

http

://g

eneo

ntol

ogy.

org/

onto

logy

/go-

basi

c.ob

o20

14-0

2-17

go.ontologyrelations

http

://g

eneo

ntol

ogy.

org/

onto

logy

/go-

basi

c.ob

o20

14-0

2-17

mm10.aliases

ftp:

//hg

dow

nloa

d.so

e.uc

sc.e

du/g

olde

nPat

h/hg

Fixe

d/da

taba

se/t

rans

Map

Gen

eUcs

cGen

es.tx

t.gz

2010

-02-

25ft

p://

hgdo

wnl

oad.

cse.

ucsc

.edu

/gol

denP

ath/

mm

10/d

atab

ase/

ense

mbl

ToG

eneN

ame.

txt.g

z20

14-0

2-19

ftp:

//hg

dow

nloa

d.cs

e.uc

sc.e

du/g

olde

nPat

h/m

m10

/dat

abas

e/ge

neN

ame.

txt.g

z20

14-0

2-19

ftp:

//hg

dow

nloa

d.cs

e.uc

sc.e

du/g

olde

nPat

h/m

m10

/dat

abas

e/kg

Alia

s.tx

t.gz

2014

-02-

18ft

p://

hgdo

wnl

oad.

cse.

ucsc

.edu

/gol

denP

ath/

mm

10/d

atab

ase/

know

nToE

nsem

bl.tx

t.gz

2014

-02-

19ft

p://

hgdo

wnl

oad.

cse.

ucsc

.edu

/gol

denP

ath/

mm

10/d

atab

ase/

know

nToL

ocus

Link

.txt.g

z20

12-1

2-03

ftp:

//hg

dow

nloa

d.cs

e.uc

sc.e

du/g

olde

nPat

h/m

m10

/dat

abas

e/re

fLin

k.tx

t.gz

2014

-02-

19mm10.chromosomes

ftp:

//hg

dow

nloa

d.cs

e.uc

sc.e

du/g

olde

nPat

h/m

m10

/chr

omos

omes

/chr

*.fa

.gz

2012

-02-

09mm10.cpgislands

ftp:

//hg

dow

nloa

d.cs

e.uc

sc.e

du/g

olde

nPat

h/m

m10

/dat

abas

e/cp

gIsl

andE

xt.tx

t.gz

2012

-03-

07mm10.exons

ftp:

//hg

dow

nloa

d.cs

e.uc

sc.e

du/g

olde

nPat

h/m

m10

/dat

abas

e/kn

ownG

ene.

txt.g

z20

12-1

2-03

mm10.genes

ftp:

//hg

dow

nloa

d.cs

e.uc

sc.e

du/g

olde

nPat

h/m

m10

/dat

abas

e/kn

ownI

sofo

rms.

txt.g

z20

12-1

2-03

mm10.transcripts

ftp:

//hg

dow

nloa

d.cs

e.uc

sc.e

du/g

olde

nPat

h/m

m10

/chr

omos

omes

/chr

*.fa

.gz

2012

-02-

09ft

p://

hgdo

wnl

oad.

cse.

ucsc

.edu

/gol

denP

ath/

mm

10/d

atab

ase/

kgX

ref.t

xt.g

z20

14-0

2-18

ftp:

//hg

dow

nloa

d.cs

e.uc

sc.e

du/g

olde

nPat

h/m

m10

/dat

abas

e/kn

ownG

ene.

txt.g

z20

12-1

2-03

pfam.clans

ftp:

//ft

p.eb

i.ac.

uk/p

ub/d

atab

ases

/Pfa

m/r

elea

ses/

Pfam

27.0

/dat

abas

e_fil

es/c

lans

.txt.g

z20

13-0

2-28

pfam.domainlocations

ftp:

//ft

p.eb

i.ac.

uk/p

ub/d

atab

ases

/Pfa

m/r

elea

ses/

Pfam

27.0

/pro

teom

es/1

0090

.tsv.

gz20

12-1

2-05

pfam.domains

ftp:

//ft

p.eb

i.ac.

uk/p

ub/d

atab

ases

/Pfa

m/r

elea

ses/

Pfam

27.0

/dat

abas

e_fil

es/p

fam

A.tx

t.gz

2013

-03-

14ft

p://

ftp.

ebi.a

c.uk

/pub

/dat

abas

es/P

fam

/rel

ease

s/Pf

am27

.0/d

atab

ase_

files

/cla

n_m

embe

rshi

p.tx

t.gz

2013

-02-

28

uniprot.proteindomains

ftp:

//ft

p.eb

i.ac.

uk/p

ub/d

atab

ases

/Pfa

m/r

elea

ses/

Pfam

27.0

/pro

teom

es/1

0090

.tsv.

gz20

12-1

2-05

http

://u

nipr

ot.o

rg/u

nipr

ot/?

quer

y=or

gani

sm:1

0090

&fo

rmat

=tab

&co

lum

ns=i

d,da

taba

se(p

fam

)20

14-0

2-14

uniprot.proteinontologies

ftp:

//ft

p.eb

i.ac.

uk/p

ub/d

atab

ases

/GO

/goa

/MO

USE

/gen

e_as

soci

atio

n.go

a_m

ouse

.gz

2014

-02-

17uniprot.proteins

http

://u

nipr

ot.o

rg/u

nipr

ot/?

quer

y=or

gani

sm:1

0090

&fo

rmat

=tab

&co

lum

ns=i

d,pr

otei

n%20

nam

es20

14-0

2-13

http

://u

nipr

ot.o

rg/u

nipr

ot/?

quer

y=or

gani

sm:1

0090

&fo

rmat

=tab

&co

lum

ns=i

d,re

view

ed,s

eque

nce

2014

-02-

13ht

tp:/

/uni

prot

.org

/uni

ref/

?que

ry=o

rgan

ism

:100

90&

form

at=t

ab&

colu

mns

=id,

mem

bers

,iden

tity

2014

-02-

04

Table A.1: Source files used to build the local database.

A P P E N D I X 61

Domain ID Clan ID Total Eraser Mediator Modifier Reader Remodeler

PF12796 CL0465 56 0 5 2 52 0

PF07707 CL0033 41 0 41 0 0 0

PF01344 CL0186 34 0 34 0 0 0

PF13637 CL0465 16 0 0 2 15 0

PF00271 CL0023 15 0 0 1 3 14

PF13857 CL0465 11 0 0 1 11 0

PF08954 CL0186 4 0 0 0 4 0

PF12738 CL0459 4 0 0 0 4 0

PF13964 CL0186 4 0 4 0 0 0

PF13415 CL0186 3 0 3 0 0 0

PF00004 CL0023 3 0 2 0 1 0

PF00415 CL0186 3 0 3 0 0 0

PF01363 CL0390 3 0 1 0 3 0

PF11717 CL0049 3 0 0 2 0 1

PF13606 CL0465 3 0 1 0 2 0

PF14844 CL0266 2 0 0 0 2 0

PF12872 CL0123 2 0 0 0 2 0

PF00097 CL0229 2 0 0 0 2 0

PF00071 CL0023 2 0 2 0 0 0

PF13921 CL0123 2 0 0 0 0 2

PF05729 CL0023 2 0 0 0 2 0

PF04564 CL0229 2 0 0 0 2 0

PF00270 CL0023 2 0 0 0 2 0

PF01393 CL0049 2 0 0 0 2 0

PF04053 CL0186 2 0 0 0 2 0

PF08596 CL0186 2 0 0 0 2 0

Table A.2: Label domain relatives in the UniRef50GO data set. The table shows thenumber of instances having the respective domain for the whole set, aswell as the ones being positive for the corresponding label. Domainswith less than two occurrences in the whole data are neglected.

62 A P P E N D I X

Cross-validation External test set validation

Label Data set ACC PRE REC FM ACC PRE REC FM

Eraser UniRef50GOC 0.985 0.917 0.550 0.687 – – – –

UniRef50GOA 0.983 0.846 0.550 0.667 – – – –

Clus30GOC 0.984 0.875 0.500 0.636 – – – –

Clus30GOA 0.984 0.875 0.500 0.636 – – – –

Clus20GOC 0.986 1.000 0.333 0.500 – – – –

Clus20GOA 0.985 0.636 0.778 0.700 – – – –

Mediator UniRef50GOC 0.855 0.858 0.653 0.741 0.852 0.848 0.651 0.737

UniRef50GOA 0.791 0.707 0.550 0.618 0.811 0.711 0.659 0.684

Clus30GOC 0.835 0.745 0.576 0.650 0.860 0.760 0.704 0.731

Clus30GOA 0.767 0.565 0.413 0.477 0.745 0.500 0.320 0.390

Clus20GOC 0.819 0.743 0.477 0.581 0.771 0.583 0.333 0.424

Clus20GOA 0.734 0.479 0.330 0.391 0.732 0.500 0.364 0.421

Modifier UniRef50GOC 0.900 0.747 0.557 0.638 0.889 0.818 0.409 0.545

UniRef50GOA 0.896 0.698 0.632 0.663 0.879 0.667 0.476 0.556

Clus30GOC 0.890 0.721 0.576 0.641 0.880 0.647 0.647 0.647

Clus30GOA 0.849 0.577 0.482 0.526 0.888 0.750 0.529 0.621

Clus20GOC 0.880 0.689 0.575 0.627 0.855 0.625 0.357 0.455

Clus20GOA 0.861 0.633 0.521 0.571 0.829 0.500 0.571 0.533

Reader UniRef50GOC 0.773 0.755 0.760 0.757 0.800 0.810 0.746 0.777

UniRef50GOA 0.697 0.682 0.667 0.674 0.735 0.714 0.726 0.720

Clus30GOC 0.733 0.748 0.709 0.728 0.730 0.709 0.780 0.743

Clus30GOA 0.718 0.727 0.715 0.721 0.755 0.724 0.840 0.778

Clus20GOC 0.672 0.680 0.667 0.673 0.723 0.702 0.786 0.742

Clus20GOA 0.710 0.733 0.673 0.702 0.732 0.788 0.634 0.703

Remodeler UniRef50GOC 0.934 0.636 0.396 0.488 – – – –

UniRef50GOA 0.922 0.556 0.189 0.282 – – – –

Clus30GOC 0.930 0.636 0.341 0.444 – – – –

Clus30GOA 0.912 0.444 0.195 0.271 – – – –

Clus20GOC 0.913 0.524 0.297 0.379 – – – –

Clus20GOA 0.888 0.263 0.135 0.179 – – – –

Table A.3: Validation results of the final SVM prediction models. The table showsthe accuracy (ACC), precision (PRE), recall (REC), and F-measure (FM)values for the models trained on the data sets with representatives beingchosen by the major number of GO annotations.

A P P E N D I X 63

Cross-validation External test set validation

Label Data set ACC PRE REC FM ACC PRE REC FM

Eraser UniRef50GOC 0.985 0.917 0.550 0.687 – – – –

UniRef50GOA 0.983 0.846 0.550 0.667 – – – –

Clus30GOC 0.984 1.000 0.429 0.600 – – – –

Clus30GOA 0.980 0.833 0.357 0.500 – – – –

Clus20GOC 0.986 1.000 0.333 0.500 – – – –

Clus20GOA 0.990 0.857 0.667 0.750 – – – –

Mediator UniRef50GOC 0.855 0.926 0.592 0.722 0.844 0.923 0.558 0.696

UniRef50GOA 0.781 0.759 0.421 0.541 0.773 0.739 0.415 0.531

Clus30GOC 0.829 0.813 0.462 0.589 0.860 0.760 0.704 0.731

Clus30GOA 0.769 0.614 0.278 0.383 0.755 0.556 0.200 0.294

Clus20GOC 0.778 0.623 0.394 0.483 0.759 0.556 0.238 0.333

Clus20GOA 0.766 0.593 0.302 0.400 0.793 0.727 0.364 0.485

Modifier UniRef50GOC 0.924 0.867 0.613 0.718 0.896 0.750 0.545 0.632

UniRef50GOA 0.912 0.843 0.557 0.670 0.909 0.800 0.571 0.667

Clus30GOC 0.884 0.737 0.494 0.592 0.900 0.818 0.529 0.643

Clus30GOA 0.876 0.700 0.494 0.579 0.888 0.750 0.529 0.621

Clus20GOC 0.875 0.706 0.493 0.581 0.831 0.500 0.357 0.417

Clus20GOA 0.863 0.698 0.411 0.517 0.866 0.600 0.643 0.621

Reader UniRef50GOC 0.778 0.767 0.750 0.759 0.815 0.865 0.714 0.783

UniRef50GOA 0.720 0.716 0.670 0.692 0.697 0.662 0.726 0.692

Clus30GOC 0.737 0.742 0.733 0.737 0.740 0.750 0.720 0.735

Clus30GOA 0.706 0.716 0.699 0.707 0.694 0.667 0.800 0.727

Clus20GOC 0.706 0.702 0.729 0.715 0.675 0.653 0.762 0.703

Clus20GOA 0.700 0.725 0.659 0.690 0.634 0.628 0.659 0.643

Remodeler UniRef50GOC 0.937 0.867 0.245 0.382 – – – –

UniRef50GOA 0.919 0.500 0.151 0.232 – – – –

Clus30GOC 0.930 0.750 0.220 0.340 – – – –

Clus30GOA 0.918 0.545 0.146 0.231 – – – –

Clus20GOC 0.925 0.800 0.216 0.340 – – – –

Clus20GOA 0.902 0.286 0.054 0.091 – – – –

Table A.4: Validation results of the final RF prediction models. The table shows theaccuracy (ACC), precision (PRE), recall (REC), and F-measure (FM) val-ues for the models trained on the data sets with representatives beingchosen by the major number of GO annotations.

64 A P P E N D I X

Cross-validation External test set validation

Label Data set ACC PRE REC FM ACC PRE REC FM

Eraser UniRef50SLC 0.984 0.909 0.500 0.645 – – – –

UniRef50SLA 0.982 0.833 0.500 0.625 – – – –

Clus30SLC 0.982 0.857 0.429 0.571 – – – –

Clus30SLA 0.982 0.857 0.429 0.571 – – – –

Clus20SLC 0.986 0.714 0.556 0.625 – – – –

Clus20SLA 0.983 0.625 0.556 0.588 – – – –

Mediator UniRef50SLC 0.866 0.868 0.681 0.763 0.881 0.861 0.738 0.795

UniRef50SLA 0.796 0.721 0.550 0.624 0.773 0.789 0.366 0.500

Clus30SLC 0.833 0.791 0.511 0.621 0.860 1.000 0.481 0.650

Clus30SLA 0.761 0.557 0.386 0.456 0.816 0.733 0.440 0.550

Clus20SLC 0.795 0.681 0.427 0.525 0.747 0.545 0.273 0.364

Clus20SLA 0.748 0.520 0.368 0.431 0.780 0.750 0.273 0.400

Modifier UniRef50SLC 0.879 0.651 0.509 0.571 0.910 0.765 0.619 0.684

UniRef50SLA 0.892 0.673 0.642 0.657 0.894 0.722 0.591 0.650

Clus30SLC 0.875 0.683 0.488 0.569 0.900 1.000 0.412 0.583

Clus30SLA 0.888 0.723 0.560 0.631 0.888 0.714 0.588 0.645

Clus20SLC 0.870 0.707 0.408 0.518 0.819 0.444 0.286 0.348

Clus20SLA 0.890 0.703 0.634 0.667 0.902 0.875 0.500 0.636

Reader UniRef50SLC 0.745 0.743 0.688 0.715 0.724 0.810 0.540 0.648

UniRef50SLA 0.738 0.717 0.731 0.723 0.795 0.746 0.855 0.797

Clus30SLC 0.726 0.748 0.681 0.713 0.760 0.882 0.600 0.714

Clus30SLA 0.683 0.684 0.687 0.686 0.714 0.762 0.640 0.696

Clus20SLC 0.684 0.726 0.589 0.651 0.723 0.771 0.643 0.701

Clus20SLA 0.674 0.691 0.634 0.662 0.732 0.788 0.634 0.703

Remodeler UniRef50SLC 0.940 0.731 0.365 0.487 – – – –

UniRef50SLA 0.918 0.450 0.173 0.250 – – – –

Clus30SLC 0.934 0.621 0.450 0.522 – – – –

Clus30SLA 0.908 0.308 0.100 0.151 – – – –

Clus20SLC 0.928 0.650 0.361 0.464 – – – –

Clus20SLA 0.900 0.391 0.250 0.305 – – – –

Table A.5: Validation results of the final SVM prediction models. The table showsthe accuracy (ACC), precision (PRE), recall (REC), and F-measure (FM)values for the models trained on the data sets with representatives beingchosen by the major protein sequence length.

A P P E N D I X 65

Cross-validation External test set validation

Label Data set ACC PRE REC FM ACC PRE REC FM

Eraser UniRef50SLC 0.982 0.833 0.500 0.625 – – – –

UniRef50SLA 0.983 0.846 0.550 0.667 – – – –

Clus30SLC 0.982 0.778 0.500 0.609 – – – –

Clus30SLA 0.986 1.000 0.500 0.667 – – – –

Clus20SLC 0.988 0.833 0.556 0.667 – – – –

Clus20SLA 0.988 0.833 0.556 0.667 – – – –

Mediator UniRef50SLC 0.861 0.935 0.606 0.735 0.888 0.966 0.667 0.789

UniRef50SLA 0.790 0.781 0.441 0.563 0.758 0.846 0.268 0.407

Clus30SLC 0.843 0.867 0.489 0.625 0.830 1.000 0.370 0.541

Clus30SLA 0.753 0.548 0.268 0.360 0.755 0.556 0.200 0.294

Clus20SLC 0.785 0.652 0.409 0.503 0.747 0.556 0.227 0.323

Clus20SLA 0.777 0.642 0.321 0.428 0.768 0.714 0.227 0.345

Modifier UniRef50SLC 0.916 0.813 0.613 0.699 0.918 0.813 0.619 0.703

UniRef50SLA 0.912 0.773 0.642 0.701 0.909 0.813 0.591 0.684

Clus30SLC 0.875 0.677 0.500 0.575 0.890 1.000 0.353 0.522

Clus30SLA 0.877 0.707 0.488 0.577 0.888 0.750 0.529 0.621

Clus20SLC 0.870 0.718 0.394 0.509 0.880 0.833 0.357 0.500

Clus20SLA 0.877 0.706 0.507 0.590 0.890 0.857 0.429 0.571

Reader UniRef50SLC 0.778 0.755 0.772 0.763 0.746 0.730 0.730 0.730

UniRef50SLA 0.707 0.703 0.653 0.677 0.705 0.672 0.726 0.698

Clus30SLC 0.751 0.754 0.742 0.748 0.770 0.829 0.680 0.747

Clus30SLA 0.689 0.696 0.679 0.687 0.694 0.738 0.620 0.674

Clus20SLC 0.684 0.684 0.681 0.683 0.699 0.707 0.690 0.699

Clus20SLA 0.662 0.665 0.659 0.662 0.671 0.694 0.610 0.649

Remodeler UniRef50SLC 0.943 0.792 0.365 0.500 – – – –

UniRef50SLA 0.925 0.667 0.115 0.197 – – – –

Clus30SLC 0.928 0.643 0.225 0.333 – – – –

Clus30SLA 0.920 0.600 0.075 0.133 – – – –

Clus20SLC 0.930 0.818 0.250 0.383 – – – –

Clus20SLA 0.902 0.250 0.056 0.091 – – – –

Table A.6: Validation results of the final RF prediction models. The table shows theaccuracy (ACC), precision (PRE), recall (REC), and F-measure (FM) val-ues for the models trained on the data sets with representatives beingchosen by the major protein sequence length.

B I B L I O G R A P H Y

[1] C. Bock and T. Lengauer. Computational Epigenetics. Bioinformatics, 24(1):1–10, 2008.

[2] A. D. Riggs, R. A. Martienssen, and V. E. Russo. Epigenetic Mechanismsof Gene Regulation. In Cold Spring Harbor Monograph, volume 32, chapterIntroduction. Cold Spring Harbor Laboratory Press, 1996.

[3] M. F. Fraga, E. Ballestar, M. F. Paz, S. Ropero, F. Setien, M. L. Ballestar,D. Heine-Suñer, J. C. Cigudosa, M. Urioste, J. Benitez, M. Boix-Chornet, A.Sanchez-Aguilera, C. Ling, E. Carlsson, P. Poulsen, A. Vaag, Z. Stephan, T. D.Spector, Y.-Z. Wu, C. Plass, and M. Esteller. Epigenetic differences arise dur-ing the lifetime of monozygotic twins. Proceedings of the National Academy ofSciences of the United States of America, 102(30):10604–10609, 2005.

[4] P. A. Callinan and A. P. Feinberg. The emerging science of epigenomics. Hu-man Molecular Genetics, 15(suppl. 1):R95–R101, 2006.

[5] A. Brero, H. P. Easwaran, D. Nowak, I. Grunewald, T. Cremer, H. Leonhardt,and M. C. Cardoso. Methyl CpG–binding proteins induce large-scale chro-matin reorganization during terminal differentiation. The Journal of Cell Biol-ogy, 169(5):733–743, 2005.

[6] S. Pu, A. L. Turinsky, J. Vlasblom, T. On, X. Xiong, A. Emili, Z. Zhang, J. Green-blatt, J. Parkinson, and S. J. Wodak. Expanding the Landscape of ChromatinModification (CM)-Related Functional Domains and Genes in Human. PLoSONE, 5(11):e14122, 2010.

[7] D. Karolchik, G. P. Barber, J. Casper, H. Clawson, M. S. Cline, M. Diekhans,T. R. Dreszer, P. A. Fujita, L. Guruvadoo, M. Haeussler, R. A. Harte, S. Heitner,A. S. Hinrichs, K. Learned, B. T. Lee, C. H. Li, B. J. Raney, B. Rhead, K. R.Rosenbloom, C. A. Sloan, M. L. Speir, A. S. Zweig, D. Haussler, R. M. Kuhn,and W. J. Kent. The UCSC Genome Browser database: 2014 update. NucleicAcids Research, 42(D1):D764–D720, 2014.

[8] The UniProt Consortium. Activities at the Universal Protein Resource(UniProt). Nucleic Acids Research, 42(D1):D191–D198, 2014.

[9] R. D. Finn, A. Bateman, J. Clements, P. Coggill, R. Y. Eberhardt, S. R. Eddy,A. Heger, K. Hetherington, L. Holm, J. Mistry, E. L. L. Sonnhammer, J. Tate,and M. Punta. The Pfam protein families database. Nucleic Acids Research,42(D1):D222–D230, 2014.

67

68 Bibliography

[10] The Gene Ontology Consortium. Gene ontology: tool for the unification ofbiology. Nature Genetics, 25(1):25–29, 2000.

[11] L. Ho and G. R. Crabtree. Chromatin remodelling during development. Na-ture, 463:474–484, 2010.

[12] J. D. McGhee and G. Felsenfeld. Nucleosome Structure. Annual Review ofBiochemistry, 49:1115–1156, 1980.

[13] T. Kouzarides. Chromatin Modifications and Their Function. Cell, 128(4):693–705, 2007.

[14] V. G. Allfrey, R. Faulkner, and A. E. Mirsky. Acetylation and Methylation ofHistones and Their Possible Role in the regulation of RNA Synthesis. Proceed-ings of the National Academy of Sciences of the United States of America, 51(5):786–794, 1964.

[15] J. E. Brownell, J. Zhou, T. Ranalli, R. Kobayashi, D. G. Edmondson, S. Y. Roth,and C. D. Allis. Tetrahymena Histone Acetyltransferase A: A Homolog toYeast Gcn5p Linking Histone Acetylation to Gene Activation. Cell, 84(6):843–851, 1996.

[16] J. Taunton, C. A. Hassig, and S. Schreiber. A mammalian histone deacetylaserelated to the yeast transcriptional regulator Rpd3p. Science, 272(5260):408–411, 1996.

[17] B. D. Strahl and C. D. Allis. The language of covalent histone modifications.Nature, 403:41–45, 2000.

[18] T. Suganuma and J. L. Workman. Signals and Combinatorial Functions ofHistone Modifications. Annual Review in Biochemistry, 80:473–499, 2011.

[19] A. Bird. DNA methylation patterns and epigenetic memory. Genes & Devel-opment, 16:6–21, 2002.

[20] M. Weber and D. Schübeler. Genomic patterns of DNA methylation: targetsand function of an epigenetic mark. Current Opinion in Cell Biology, 19(3):273–280, 2007.

[21] H. Cedar and Y. Bergman. Linking DNA methylation and histone modifica-tion: patterns and paradigms . Nature Reviews Genetics, 10:295–304, 2009.

[22] V. K. Gangaraju and B. Bartholomew. Mechanisms of ATP Dependent Chro-matin Remodeling. Mutation Research/Fundamental and Molecular Mechanismsof Mutagenesis, 618(1–2):3–17, 2008.

[23] M. A. Suraniemail, K. Hayashi, and P. Hajkova. Genetic and Epigenetic Reg-ulators of Pluripotency. Cell, 128(4):747–762, 2007.

[24] M. F. Fraga and M. Esteller. Epigenetics and aging: the targets and the marks.Trends in Genetics, 23(8):413–418, 2007.

Bibliography 69

[25] P. A. Jones and S. B. Baylin. The Epigenomics of Cancer. Cell, 128(4):683–692,2007.

[26] F. Chiacchiera, A. Piunti, and D. Pasini. Epigenetic methylations and theirconnections with metabolism. Cellular and Molecular Life Sciences, 70(9):1495–1508, 2013.

[27] S. Thakurela. Personal communication. Institute of Molecular Biology,Working group “Epigenetic Regulation of Development and Disease”, 55128Mainz, Germany.

[28] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning—Data Mining, Inference, and Prediction. Springer, New York (NY, USA), secondedition, 2009.

[29] J. Han, M. Kamber, and J. Pei. Data Mining—Concepts and Techniques. MorganKaufmann, Waltham (MA, USA), third edition, 2011.

[30] I. H. Witten, E. Frank, and M. A. Hall. Data Mining—Practical Machine Learn-ing Tools and Techniques. Morgan Kaufmann, Burlington (MA, USA), thirdedition, 2011.

[31] M. H. Gail, L. A. Brinton, D. P. Byar, D. K. Corle, S. B. Green, C. Schairer, andJ. J. Mulvihill. Projecting Individualized Probabilities of Developing BreastCancer for White Females Who Are Being Examined Annually. Journal of theNational Cancer Institute, 81(24):1879–1886, 1989.

[32] P. Pavlidis, J. Qin, V. Arango, J. J. Mann, and E. Sibille. Using the Gene On-tology for Microarray Data Mining: A Comparison of Methods and Appli-cation to Age Effects in Human Prefrontal Cortex. Neurochemical Research,29(6):1213–1222, 2004.

[33] B. Al-Badr and S. A. Mahmoud. Survey and bibliography of Arabic opticaltext recognition. Signal Processing, 41(1):49–77, 1995.

[34] T. M. Mitchell. Machine Learning. MacGraw-Hill, international edition, 1997.

[35] K. Gendler, T. Paulsen, and C. Napoli. ChromDB: The Chromatin Database.Nucleic Acids Research, 36(suppl. 1):D298–D302, 2008.

[36] A. L. Turinsky, B. Turner, R. C. Borja, J. A. Gleeson, M. Heath, S. Pu, T. Switzer,D. Dong, Y. Gong, T. On, X. Xiong, A. Emili, J. Greenblatt, J. Parkinson, Z.Zhang, and S. J. Wodak. DAnCER: Disease-Annotated Chromatin Epigenet-ics Resource. Nucleic Acids Research, 39(suppl. 1):D889–D894, 2011.

[37] T. Tatusova, S. Ciufo, B. Fedorov, K. O’Neill, and I. Tolstoy. RefSeq microbialgenomes database: new representation and annotation strategy. Nucleic AcidsResearch, 42(D1):D553–D559, 2014.

70 Bibliography

[38] C. Bock, K. Halachev, J. Büch, and T. Lengauer. EpiGRAPH: user-friendlysoftware for statistical analysis and prediction of (epi)genomic data. GenomeBiology, 10(2):R14, 2009.

[39] K. Halachev, H. Bast, F. Albrecht, T. Lengauer, and C. Bock. EpiExplorer: liveexploration and global analysis of large epigenomic datasets. Genome Biology,13(10):R96, 2012.

[40] A. Clare, A. Karwath, H. Ougham, and R. D. King. Functional bioinformaticsfor Arabidopsis thaliana. Bioinformatics, 22(9):1130–1136, 2006.

[41] A. Zimek, F. Buchwald, E. Frank, and S. Kramer. A Study of Hierarchicaland Flat Classification of Proteins. IEEE/ACM Transactions on ComputationalBiology and Bioinformatics, 7(3):563–571, 2010.

[42] L. Schietgat, C. Vens, J. Struyf, H. Blockeel, D. Kocev, and S. Džeroski. Pre-dicting gene function using hierarchical multi-label decision tree ensembles.BMC Bioinformatics, 11(2), 2010.

[43] P. Flicek, M. R. Amode, D. Barrell, K. Beal, K. Billis, S. Brent, D. Carvalho-Silva, P. Clapham, G. Coates, S. Fitzgerald, L. Gil, C. G. Girón, L. Gordon,T. Hourlier, S. Hunt, N. Johnson, T. Juettemann, A. K. Kähäri, S. Keenan,E. Kulesha, F. J. Martin, T. Maurel, W. M. McLaren, D. N. Murphy, R. Nag,B. Overduin, M. Pignatelli, B. Pritchard, E. Pritchard, H. S. Riat, M. Ruffier,D. Sheppard, K. Taylor, A. Thormann, S. J. Trevanion, A. Vullo, S. P. Wilder,M. Wilson, A. Zadissa, B. L. Aken, E. Birney, F. Cunningham, J. Harrow, J.Herrero, T. J. Hubbard, R. Kinsella, M. Muffato, A. Parker, G. Spudich, A.Yates, D. R. Zerbino, and S. M. Searle. Ensembl 2014. Nucleic Acids Research,42(D1):D749–D755, 2014.

[44] A. Ruepp, B. Waegele, M. Lechner, B. Brauner, I. Dunger-Kaltenbach, G. Fobo,G. Frishman, C. Montrone, and H.-W. Mewes. CORUM: the comprehen-sive resource of mammalian protein complexes—2009. Nucleic Acids Research,38(suppl. 1):D497–D501, 2010.

[45] S. Pu, J. Wong, B. Turner, E. Cho, and S. J. Wodak. Up-to-date catalogues ofyeast protein complexes. Nucleic Acids Research, 37(3):825–831, 2009.

[46] A. Bateman, E. Birney, L. Cerruti, R. Durbin, L. Etwiller, S. R. Eddy, S.Griffiths-Jones, K. L. Howe, M. Marshall, and E. L. L. Sonnhammer. The PfamProtein Families Database. Nucleic Acids Research, 30(1):276–280, 2002.

[47] D. Maglott, J. Ostell, K. D. Pruitt, and T. Tatusova. Entrez Gene: gene-centeredinformation at NCBI. Nucleic Acids Research, 35(suppl 1):D26–D31, 2007.

[48] W. J. Kent, C. W. Sugnet, T. S. Furey, K. M. Roskin, T. H. Pringle, A. M. Zahler,and D. Haussler. The Human Genome Browser at UCSC. Genome Research,12:996–1006, 2002.

Bibliography 71

[49] Gene Help [Internet], chapter Gene Frequently Asked Questions. National Cen-ter for Biotechnology Information (US), 2008, 2014.

[50] Mouse Genome Sequencing Consortium. Initial sequencing and comparativeanalysis of the mouse genome. Nature, 420(6915):520–562, 2002.

[51] D. Karolchik, A. S. Hinrichs, T. S. Furey, K. M. Roskin, C. W. Sugnet, D. Haus-sler, and W. J. Kent. The UCSC Table Browser data retrieval tool. Nucleic AcidsResearch, 31(suppl 1):D493–D496, 2004.

[52] The UniProt Consortium. How can i access resources on this web site pro-grammatically? http://uniprot.org/faq/28.

[53] H. Huang, P. B. McGarvey, B. E. Suzek, R. Mazumder, J. Zhang, Y. Chen, andC. H. Wu. A comprehensive protein-centric ID mapping service for moleculardata integration. Bioinformatics, 27(8):1190–1191, 2011.

[54] PostgreSQL Global Development Group. PostgreSQL 9.2. http://

postgresql.org, 2014.

[55] EpiExplorer: supplementary information. http://epiexplorer.mpi-inf.

mpg.de/supplementary/.

[56] The Gene Ontology Consortium. The Gene Ontology in 2010: extensions andrefinements. Nucleic Acids Research, 38(suppl 1):D331–D335, 2010.

[57] R. D. Finn, J. Mistry, B. Schuster-Böckler, S. Griffiths-Jones, V. Hollich, T. Lass-mann, S. Moxon, M. Marshall, A. Khanna, R. Durbin, S. R. Eddy, E. L. L.Sonnhammer, and A. Bateman. Pfam: clans, web tools and services. NucleicAcids Research, 34(suppl. 1):D247–D251, 2006.

[58] B. E. Suzek, H. Huang, P. McGarvey, R. Mazumder, and C. H. Wu. UniRef:comprehensive and non-redundant UniProt reference clusters. Bioinformatics,23(10):1282–1288, 2007.

[59] E. C. Dimmer, R. P. Huntley, Y. Alam-Faruque, T. Sawford, C. O’Donovan,M. J. Martin, B. Bely, P. Browne, W. M. Chan, R. Eberhardt, M. Gardner, K.Laiho, D. Legge, M. Magrane, K. Pichler, D. Poggioli, H. Sehra, A. Auch-incloss, K. Axelsen, M.-C. Blatter, E. Boutet, S. Braconi-Quintaje, L. Breuza,A. Bridge, E. Coudert, A. Estreicher, L. Famiglietti, S. Ferro-Rojas, M. Feuer-mann, A. Gos, N. Gruaz-Gumowski, U. Hinz, C. Hulo, J. James, S. Jimenez,F. Jungo, G. Keller, P. Lemercier, D. Lieberherr, P. Masson, M. Moinat, I. Pe-druzzi, S. Poux, C. Rivoire, B. Roechert, M. Schneider, A. Stutz, S. Sundaram,M. Tognolli, L. Bougueleret, G. Argoud-Puy, I. Cusin, P. Duek-Roggli, I. Xe-narios, and R. Apweiler. The UniProt-GO Annotation database in 2011. Nu-cleic Acids Research, 40(D1):D565–D570, 2012.

[60] The Gene Ontology Consortium. Guide to GO Evidence Codes. http://

geneontology.org/page/guide-go-evidence-codes.

72 Bibliography

[61] J. Park, L. Holm, A. Heger, and C. Chothia. RSDB: representative proteinsequence databases have high information content. Bioinformatics, 16(5):458–464, 2000.

[62] L. Fu, B. Niu, Z. Zhu, S. Wu, and W. Li. CD-HIT: accelerated for clusteringthe next-generation sequencing data. Bioinformatics, 28(23):3150–3152, 2012.

[63] W. Li and A. Godzik. Cd-hit: a fast program for clustering and comparinglarge sets of protein or nucleotide sequences. Bioinformatics, 22(13):1658–1659,2006.

[64] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N.Shindyalov, and P. E. Bourne. The Protein Data Bank. Nucleic Acids Research,28(1):235–242, 2000.

[65] Y. Huang, B. Niu, Y. Gao, L. Fu, and W. Li. CD-HIT Suite: a web server forclustering and comparing biological sequences. Bioinformatics, 26(5):680–682,2010.

[66] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic localalignment search tool, 1990.

[67] L. D. Raedt. The Logic Programming Paradigm, chapter A Perspective on Induc-tive Logic Programming, pages 335–346. Springer Berlin Heidelberg, 1999.

[68] J. Read, B. Pfahringer, G. Holmes, and E. Frank. Classifier Chains for Multi-label Classification. Machine Learning, 85(3):333–359, 2011.

[69] G. Tsoumakas and I. Katakis. Multi-Label Classification: An Overview. Inter-national Journal of Data Warehousing and Mining, 3(3):1–13, 2007.

[70] C. Elkan and K. Noto. Learning Classifiers from Only Positive and UnlabeledData. In KDD ’08 Proceedings of the 14th ACM SIGKDD international conferenceon Knowledge discovery and data mining, pages 213–220. ACM, New York (NY,USA), 2008.

[71] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning,20(3):273–297, 1995.

[72] C.-C. Chang and C.-J. Lin. LIBSVM: A Library for Support Vector Machines.ACM Transactions on Intelligent Systems and Technology, 2(3):27:1–27:27, 2011.

[73] Y. EL-Manzalawy. WLSVM. http://www.cs.iastate.edu/~yasser/wlsvm/,2005.

[74] M. A. Hall. Correlation-based Feature Subset Selection for Machine Learning. PhDthesis, University of Waikato, Hamilton, New Zealand, 1998.

[75] L. Breiman. Random Forests. Machine Learning, 45(1):5–32, 2001.

Bibliography 73

[76] C. J. Van Rijsbergen. Information Retrieval. Butterworth-Heinemann, Newton(MA, USA), second edition, 1979.

[77] L. Breiman and P. Spector. Submodel Selection and Evaluation in Regression.The X-Random Case. International Statistical Review, 60(3):291–319, 1992.

[78] R. Kohavi. A Study of Cross-Validation and Bootstrap for Accuracy Estima-tion and Model Selection. In International Joint Conference on Artificial Intelli-gence (IJCAI), pages 1137–1143. Morgan Kaufman, 1995.

[79] M. Gütlein, C. Helma, A. Karwath, and S. Kramer. A Large-Scale EmpiricalEvaluation of Cross-Validation and External Test Set Validation in (Q)SAR.Molecular Informatics, 32(5–6):516–528, 2013.

[80] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D.Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: MachineLearning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

[81] J. A. Blake, C. J. Bult, J. T. Eppig, J. A. Kadin, J. E. Richardson, and The MouseGenome Database Group. The Mouse Genome Database: integration of andaccess to knowledge about the laboratory mouse. Nucleic Acids Research,42(D1):D810–D817, 2014.

[82] L. G. Wilming, J. G. R. Gilbert, K. Howe, S. Trevanion, T. Hubbard, and J. L.Harrow. The vertebrate genome annotation (Vega) database. Nucleic AcidsResearch, 36(suppl. 1):D753–760, 2008.

[83] R. Kikuno, T. Nagase, M. Nakayama, H. Koga, N. Okazaki, D. Nakajima, andO. Ohara. HUGE: a database for human KIAA proteins, a 2004 update in-tegrating HUGEppi and ROUGE. Nucleic Acids Research, 32(suppl. 1):D502–D504, 2004.

[84] P. J. Stogiosa and G. G. Privé. The BACK domain in BTB-kelch proteins.Trends in Biochemical Sciences, 29(12):634–637, 2004.

[85] A. S. Ho, K. Kannan, D. M. Roy, L. G. T. Morris, I. Ganly, N. Katabi, D. Ra-maswami, L. A. Walsh, S. Eng, J. T. Huse, J. Zhang, I. Dolgalev, K. Huberman,A. Heguy, A. Viale, M. Drobnjak, M. A. Leversha, C. E. Rice, B. Singh, N. G.Iyer, C. R. Leemans, E. Bloemena, R. L. Ferris, R. R. Seethala, B. E. Gross, Y.Liang, R. Sinha, L. Peng, B. J. Raphael, S. Turcan, Y. Gong, N. Schultz, S. Kim,S. Chiosea, J. P. Shah, C. Sander, W. Lee, and T. A. Chan. The mutationallandscape of adenoid cystic carcinoma. Nature Genetics, 45:791–798, 2013.

[86] A. Flaus, D. M. A. Martin, G. J. Barton, and T. Owen-Hughes. Identification ofmultiple distinct Snf2 subfamilies with conserved structural motifs. NucleicAcids Research, 34(10):2887–2905, 2006.

74 Bibliography

[87] A. J. Williams, L. M. Khachigian, T. Shows, and T. Collins. Isolation and Char-acterization of a Novel Zinc-finger Protein with Transcriptional Repressor Ac-tivity. The Journal of Biological Chemistry, 270:22143–22152, 1995.

[88] V. J. Bardwell and R. Treisman. The POZ domain: a conserved protein-proteininteraction motif. Genes & Development, 8:1664–1677, 1994.

[89] E. J. Bellefroid, D. A. Poncelet, P. J. Lecocq, O. Revelant, and J. A. Martial. Theevolutionarily conserved Krüppel-associated box domain defines a subfam-ily of eukaryotic multifingered proteins. Proceedings of the National Academyof Sciences of the United States of America, 88(9):3608–3612, 1991.

[90] L. C. Edelstein and T. Collins. The SCAN domain family of zinc finger tran-scription factors. Gene, 359:1–17, 2005.

[91] B. T. Kilea, B. A. Schulmanc, W. S. Alexandera, N. A. Nicolaa, H. M. E. Mar-tina, and D. J. Hilton. The SOCS box: a tale of destruction and degradation.Trends in Biochemical Sciences, 27(5):235–241, 2002.

[92] C. A. Andresen, S. Smedegaard, K. B. Sylvestersen, C. Svensson, D. Iglesias-Gato, G. Cazzamali, T. K. Nielsen, M. L. Nielsen, and A. Flores-Morales. Pro-tein Interaction Screening for the Ankyrin Repeats and Suppressor of Cy-tokine Signaling (SOCS) Box (ASB) Family Identify Asb11 as a Novel Endo-plasmic Reticulum Resident Ubiquitin Ligase. The Journal of Biological Chem-istry, 289:2043–2054, 2014.

[93] D. W. Choi, Y.-M. Seo, E.-A. Kim, K. S. Sung, J. W. Ahn, S.-J. Park, S.-R.Lee, and C. Y. Choi. Ubiquitination and Degradation of Homeodomain-interacting Protein Kinase 2 by WD40 Repeat/SOCS Box Protein WSB-1. TheJournal of Biological Chemistry, 283:4682–4689, 2008.

[94] P. Trojer, G. Li, R. J. Sims III, A. Vaquero, N. Kalakonda, P. Boccuni, D. Lee,H. Erdjument-Bromage, P. Tempst, S. D. Nimer, Y.-H. Wang, and D. Rein-berg. L3MBTL1, a Histone-Methylation-Dependent Chromatin Lock. Cell,129(5):915–928, 2007.

[95] J. Min, A. Allali-Hassani, N. Nady, C. Qi, H. Ouyang, Y. Liu, F. MacKenzie, M.Vedadi, and C. H. Arrowsmith. L3MBTL1 recognition of mono- and dimethy-lated histones. Nature Structural & Molecular Biology, 14:1229–1230, 2007.

[96] K. Hofmann and J. Tschopp. The death domain motif found in Fas (Apo-1) and TNF receptor is present in proteins involved in apoptosis and axonalguidance. FEBS Letters, 371(3):321–323, 1995.

[97] B. Eichenmüller, P. Everley, J. Palange, D. Lepley, and K. A. Suprenant. TheHuman EMAP-like Protein-70 (ELP70) Is a Microtubule Destabilizer That Lo-calizes to the Mitotic Apparatus. The Journal of Biological Chemistry, 277:1301–1309, 2002.

Bibliography 75

[98] X. Luo and K. Hofmann. The protease-associated domain: a homology do-main associated with multiple classes of proteases. Trends in Biochemical Sci-ences, 26(3):147–148, 2001.

[99] C. E. Berndsen and C. Wolberger. New insights into ubiquitin E3 ligase mech-anism. Nature Structural & Molecular Biology, 21:301–307, 2014.

[100] J. Min, X. Zhang, X. Cheng, S. I. Grewal, and R.-M. Xu. Structure of theSET domain histone lysine methyltransferase Clr4. Nature Structural Biology,9:828–832, 2002.

[101] B. Al-Sady, H. D. Madhani, and G. J. Narlikar. Division of Labor betweenthe Chromodomains of HP1 and Suv39 Methylase Enables Coordination ofHeterochromatin Spread. Molecular Cell, 51(1):80–91, 2013.

[102] S. Rea, F. Eisenhaber, D. O’Carroll, B. D. Strahl, Z.-W. Sun, M. Schmid, S.Opravil, K. Mechtler, C. P. Ponting, C. D. Allis, and T. Jenuwein. Regulationof chromatin structure by site-specific histone h3 methyltransferases. Nature,406:593–599, 2000.

[103] A. Andreeva, D. Howorth, C. Chothia, E. Kulesha, and A. G. Murzin. SCOP2prototype: a new approach to protein structure mining. Nucleic Acids Research,42(D1):D310–D314, 2014.