DECIPHERING COMPLEX BIOLOGICAL PROCESSES USING GENE

DECIPHERING COMPLEX BIOLOGICAL PROCESSES USING GENE COEXPRESSION NETWORKS

By

LIHE LIU

A THESIS PRESENTED TO THE GRADUATE SCHOOL

OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF SCIENCE

UNIVERSITY OF FLORIDA

2020

© 2020 Lihe Liu

To my family and friends

4

ACKNOWLEDGMENTS

In the first place, I would like to express my most profound appreciation to my

advisor Dr. Francisco Peñagaricano. This work would not have been so novel and

fascinating without his mentorship. He proposed great ideas in the beginning, worked

closely with me during my exploration and gave me valuable advice to accomplish my

research projects. During my master’s program, I have learned not only how to solve

problems but also how to think creatively. I would also like to extend my deepest

gratitude to my committee members Dr. Samantha Brooks and Dr. Matias Kirst for their

advice, suggestions and supports.

I am incredibly thankful to my colleagues who I ever worked with in the

Department of Animal Sciences at the University of Florida. Especially my labmates –

Dr. Rostam Abdollahi-Arpanahi, Dr. Juan Pablo Nani, Dr. Fernanda Rezende, Dr.

Helder Louvandini, Dr. Rodrigo De Almeida Teixeira, Dr. Laila Talarico Dias, Anil Sigdel,

Rocio Amorin, Hendyel Pacheco, Mariana Nehme Marinho, Simone Da Silva, Eugenio

Jara and Gastón Rijo. Many thanks to the Animal Sciences Graduate Student

Association in the Department of Animal Sciences.

I am deeply indebted to my parents who supported me all the time no matter

which city I am in on the planet.

5

TABLE OF CONTENTS page

ACKNOWLEDGMENTS .................................................................................................. 4

LIST OF FIGURES .......................................................................................................... 7

LIST OF ABBREVIATIONS ............................................................................................. 8

ABSTRACT ..................................................................................................................... 9

CHAPTER

1 INTRODUCTION .................................................................................................... 11

2 LITERATURE REVIEW .......................................................................................... 13

High Throughput Technologies and Omics Data .................................................... 13 Overview and History ....................................................................................... 13 Gene Expression Profiling and Analysis Methods ............................................ 15 Epigenetic Modifications and Analysis Methods ............................................... 16 Other Types of Omics Data .............................................................................. 17 Multi-Omics Data Integration and Systems Biology Methods ........................... 18

Biological Networks ................................................................................................ 20 Biological Networks Overview .......................................................................... 20 Gene Coexpression Network Analysis ............................................................. 21 Unweighted and Weighted Gene Correlation Network Analysis ....................... 24 Differential Network Analysis ............................................................................ 26 Limitations of Correlation Networks and Conditional Independencies .............. 28

Over-representation Analysis (ORA) ...................................................................... 29 Fisher’s Exact Test and Hypergeometric Distribution ....................................... 29 ORA Example ................................................................................................... 29 EnrichKit R software ......................................................................................... 30

3 DIFFERENTIAL NETWORK ANALYSIS OF BOVINE MUSCLE REVEALS CHANGES IN GENE COEXPRESSION PATTERNS IN RESPONSE TO CHANGES IN MATERNAL NUTRITION ................................................................. 33

Introduction ............................................................................................................. 33 Materials and Methods............................................................................................ 35

Ethics Statement .............................................................................................. 35 Animals and Experimental Design .................................................................... 35 RNA Extraction, Library Preparation and Sequencing ...................................... 36 RNA-Seq Quality Control and Mapping ............................................................ 36 Read Counting, Processing and Normalization ................................................ 36 Gene Coexpression Network Construction....................................................... 37 Module Preservation ........................................................................................ 38

6

Module Characterization................................................................................... 38 DNA Extraction, Library Preparation and Sequencing ...................................... 39 Bisulfite-Seq Quality Control and Mapping ....................................................... 39 Differentially Methylated Cytosines and Genes ................................................ 40

Results .................................................................................................................... 40 RNA-sequencing Analysis ................................................................................ 40 Network Construction and Module Identification .............................................. 41 Module Preservation ........................................................................................ 41 Module Characterization................................................................................... 42 DNA Methylation Analysis ................................................................................ 42 DNA Methylation and Module Preservation ...................................................... 43 DNA Methylation and Network Properties ........................................................ 43

Discussion .............................................................................................................. 44

4 CONCLUSIONS ..................................................................................................... 56

LIST OF REFERENCES ............................................................................................... 58

BIOGRAPHICAL SKETCH ............................................................................................ 65

7

LIST OF FIGURES

Figure page 2-1 Spermatogenesis coexpression network analysis: distributions of node

connectivity and node clustering coefficient between maternal diets. ................ 31

2-2 Functional terms and pathways significantly enriched with genes associated

with dairy cow conception under thermal‐stress conditions. Four gene

annotation databases were analyzed. ................................................................ 32

3-1 Soft-thresholding powers and module detection. ................................................ 50

3-2 Network visualization. ......................................................................................... 51

3-3 Permutation-based composite preservation statistics. (A) Summary statistics 𝑍𝑠𝑢𝑚𝑚𝑎𝑟𝑦 (y-axis) as function of module size. (B) Summary statistics

𝑚𝑒𝑑𝑖𝑎𝑛𝑅𝑎𝑛𝑘 (y-axis) as function of module size. ............................................... 52

3-4 Functional characterization of non-preserved modules. Six gene annotation databases were analyzed: Gene Ontology (GO), KEGG, Medical Subject Headings (MeSH), InterPro, Reactome and Molecular Signatures Database. ... 53

3-5 Comparison of methylation levels between preserved and unpreserved modules. Methylation level was calculated as differentially methylated cytosines divided by all the cytosines evaluated. ................................................ 54

3-6 Relationship between methylation level and intramodular connectivity. Methylation level was calculated as differentially methylated cytosines divided by all the cytosines evaluated. ............................................................... 55

8

LIST OF ABBREVIATIONS

ATP

DAGs

DEGs

DMCs

Adenosine Triphosphate

Directed and Acyclic Graphs

Differentially Expressed Genes

Differentially Methylated Cytosines

GN

GO

HTO

KEGG

MeSH

MSigDB

NAD(P)H

Gene Network

Gene Ontology

High Throughput Omics

Kyoto Encyclopedia of Genes and Genomes

Medical Subject Headings

Molecular Signatures Database

Nicotinamide Adenine Dinucleotide Phosphate

NGS

ODE

ORA

TOM

WGCNA

Next Generation Sequencing

Ordinary Differential Equation

Over-Representation Analysis

Topological Overlap Matrix

Weighted Gene Coexpression Network Analysis

9

Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Science

DECIPHERING COMPLEX BIOLOGICAL PROCESSES USING GENE

COEXPRESSION NETWORKS

By

Lihe Liu

August 2020

Chair: Francisco Peñagaricano Major: Animal Sciences

Coexpression network analysis is a powerful tool to reveal transcriptional

regulatory mechanisms, identify transcription factors, and discover gene functions. It

can also be used to investigate changes in coexpression patterns in response to

environmental insults or changes in experimental conditions. Maternal nutrition is

considered a major intrauterine regulator of fetal developmental programming. The

objective of this study was to investigate structural changes in gene coexpression

networks in the muscle of bull beef calves gestated under diets with or without

methionine supplementation. Both muscle transcriptome and methylome were

evaluated using next generation sequencing. Maternal methionine supplementation

significantly perturbed coexpression patterns in the offspring’s muscle. Indeed, we

found that neither the connection strength nor the connectivity pattern of six modules

(subnetworks) detected in the control diet was preserved in the methionine-rich diet.

Functional characterization revealed that some of the unpreserved modules are

implicated in myogenesis, adipogenesis, fibrogenesis, canonical Wnt/β-catenin

pathway, ribosome structure, rRNA binding and processing, mitochondrial activities,

ATP synthesis and NAD(P)H oxidoreductases, among other functions. The bisulfite

10

sequencing analysis showed that nearly 2% of all evaluated cytosines were differentially

methylated between maternal diets. Interestingly, there were significant differences in

the levels of gene body DNA methylation between preserved and unpreserved modules.

Overall, our findings provide evidence that maternal nutrition can significantly alter gene

coexpression patterns in the offspring, and changes in DNA methylation mediate some

of these perturbations.

11

CHAPTER 1 INTRODUCTION

Modern biological research approaches have experienced unprecedented

changes since the initiation of high throughput technologies (“omics” technologies),

such as genomics, epigenomics, transcriptomics, proteomics and metabolomics,

empowering the concurrent identification and quantification of a large number of

molecular entities within the organism. “Omics” ranges an increasingly broad scope of

studies. This considerable amount of data provides innovative insights only if one is

armed with the essential analytical techniques and computing resources to integrate

and disentangle multiple layers of information. Systems biology intends to recognize a

whole cellular system as a working unit, as opposed to investigating the elements as

isolated individuals, such perspective highlights holism instead of reductionism.

Systems biology approaches, e.g., multi-omics data integration, usually incorporate

mathematics, statistical modeling and computational techniques to mine vast amounts

of data. In biology research, perturbation experiments play a vital role in investigating

gene functions and regulatory pathways. The traditional differential expression (DE)

analysis tests genes individually; however, genes never work in isolation and rarely

have a solitary function. Coexpression network analysis is a ground-breaking tool to

reveal transcriptional regulatory mechanisms, identify transcription factors, and discover

gene functions, as well as characterize changes in global coexpression patterns in

response to external perturbations. Over-Representation Analysis (ORA) is a commonly

used test to determine if the known functional pathways or protein families are over-

represented (enriched) in the gene list obtained in experiments. The combination of a

12

coexpression network modeling and over-representation analysis greatly facilitates the

identification and functional interpretation of genes disturbed by treatments.

In this study, we evaluated the impact of maternal methionine supplementation

on gene coexpression patterns in the offspring’s muscle tissue. Both muscle

transcriptome and methylome were examined using next-generation sequencing

technologies. Changes in gene coexpression patterns between maternal control and

methionine-rich diets were evaluated using differential gene coexpression network

analysis. The disturbed (sub)networks were functionally characterized using over-

representation analysis. Additionally, we investigated the link between network

perturbation and DNA methylation. Our results provide evidence that maternal nutrition

can significantly perturb gene coexpression patterns in the offspring, and some of these

changes might be mediated by alterations in DNA methylation.

13

CHAPTER 2 LITERATURE REVIEW

High Throughput Technologies and Omics Data

Overview and History

Early in 1977, the “first-generation” DNA sequencing technology, chain-

termination method, was developed by Frederick Sanger and colleagues, which was

then embraced as the predominant methodology in both laboratory applications and

commercial sequencing platforms. In 2001, this method was applied to initially

determine the sequence of the human genome (Lander et al. 2001; Venter et al. 2001),

soon after, the complete human genome was released in 2004 (International Human

Genome Sequencing 2004). At the same time, the world witnessed the archive of

genome sequences of several other model organisms (Mouse Genome Sequencing

Consortium 2002; Gibbs et al. 2004; Chimpanzee Sequencing and Analysis Consortium

2005). Despite this unprecedented progress, this “first-generation” sequence technology

has limitations in throughput and cost. Consequently, the development of more powerful

sequencing approaches was stimulated in order to improve speed and precision while

concurrently diminish cost and workforce. For example, 454 was initiated by 454 Life

Sciences in 2005, Solexa publicized Genome Analyzer and Agencourt developed

SOLiD (Sequencing by Oligo Ligation Detection) in 2006. These alternatives had

enhanced performance on precision and throughput; consequently, they become the

three most desired massively parallel sequencing systems in this NGS era. Afterward,

other agencies procured these originator companies: Agencourt was acquired by

Applied Biosystems in 2006, and 454 was purchased by Roche in 2007, later, Solexa

was obtained by Illumina. Inspired by the keynote of cost-efficient and high-throughput

14

analysis approaches, technological advances have been made in other “-omics” field

besides genomics. For instance, the “expression array” was initiated within the late

1990s; soon after, a few refined array technologies were demonstrated competent in

measuring the expression levels of all protein-coding transcripts in a particular tissue.

Within the early 2000s, the measurement of expression quantitative trait loci (eQTL, loci

controlling gene expression) was first accomplished by array technologies. Also, Nobel

Prize winners John Fenn and Koichi Tanaka developed two ionization techniques –

electrospray (Whitehouse et al. 1985) and MALDI (matrix-assisted laser desorption

ionization) mass spectrometry (Karas et al. 1989), which made it possible to cohesively

perform the vaporization, ionization and evaluation of relatively large, non-volatile

biomolecules such as proteins and peptides. Gradually, numerous other “omics”

technologies were invented that have the potential of interrogating entire pools of

transcripts, proteins, metabolites and the genome.

Empowered by such new technologies, “omics” studies are gaining ever-

increasing popularity. Here the term “omics” is a suffix suggesting the assessment of

the complete complement of a given level of biological information flow. In these

scenarios, high-throughput (HT) screening techniques are utilized to produce the huge

sums of “omics” data and exhibit a system-level understanding of associations and

dependencies between molecular individuals. “Omics” ranges an increasingly broad

scope of studies, particularly genomics (the quantitative investigation of a complete set

of genetic sequences), transcriptomics (the measurement of all the messenger RNA

produced in a particular cell), proteomics (the study of the entire set of proteins,

15

focusing on protein structures, functions and abundance) and metabolomics

(metabolites and metabolic networks).

Gene Expression Profiling and Analysis Methods

Gene expression is the spatial and temporal pattern of a gene product. The

transcriptome is the complete collection of readouts of mRNA, or “transcripts”, present

in one or a group of cells. Transcriptomics, also known as gene expression profiling,

measures the expression level of transcripts in a specific cell and in a particular state.

Gene regulations could induce phenotypic changes in the animals by changing the

relationships of proteins and metabolites. DNA microarray is one of the reliable

approaches to assess the abundance of the transcripts, where nucleic acid probes are

bound to microarray slides covalently and the target sequences with fluorescent labels

are then hybridized and scanned. Finally, the signal intensities in the scanned images

can appropriately represent expression information (Mantione et al. 2014). As an

alternative to microarray technology, RNA-sequencing (RNA-Seq) no longer requires

species- or transcript-specific probes and can detect novel transcripts and genetic

variants, for example, isoforms expression, exon expression, allele expression and

haplotype expression.

The well-known approach, differential expression analysis (DE analysis)

(Kadarmideen et al. 2007), was employed to compare the individual gene expression

levels across two conditions and identify genes showing statistically significant

differences, e.g., normal versus diseased or high-performance versus low-performance

animals. Transcriptomic studies can measure the effect of external perturbations and

compare different tissues at the gene expression level, thus providing novel insights into

the health, production traits and behaviors of animals. In addition, transcriptomic topics

16

also investigate interactions among genes by establishing global coexpression networks

that focus on the identification of coexpressed gene modules, where coexpression

suggests co-regulation (functional similarities) of genes involved in the same biological

processes.

Epigenetic Modifications and Analysis Methods

Originally defined as “the branch of biology which studies the causal interactions

between genes and their products, which bring the phenotype into being” (Waddington

1942), the term epigenetics is characterized as the occurrences of changes in gene

expression or cellular phenotype that are stable and heritable while do not stem from

the changes in Watson-Crick base-pairing of DNA (Goldberg et al. 2007). Epigenetic

variation contributes to phenotypic variation through chromatin marks at the gene level,

which can affect local transcriptional potential and cause changes in the expression of

genes that underlying the trait. So far, the most comprehensively characterized

chemical alteration mechanism of chromatin is DNA methylation. This modification

features the covalent addition of a methyl group to the C5 (5meC) position of the

pyrimidine ring, where the cytosines are consequently being converted to 5-

methylcytosines through the enzyme DNA methyltransferases (DNMTs). This is a

common epigenetic mark found in eukaryotes and it often occurs in the CpG or CpHpG

(H = A, T, C) context. DNA methylations found at gene regulatory regions, i.e.,

promoters, were usually believed to have a repressive effect of gene expression (Laird

2010).

A wide range of strategies is employed in various DNA methylation profiling

approaches. DNA methylation status determination could be local or global. In genome-

wide DNA methylation mapping, NGS technologies have greatly influenced the

17

strategies utilized, for example, 5mC-specific immunoprecipitation (IP), methylation-

sensitive restriction enzymes (MSRE) digestion, sodium bisulfite modification (SBM)

and third-generation sequencing (TGS) technologies (Marzese et al. 2015). Despite the

starting method, either microarray or NGS technologies can subsequently be applied to

evaluate DNA methylation status. Among these methods, the most effective approach

to assess methylation status is DNA bisulfite conversion combined with NGS, which can

reach the single-base resolution; hence a comprehensive measurement of DNA

methylation of almost every cytosine on the genome can be yielded (Marzese et al.

2015). Whole-genome bisulfite sequencing (WGBS) generates the most comprehensive

DNA methylation maps; however, it is incredibly costly, and the subsequent data

analysis remains challenging. As such, different adaptations of WGNS have been

applied, for example, reduced representation bisulfite sequencing (RRBS) (Meissner et

al. 2005) and bisulfite sequencing of padlock probes (BSPP) (Diep et al. 2012).

Other Types of Omics Data

Proteomics was oriented to depict the whole collection of proteins in a cell, tissue

or a living organism (Haoudi et al. 2006). Proteomics topics can be characterized into

three primary ranges: (1) extensive identification of proteins and protein post-

translational modifications; (2) comparative study of protein levels and identification of

differentially displayed proteins across conditions, e.g., health and disease; and (3)

identification of protein-protein interactions using techniques such as yeast two-hybrid

system or MS (mass spectrometry) (Pandey et al. 2000). What lies in the center of

functional proteomics analysis is the identification of components of a protein complex,

note that it is often challenging to make functional predictions of a protein with only the

18

homology information or even three-dimensional structure available (Pandey et al.

2000).

Metabolomics studies focus on characterizing the global metabolite profiles in a

cell. Typically, all the metabolites with all chemical natures, e.g., amino acids,

antibodies, aptamers, small biomolecules, etc., will be examined (Rochfort 2005).

Metabolomics can significantly facilitate the in vivo metabolic network modeling because

it provides not only a descriptive data source but a quantitative data source of

intracellular metabolites (Suravajhala et al. 2016).

“Omics” ranges an increasingly broad scope of studies, which include general

topics, for example, genomics, transcriptomics, proteomics, and metabolomics, as well

as more specific topics regarding post-genomic biological research and modern medical

science. Such studies include but not limited to pharmacogenomics (quantitatively

measure the relationship of the genetic factors and the host response to drugs),

physiomics (dynamics and functions of cell physiology), nutrigenomics (quantitatively

measure the relationship of the genetics affects and the organism’s response to diet),

phylogenomics (combining evolution and genomic information) and interactomics

(networks of molecule interactions) (Schneider et al. 2011).

Multi-Omics Data Integration and Systems Biology Methods

The advent of “omics” technologies has empowered biological sciences to

identify and quantify the molecular components of a cell at different layers with ever-

increasing speed and precision. “Omics” technologies and the subsequent analysis

techniques significantly improved the speed and volume of the cellular components

information we can acquire, empowering a comprehensive multi-level characterization

of molecular entities. However, it is not appropriate to assume a linear combination of

19

the biological properties of cellular components predominating the trait we observe,

what underlies the “traits system” is the collaboration of numerous interacting biological

entities at different levels, for example, genomics, transcriptomics, metabolomics and

proteomics and even beyond. This has needed new predictive models and

computational approaches for investigating how the genome itself and various external

conditions may impact these functional associations and the observable behaviors of

the “trait system”. This holistic multi-omics integration approach could provide valuable

insights.

The term “systems biology” alludes to a set of strategies and techniques that

endeavor to decipher complex biological processes by integrating cellular information

from different levels, such approach relies on the acquisition of observations of huge

numbers of cell components and the appropriate modeling of the interactions among

them (Baitaluk et al. 2006; Bruggeman et al. 2007). This approach usually incorporates

mathematics, statistical modeling and computational techniques to integrate vast

amounts of data so as to disentangle complex biological processes and envisage the

behaviors of the cellular systems. From a multi-omics integration point of view, this

approach takes into consideration data from different sources and try to link isolated

“omics” levels with genes/pathways functional annotation; thus, all “omics” levels can be

related to cell phenotypes. For example, Suravajhala et al. (2016) proposed that “a

typical data integration process goes from genome → epigenome → transcriptome →

metabolome → proteome → phenotype or disease.”

20

Biological Networks

Biological Networks Overview

Systematical investigations of huge groups of molecules, e.g., genes, proteins,

metabolites and their modifications have elucidated the accurate interaction and

regulation of complex molecular networks, thus explaining the functionalities of complex

cellular processes. With prominent topological and hierarchical properties, these

complex biological networks are extremely distinct from random networks and can

potentially provide innovative insights in deciphering fundamental mechanisms

underlying biological processes and disease pathologies. To date, there are several

kinds of biological networks that have been extensively studied and applied, for

example, genetic interaction networks, protein-protein interactions networks,

transcription factor binding networks, metabolic interaction networks, protein

phosphorylation (Zhu et al. 2007). Network objects (variables/molecules) are referred to

as “nodes” and interactions as “edges”. Network edges are usually represented as

binary measurements (unweighted networks); however, various strengths could also be

assigned to certain edges (weighted networks). The directions of edges are dependent

on the logic of the network and usually imply casualty. Undirected networks are often

used to characterize the protein-protein and genetic interactions while networks with

directions embedded are usually utilized to depict transcription factor binding networks,

metabolic interaction networks and protein phosphorylation.

Network topology is essential in understanding network architecture and different

statistics can be used to characterize topological properties of a node, such as

connectivity, betweenness, cluster coefficient, shortest path length, etc. Biological

networks are found to have a “scale-free” format rather than randomly organized, where

21

the connectivity distribution, i.e., distribution of the number of directly connected nodes,

follows a power-law (Barabasi et al. 2003). As results, the distribution of connectivity

has a “fat-tailed” characteristic where the nodes with high connectivity are termed

“hubs”. Specifically, a biological network contains a small number of hubs with many

connections and a large number of nodes that have relatively fewer connections. Such

type of organization has shown high robustness, for example, a random loss of single

non-hub node in a scale-free network will induce fewer disorders than that in a random

network (Zhu et al. 2007). Presumably, a hub component found in a scale-free network

is relative more involved in the process and therefore usually plays vital roles in this

interacting system, for example, a transcription factor in a gene-gene interaction

network.

Gene Coexpression Network Analysis

With a large volume of gene expression data becoming available, gene networks

(GNs) have become a ground-breaking method for processing, interpreting and

predicting interactions between genes. Since genes never work in isolation and rarely

have single role, network theory could largely facilitate the modeling of complicated

cellular processes. A typical gene network analysis starts with the gene measurement

input, usually consists of messenger RNA expression, which could either be obtained

from NCBI Gene Expression Omnibus (GEO) (Barrett et al. 2013) or generated in a

given experiment. After data acquisition, data may be pre-processed using different

computational (quality control) methods so as to improve the quality of the input data.

Then, the preferred computational algorithms (machine learning approaches) are

applied to the preprocessed datasets in order to infer one or multiple networks. Finally,

the obtained network is optimized and validated using a single network approach (focus

22

on mechanisms); alternatively, networks from different conditions can be compared

using a differential network approach (focus on changes), thus revealing network

topological differences/changes found in different tissues or induced by external

perturbation, e.g., diseases, nutrition. Gene network construction using gene expression

levels as variables is intuitive and preferred since transcription is recognized as the

primary mechanism underlying gene expression regulation (Lappalainen et al. 2013).

The choice of a certain gene network reconstruction algorithm depends on the

logical thinking underneath the gene network and will shape different resulting networks

showing different behavior. Main gene network inference approaches include but not

limited to: (i) information theory, (ii) Boolean networks, (iii) Ordinary differential equation

(ODE), (iv) Bayesian and (v) Neural models (Delgado et al. 2019).

Information theory-based networks, as known as coexpression networks, use

pairwise correlation coefficients as primary information for dependencies inferences

between genes, for example, Pearson, Spearman or Kendall coefficients, besides, other

types of pairwise similarity measurement can also be applied in gene network

reconstructions, such as Euclidean distances or mutual information. These models have

obvious advantages, including intuitive, less computationally demanding and requiring

low number samples. As an example, 14 hub genes were successfully identified in the

regulation of postmenopausal osteoporosis disease (Wang et al. 2020).

In Boolean networks, different variables represent genes and Boolean binary

values are used to discretize expression levels, where silenced genes are annotated as

0 and activated genes are annotated as 1 (Melkman et al. 2018). Although the Boolean

network approach was proven useful (Orlando et al. 2008; Moignard et al. 2015; Simak

23

et al. 2017), the main limitations lie in the discretization procedure – gene expression is

usually considered as a continuous variable and is rarely measured by fully-activation or

fully-silenced, such discretization will result in massive information loss. Also, Boolean

networks are not robust to noisy data (Maheshri et al. 2007).

Ordinary differential equation (ODE) methods use continuous variables instead of

binary variables for gene expression measurement; also, it characterizes changes in

gene expression level as a dependent variable of the expression level of other genes;

moreover, it takes into account external factors. This approach yields models with

higher precision and facilitates the quantitative dynamic modeling of gene regulation

relationships (Chai et al. 2014). ODE models utilize continuous-time variables with

constraints and prohibit negative values (Chai et al. 2014). One of the disadvantages of

the ODE approach is that only the linear relationships or few specific kinds of non-linear

relationships are taken into consideration (Voit 2000; Hecker et al. 2009), while

biological regulations are often depicted as complex non-linear dynamics. Moreover, the

computational complexity of ODE models limits their ability to decipher large gene

networks and the value estimation for model parameters is challenging (Delgado et al.

2019).

Bayes theorem of probability along with graph theory was employed in Bayesian

GNs logics hence making it feasible to qualitatively measure the properties of GNs

(Chai et al. 2014). Generally, Bayesian networks are directed and acyclic graphs

(DAGs), which represents the joint distribution of nodes where a set of local node

probability distributions are included (Kaderali et al. 2008). The main advantage of

Bayesian networks is the robustness of GNs reconstruction since it has great

24

adaptability to combine various sources of data and prior knowledge. The Bayesian

network approach has been applied in numerous areas extending from evolutionary

development to medicine (Acerbi et al. 2014; Chekouo et al. 2015; Chudasama et al.

2018).

There are mainly two methods in the applications of Neural models: Artificial

Neural Network (ANN) and Recurrent Neural Networks (RNN). The first model is

completely neural while the second model involves fuzzy logic (Biswas et al. 2016).

RNN is capable of modeling non-linear and dynamic genes pairwise relationships

(Kordmahalleh et al. 2017). Neural models can fit in continuous variables and their

outcomes look homogeneous to the neural connections which can be observed in

natural processes. For a particular gene, the model can consider it as the source of

regulatory impact and compute a weighted sum of all potential regulating genes. Also, a

scoring function is utilized to evaluate the results, for example, network performance

optimization and error minimization.

Unweighted and Weighted Gene Correlation Network Analysis

Correlation network approaches have been widely performed in transcriptomics

studies. Typically, gene expression consists of the total amount of mRNA corresponding

to genes. In most scenarios, gene coexpression networks are reconstructed at the gene

level; however, measurements can also be measured at the probe level in microarrays

applications. Gene expression data are usually pre-processed/normalized using one of

many available methods. As results, the normalized gene expression measurements

are stored and communicated in an m × n dimensional matrix exprDatX whose ith

column xi corresponds to the ith gene (or probe) expression profile (across samples).

The mth component of xi represents the mth sample measurements. The correlations

25

between gene expression profiles across samples indicate gene expression similarities.

Inspired by the “guilt-by-association” heuristic, the highly correlated expression profiles

of two genes may indicate their functional associations, e.g., corresponding to the same

protein complexes, involved in the same pathways or participating in the same

regulatory and signaling system (Eisen et al. 1998).

An unweighted coexpression network is established by applying cutoff values to

the absolute values of the correlation matrix. To construct an unweighted network, an n

× n dimensional matrix R = (𝑟𝑖𝑗) with pairwise correlations as entries are computed

using correlation coefficients, e.g., Pearson’s correlation. By applying a threshold on

|𝑟𝑖𝑗|, then the correlation matrix is converted to an adjacency matrix A = (𝑎𝑖𝑗), where

entries are either 0 or 1. Finally, an unweighted network is established using the

pairwise connection information obtained from the adjacency matrix where two

genes i and j are either connected (𝑎𝑖𝑗 = 1) or disconnected (𝑎𝑖𝑗 = 0). Such dichotomic

approach (hard thresholding) can potentially result in information loss; hence, those

approaches with the abilities to preserve the continuous property of gene expression

are preferred in some cases.

Weighted Gene Coexpression Network Analysis (WGCNA) is a commonly

applied unsupervised learning approach in transcriptomics studies, especially for

correlation-based biological networks (Zhang et al. 2005; Langfelder et al. 2008). A

weighted coexpression network can be defined as a weighted adjacency matrix by

nature, and it can adequately characterize pairwise gene relationships while preserving

the continuity nature of the expression. Besides this, the weighted network approach

has other advantages: (i) weighted networks focus on clusters of genes (modules)

26

instead of single genes, this approach is more intuitive than testing genes individually

because genes never work in isolation, also, this approach dramatically alleviates the

problem of multiple testing; (ii) network is highly robust despite different soft thresholds;

(iii) such networks enable the geometric interpretation of the relationship detected; (iv)

network statistics can enhance the standard data-mining approaches and intuitively

characterize the network properties. (v) a parsimonious parametrization is feasible in

such application because weighted networks can be approximated by "factorizable"

networks, while networks with weak density (unweighted networks) cannot. Additionally,

the WGCNA algorithm provides a set of intuitive and persuasive module preservation

statistics that can be used to statistically quantify the cross-condition preservation status

of a given module or subnetwork. Module preservation is an essential step of differential

network analysis approach, this approach relies on module preservation statistics

because they can facilitate the statistical significance inference and biological

interpretation of network topological changes.

Differential Network Analysis

In the scenario where samples are collected from two or more different

groups/conditions, a differential network analysis is applied and comparisons of global

gene coexpression profiles can be made. This approach compares the set of inter-

connections of genes cross conditions, thus demonstrating the changes in network

topology due to external perturbations. In our previous work (Louvandini et al. 2020), we

performed a differential network (unweighted) analysis in order to compare the process

of spermatogenesis in lambs exposed to either a maternal control diet or a maternal

gossypol-rich diet. A total of 145 genes that showed expression in the RNA-Seq data

and belong to the Gene Ontology (GO) term spermatogenesis (GO:0007283) were

27

evaluated. For each maternal diet, a correlation matrix R = (𝑟𝑖𝑗) with dimensions

145 × 145 was constructed using Pearson correlation coefficients. Each correlation

matrix was then translated into an adjacency matrix A = (𝑎𝑖𝑗), a 145 × 145 matrix with

entries either 0 or 1. Here, if |𝑟𝑖𝑗| ≥ 0.5 and P-value ≤0.05, then 𝑎𝑖𝑗 = 1, otherwise 𝑎𝑖𝑗 = 0.

Finally, for each maternal diet, an unweighted network was constructed based on

adjacency matrix where two genes i and j were either connected (𝑎𝑖𝑗 = 1) or

disconnected (𝑎𝑖𝑗 = 0). The structure and topology of each network were evaluated

using node connectivity and node cluster coefficient (Langfelder et al. 2011). The

connectivity 𝑘𝑖 of gene 𝑖, defined as 𝑘𝑖 = ∑ 𝑎𝑖𝑗𝑗≠𝑖 , measures the connection strength of

gene 𝑖 with all the other genes. The cluster coefficient 𝑐𝑖 of gene 𝑖, defined as,

𝑐𝑖 = ∑ ∑ 𝑎𝑖𝑗𝑎𝑗𝑘𝑎𝑘𝑖𝑘≠𝑖𝑗≠𝑖

(∑ 𝑎𝑖𝑗𝑗≠𝑖 )2 − ∑ (𝑎𝑖𝑗)2𝑗≠𝑖

which represents the local density of a network with 0 ≤ 𝑐𝑖 ≤ 1.

Figure 2-1 shows the distribution of the node connectivity and node cluster

coefficient, two classical network preservation statistics, in each maternal diet.

Interestingly, the exposure to gossypol in utero and throughout lactation caused

noticeable changes in the coexpression patterns of genes directly implicated in

spermatogenesis. Indeed, the maternal gossypol-rich diet altered the spermatogenesis

network decreasing both gene connectivity and gene clustering. The gene coexpression

network analysis revealed that the maternal exposure to gossypol significantly altered

the coexpression patterns of spermatogenesis-related genes, suggesting that gossypol

might disrupt gene coregulation mechanisms.

28

Limitations of Correlation Networks and Conditional Independencies

Conventionally, gene correlation networks are obtained by learning the pairwise

correlations of the gene expression profile. However, it should be noted that this

approach fails to distinguish direct (conditional independence) from indirect (marginal

independence) based on the observed expression similarities. The resulted networks

are missing the regulatory relationships because the observed undirected relationships

can be explained by multiple regulatory mechanisms. We have no way to choose

between them from expression data alone using correlation coefficients (Markowetz et

al. 2007). However, some statistical methods based on the concept of conditional

independencies can be applied to reveal the most likely possibilities (Dawid 1979). Let

𝑋, 𝑌, 𝑍 be random factors sharing a joint distribution 𝑃. Given that if and only if

P(X = x, Y = y | Z = z) = P(X = x | Z = z) · P(Y = y | Z = z)

We can conclude that ( 𝑋 ⊥ 𝑌 | 𝑍), which means that 𝑋 is conditionally independent of 𝑌

given 𝑍. Note that this can also be represented in this way

P(X = x | Y = y, Z = z) = P(X = x | Z = z)

And this is a direct generalization of the independence condition for 𝑋 and 𝑌,

P(X = x, Y = y) = P(X = x) · P(Y = y).

Suppose we have a set of variables Z instead of a single variable Z, the same

conclusions hold. This gives us a chance to reveal the true regulatory relationships

between two genes while controlling the effect of other genes, using this approach, we

can retrieve causal relationships of genes instead of obtaining sets of undirected

coexpression connections.

29

Over-representation Analysis (ORA)

Fisher’s Exact Test and Hypergeometric Distribution

Named after its inventor Sir Ronald Fisher, Fisher's exact test is one of a class of

exact tests commonly used in the analysis of contingency tables, where categorical data

generated by classifying objects in two different ways. By applying such test, the

significance level (e.g., P-value) of the association (contingency) between two types of

classifications can be calculated exactly instead of relying on an approximation.

In the scenario of functional characterization, Over-Representation Analysis

(ORA) (based on Fisher's exact test) is a broadly utilized analysis to decide whether

known cellular functions or biological processes are over-represented (enriched) in an

experimentally derived gene list, e.g., differentially expressed genes (DEGs), genes

located in an unpreserved coexpression module or genes flagged by significant SNPs

from whole-genome scans. Given a specific biological pathway, the P-value of the

observing 𝑘 significant genes in the pathway can be calculated by the hypergeometric

distribution

𝑃𝑣𝑎𝑙𝑢𝑒 = 1 − ∑ (

𝑆𝑖

) (𝑁 − 𝑆𝑚 − 𝑖

)

(𝑁𝑚

)

𝑘−1

𝑖=0

where 𝑆 refers to the number of significant genes, N refers to the total number of genes

tested and m refers to the number of genes in the pathway/gene-set of interest.

ORA Example

In our recent study (Sigdel et al. 2020), we applied ORA to a list of candidate

genes associated with cows’ ability to conceive under thermal stress conditions. In this

analysis, four different annotation databases were explored, including GO (Ashburner et

30

al. 2000), Medical Subject Headings (Nelson et al. 2004), InterPro (Mitchell et al. 2019)

and Reactome (Jassal et al. 2020). As results, at least five different groups of gene‐sets

were identified; these functional terms are related to fertilization, development, heat

shock proteins, cellular response to oxidative stress, and calcium ion homeostasis.

Figure 2-2 shows the most relevant functional terms and biological pathways enriched

with thermotolerant genes.

EnrichKit R software

In order to facilitate ORA analyses and improve collaborations, an R software

package was built for ORA named EnrichKit (0.1.0). This package serves as an

interface that enables the user to perform ORA using six annotation databases, namely,

GO (Ashburner et al. 2000), KEGG (Kanehisa et al. 2000), Medical Subject Headings

(Nelson et al. 2004), InterPro (Mitchell et al. 2019) and Reactome (Jassal et al. 2020)

and Molecular Signatures Database (MSigDB) (Liberzon et al. 2011). Currently, only

Bos Taurus (cow) is supported but other organisms may be included in future releases.

This R package is hosted on Github (https://github.com/liulihe954/EnrichKit).

https://onlinelibrary.wiley.com/doi/full/10.1111/age.12943#age12943-fig-0002

https://github.com/liulihe954/EnrichKit

31

Figure 2-1. Spermatogenesis coexpression network analysis: distributions of node connectivity and node clustering coefficient between maternal diets. a Node connectivity; (b) Node clustering coefficient. Control diet is represented in blue while gossypol-rich cottonseed diet is represented in red. Treatment means are marked by the dashed vertical lines: the blue line is for the control diet and the red line is for the gossypol-rich cottonseed diet. Adapted from Louvandini et al. (2020).

32

Figure 2-2. Functional terms and pathways significantly enriched with genes associated with dairy cow conception under thermal‐stress conditions. Four gene annotation databases were analyzed: GO, Medical Subject Headings, InterPro and Reactome. The y‐axis displays the names and the total number of genes of each gene‐set. The black dots represent the significance of

enrichment (−log10 P‐value, Fisher’s exact test, top x‐axis) and the bars represent the number of significant genes in each functional term (bottom x‐axis). Adapted from Sigdel et al. (2020)

33

CHAPTER 3 DIFFERENTIAL NETWORK ANALYSIS OF BOVINE MUSCLE REVEALS CHANGES

IN GENE COEXPRESSION PATTERNS IN RESPONSE TO CHANGES IN MATERNAL NUTRITION

Introduction

Transcriptome analysis is an essential tool to uncover the molecular basis of

phenotypic variation. The advent of RNA sequencing has dramatically improved the

characterization and quantification of transcriptomes (Ozsolak et al. 2011). The most

common use of RNA sequencing is the identification of differentially expressed genes,

that is, genes that show differences in expression between conditions. However, genes

and gene products do not usually work in isolation, but they are connected in complex

networks. There is increasing interest in moving beyond differential expression and

examine transcriptional profiles in the context of molecular networks (Han 2008). It is

well-accepted that genes that are controlled by the same set of transcription factors or

are involved in the same biological processes tend to have similar expression profiles

(Allocco et al. 2004). This principle is known as guilt-by-association and represents the

basis for the reconstruction of gene networks using RNA sequencing data. These

networks, commonly called gene coexpression networks, are undirected graphs where

nodes correspond to genes and edges represent pairwise expression similarities.

Gene coexpression networks can be used for different purposes. One popular

application consists in the characterization of the topology of the reconstructed network

and examination of interesting nodes and coexpression structures. This single network

analysis focuses on the mechanisms allowing the identification of transcription factors

(hub genes), the functional annotation of unknown genes, i.e., the association of genes

of unknown function with well-described biological processes, and the detection of

34

transcriptional regulatory programs (van Dam et al. 2018). Another application consists

of evaluating gene coexpression networks but across conditions. Here, the term

condition is very broad and can refer to different tissues, different developmental

stages, or even different treatments. This application, commonly known as differential

network analysis, focuses on determining changes in the topology of the networks

across conditions. For instance, it is possible to examine whether connections or

subnetwork defined under normal conditions (control group) are reproducible and

preserved in the testing group (treatment group) (Langfelder et al. 2011). In this

scenario, differences in the topology of these two networks would indicate that

coexpression patterns were significantly perturbed by the treatment. Note that

expression similarities (coexpression) hint at common regulatory mechanisms

(coregulation), and hence, changes in the network might indicate that the treatment has

disrupted coregulation mechanisms, functional links and biological processes.

Undoubtedly, this shift in focus from differentially expressed genes to differentially

connected genes provides more holistic insights about gene regulation.

It is well-documented that different intrauterine insults can induce permanent

changes to the structure, physiology, and metabolism of the offspring. This

phenomenon has been termed fetal programming and may have lasting or lifelong

consequences (Godfrey et al. 2001). Maternal nutrition is considered a major

intrauterine environmental factor and it is now known that maternal nutritional status

during pregnancy can induce remarkable effects on fetal development (Wu et al. 2004).

There is growing evidence that maternal nutrition can alter epigenetic marks of the fetal

genome, such as DNA methylation. Indeed, this link between maternal nutrition and

35

subsequent modification of the fetal epigenome, including changes in gene expression,

is one of the molecular mechanisms proposed to explain the phenomenon of fetal

programming (Burdge et al. 2010; Chmurzynska 2010).

The main objective of this study was to assess whether maternal nutrition can

disrupt gene coexpression patterns in the offspring. Maternal nutritional treatments

consisted of control or a methionine-rich diets offered during the periconceptional and

early gestation periods. Both muscle transcriptome and methylome were evaluated in

the offspring using next generation sequencing. Note that DNA methylation depends on

the availability of methyl donors, such as methionine, and hence, we hypothesized that

maternal methionine supplementation could alter the fetal epigenome, which in turn

could induce significant changes in the topology of gene networks.

Materials and Methods

Ethics Statement

The University of Florida approved the procedures used in this study (IACUC

#2014408583)

Animals and Experimental Design

Beef cows from the University of Florida Range Cattle Research and Education

Center (Ona, Florida, US) were assigned to one of two nutritional treatments from days

-30 to +90 relative to the beginning of the breeding season. These treatments consisted

of a control diet based on limpograss hay supplemented with molasses and urea and a

methionine-rich diet equal to the control diet but supplemented with 10 g/head/day of

MetaSmart Liquid (Adisseo, Alpharetta, GA) providing 3.7 g/head/day of rumen-

protected methionine. Longissimus dorsi muscle samples were collected from 20 bull

calves, 10 per maternal diet, at one month of age. Maternal diets did not affect birth or

36

weaning weight but altered post weaning calf growth performance. Calves derived from

the methionine-rich diet had a greater average daily gain and feed conversion rate post

weaning.

RNA Extraction, Library Preparation and Sequencing

Total RNA was extracted using the Qiagen RNeasy Mini kit. RNA yield and

quality were evaluated using the Agilent 2100 Bioanalyzer (Agilent Technologies, Inc.).

RNA-sequencing libraries were prepared from 50 ng RNA samples using a poly(A)

capture method and then sequenced using Illumina’s HiSeq 3000 at the University of

Florida. A total of 19 muscle samples from 19 bull calves derived from 9 control and 10

methionine-rich maternal diets were successfully processed and sequenced, and hence

used for subsequent RNA-sequencing analyses. RNA-sequencing data can be

accessed by NCBI GEO with the accession number GSE116974.

RNA-Seq Quality Control and Mapping

The quality of the sequencing reads was evaluated using the software FastQC

(v0.11.7, Babraham Bioinformatics, UK). Adaptor removal and trimming were conducted

with Trim Galore (version 0.4.4, Babraham Bioinformatics, UK) using the following

parameters: --paired, --clip_R1 10, --clip_R2 10, --three_prime_clip_R1 10, --

three_prime_clip_R2 10, and --length 20. The resulting paired-end sequencing reads

were mapped to the latest bovine reference genome (ARS-UCD1.2) using the software

Hisat2 (v2.1.0) (Kim et al. 2015).

Read Counting, Processing and Normalization

The number of reads that mapped to each annotated gene in the bovine GTF file

(ARS-UCD1.2) was obtained using the python script htseq-count (v0.6.1p1) using the

option intersection-nonempty (Anders et al. 2015). Both highly abundant genes

37

(myosins, n = 25 genes) and lowly expressed genes (read counts ≤ 5 in at least 9

biological replicates) were removed from the raw expression data and not included in

subsequent analyses. After data processing, read counts were normalized using the

trimmed mean of M-values (TMM) normalization method available in the R package

edgeR (v3.14) (Robinson et al. 2010).

Gene Coexpression Network Construction

Genes with high expression variance across samples (top 50%) were used for

network modeling. The R package WGCNA (v1.69) was used for network construction

(Zhang et al. 2005; Langfelder et al. 2008). First, an unsigned adjacency matrix was

constructed based on pairwise Pearson correlation coefficients using the function

adjacency. This adjacency matrix was then remodeled as a Topological Overlap Matrix

(TOM) using the function TOMsimilarity. The TOM-based dissimilarity matrix, simply

defined as 𝑑𝑖𝑠𝑠𝑇𝑂𝑀 = (1 – 𝑇𝑂𝑀), was used as a pairwise distance matrix for

hierarchical clustering. Note that genes in the same cluster share strong

interconnections, and they might define modules or subnetworks. Module detection was

performed by cutting the branches of the clustering dendrogram using the function

cutreeDynamic. Genes that could not be assigned to any module were considered as

background genes (grey module) and were not included in subsequent module

preservation analyses. For each module, the variable module eigengene (ME) was

calculated as the first principal component of the module expression data. Modules

detected in the dynamic cutting process were merged if their eigengene values were

highly correlated (Pearson correlation ≥ 0.8). The module membership was calculated

using the function signedKME, where the expression profile of each gene was

38

correlated with the module eigengene in order to quantify how connected a gene was to

a given module.

Module Preservation

The function modulePreservation implemented in the R package WGCNA was

used to evaluate the preservation of each module (subnetwork) across conditions, i.e.,

between maternal control and maternal methionine-rich diets. A total of 12 different

module preservation statistics were investigated using a permutation test with 2000

resamples. These 12 different preservation statistics were combined in two composite

preservation significance scores, namely 𝑍𝑑𝑒𝑛𝑠𝑖𝑡𝑦 and 𝑍𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑖𝑣𝑖𝑡𝑦. Finally, 𝑍𝑠𝑢𝑚𝑚𝑎𝑟𝑦 was

computed as the mean of 𝑍𝑑𝑒𝑛𝑠𝑖𝑡𝑦 and 𝑍𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑖𝑣𝑖𝑡𝑦, representing the general

preservation status of a given module across the two experimental conditions

(Langfelder et al. 2011). In addition, the alternative module-size-independent statistic

𝑚𝑒𝑑𝑖𝑎𝑛𝑅𝑎𝑛𝑘 was also calculated to assess module preservation. This rank-based

statistic is defined as the mean of observed median ranks for individual connectivity and

density preservation statistics in each module; thus, modules with low median rank

values are considered as preserved between conditions/treatments. Finally, following

Langfelder and colleagues, an unpreserved module was defined as a module with

𝑍𝑠𝑢𝑚𝑚𝑎𝑟𝑦 ≤ 2 and 𝑚𝑒𝑑𝑖𝑎𝑛𝑅𝑎𝑛𝑘 greater than half of the total number of the modules

detected (Langfelder et al. 2011).

Module Characterization

The functional characterization of the unpreserved modules was performed using

a Fisher’s exact test, a test of proportions based on the cumulative hypergeometric

distribution. Genes within each unpreserved module were scrutinized using six different

39

databases, including Gene Ontology (GO) (Ashburner et al. 2000), KEGG (Kanehisa et

al. 2000), Interpro (Mitchell et al. 2019), Reactome (Jassal et al. 2020), Medical Subject

Headings (MeSH) (Nelson et al. 2004) and Molecular Signatures Database (MSigDB)

(Liberzon et al. 2011). The P-values of significant overrepresentation/enrichment of

biological pathways, molecular functions and functional terms were calculated using

Fisher’s exact test.

DNA Extraction, Library Preparation and Sequencing

Total DNA was extracted from muscle samples for whole-genome bisulfite

sequencing analysis. Extraction, library construction, bisulfite treatment and sequencing

were performed by Novogene Bioinformatics Technology Co., Ltd (Beijing, China).

Libraries were sequenced with Illumina’s HiSeq 3000 using 150-bp paired-end reads. A

total of 16 muscle samples from 16 bull calves derived from 7 control and 9 methionine-

rich maternal diets were successfully processed and sequenced, and hence used for

subsequent bisulfite-sequencing analyses. Whole-genome bisulfite sequencing data

can be accessed by NCBI GEO with the accession number GSE117194.

Bisulfite-Seq Quality Control and Mapping

The quality of the sequencing reads was evaluated using the software FastQC

(v0.11.7, Babraham Bioinformatics, UK). Adaptor removal and trimming was performed

when needed using the software Trim Galore (v0.4.4, Babraham Bioinformatics, UK).

After quality control and processing, the resulting paired-end sequencing reads were

aligned to the ARS-UCD1.2 bovine reference genome using the software Bismark

(v0.17.0, Babraham Bioinformatics, UK) (Krueger et al. 2011). The tool

deduplicate_bismark was used to remove duplicate read alignments. Methylation calls

were performed using Bismark methylation extractor (v0.17.0, Babraham

40

Bioinformatics) using the following parameters: --paired-end, --comprehensive, --

bedGraph, and --cytosine_report (Krueger et al. 2011).

Differentially Methylated Cytosines and Genes

Differential methylation between maternal diets was analyzed using a logistic

regression implemented in the R package Methylkit (v1.0.0) (Akalin et al. 2012). Only

cytosines with read coverage greater than 8 in a CpG context were evaluated.

Differentially methylated cytosines were defined as those having methylation

percentage changes between treatments greater than 20% and q-values ≤ 0.10. The

software Rgmatch was used to match the cytosines to different gene features, such as

transcription start site, exons, introns, and upstream regions (Furio-Tari et al. 2016). In

addition, we defined methylation level as the ratio of differentially methylated cytosines

to all the cytosines evaluated in a given region. For each gene, we calculated the

methylation level for the regulatory region (transcription start site, promoter and

upstream region) and also inside the gene body (exons and introns).

Results

RNA-sequencing Analysis

The RNA-sequencing of the muscle transcriptome yielded about 50 million

paired-end reads per sample. Roughly 87% of the reads were mapped to the ARS-

UCD1.2 bovine genome assembly using the software Hisat2. After removing the

myosins (highly abundant genes in muscle) and lowly expressed genes (genes with 5 or

less read counts in at least 9 biological replicates), a total of 12,786 genes were

retained for the network analysis.

41

Network Construction and Module Identification

We first characterized the muscle transcriptome under normal conditions, i.e., no

nutritional treatment. As such, the inference of the gene coexpression network and

subsequent module identification was performed using only samples derived from the

maternal control diet. A total of 7,034 genes with high across-sample expression

variance were included in this network analysis. A soft-threshold equal to 24 was

derived from the high scale-free fitting index (R2 ≥ 0.8; Figure 3-1A), resulting in a mean

connectivity equal to 31.45 (Figure 3-1B). A total of 147 preliminary modules were

detected using a dynamic tree cut process, and after merging highly correlated modules

(Pearson’s correlation ≥ 0.8; Figure 3-1C), a total of 14 modules, including the

background set (grey module) were retained for subsequent analysis (Figure 3-2A).

Module Preservation

After we characterized the muscle transcriptome in the control diet, we

investigated the impact of the nutritional treatment (maternal methionine

supplementation) on gene coexpression patterns. As such, we compared the structure

of coexpression networks between control and methionine diets to identify changes in

the topology. These changes were evaluated using a permutation test with 2000

iterations. Twelve module preservation statistics were calculated for each of the 14

modules previously identified in the control diet. A total of six modules were considered

as unpreserved (𝑍𝑠𝑢𝑚𝑚𝑎𝑟𝑦 ≤ 2), five modules were considered as preserved with weak

to moderate evidence of preservation (2 < 𝑍𝑠𝑢𝑚𝑚𝑎𝑟𝑦 ≤ 10), and finally two modules

showed strong evidence of preservation (𝑍𝑠𝑢𝑚𝑚𝑎𝑟𝑦 > 10) (Figure 3-3). Overall, the six

42

unpreserved modules were considered as gene coexpression modules or subnetworks

that were significantly perturbed by the maternal methionine diet.

Module Characterization

The six unpreserved modules were further investigated to reveal their functional

roles and gain insights about the biological processes that were impacted by maternal

methionine supplementation. This functional characterization was performed using a

Fisher’s exact test, a hypergeometric-based overrepresentation test commonly used to

evaluate 2×2 contingency tables. Six different biological databases were evaluated,

including Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG),

Reactome, InterPro, Medical Subject Headings (MeSH), and Molecular Signatures

Database (MSigDB). Figure 3-4 shows the functional characterization for

“antiquewhite2”, the most perturbated module. Interestingly, our analysis revealed that

genes in this unpreserved module are closely related to (i) myogenesis, adipogenesis,

and fibrogenesis, (ii) ribosome structure, (iii) rRNA binding and processing, (iv)

mitochondrial activities, (v) ATP synthesis, and (vi) NAD(P)H oxidoreductases.

Moreover, genes in the module “violet” are implicated in the regulation of the canonical

Wnt signaling pathway, a signal transduction pathway that is involved in different

embryonic processes, such as cell fate specification, cell proliferation, and cell

migration. The full list of significant functional terms, including term ID, term name, total

number of genes in the module, and Fisher’s P-value are reported.

DNA Methylation Analysis

Whole-genome bisulfite sequencing produced roughly 350M paired-end reads

per sample. The software Bismark was used to map the reads to the ARS-UCD1.2

bovine genome assembly, yielding a 70% mapping rate. A total of 5,136,556 cytosines

43

(CpG context) were evaluated (read coverage ≥ 8), and 101,094 were identified as

differentially methylated between maternal diets (methylation change ≥ 20%, q-value ≤

0.10). Based on the ARS-UCD1.2 annotation file, cytosines were classified as (i) within

a gene (gene body: exons and introns), (ii) within the regulatory region (5.5 kb upstream

the gene), or (iii) located in an intergenic region. As results, we targeted a total of

25,491 genes annotated in the cow genome that had at least one evaluated cytosine

(either gene body or regulatory region), and 10,247 of the 25,491 had at least one

differentially methylated cytosine. Of interest, a total of 6,735 of the 7,034 genes used in

the network analysis had methylation data.

DNA Methylation and Module Preservation

We investigated if there were significant differences in DNA methylation between

genes in preserved and unpreserved modules. For each gene, we calculated the

methylation level as differentially methylated cytosines divided by all the cytosines

evaluated. We calculated the methylation level either for the gene body or the regulatory

region. Notably, the distribution of methylation level in the gene body was significantly

different in genes located in unpreserved modules (n = 1146) compared to genes in

preserved modules (n = 5589) (Kolmogorov-Smirnov test, P-value ≤ 0.01, Figure 3-5).

We observed the same result if only transcription factors are considered. On the other

hand, there was no difference in the distribution of methylation level in the regulatory

region between preserved or unpreserved modules, neither for all genes nor for only

transcription factors.

DNA Methylation and Network Properties

We also investigated the relationship between DNA methylation and three

different gene network properties, namely differential coexpression score, module

44

membership, and intramodular connectivity. Interestingly, for those genes located in

preserved modules, we found a negative relationship between methylation level and

intramodular connectivity, considering either the gene body (regression coefficient 𝛽 =

−0.23, P-value = 0.009, Figure 3-6) or the regulatory region (regression coefficient 𝛽 =

−0.10, P-value = 0.013, Figure 3-6). On the contrary, there was no relationship between

methylation level and intramodular connectivity for genes in unpreserved modules.

Moreover, a significant negative relationship was found between methylation level in the

regulatory region and module membership, but again only for genes in preserved

modules.

Discussion

Global coexpression network analysis provides a powerful approach to uncover

the molecular basis of phenotypic variation. Gene coexpression networks are typically

used to infer and annotate gene functions, prioritize candidate regulatory genes, and

reveal transcriptional regulatory mechanisms. Lately, there has been a greater

emphasis on the use of network analysis to elucidate the changes in gene expression

patterns in response to changes in experimental conditions or environmental insults.

The present study was specially designed to reveal structural changes in gene

coexpression networks due to a maternal methionine-rich diet. Maternal nutrition

represents a major intrauterine environmental insult that can induce permanent changes

in the offspring. Here, we evaluated gene coexpression networks in the muscle of bull

beef calves gestated under a control or methionine-rich diet, we functionally

characterized the subnetworks altered by maternal methionine supplementation, and we

investigated the link between network perturbation and DNA methylation. Our results

45

provide evidence that maternal nutrition can significantly perturb gene coexpression

patterns in the offspring, and some of these changes might be mediated by alterations

in DNA methylation.

Maternal methionine supplementation significantly disturbed gene coexpression

patterns in the offspring’s muscle. In fact, following the methodology proposed by

Langfelder et al. (2011), we identified six modules or subnetworks that significantly

changed between experimental conditions. This permutation-based method basically

evaluates if the two most important aspects of module topology, namely density and

connectivity, are preserved between a reference condition and a test condition. Here,

we found that neither the connection strength nor the connectivity pattern of these six

subnetworks detected in the control diet was preserved in the methionine-rich diet.

There is growing evidence that certain intrauterine insults impact gene coexpression

patterns, which in turn may alter fetal developmental programming. For instance,

Deyssenroth et al. (2017) reported that alterations in gene coexpression networks in the

human placenta are associated with abnormal fetal growth and development. Lombardo

et al. (2018) showed that maternal immune activation via infection during pregnancy

disrupts fetal brain gene coexpression networks, and this disruption is associated with

an increased risk for autism spectrum disorder. Recently, we reported that exposure to

gossypol in utero and during lactation altered the development and gene expression of

the testicles, including a significant perturbation of coexpression patterns among

spermatogenesis-related genes (Louvandini et al. 2020). Overall, our findings provide

further evidence that intrauterine insults, such as diet, not only can change gene

46

expression but also alter coexpression patterns, which in turn suggests alterations in

coexpression mechanisms.

The functional characterization revealed that some of the unpreserved modules

are directly implicated in myogenesis, adipogenesis, and fibrogenesis. Notably, it is

well-documented that maternal nutrition alters fetal skeletal muscle development by

interfering with these three important processes (Du et al. 2010). For instance, Zhu and

collaborators have shown that nutrient deficiency in ruminants from early to mid-

gestation negatively impacts myogenesis, reducing muscle fiber number and also

muscle mass (Zhu et al. 2004; Zhu et al. 2006). In addition, Tong and collaborators

have reported that maternal overnutrition enhances adipogenesis in fetal skeletal

muscle (Tong et al. 2008; Tong et al. 2009). Similarly, Du et al. (2010) reported that

maternal undernutrition with supplementation of ruminal bypass protein from day 60 to

day 180 of gestation significantly affects adipogenesis, changing marbling scores in the

steer progeny. Moreover, pigs with reduced birth weight due to malnutrition in utero

have a greater content of collagen in their skeletal muscle (Karunaratne et al. 2005).

Note that myogenesis, adipogenesis and fibrogenesis are vital aspects of muscle

physiology, directly impacting lean muscle mass, marbling and also collagen content,

and hence, any disturbances in these processes may have long‐term consequences,

impacting muscle growth and meat quality.

We also found unpreserved subnetworks related to the regulation of the

canonical Wnt/β-catenin pathway. By acting through autocrine and/or paracrine

mechanisms, the Wnt family of secreted glycoproteins affects different aspects of cell

physiology, such as cell proliferation, cell differentiation or maintenance of precursor

47

cells (Novakofski 2004; Johnson et al. 2006). Interestingly, in skeletal muscle, β-catenin

regulates the expression of two transcription factors, namely PAX3 and GLI, which are

essential for skeletal myogenesis (Capdevila et al. 1998; Borycki et al. 2000). Indeed,

some studies have shown that blocking the β-catenin pathway reduces the total number

of myocytes (Pan et al. 2005; Yamanouchi et al. 2007). Shang and colleagues showed

that the upregulation of Wnt signaling promotes myogenesis, whereas downregulation

of this pathway promotes adipogenesis (Shang et al. 2007). Moreover, we identified

terms related to basic cell structures/functions, including ribosome structure, rRNA

binding and processing, mitochondrial activities, ATP synthesis and NAD(P)H

oxidoreductases. Notably, previous studies have shown that these cell activities can be

affected by maternal nutrient restrictions. For instance, Peñagaricano et al. (2014)

reported that maternal diets significantly impact functional terms closely related to the

ribosome in fetal muscle. Moreover, Mayeur et al. (2013) found that maternal

undernutrition induces placental mitochondrial abnormalities and reduced ATP level in

mice offspring. Similarly, Zhu et al. (2006) found that maternal nutrient restriction can

induce downregulation of key enzymes involved in mitochondrial function in the

offspring’s muscle.

The exact mechanisms by which maternal diet can affect gene coexpression

patterns in the offspring are not yet known. Here, we examined the hypothesis that

changes in DNA methylation cause changes in the topology of gene networks. Nearly

2% of all the evaluated cytosines in a CpG context were found to be differentially

methylated between maternal diets. Notably, we found significant differences in the

level of gene body methylation between genes in preserved modules versus genes in

48

unpreserved modules. The same trend was observed when only transcription factors

were considered. Although the function of DNA methylation in regulatory regions is well-

known, the role of DNA methylation within the gene is not yet well understood. Some

studies have suggested that DNA methylation in the gene body might be involved in the

regulation of alternative splicing (Shukla et al. 2011; Maunakea et al. 2013). Of special

interest, Saha et al. (2017) reported that the regulation of alternative splicing is

coordinated across functionally related genes. Therefore, changes in DNA methylation

might cause changes in isoform expression, which in turn can alter gene coexpression

patterns. Moreover, changes in DNA methylation also altered subnetwork properties.

Indeed, in preserved modules, there was a clear negative relationship between

methylation level and intramodular connectivity, i.e., more methylation, less connectivity,

that is, less gene activity. Notably, this functional relationship completely disappeared in

unpreserved modules. Overall, our findings suggest that maternal methionine

supplementation may induce changes in the offspring epigenome, such as changes in

DNA methylation, which in turn alter coexpression patterns and gene network

properties.

Overall, our study has shown that maternal nutrition levels during preconception

and early pregnancy can significantly impact gene coexpression patterns in the

offspring. Some of the perturbed gene functions are directly implicated in the

development of skeletal muscle, such as myogenesis, adipogenesis, and Wnt/β-catenin

pathway. Notably, some of the changes in gene coexpression patterns are associated

with changes in DNA methylation. To the best of our knowledge, this is the first study

that investigates the link between maternal nutrition, DNA methylation and gene

49

coexpression networks. Our findings suggest that maternal nutrition perturbs gene

coexpression patterns, and these alterations are in part mediated by changes in the

epigenome.

50

Figure 3-1. Soft-thresholding powers and module detection. (A) Scale-free topology fitting index (y-axis) as function of the soft-thresholding power (x-axis). (B) Mean connectivity (y-axis) as function of the soft-thresholding power (x-axis). (C) Tree plot of initial module eigengenes.

51

Figure 3-2. Network visualization. (A) Gene cluster dendrogram in the maternal control diet. (B) Gene cluster dendrogram in the maternal methionine diet.

52

Figure 3-3. Permutation-based composite preservation statistics. (A) Summary statistics 𝑍𝑠𝑢𝑚𝑚𝑎𝑟𝑦 (y-axis) as function of module size. Non-preserved modules were

identified using 𝑍𝑠𝑢𝑚𝑚𝑎𝑟𝑦 ≤ 2. (B) Summary statistics 𝑚𝑒𝑑𝑖𝑎𝑛𝑅𝑎𝑛𝑘 (y-axis) as

function of module size. High median rank values suggest strong evidence of perturbation.

53

Figure 3-4. Functional characterization of non-preserved modules. Six gene annotation databases were analyzed: Gene Ontology (GO), KEGG, Medical Subject Headings (MeSH), InterPro, Reactome and Molecular Signatures Database. The y-axis displays the term ID and the total number of genes in each term. The black dots represent the significance of enrichment (-log10 P-value, Fisher’s exact test, top x-axis) and the bars represent the percentage of significant genes in each functional term (bottom x-axis).

54

Figure 3-5. Comparison of methylation levels between preserved and unpreserved modules. Methylation level was calculated as differentially methylated cytosines divided by all the cytosines evaluated.

55

Figure 3-6. Relationship between methylation level and intramodular connectivity. Methylation level was calculated as differentially methylated cytosines divided by all the cytosines evaluated. Regressions in preserved modules (red) were significant (P-value < 0.05).

56

CHAPTER 4 CONCLUSIONS

The era of systems biology is inspired by the fact that genes and proteins do not

work in isolation but are connected in intricate networks. Therefore, deciphering gene

coexpression patterns is crucial to understand the response of the system to external

perturbations. Indeed, differential network analysis allows us to determine changes in

the topology of the gene networks across conditions. Expression similarities hint at

common regulatory mechanisms, and hence, changes in gene networks might indicate

disruptions in coregulation mechanisms.

Our findings suggest that maternal nutrition status during preconception and

early pregnancy can significantly influence gene coexpression patterns in the offspring,

and some of these perturbations are mediated by changes in DNA methylation. Indeed,

our module preservation analysis revealed several changes in gene coexpression

patterns and network topology in response to maternal methionine supplementation.

Functional characterizations of the perturbed subnetworks revealed that these gene

modules are involved in processed directly implicated in the development and

physiology of the skeletal muscle, such as myogenesis, adipogenesis, and Wnt/β-

catenin pathway, as well as some other basic cell structures/functions, including

ribosome structure, rRNA binding and processing, mitochondrial activities, ATP

synthesis and NAD(P)H oxidoreductases. Disturbances in these processes may have

long‐term consequences, impacting muscle growth and meat quality.

The exact mechanisms by which external perturbations affect gene coexpression

patterns are not yet known. Here, we found significant differences in the level of DNA

methylation between genes in preserved modules versus genes in unpreserved

57

modules. In addition, we revealed some linear relationships between network properties

and gene methylation levels. Overall, our results suggest that some of the changes in

gene coexpression patterns are mediated by changes in DNA methylation.

58

LIST OF REFERENCES

Acerbi E., Zelante T., Narang V. & Stella F. (2014) Gene network inference using continuous time Bayesian networks: a comparative study and application to Th17 cell differentiation. BMC Bioinformatics 15, 387.

Akalin A., Kormaksson M., Li S., Garrett-Bakelman F.E., Figueroa M.E., Melnick A., et al. (2012) methylKit: a comprehensive R package for the analysis of genome-wide DNA methylation profiles. Genome Biol 13, R87.

Allocco D.J., Kohane I.S. & Butte A.J. (2004) Quantifying the relationship between co-expression, co-regulation and gene function. BMC Bioinformatics 5, 18.

Anders S., Pyl P.T. & Huber W. (2015) HTSeq--a Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166-9.

Ashburner M., Ball C.A., Blake J.A., Botstein D., Butler H., Cherry J.M., et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25, 25-9.

Baitaluk M., Qian X., Godbole S., Raval A., Ray A. & Gupta A. (2006) PathSys: integrating molecular interaction graphs for systems biology. BMC Bioinformatics 7, 55.

Barabasi A.L. & Bonabeau E. (2003) Scale-free networks. Sci Am 288, 60-9.

Barrett T., Wilhite S.E., Ledoux P., Evangelista C., Kim I.F., Tomashevsky M., et al. (2013) NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res 41, D991-5.

Biswas S. & Acharyya S. (2016) Neural model of gene regulatory network: a survey on supportive meta-heuristics. Theory Biosci 135, 1-19.

Borycki A.G., Brown A.M.C. & Emerson C.P. (2000) Shh and Wnt signaling pathways converge to control Gli gene activation in avian somites. Development 127, 2075-87.

Bruggeman F.J. & Westerhoff H.V. (2007) The nature of systems biology. Trends Microbiol 15, 45-50.

Burdge G.C. & Lillycrop K.A. (2010) Nutrition, epigenetics, and developmental plasticity: implications for understanding human disease. Annu Rev Nutr 30, 315-39.

Capdevila J., Tabin C. & Johnson R.L. (1998) Control of dorsoventral somite patterning by Wnt-1 and β-catenin. Developmental biology 193, 182-94.

59

Chai L.E., Loh S.K., Low S.T., Mohamad M.S., Deris S. & Zakaria Z. (2014) A review on the computational approaches for gene regulatory network construction. Comput Biol Med 48, 55-65.

Chekouo T., Stingo F.C., Doecke J.D. & Do K.A. (2015) miRNA-target gene regulatory networks: A Bayesian integrative approach to biomarker selection with application to kidney cancer. Biometrics 71, 428-38.

Chimpanzee Sequencing and Analysis Consortium (2005) Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, 69-87.

Chmurzynska A. (2010) Fetal programming: link between early nutrition, DNA methylation, and complex diseases. Nutr Rev 68, 87-98.

Chudasama D., Bo V., Hall M., Anikin V., Jeyaneethi J., Gregory J., et al. (2018) Identification of cancer biomarkers of prognostic value using specific gene regulatory networks (GRN): a novel role of RAD51AP1 for ovarian and lung cancers. Carcinogenesis 39, 407-17.

Dawid A.P. (1979) Conditional independence in statistical theory. Journal of the Royal Statistical Society: Series B (Methodological) 41, 1-15.

Delgado F.M. & Gomez-Vela F. (2019) Computational methods for Gene Regulatory Networks reconstruction and analysis: A review. Artif Intell Med 95, 133-45.

Deyssenroth M.A., Peng S.E., Hao K., Lambertini L., Marsit C.J. & Chen J. (2017) Whole-transcriptome analysis delineates the human placenta gene network and its associations with fetal growth. BMC Genomics 18.

Diep D., Plongthongkum N., Gore A., Fung H.L., Shoemaker R. & Zhang K. (2012) Library-free methylation sequencing with bisulfite padlock probes. Nat Methods 9, 270-2.

Du M., Tong J., Zhao J., Underwood K.R., Zhu M., Ford S.P., et al. (2010) Fetal programming of skeletal muscle development in ruminant animals. Journal of Animal Science 88, E51-E60.

Eisen M.B., Spellman P.T., Brown P.O. & Botstein D. (1998) Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences 95, 14863-68.

Furio-Tari P., Conesa A. & Tarazona S. (2016) RGmatch: matching genomic regions to proximal genes in omics data integration. BMC Bioinformatics 17, 427.

Gibbs R.A., Weinstock G.M., Metzker M.L., Muzny D.M., Sodergren E.J., Scherer S., et al. (2004) Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428, 493-520.

60

Godfrey K.M. & Barker D.J. (2001) Fetal programming and adult health. Public health nutrition 4, 611-24.

Goldberg A.D., Allis C.D. & Bernstein E. (2007) Epigenetics: a landscape takes shape. Cell 128, 635-8.

Han J.D. (2008) Understanding biological functions through molecular networks. Cell Research 18, 224-37.

Haoudi A. & Bensmail H. (2006) Bioinformatics and data mining in proteomics. Expert Rev Proteomics 3, 333-43.

Hecker M., Lambeck S., Toepfer S., van Someren E. & Guthke R. (2009) Gene regulatory network inference: data integration in dynamic models-a review. Biosystems 96, 86-103.

International Human Genome Sequencing C. (2004) Finishing the euchromatic sequence of the human genome. Nature 431, 931-45.

Jassal B., Matthews L., Viteri G., Gong C., Lorente P., Fabregat A., et al. (2020) The reactome pathway knowledgebase. Nucleic Acids Res 48, D498-D503.

Johnson M.L. & Rajamannan N. (2006) Diseases of Wnt signaling. Reviews in Endocrine & Metabolic Disorders 7, 41-49.

Kadarmideen H.N. & Reverter A. (2007) Combined genetic, genomic and transcriptomic methods in the analysis of animal traits. CABI review: perspectives in agriculture, veterinary science, nutrition and natural resources 2, 16.

Kaderali L. & Radde N. (2008) Inferring gene regulatory networks from expression data. In: Computational Intelligence in Bioinformatics (pp. 33-74. Springer.

Kanehisa M. & Goto S. (2000) KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28, 27-30.

Karas M., Ingendoh A., Bahr U. & Hillenkamp F. (1989) Ultraviolet–laser desorption/ionization mass spectrometry of femtomolar amounts of large proteins. Biomedical & environmental mass spectrometry 18, 841-43.

Karunaratne J.F., Ashton C.J. & Stickland N.C. (2005) Fetal programming of fat and collagen in porcine skeletal muscles. Journal of Anatomy 207, 763-68.

Kim D., Langmead B. & Salzberg S.L. (2015) HISAT: a fast spliced aligner with low memory requirements. Nat Methods 12, 357-60.

Kordmahalleh M.M., Sefidmazgi M.G., Harrison S.H. & Homaifar A. (2017) Identifying time-delayed gene regulatory networks via an evolvable hierarchical recurrent neural network. BioData mining 10, 29.

61

Krueger F. & Andrews S.R. (2011) Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics 27, 1571-2.

Laird P.W. (2010) Principles and challenges of genome-wide DNA methylation analysis. Nature Reviews Genetics 11, 191-203.

Lander E.S., Linton L.M., Birren B., Nusbaum C., Zody M.C., Baldwin J., et al. (2001) Initial sequencing and analysis of the human genome. Nature 409, 860-921.

Langfelder P. & Horvath S. (2008) WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 9, 559.

Langfelder P., Luo R., Oldham M.C. & Horvath S. (2011) Is my network module preserved and reproducible? PLoS Comput Biol 7, e1001057.

Lappalainen T., Sammeth M., Friedlander M.R., t Hoen P.A., Monlong J., Rivas M.A., et al. (2013) Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506-11.

Liberzon A., Subramanian A., Pinchback R., Thorvaldsdottir H., Tamayo P. & Mesirov J.P. (2011) Molecular signatures database (MSigDB) 3.0. Bioinformatics 27, 1739-40.

Lombardo M.V., Moon H.M., Su J., Palmer T.D., Courchesne E. & Pramparo T. (2018) Maternal immune activation dysregulation of the fetal brain transcriptome and relevance to the pathophysiology of autism spectrum disorder. Molecular Psychiatry 23, 1001-13.

Louvandini H., Correa P.S., Amorin R., Liu L., Ieda E.H., Jimenez C.R., et al. (2020) Gestational and lactational exposure to gossypol alters the testis transcriptome. BMC Genomics 21, 59.

Maheshri N. & O’Shea E.K. (2007) Living with noisy genes: how cells function reliably with inherent variability in gene expression. Annual review of biophysics and biomolecular structure 36.

Mantione K.J., Kream R.M., Kuzelova H., Ptacek R., Raboch J., Samuel J.M., et al. (2014) Comparing bioinformatic gene expression profiling methods: microarray and RNA-Seq. Med Sci Monit Basic Res 20, 138-42.

Markowetz F. & Spang R. (2007) Inferring cellular networks--a review. BMC Bioinformatics 8 Suppl 6, S5.

Marzese D.M. & Hoon D.S. (2015) Emerging technologies for studying DNA methylation for the molecular diagnosis of cancer. Expert Rev Mol Diagn 15, 647-64.

62

Maunakea A.K., Chepelev I., Cui K.R. & Zhao K.J. (2013) Intragenic DNA methylation modulates alternative splicing by recruiting MeCP2 to promote exon recognition. Cell Research 23, 1256-69.

Mayeur S., Lancel S., Theys N., Lukaszewski M.A., Duban-Deweer S., Bastide B., et al. (2013) Maternal calorie restriction modulates placental mitochondrial biogenesis and bioenergetic efficiency: putative involvement in fetoplacental growth defects in rats. American Journal of Physiology-Endocrinology and Metabolism 304, E14-E22.

Meissner A., Gnirke A., Bell G.W., Ramsahoye B., Lander E.S. & Jaenisch R. (2005) Reduced representation bisulfite sequencing for comparative high-resolution DNA methylation analysis. Nucleic Acids Res 33, 5868-77.

Melkman A.A., Cheng X., Ching W.K. & Akutsu T. (2018) Identifying a Probabilistic Boolean Threshold Network From Samples. IEEE Trans Neural Netw Learn Syst 29, 869-81.

Mitchell A.L., Attwood T.K., Babbitt P.C., Blum M., Bork P., Bridge A., et al. (2019) InterPro in 2019: improving coverage, classification and access to protein sequence annotations. Nucleic Acids Research 47, D351-D60.

Moignard V., Woodhouse S., Haghverdi L., Lilly A.J., Tanaka Y., Wilkinson A.C., et al. (2015) Decoding the regulatory network of early blood development from single-cell gene expression measurements. Nat Biotechnol 33, 269-76.

Mouse Genome Sequencing Consortium (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520-62.

Nelson S.J., Schopen M., Savage A.G., Schulman J.L. & Arluk N. (2004) The MeSH translation maintenance system: structure, interface design, and implementation. Stud Health Technol Inform 107, 67-9.

Novakofski J. (2004) Adipogenesis: Usefulness of in vitro and in vivo experimental models. Journal of Animal Science 82, 905-15.

Orlando D.A., Lin C.Y., Bernard A., Wang J.Y., Socolar J.E., Iversen E.S., et al. (2008) Global control of cell-cycle transcription by coupled CDK and network oscillators. Nature 453, 944-7.

Ozsolak F. & Milos P.M. (2011) RNA sequencing: advances, challenges and opportunities. Nat Rev Genet 12, 87-98.

Pan W.J., Jia Y.Y., Wang J.Y., Tao D.L., Gan X.Q., Tsiokas L., et al. (2005) beta-Catenin regulates myogenesis by relieving I-mfa-mediated suppression of myogenic regulatory factors in P19 cells. Proceedings of the National Academy of Sciences of the United States of America 102, 17378-83.

63

Pandey A. & Mann M. (2000) Proteomics to study genes and genomes. Nature 405, 837-46.

Peñagaricano F., Wang X., Rosa G.J.M., Radunz A.E. & Khatib H. (2014) Maternal nutrition induces gene expression changes in fetal muscle and adipose tissues in sheep. BMC Genomics 15, 1034.

Robinson M.D., McCarthy D.J. & Smyth G.K. (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139-40.

Rochfort S. (2005) Metabolomics reviewed: a new "omics" platform technology for systems biology and implications for natural products research. J Nat Prod 68, 1813-20.

Saha A., Kim Y., Gewirtz A.D.H., Jo B., Gao C., McDowell I.C., et al. (2017) Co-expression networks reveal the tissue-specific regulation of transcription and splicing. Genome Research 27, 1843-58.

Schneider M.V. & Orchard S. (2011) Omics technologies, data and bioinformatics principles. In: Bioinformatics for omics Data (pp. 3-30. Springer.

Shang Y., Zhang C., Wang S., Xiong F., Zhao C., Peng F., et al. (2007) Activated β-catenin induces myogenesis and inhibits adipogenesis in BM-derived mesenchymal stromal cells. Cytotherapy 9, 667-81.

Shukla S., Kavak E., Gregory M., Imashimizu M., Shutinoski B., Kashlev M., et al. (2011) CTCF-promoted RNA polymerase II pausing links DNA methylation to splicing. Nature 479, 74-U99.

Sigdel A., Liu L., Abdollahi-Arpanahi R., Aguilar I. & Penagaricano F. (2020) Genetic dissection of reproductive performance of dairy cows under heat stress. Anim Genet.

Simak M., Yeang C.H. & Lu H.H. (2017) Exploring candidate biological functions by Boolean Function Networks for Saccharomyces cerevisiae. PLoS One 12, e0185475.

Suravajhala P., Kogelman L.J. & Kadarmideen H.N. (2016) Multi-omic data integration and analysis using systems genomics approaches: methods and applications in animal production, health and welfare. Genetics Selection Evolution 48, 38.

Tong J., Zhu M.J., Underwood K.R., Hess B.W., Ford S.P. & Du M. (2008) AMP-activated protein kinase and adipogenesis in sheep fetal skeletal muscle and 3T3-L1 cells. Journal of Animal Science 86, 1296-305.

64

Tong J.F., Yan X., Zhu M.J., Ford S.P., Nathanielsz P.W. & Du M. (2009) Maternal obesity downregulates myogenesis and β-catenin signaling in fetal skeletal muscle. American Journal of Physiology-Endocrinology and Metabolism 296, E917-E24.

van Dam S., Vosa U., van der Graaf A., Franke L. & de Magalhaes J.P. (2018) Gene co-expression analysis for functional classification and gene-disease predictions. Briefings in Bioinformatics 19, 575-92.

Venter J.C., Adams M.D., Myers E.W., Li P.W., Mural R.J., Sutton G.G., et al. (2001) The sequence of the human genome. Science 291, 1304-51.

Voit E.O. (2000) Computational analysis of biochemical systems: a practical guide for biochemists and molecular biologists. Cambridge University Press.

Waddington C.H. (1942) The epigenotype. Endeavour 1, 18-20.

Wang X.-L., Liu Y.-M., Zhang Z.-D., Wang S.-S., Du Y.-B. & Yin Z.-S. (2020) Utilizing benchmarked dataset and gene regulatory network to investigate hub genes in postmenopausal osteoporosis.

Whitehouse C.M., Dreyer R.N., Yamashita M. & Fenn J.B. (1985) Electrospray interface for liquid chromatographs and mass spectrometers. Anal Chem 57, 675-9.

Wu G., Bazer F.W., Cudd T.A., Meininger C.J. & Spencer T.E. (2004) Maternal nutrition and fetal development. The Journal of nutrition 134, 2169-72.

Yamanouchi K., Hosoyama T., Murakami Y. & Nishihara M. (2007) Myogenic and adipogenic properties of goat skeletal muscle stem cells. Journal of Reproduction and Development 53, 51-58.

Zhang B. & Horvath S. (2005) A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol 4, Article17.

Zhu M.J., Ford S.P., Means W.J., Hess B.W., Nathanielsz P.W. & Du M. (2006) Maternal nutrient restriction affects properties of skeletal muscle in offspring. Journal of Physiology-London 575, 241-50.

Zhu M.J., Ford S.P., Nathanielsz P.W. & Du M. (2004) Effect of maternal nutrient restriction in sheep on the development of fetal skeletal muscle. Biology of Reproduction 71, 1968-73.

Zhu X., Gerstein M. & Snyder M. (2007) Getting connected: analysis and principles of biological networks. Genes Dev 21, 1010-24.

65

BIOGRAPHICAL SKETCH

Lihe Liu was born and raised in Zibo, Shandong province, People’s Republic of

China. After he completed high school, he was admitted to Huazhong Agricultural

University (HZAU), where he earned his bachelor’s degree in animal sciences. In the fall

of 2016, he joined China Agricultural University (CAU) as an exchange student. After

one year of study, he came to the United States and joined the University of Florida

(UF) as an international exchange student in the fall of 2017. In the spring of 2018, he

joined Dr. Francisco Peñagaricano lab as an intern, soon later, he joined the lab

officially as a master program student. At UF, he has been conducting research in the

animal science field using computational biology approaches. He has experience in

statistical computing and bioinformatic analysis.

Documents

DECIPHERING COMPLEX BIOLOGICAL PROCESSES USING GENE