46
1 | Page TITLE OF THE PROJECT Development of NGS data analysis pipeline using R-statistical packageA PROJECT SUBMITTED TO DR. D.Y. PATIL VIDYAPEETH (DEEMED UNIVERSITY) IN PARTIAL FULFILLMENT OF FOUR YEARS FULL-TIME DEGREE PROGRAMME B. TECH BIOINFORMATICS SUBMITTED BY Ashish Singh Tomar UNDER THE GUIDANCE OF Dr. R. Srivatsan Institute of Bioinformatics and Applied Biotechnology Biotech Park, Electronics City Phase I, Bangalore 560 100. DR.D.Y.PATIL BIOTECHNOLOGY & BIOINFORMATICS INSTITUE, TATHAWADE, PUNE – 33 ( MAY 2012 )

project

Embed Size (px)

Citation preview

Page 1: project

1 | P a g e

TITLE OF THE PROJECT

“Development of NGS data analysis pipeline using

R-statistical package”

A PROJECT SUBMITTED TO DR. D.Y. PATIL VIDYAPEETH (DEEMED UNIVERSITY)

IN PARTIAL FULFILLMENT OF FOUR YEARS FULL-TIME DEGREE PROGRAMME

B. TECH BIOINFORMATICS

SUBMITTED BY

Ashish Singh Tomar

UNDER THE GUIDANCE OF

Dr. R. Srivatsan Institute of Bioinformatics and Applied Biotechnology

Biotech Park, Electronics City Phase I,

Bangalore 560 100.

DR.D.Y.PATIL BIOTECHNOLOGY & BIOINFORMATICS

INSTITUE, TATHAWADE, PUNE – 33 ( MAY 2012 )

Page 2: project

2 | P a g e

CERTIFICATE This is to certify that Mr. Ashish Singh Tomar has prepared this project titled “Development of NGS data analysis pipeline using R-statistical package”, under my guidance and to my satisfaction, in fulfillment of the requirement for Bachelors Degree in Bioinformatics.

Signature & Seal of Guide

Guided By

Dr. R. Srivatsan

(Address of the host Institute) Institute of Bioinformatics and Applied Biotechnology

Biotech Park, Electronics City Phase I,

Bangalore 560 100.

Director Dr. D. Y. Patil Biotechnology & Bioinformatics Institute,

Tathawade, Pune - 33

Page 3: project

3 | P a g e

ACKNOWLEDGMENT

First of all I would like owe a great thanks to my colleagues who helped me in understanding some views and also contributed their efforts in tackling problems. I would especially like to thanks my guide Dr. R Srivatsan sir for his impeccable support and guidance without whom this project would have been a daunting task. He took pain to go through our progress and made necessary correction as well as suggestions when ever needed.

I will like to thanks IBAB for providing a sound working environment with high end server facility and giving opportunity to involve in a good project.

Finally I am also thankful to D.Y Patil institute if biotechnology and bioinformatics who sent me for project work.

Page 4: project

4 | P a g e

TABLE OF CONTENTS CHAPTER I

1.1 INTRODUCTION………………………………………………………………….6

1.2 NEXT-GENERATION SEQUENCING TECHNOLOGIES………………………8

1.3 TYPES OF SEQUENCING METHODS…………………………………….……13

1.4 TYPES OF NEXT-GENERATION SEQUENCING TECHNOLOGIES…………15

1.5 APPLICATIONS OF HIGH-THROUGHPUT SEQUENCING…………………..16

1.6 ANALYSIS OF RNA SEQ DATA………………………………………………...17

1.7 R AND BIOCONDUCTOR………………………………………………………..18

CHAPTER II

2.1 BACKGROUND …………………………………………………………………..22

2.2.FILE FORMATS…………………………………………………………………...22

2.3.ASSEMBLY………………………………………………………………………..23

2.4.ASSEMBLY ALGORITHMS………………………………………………………24

2.5. MAPPING………………………………………………………………………….25

2.3 DEFINITION OF TERMS……………………………………………………….…26

CHAPTER III

3.1 AIMS AND OBJECTIVES…………………………………………………………28

3.2 METHODOLOGY………………………………………………………………..…30

CHAPTER IV

4. RESULTS……………………………………………………………………………34

CHAPTER V

5. CONCLUSIONS………………………………………………………………….….44

6. REFERENCES……………………………………………………………………………….45

Page 5: project

5 | P a g e

TABLE OF FIGURES Figure Number

Figure Name Page Number

Fig 1 Cost graph of different sequencing technique 7 Fig 2 Base calling 9 Fig 3 Paired end sequencing 12 Fig 4 Mate pair sequencing 12 Fig. 5 Overlap graph and de bruijn graph 24 Fig 6 Flow Chart Of Pipeline 26 Fig 7 Overall Read Quality 33 Fig 8 Per-Cycle Quality Score 34 Fig 9 Read distribution 35 Fig10 Cycle-Specific Base Calls And Read Quality 36 Fig 11 Per Cycle Read Quality 37 Fig 12 Histogram and weighted histogram of contigs

coverage 38

Fig 13 Dinucleotide frequency 38

Page 6: project

6 | P a g e

CHAPTER I

1.1 INTRODUCTION

NEXT-GENERATION SEQUENCING

Next-generation sequencing technologies are revolutionizing genomics and their effects are

increasingly widespread. Genome-wide sequencing has enabled modern biomedical research

to discover more and more biomarkers in healthy as well as disease-affected cells and

tissues. The high demand for low-cost sequencing has driven the development of high-

throughput sequencing technologies that parallelize the sequencing process, producing

thousands or millions of sequences at once, called massively parallel DNA sequencing.

Next-generation high-throughput DNA sequencing techniques are opening fascinating

opportunities in the life sciences. Novel fields and applications in biology and medicine are

becoming a reality, much beyond the original goal of the genomic sequencing. Serving as

examples are: personal genomics with detailed analysis of individual genome stretches;

precise analysis of RNA transcripts for gene expression, surpassing and replacing in several

respects analysis by various microarray platforms, for instance precise analysis of DNA

regions interacting with regulatory proteins in functional regulation of gene expression

(Chip-seq). The next-generation sequencing technologies offer novel and rapid ways for

genome-wide characterization and profiling of mRNAs, small RNAs, transcription factor

regions, structure of chromatin and DNA methylation patterns. In gene-expression studies

microarrays are now being replaced by seq-based methods, which can identify and quantify

rare transcripts without prior knowledge of a particular gene and can provide information

regarding alternative splicing and sequence variation in identified genes.

The ability to sequence the whole genome of many related organisms has allowed large-

scale comparative and evolutionary studies that were unimaginable just a few years ago. For

example Metagenomics [1] and HapMap project [2].

The broadest application of NGS is resequencing of human genome to enhance our

understanding of how genetic differences affect health and disease and to know the

difference between individuals at genomic level. Understanding how a small change in

Page 7: project

7 | P a g e

genomes give rise to different phenotypes will lead to the development of personalized and

preventative medicine. The power of next-generation sequencing is increasingly exploited to

re-sequence strains and genomes of individuals for which reference genome sequences are

available to understand genomic diversity. Such studies have identified mutations in

bacterial strains, polymorphisms in worm, structural variation in the human genome and

specific alleles involved in cancer. In addition to analysis of genome sequences, NGS has

paved way for new approaches for assay and application such as Chip-seq, Tn-seq, RNA-seq

etc. which will greatly advance our understanding of various phenomena at genomic level.

The principle behind these alternative applications, which have been termed ‘sequence

census’ methods, is simple: complex DNA or RNA samples are directly sequenced to

determine their content without bacterial cloning as a prerequisite.

Given the vast amount of data produced (currently greater than a gigabase per run, with this

constantly increasing as well), developing a sound data storage and management solution

and creating informatics tools to effectively analyze the data are essential to successful

application of the technology.

Next-generation sequencing technologies allow genomes to be sequenced more quickly and

less expensively than previous techniques [fig.1][3]. Next-generation sequencing has proven

to be an extremely effective technology for molecular counting applications where the

number of sequence reads provides a digital readout for RNA-seq, ChIP-seq, Tn-seq and

other applications. Biological pathways consist of complex networks of interacting genes

which are responsible for expression and regulation of other genes. Therefore it is essential

to determine quantitative genetic interaction on a genome wide range to reveal the hidden

mechanism of gene regulation during various diseases. While having a genome wise

annotation and analysis, the main challenge of genome assembly is in identifying repetitive

regions present in most of the mammalian genome which makes it difficult for the

identification of exons or regulatory regions. With reference genome available, short

sequence reads are sufficient to map their locations (except for repeated regions), and once

mapped, millions of sequence hits are simply counted to determine their genomic

distribution.

Page 8: project

8 | P a g e

Fig. 1 Cost graph of different sequencing technique.

Next-generation sequencing technologies are now being exploited not only to analyze static

genome, but also dynamic transcriptomes in an approach termed RNA-seq. With the

advancements in next generation sequencing, RNA-seq has emerged as powerful tool for

analyzing transcriptome study. It has been widely applied for both well-studied model

organisms and non-model organisms in order to determine transcript profile of organisms

and to give insights into biological processes. For organisms with unavailable or incomplete

genome, reference sequence mapping strategy is not suitable. Thus, for organisms with un-

sequenced genome or cancer cells with widespread chimeric RNAs, de novo assembly is

essential to provide transcriptome analysis.

Next generation sequencing has made it possible to generate massively parallel and high

resolution DNA sequence data. Its usefulness in various genomic applications such as

genome-wide detection of SNPs, DNA methylation profiling, mRNA expression profiling

and whole-genome re-sequencing is now well recognized. SNPs and single nucleotide

insertions and deletions (INDELs) were detected by scanning the assembled contigs for

positions where the underlying reads significantly disagreed with the consensus base.

Page 9: project

9 | P a g e

1.2 NEXT-GENERATION SEQUENCING TECHNOLOGIES:

Sequencing technologies include a number of steps that are broadly identified as template

preparation, sequencing, imaging and data analysis. The unique combination of specific

protocols distinguishes one technology from another and determines the type of data

produced from each platform.

Template preparation: In the first step, the DNA is chopped (sheared) into small pieces

and the pieces of DNA are amplified by PCR method. The amplified pieces are immobilized

on a solid surface to form templates. Millions of templates DNA are allowed for rapid

sequencing at the same time. Some of the NGS technologies use different ways of template

preparation like clonally amplified and single molecule. [1]

Sequencing and imaging: Template preparation mostly composed of clonally amplified

and single molecule templates. The template from these methods are further processed for

sequencing and imaging using the Cyclic Reversible Termination (CRT), Sequencing By

Ligation (SBL), Single Nucleotide Addition (SNA) also called Pyrosequencing, and Real

Time Sequencing (RTS).[1]

Widely Used Platforms:

1. Pyrosequencing by Roche Diagnostics

2. Sequencing By Ligation (SBL) or SOLiD sequencing by Applied Biosystems

3. Real Time Sequencing by Pacific Biosciences

BASE CALLING:

Base-calling usually refers to the conversion of intensity data into sequences and quality

scores. Intensity information is extracted from images by the image analysis.

Base-calling has two aspects: Identifying the base-call and assigning a confidence

estimate to the call.

1. Identifying the base-call: Making a base-call is usually based on the intensity estimates.

Signal-processing needs to correct for confounding factors:

Frequency cross-talk (optical detection mechanism)

Page 10: project

10 | P a g e

Phasing effects (imperfect chemistry)

Signal decay

2. Assignment of a confidence estimate: Assignment of a confidence estimate or quality

score is vital for downstream analysis phred method can be extended to Next generation

technologies [4].

Below table shows how base calls are made

Fig. 2 Base calling

Although the data produced are similar between platforms, large differences in accuracy

and quality arise which depends on base calling error probability given by phred score.

These differences in data output should be carefully considered when comparing different

platforms on the basis of data quality, depth of sequencing, no of reads produced and cost.

Phred quality scores were originally developed by the program Phred to help in the

automation of DNA sequencing in the Human Genome Project. Phred quality scores are

assigned to each base call in automated sequencer traces. Phred quality scores have become

widely accepted to characterize the quality of DNA sequences, and can be used to compare

the efficacy of different sequencing methods. Perhaps the most important use of Phred

quality scores is the automatic determination of accurate, quality-based consensus

sequences.

Base for which no Phred score could be calculated.

An example of base that has been given Phred score of 10 indicating there is 90% probability that this base is correctly assigned.

Page 11: project

11 | P a g e

PHRED QUALITY SCORES A numeric Phred score represents the error probability of a given base call. When a

nucleotide sequence is produced by sequencing, random error results in the possibility that

any given base call may be incorrect. Thus, a quality score is provided for each base. The

phred score can be calculated from the error probability of a given base call:

phred score=-10*log(error probability)/log(10)

Error Probability

Phred Score

1 0 0.1 10 0.01 20

0.001 30 0.0001 40

Phred Quality Table

When quality scores are used to represent a long sequence (such as in a fastq file), they are

often represented using the ASCII alphabet, adding the number 33 to Phred scores, and 64 to

Illumina scores (The Illumina pipeline produces phred scores, but uses a different ASCII

offset). For example, a Phred score of 40 can be represented as the ASCII char "I"

(40+33=ASCII #73), and an Illumina score of 40 as "h" (40+64=ASCII #104) [12].

PAIRED-END SEQUENCING

Paired-end sequencing is emerging as a key technique for assessing genome rearrangements

and structural variation on a genome-wide scale. Paired end sequencing is a simple

modification to the standard single-read DNA library preparation which facilitates reading

both the forward and reverse template strands. In addition to sequence information, both

reads contain long range positional information, allowing for highly precise alignment of

reads. This technique is particularly useful for detecting copy-neutral rearrangements, such

as inversions and translocations, which are common in cancer and can produce novel fusion

genes. Paired-end sequencing approach allows for a genome-wide survey of all potential

fusion genes and other rearrangements in a tumor.

Page 12: project

12 | P a g e

Pair reads are invaluable for short-read data analysis, as a large fraction of short reads are

difficult to map uniquely to the genome, and the second read of a pair can be used to find the

correct location (it is said that the first read is ‘rescued’ by the second).[6]

MATE PAIR SEQUENCING

Mate Pair Library Sequencing makes it possible to create libraries with inserts from 2 to 5

kb in size. DNA is fragmented into 2-5kb segments that are end-repaired with biotin labeled

dNTPs. The labeled fragments are circularized and then fragmented again into 400-600bp

pieces. Fragments with the biotin labels are enriched, end-repaired, and ligated with adapters

used for downstream processes. The final mate pair library consists of fragments made up of

two DNA segments that were originally separated by 2-5kb. The mate pair library is

hybridized and amplified onto a flow cell followed by paired-end sequencing.

These long-insert Paired-End libraries are useful for a number of applications, including De

Novo Sequencing, genome finishing, and structural variant detection. Combining data

generated from Mate Pair library sequencing with that from short-insert paired-end reads

provides a powerful combination of read lengths for maximal genomic sequencing coverage

across the genome.

Mate pairs are also typically used to discover structural variants (SVs) regions of the

genome that have undergone large-scale mutations, such as inversions and large insertions

and deletions known as INDELS. Mate pair is more relevant in genome assembly, especially

for covering repetitive sequences [5].

Below is figure which explains steps in paired end and mate sequencing, the difference

between both methods is that mate pair end uses e specific type of libraries (biotinylated

labeled) and then it follows same steps as paired end sequencing. Mate pair allows you to

have your pairs be much farther apart, which can be more informative than the standard

paired-end protocol.

Page 13: project

13 | P a g e

Fig. 3 Paired end sequencing Fig. 4 Mate pair sequencing

1.3 TYPES OF SEQUENCING METHODS:

1.3.1. Pyrosequencing: This method of DNA sequencing is based on “sequencing by

synthesis” principle. The sequences are identified on basis of intensity of light emitted when

complimentary nucleotide incorporates to template. This reaction involves a single strand of

the DNA to be sequenced and then synthesizing its complementary strand enzymatically.

This method detects activity of DNA polymerase with another chemiluminescent enzyme

(luciferase and apyrase). It allows a single base to be incorporated at a time and detecting

which base was actually added. The template DNA is immobile, and solutions of A, C, G,

and T nucleotides are sequentially added and removed from the reaction. Light is produced

only when any one nucleotide complements the first unpaired base of the template. The

previous nucleotide is degraded before the next nucleotide is added for synthesis allowing

for the possible revealing of the next nucleotide via the resulting intensity of light.

Page 14: project

14 | P a g e

1.3.2. Sequencing by ligation:

This DNA sequencing method uses enzyme DNA ligase to identify the nucleotide present in

a given unknown DNA sequence. This method relies on the sensitivity of DNA ligase for

base-pairing mismatches. The DNA molecule to be sequenced is ssDNA flanked by known

sequence which is bound to oligonucleotide anchors labeled with fluorescent dyes. When

anchor molecule hybridizes to DNA, the DNA ligase joins the molecule to the anchor when

its bases match the unknown DNA sequence. Based on the fluorescence produced by the

molecule, one can infer the identity of the nucleotide at this position in the unknown

sequence. This hybridization is cleaved and again the same process is repeated.

1.3.3. Single molecule sequencing: Single molecule sequencing is a parallelized single

molecule DNA sequencing by synthesis technique. This sequencing utilizes zero mode wave

guide (an optical waveguide that guides light energy into a volume that is small in all

dimensions compared to the wavelength of the light) at the bottom of which a single DNA

polymerase with single stranded DNA as template. The ZMW is a structure that creates an

illuminated observation volume that is small enough to observe only a single nucleotide of

DNA (also known as a base) being incorporated by DNA polymerase. Each of the four DNA

bases is attached to one of four different fluorescent dyes. When a nucleotide is incorporated

by the DNA polymerase, the fluorescent tag is cleaved off and diffuses out of the

observation area of the ZMW where its fluorescence is no longer observable. A detector

detects the fluorescent signal of the nucleotide incorporation, and the base call is made

according to the corresponding fluorescence of the dye.

1.3.4. Nanopore DNA sequencing: A Nanopore is simply a small hole, of the order of 1

nanometer in internal diameter. Certain porous transmembrane cellular proteins act as

nanopores, and nanopores have also been made by etching a somewhat larger hole (several

tens of nanometers) in a piece of silicon, and then gradually filling it in using ion-beam

sculpting methods which results in a much smaller diameter hole: the nanopore. The theory

behind nanopore sequencing is that when a nanopore is immersed in a conducting fluid and

a potential (voltage) is applied across it, an electric current due to conduction of ions

through the nanopore can be observed. The amount of current is very sensitive to the size

Page 15: project

15 | P a g e

and shape of the nanopore. If single nucleotides (bases), strands of DNA or other molecules

pass through or near the nanopore, this can create a characteristic change in the magnitude

of the current through the nanopore.

1.3.5. Semiconductor Sequencing: Ion Semiconductor Sequencing is a method of DNA

sequencing based on the detection of hydrogen ions that are released during the

polymerization of DNA. This is a method of "sequencing by synthesis", during which a

complementary strand is built based on the sequence of a template stand. A microwell

containing a template DNA strand to be sequenced is flooded with a single species of

deoxyribonucleotide triphosphate (dNTP). If the introduced dNTP is complementary to the

leading template nucleotide, it is incorporated into the growing complementary strand. This

causes the release of a hydrogen ion that triggers an ISFET (ion-sensitive field-effect

transistor) ion sensor, which indicates that a reaction has occurred. If homopolymer repeats

are present in the template sequence, multiple dNTP molecules will be incorporated in a

single cycle. This leads to a corresponding number of released hydrogens and a

proportionally higher electronic signal.

1.4 TYPES OF NEXT-GENERATION SEQUENCING TECHNOLOGIES

1.4.1 RNA-seq: Is also called “Whole Transcriptome Shotgun Sequencing” a revolutionary

tool for transcriptomics, refers to as use of high-throughput sequencing technologies to

sequence cDNA in order to get information about cells RNA content. RNA sequencing has

emerged as a powerful and cost-effective way for transcriptome study. De novo assembly of

transcripts provides an important solution to transcriptome analysis for organisms with no

reference genome. RNA-seq provides efficient ways to measure Transcriptome data

experimentally, allowing them to get information such as how different alleles of a gene are

expressed and detect post-transcriptional mutations or identify gene fusions.

1.4.2 Chip-seq: Also known as ChIP-sequencing, is used to analyze protein interactions

with DNA. Chip-seq combines chromatin immunoprecipitation (ChIP) with massive parallel

DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to

map global binding sites precisely for any protein of interest. It is also to determine how

Page 16: project

16 | P a g e

transcription factors and other chromatin-associated proteins influence phenotype-affecting

mechanisms. It determines how proteins interact with DNA to regulate gene expression and

is essential for understanding mechanism of biological processes and disease states.[10]

1.4.3 Bisulphite-seq: Is the use of bisulfate treatment of DNA to determine its pattern of

methylation. DNA methylation was the first discovered epigenetic mark, and remains the

most studied. In animals it predominantly involves the addition of a methyl group to the

carbon-5 position of cytosine residues of the dinucleotide CpG, and is implicated in

repression of transcriptional activity. [8]

1.4.4 Tn-seq: Tn-seq is used for accurately determining quantitative genetic interactions on

a genome-wide scale in microorganisms. Tn-seq is based on the assembly of a saturated

Mariner transposon insertion library. After library selection, changes in frequency of each

insertion mutant are determined by sequencing of the flanking regions. These changes are

used to calculate each mutant’s fitness. Due to the wide activity of the Mariner transposon,

Tn-seq has the potential to contribute to the exploration of complex pathways across many

different species [1].

1.5 APPLICATIONS OF HIGH-THROUGHPUT SEQUENCING

1.5.1. The 1000 Genomes Project: More genomes need to be sequenced to learn how

genotype correlates with phenotype. A project to sequence 1000 human genomes has been

prepared, which will allow creation of a reference standard for the analysis of human

genomic variations that is expected to contribute to studies of disease and how genotype

correlates with phenotype. [7]

1.5.2. Targeted sequencing: currently we sample whole genome, which is wasteful if we

are interested in a particular genomic region. This approach will allow sequencing only

those portion of genome in which we are interested. [17]

1.5.3. Human Microbiome Project: Also called The Second Human Genome Project,

will focus on analyzing the collection of microbes in and on human body which will

Page 17: project

17 | P a g e

contribute in understanding human health and disease. Changes in microbial communities

in the body have been generally linked to immune system function, obesity and cancer. In

future, each individual’s microbiome could eventually become a medical biometric. [18]

1.5.4. Metagenomics Project: The novel sequencing technologies will be also useful in

microbial genomics, for example in the metagenomics measuring the genetic diversity

encoded by microbial life in organisms inhabiting a common environment.

An important application is planned by the US DOE Joint Genome Institute (JGI)

which will focus its sequencing efforts on new plant and microbial targets that may

be of use in the development of alternative energies.[19]

The JGI plans to sequence the genome of the marine red alga, which may play an

important environmental role in removing carbon dioxide from the atmosphere.

1.5.5. HapMap Project: This project aims to develop a Haplotype Map (HapMap) of

human genome which will describe common pattern of genetic variation in human. This

project will serve as resource to researchers to find genetic variants affecting health, disease

and responses to drugs and environmental factors. [20]

1.6 ANALYSIS OF RNA Seq DATA RNA seq experiment results in very large data files. The data analysis involves complex

steps from fastq quality inspection to GO annotation (described later), which form a

pipeline.

For performing analysis on RNA-seq high throughput data, we need high end servers[centos

] for high RAM and fast computational speed.

Many tools, open source as well as commercial, exist for NGS data analysis. Commercial

tools for next generation sequencing include Avadis NGS by strand [16], CLCbio Genomics

Workbench [13], DNANexus [14], and GenomeQues [15]. At global level, many

universities and consortiums have created online as well as downloadable open source tools

for NGS data analysis.

Among the open source tools, R/Bioconductor based tools are very popular. As explained

below, R/Bioconductor provides a comprehensive framework consisting of thousands of

Page 18: project

18 | P a g e

libraries and tools covering the entire spectrum of bioinformatics data analysis methods.

Detailed description and performance of these algorithms and libraries have been published

in open source journals. We developed a pipeline using R/Bioconductor libraries to analyze

data from RNA seq experiments.

1.7 R AND BIOCONDUCTOR

R is an open source scripting language and environment for statistical computing and

graphics. R provides libraries for performing a wide variety of statistical and mathematical

computations like linear and nonlinear modeling, classical statistical tests, time-series

analysis, classification, clustering, Nueral Network and many more. Armed with inbuilt

graphical libraries, it is highly versatile and extensible. R provides an Open Source

environment supported by a very large number of communities providing applications in

various fields like mathematics, engineering, business mathematics, education and biology.

One of R's strengths is the ease with which well-designed publication-quality plots can be

produced, including mathematical symbols and formulae where needed. Great care has been

taken over the defaults for the minor design choices in graphics, but the user retains full

control.

R is an integrated suite of software facilities for data manipulation, calculation and graphical

display. It includes

an effective data handling and storage facility,

a suite of operators for calculations on arrays, in particular matrices,

a large, coherent, integrated collection of intermediate tools for data analysis,

graphical facilities for data analysis and display either on-screen or on hardcopy, and

a well-developed, simple and effective programming language which includes

conditionals, loops, user-defined recursive functions and input and output facilities.

The term "environment" is intended to characterize it as a fully planned and coherent

system, rather than an incremental accretion of very specific and inflexible tools, as

is frequently the case with other data analysis software.

Page 19: project

19 | P a g e

R is designed around a true computer language, and it allows users to add additional

functionality by defining new functions. For computationally-intensive tasks, C, C++ and

FORTRAN code can be linked and called at run time. Advanced users can write C code to

manipulate R objects directly.

1.7.1 BIOCONDUCTOR:

Bioconductor is an open development project, contributed by the global scientific

community. Within the framework of R package, developers create and add libraries for

specific applications following package guidelines to make it easier for others to use and

extend the software. Bioconductor [26,27] is an organized effort by the global biology

community that provides libraries and tools within the R framework for the comprehensive

analysis of data from bioinformatics experiments. Bioconductor uses the R statistical

programming language, and is open source and open development.

Bioconductor can import diverse sequence-related file types, including fasta, fastq, BAM,

gff, bed, and wig files, among others. Packages support common and advanced sequence

manipulation operations such as trimming, transformation, and alignment. Domain-specific

analyses include quality assessment, ChIP-seq, differential expression, RNA-seq, and other

approaches.

Bioconductor has extensive facilities for mapping between microarray probe, gene, pathway,

gene ontology, homology and other annotations. Bioconductor has built-in representations

of GO, KEGG, vendor, and other annotations, and can easily access NCBI, BiomaRt,

UCSC, and other sources. Bioconductor libraries make extensive use of R graphics facilities

for creating sophisticated plots required for NGS data display. Therefore, R/Bioconductor

framework is the natural choice for the developmental platform in our pipeline.

1.7.2 BIOCONDUCTOR PACKAGES USED IN THIS PIPELINE

Biostrings: The Biostrings package from Bioconductor provides an advanced environment

for efficient sequence management and analysis in R. It contains many speed and memory

effective string containers, string matching algorithms, and other utilities, for fast

Page 20: project

20 | P a g e

manipulation of large sets of biological sequences. The objects and functions provided by

Biostrings form the basis for many other sequence analysis packages [21].

ShortRead: The ShortRead package provides input, quality control, filtering, parsing, and

manipulation functionality for short read sequences produced by high throughput

sequencing technologies. While support is provided for many sequencing technologies, this

package is primarily focused on Solexa/Illumina reads [22].

GoSeq: Detects Gene Ontology or other user defined categories which are over/under

represented in RNA-seq data. We can obtain all gene ontology (GO) categories associated

with a set of genes using the relevant organism package. GoSeq is a package for performing

Gene Ontology (GO) analysis on RNA-seq data. GO analysis is widely used to reduce

complexity and highlight biological processes in genome-wide expression studies, but

standard methods give biased results on RNA-seq data due to over-detection of differential

expression for long and highly expressed transcripts. Application of GoSeq to a prostate

cancer data set shows that GoSeq dramatically changes the results, highlighting categories

more consistent with the known biology [23].

SRAdb: High throughput sequencing technologies have very rapidly become standard tools

in biology. The data that these machines generate are large, extremely rich. As such, the

Sequence Read Archives (SRA) has been set up at to store these data in public repositories

in much the same spirit as microarray databases like NCBI GEO and EBI ArrayExpress.

Accessing data in SRA requires finding it first and this R package provides a convenient and

powerful framework to do that. In addition, SRAdb features functionality to determine

availability of sequence files and to download files of interest [24].

BiomaRt package: In recent years a huge number of biological database have been

available in public repositories. Easy access to these valuable data resources and firm

integration with data analysis is needed for comprehensive bioinformatics data analysis.

This package provides an interface to a growing collection of databases implementing the

BiomaRt software suit. The software package enables retrieval of large amount of data in a

Page 21: project

21 | P a g e

uniform way without the need to know the underlying database schemas or write complex

SQL queries. Examples of BiomaRt databases are Ensembl, Uniprot and HapMap.

These major databases give biomaRt users direct access to a diverse set of data and enable a

wide range of powerful online queries from R. BiomaRt databases can contain several

datasets, for Ensembl every species is a different dataset [25].

Page 22: project

22 | P a g e

CHAPTER II

2.1 BACKGROUND

Prostate cancer illumina NGS data is analyzed using R-statistical package. Short reads of

normal and cancer cells of prostate were retrieved from NCBI SRA with accession number

SRX022060, SRX022061, SRX022063, SRX022080, SRX022081 and SRX022083[28].

These SRA reads are in fastq format with base call and assigned probability (phred score).

Converting these fastq files to SAM format using Bowtie to generate counts file. These

counts file will be utilized as input file for differential expression analysis. In background

we will see file formats, assembly methods, assembly algorithm and mapping algorithm.

2.2. FILE FORMATS

1.1 FASTQ: FASTQ has emerged as a common file format for sharing sequencing read

data combining both the sequence and an associated per base quality score. Ii is s a test

based format for storing biological sequence obtained from NGS. Both nucleotides and

score are encoded with a single ASCII character. It has become the de facto standard format

for storing the output of high throughput sequencing instruments such as illumina Genome

Analyzer.

A FASTQ file normally uses four lines per sequence. Line 1 begins with a '@' character and

is followed by a sequence identifier and an optional description (like a FASTA title line).

Line 2 is the raw sequence letters. Line 3 begins with a '+' character and is optionally

followed by the same sequence identifier (and any description) again. Line 4 encodes the

quality values for the sequence in Line 2, and must contain the same number of symbols as

letters in the sequence.

@HWUSI-EAS582_157:6:1:1:1501/1 NCACAGACACACACGAACACACAAAGACATGCCCATATGAAGAT + %.7786867:778556858746575058873/347777476035

Page 23: project

23 | P a g e

1.2 SAM: SAM stands for Sequence Alignment/Map format is a TAB-delimited text format

consisting of a header section, which is optional, and an alignment section. If present, the

header must be prior to the alignments. Header lines start with `@', while alignment lines do

not. Each alignment line has 11 mandatory fields for essential alignment information such as

mapping position, and variable number of optional fields for flexible or aligner specific

information.

1.3 BAM: BAM is a compressed binary version of SAM format, a compact and indexable

representation of nucleotide sequence alignments. For more convenience Bam files can be

converted into BAI files which are indexed BAM files.

2.3. ASSEMBLY:

Once sequencing reads have been produced, it is necessary to align them in a coherent

manner. The assembler detects reads which are consistently aligning with each other, thus

forming contiguous sequence known as contigs. Assembler attempts to arrange all the

contigs by their overlapping ends. Sets of contigs which can all be placed together in the

same region are sometimes called supercontigs or scaffolds.

2.1. De novo ASSEMBLY:

De novo assembly means assembling short reads without any reference genome by utilizing

knowledge hidden in short reads i.e. the details of their overlap. This overlapping property is

used by the algorithms to from contiguous sequence which can be mapped or aligned to

genome of interest to deduce information of that contigs. Various algorithms have been

developed to link such overlapping reads.

2.2 Reference-based assembly:

A reference genome (also known as a reference assembly) is a digital nucleic acid

sequence database, assembled by scientists as a representative example of a species' set of

genes. As they are often assembled from the sequencing of DNA from a number of donors,

reference genomes do not accurately represent the set of genes of any single individual.

Instead a reference provides a haploid mosaic of different DNA sequences from each donor.

Page 24: project

24 | P a g e

Usually, a genome is chosen as the reference only if the similarity between it and the target

genome is close to 100%. This restriction leads to quite limited application of the

comparative assembly. In our study we choose NCBI36 Hg18 as reference most of them

aligned to reference but some of them were rejected.

2.4. ASSEMBLY ALGORITHMS: There are two basic approaches in algorithms for short-read assemblers: overlap graphs and

de Bruijn graph.

2.4.1 OVERLAP GRAPH: Most assemblers that were developed for Sanger reads follow

the overlap-layout-consensus paradigm. They compute all pair-wise overlap between reads

and store this information as a graph. Each node in the graph corresponds to a read and an

edge denotes an overlap between two reads. The overlap graph is used to compute a layout

of reads and consensus sequence of contigs. This method works best when there is limited

number of reads with significant overlap. Some ngs assembler use this technique but this

method is computationally expensive because large number of reads make overlap graph

very large. [Fig.5 ][11]

2.4.2 de Bruijn GRAPH: As overlap graphs do not scale with increasing number of reads,

most of ngs assembler use de Bruijn graphs. De Bruijn graphs reduce the computational

effort by breaking reads into smaller sequences of DNA, called k-mers where k denotes the

length in bases of these sequences. The de Bruijn graph finds overlaps of k-1 length between

these k-mers and not between the actual reads. The maximum efficient k-mer size for a

particular assembly is determined by the read length as well as error rate. The value of

parameter k has significant influence on the quality of assembly. Estimate of good values

can be made before assembly, but often the optimal value is best found by testing a small

range of values. Another property of de Bruijn it is that repeats in the genome can be

collapsed in graph and do not lead to many overlaps, although this doesn’t mean that they

can be more bridged or resolved [fig.5 ] [26].

Page 25: project

25 | P a g e

Fig. 5. Overlap graph and de bruijn graph

2.5. Mapping:

Genome mapping is assigning/locating of a specific gene to particular region of a

chromosome and determining the location of and relative distances between genes on the

chromosome. One of the most basic tasks in NGS analysis is the alignment of reads to either

a reference genome or transcriptome.

There are two major algorithmic approaches to map RNA-seq reads to a reference

transcriptome. The first, to which we collectively refer as ‘unspliced read aligners’ align

reads to a reference without allowing any large gaps. The unspliced read aligners fall into

two main categories, ‘Seed methods’ and ‘Burrows-Wheeler transform methods’.

2.5.1. Seed methods such as mapping and assembly with quality (MAQ) and Stampy find

matches for short subsequences, termed ‘seeds’, assuming that at least one seed in a read

will perfectly match the reference. Each seed is used to narrow candidate regions where

more sensitive methods (such as Smith-Waterman) can be applied to extend seeds to full

alignments [1].

2.5.2. In contrast, the second approach includes Burrows-Wheeler transform methods

such as Burrows-Wheeler alignment (BWA) and Bowtie, which compact the genome into a

data structure that is very efficient when searching for perfect matches. When allowing

Page 26: project

26 | P a g e

mismatches, the performance of Burrows-Wheeler transform methods decreases

exponentially with the number of mismatches as they iteratively perform perfect searches.

Unspliced read aligners are ideal for mapping reads against a reference cDNA databases for

quantification purposes. If the exact reference transcriptome is available, Burrows-Wheeler

methods are faster than seed-based methods. In contrast, when only the reference

transcriptome of a distant species is available, ‘seed methods’ can result in a large increase

in sensitivity [1].

2.6 DEFINITION OF TERMS:

MPSS: Massive parallel sequencing encompasses several high-throughput approaches to

DNA sequencing; it is also called next-generation sequencing (NGS) or second-generation

sequencing.

Deep sequencing: Depth in DNA sequencing refers to the number of times a nucleotide is

read during the sequencing process. Deep sequencing indicates that the coverage, or depth,

of the process is many times larger than the length of the sequence under study. The term

"deep" has been used for a wide range of depths (>7x) and the newer term "ultra-deep" has

appeared in the scientific literature to refer to even higher coverage (>100x).

Coverage: Coverage is the average number of reads representing a given nucleotide in the

reconstructed sequence.

Contigs: A contigs is a contiguous, overlapping sequence read resulting from the

reassembly of the small DNA fragments generated by sequencing. Contigs refers to the

overlapping clones that form a physical map of the genome that is used to guide sequencing

and assembly. Contigs can thus refer both to overlapping DNA sequence and to overlapping

physical segments (fragments) contained in clones depending on the context.

Supercontigs: A supercontig, also known as a super or a scaffold, is the largest type of

object in an assembly. A supercontig consists of one or more contigs bound together. The

Page 27: project

27 | P a g e

supercontig object includes information about the reads and contigs used to generate it, as

well as quality scores for each base.

Scaffolding: The process of creating supercontigs from contigs is called scaffolding.

N50 Value : The N50 statistic is a measure of the average length of a set of sequences, with

greater weight given to longer sequences. It is used widely in genome assembly, especially

in reference to contig lengths within a draft assembly. Given a set of sequences of varying

lengths, the N50 length is defined as the length N for which half of all bases in the

sequences are in a sequence of length L < N.

Page 28: project

28 | P a g e

CHAPTER III

3.1 AIM AND OBJECTIVES

Next Generation Sequencing is able to generate huge amounts of DNA sequence reads and

the major challenge is to handle such a large data efficiently. In this work we aim to develop

a method exploiting all available information to accurately align as many as possible spliced

sequence reads to the genome.

The data contains not only the DNA sequence of the read and the genome, but also quality

information associated with the read and predictions about potential splice sites within the

genome. The pipeline will produce some plots regarding statistics of reads and contigs. In

our work we extend the analysis method to also benefit from the read’s quality score. We

also removed bad quality base calls from reads in by trimming fastq file and found better

alignment with genomic regions. This information can help to decide at which positions one

can expect to observe mismatches and subsequently contribute to the identification of the

correct alignment.

In our work we used R package to perform powerful statistical methods to carry out data

processing for analyzing differential expression analysis, isoform, small RNA profiling. We

also analyzed short reads to detect whether we can perform de-novo assembly using RNA

data. We designed a fully functional automated pipeline which uses Bioconductor libraries

to analyze HTS data. Analysis can be carried on various statistical methods such as negative

binomial, Bayesian and exact test. We assembled reads both de-novo and by mapping to

genome. After de novo assembly we analyzed contigs for various biological mechanisms

such as intron retention, alternative splicing etc. In second method we mapped using bowtie

and aggregated reads count which were uniquely mapped to genome to find differentially

expressed genes.

This pipeline will also be annotating reads and will provide information regarding which

biological pathway they belong and to which portion they interact. BiomarRt package is

used for annotation purpose and for describing KEGG pathway. The flowchart of the

pipeline is given in Figure [3]. We will now describe each component in detail.

Page 29: project

29 | P a g e

FLOW CHART OF PIPELINE

Fig.6. Flow Chart Of Pipeline

Reference Based Alignment Using Bowtie

Fastq files

Q.S. Analysis on Short Reads

Trimming Low Quality Reads

Generating HTML report for reads

De novo assembly: Velvet

Blastn using Standalone blast

Comparing Blast results of cancer & normal

Generating Expression File using SAM file

Performing DGE analysis

Analyzing GO & KEGG Pathway

Analyzing statistics of contigs file

Page 30: project

30 | P a g e

3.2 METHODOLOGY

3.2.1. RETRIEVING FASTQ SEQUENCES

High throughput sequencing technologies have very rapidly become standard tools in

biology.

The data that these machines generate are large, extremely rich. As such, the Sequence Read

Archives (SRA) have been set up at NCBI CBI GEO does often contain aligned reads for

sequencing experiments and the SRAdb package can help to provide links to these data.

Command in R to get SRA files:

3.2.2. FASTQ QUALITY INSPECTION USING ShortRead PACKAGE

Analysis of short reads is necessary to know the further strategy i.e. whether we should

perform de novo assembly or we should perform mapping with reference genome.

Analysis using ShortRead package gives quality information if quality score of sequences

are less than 20 we will remove those bases by trimming. It also inspects read yield, base

composition, most common base and plot per-cycle quality.

3.2.3. DE-NOVO ASSEMBLY USING VELVET

Using velvet assembler for De novo assembly of sequenced DNA but can also be used for

de novo assembly of transcriptomic sequence. De novo assembly of short sequence reads

into transcripts allows to reconstruct the sequences of full transcriptome, identify and lists

all expressed genes, separate isoforms, and capture the expression levels of transcripts.

Velvet, a program specially developed for de novo transcriptome assembly from short-read

RNA-Seq data. Velvet is generally used for assembly of bacterial genome but is also capable

of performing de novo assembly of mammalian genome. Velvet construct de Bruijn graph

library (SRAdb)

getFastq(in_acc = c("SRR000648", "SRR000657"),sra_con = sra_con, destdir = getwd())

sra_con <- dbConnect(SQLite(), sqlfile)

sra_con <- dbConnect(SQLite(), "SRAmetadb.sqlite")

Page 31: project

31 | P a g e

from large amounts of short-read sequences, then used an enumeration algorithm to score all

possible paths and branches, and retained those plausible ones as transcripts/isoforms.

Velvet is specially programmed to recover paths supported by actual reads and remove

ambiguous/erroneous edges, thus ensuring correct transcript reconstruction.

Command :

Hash length: 31

Input file: .Fastq Output: contigs.fa

3.2.4. ANALYZING STATISTICS OF CONTIGS FILE:

Statistical analysis of contigs file is necessary to know the quality of contigs produced by de

novo assembler is of any importance, whether the contigs aligned are of good length with

good quality score. Statistical analysis is an important step while performing de novo

assembly as it reveals statistical significance that contigs produced can be used for further

analysis or we should map the reads with some reference genome.

We got plots named below:

Histogram, weighted histogram and dinucleotide Frequency

3.2.5. PERFORMING STANDALONE BLAST

After performing and analyzing velvet output we carried out mapping of genomic segments

(i.e. contigs) to refseq database using standalone Blast. First of all we downloaded refseq

fasta file from NCBI and formatted them to be used as database.

Command:

For Buiding Database: Makeblastdb –in <fasta_file> -dbtype –out <output_db_filename> For Performing Blast: Blastn –query <fasta_file> -db <database_name> -out <output_file>

./velveth output_directory hash_length [[-file_format][-read_type] filename] ./ velvetg output_directory coverage_cutoff

Page 32: project

32 | P a g e

FOR BUILDING DATABASE:

Input file: .fasta Output file: index file (.ewt)

FOR BLAST:

Input file: .fasta Output file: text file

We performed blastn using NCBI refseq as database and certain parameters to get top hits,

query name, sequence length matched, e-value and percent identity. In order to make strict

matching we fixed percent identity to 80%. Input file was .fastq file which was obtained

from velvet as contigs.

After performing blastn for both normal and cancer contigs we matched output text file with

each other on basis of mapped segment id.

3.2.6. COMPARING BLAST RESULTS TO FIND INTRON RETENTION

After getting mapped file for normal prostate and cancer prostate we took out those

segments which were having identical mapped refseq id. We manually analyzed both normal

and cancer contigs mapped to sequence of mapped refseq id and found mapping difference

between normal and cancer prostate contigs.

3.2.7. PERFORMING MAPPING USING BOWTIE

In another strategy we performed mapping of fastq file of normal and cancer prostate using

bowtie i.e. performing assembly of short reads using NCBI36 cDNA as reference genome.

We mapped short reads in fastq file with reference genome by allowing only 2 mismatches.

The output of bow tie is SAM file which contains reads information, portion of genome to

which read has aligned, start and end position and number of times in aligned. The bowtie

output is used to generate count file which will be having sequence id and number of counts

it mapped to genome.

Command:

bowtie -q -v 2 –sam <database_file_name> <fastq_file_name> <sam_output_filename>

Page 33: project

33 | P a g e

Input file : fastq file Output file: SAM file

3.2.8. GENERATING EXPRESSION FILE

After performing bowtie we got SAM file as bowtie output which contains information

regarding mapping position, number of time reads mapped to genome, transcript/gene id,

etc. We used R script to extract count for each reads and sorted them according to transcript

id. This generated count file is used as input by various BIOCONDUCTOR packages for

differential analysis of expression level.

3.2.9. USING BIOMART

We used biomaRt to make a database file of Ensembl gene id and matching transcript id in

order to replace mapped transcript id obtained from SAM file.

3.2.10. GENE ONTOLOGY USING goSeq:

This package provides methods for performing Gene Ontology analysis of RNA-seq data,

taking length bias into account. In order to perform a GO analysis of RNA-seq data, goSeq

only requires a simple named vector, which contains two pieces of information.

1. Measured genes: all genes for which RNA-seq data was gathered for your experiment.

Each element of your vector should be named by a unique gene identifier.

2. Differentially expressed genes: each element of your vector should be either a 1 or a 0,

where 1 indicates that the gene is differentially expressed and 0 that it is not. If the

organism, gene identifier or category test is currently not natively supported by goSeq, it

will also be necessary to supply additional information regarding the genes length and/or the

association between categories and genes such as gene id or gene symbol.

By using this package we annotated gene id’s which were selected as differentially

expressed according to their p-value. This package also helps us to know the pathway

information of genes.

Page 34: project

34 | P a g e

CHAPTER IV

RESULTS OF A REPRESENTATIVE ANALYSIS

We tested our pipeline by RNA seq Prostate cancer data with SRA accession number

SRX022060, SRX022061, SRX022063, SRX022080, SRX022081 and SRX022083[28] and

below are plots, expression profiling results and GO terms obtained as output of pipeline.

4.1. FASTQ QUALITY INSPECTION

4.1.1 OVERALL READ QUALITY:

Fig.7 Overall Read Quality

Lanes with consistently good quality reads have strong peaks at the right of the panel. Most

of reads are above QS (Quality Score) 20 they can be considered as good quality reads. We

can trim low quality reads by putting a cutoff below 10 because when we trimmed reads

with QS less than 20 we obtained less number of contigs as some of eliminated reads were

needed for filling gaps. We have analyzed QS for every fastq files and found a strong peak

after base call 20.

Page 35: project

35 | P a g e

4.1.2 PER-CYCLE QUALITY SCORE:

Fig. 8 Per-Cycle Quality Score

Reported quality scores are ‘calibrated’ i.e. incorporating phred-like adjustments following

sequence alignment. These typically decline with cycle, in an accelerating manner. Abrupt

transitions in quality between cycles toward the end of the read might result when only some

of the cycles are used for alignment: the cycles included in the alignment are calibrated more

effectively than the reads excluded from the alignment. Thus as number of cycles increases

the quality score falls.

The reddish lines are quartiles (solid: median, dotted: 25, 75), the green line is the mean.

Shading is proportional to number of reads.

Page 36: project

36 | P a g e

4.1.3 READ DISTRIBUTION

Fig. 9 Read distribution

These curves show how coverage is distributed amongst reads. Ideally, the cumulative

proportion of reads will transition sharply from low to high. Portions to the left of the

transition might correspond roughly to sequencing or sample processing errors, and

correspond to reads that are represented relatively infrequently. 10-15% of reads fall under

this category. To the right of transition reads are over represented than expected which may

be due to sequenced primer or adapter sequences, sequencing or base calling artifacts (e.g.,

poly-A reads), or features of the sample DNA (highly repeated regions) not adequately

removed during sample preparation. About 5% of reads fall under this category.

Broad transitions from low to high cumulative proportion of reads may reflect sequencing

bias or (perhaps intentional) features of sample preparation resulting in non-uniform

coverage.

Common duplicate reads might provide clues to the source of over-represented sequences.

Some of these reads are filtered by the alignment algorithms; other duplicate reads might

point to sample preparation issues.

Page 37: project

37 | P a g e

4.1.4 CYCLE-SPECIFIC BASE CALLS AND READ QUALITY

Fig. 10 Cycle-Specific Base Calls And Read Quality

Per-cycle base call should usually be approximately uniform across cycles. Quality of A

increases as number of cycle increases and quality of T decreases as number of cycles

increases. Quality after 10 cycles remains uniform and the base call for each four bases are

stable we can rely on base call when number of cycles is more.

Page 38: project

38 | P a g e

4.1.5 PER CYCLE READ QUALITY

Fig.11 Per Cycle Read Quality

Per cycle read quality plot shows that the fred quality of reads decreases as the number of

cycle increases. Top line of each box represent quartile which is uniform in overall

sequencing which shows that these reads can be used for de novo assembly. Quartile is a

important factor in deciding whether we should perform de novo assembly or not. By

evaluating this plot we can remove the bad reads from fastq by trimming the short reads

with Fred score below a desired cutoff, but practically this cutoff should not exceed value of

20 as this will remove some reads which were acting as bridge in between short reads.

Page 39: project

39 | P a g e

4.2. ANALYZING STATISTICS OF CONTIGS FILE

4.2.1 Histogram of contigs coverage Weighted histogram of contigs coverage

Fig. 9 Histogram and weighted histogram of contigs coverage

Above histograms show the coverage of contigs for RNA-seq of data of 3 normal and 3

cancer samples taken from NCBI SRA [28]. In the weighted histogram on left side low

coverage is not observed and all contigs are of good coverage.

4.2.2Dinucleotide frequency:

Fig 13 Dinucleotide frequency

This plot describes dinucleotide frequency in samples.

Page 40: project

40 | P a g e

4.2.3 CONTIGS N50 AND MAXIMUM LENGTH:

Sample Name

N50 Max_contig_length

SRX022060 153 2895

SRX022061 152 3116

SRX022063 143 3088

SRX022080 163 3426

SRX022081 171 3512

SRX022083 158

3468

N50 is contig length such that using equal or longer contigs produces half the bases of the genome. Max length is longest contig obtained by velvet assembler. 4.3. ANALYZING DIFFERENTIAL GENES EXPRESSION

4.3.1Top Tags From DGE analysis:

Comparison of groups: normal-cancer Gene id logFC logCPM PValue

ENSG00000100285 -14.657624 11.619148 0.0003688800 ENSG00000044574 -14.203956 10.415779 0.0008230058 ENSG00000211896 -11.635115 8.851908 0.0023336360 ENSG00000126709 -10.389798 7.864545 0.0045019755 ENSG00000187244 -11.650946 7.725334 0.0049487220 ENSG00000215034 9.636823 6.693072 0.0097990435 ENSG00000211893 -9.214604 6.685765 0.0098388364 ENSG00000211677 -9.223101 6.519094 0.0110031817 ENSG00000211892 -9.013760 6.319759 0.0125532529 ENSG00000101439 -9.689559 6.124585 0.0143609705 Top tags are those differentially expressed gene which rejected null hypothesis with PValue

more than 0.05 i.e. with 95% confidence interval these genes have been differentially

expressed in cancer than in normal.

4.3.2 DIFFERENTIALLY EXPRESSED GENES:

0 1 19646 20

Page 41: project

41 | P a g e

0 represents for non differentially expressed and 1 for differentially expressed in groups Normal-Cancer. 4.3.3 GO TERMS RETRIEVED BY GOSEQ PACKAGE

GOID: GO:0010466

Term: negative regulation of peptidase activity

Ontology: BP

Definition: Any process that stops or reduces the rate of peptidase

activity, the hydrolysis of peptide bonds within proteins.

--------------------------------------

GOID: GO:0051346

Term: negative regulation of hydrolase activity

Ontology: BP

Definition: Any process that stops or reduces the rate of hydrolase

activity, the catalysis of the hydrolysis of various bonds.

Synonym: down regulation of hydrolase activity

Synonym: down-regulation of hydrolase activity

Synonym: downregulation of hydrolase activity

Synonym: hydrolase inhibitor

Synonym: inhibition of hydrolase activity

--------------------------------------

GOID: GO:0004866

Term: endopeptidase inhibitor activity

Ontology: MF

Definition: Stops, prevents or reduces the activity of an

endopeptidase, any enzyme that hydrolyzes nonterminal peptide bonds

in polypeptides.

Synonym: alpha-2 macroglobulin

Synonym: endoproteinase inhibitor

Synonym: proteinase inhibitor

--------------------------------------

Page 42: project

42 | P a g e

GOID: GO:0030414

Term: peptidase inhibitor activity

Ontology: MF

Definition: Stops, prevents or reduces the activity of a peptidase, any

enzyme that catalyzes the hydrolysis peptide bonds.

Synonym: protease inhibitor activity

--------------------------------------

GOID: GO:0052547

Term: regulation of peptidase activity

Ontology: BP

Definition: Any process that modulates the frequency, rate or extent of

peptidase activity, the hydrolysis of peptide bonds within

proteins.

Synonym: peptidase regulator activity

--------------------------------------

GOID: GO:0043086

Term: negative regulation of catalytic activity

Ontology: BP

Definition: Any process that stops or reduces the activity of an

enzyme.

Synonym: down regulation of enzyme activity

Synonym: down-regulation of enzyme activity

Synonym: downregulation of enzyme activity

Synonym: inhibition of enzyme activity

Synonym: negative regulation of enzyme activity

--------------------------------------

GOID: GO:0051336

Term: regulation of hydrolase activity

Ontology: BP

Definition: Any process that modulates the frequency, rate or extent of

hydrolase activity, the catalysis of the hydrolysis of various

Page 43: project

43 | P a g e

bonds, e.g. C-O, C-N, C-C, phosphoric anhydride bonds, etc.

Hydrolase is the systematic name for any enzyme of EC class 3.

Synonym: hydrolase regulator

GOID: GO:0006952

Term: defense response

Ontology: BP

Definition: Reactions, triggered in response to the presence of a

foreign body or the occurrence of an injury, which result in

restriction of damage to the organism attacked or

prevention/recovery from the infection caused by the attack.

Synonym: antimicrobial peptide activity

Synonym: defence response

Synonym: defense/immunity protein activity

Synonym: physiological defense response

Synonym: GO:0002217

Synonym: GO:0042829

Secondary: GO:0002217

Secondary: GO:0042829

--------------------------------------

GOID: GO:0061134

Term: peptidase regulator activity

Ontology: MF

Definition: Modulates the activity of a peptidase, any enzyme that

catalyzes the hydrolysis peptide bonds.

--------------------------------------

GOID: GO:0061135

Term: endopeptidase regulator activity

Ontology: MF

Definition: Modulates the activity of a peptidase, any enzyme that

hydrolyzes nonterminal peptide bonds in polypeptides.

Page 44: project

44 | P a g e

CHAPTER V

CONCLUSIONS:

This pipeline performs some initial statistical analysis which will help in our understanding

of short reads and will pave a path for further analysis such as quality trimming, de novo

assembly and mapping. Fastq quality inspection will allow us to inspect reads and remove

the bad base call, it also suggest contamination if present. In “per cycle read quality” plots if

the quartile is not uniform then we are not supposed to do de novo assembly. Analysis of

stat file obtained from velvet gives histogram and weighted histogram of coverage which

shows low coverage region if present. If any low coverage regions are found they can be

removed by setting a cutoff slightly more than mean of weighted histogram which will

remove low coverage region.

We have analyzed Prostate cancer data vs. normal data for testing performance of pipeline.

By fastq quality inspection we concluded that the reads have good quality with some adapter

contamination. Adapter contamination may interfere in velvet assembly. We found by the

analysis that the reads are suitable for de novo assembly.

We analyzed blast results and found intron retention in Homo sapiens kallikrein-related

peptidase 3 with gi|225543369.

In further analysis the pipeline performs mapping of short reads using bowtie, on an average

70% of short reads mapped with NCBI36 Hg18.

Pipeline performs DGE analysis and gives top 10 most differentially expressed genes

according to p-value less than 0.05 i.e. these top 10 genes disproved null hypothesis by 95%

confidence interval. After getting DGE, pipeline performs Gene Ontology analysis on

differentially expressed genes for getting GO related terms.

Page 45: project

45 | P a g e

REFERENCES

[1] Elaine R. Mardis , “Next-Generation DNA Sequencing Methods” Departments of

Genetics and Molecular Microbiology and Genome Sequencing Center, Washington

University School of Medicine, St. Louis

[2] “A haplotype map of the human genome”, The International HapMap Consortium,

Nature 437, 1299-1320 (27 October 2005) | doi:10.1038/nature04226; Received 11 August

2005; Accepted 12 September 2005

[3] Figure for decreasing cost From: National Human Research Institute

[4]Short note on base calling, http://www.ebi.ac.uk/industry/Documents/workshop-

materials/newsequence291009/Basecalling-Klaus_Maisinger.pdf

[5] http://www.illumina.com/technology/paired_end_sequencing_assay.ilmn

[6] http://www.illumina.com/technology/mate_pair_sequencing_assay.ilmn

[7] 1000 Genomes Project reveals human variation.

http://www.nature.com/news/2010/101027/full/news.2010.567.html

[8] http://www.biomedcentral.com/1471-2105/10/232

[9] Tim van Opijnen, Kip L. Bodi, and Andrew Camilli,“Tn-seq; high-throughput parallel

sequencing for fitness and genetic interaction studies in microorganisms.”

[10] http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0006589

[11] A memory-efficient data structure representing exact-match overlap graphs with

application for next generation DNA assembly

http://bioinformatics.oxfordjournals.org/content/early/2011/06/02/bioinformatics.btr321

[12] http://manuals.bioinformatics.ucr.edu/home/ht-seq

[13] http://www.clcbio.com/index.php?id=1240

[14] https://dnanexus.com/

Page 46: project

46 | P a g e

[15] http://www.genomequest.com/

[16] http://www.strandsi.com/AvadisNGS

[17] http://www.raindancetechnologies.com/applications/next-generation-sequencing-

technology.asp

[18] http://www.nature.com/nature/journal/v449/n7164/full/nature06244.html

[19] http://www.nature.com/nature/journal/v453/n7195/full/453687a.html

[20] http://www.nature.com/nature/journal/v437/n7063/edsumm/e051027-01.html

[21] http://www.bioconductor.org/packages/2.9/bioc/html/Biostrings.html

[22 ] http://www.bioconductor.org/packages/2.9/bioc/html/ShortRead.html

[23] http://bioinf.wehi.edu.au/software/goseq/

[24] http://www.bioconductor.org/packages/2.9/bioc/html/SRAdb.html

[25] http://www.bioconductor.org/packages/2.2/bioc/html/biomaRt.html

[26,27] http://www.bioconductor.org/

http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual

[28] Recurrent chimeric RNAs enriched in human prostate cancer identified by deep

sequencing. http://www.ncbi.nlm.nih.gov/pubmed/21571633