BMS 617

Marshall University Genomics Core Facility

Marshall University School of MedicineDepartment of Biochemistry and Microbiology

BMS 617

Lecture 15: High-throughput sequencing and bioinformatics

Marshall University School of Medicine

High-throughput sequencing• High-throughput sequencing (a.k.a next-generation sequencing, or NGS) is an

experimental technique that can sequence large numbers of DNA fragments at one time

• Basic idea:– Take a DNA sample, denature and fragment it into segments up to a few hundred

base-pairs long– DNA is attached to a substrate (flow cell)– Special nucleotide bases are allowed to anneal to single-stranded DNA samples

• adapted so that only one base attaches at a time• different fluorescent dyes attached to each of A, C, G, T

– flow cell is scanned with an optical scanner to determine base added to each fragment

– “stops” and dyes removed from attached base, so another base can be attached– repeat up to 100-150 times– can then “turn around” and sequence same fragments from the other end


Applications of NGS• Many different applications of NGS

– Genome sequencing• Sequence DNA from samples, identify variants• Potentially identify causal variants for disease

– Exome sequencing• Sequence only the coding portions of the genome

– RNA sequencing (RNA-Seq)• Collect RNA samples, build complementary DNA, sequence the DNA• Can count the number of reads mapping to each known gene to measure

gene expression• Sequences can show transcripts

– Identify different splice variants for genes

– Many others


Output from NGS

• A single run on an Illumina sequencer can read up to 3 billion DNA fragments – 2 reads of 100 bases per fragment, so up to 6×1011

(600,000,000,000) bases per experimental run• This is a lot of data• How to process it?• Some standard pipelines exist for common types of

experiments• Most start with aligning the reads to a known

(“reference”) genome


Reference Genomes

• Human Genome project was completed in 2003– Technically, the first complete draft

• Still an ongoing project

• Basically a sequence of “consensus” bases for each chromosome

• Raw sequences like this are stored in a fasta file– Can be a single file, or one file per chromosome– Very simple text file format

• Line containing name of sequence starts with ‘>’• Other lines contain bases, maximum of 80 per line


Sequencing Reads

• Output from a sequencer consists of collection of reads– Typically 100 bases per read– Millions (sometimes even billions) of reads per sample

• Each base also has a quality score associated with it– Estimate of how confident the sequencing software is

in making the call• Reads are stored in fastq files


Alignment• Alignment involves finding the location in the genome of a read

– Which chromosome, which base number?• In theory, could scan the entire genome and look for a match of the read

to the sequence– However, must account for variation– Natural biological variation– Sequencing error

• Alignment needs to find the “best match” in the entire genome– Remember, mammalian genomes are around 3 billion bases long– And we have up to 3 billion reads per run of the sequencer– So potentially up to 9×1018 (9,000,000,000,000,000,000) things to check– Very sophisticated algorithms are needed to make this a viable task

• Typically give “best approximation” results– Most commonly used alignment software: BWA and BowTie


Alignment output

• Result of alignment is a sam (“sequence alignment mapping”) file, or its binary version, a bam file.– For each read, describes the sequence to which it

mapped (i.e. the chromosome), how it mapped (were there “gaps”?), the quality of the mapping (how certain?), and other data

• These files (and others) can be viewed in a genome viewer– Integrative Genome Viewer (IGV) from the Broad Institute

or UCSC’s Genome Browser are the commonly-used ones


RNA-Seq alignment

• RNA-Seq presents special problems for alignment• In most eukaryotes, transcription is spliced• Remember we start with RNA transcripts• Use those to construct cDNA• And sequence the cDNA• Need to map this back to the reference sequence

from where it came• Introns appear as huge deletions to aligner


RNA-Seq alignment problem

Transcription

Fragmentation

Make cDNA, sequence

How to map to genome?


RNA-Seq alignment solutions• One option is to ignore the problem

– Will only align reads that map to a single exon• Or have a splice junction close to one end of the read• Longer exons will be well mapped

– Maybe works well enough for simple differential expression experiments

• Another option is to build a reference transcriptome and align to it– Basically, take all the known genes and their transcripts, and build

a “genome” out of the known transcripts– Restricts only to known transcripts– Splice variation can cause problems


TopHat aligner• TopHat is a specialized aligner for aligning RNA-Seq reads

– Like BowTie, collaboration between UMD and MIT• Works by (computationally) chopping reads into small chunks

(~25bp) and aligning those– Much less likely those will span splice junctions, so many will align directly

to the genome– Use the aligned chunks to guess where the splice junctions are– Then use the splice junctions to align the chunks that failed to align first

time• Very computationally intensive

– 50 million reads usually take about 16 hours to align on a 12-processor core

– Typical experiment may have a dozen or more such samples


Downstream Analysis

• After aligning, analysis varies depending on type of application

• For genomic applications, usually want to know where the variants are compared to the reference genome– Which are “interesting”?

• For RNA-Seq, often want to compare expression of genes between samples– Similar to microarray studies


Variant Calling

• Variant calling is performed by examining the alignment file (bam file) and comparing the sequence from the read to the sequence from the reference genome– Use quality scores for bases to determine

probability this is a real variant and not a sequencing error

– Standard software for this step is samtools


Filtering variants

• In humans, typically a subject will have hundreds of thousands of variants relative to the reference genome– About 1 base in every 10,000

• Which of these are of interest?– Need to filter them

• Typically, for disease studies, look for variants present in the diseased individuals and not present in the normal individuals– In family studies this can remove many variants


Filtering steps for variants• Still have many variants in consideration

– Typically remove “common variants”• If the variant is known, and has a relatively large frequency in the population (say

> 5%), then it probably is not causal for a major disease• Simply by evolutionary consideration, or by considering the disease incidence rate

– Often focus only on variants that occur in coding regions of the genome– Can also look at the effect those variants have on the generated protein

sequence• Synonymous changes (ones which change the DNA sequence but result in the

same amino acid) are probably not going to have an effect on phenotype– Sophisticated tools (Polyphen2 and SIFT) will examine the relevant protein

sequence across many species• If the sequence is highly conserved across many species, changes to it are likely to

be deleterious


Differential Expression analysis by RNA-Seq

• Typical use of RNA-Seq is to determine genes differentially expressed between two sets of samples– Similar to microarray studies but with several

advantages– Not restricted to genes spotted on array

• Not even restricted to known genes– Can distinguish between different splice variants if

needed– Less statistical noise: get an actual count of reads


Typical pipeline for differential expression analysis

• Align reads to reference genome (preferably using TopHat)• For each known gene, count the number of reads for each

sample intersecting that gene– Can also do this for each exon instead of each gene if you want

differential transcript analysis• Normalize the counts by the total number of counts for the

sample• For each gene, compare the normalized counts for one set of

samples to the normalized counts for another set of samples• Generate a p-value• Correct for multiple hypothesis testing


Annotation and Variation databases

• Pipelines for variant filtering and RNA-Seq analysis relied not just on reference genome but on knowing where the genes are in that genome– For RNA-Seq, had to count reads intersecting each gene– For variant filtering, wanted to know which variants were in coding exons– Variant filtering also relied on “known variants”

• These steps rely on additional databases• Two main sources:

– NCBI National Center for Biotechnology Information• Part of National Institutes of Health (NIH)

– EBI European Bioinformatics Institute• Part of European Molecular Biology Laboratory (EMBL)

– UCSC maintains a database linking to both of these• Kind of meta-database


Strategies for using annotation and variation databases

• There are online and some standalone tools for interacting with these databases

• Because of the scale of the data, however, at some point using these requires writing computer code

• For most use-cases, download text file and write code to read and process it– Usually download from genome.ucsc.edu


Custom analyses

• Some analyses don’t have “off-the-shelf” data analysis pipelines– Have to be created ad-hoc

• Current project: reduced representation bisulphite sequencing (RRBS)– Technique to use NGS to identify sites in the

genome which are methylated• Methylation affects transcription• Known mechanism for turning gene expression on and off


RRBS

• DNA is first cleaved by MSP1 enzyme– Cleaves DNA at sites matching CCGG

• Then size-selected by gel to select only fragments between 30 and 350 bases in length

• Treat with bisuphite– Converts unmethylated cytosines (C) to uracil• Reads as T in sequencing

– But leaves methylated cytosines alone


Bismark• Some tools exist for RRBS analysis, but are fairly primitive– Bismark is an aligner/methylation caller developed by the

Babaraham Institute– Basic idea:

• Convert all Cs in reference genome to Ts• Temporarily convert all Cs in reads to Ts• Align converted reads to converted genome• Un-convert genome and reads• Ts in reads that mapped to Cs in reference are likely to be

unmethylated Cs• Cs in reads that mapped to Cs in reference are likely to be methylated

Cs


RRBS Pipeline• Before we align and call methylation, want to do some

quality control analysis– Did the experimental protocol produce the results we expected?

• All reads should begin CGG– Cleavage step drops the first base of the CCGG cleavage site

• Should be able to identify potential targets of alignment– Find all occurrences of CCGG in the reference genome– Find all fragments of the genome between these sites with

length between 30 and 350 bases– All reads should align to the beginning or end of these fragments


A note on computational power• If we attempted the task “Find all occurrences of CCGG in

the reference genome” by hand• Optimistically assume you can scan 30 bases per second

looking for these• At 3 billion bases, this would take 100,000,000 seconds

– 1,666,667 minutes, or 27,778 hours, or 1,157 days– i.e. 3.17 years of continuous work– or about 14 years full time at 40 hours/week

• My laptop can complete this task in a couple of minutes• Remember aligning a dozen RNA-Seq samples takes a

weekend on our 22×12 CPU computer cluster


Next pipeline steps• Having confirmed reads are located as expected, look at

methylation calls• Compare methylation status between different cell lines and

identify locations where methylation is different– Use Fisher’s exact test to compare number of methylated and

unmethylated calls between cell lines– Correct p-value for multiple hypothesis testing

• Look for “clusters” of differences• Filter

– Which are in genes, or perhaps just upstream of genes (in the promotor region)

– These are likely to affect expression


Summary

• NGS data analysis involves managing and manipulating large amounts of data

• Eventually, some programming skills are necessary

• Statistical analysis is usually involved at the end of the pipeline

• Potential for very powerful analyses and discoveries

Documents

BMS 617