228
Bioinformatica e analisi dei genomi Anno 2016/2017 Pierpaolo Maisano Delser mail: [email protected]

Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

Embed Size (px)

Citation preview

Page 1: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

Bioinformatica e analisi dei genomi

Anno 2016/2017

Pierpaolo Maisano Delsermail: [email protected]

Page 2: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

Background

Cusco, Marzo 2009

• Laurea Triennale: Scienze Biologiche, Universita’ degli Studi di Ferrara, Dr. Silvia Fuselli;

• Laurea Specialistica: Scienze Biomolecolarie Cellulari, Universita’ degli Studi di Ferrara, Dr. Silvia Fuselli;

• PhD in Genetics, University of Leicester, prof. Mark A. Jobling;

• Post-doctoral fellow EPHE-MNHN, Paris, Dr.Stefano Mona.

• Research Fellow Trinity College, Dublin, Prof. Daniel Bradley

Page 3: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

Muséum national d'Histoire naturelle - Paris

Page 4: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

Trinity College Dublin - Ireland

Page 5: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

Informazioni pratiche

• Teoria + pratica;

• Software and tools;

• Files;

• Slides on the website;

• Argomenti nuovi / argomenti gia’ trattati;

Page 6: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

Informazioni pratiche

Cartella di lavoro (fastq file): /home/bioinfo_file/

File referenza , intevalli per il coverage, genoma per IGV): /home/bioinfo_file/reference/

Ricordatevi i percorsi dei file!!

pwd: mostra la vostra posizione

Page 7: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

Programma

• next-generation sequencing (NGS)…come, quando, perche’?

• un esempio di gestione e analisi dati NGS:

• tipo di dato;• file e formati;• programmi;• interpretazione dei risultati;• stima dell’errore;• quando fermarsi?

• Applicazioni e/o progetti su diversi organismi.

Page 8: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

capture: exome/custom/cancer

amplicon sequencing

whole genome

mapping to a reference genome

de-novoassembly

sequencing

unalignedreads QC

mapping refinement

mapping QCassembly QC

whole transcriptome

amplicon sequencing: fixed/custom

DNA-seq

RNA-seq

reads trimming

NGS: come, quando, perché?

Filtering

Validation

Page 9: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

Domanda: quando?

Risposta: quando ha senso!

• Amplicone 400bp in 100 individui? → Sanger sequencing

• 50 ampliconi in 100 individui? → NGS + target capture

• Gene conversion, elementiripetuti, recombination breakpoints? → NGS + Sanger sequencing

Domanda: perche’?

Risposta: la vostra idea per un progetto!

NGS: come, quando, perché?

Page 10: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

un esempio di gestione e analisi dati NGS

Nanopore minIon/gridIon

Pacific Bioscience (PacBio)

Ion torrent PGM/Proton

Roche 454

Illumina MiSeq/HiSeq

Page 11: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

capture: exome/custom/cancer

amplicon sequencing

whole genome

mapping to a reference genome

de-novoassembly

sequencing

unalignedreads QC

mapping refinement

mapping QCassembly QC

whole transcriptome

amplicon sequencing: fixed/custom

DNA-seq

RNA-seq

reads trimming

Filtering

Validation

un esempio di gestione e analisi dati NGS

Page 12: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

un esempio di gestione e analisi dati NGS

• progetto

• progetto:applicazione (whole genomes? Exomes? Target capture? Amplicon sequencing?)

• progetto:applicazione:scopo (SNPs, indels, repeated elements, CNVs…)

• progetto:applicazione:scopo:coverage (SNPs, indels, repeatedelements, CNVs…)

Page 13: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

Project:

• Carcharodon carcharias - the great white shark;

Page 14: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

Project:

• Carcharodon carcharias;

• Diploid organism;

• 82 chromosomes (41 pairs);

• Genome size ~5.2 Gb – not fully sequenced yet;

• Target capture experiment;

• Paired-end reads (250bp).

Page 15: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

Project:

• Target capture experiment;

Meyerson M et al., 2010, NatRevGenetics

Page 16: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

fragment ========================================fragment + adaptors ~~~========================================~~~SE read --------->PE reads R1---------> <---------R2unknown gap ..................................................

Single-end (SE) or paired-end (PE) sequencing.

Page 17: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

raw reads (.fastq) 2. alignment to a reference genomeclose reference?

time limited?

bwa

distant reference?

stampy

aligned reads (.sam/.bam)

3. bam refinementduplicate removal

local realignment

base recalibration

picardGATK GATK

aligned reads (.sam/.bam)

5. variant callingSNPs/indels

single/multi-sample

samtools

raw variants (.vcf)

ready-to-use variants (.vcf)

4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)

6. variant filtering and validation

in silico vs in vitro validation

vcftools

variant score recalibration

big datasets

known SNPs/indels

1. Fastq quality control + trimming

Adapters ?Low quality bases?

samtoolsIGV/tablet

Page 18: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

.fa/.fasta

.fastq

.sam (.sai)

.bam (.bai)

.vcf

sequences

read data

mapped reads

mapped reads (binary)

variant information

Page 19: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

@M00725:28:000000000-AJ72K:1:1101:11561:1002 1:N:0:1AGTCAACAACGGGAACAAAATCCTGAAGGTCATGGTATGTGTANNNNTNTTNNNNNCCNNNNNNATGTGTCNNNNNNNTNNNNNNTCTGAGTNNNNNNNCTCTCTTNNNNNNNAGTGGGTNNNNNNNGCATCCANNNAGCACGATTTTNNNNNNNTATTCAGGAGACAANNNNNNNGTGGGCANNNNNNNGTGTTGGNNNNNNNNNNNNNNGGAGAGANAAAAAANNNNNNNTGAAGTCNNNNNNNNNNNNAGCGNNANNNNNNNTCNNNNNNNNNNNNNNATCANNNNNNNNNNGGTG+8ACCFGFGGGCDGGGGCFGGGGGGGFGGGGGFEFFGGGFFEGG####9#::#####:9######::CD@FG#######:######,:99CF?#######::DBFDE#######4::DFG>#######+9A=D@F###88=+<FFFFGG#######++8@8;EEFG8>DG#######+6@DEFF#######*44D=,:##############*/**2:*#212/8C#######*.*2:/9############)-))##0#######,(##############0((,##########-((-

raw reads (.fastq)

Page 20: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

raw reads (.fastq)

Terminal: more cc_gn2_R1_trimmed.fastq head cc_gn2_R1_trimmed.fastq

Page 21: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

raw reads (.fastq)

@M00725:28:000000000-AJ72K:1:1101:11561:1002 1:N:0:1AGTCAACAACGGGAACAAAATCCTGAAGGTCATGGTATGTGTANNNNTNTTNNNNNCCNNNNNNATGTGTCNNNNNNNTNNNNNNTCTGAGTNNNNNNNCTCTCTTNNNNNNNAGTGGGTNNNNNNNGCATCCANNNAGCACGATTTTNNNNNNNTATTCAGGAGACAANNNNNNNGTGGGCANNNNNNNGTGTTGGNNNNNNNNNNNNNNGGAGAGANAAAAAANNNNNNNTGAAGTCNNNNNNNNNNNNAGCGNNANNNNNNNTCNNNNNNNNNNNNNNATCANNNNNNNNNNGGTG+8ACCFGFGGGCDGGGGCFGGGGGGGFGGGGGFEFFGGGFFEGG####9#::#####:9######::CD@FG#######:######,:99CF?#######::DBFDE#######4::DFG>#######+9A=D@F###88=+<FFFFGG#######++8@8;EEFG8>DG#######+6@DEFF#######*44D=,:##############*/**2:*#212/8C#######*.*2:/9############)-))##0#######,(##############0((,##########-((-

Page 22: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

First mate in the pair (paired-end reads)

Run ID

flowcell ID

index

Quality values for each nucleotide

Instrument ID

raw reads (.fastq)

@M00725:28:000000000-AJ72K:1:1101:11561:1002 1:N:0:1AGTCAACAACGGGAACAAAATCCTGAAGGTCATGGTATGTGTANNNNTNTTNNNNNCCNNNNNNATGTGTCNNNNNNNTNNNNNNTCTGAGTNNNNNNNCTCTCTTNNNNNNNAGTGGGTNNNNNNNGCATCCANNNAGCACGATTTTNNNNNNNTATTCAGGAGACAANNNNNNNGTGGGCANNNNNNNGTGTTGGNNNNNNNNNNNNNNGGAGAGANAAAAAANNNNNNNTGAAGTCNNNNNNNNNNNNAGCGNNANNNNNNNTCNNNNNNNNNNNNNNATCANNNNNNNNNNGGTG+8ACCFGFGGGCDGGGGCFGGGGGGGFGGGGGFEFFGGGFFEGG####9#::#####:9######::CD@FG#######:######,:99CF?#######::DBFDE#######4::DFG>#######+9A=D@F###88=+<FFFFGG#######++8@8;EEFG8>DG#######+6@DEFF#######*44D=,:##############*/**2:*#212/8C#######*.*2:/9############)-))##0#######,(##############0((,##########-((-

lane tile

coordinates of the cluster

Is the read filtered? No (N) or Yes (Y)

Control included? 0=No

read

Page 23: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

Lowest HighestASCII

33 1260.2......................26...31........41

Illumina 1.8+ Phred+33, raw reads typically (0, 41)

raw reads (.fastq)

@M00725:28:000000000-AJ72K:1:1101:11561:1002 1:N:0:1AGTCAACAACGGGAACAAAATCCTGAAGGTCATGGTATGTGTANNNNTNTTNNNNNCCNNNNNNATGTGTCNNNNNNNTNNNNNNTCTGAGTNNNNNNNCTCTCTTNNNNNNNAGTGGGTNNNNNNNGCATCCANNNAGCACGATTTTNNNNNNNTATTCAGGAGACAANNNNNNNGTGGGCANNNNNNNGTGTTGGNNNNNNNNNNNNNNGGAGAGANAAAAAANNNNNNNTGAAGTCNNNNNNNNNNNNAGCGNNANNNNNNNTCNNNNNNNNNNNNNNATCANNNNNNNNNNGGTG+8ACCFGFGGGCDGGGGCFGGGGGGGFGGGGGFEFFGGGFFEGG####9#::#####:9######::CD@FG#######:######,:99CF?#######::DBFDE#######4::DFG>#######+9A=D@F###88=+<FFFFGG#######++8@8;EEFG8>DG#######+6@DEFF#######*44D=,:##############*/**2:*#212/8C#######*.*2:/9############)-))##0#######,(##############0((,##########-((-

Quality values for each nucleotide

Page 24: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

Lowest HighestASCII

33 1260.2......................26...31........41

Illumina 1.8+ Phred+33, raw reads typically (0, 41)

Phred-scale value:

Q = -10*log_10P → P = 10-Q/10

Phred Quality Score(Q)

Probability of incorrect base call

(P)Base call accuracy

10 1 in 10 90%20 1 in 100 99%30 1 in 1000 99.9%40 1 in 10000 99.99%50 1 in 100000 99.999%

raw reads (.fastq)

Page 25: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

raw reads (.fastq)

• Open cc_gn2_R2_trimmed.fastq

Terminal: more cc_gn2_R2_trimmed.fastq OR head cc_gn2_R2_trimmed.fastq

• Are cc_gn2_R1_trimmed.fastq and cc_gn2_R2_trimmed.fastq coming from two different lanes?

• What’s the difference between the two fastq files?

Page 26: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

raw reads (.fastq)

cc_gn2_R2_trimmed.fastq

cc_gn2_R1_trimmed.fastq

@M00725:28:000000000-AJ72K:1:1101:11561:1002 1:N:0:1AGTCAACAACGGGAACAAAATCCTGAAGGTCATGGTATGTGTANNNNTNTTNNNNNCCNNNNNNATGTGTCNNNNNNNTNNNNNNTCTGAGTNNNNNNNCTCTCTTNNNNNNNAGTGGGTNNNNNNNGCATCCANNNAGCACGATTTTNNNNNNNTATTCAGGAGACAANNNNNNNGTGGGCANNNNNNNGTGTTGGNNNNNNNNNNNNNNGGAGAGANAAAAAANNNNNNNTGAAGTCNNNNNNNNNNNNAGCGNNANNNNNNNTCNNNNNNNNNNNNNNATCANNNNNNNNNNGGTG

@M00725:28:000000000-AJ72K:1:1101:11561:1002 2:N:0:1CCATTTCTNNNNNNNAGGACCTNNNNNNNAGCCCTNNNNNNNNNNNNNAGNATATGANNNNNNNTCTTATTNANCCANNNTCTAGNNNNNNNCTTTCCTNNNNNNNTCTCTGANNNNNNNNNNNNNNCCCTTCCNNTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTNTCTCNTNNNNNNNNNNNNAAAATCCNNNNNNNNNNNNNNCCACTAANNNNNNNNNNNNNNAAGAAATAACACACNNNNNNNACAAAAANNNNNNNACAACACNNNNNNNGCATAAANNNA

Same lane, different read mate in the pair!

Page 27: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

raw reads (.fastq) 2. alignment to a reference genomeclose reference?

time limited?

bwa

distant reference?

stampy

aligned reads (.sam/.bam)

3. bam refinementduplicate removal

local realignment

base recalibration

picardGATK GATK

aligned reads (.sam/.bam)

5. variant callingSNPs/indels

single/multi-sample

samtools

raw variants (.vcf)

ready-to-use variants (.vcf)

4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)

6. variant filtering and validation

in silico vs in vitro validation

vcftools

variant score recalibration

big datasets

known SNPs/indels

1. Fastq quality control + trimming

Adapters ?Low quality bases?

4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)

samtoolsIGV/tablet

Page 28: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

1- Fastq quality control + trimming

Fastqc: quality control of the raw data coming out from the sequencer

• Evaluation of the quality of the generated data;

• Basic summary statistics of the raw data;

• Several modules to evaluate different features (i.e. adapters; base quality, etc…)

• Feedback (green, orange, red): do not fully rely on that, think what does it mean!!

Page 29: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

1- Fastq quality control + trimming

Page 30: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

1- Fastq quality control + trimming

Per base sequence quality: warning

Page 31: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

1- Fastq quality control + trimming

95-99 bp 90-94 bp

What can we do to improve the quality at the end of the reads?

Read Trimming: removal of lower-quality 3' Ends with Low Quality Scores

Page 32: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

1- Fastq quality control + trimming

Per sequence quality score: pass

Page 33: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

1- Fastq quality control + trimming

Sequence length: pass

Page 34: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

Adapters removal1- Fastq quality control + trimming

Failed

Warning

Page 35: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

Adapters removal1- Fastq quality control + trimming

Pass

Page 36: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

Overrepresented sequences

1- Fastq quality control + trimming

Removal of overrepresented sequences (PCR primers).

Page 37: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

FASTQC references:

• Software website:http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

• Manual:https://insidedna.me/tool_page_assets/pdf_manual/fastqc.pdf

Page 38: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

raw reads (.fastq) 2. alignment to a reference genomeclose reference?

time limited?

bwa

distant reference?

stampy

aligned reads (.sam/.bam)

3. bam refinementduplicate removal

local realignment

base recalibration

picardGATK GATK

aligned reads (.sam/.bam)

5. variant callingSNPs/indels

single/multi-sample

samtools

raw variants (.vcf)

ready-to-use variants (.vcf)

4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)

6. variant filtering and validation

in silico vs in vitro validation

vcftools

variant score recalibration

big datasets

known SNPs/indels

1. Fastq quality control + trimming

Adapters ?Low quality bases?

samtoolsIGV/tablet

Page 39: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

Alignment : process of determining the most likelylocation within the genome for the observed DNA read

raw reads reference genome

2- Alignment to a reference genome

Page 40: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

trade-off: speed vs sensitivity – the higher the accuracy the longer the alignment run

two classes of methods:

Burrows-Wheeler

• Fast• less robust at high divergence

with reference genome• e.g. bwa

Hashing

• slow (needs more memory)• robust at high divergence with

reference genome• e.g. stampy

the shorter the read the harder is to find its location in the genome

big amount of data: computationally challenging for memory and speed

2- Alignment to a reference genome

BW: https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transformHashing: https://en.wikipedia.org/wiki/Hash_table

Page 41: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

raw reads reference genome

low MQ: the probability of mapping to different locations is high, but no perfect multiple matches

high MQ: a single match

MQ0: a perfect multiple match

What if there are several possible places to align your sequencing read?

This may be due to:- Repeated elements in the genome- Low complexity sequences- Reference errors and gaps

MQ is a phred-score of the quality of the alignment

2- Alignment to a reference genome

Page 42: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

This may be due to:- Repeated elements in the genome- Low complexity sequences- Reference errors and gaps

2- Alignment to a reference genome

Reference sequence

Element 1 Element 2

Sample_1

Reference sequence

Sample_1

1 copia

1 copia

1 copia

1 copia

Page 43: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

This may be due to:- Repeated elements in the genome- Low complexity sequences- Reference errors and gaps

2- Alignment to a reference genome

Reference sequence

Element 1 Element 2

Sample_1

Element 1

Perfect mul ple matches → MQ0Not a perfect match → Low MQ

Page 44: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

This may be due to:- Repeated elements in the genome- Low complexity sequences- Reference errors and gaps

2- Alignment to a reference genome

Reference sequence

Element 1 Element 2

Sample_1

Element 1

Perfect mul ple matches → MQ0Not a perfect match → Low MQ

Reference sequence

Sample_1

2 copia

1 copia

1 copia

1 copia

Page 45: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

This may be due to:- Repeated elements in the genome- Low complexity sequences- Reference errors and gaps

2- Alignment to a reference genome

Reference sequence

Element 1 Element 2

Sample_1

False heterozygous callCluster of heterozygotes

Reference sequence

Sample_1

1 copia

2 copia

1 copia

1 copia

Page 46: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

This may be due to:- Repeated elements in the genome- Low complexity sequences- Reference errors and gaps

2- Alignment to a reference genome

AluSg7

Page 47: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

This may be due to:- Repeated elements in the genome- Low complexity sequences- Reference errors and gaps

2- Alignment to a reference genome

Page 48: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

2- Alignment to a reference genome: mapping with bwa-mem

Three different algorithm:

1. BWA-backtrack: for illumina reads up to 100bp;

2. BWA-SW: long read support, split alignment;

3. BWA-MEM: long read support, split alignment, faster, more accurate

Fastq files are already trimmed → adapters removed

Page 49: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

2- Alignment to a reference genome: mapping with bwa-mem

Split read:

Karacok E et al., 2012

• paired-end alignment;

• it uses the reference genome (.fa) and the reads (.fastq) to create a SAM file;

• Option to mark shorter split hits as secondary (not supplementary).

Page 50: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

• paired-end alignment;

• it uses the reference genome (.fa) and the reads (.fastq) to create a SAM file;

• Option to mark shorter split hits as secondary (not supplementary).

bwa mem [options] [RefSeq] [fastq1] [fastq2] > cc_gn2_R12.sam

2- Alignment to a reference genome: mapping with bwa-mem

Type: bwa mem to check the options

Page 51: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

bwa mem -M cc_ref.fa cc_gn2_R1_trimmed.fastq cc_gn2_R2_trimmed.fastq > cc_gn2_R12.sam

2- Alignment to a reference genome: mapping with bwa-mem

• paired-end alignment;

• it uses the reference genome (.fa) and the reads (.fastq) to create a SAM file;

• Option to mark shorter split hits as secondary (not supplementary).

Reference genomeMark shorter split hits as secondary

Fastq 1

Fastq 2

Page 52: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

2- Alignment to a reference genome: from sam to bam

Convert sam-to-bam:

samtools view .. .. .. input_sam .. input_bam

• Option to define that the input is a sam file;

• Option to have output in bam format;

• Option to define the output;

Page 53: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

samtools view -Sb cc_gn2_R12.sam -o cc_gn2_R12.bam

2- Alignment to a reference genome: from sam to bam

sam-to-bamOutput in bam format

Output file (bam)

Input file (sam)

Input is a sam file

Page 54: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

raw reads (.fastq) 2. alignment to a reference genomeclose reference?

time limited?

bwa

distant reference?

stampy

aligned reads (.sam/.bam)

3. bam refinementduplicate removal

local realignment

base recalibration

picardGATK GATK

aligned reads (.sam/.bam)

5. variant callingSNPs/indels

single/multi-sample

samtools

raw variants (.vcf)

ready-to-use variants (.vcf)

4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)

6. variant filtering and validation

in silico vs in vitro validation

vcftools

variant score recalibration

big datasets

known SNPs/indels

1. Fastq quality control + trimming

Adapters ?Low quality bases?

samtoolsIGV/tablet

Page 55: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

SAM/BAM format

SAM – sequence alignment mapBAM – binary alignment map

Standard formats for alignmentBAM is the binary version of SAM – reduced size, easier to store and to access but the full information is not readable by human eye

aligned reads (.sam/.bam)

Page 56: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

more cc_gn2_R12.bam

aligned reads (.sam/.bam)

Page 57: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

BAM format

aligned reads (.sam/.bam)

Page 58: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

more cc_gn2_R12.sam

aligned reads (.sam/.bam)

Page 59: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

SAM format

aligned reads (.sam/.bam)

Page 60: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

SAM – sequence alignment mapBAM – binary alignment map

aligned reads (.sam/.bam)

They consist of two parts:

1. Header: contains information about the sample

2. Alignment: contains location and qualities for all the reads

You can find a detailed explanation in the sam/bam format specification (http://samtools.sourceforge.net/SAMv1.pdf).

Page 61: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

SAM format

aligned reads (.sam/.bam)

Header

Alignment

Page 62: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

Header contains:@HD – header line@SQ – Reference sequence dictionary, one per chromosome,

SN (reference sequence name) and LN (reference sequence length) @RG – Read group@PG – Program, ID (identifier)@CO – comment

SAM – sequence alignment mapBAM – binary alignment map

They consist of two parts:

1. Header: contains information about the sample

aligned reads (.sam/.bam)

Page 63: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

SAM – sequence alignment mapBAM – binary alignment map

They consist of two parts:

1. Header: contains information about the sample

aligned reads (.sam/.bam)

Reference Sequence Name Reference Sequence Length

Page 64: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

SAM format

aligned reads (.sam/.bam)

Header

Alignment

Page 65: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

Alignment contains one line per read, and each line contains 12 columns:

SAM – sequence alignment mapBAM – binary alignment map

They consist of two parts:

2. Alignment: contains location and qualities for all the reads

aligned reads (.sam/.bam)

Page 66: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

SAM – sequence alignment mapBAM – binary alignment map

aligned reads (.sam/.bam)

M00725:28:000000000-AJ72K:1:1101:18215:1102 99 cc_ref 1754677 60

300M = 1754780 309

CTCCTTCACCAGATGGATTCTCGCCTTACAGTCCTGAGGAAACTAACCGCAGAGTCAACAAAGTAATGCGAGNNNNNNNGTACTTGCTACAGCNANNNGGTCCAAATNNNTNNNTTATTGGNNNAGATGTT

CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG#######::CFGGGGGGGGGG#:###:9BFGGGGG###:###::DFGDG###4+

2. Alignment: contains location and qualities for all the reads

Page 67: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

SAM – sequence alignment mapBAM – binary alignment map

aligned reads (.sam/.bam)

M00725:28:000000000-AJ72K:1:1101:18215:1102 99 cc_ref 1754677 60

300M = 1754780 309

CTCCTTCACCAGATGGATTCTCGCCTTACAGTCCTGAGGAAACTAACCGCAGAGTCAACAAAGTAATGCGAGNNNNNNNGTACTTGCTACAGCNANNNGGTCCAAATNNNTNNNTTATTGGNNNAGATGTT

CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG#######::CFGGGGGGGGGG#:###:9BFGGGGG###:###::DFGDG###4+

QNAME FLAG

2. Alignment: contains location and qualities for all the reads

Page 68: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

bitwise FLAG

It is an integer, but it represents the sum of different values.

aligned reads (.sam/.bam)

Open Firefox > google.co.uk > Type “bitwise flag broad”

There is a tool online which provides a quick “translation” (https://broadinstitute.github.io/picard/explain-flags.html)

M00725:28:000000000-AJ72K:1:1101:18215:1102 99QNAME FLAG

Page 69: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

bitwise FLAG

It is an integer, but it represents the sum of different values.

aligned reads (.sam/.bam)

There is a tool online which provides a quick “translation” (https://broadinstitute.github.io/picard/explain-flags.html)

M00725:28:000000000-AJ72K:1:1101:18215:1102 99QNAME FLAG

Page 70: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

SAM – sequence alignment mapBAM – binary alignment map

aligned reads (.sam/.bam)

M00725:28:000000000-AJ72K:1:1101:18215:1102 99 cc_ref 1754677 60

300M = 1754780 309

CTCCTTCACCAGATGGATTCTCGCCTTACAGTCCTGAGGAAACTAACCGCAGAGTCAACAAAGTAATGCGAGNNNNNNNGTACTTGCTACAGCNANNNGGTCCAAATNNNTNNNTTATTGGNNNAGATGTT

CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG#######::CFGGGGGGGGGG#:###:9BFGGGGG###:###::DFGDG###4+

QNAME FLAG RNAME POS MAPQ

CIGAR

2. Alignment: contains location and qualities for all the reads

Page 71: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

CIGAR string

It is a compact representation of sequence alignment. It includes:• M – match or mismatch• I – insertion• D – deletion

read: ACTCA–TGCAGTref: ACTCAGTG––GTcigar 5M1D2M2I2M

read: ACGTCATG––––CAGTref: ACG–CATGCGGCAGTcigar 3M1I4M3D4M

So, what is the cigar line of…?

aligned reads (.sam/.bam)

Page 72: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

SAM – sequence alignment mapBAM – binary alignment map

aligned reads (.sam/.bam)

M00725:28:000000000-AJ72K:1:1101:18215:1102 99 cc_ref 1754677 60

300M = 1754780 309

CTCCTTCACCAGATGGATTCTCGCCTTACAGTCCTGAGGAAACTAACCGCAGAGTCAACAAAGTAATGCGAGNNNNNNNGTACTTGCTACAGCNANNNGGTCCAAATNNNTNNNTTATTGGNNNAGATGTT

CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG#######::CFGGGGGGGGGG#:###:9BFGGGGG###:###::DFGDG###4+

QNAME FLAG RNAME POS MAPQ

CIGAR

2. Alignment: contains location and qualities for all the reads

MRNM

Page 73: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

SAM – sequence alignment mapBAM – binary alignment map

aligned reads (.sam/.bam)

M00725:28:000000000-AJ72K:1:1101:18215:1102 99 cc_ref 1754677 60

300M = 1754780 309

CTCCTTCACCAGATGGATTCTCGCCTTACAGTCCTGAGGAAACTAACCGCAGAGTCAACAAAGTAATGCGAGNNNNNNNGTACTTGCTACAGCNANNNGGTCCAAATNNNTNNNTTATTGGNNNAGATGTT

CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG#######::CFGGGGGGGGGG#:###:9BFGGGGG###:###::DFGDG###4+

QNAME FLAG RNAME POS MAPQ

CIGAR MPOS ISIZE

SEQ

QUAL

2. Alignment: contains location and qualities for all the reads

MRNM

Page 74: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

raw reads (.fastq) 2. alignment to a reference genomeclose reference?

time limited?

bwa

distant reference?

stampy

aligned reads (.sam/.bam)

3. bam refinementduplicate removal

local realignment

base recalibration

picardGATK GATK

aligned reads (.sam/.bam)

5. variant callingSNPs/indels

single/multi-sample

samtools

raw variants (.vcf)

ready-to-use variants (.vcf)

4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)

6. variant filtering and validation

in silico vs in vitro validation

vcftools

variant score recalibration

big datasets

known SNPs/indels

1. Fastq quality control + trimming

Adapters ?Low quality bases?

samtoolsIGV/tablet

Page 75: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

3- Bam refinement – before starting…

BAM missing a RG LINE…what is a RG LINE??

You can find a detailed explanation in the sam/bam format specification (http://samtools.sourceforge.net/SAMv1.pdf).

Page 76: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

3- Bam refinement – before starting…

picard-tools AddOrReplaceReadGroupsINPUT=cc_gn2_R12.bam OUTPUT=cc_gn2_R12_rg.bam RGLB=cc_gn2 RGPL=Illumina RGPU=01 RGSM=shark

Page 77: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

3- Bam refinement – before starting…

sort the bam (this adds the bam extension automatically!)It sorts alignments by coordinates

samtools sort cc_gn2_R12_rg.bam cc_gn2_R12_rg_sorted

samtools index cc_gn2_R12_rg_sorted.bam

Index the sorted bam file

Page 78: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

BAM format

aligned reads (.sam/.bam)

Page 79: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

samtools view -H cc_gn2_R12_rg_sorted.bam

use samtools to check the header of the BAM

1. How many chromosomes are present in your header?2. Which version of the BAM is it?3. Is it sorted?

aligned reads (.sam/.bam)

We can “read” the header of the BAM file…

Page 80: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

@HD VN:1.4 SO:coordinate@SQ SN:cc_ref LN:1784076@RG ID:1 PU:01 LB:cc_gn2 SM:shark PL:Illumina

aligned reads (.sam/.bam)

Yes, by coordinate

1 chromosome (“artificial”)

Page 81: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

aligned reads (.sam/.bam)

@HD VN:1.3 SO:coordinate@SQ SN:I LN:230218@SQ SN:II LN:813184@SQ SN:III LN:316620@SQ SN:IV LN:1531933@SQ SN:IX LN:439888@SQ SN:Mito LN:85779@SQ SN:V LN:576874@SQ SN:VI LN:270161@SQ SN:VII LN:1090940@SQ SN:VIII LN:562643@SQ SN:X LN:745751@SQ SN:XI LN:666816@SQ SN:XII LN:1078177@SQ SN:XIII LN:924431@SQ SN:XIV LN:784333@SQ SN:XV LN:1091291@SQ SN:XVI LN:948066@PG ID:bwa PN:bwa VN:0.7.10-r789 CL:bwa mem -M

Page 82: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

3- Bam refinement

Input: BAM

Three main steps:

1. Local realignment

2. Base quality recalibration

3. Duplicate removal

Output: BAM

Page 83: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

3- Bam refinement – Local realignment

Ref: ACTTTCGGATGCTGATCGGGATGCTTTAGCTGATGCTGATGGGCTTTCGATCGATTTAAAAGCTACTTTCGGATGCTGATCGGGATGCTTTAGCTGA

TCGGATGCTGATCGGGATGCTTTAGCTGATGCTCTGATCGGGATGCTTTAGCTGATGCTGATGG

Ref: ACTTTCGGATGCTGATCGGGATGCTTTAGCTGATGCTGATGGGCTTTCGATCGATTTAAAAGCTACTTTCGGATGCTGATC____ATGCTTTAGCTGA

TCGGATGCTGATC____T_GCTTTAGCTGATGCTCTGATC____ATGCTTTAGCTGATGCTGATGG

TC_____T_GCTTTAGCTGATGCTGATGGGCTT

Ref: ACTTTCGGATGCTGATCGGGATGCTTTAGCTGATGCTGATGGGCTTTCGATCGATTTAAAAGCTACTTTCGGATGCTGATC____ATGCTTTAGCTGA

TCGGATGCTGATC____ATGCTTTAGCTGATGCTCTGATC____ATGCTTTAGCTGATGCTGATGG

TC____ATGCTTTAGCTGATGCTGATGGGCTT

Problem: Short indels in the sample relative to the reference sequence can pose difficulties for alignment programs. Indels occuring towards the ends of the reads are often not aligned correctly, introducing an excess of SNPs

Page 84: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

It uses the full alignment context to determine whether the indel exists.

Two-step process:

1. RealignerTargetCreator: it determines the small suspicious intervals which are likely in need of realignment

2. IndelRealigner: it runs the realignment on those intervals

notes:- having a list of known indels helps

3- Bam refinement – Local realignment

Page 85: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

raw reads (.fastq) 2. alignment to a reference genomeclose reference?

time limited?

bwa

distant reference?

stampy

aligned reads (.sam/.bam)

3. bam refinementduplicate removal

local realignment

base recalibration

picardGATK GATK

aligned reads (.sam/.bam)

5. variant callingSNPs/indels

single/multi-sample

samtools

raw variants (.vcf)

ready-to-use variants (.vcf)

4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)

6. variant filtering and validation

in silico vs in vitro validation

vcftools

variant score recalibration

big datasets

known SNPs/indels

1. Fastq quality control + trimming

Adapters ?Low quality bases?

samtoolsIGV/tablet

Page 86: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

raw reads (.fastq) 2. alignment to a reference genomeclose reference?

time limited?

bwa

distant reference?

stampy

aligned reads (.sam/.bam)

3. bam refinementduplicate removal

local realignment

base recalibration

picardGATK GATK

aligned reads (.sam/.bam)

5. variant callingSNPs/indels

single/multi-sample

samtools

raw variants (.vcf)

ready-to-use variants (.vcf)

4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)

6. variant filtering and validation

in silico vs in vitro validation

vcftools

variant score recalibration

big datasets

known SNPs/indels

1. Fastq quality control + trimming

Adapters ?Low quality bases?

samtoolsIGV/tablet

Page 87: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

Each base call has an associated base call quality (phred-scale).Rule of thumb: anything less than Q20 is not useful data.

The quality of a call depends on multiple factors (e.g. position in the read, sequence context).

In addition, the alignment can provide useful information. Mismatches to the reference are considered errors (unless they are described polymoprhisms).

It requires a catalogue of variable sites!

3- Bam refinement – Base Quality Recalibration

How does it work?

Page 88: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

3- Bam refinement – Base Quality Recalibration

List of know variantsBAM files with variants

123 C/A BQ cov1 cov2 cov3…

145 G/A BQ cov1 cov2 cov3…

1298 G/T BQ cov1 cov2 cov3…

1345 C/T BQ cov1 cov2 cov3…

1789 C/G BQ cov1 cov2 cov3…

123 C/A

145 G/A

1345 C/T

BQ: base qualityCovariates: position in the read, sequencing cycle, dinucleotide, …

Considered as real variants Recalibrated BQ using different covariates

Page 89: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

BQ: base qualityCovariates: position in the read, end of read with worse calls!

3- Bam refinement – Base Quality Recalibration

BAM files with variants

123 C/A BQ cov1 cov2 cov3…

145 G/A BQ cov1 cov2 cov3…

1298 G/T BQ cov1 cov2 cov3…

1345 C/T BQ cov1 cov2 cov3…

1789 C/G BQ cov1 cov2 cov3…

1298

1789

123

145

1345

A

A

T

T

G

Considered as real variants

Recalibrated BQ using different covariates

High quality >>>>>>>>>>>>> Low Quality

Page 90: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

BQ: base qualityCovariates: position in the read, end of read with worse calls!

3- Bam refinement – Base Quality Recalibration

BAM files with variants

123 C/A BQ cov1 cov2 cov3…

145 G/A BQ cov1 cov2 cov3…

1298 G/T BQ_1 cov1 cov2 cov3…

1345 C/T BQ cov1 cov2 cov3…

1789 C/G BQ_1 cov1 cov2 cov3…

1298

1789

123

145

1345

A

A

T

T

G

Considered as real variants

Recalibrated BQ using different covariates

High quality >>>>>>>>>>>>> Low Quality

Page 91: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

3- Bam refinement – Base Quality Recalibration

Covariate: cycle number

First cycle: higher qualityLast cycles: lower quality

Page 92: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

3- Bam refinement – Base Quality Recalibration

We will not run the Base Quality Recalibration because of time and list of variants available.

Few more details:

It supports several platforms: Illumina, SOLiD, 454, Complete Genomics, Pacific Biosciences (stated on the website) and IonTorrent (stated in the GATK forum).

You can find how to do it at:

https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_bqsr_BaseRecalibrator.php

Page 93: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

raw reads (.fastq) 2. alignment to a reference genomeclose reference?

time limited?

bwa

distant reference?

stampy

aligned reads (.sam/.bam)

3. bam refinementduplicate removal

local realignment

base recalibration

picardGATK GATK

aligned reads (.sam/.bam)

5. variant callingSNPs/indels

single/multi-sample

samtools

raw variants (.vcf)

ready-to-use variants (.vcf)

4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)

6. variant filtering and validation

in silico vs in vitro validation

vcftools

variant score recalibration

big datasets

known SNPs/indels

1. Fastq quality control + trimming

Adapters ?Low quality bases?

samtoolsIGV/tablet

Page 94: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

PCR is used during library preparation.

This can result in duplicate DNA fragments in the final library prep.

3- Bam refinement – Duplicate removal

What we want: information from independent fragmentsWhat we do not want: copies of the same information coming from one fragment

Reference Sequence

Page 95: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

PCR is used during library preparation.

This can result in duplicate DNA fragments in the final library prep.

3- Bam refinement – Duplicate removal

What we want: information from independent fragmentsWhat we do not want: copies of the same information coming from one fragment

Reference Sequence

Page 96: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

3- Bam refinement – Duplicate removal

Possible heterozygote, SNP call

C

C

C

C

A

Ref call

Page 97: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

3- Bam refinement – Duplicate removal

• It can result in false SNPs calls.

• Duplicates may fake a high coverage thus giving high support to some variants.

• PCR-free protocols exist but require a large amount of DNA.

Why is it important to do it?

Page 98: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

Number of duplicates varies according to the complexity of the library:

• whole genome experiments (<5%)

• custom enrichment ones (<30%)

It must be done after alignment and at the library level.

How does it work?

It identifies read-pairs where the outer ends map to the same position on the genome and removes all but one copy.

3- Bam refinement – Duplicate removal

Page 99: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

picard-tools MarkDuplicatesINPUT=cc_gn2_R12_rg_sorted.bam OUTPUT=cc_gn2_R12_rg_sorted_rmdup.bamMETRICS_FILE=dupl_metrics.txt

Duplicate removal

3- Bam refinement – Duplicate removal

Input file

Module used

Output file

Metrics file

Page 100: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

Duplicate removal

3- Bam refinement – Duplicate removal

What do we have to do after each step???

Sort and Index the newly generated BAM file

picard-tools MarkDuplicatesINPUT=cc_gn2_R12_rg_sorted.bam OUTPUT=cc_gn2_R12_rg_sorted_rmdup.bamMETRICS_FILE=dupl_metrics.txt

Page 101: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

sort the bam

samtools sort cc_gn2_R12_rg_sorted_rmdup.bam cc_gn2_R12_rg_sorted_rmdup_sorted

samtools index cc_gn2_R12_rg_sorted_rmdup_sorted.bam

Index the sorted bam file

3- Bam refinement – Duplicate removal

Page 102: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

raw reads (.fastq) 2. alignment to a reference genomeclose reference?

time limited?

bwa

distant reference?

stampy

aligned reads (.sam/.bam)

3. bam refinementduplicate removal

local realignment

base recalibration

picardGATK GATK

aligned reads (.sam/.bam)

5. variant callingSNPs/indels

single/multi-sample

samtools

raw variants (.vcf)

ready-to-use variants (.vcf)

4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)

6. variant filtering and validation

in silico vs in vitro validation

vcftools

variant score recalibration

big datasets

known SNPs/indels

1. Fastq quality control + trimming

Adapters ?Low quality bases?

samtoolsIGV/tablet

Page 103: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

raw reads (.fastq) 2. alignment to a reference genomeclose reference?

time limited?

bwa

distant reference?

stampy

aligned reads (.sam/.bam)

3. bam refinementduplicate removal

local realignment

base recalibration

picardGATK GATK

aligned reads (.sam/.bam)

5. variant callingSNPs/indels

single/multi-sample

samtools

raw variants (.vcf)

ready-to-use variants (.vcf)

4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)

6. variant filtering and validation

in silico vs in vitro validation

vcftools

variant score recalibration

big datasets

known SNPs/indels

1. Fastq quality control + trimming

Adapters ?Low quality bases?

samtoolsIGV/tablet

Page 104: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

3- Bam refinement – BAM check

BAM check gives us answers to several questions:

How many duplicates do I have? Is that reasonable for my experiment?

How many of my reads mapped back to the reference? How many of these are paired in mapping? How many pairs are mapped to different chromosomes?

How much average coverage do I have? Is the coverage evenly distributed along my region?

Duplicate removal

Alignment Stats

Coverage

Page 105: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

picard-tools MarkDuplicatesINPUT=cc_gn2_R12_rg_sorted.bam OUTPUT=cc_gn2_R12_rg_sorted_rmdup.bamMETRICS_FILE=dupl_metrics.txt

3- Bam refinement – BAM check: duplicate removal

Page 106: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

3- Bam refinement – BAM check: duplicate removal

Open the file dupl_metrics.txt

More/cat/gedit

1) Check the % of duplicates;

0.202646 → ~20.3%

Page 107: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

3- Bam refinement – BAM check: duplicate removal

## net.sf.picard.metrics.StringHeader# net.sf.picard.sam.MarkDuplicatesINPUT=[/media/pier/pierWD/backup/freecom/work/teaching/Unife_bioinfo_122016/material/white_shark_fastq_gn2/white_shark/DD10/final_files/cc_gn2_R12_rg_sorted.bam] OUTPUT=/media/pier/pierWD/backup/freecom/work/teaching/Unife_bioinfo_122016/material/white_shark_fastq_gn2/white_shark/DD10/final_files/cc_gn2_R12_rg_sorted_rmdup.bam METRICS_FILE=/media/pier/pierWD/backup/freecom/work/teaching/Unife_bioinfo_122016/material/white_shark_fastq_gn2/white_shark/DD10/final_files/dupl_metrics.txt PROGRAM_RECORD_ID=MarkDuplicatesPROGRAM_GROUP_NAME=MarkDuplicates REMOVE_DUPLICATES=false ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false## net.sf.picard.metrics.StringHeader# Started on: Thu Nov 10 18:37:26 GMT 2016

## METRICS CLASS net.sf.picard.sam.DuplicationMetricsLIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS

UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATESPERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE

cc_gn2 12953 53727 379573 7026 8687 8687 0.202646

We have 20.3% of PCR duplicates in our experiment

Page 108: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

3- Bam refinement – BAM check

BAM check gives us answers to several questions:

How many duplicates do I have? Is that reasonable for my experiment?

How many of my reads mapped back to the reference? How many of these are paired in mapping? How many pairs are mapped to different chromosomes?

How much average coverage do I have? Is the coverage evenly distributed along my region?

Duplicate removal

Alignment Stats

Coverage

Page 109: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

Run flagstat on the BAM file before and after BAM refinement, can you see any difference?

3- Bam refinement – BAM check: alignment stats

BEFORE:

AFTER:

samtools flagstat cc_gn2_R12_rg_sorted.bam

samtools flagstat cc_gn2_R12_rg_sorted_rmdup.bam

Page 110: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

3- Bam refinement – BAM check: alignment stats

509594 + 0 in total (QC-passed reads + QC-failed reads)24400 + 0 duplicates130021 + 0 mapped (25.51%:-nan%)509594 + 0 paired in sequencing256086 + 0 read1253508 + 0 read2100420 + 0 properly paired (19.71%:-nan%)115180 + 0 with itself and mate mapped14841 + 0 singletons (2.91%:-nan%)0 + 0 with mate mapped to a different chr0 + 0 with mate mapped to a different chr (mapQ>=5

509594 + 0 in total (QC-passed reads + QC-failed reads)0 + 0 duplicates130021 + 0 mapped (25.51%:-nan%)509594 + 0 paired in sequencing256086 + 0 read1253508 + 0 read2100420 + 0 properly paired (19.71%:-nan%)115180 + 0 with itself and mate mapped14841 + 0 singletons (2.91%:-nan%)0 + 0 with mate mapped to a different chr0 + 0 with mate mapped to a different chr (mapQ>=5)

BEFORE

AFTER

Page 111: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

3- Bam refinement – BAM check: alignment stats

24400 + 0 duplicates

LIBRARY UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES PERCENT_DUPLICATIONcc_gn2 7026 8687 0.202646

(8687*2) + 7026 = 24400

Page 112: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

3- Bam refinement – BAM check

BAM check gives us answers to several questions:

How many duplicates do I have? Is that reasonable for my experiment?

How many of my reads mapped back to the reference? How many of these are paired in mapping? How many pairs are mapped to different chromosomes?

How much average coverage do I have? Is the coverage evenly distributed along my region?

Duplicate removal

Alignment Stats

Coverage

Page 113: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

Depth of Coverage – The number of reads that spans a given DNA sequence of interest. This is commonly expressed in terms of “Yx” where “Y” is the number of reads and “x” is the unit reflecting the depth of coverage metric (i.e. 5x, 10x, 20x, 100x)

7x 9x11x

3- Bam refinement – BAM check: coverage estimation

Page 114: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

java -jar /opt/GATK-3.5-0/GenomeAnalysisTK.jar-I cc_gn2_R12_rg_sorted_rmdup_sorted.bam-R cc_ref.fa-T DepthOfCoverage-o cc_gn2_R12_rg_sorted_rmdup_coverage-L coverage.intervals

3- Bam refinement – BAM check: coverage estimation

Input file

Module used

List of intervals

GATK, coverage per base

Output

Reference sequence

Page 115: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

3- Bam refinement – BAM check: coverage estimation

GATK, coverage per base

-L coverage.intervals

List of intervals,

cc_ref:198200-213200

Page 116: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

3- Bam refinement – BAM check: coverage estimation

geditcc_gn2_R12_rg_sorted_rmdup_coverage.sample_summary

OR

More/Cat/Head

sample_id total mean granular_third_quartile granular_median granular_first_quartile %_bases_above_15shark 64770 5.70 11 4 1 10.2Total 64770 5.70 N/A N/A N/A

sample_id total mean granular_third_quartile granular_mediangranular_first_quartile %_bases_above_15

shark 64770 5.70 11 4 1 10.2Total 64770 5.70 N/A N/A N/A

Page 117: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

3- Bam refinement – BAM check: coverage estimation

Per base coverage

gedit cc_gn2_R12_rg_sorted_rmdup_coverage

OR

More/Cat/Head

Locus Total_Depth Average_Depth_sample Depth_for_sharkcc_ref:198243 0 0.00 0cc_ref:198244 0 0.00 0cc_ref:198245 0 0.00 0cc_ref:198246 0 0.00 0cc_ref:198247 0 0.00 0cc_ref:198248 0 0.00 0cc_ref:198249 0 0.00 0cc_ref:198250 0 0.00 0cc_ref:198251 0 0.00 0cc_ref:198252 0 0.00 0cc_ref:198253 1 1.00 1

Page 118: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

3- Bam refinement – BAM check: coverage estimation

Open R:

• Open a terminal;• Type R;

Page 119: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

3- Bam refinement – BAM check: coverage estimation

data<-read.table("cc_gn2_R12_rg_sorted_rmdup_coverage", sep="\t", header=T)

names(data)

old_col<-data$Locus

new_col<-gsub("cc_ref:","",as.character(old_col))

data["pos"]<-new_col

plot(data$pos, data$Depth_for_shark, type="l")

Page 120: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

3- Bam refinement – BAM check: coverage estimation

Position (bp)

Coverage

Page 121: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

3- Bam refinement – BAM check: coverage estimation

Page 122: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

Why is important to check the coverage?

• To check how your experiment performed (one of the ways to assess the quality of your experiment);

• To understand how confident can you be with your data;

• To decide on filtering after variant calling;

• To look for structural variation.

3- Bam refinement – BAM check: coverage estimation

Page 123: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

raw reads (.fastq) 2. alignment to a reference genomeclose reference?

time limited?

bwa

distant reference?

stampy

aligned reads (.sam/.bam)

3. bam refinementduplicate removal

local realignment

base recalibration

picardGATK GATK

aligned reads (.sam/.bam)

5. variant callingSNPs/indels

single/multi-sample

samtools

raw variants (.vcf)

ready-to-use variants (.vcf)

4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)

6. variant filtering and validation

in silico vs in vitro validation

vcftools

variant score recalibration

big datasets

known SNPs/indels

1. Fastq quality control + trimming

Adapters ?Low quality bases?

samtoolsIGV/tablet

Page 124: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

raw reads (.fastq) 2. alignment to a reference genomeclose reference?

time limited?

bwa

distant reference?

stampy

aligned reads (.sam/.bam)

3. bam refinementduplicate removal

local realignment

base recalibration

picardGATK GATK

aligned reads (.sam/.bam)

5. variant callingSNPs/indels

single/multi-sample

samtools

raw variants (.vcf)

ready-to-use variants (.vcf)

4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)

6. variant filtering and validation

in silico vs in vitro validation

vcftools

variant score recalibration

big datasets

known SNPs/indels

1. Fastq quality control + trimming

Adapters ?Low quality bases?

samtoolsIGV/tablet

Page 125: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

4- BAM check: visualisation

Different tools to visualise aligned NGS data:

• IGV: https://www.broadinstitute.org/igv/;

• Tablet: https://ics.hutton.ac.uk/tablet/;

• …

• Samtools tview

Page 126: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

Samtools tview:

• Basic visualization tool;

• Terminal based;

• No need of RAM/Java/extra packages;

4- BAM check: visualisation

Page 127: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

samtools tviewcc_gn2_R12_rg_sorted_rmdup_sorted.bam -p cc_ref:1215261

4- BAM check: visualisation

pier
Typewritten Text
cc_ref.fa
Page 128: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

samtools tviewcc_gn2_R12_rg_sorted_rmdup_sorted.bam -p cc_ref:1215261

Position (Chr:bp)

4- BAM check: visualisation

Input: bam file

pier
Typewritten Text
cc_ref.fa
Page 129: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

4- BAM check: visualisation Reference sequencePosition

Read

Consensus

Page 130: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

• . : base matching positive strand;

• , : base matching negative strand;

• underlined: secondary or orphan;

• Uppercase letters: base matching positive strand;

• Lowercase letters: base matching negative strand;

4- BAM check: visualisation

Page 131: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

Press “ . ”

4- BAM check: visualisation

Page 132: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

Positive strand

Reverse strand

4- BAM check: visualisation

Page 133: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

?: menu

q: exit

m: mapping qualityn: nucleotide

b: base quality

.: on/off dots

Secondary or orphanColours for quality

4- BAM check: visualisation

Page 134: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

0 ≤ MQ ≤ 9

Press “ ? ”

Press “ q ”, “ m ”

4- BAM check: visualisation

MQ >=30

Page 135: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

Press “ b ”

Press “ ? ”

BQ ≥ 30

4- BAM check: visualisation

20 ≤ BQ ≤ 29

10 ≤ BQ ≤ 19

0 ≤ BQ ≤ 9

Page 136: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

Press “ n ”

What does it happen??

The four nucleotides are highlighted by four different colours, no more mapping or base

quality

4- BAM check: visualisation

Page 137: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

4- BAM check: visualisation

Page 138: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

1. Is there any polymorphic site between position 1,174,361 and 1,174,391?

2. Is there any polymorphic sites between position 1,233,201 and 1,233,221?

If so, state the reference and alterative allele, the average quality of the base (BQ), the average mapping quality of the read (MQ) and how would you call that site (i.e. homozygous reference, heterozygous, homozygous alternative).

4- BAM check: visualisation

Page 139: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

4- BAM check: visualisation

Page 140: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

1. Is there any polymorphic site between position 1,174,361 and 1,174,391?

No polymorphic sites between position 1,174,361 and 1,174,391.

2. Is there any polymorphic sites between position 1,233,201 and 1,233,221?

If so, state the reference and alterative allele, the average quality of the base (BQ), the average mapping quality of the read (MQ) and how would you call that site (i.e. homozygous reference, heterozygous, homozygous alternative).

4- BAM check: visualisation

Page 141: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

4- BAM check: visualisation

Page 142: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

• Yes, there is one polymorphic sites

• Reference allele: G

• Alternative allele: T

• Call: possible homozygous alternative(T/T)

• Average BQ: BQ ≥ 30 (white)

• Average MQ: MQ ≥ 30 (white)

4- BAM check: visualisation

Page 143: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

1. Is there any polymorphic site between position 4761 and 4771?No polymorphic sites between position 4761 and 4771

2. Is there any polymorphic sites between position1781 and 1791?Yes, there is one polymorphic sitesReference allele: GAlternative allele: TCall: possible homozygous alternative (T/T)Average BQ: BQ ≥ 30Average MQ: MQ ≥ 30

If so, state the reference and alterative allele, the average quality of the base (BQ), the average mapping quality of the read (MQ) and how would you call that site (i.e. homozygous reference, heterozygous, homozygous alternative).

4- BAM check: visualisation

Page 144: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

4- BAM check: visualisation

IGV

Page 145: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

4- BAM check: visualisation

IGV

Page 146: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

raw reads (.fastq) 2. alignment to a reference genomeclose reference?

time limited?

bwa

distant reference?

stampy

aligned reads (.sam/.bam)

3. bam refinementduplicate removal

local realignment

base recalibration

picardGATK GATK

aligned reads (.sam/.bam)

5. variant callingSNPs/indels

single/multi-sample

samtools

raw variants (.vcf)

ready-to-use variants (.vcf)

4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)

6. variant filtering and validation

in silico vs in vitro validation

vcftools

variant score recalibration

big datasets

known SNPs/indels

1. Fastq quality control + trimming

Adapters ?Low quality bases?

samtoolsIGV/tablet

Page 147: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

raw reads (.fastq) 2. alignment to a reference genomeclose reference?

time limited?

bwa

distant reference?

stampy

aligned reads (.sam/.bam)

3. bam refinementduplicate removal

local realignment

base recalibration

picardGATK GATK

aligned reads (.sam/.bam)

5. variant callingSNPs/indels

single/multi-sample

samtools

raw variants (.vcf)

ready-to-use variants (.vcf)

4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)

6. variant filtering and validation

in silico vs in vitro validation

vcftools

variant score recalibration

big datasets

known SNPs/indels

1. Fastq quality control + trimming

Adapters ?Low quality bases?

samtoolsIGV/tablet

Page 148: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

SNPs indels SV

samtools

GATK:1. Unified Genotyper2. Haplotype caller

samtools

GATK:1. Unified Genotyper2. Haplotype caller

Dindel

SVMerge – pipeline combining many

different tools

5- Variant calling

SNPs: Single Nucleotide PolymorphismsIndels: insertions/deletionsSV: Structural Variation

Page 149: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

Variant calling:Examine the bases aligned to position and look for differences

What we are looking for:Polymorphic sites / monomorphic sites

Factors to consider:- Base call qualities of each supporting base- Proximity to indels and homopolymer run- Mapping qualities of the reads supporting the SNP (increased

read length or paired-end help MQ scores)- Sequencing depth- Individual vs multi-sample calling

5- Variant calling

Page 150: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

raw reads (.fastq) 2. alignment to a reference genomeclose reference?

time limited?

bwa

distant reference?

stampy

aligned reads (.sam/.bam)

3. bam refinementduplicate removal

local realignment

base recalibration

picardGATK GATK

aligned reads (.sam/.bam)

5. variant callingSNPs/indels

single/multi-sample

samtools

raw variants (.vcf)

ready-to-use variants (.vcf)

4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)

6. variant filtering and validation

in silico vs in vitro validation

vcftools

variant score recalibration

big datasets

known SNPs/indels

1. Fastq quality control + trimming

Adapters ?Low quality bases?

samtoolsIGV/tablet

Page 151: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

raw reads (.fastq) 2. alignment to a reference genomeclose reference?

time limited?

bwa

distant reference?

stampy

aligned reads (.sam/.bam)

3. bam refinementduplicate removal

local realignment

base recalibration

picardGATK GATK

aligned reads (.sam/.bam)

5. variant callingSNPs/indels

single/multi-sample

samtools

raw variants (.vcf)

ready-to-use variants (.vcf)

4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)

6. variant filtering and validation

in silico vs in vitro validation

vcftools

variant score recalibration

big datasets

known SNPs/indels

1. Fastq quality control + trimming

Adapters ?Low quality bases?

samtoolsIGV/tablet

Page 152: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

raw variants (.vcf)

5- Variant calling

About the variant file:

filtered variants (.vcf)

variants (.vcf)

=

Page 153: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

Standardised format for storing DNA polymorphism data - SNPs, indels, SV- Rich annotations

Can be indexed for fast data retrieval of variants from a range of positions

Can store variant information over many samples

Record meta-data about the site- dbSNP accession, filter status

Very flexible- Tags can be introduced to describe new types of variants- Different VCF files may contain different information/annotations

variants (.vcf)

Page 154: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

variants (.vcf)He

ader

Data

Page 155: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

variants (.vcf)

Header

Something specific to the header lines???

Page 156: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

variants (.vcf)

Header

Lines starting with ##: arbitrary number of meta-information lines

##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">

##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">

##INFO=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">

Page 157: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

variants (.vcf)

Header

line starting with #: column definition – mandatory columns include:

Page 158: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

variants (.vcf)

Header

line starting with #: column definition – mandatory columns include:

Page 159: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

variants (.vcf)

Header

line starting with #: column definition – mandatory columns include:

CHROM chromosome

Page 160: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

variants (.vcf)

Header

line starting with #: column definition – mandatory columns include:

CHROM chromosomePOS position of the start of the variant

Page 161: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

variants (.vcf)

Header

line starting with #: column definition – mandatory columns include:

CHROM chromosomePOS position of the start of the variantID unique identifier of the variant (e.g. rs number for SNPs)

Page 162: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

variants (.vcf)

Header

line starting with #: column definition – mandatory columns include:

CHROM chromosomePOS position of the start of the variantID unique identifier of the variant (e.g. rs number for SNPs)REF reference allele

Page 163: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

variants (.vcf)

Header

line starting with #: column definition – mandatory columns include:

CHROM chromosomePOS position of the start of the variantID unique identifier of the variant (e.g. rs number for SNPs)REF reference alleleALT comma separated list of alternate non-reference alleles

Page 164: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

variants (.vcf)

Header

line starting with #: column definition – mandatory columns include:

CHROM chromosomePOS position of the start of the variantID unique identifier of the variant (e.g. rs number for SNPs)REF reference alleleALT comma separated list of alternate non-reference allelesQUAL phred-scaled quality score

Page 165: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

variants (.vcf)

Header

line starting with #: column definition – mandatory columns include:

CHROM chromosomePOS position of the start of the variantID unique identifier of the variant (e.g. rs number for SNPs)REF reference alleleALT comma separated list of alternate non-reference allelesQUAL phred-scaled quality scoreFILTER site filtering information

Page 166: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

variants (.vcf)

Header

line starting with #: column definition – mandatory columns include:

CHROM chromosomePOS position of the start of the variantID unique identifier of the variant (e.g. rs number for SNPs)REF reference alleleALT comma separated list of alternate non-reference allelesQUAL phred-scaled quality scoreFILTER site filtering informationINFO user extensible annotation (e.g. samtools and GATK may differ in this)

Page 167: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

variants (.vcf)

Header

line starting with #: column definition – mandatory columns include:

CHROM chromosomePOS position of the start of the variantID unique identifier of the variant (e.g. rs number for SNPs)REF reference alleleALT comma separated list of alternate non-reference allelesQUAL phred-scaled quality scoreFILTER site filtering informationINFO user extensible annotation (e.g. samtools and GATK may differ in this)

FORMAT how the information for each sample is presented (i.e. GT:DP:DV:SP:DPR)

Page 168: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

variants (.vcf)

Header

line starting with #: column definition – mandatory columns include:

CHROM chromosomePOS position of the start of the variantID unique identifier of the variant (e.g. rs number for SNPs)REF reference alleleALT comma separated list of alternate non-reference allelesQUAL phred-scaled quality scoreFILTER site filtering informationINFO user extensible annotation (e.g. samtools and GATK may differ in this)

FORMAT how the information for each sample is presented (i.e. GT:DP:DV:SP:DPR)

samples follow

Page 169: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

variants (.vcf)

Header

line starting with #: column definition – mandatory columns include:

CHROM chromosomePOS position of the start of the variantID unique identifier of the variant (e.g. rs number for SNPs)REF reference alleleALT comma separated list of alternate non-reference allelesQUAL phred-scaled quality scoreFILTER site filtering informationINFO user extensible annotation (e.g. samtools and GATK may differ in this)

FORMAT how the information for each sample is presented (i.e. GT:DP:DV:SP:DPR)

samples follow

Page 170: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

variants (.vcf)He

ader

Data

Page 171: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

variants (.vcf)

Data

one line per site (all columns described above per line);

useful information per site and per sample;

Page 172: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant calling

?????

You have to create the command needed depending on what we want to have in the final vcf file.

Software and tools: samtools and bcftools

Input: cc_gn2_R12_rg_sorted_rmdup_sorted.bam

Page 173: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

samtools mpileup .. .. .. .. cc_gn2_R12_rg_sorted_rmdup_sorted.bam| bcftoolsview .. .. - > cc_gn2_bq20_mq40.vcf

5- Variant calling

samtools mpileup:

• No indel calling;• Parameter for adjusting mapQ (use 50 as value) ;• Base quality minimum 20;• Mapping quality minimum 40;• Output per-sample strand bias P-value (SP);• Output per-sample Depth of Coverage (DP);• Generate BCF output;• Specify reference sequence file.

Page 174: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

samtools mpileup .. .. .. .. cc_gn2_R12_rg_sorted_rmdup_sorted.bam| bcftoolsview .. .. - > cc_gn2_bq20_mq40.vcf

5- Variant calling

Bcftools view:

• Call genotypes at variant sites;• Variant sites only.

Page 175: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant calling

How to know all the options corresponding to these requirements???

• Type “samtools mpileup” in the terminal;• Type “bcftools view” in the terminal.

samtools mpileup .. .. .. .. cc_gn2_R12_rg_sorted_rmdup_sorted.bam| bcftoolsview .. .. - > cc_gn2_bq20_mq40.vcf

Page 176: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

samtools mpileup .. .. .. .. cc_gn2_R12_rg_sorted_rmdup_sorted.bam| bcftoolsview .. .. - > cc_gn2_bq20_mq40.vcf

5- Variant calling

samtools mpileup:

1. No indel calling; 2. Parameter for adjusting mapQ (use 50 as value) ; 3. Base quality minimum 20; 4. Mapping quality minimum 40; 5. Output per-sample strand bias P-value (SP); 6. Output per-sample Depth of Coverage (DP); 7. Generate BCF output; 8. Specify reference sequence file.

Bcftools call:

1. Call genotypes at variant sites; 2. Variant sites only.

Page 177: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

samtools mpileup .. .. .. .. cc_gn2_R12_rg_sorted_rmdup_sorted.bam| bcftoolsview .. .. - > cc_gn2_bq20_mq40.vcf

5- Variant calling

samtools mpileup:

1. No indel calling; -I2. Parameter for adjusting mapQ (use 50 as value) ; -C 503. Base quality minimum 20; -Q 204. Mapping quality minimum 40; -q 405. Output per-sample strand bias P-value (SP); -S6. Output per-sample Depth of Coverage (DP); -D7. Generate BCF output; -g8. Specify reference sequence file. -f ref_seq

Bcftools call:

1. Call genotypes at variant sites; -g2. Variant sites only. -v

Page 178: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant calling

Variant calling command:

samtools mpileup -I -C 50 -Q 20 -q 40 -S -D -g -f cc_ref.facc_gn2_R12_rg_sorted_rmdup_sorted.bam| bcftoolsview -gv - > cc_gn2_bq20_mq40.vcf

Page 179: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant calling

gedit cc_gn2_bq20_mq40.vcf

OR

More/Cat/Head

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sharkcc_ref 1610669 . G A 140 . DP=10;VDB=6.833620e-02;AF1=1;AC1=2;DP4=0,0,6,1;MQ=47;FQ=-48 GT:PL:DP:SP:GQ 1/1:173,21,0:7:0:39

Page 180: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant calling

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sharkcc_ref 1610669 . G A 140 . DP=10;VDB=6.833620e 02;AF1=1;AC1=2;DP4=0,0,6,1;MQ=47;FQ=-48 GT:PL:DP:SP:GQ 1/1:173,21,0:7:0:39

Page 181: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant calling

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sharkcc_ref 1610669 . G A 140 . DP=10;VDB=6.833620e 02;AF1=1;AC1=2;DP4=0,0,6,1;MQ=47;FQ=-48 GT:PL:DP:SP:GQ 1/1:173,21,0:7:0:39

Page 182: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

DP: "Raw read depth"DPR: "Number of high-quality bases observed for each allele"DP4: "Number of high-quality ref-forward , ref-reverse, alt-forward

and alt-reverse bases"MQ: "Average mapping quality"

INFO: user extensible annotation (e.g. samtools and GATK may differ in this)

5- Variant calling

INFO: DP=10;VDB=6.833620e 02;AF1=1;AC1=2;DP4=0,0,6,1;MQ=47;FQ=-48

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sharkcc_ref 1610669 . G A 140 . DP=10;VDB=6.833620e 02;AF1=1;AC1=2;DP4=0,0,6,1;MQ=47;FQ=-48 GT:PL:DP:SP:GQ 1/1:173,21,0:7:0:39

Page 183: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant calling

INFO field regards the SITE and NOT the specific samples in multisample calling!!!

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sharkcc_ref 1610669 . G A 140 . DP=10;VDB=6.833620e 02;AF1=1;AC1=2;DP4=0,0,6,1;MQ=47;FQ=-48 GT:PL:DP:SP:GQ 1/1:173,21,0:7:0:39

Page 184: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant calling

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sharkcc_ref 1610669 . G A 140 . DP=10;VDB=6.833620e 02;AF1=1;AC1=2;DP4=0,0,6,1;MQ=47;FQ=-48 GT:PL:DP:SP:GQ 1/1:173,21,0:7:0:39

Page 185: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant calling

FORMAT: how the information for each sample is presented (i.e. GT:PL:DP:SP:GQ)

Each sample (multisample calling) has its own FORMAT field, information are sample specific!!

GT:PL:DP:SP:GQ

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sharkcc_ref 1610669 . G A 140 . DP=10;VDB=6.833620e 02;AF1=1;AC1=2;DP4=0,0,6,1;MQ=47;FQ=-48 GT:PL:DP:SP:GQ 1/1:173,21,0:7:0:39

Page 186: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant calling

FORMAT: how the information for each sample is presented (i.e. GT:PL:DP:SP:GQ)

GT: genotype

0/0: homozygote reference;0/1: heterozygote;1/1: homozygote alternative;

0: reference allele;1:alternative allele;

1/1:173,21,0:7:0:39

Homozygote alternative

GT:PL:DP:SP:GQ

Page 187: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant calling

FORMAT: how the information for each sample is presented (i.e. GT:PL:DP:SP:GQ)

PL: List of Phred-scaled genotype likelihoods;

Approximate likelihood = 10^(-PL/10)

P(0/0) = 10^(-173/10) = 10^(-17.3) = 0.000000000000000005011872P(0/1) = 10^(-21/10) = 10^(-2.1) = 0.0007943P(1/1) = 10^(-0/10) = 10^(0) = 1

GT:PL:DP:SP:GQ

1/1:173,21,0:7:0:39

Page 188: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant calling

FORMAT: how the information for each sample is presented (i.e. GT:PL:DP:SP:GQ)

DP: Number of high-quality bases;

DP=7

GT:PL:DP:SP:GQ

1/1:173,21,0:7:0:39

Page 189: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant calling

FORMAT: how the information for each sample is presented (i.e. GT:PL:DP:SP:GQ)

SP: Phred-scaled strand bias P-value;

Strand bias p value= 10^(- SP /10)

0.05=10^(- SP /10) → -10*Log 0.05 = SP → SP =13.01

• ↑ SP → ↓ p value → there is a sta s cal significant difference in the number of alleles coming from the two strands;

• ↓ SP → ↑ p value → there is NOT a statistical significant difference in the number of alleles coming from the two strands.

Would you keep sites with SP ≥ 13 or sites with SP ≤ 13?

GT:PL:DP:SP:GQ

Page 190: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant calling

FORMAT: how the information for each sample is presented (i.e. GT:PL:DP:SP:GQ)

SP: Phred-scaled strand bias P-value;

Strand bias p value= 10^(- SP /10)

0.05=10^(- SP /10) → -10*Log 0.05 = SP → SP =13.01

SP=0

If we assume a p value threshold of 0.05, we will keep all sites with SP≤13

GT:PL:DP:SP:GQ

1/1:173,21,0:7:0:39

Page 191: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant calling

FORMAT: how the information for each sample is presented (i.e. GT:PL:DP:SP:GQ)

GQ: conditional genotype quality, encoded as a phred quality;

GQ=−10log10 P(genotype call is wrong)

P(genotype call is wrong) = 10^(-GQ/10)

P(genotype call is wrong) = 10^(-39/10)=10^-3.9=0.00012589

GT:PL:DP:SP:GQ

1/1:173,21,0:7:0:39

Page 192: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant calling

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sharkcc_ref 1610669 . G A 140 . DP=10;VDB=6.833620e 02;AF1=1;AC1=2;DP4=0,0,6,1;MQ=47;FQ=-48 GT:PL:DP:SP:GQ 1/1:173,21,0:7:0:39

FORMAT: how the information for each sample is presented (i.e. GT:PL:DP:SP:GQ)

GT: Genotype;PL: List of Phred-scaled genotype likelihoods;DP: Number of high-quality bases;SP: Phred-scaled strand bias P-value;DPR: Genotype quality.

GT:PL:DP:SP:GQ

Page 193: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant calling

Useful references and resources:

https://samtools.github.io/hts-specs/VCFv4.2.pdf

https://gist.github.com/inutano/f0a2f5c219ab4920c5b5

https://faculty.washington.edu/browning/beagle/intro-to-vcf.html

http://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it

Page 194: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant calling

We have our raw variants but now we need to refine our dataset…how?

• Variant score recalibration;• Variant filtering;

• Validation

Page 195: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

raw reads (.fastq) 2. alignment to a reference genomeclose reference?

time limited?

bwa

distant reference?

stampy

aligned reads (.sam/.bam)

3. bam refinementduplicate removal

local realignment

base recalibration

picardGATK GATK

aligned reads (.sam/.bam)

5. variant callingSNPs/indels

single/multi-sample

samtools

raw variants (.vcf)

ready-to-use variants (.vcf)

4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)

6. variant filtering and validation

in silico vs in vitro validation

vcftools

variant score recalibration

big datasets

known SNPs/indels

1. Fastq quality control + trimming

Adapters ?Low quality bases?

samtoolsIGV/tablet

Page 196: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

raw reads (.fastq) 2. alignment to a reference genomeclose reference?

time limited?

bwa

distant reference?

stampy

aligned reads (.sam/.bam)

3. bam refinementduplicate removal

local realignment

base recalibration

picardGATK GATK

aligned reads (.sam/.bam)

5. variant callingSNPs/indels

single/multi-sample

samtools

raw variants (.vcf)

ready-to-use variants (.vcf)

4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)

6. variant filtering and validation

in silico vs in vitro validation

vcftools

variant score recalibration

big datasets

known SNPs/indels

1. Fastq quality control + trimming

Adapters ?Low quality bases?

samtoolsIGV/tablet

Page 197: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant calling: Variant quality score recalibration

• Available in GATK.

• It aims at producing well-calibrated probabilities for the variants called.

• It develops a continuous, covarying estimate of the relationship between SNP call annotations ( e.g. MQ, QD…) and the probability that a SNP is a true genetic variant versus a sequencing or data processing artifact.

• It needs “true sites” to be trained (i.e. HapMap Phase 3 data, OMNI 2.5 M, etc…).

• We are not going to use it, because it needs big datasets (either many samples, or whole genome data) to work properly.

• You can find more information at http://gatkforums.broadinstitute.org/discussion/39/variant-quality-score-recalibration-vqsr

Page 198: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

raw reads (.fastq) 2. alignment to a reference genomeclose reference?

time limited?

bwa

distant reference?

stampy

aligned reads (.sam/.bam)

3. bam refinementduplicate removal

local realignment

base recalibration

picardGATK GATK

aligned reads (.sam/.bam)

5. variant callingSNPs/indels

single/multi-sample

samtools

raw variants (.vcf)

ready-to-use variants (.vcf)

4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)

6. variant filtering and validation

in silico vs in vitro validation

vcftools

variant score recalibration

big datasets

known SNPs/indels

1. Fastq quality control + trimming

Adapters ?Low quality bases?

samtoolsIGV/tablet

Page 199: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

raw reads (.fastq) 2. alignment to a reference genomeclose reference?

time limited?

bwa

distant reference?

stampy

aligned reads (.sam/.bam)

3. bam refinementduplicate removal

local realignment

base recalibration

picardGATK GATK

aligned reads (.sam/.bam)

5. variant callingSNPs/indels

single/multi-sample

samtools

raw variants (.vcf)

ready-to-use variants (.vcf)

4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)

6. variant filtering and validation

in silico vs in vitro validation

vcftools

variant score recalibration

big datasets

known SNPs/indels

1. Fastq quality control + trimming

Adapters ?Low quality bases?

samtoolsIGV/tablet

Page 200: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant filtering and validation

WHY filtering???

• Each caller tends to call as many sites it can but it provides useful information on several parameters;

• Many calls are “guesses”;• Artifacts;• Presence of sequencing error;• To create a high quality dataset;

WHY validation???

• An error-free dataset is unrealistic;• To estimate the % of error within our high quality dataset;

Page 201: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant filtering and validation: FILTERING

Filtering is all about finding the right balance:

Page 202: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant filtering and validation: FILTERING

Stringent filters Enough information (sites)

Filtering is all about finding the right balance:

Page 203: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant filtering and validation: FILTERING

Less stringent filters

Lots of sites but with many possible

errors

Page 204: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant filtering and validation: FILTERING

Very stringent filters

Not many sites left

Page 205: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant filtering and validation: FILTERING

General approach:

• Try different thresholds for each parameter and draw a distribution;

• If possible, look for sudden changes in the distribution and set a threshold;

• No fixed rule on which thresholds you should use;

• It is data and project specific;

• It is a crucial step to create a high quality dataset.

Page 206: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant filtering and validation: FILTERING

Common cautions:

- Base quality BQ20- Depth (min and max) very dependent on your average- Mapping quality MQ40/50 (minimum MQ30)- Strand-bias p-value>0.05- SNP density dependent on the genome [e.g. no

more than 1 SNP/4bp]- Indel proximity not closer than 10bp to an indel

- Missing data?

- Some filters may be applied during the variant calling while others are applied afterwards.

Page 207: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant filtering and validation: FILTERING

MQ distribution

Page 208: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant filtering and validation: FILTERING

MQ distribution

Page 209: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant filtering and validation: FILTERING

Coverage: distribution per run

Page 210: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant filtering and validation: FILTERING

Coverage: distribution per population

Page 211: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant filtering and validation: FILTERING

VCFtools (http://vcftools.sourceforge.net/)

--min-meanDP <float> --max-meanDP <float>

Includes only sites with mean depth values (over all included individuals) greater than or equal to the "--min-meanDP" value and less than or equal to the "--max-meanDP" value.

One of these options may be used without the other. These options require that the "DP" FORMAT tag is included for each site.

GT:PL:DP:SP:GQ

Page 212: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant filtering and validation: FILTERING

VCFtools (http://vcftools.sourceforge.net/)

vcftools --vcf cc_gn2_bq20_mq40.vcf--min-meanDP 20--recode --recode-INFO-all--out cc_gn2_bq20_mq40_dp20.vcf

Input file (vcf)

DP threshold

Create new vcf file (recode in the file name)Keep all the INFO

Output file

INFO: DP=10;VDB=6.833620e 02;AF1=1;AC1=2;DP4=0,0,6,1;MQ=47;FQ=-48

Page 213: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant filtering and validation: FILTERING

Gedit cc_gn2_bq20_mq40_dp20.vcf.recode.vcf

OR

Cat/more cc_gn2_bq20_mq40_dp20.vcf.recode.vcf

Page 214: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant filtering and validation: FILTERING

cc_ref 39327 . T G 182 . DP=26;VDB=7.943657e-02;RPB=2.288581e+00;AF1=0.5;AC1=1;DP4=5,6,5,5;MQ=48;FQ=184;PV4=1,1,1,1GT:PL:DP:SP:GQ 0/1:212,0,217:21:0:99

cc_ref 381078 . N C 222 . DP=24;VDB=1.903497e-01;AF1=1;AC1=2;DP4=0,0,13,10;MQ=48;FQ=-96GT:PL:DP:SP:GQ 1/1:255,69,0:23:0:99

cc_ref 1233213 . G T 222 . DP=24;VDB=1.958289e-01;AF1=1;AC1=2;DP4=0,0,11,12;MQ=46;FQ=-96GT:PL:DP:SP:GQ 1/1:255,69,0:23:0:99

Page 215: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant filtering and validation: FILTERING

cc_ref 39327 . T G 182 . GT:PL:DP:SP:GQ 0/1:212,0,217:21:0:99

cc_ref 381078 . N C 222 . GT:PL:DP:SP:GQ 1/1:255,69,0:23:0:99

cc_ref 1233213 . G T 222 . GT:PL:DP:SP:GQ1/1:255,69,0:23:0:99

1,233,213 G/T familiar??? You discover it in tview!!!

Page 216: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant filtering and validation: FILTERING

Visualisation of the filtered variant sites in IGV

IGV needs:

• A genome: reference sequence;

• BAM files: sorted and indexed.

Page 217: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant filtering and validation: FILTERING

Visualisation of the filtered variant sites in IGV

./igv.sh

Genomes > Load Genome > white_shark.genome

Files > Load > cc_gn2_R12_rg_sorted_rmdup_sorted.bam

Page 218: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant filtering and validation: FILTERING

Visualisation of the filtered variant sites in IGV

cc_ref:39327

Page 219: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant filtering and validation: FILTERING

Visualisation of the filtered variant sites in IGV

cc_ref:381078

Page 220: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant filtering and validation: FILTERING

Visualisation of the filtered variant sites in IGV

cc_ref:1233213

Page 221: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

raw reads (.fastq) 2. alignment to a reference genomeclose reference?

time limited?

bwa

distant reference?

stampy

aligned reads (.sam/.bam)

3. bam refinementduplicate removal

local realignment

base recalibration

picardGATK GATK

aligned reads (.sam/.bam)

5. variant callingSNPs/indels

single/multi-sample

samtools

raw variants (.vcf)

ready-to-use variants (.vcf)

4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)

6. variant filtering and validation

in silico vs in vitro validation

vcftools

variant score recalibration

big datasets

known SNPs/indels

1. Fastq quality control + trimming

Adapters ?Low quality bases?

samtoolsIGV/tablet

Page 222: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant filtering and validation: VALIDATION

• Why?• An error-free dataset is unrealistic;• To estimate the % of error within our high quality dataset;

• How to do it:

• Sanger Sequencing vs NGS (our) dataset;• Public dataset vs NGS (our) dataset;• SNPchip vs NGS (our) dataset;

“gold” datasetHigh quality dataset

Our datasetTest dataset

SAME SAMPLE(S)!!!

Page 223: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant filtering and validation: VALIDATION

Compare our sample to a “gold” dataset (same sample!!!):

• All sites, site by site;

• TRUE: concordant site;

• FALSE: discordant site;

• POSITIVE: polymorphic site (compared to reference sequence);

• NEGATIVE: reference site (compared to the reference sequence carrying a variant at that position);

Page 224: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

our sequenced sample

“Gold” dataset – same sample

TPtrue positive

FPfalse positive

TNtrue negative

FNfalse negative

5- Variant filtering and validation: VALIDATION

SAME SAMPLE(S)!!!

True positiveFalse positiveTrue negativeFalse negative

% : 0-100

Page 225: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

our sequenced sample

“Gold” dataset – same sample

TPtrue positive

FPfalse positive

TNtrue negative

FNfalse negative

5- Variant filtering and validation: VALIDATION

SAME SAMPLE(S)!!!

Page 226: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant filtering and validation: VALIDATION

True positiveFalse positiveTrue negativeFalse negative

True positiveTrue negative

False positiveFalse negative

Correct calls False calls / Sequencing error

Page 227: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant filtering and validation: VALIDATION

Human X chr True calls (%) False calls (%)Ion Torrent vs Complete Genomics(427,000 genotypes) 99.9995 0.0005

Ion Torrent vs Illumina( 3.65*10^6 genotypes) 99.99997 0.00003

Shark autosomal data Concordant calls (%)

Sequencing error (%)

2 samples (independent runs)~741,000 genotypes 99.986 0.014

Page 228: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/copy_of_bioinfo... · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser

5- Variant filtering and validation: FILTERING and VALIDATION

After filtering and validation we have a high quality vcf file ready for downstream analyses

(population genetics, medical genetics, association studies, SNP discovery, etc…)