96
Bioinformatica e analisi dei genomi Anno 2015/2016 Pierpaolo Maisano Delser mail: [email protected]

Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

Bioinformatica e analisi dei genomi

Anno 2015/2016

Pierpaolo Maisano Delsermail: [email protected]

Page 2: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

• Laurea Triennale: ScienzeBiologiche, Universita’ degli Studi di Ferrara, Dr. Silvia Fuselli;

• Laurea Specialistica: ScienzeBiomolecolari e Cellulari, Universita’ degli Studi di Ferrara, Dr. Silvia Fuselli;

• PhD in Genetics, University of Leicester, prof. Mark A. Jobling;

• Post‐doctoral fellow EPHE‐MNHN, Paris, Dr. Stefano Mona.

Background

Page 3: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

• Laurea Triennale: ScienzeBiologiche, Universita’ degli Studi di Ferrara, Dr. Silvia Fuselli;

• Laurea Specialistica: ScienzeBiomolecolari e Cellulari, Universita’ degli Studi di Ferrara, Dr. Silvia Fuselli;

• PhD in Genetics, University of Leicester, prof. Mark A. Jobling;

• Post‐doctoral fellow EPHE‐MNHN, Paris, Dr. Stefano Mona.

Background

Cusco, Marzo 2009

Page 4: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

Muséum national d'Histoire naturelle ‐ Paris

Page 5: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

Informazioni pratiche

• Teoria + pratica;

• Software and tools;

• Files;

• Slides on the website;

• Argomenti nuovi / argomenti gia’ trattati;

Page 6: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

Programma

• next‐generation sequencing (NGS)…come, quando, perche’?

• un esempio di gestione e analisi dati NGS:

• tipo di dato;• file e formati;• programmi;• interpretazione dei risultati;• stima dell’errore;• quando fermarsi?

• Applicazioni e/o progetti su diversi organismi.

Page 7: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

capture: exome/custom/cancer

amplicon sequencing

whole genome

mapping to a reference genome

de‐novoassembly

sequencing

unalignedreads QC

mapping refinement

mapping QCassembly QC

whole transcriptome

amplicon sequencing: fixed/custom

DNA‐seq

RNA‐seq

reads trimming

NGS: come, quando, perché?

Filtering

Validation

Page 8: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

NGS: come, quando, perché?

Domanda: quando? Domanda: perche’?

Page 9: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

Domanda: quando?

Risposta: quando ha senso!

NGS: come, quando, perché?

Domanda: perche’?

Page 10: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

Domanda: quando?

Risposta: quando ha senso!

• Amplicone 400bp in 100 individui? → Sanger sequencing

NGS: come, quando, perché?

Domanda: perche’?

Page 11: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

Domanda: quando?

Risposta: quando ha senso!

• Amplicone 400bp in 100 individui? → Sanger sequencing

• 50 ampliconi in 100 individui? → NGS + target capture

NGS: come, quando, perché?

Domanda: perche’?

Page 12: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

Domanda: quando?

Risposta: quando ha senso!

• Amplicone 400bp in 100 individui? → Sanger sequencing

• 50 ampliconi in 100 individui? → NGS + target capture

• Gene conversion, elementiripetuti, recombination breakpoints? → NGS + Sanger sequencing

NGS: come, quando, perché?

Domanda: perche’?

Page 13: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

Domanda: quando?

Risposta: quando ha senso!

• Amplicone 400bp in 100 individui? → Sanger sequencing

• 50 ampliconi in 100 individui? → NGS + target capture

• Gene conversion, elementiripetuti, recombination breakpoints? → NGS + Sanger sequencing

Domanda: perche’?

Risposta: la vostra idea per un progetto!

NGS: come, quando, perché?

Page 14: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

un esempio di gestione e analisi dati NGS

Page 15: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

un esempio di gestione e analisi dati NGS

Nanopore minIon/gridIon

Pacific Bioscience (PacBio)

Ion torrent PGM/Proton

Roche 454

Illumina MiSeq/HiSeq

Page 16: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

capture: exome/custom/cancer

amplicon sequencing

whole genome

mapping to a reference genome

de‐novoassembly

sequencing

unalignedreads QC

mapping refinement

mapping QCassembly QC

whole transcriptome

amplicon sequencing: fixed/custom

DNA‐seq

RNA‐seq

reads trimming

Filtering

Validation

un esempio di gestione e analisi dati NGS

Page 17: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

un esempio di gestione e analisi dati NGS

• progetto

Page 18: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

un esempio di gestione e analisi dati NGS

• progetto

• progetto:applicazione (whole genomes? Exomes? Target capture? Amplicon sequencing?)

Page 19: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

un esempio di gestione e analisi dati NGS

• progetto

• progetto:applicazione (whole genomes? Exomes? Target capture? Amplicon sequencing?)

• progetto:applicazione:scopo (SNPs, indels, repeated elements, CNVs…) 

Page 20: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

un esempio di gestione e analisi dati NGS

• progetto

• progetto:applicazione (whole genomes? Exomes? Target capture? Amplicon sequencing?)

• progetto:applicazione:scopo (SNPs, indels, repeated elements, CNVs…) 

• progetto:applicazione:scopo:coverage (SNPs, indels, repeatedelements, CNVs…)

Page 21: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

Project:

• Saccharomyces cerevisiae;

• Genome: 16 chromosomes, ~12.5Mb, ~6200 genes;

• Whole genome sequencing;

• Illumina platform; 

• Paired‐end reads, 1 library, 2 lanes.

Page 22: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

fragment                ========================================fragment + adaptors    ~~~========================================~~~SE read                                  ‐‐‐‐‐‐‐‐‐>PE reads                            R1‐‐‐‐‐‐‐‐‐>                                                         <‐‐‐‐‐‐‐‐‐R2unknown gap                                       ..................................................

Single‐end (SE) or paired‐end (PE) sequencing.

Page 23: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

fragment                ========================================fragment + adaptors    ~~~========================================~~~SE read                                  ‐‐‐‐‐‐‐‐‐>PE reads                            R1‐‐‐‐‐‐‐‐‐>                                                         <‐‐‐‐‐‐‐‐‐R2unknown gap                                       ..................................................

Single‐end (SE) or paired‐end (PE) sequencing.

Page 24: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

raw reads (.fastq) 2. alignment to a reference genomeclose reference?time limited?

bwa

distant reference?

stampy

aligned reads (.sam/.bam)

3. bam refinementduplicate removal

local realignment

base recalibration

picardGATK GATK

aligned reads (.sam/.bam)

5. variant calling

SNPs/indels

single/multi‐sample

samtools

raw variants (.vcf)

ready‐to‐use variants (.vcf)

4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)

6. variant filtering and validation

in silico vs in vitro validation

vcftools

variant score recalibration

big datasets

known SNPs/indels

1. Fastq quality control + trimming

Adapters ?Low quality bases?

samtools

Page 25: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

.fa/.fasta

.fastq

.sam (.sai)

.bam (.bai)

.vcf

sequences

read data

mapped reads

mapped reads (binary)

variant information

Page 26: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9

raw reads (.fastq)

Page 27: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

raw reads (.fastq)

gedit s‐6‐1.fastq

OR

Terminal: more s‐6‐1.fastq OR head s‐6‐1.fastq

Page 28: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9

raw reads (.fastq)

Instrument ID

Page 29: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9

Lane

Instrument ID

raw reads (.fastq)

Page 30: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9

Lane

Instrument ID Tile

raw reads (.fastq)

Page 31: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9

Lane

coordinates of the cluster

Instrument ID Tile

raw reads (.fastq)

Page 32: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9

Lane

coordinates of the cluster

Instrument ID Tile

Index number

raw reads (.fastq)

Page 33: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9

Lane

coordinates of the cluster

Instrument ID

First mate in the pair (paired‐end reads)

TileIndex number

raw reads (.fastq)

Page 34: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9

Lane

coordinates of the cluster

read

Instrument ID

First mate in the pair (paired‐end reads)

TileIndex number

raw reads (.fastq)

Page 35: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9

Lane

coordinates of the cluster

read

Quality values for each nucleotide

Instrument ID

First mate in the pair (paired‐end reads)

TileIndex number

raw reads (.fastq)

Page 36: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9

Lane

coordinates of the cluster

read

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~Lowest HighestASCII

33 126

Instrument ID

0.2......................26...31........41

Illumina 1.8+ Phred+33, raw reads typically (0, 41)

First mate in the pair (paired‐end reads)

TileIndex number

Quality values for each nucleotide (base quality score)

raw reads (.fastq)

Page 37: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~Lowest HighestASCII

33 1260.2......................26...31........41

Illumina 1.8+ Phred+33, raw reads typically (0, 41)

Phred‐scale value:

Q = ‐10*log_10P    →    P = 10‐Q/10

Phred Quality Score(Q)

Probability of incorrect base call 

(P)Base call accuracy

10 1 in 10 90%20 1 in 100 99%30 1 in 1000 99.9%40 1 in 10000 99.99%50 1 in 100000 99.999%

raw reads (.fastq)

Page 38: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

raw reads (.fastq)

• Move into folder lane2;

• Open s‐7‐1.fastq

• gedit s‐7‐1.fastq

OR

Terminal: more s‐7‐1.fastq OR head s‐6‐1.fastq

• Are s‐6‐1.fastq and s‐7‐1.fastq coming from two different lanes?

Page 39: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

raw reads (.fastq) 2. alignment to a reference genomeclose reference?time limited?

bwa

distant reference?

stampy

aligned reads (.sam/.bam)

3. bam refinementduplicate removal

local realignment

base recalibration

picardGATK GATK

aligned reads (.sam/.bam)

5. variant calling

SNPs/indels

single/multi‐sample

samtools

raw variants (.vcf)

ready‐to‐use variants (.vcf)

4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)

6. variant filtering and validation

in silico vs in vitro validation

vcftools

variant score recalibration

big datasets

known SNPs/indels

1. Fastq quality control + trimming

Adapters ?Low quality bases?

samtools

Page 40: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

1‐ Fastq quality control + trimming

Fastqc: quality control of the raw data coming out from the sequencer

Page 41: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

1‐ Fastq quality control + trimming

Fastqc: quality control of the raw data coming out from the sequencer

• Evaluation of the quality of the generated data;

Page 42: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

1‐ Fastq quality control + trimming

Fastqc: quality control of the raw data coming out from the sequencer

• Evaluation of the quality of the generated data;

• Basic summary statistics of the raw data;

Page 43: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

1‐ Fastq quality control + trimming

Fastqc: quality control of the raw data coming out from the sequencer

• Evaluation of the quality of the generated data;

• Basic summary statistics of the raw data;

• Several modules to evaluate different features (i.e. adapters; base quality, etc…)

Page 44: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

1‐ Fastq quality control + trimming

Fastqc: quality control of the raw data coming out from the sequencer

• Evaluation of the quality of the generated data;

• Basic summary statistics of the raw data;

• Several modules to evaluate different features (i.e. adapters; base quality, etc…)

• Feedback (green, orange, red): do not fully rely on that, think what does it mean!!

Page 45: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

1‐ Fastq quality control + trimming

Page 46: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

1‐ Fastq quality control + trimming

Per base sequence quality: warning

Page 47: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

1‐ Fastq quality control + trimming

What can we do to improve the quality at the end of the reads? 

Read Trimming: removal of lower‐quality 3' Ends with Low Quality Scores

Page 48: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

What can we do to improve the quality at the end of the reads? 

Read Trimming: removal of lower‐quality 3' Ends with Low Quality Scores

1‐ Fastq quality control + trimming

Page 49: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

1‐ Fastq quality control + trimming

What can we do to improve the quality at the end of the reads? 

Read Trimming: removal of lower‐quality 3' Ends with Low Quality Scores

Page 50: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

1‐ Fastq quality control + trimming

What can we do to improve the quality at the end of the reads? 

Read Trimming: removal of lower‐quality 3' Ends with Low Quality Scores

Page 51: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

1‐ Fastq quality control + trimming

95‐99 bp 90‐94 bp

What can we do to improve the quality at the end of the reads? 

Read Trimming: removal of lower‐quality 3' Ends with Low Quality Scores

Page 52: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

1‐ Fastq quality control + trimming

Per sequence quality score: pass

Page 53: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

1‐ Fastq quality control + trimming

Sequence length: pass

Page 54: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

Adapters removal1‐ Fastq quality control + trimming

Failed

Warning

Page 55: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

Adapters removal1‐ Fastq quality control + trimming

Pass

Page 56: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

Overrepresented sequences

1‐ Fastq quality control + trimming

Removal of overrepresented sequences (PCR primers).

Page 57: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

FASTQC references:

• Software website:http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

• Manual:https://insidedna.me/tool_page_assets/pdf_manual/fastqc.pdf

Page 58: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

raw reads (.fastq) 2. alignment to a reference genomeclose reference?time limited?

bwa

distant reference?

stampy

aligned reads (.sam/.bam)

3. bam refinementduplicate removal

local realignment

base recalibration

picardGATK GATK

aligned reads (.sam/.bam)

5. variant calling

SNPs/indels

single/multi‐sample

samtools

raw variants (.vcf)

ready‐to‐use variants (.vcf)

4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)

6. variant filtering and validation

in silico vs in vitro validation

vcftools

variant score recalibration

big datasets

known SNPs/indels

1. Fastq quality control + trimming

Adapters ?Low quality bases?

samtools

Page 59: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

raw reads (.fastq) 2. alignment to a reference genomeclose reference?time limited?

bwa

distant reference?

stampy

aligned reads (.sam/.bam)

3. bam refinementduplicate removal

local realignment

base recalibration

picardGATK GATK

aligned reads (.sam/.bam)

5. variant calling

SNPs/indels

single/multi‐sample

samtools

raw variants (.vcf)

ready‐to‐use variants (.vcf)

4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)

6. variant filtering and validation

in silico vs in vitro validation

vcftools

variant score recalibration

big datasets

known SNPs/indels

1. Fastq quality control + trimming

Adapters ?Low quality bases?

samtools

Page 60: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

Alignment : process of determining the most likelylocation within the genome for the observed DNA read

raw reads reference genome

2‐ Alignment to a reference genome

Page 61: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

trade‐off: speed vs sensitivity – the higher the accuracy the longer the alignment run

two classes of methods:

Burrows‐Wheeler

• Fast• less robust at high divergence 

with reference genome• e.g. bwa

Hashing

• slow (needs more memory)• robust at high divergence with 

reference genome• e.g. stampy

the shorter the read the harder is to find its location in the genome

big amount of data: computationally challenging for memory and speed

2‐ Alignment to a reference genome

Page 62: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

What if there are several possible places to align your sequencing read?

This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps

2‐ Alignment to a reference genome

Page 63: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

raw reads reference genome

low MQ: the probability of mapping to different locations is high, but no perfect multiple matches

high MQ: a single match

MQ0: a perfect multiple match

What if there are several possible places to align your sequencing read?

This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps

MQ is a phred‐score of the quality of the alignment

2‐ Alignment to a reference genome

Page 64: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps

2‐ Alignment to a reference genome

Reference sequence

Element 1 Element 2

Page 65: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps

2‐ Alignment to a reference genome

Reference sequence

Element 1 Element 2

Sample_1

Page 66: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps

2‐ Alignment to a reference genome

Reference sequence

Element 1 Element 2

Sample_1

Reference sequence

Sample_1

1 copia

1 copia

1 copia

1 copia

Page 67: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps

2‐ Alignment to a reference genome

Reference sequence

Element 1 Element 2Element 1

Page 68: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps

2‐ Alignment to a reference genome

Reference sequence

Element 1 Element 2

Sample_1

Element 1

Page 69: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps

2‐ Alignment to a reference genome

Reference sequence

Element 1 Element 2

Sample_1

Element 1

Page 70: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps

2‐ Alignment to a reference genome

Reference sequence

Element 1 Element 2

Sample_1

Element 1

Page 71: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps

2‐ Alignment to a reference genome

Reference sequence

Element 1 Element 2

Sample_1

Element 1

Perfect mul ple matches → MQ0Not a perfect match → Low MQ

Page 72: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps

2‐ Alignment to a reference genome

Reference sequence

Element 1 Element 2

Sample_1

Element 1

Perfect mul ple matches → MQ0Not a perfect match → Low MQ

Reference sequence

Sample_1

2 copia

1 copia

1 copia

1 copia

Page 73: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps

2‐ Alignment to a reference genome

Reference sequence

Element 1 Element 2

Sample_1

Page 74: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps

2‐ Alignment to a reference genome

Reference sequence

Element 1 Element 2

Sample_1

False heterozygous callCluster of heterozygotes

Page 75: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps

2‐ Alignment to a reference genome

Reference sequence

Element 1 Element 2

Sample_1

False heterozygous callCluster of heterozygotes

Reference sequence

Sample_1

1 copia

2 copia

1 copia

1 copia

Page 76: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps

2‐ Alignment to a reference genome

AluSg7

Page 77: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps

2‐ Alignment to a reference genome

Page 78: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

create the index of the reference genome (for bwa, samtools and picard)

bwa index: this is a FM‐index – specific to the algorithm behind this aligner

bwa index -a is Saccharomyces_cerevisiae.EF4.68.dna.toplevel.fa

2‐ Alignment to a reference genome: reference sequence

Page 79: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

create the index of the reference genome (for bwa, samtools and picard)

bwa index: this is a FM‐index – specific to the algorithm behind this aligner

bwa index -a is Saccharomyces_cerevisiae.EF4.68.dna.toplevel.fa

index .fai

samtools faidxSaccharomyces_cerevisiae.EF4.68.dna.toplevel.fa

The index file  stores records of sequence identifier, length, the offset of the first sequence character in the file, the number of characters per line, and the number of bytes per line.

2‐ Alignment to a reference genome: reference sequence

Page 80: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

2‐ Alignment to a reference genome: reference sequence

index .fai

samtools faidxSaccharomyces_cerevisiae.EF4.68.dna.toplevel.fa

The index file  stores records of sequence identifier, length, the offset of the first sequence character in the file, the number of characters per line, and the number of bytes per line.

Page 81: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

2‐ Alignment to a reference genome: reference sequence

index .fai

samtools faidxSaccharomyces_cerevisiae.EF4.68.dna.toplevel.fa

The index file  stores records of sequence identifier, length, the offset of the first sequence character in the file, the number of characters per line, and the number of bytes per line.

Page 82: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

2‐ Alignment to a reference genome: reference sequence

index .fai

samtools faidxSaccharomyces_cerevisiae.EF4.68.dna.toplevel.fa

The index file  stores records of sequence identifier, length, the offset of the first sequence character in the file, the number of characters per line, and the number of bytes per line.

50 characters

Page 83: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

2‐ Alignment to a reference genome: reference sequence

index .fai

samtools faidxSaccharomyces_cerevisiae.EF4.68.dna.toplevel.fa

The index file  stores records of sequence identifier, length, the offset of the first sequence character in the file, the number of characters per line, and the number of bytes per line.

50 characters

60 characters

Page 84: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

create the dictionary of the reference genome (for samtools, gatk and picard)

dictionary .dict: list of contigs included in the fasta file of the reference genome

java -jar picard.jar CreateSequenceDictionaryREFERENCE=Saccharomyces_cerevisiae.EF4.68.dna.toplevel.fa OUTPUT=Saccharomyces_cerevisiae.EF4.68.dna.toplevel.dict

keep index and dictionary files in the same directory of the reference file!

2‐ Alignment to a reference genome: reference sequence

Page 85: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

dictionary .dict: list of contigs included in the fasta file of the reference genome

2‐ Alignment to a reference genome – reference sequence

Page 86: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

dictionary .dict: list of contigs included in the fasta file of the reference genome

2‐ Alignment to a reference genome – reference sequence

SequenceName

Page 87: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

dictionary .dict: list of contigs included in the fasta file of the reference genome

2‐ Alignment to a reference genome – reference sequence

SequenceName

SequenceLength

Page 88: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

dictionary .dict: list of contigs included in the fasta file of the reference genome

2‐ Alignment to a reference genome – reference sequence

SequenceName

SequenceLength

Path

Page 89: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

dictionary .dict: list of contigs included in the fasta file of the reference genome

2‐ Alignment to a reference genome – reference sequence

SequenceName

SequenceLength

Path MD5 checksum 

Page 90: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

2‐ Alignment to a reference genome: mapping with bwa‐mem

Three different algorithm:

1. BWA‐backtrack: for illumina reads up to 100bp;

2. BWA‐SW: long read support, split alignment;

3. BWA‐MEM: long read support, split alignment, faster, more accurate

Page 91: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

2‐ Alignment to a reference genome: mapping with bwa‐mem

Three different algorithm:

1. BWA‐backtrack: for illumina reads up to 100bp;

2. BWA‐SW: long read support, split alignment;

3. BWA‐MEM: long read support, split alignment, faster, more accurate

Page 92: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

• paired‐end alignment (lane1);

2‐ Alignment to a reference genome: mapping with bwa‐mem

Page 93: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

• paired‐end alignment (lane1);

• it uses the reference genome (.fa) and the reads (.fastq) to create a SAM file;

2‐ Alignment to a reference genome: mapping with bwa‐mem

Page 94: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

• paired‐end alignment (lane1);

• it uses the reference genome (.fa) and the reads (.fastq) to create a SAM file;

• Option to mark shorter split hits as secondary (not supplementary).

2‐ Alignment to a reference genome: mapping with bwa‐mem

Page 95: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

2‐ Alignment to a reference genome: mapping with bwa‐mem

Split read:

Karacok E et al., 2012

• paired‐end alignment (lane1);

• it uses the reference genome (.fa) and the reads (.fastq) to create a SAM file;

• Option to mark shorter split hits as secondary (not supplementary).

Page 96: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant

• paired‐end alignment (lane1);

• it uses the reference genome (.fa) and the reads (.fastq) to create a SAM file;

• Option to mark shorter split hits as secondary (not supplementary).

bwa mem [options] [RefSeq] [lane1_fastq1] [lane1_fastq2] > lane1.sam

2‐ Alignment to a reference genome: mapping with bwa‐mem