37
GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation: Rob Elshire With supporting information from the coders.

GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

  • Upload
    others

  • View
    17

  • Download
    0

Embed Size (px)

Citation preview

Page 1: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

GBS Bioinformatics Pipeline(s) Overview

Getting from sequence files to genotypes.

Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman

Presentation: Rob Elshire With supporting information from the coders.

Page 2: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

Three Pipelines

• Discovery Pipeline – Requires a reference genome

– Multiple steps to get to genotypes

– Hands on tutorial is based on this pipeline

• Production Pipeline – Uses information from Discovery Pipeline

– One step from sequence to genotypes

• UNEAK Pipeline – For species without a reference genome

– Fei Lu will present this tomorrow at 9:30

Page 3: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

Vocabulary • Sequence File

– Text file containing DNA sequence and supplemental information from the Illumina Platform.

• Key File – Text file used to assign a GBS Bar Code to a Taxa

• GBS Tag – DNA sequence consisting of a cut site remnant and

additional sequence.

• GBS Bar Code – A short known sequence of DNA used to assign a GBS

Tag to its original Taxa

• Taxa – An individual sample

Page 4: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

Discovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa

Sequence

TOPM

GBS Discovery Pipeline

Page 5: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

Discovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa

Sequence

TOPM

GBS Discovery Pipeline

Page 6: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

HWI-ST397 0 3 68 15896 200039 0 1 GTCGATTCTGCTGACTTCATGGCTTCTGTTGACGACGATGTGGAACGAGCTGTTGTTGAAACTGATGAGGTTGCTGAGATCGGAAGAGCGGTTCAGCAGG HWI-ST397 0 3 68 15960 200043 0 1 GAGAATCAGCTTTTCCAACACCTTGAGTTTGAGTATGCGATGACAGTTACTCTTACTGTCCATTGTCAGCATTGCCAGAGCTTGACCAGCTGAGATCGGA HWI-ST397 0 3 68 15831 200053 0 1 ATGTACTGCACCGTTGCAAGCGAGCACCACCAAGCGGCGGTATGCACTTTGCAATATGTAGCTAGAATAGGATTTTCAGGTGATTAGGAGCGTAAAAAAG HWI-ST397 0 3 68 15867 200049 0 1 CCAGCTCAGCCTGCATTCTTTCAAAAACTTCCAATGCCTCTCTTGGCCTAGCATTTTGGGCATACCCTGTGACCATTGCTGTCCATGCCACCATATCCTT HWI-ST397 0 3 68 15943 200048 0 1 GATTTTACTGCACATCGGTCTTGTCACACCAGCTATACCTGTAGAGTTGCCTTCCACAGTTGTAGAGATCGGAAGAGCGGTTCAGCGGGACTGCCGAGAA HWI-ST397 0 3 68 15812 200062 0 1 TCACCCAGCATCACGCCCCTTCACATCCAGTAAAACCCCTGAATGATGTGCTGTCACTGTTTGATATACAGTTGTTAACGTGAGGACGGGCTTTGAAGGA HWI-ST397 0 3 68 15888 200067 0 1 CTTGACTGCCACCATGAATATGTGTTCCAAGTGCCACAAGGACTTGGCCCTGAAGCAAGAACAAGCCAAACTTGCAGAGATCGGAAGAGCGGTTCAGCAG HWI-ST397 0 3 68 15969 200067 0 1 CCACAACTGCTCCATCTTTTCCATGAGACATTGCTCCCGCCATTGCACCCTTGGCATCAGCAGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG HWI-ST397 0 3 68 15786 200078 0 1 GTATTCTGCACACGAATCAGCTGAGACACCAATTGGGCATGAATCAAATGGCGCCATTGCCGGGGATCGAACCCCGAATCAAATGGTGCCATTGCCACTG HWI-ST397 0 3 68 15830 200072 0 1 AATATGCCAGCAGTTAAGAGAGTTCAAGATCCAGGGCTCATATTCAGTCACCTATATCAATTTCGAAATGGATTTCCAGGGTTTTAAGAGCCTAACAAAG HWI-ST397 0 3 68 15863 200073 0 1 CTCCCTGCGGGTGCGCGCGACCCATCTTCAGTTGGAGCGTCTATCGGCGTTGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTA HWI-ST397 0 3 68 15762 200088 0 1 TGGTACGTCTGCGGAATGGCGTTTTTTATGCCTTAGTGGTTCGCAGAGCATTTGGCAGCTGAGATGGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGAT HWI-ST397 0 3 68 15903 200085 0 1 GGACCTACTGCCCAAGAACGGCTCACCCATCATCCGCTTTCTTCACCTTCCGTCTTCTTTGGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGAC HWI-ST397 0 3 68 15921 200082 0 1 GAGAATCAGCGTGTACGGGGCACGGGGTGACTGCTGTTGCGTGCGAGGGCTGAGATCGGAAGAGCGGTTCAGCAGGAGTGCCGAGACCGATCTCGTATGC HWI-ST397 0 3 68 15984 200085 0 1 TTCTCCAGCCGCATGGGCCGGAGACCAGAGAGGCCTCCCCAGGATTTGCACGATAGACCACGACTTATGGACGATTGGGAAGCCCTTGTTGGAAGGAAAT HWI-ST397 0 3 68 15788 200096 0 1 GCGTCAGCAAATGCCCCAACAGCCAAGTCAGCAATTGCCTCAGCAACTTGGGCCACAAACACCACAGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCC HWI-ST397 0 3 68 15842 200099 0 1 TAGGCCATCAGCTGACTTCCCGGGTGTGGAGAAAAGAGGGCCCCTCACTTCTCTCAAGTGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCGGAGACCG HWI-ST397 0 3 68 15876 200105 0 1 GGACCTACTGCCGGCGGGACGAAAGCGGTTGTTGAATGATGGGGGTCACTAGGCCTTCCAGGGCCTTTAAGCGCGCGCTGAGATCGGAAGAGGGGTTCAG HWI-ST397 0 3 68 15937 200097 0 1 CTCCCTGTTGAAGCATGTGCAAAAGAGCTTGTTCTCGGCCTTCTTCAAGCCATTCTCTTGGCAGACGGCTTTGCCTAGAAGTTTCGCCCCATCACCCTTG HWI-ST397 0 3 68 15958 200102 0 1 CGCCTTATCTGCCCTCGCCGGTCATGGGGAGTGGTGCCCCTACCTCGGACAAGACAGATGCAGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG

Raw Sequence (Qseq)

Page 7: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

HWI-ST397 0 3 68 15896 200039 0 1 GTCGATTCTGCTGACTTCATGGCTTCTGTTGACGACGATGTGGAACGAGCTGTTGTTGAAACTGATGAGGTTGCTGAGATCGGAAGAGCGGTTCAGCAGG HWI-ST397 0 3 68 15960 200043 0 1 GAGAATCAGCTTTTCCAACACCTTGAGTTTGAGTATGCGATGACAGTTACTCTTACTGTCCATTGTCAGCATTGCCAGAGCTTGACCAGCTGAGATCGGA HWI-ST397 0 3 68 15831 200053 0 1 ATGTACTGCACCGTTGCAAGCGAGCACCACCAAGCGGCGGTATGCACTTTGCAATATGTAGCTAGAATAGGATTTTCAGGTGATTAGGAGCGTAAAAAAG HWI-ST397 0 3 68 15867 200049 0 1 CCAGCTCAGCCTGCATTCTTTCAAAAACTTCCAATGCCTCTCTTGGCCTAGCATTTTGGGCATACCCTGTGACCATTGCTGTCCATGCCACCATATCCTT HWI-ST397 0 3 68 15943 200048 0 1 GATTTTACTGCACATCGGTCTTGTCACACCAGCTATACCTGTAGAGTTGCCTTCCACAGTTGTAGAGATCGGAAGAGCGGTTCAGCGGGACTGCCGAGAA HWI-ST397 0 3 68 15812 200062 0 1 TCACCCAGCATCACGCCCCTTCACATCCAGTAAAACCCCTGAATGATGTGCTGTCACTGTTTGATATACAGTTGTTAACGTGAGGACGGGCTTTGAAGGA HWI-ST397 0 3 68 15888 200067 0 1 CTTGACTGCCACCATGAATATGTGTTCCAAGTGCCACAAGGACTTGGCCCTGAAGCAAGAACAAGCCAAACTTGCAGAGATCGGAAGAGCGGTTCAGCAG HWI-ST397 0 3 68 15969 200067 0 1 CCACAACTGCTCCATCTTTTCCATGAGACATTGCTCCCGCCATTGCACCCTTGGCATCAGCAGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG HWI-ST397 0 3 68 15786 200078 0 1 GTATTCTGCACACGAATCAGCTGAGACACCAATTGGGCATGAATCAAATGGCGCCATTGCCGGGGATCGAACCCCGAATCAAATGGTGCCATTGCCACTG HWI-ST397 0 3 68 15830 200072 0 1 AATATGCCAGCAGTTAAGAGAGTTCAAGATCCAGGGCTCATATTCAGTCACCTATATCAATTTCGAAATGGATTTCCAGGGTTTTAAGAGCCTAACAAAG HWI-ST397 0 3 68 15863 200073 0 1 CTCCCTGCGGGTGCGCGCGACCCATCTTCAGTTGGAGCGTCTATCGGCGTTGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTA HWI-ST397 0 3 68 15762 200088 0 1 TGGTACGTCTGCGGAATGGCGTTTTTTATGCCTTAGTGGTTCGCAGAGCATTTGGCAGCTGAGATGGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGAT HWI-ST397 0 3 68 15903 200085 0 1 GGACCTACTGCCCAAGAACGGCTCACCCATCATCCGCTTTCTTCACCTTCCGTCTTCTTTGGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGAC HWI-ST397 0 3 68 15921 200082 0 1 GAGAATCAGCGTGTACGGGGCACGGGGTGACTGCTGTTGCGTGCGAGGGCTGAGATCGGAAGAGCGGTTCAGCAGGAGTGCCGAGACCGATCTCGTATGC HWI-ST397 0 3 68 15984 200085 0 1 TTCTCCAGCCGCATGGGCCGGAGACCAGAGAGGCCTCCCCAGGATTTGCACGATAGACCACGACTTATGGACGATTGGGAAGCCCTTGTTGGAAGGAAAT HWI-ST397 0 3 68 15788 200096 0 1 GCGTCAGCAAATGCCCCAACAGCCAAGTCAGCAATTGCCTCAGCAACTTGGGCCACAAACACCACAGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCC HWI-ST397 0 3 68 15842 200099 0 1 TAGGCCATCAGCTGACTTCCCGGGTGTGGAGAAAAGAGGGCCCCTCACTTCTCTCAAGTGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCGGAGACCG HWI-ST397 0 3 68 15876 200105 0 1 GGACCTACTGCCGGCGGGACGAAAGCGGTTGTTGAATGATGGGGGTCACTAGGCCTTCCAGGGCCTTTAAGCGCGCGCTGAGATCGGAAGAGGGGTTCAG HWI-ST397 0 3 68 15937 200097 0 1 CTCCCTGTTGAAGCATGTGCAAAAGAGCTTGTTCTCGGCCTTCTTCAAGCCATTCTCTTGGCAGACGGCTTTGCCTAGAAGTTTCGCCCCATCACCCTTG HWI-ST397 0 3 68 15958 200102 0 1 CGCCTTATCTGCCCTCGCCGGTCATGGGGAGTGGTGCCCCTACCTCGGACAAGACAGATGCAGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG

Raw Sequence (Qseq)

Page 8: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

Key File

Flowcell Lane Barcode DNASample LibraryPlate Row Column LibraryPrepID PlateName

81PVTABXX 2 CTCC Sample_1 1 A 1 1 Plate_A

81PVTABXX 2 TGCA Sample_2 1 A 2 2 Plate_A

81PVTABXX 2 ACTA Sample_3 1 A 3 3 Plate_A

81PVTABXX 2 CAGA Sample_4 1 A 4 4 Plate_A

81PVTABXX 2 AACT Sample_5 1 A 5 5 Plate_A

81PVTABXX 2 GCGT Sample_6 1 A 6 6 Plate_A

81PVTABXX 2 TGCGA Sample_7 1 A 7 7 Plate_A

81PVTABXX 2 CGAT Sample_8 1 A 8 8 Plate_A

81PVTABXX 2 CGCTT Sample_9 1 A 9 9 Plate_A

81PVTABXX 2 TCACC Sample_10 1 A 10 10 Plate_A

81PVTABXX 2 CTAGC Sample_11 1 A 11 11 Plate_A

81PVTABXX 2 ACAAA Sample_12 1 A 12 12 Plate_A

81PVTABXX 2 TTCTC Sample_13 1 B 1 13 Plate_A

81PVTABXX 2 AGCCC Sample_14 1 B 2 14 Plate_A

81PVTABXX 2 GTATT Sample_15 1 B 3 15 Plate_A

81PVTABXX 2 CTGTA Sample_16 1 B 4 16 Plate_A

81PVTABXX 2 ACCGT Sample_17 1 B 5 17 Plate_A

81PVTABXX 2 GTAA Sample_18 1 B 6 18 Plate_A

81PVTABXX 2 GGTTGT Sample_19 1 B 7 19 Plate_A

Page 9: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

Chimeric (?) sequence

Rejected reads

‘Good’ read

Insert Barcode

Cut site

Short sequence

Adapter dimer

Insert Barcode adapter Cut site Common adapter Cut site

Insert Cut site 2nd Insert Barcode

Cut site

GBS Tags

Barcode

Cut site Common adapter

Trimmed reads

Insert Cut site

Insert Barcode

No Cut site

No Barcode

Insert Barcode

Cut site Common adapter Cut site

Page 10: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

Chimeric (?) sequence

Rejected reads

‘Good’ read

Insert Barcode

Cut site

Short sequence

Adapter dimer

Insert Barcode adapter Cut site Common adapter Cut site

Insert Cut site 2nd Insert Barcode

Cut site

GBS Tags

Barcode

Cut site Common adapter

Trimmed reads

Insert Cut site

Insert Barcode

No Cut site

No Barcode

Insert Barcode

Cut site Common adapter Cut site

Page 11: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

Discovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa

Sequence

TOPM

GBS Discovery Pipeline

Page 12: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

Tag Counts

• With information from the key file, each sequence file is processed, tags are identified and counted.

• If a tag is shorter than 64 bases it is padded.

• The tags and counts are put into a tag count file for each sequence file.

QseqToTagCountsPlugin / FastqToTagCountsPlugin

Page 13: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

Master Tag Counts

• The individual tag count files are merged into a master tag count file.

• A minimum count is specified at the merge stage to exclude tags with low counts (likely sequencing errors).

MergeMultipleTagCountsPlugin

Page 14: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

Conversion of Tags to Fastq

• Sequence aligners do not work with the tag count file format.

• In preparation for the alignment step, the tag count file is converted to fastq format.

TagCountsToFastqPlugin

Page 15: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

Discovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa

Sequence

TOPM

GBS Discovery Pipeline

Page 16: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

Tag Alignment / TOPM

• The GBS pipeline uses an external aligner to do the initial alignment.

• The current version uses bowtie2 which produces the alignment in the SAM format.

• We convert the SAM file into our tags on physical map format (TOPM)

bowtie2

SAMConverterPlugin

Page 17: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

TOPM

Page 18: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

So Far We Have

• Identified and counted GBS tags.

• Converted tag counts file to fastq.

• Aligned the tags to a reference.

• Converted the alignment to TOPM.

Page 19: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

Discovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa

Sequence

TOPM

GBS Discovery Pipeline

Page 20: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

Tags by Taxa

• In this step we identify which tags are present in which taxa.

– Original Sequence Files

– Key File

– Master Tag Count File

• Recently migrated to HDF5 file format.

– Efficient storage

– Large data sets

SeqToTBTHDF5Plugin

Page 21: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

Tags By Taxa Additional Operations

• If many TBTs have been created they are merged into 1 TBT.

• Taxa that were sequenced multiple times are merged.

• The TBT table is pivoted in preparation for SNP calling.

ModifyTBTHDF5Plugin

Page 22: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

Discovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa

Sequence

TOPM

GBS Discovery Pipeline

Page 23: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

SNP Calling

• Files used in SNP Calling – TOPM

– TBT

– Pedigree File (optional)

• Some Key Settings – mnF MinimumF (inbreeding coefficient)

– mnMAF Minimum Minor Allele Frequency

– mnMAC Minimum Minor Allele Count

– mnLCov Minimum Locus Coverage

TagsToSNPByAlignmentPlugin

Page 24: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

HapMap rs# alleles chrom pos strand SgSBRIL067:633Y5AAXX:2:C9 SgSBRIL019:633Y5AAXX:2:C3 S1_2100 A/G 1 2100 + N N N N N N N R N A N S1_2163 T/C 1 2163 + N N N N N N T C T T N S1_13837 T/G 1 13837 + N N N N N N N G N N T S1_14606 C/T 1 14606 + N N C N N N T T T T C S1_2061 T/A 1 20601 + T N N N N N N A N N N S1_68332 C/T 1 68332 + N N N N N N N N N N N S1_68596 A/T 1 68596 + A N N N N N N N N A N S1_69309 G/A 1 69309 + N G N N N N N A N N N S1_79955 T/G 1 79955 + N T G T T N T T N N N S1_79961 T/G 1 79961 + N T T T T N T T N N N S1_80584 G 1 80584 + N N N N N N N N N N G S1_80647 C/T 1 80647 + N N N N N N N C N N C S1_81274 T/G 1 81274 + N N N N N N T G N N N S1_108834 G/A 1 108834 + N N N N N N N N N N N S1_112345 T/G 1 112345 + N N N N N N K T N N N S1_115359 C/T 1 115359 + N N N N N N T C N T S1_115362 T/C 1 115362 + N N N N N N N C N N N S1_115405 G/A 1 115405 + G G A N N G G G G N S1_115516 T/G 1 115516 + N N T N N N T T N N T S1_116694 A/G 1 116694 + N A G N N N G A N N N S1_119016 C/T 1 119016 + N N N N C N N C N N N S1_155366 T/C 1 155366 + N T N N N N

Page 25: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

Production Pipeline

Page 26: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

Why another pipeline?

• The last maize build (30000 taxa) with the discovery pipeline took over 3 months.

• Most common alleles have been identified after the first few discovery builds.

• Use the information from the discovery pipeline to call SNPs in new runs quickly.

• Improve efficiency and automate.

Page 27: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

GBS Discovery pipeline

Discovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa

Fastq

TOPM

Page 28: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

GBS Discovery pipeline

Discovery

Tag Counts

SNP Caller

Tags by Taxa

Fastq

TOPM

Genotypes

Filtered Genotypes

Page 29: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

GBS Bioinformatics Pipelines

Discovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa

Fastq

Production

TOPM

Fastq

Page 30: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

Discovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa

Fastq

Production

TOPM

Fastq

TagsOnPhysicalMap (TOPM)

Page 31: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

GBS Bioinformatics Pipelines

Discovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa

Fastq

Production

Filtered Genotypes

TOPM

Fastq

Page 32: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

GBS Bioinformatics Pipelines

Discovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa

Fastq

Production

Fastq

Filtered Genotypes

TOPM TOPM

Page 33: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

GBS Bioinformatics Pipelines

Discovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa

Fastq

Production

Fastq

Filtered Genotypes

TOPM TOPM

Page 34: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

GBS Bioinformatics Pipelines

Discovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa

Fastq

Production

Fastq

Filtered Genotypes

TOPM TOPM

Genotypes

Page 35: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

Running the Production Pipeline

• Required Files: – Sequence file (fastq or qseq)

– Key file

– Production TOPM

• TASSEL 3 Standalone & RawReadsToHapMapPlugin

• Running the Pipeline: – One lane processed at a time

– HapMap files by chromosome

• ~7 minutes

Page 36: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

Testing Production Pipeline

• Compared HapMap files produced by Discovery Pipeline and Production Pipeline

• Site Comparison:

– Discovery 48,139

– Production 47,676

– Difference due to maximum 8 alleles

• 99.98% correlation of genetic distance matrices

Page 37: GBS Bioinformatics Pipeline Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · Why another pipeline? •The last maize build (30000 taxa) with the discovery

Next Steps In Pipeline Development

• Hierarchical Data Format – supports very large data sets and complex data structures.

• Working to fuse TOPM, TBT, Keyfile, and Pedigree File into one HDF5 repository.

• Continued improvements to SNP caller.

• Ability to use tags not present in the reference.