Upload
shavonne-malone
View
216
Download
1
Embed Size (px)
Citation preview
Bioinformatics Applications
Vivek Krishnakumar & Haibao TangJ. Craig Venter Institute
Plant Informatics WorkshopJuly 15, 2013
FASTA format - Example
A single FASTA sequence record:• Definition Line or Header begins with ‘>’
• Width of sequence rows usually 60 letters.The Multi-FASTA format is composed of FASTA records concatenated together.
FASTA - Definition Line
• Minimum requirement for definition line is '>' symbol followed by an identifier
>Medtr5g073340• Extra comments may be added after the
identifier separated by a whitespace• Several pre-defined conventions already
exist, that are followed by the sequence databases
faSize
htang@htang-lx $ faSize contigs.fasta344315953 bases (621962 N's 343693991 real 343693991 upper 0 lower) in 177125 sequences in 1 filesTotal size: mean 1943.9 sd 3181.2 min 200 (contig_177102) max 139442 (contig_26116) median 624N count: mean 3.5 sd 10.5U count: mean 1940.4 sd 3180.1L count: mean 0.0 sd 0.0%0.00 masked total, %0.00 masked real
htang@htang-lx $ faSize contigs.fasta -detailed | sort -k2,2nr | headcontig_26116 139442contig_26133 126994contig_26222 58825contig_28230 54642contig_25859 50265contig_38555 47349contig_26213 47226contig_41638 45958contig_6779 43393contig_26783 42041
faOneRecord, faSomeRecords
faOneRecord - Extract a single record from a .FA fileusage: faOneRecord in.fa recordName
faSomeRecords - Extract multiple fa recordsusage: faSomeRecords in.fa listFile out.faoptions: -exclude - output sequences not in the list file.
• Does not create index like cdbindex/cdbyank (does not create extra files)
• Sufficiently fast
faSplit
faSplit - Split an fa file into several files.usage: faSplit how input.fa count outRootwhere how is either 'base' 'sequence' or 'size'. Filessplit by sequence will be broken at the nearestfa record boundary, while those split by base willbe broken at any base. Files broken by size willbe broken every count bases.Examples: faSplit sequence estAll.fa 100 estThis will break up estAll.fa into 100 files(numbered est001.fa est002.fa, ... est100.faFiles will only be broken at fa record boundaries faSplit base chr1.fa 10 1_This will break up chr1.fa into 10 files faSplit size input.fa 2000 outRootThis breaks up input.fa into 2000 base chunks faSplit about est.fa 20000 outRootThis will break up est.fa into files of about 20000 bytes each by record. faSplit byname scaffolds.fa outRootThis breaks up scaffolds.fa using sequence names as file names. faSplit gap chrN.fa 20000 outRootThis breaks up chrN.fa into files of at most 20000 bases each,at gap boundaries if possible.
faGapSizes, faGapLocs
htang@htang-lx (~) $ faGapSizes medicago.fagapCount=7870, totalN=51926655, minGap=1, maxGap=250000, avgGap=6598.00
0 < size < 10 : 2402: *********************** size = 10 : 2: 10 < size < 50 : 50: size = 50 : 26: 50 < size < 100 : 38: size = 100 : 4654: ********************************************* 100 < size < 500 : 7: size = 500 : 0: 500 < size < 1000 : 0: size = 1000 : 0: 1000 < size < 5000 : 1: size = 5000 : 305: ** 5000 < size < 10000 : 1: size = 10000 : 1: 10000 < size < 50000 : 0: size = 50000 : 101: size > 50000 : 282: **
htang@htang-lx (~) $ faGapLocs CU914135 stdout0 CU914135.1_seq_0 70702 CU914135.1 27620170702 CU914135.1_gap_0 50 CU914135.1 27620170752 CU914135.1_seq_1 51172 CU914135.1 276201121924 CU914135.1_gap_1 50 CU914135.1 276201121974 CU914135.1_seq_2 154227 CU914135.1 276201
Format for HTS Data
• High Throughput Sequencing (HTS) instruments produce large quantities of sequence data
• Requirement to store sequence quality along with raw sequence arose
• Thus, logical extension to FASTA was developed, called FASTQo Minimal representation of sequencing reado Ability to store numeric quality score
FASTQ format
• Line 1 begins with the @ character followed by sequence identifier
• Line 2 consists of the raw sequence• Line 3 begins with a + symbol followed by an
optional description or repeat of Line 1• Line 4 corresponds to the encoded quality
values (one character each for every nucleotide in the sequenced read)
• Common file extensions:.fastq, .fq, or .txt are used
FASTQ formatIllumina Quality Encoding
• Phred Quality Scores Q: logarithmically related to base calling error probabilities P
• Phred score:10 = 10% error20 = 1% error30 = 0.1% error40 = 0.01% error
FASTQ formatIllumina Quality Encoding
• Illumina format encodes a quality score between 0 and 62 using ASCII 64 to 126 (as compared to Sanger which encodes 0 to 93 using ASCII 33 to 126)
• The phred score + some offset = ASCII code, example:
• Two types of offsets (phred +33, and phred +64). Most of the FASTQ files are phred +33.
• How do I know which offset it is? There is a quick tip
Generic Feature FormatGFF = Generic Feature Format
Tab delimited, easy for data parsing and processing Many annotation viewers accept this format, isn't very strictFields:
1. Reference Sequence: base seq to which the coordinates are anchored2. Source: source of the annotation3. Type: Type of feature4. Start coordinate (1-based)5. End coordinate6. Score: Used for holding numerical scores (similarity, etc)7. Strand: "+","-", or "." if unstranded8. Frame: Signifies codon phase for coding sequence (CDS) features9. Other attributes or/and comments
GFF3 – Generic Feature Format v3http://www.sequenceontology.org/gff3.shtml
Extension of GFF by the Sequence Ontology (SO) and Generic Model Organism Database (GMOD) Projects - Allows hierarchies more than one level deep - Constrains the feature type field to be taken from controlled vocabulary - Feature can belong to more than one group - Attributes take the form of “Key=Value” pairs separated by a ";" - Uppercase 'Keys' are reserved. Lower case 'keys' are user-defined
BED Formathttp://genome.ucsc.edu/FAQ/FAQformat.html#format1
• Developed primarily for the UCSC genome browser
• BED lines have 3 required fields and 9 optional fields
• With just the required fields and a few additional optional fields (bed6), individual features (such as gene/mRNA boundaries) can be represented
• In order to depict hierarchical features, all 12 columns are necessary (also called bed12 format)
Three required BED fields are:1. chrom - The name of the chromosome/reference (e.g. chr3, scaffold_1)2. chromStart - The starting position of the feature (0-based)3. chromEnd - The ending position of the feature
The 9 additional optional BED fields are:4. name - Defines the name of the BED line5. score - A score between 0 and 10006. strand - Defines the strand - either '+' or '-'.7. thickStart - The starting position at which the feature is drawn thickly
(for example, the start codon in gene displays).8. thickEnd - The ending position at which the feature is drawn thickly (for
example, the stop codon in gene displays)9. itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0)10.blockCount - The number of blocks (exons) in the BED line11.blockSizes - A comma-separated list of the block sizes, corresponding
to blockCount12.blockStarts - A comma-separated list of block starts, relative to
chromStart, corresponding to blockCount
BED Format - Description
BEDchr2 0 673342chr5 0 619924
BED6chr2 136141 136779 Medtr2g005340 0 +chr2 140131 140463 Medtr2g005360 0 +
BED12 chr2 140131 140463 Medtr2g005360 0 + \
140131 140463 255,0,0 2 100,58 0,100
BED Format - Examples
BEDTOOLS
• Author: Aaron Quinlan, UVA• BED format (first three columns `chr`, `start`, `end` are required), 0-based
• BEDTOOLS operates on genomic coordinates and uses “Bin indexing” to speed up range queries
#chromosome start end name score strand ...Mt_chr01 150050 154771 Medtr1g005000 1000 -Mt_chr01 154858 155160 Medtr1g005010 1000 +Mt_chr01 155200 162479 Medtr1g005020 1000 +Mt_chr01 162535 164388 Medtr1g005030 1000 -Mt_chr01 164428 166156 Medtr1g005040 1000 -Mt_chr01 167402 169400 Medtr1g005050 1000 -Mt_chr01 169440 171646 Medtr1g005060 1000 -Mt_chr01 171722 172427 Medtr1g005070 1000 -
BEDTOOLS functionalities
• Find out what’s overlapping between sets of features. (intersectBed)
• Find closest genomic features. (closestBed, windowBed)
• Merge overlapping features. (mergeBed)
• Computing coverage for alignments based on genome features. (coverageBam, bamToBed)
• Calculating the depth and breadth of sequence coverage across defined "windows" in a genome. (coverageBed)
• Sequence output. (fastaFromBed, maskFastaFromBed)
intersectBed: What’s in common?
• Report the base-pair overlap between sequence alignments and genes.
• Report those alignments that overlap NO genes. Like "grep -v"
• Report the number of genes that each alignment overlaps.
• Report the entire, original alignment and gene entries for each overlap, and number of overlapping bases.
• Only report an overlap with a repeat if it spans at least 50% of the exon.
• Read BED A from stdin. For example, find genes that overlap LINEs but not SINEs.
• Retain only single-end BAM alignments that do not overlap simple sequence repeats.
$ intersectBed -a reads.bed -b genes.bed
$ intersectBed -a reads.bed -b genes.bed -v
$ intersectBed -a reads.bed -b genes.bed -c
$ intersectBed -a reads.bed -b genes.bed –wo
$ intersectBed -a exons.bed -b repeatMasker.bed –f 0.50
$ intersectBed -a genes.bed -b LINES.bed | intersectBed -a stdin -b SINEs.bed –v
$ intersectBed -abam reads.bam -b SSRs.bed -v > reads.noSSRs.bam
windowBed, closestBed: What’s in there?
• windowBed
• Report all genes that are within 10000 bp upstream or downstream of CNVs.
• Report all genes that are within 10000 bp upstream or 5000 bp downstream of CNVs.
• Report all SNPs that are within 5000 bp upstream or 1000 bp downstream of genes. Define upstream and downstream based on strand.
• closestBed
• Note: By default, if there is a tie for closest, all ties will be reported. closestBed allows overlapping features to be the closest.
• Find the closest ALU to each gene.
• Find the closest ALU to each gene, choosing the first ALU in the file if there is a tie.
$ windowBed -a CNVs.bed -b genes.bed -w 10000
$ windowBed -a CNVs.bed -b genes.bed –l 10000 –r 5000
$ windowBed -a genes.bed –b snps.bed –l 5000 –r 1000 -sw
$ closestBed -a genes.bed -b ALUs.bed
$ closestBed -a genes.bed -b ALUs.bed –t first
mergeBed, coverageBed
• mergeBed
• Merge overlapping repetitive elements into a single entry.
• Merge overlapping repetitive elements into a single entry, returning the number of entries merged.
• Merge nearby (within 1000 bp) repetitive elements into a single entry.
• coverageBed
• Compute the coverage of aligned sequences on 10 kilobase “windows” spanning the genome.
$ mergeBed -i repeatMasker.bed
$ mergeBed -i repeatMasker.bed -n
$ mergeBed -i repeatMasker.bed –d 1000
$ coverageBed -a reads.bed -b windows10kb.bed
Default Output: After each entry in B, reports: 1) The number of features in A that overlapped the B interval. 2) The number of bases in B that had non-zero coverage. 3) The length of the entry in B. 4) The fraction of bases in B that had non-zero coverage.
fastaBed, maskFastaFromBed: messing with sequences
• fastaBed
• maskFastaFromBed
Program: fastaFromBed (v2.6.1)Summary: Extract DNA sequences into a fasta file based on BED coordinates.
Usage: fastaFromBed [OPTIONS] -fi -bed -fo
Options: -fi Input FASTA file -bed BED file of ranges to extract from -fi -fo Output file (can be FASTA or TAB-delimited) -name Use the BED name field (#4) for the FASTA header -tab Write output in TAB delimited format. - Default is FASTA format.
Program: maskFastaFromBed (v2.6.1)Summary: Mask a fasta file based on BED coordinates.
Usage: maskFastaFromBed [OPTIONS] -fi -out -bed
Options: -fi Input FASTA file -bed BED file of ranges to mask in -fi -fo Output FASTA file -soft Enforce "soft" masking. That is, instead of masking with Ns, mask with lower-case bases.
Sequence Alignment Format
• With the advent of HTS technologies, several requirements arose:o need for a generic alignment format to store
read alignmentso support for short and long reads generated
by different sequencing platformso compact file sizeo random access abilityo ability to store various types of alignments
(clipped, spliced, multi-mapped, padded) • As a result, SAM (Sequence Alignment/Map)
format evolved
SAM Format - Examplehttp://samtools.sourceforge.net/SAM1.pdf
@HD VN:1.3 SO:coordinate@SQ SN:ref LN:45r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *r003 16 ref 29 30 6H5M * 0 0 TAGGC * NM:i:0r001 83 ref 37 30 9M = 7 -39 CAGCGCCAT *
SAM Format - Header Lines
• Header lines start with @• @ is followed by TAG• Known TAGS:@HD - Header@SQ - Reference Sequence dictionary@RG - Read Group
• Header fields are TYPE:VALUE pairs @SQ SN:ref LN:45
TAG TYPE:VALUE TYPE:VALUE• Example:@RG ID:L2 PU:SC_2_12 LB:SC_2
SM:NA12891
SAM Format - Alignment Section
• 11 mandatory fields• Tab-delimited• Optional fields (variable)
1. QNAME: Query name of the read or the read pair2. FLAG: Bitwise flag (multiple segments, unmapped, etc.)3. RNAME: Reference sequence name4. POS: 1-Based leftmost position of clipped alignment5. MAPQ: Mapping quality (Phred-scaled)6. CIGAR: Extended CIGAR string (operations: MIDNSHP)7. MRNM: Mate reference name (‘=’ if same as RNAME)8. MPOS: 1-based leftmost mate position9. ISIZE: Inferred insert size10.SEQ: Sequence on the same strand as the reference11.QUAL: Query quality (ASCII-33 = Phred base quality)
M: match/mismatchI: insertionD: deletionP: paddingN: skipS: soft-clipH: hard-clip
SAM Format - CIGAR string
Ref: GCATTCAGATGCAGTACGC
Read: ccTCAG--GCAGTAgtg
CIGAR 2S4M2D6M3S
POS 5
• Known as BAM• Binary, indexed representation of SAM• Uses BGZF (Blocked GZip Format)
compression• Storage space requirements ~27% of
original SAM• SAM/BAM can be manipulated using
SAMTOOLS package
SAM FormatCompressed Binary Version
SAMTOOLS
• SAM, and BAM format is popular for reporting read mappings onto the reference genome, SAM is human readable
• SAMTOOLS, author: Heng Li, Broadgood for manipulating SAM files
@HD VN:1.0 SO:unsorted@SQ SN:contig_27167 LN:26169GFNQG3Z01EMD0D 65 contig_27167 25164 60 3S24M = 23568 1620
CATTCTCTCTTTCTTCTTTGTGCTTCA *GFNQG3Z01EMD0D 129 contig_27167 23568 60 18M1D8M = 25164 1620
GTCAAACAACCCTCGTAGAAATATGAT *
SAMTOOLS
• Samtools manipulate SAM/BAM• Once you have the bam indexed, you can quickly access any range in the reference (remember bin indexing?)
bowtie ref -1 SRR067323_1.fastq -2 SRR067323_2.fastq --best --maxins 1000 -S | \samtools view -h -S -F 4 - > SRR067323.aligned.sam
samtools view -bS SRR067323.aligned.sam –o SRR067323.aligned.bamsamtools sort SRR067323.aligned.bam SRR067323.sortedsamtools index SRR067323.sorted.bam
SAM BAM Sorted BAM Indexed BAM
samtools view samtools sort samtools index