17
FUNPOPGEN Altrans Manual Version 1.1.02 Halit Ongen 12/9/2012 Contains the Altrans installation, method, usage examples, options, and files.

Manual

Embed Size (px)

DESCRIPTION

Manual

Citation preview

Page 1: Manual

FUNPOPGEN

Altrans Manual Version 1.1.02

Halit Ongen 12/9/2012

Contains the Altrans installation, method, usage examples, options, and files.

Page 2: Manual

Contents I. Installation ....................................................................................................................................................... 3

Windows .......................................................................................................................................................... 3 Linux and Mac OS ......................................................................................................................................... 3

II. Introduction ................................................................................................................................................... 4 III. Usage Examples .......................................................................................................................................... 8 IV. Worked Example ........................................................................................................................................ 9 V. Options .......................................................................................................................................................... 10

Important Options: ................................................................................................................................... 10 -a|--annotation ...................................................................................................................................... 10 -A|--anchor-length ............................................................................................................................... 10 -b|--bam-file ........................................................................................................................................... 10 -c|--check-proper-pairing ................................................................................................................. 10 -e|--min-exon-length .......................................................................................................................... 10 -E|--single-end ....................................................................................................................................... 10 -i|--max-intron-length ........................................................................................................................ 10 -m|--mapping-quality ......................................................................................................................... 10 -n|--no-clipping ..................................................................................................................................... 10 -o|--output-dir ....................................................................................................................................... 11 -p|--output-prefix ................................................................................................................................. 11 -r|--read-length ..................................................................................................................................... 11 -s|--split-reads ....................................................................................................................................... 11 -u|--use-unpaired-split ...................................................................................................................... 11 -x|--max-clip ........................................................................................................................................... 11

Other Options: ............................................................................................................................................ 11 -B|--bin-size ............................................................................................................................................ 11 -C|--use-inter-chr ................................................................................................................................. 11 -D|--use-diff-strand ............................................................................................................................. 11 -d|--distribution-file ............................................................................................................................ 11 -f|--norm-file .......................................................................................................................................... 11 -F|--merge-files ..................................................................................................................................... 12 -g|--gene-types ...................................................................................................................................... 12 -h|-?|--help .............................................................................................................................................. 12 -H|--has-header ..................................................................................................................................... 12 -I|--ignore-prob-groups ..................................................................................................................... 12 -j|--print-bed .......................................................................................................................................... 12 -J|--probability-file ............................................................................................................................... 12 -k|--print-combined ............................................................................................................................ 12 -l|--convert .............................................................................................................................................. 12 -M|--use-multi-level ............................................................................................................................ 12 -N|--no-norm-file .................................................................................................................................. 12 -P|--no-fractions ................................................................................................................................... 12 -R|--no-raw-counts .............................................................................................................................. 12 -S|--silent ................................................................................................................................................. 12 -t|--trim .................................................................................................................................................... 13 -v|--verbose ............................................................................................................................................ 13 -V|--version ............................................................................................................................................. 13 -U|--use-slots .......................................................................................................................................... 13 -w|--working-dir .................................................................................................................................. 13 -W|--force-skip ...................................................................................................................................... 13 -X|--regex ................................................................................................................................................. 13 -y|--skip-fusion ...................................................................................................................................... 13 -Y|--normalize ....................................................................................................................................... 14 -z|--max-reads-dist .............................................................................................................................. 14 -Z|--covar ................................................................................................................................................. 14

VI. Output and other files ........................................................................................................................... 15 1

Page 3: Manual

Fragment size distribution file ............................................................................................................ 15 Master fragment sizes file ..................................................................................................................... 15 Forward file ................................................................................................................................................. 15 Reverse file .................................................................................................................................................. 15 Combined file .............................................................................................................................................. 15 Raw forward file ........................................................................................................................................ 15 Raw reverse file ......................................................................................................................................... 15 2norm file ..................................................................................................................................................... 15 Norm file ....................................................................................................................................................... 15 Extended BED file ..................................................................................................................................... 16 Log file ........................................................................................................................................................... 16 Converted file ............................................................................................................................................. 16 Probability file ............................................................................................................................................ 16

2

Page 4: Manual

I. Installation

Windows

You can use the binary at http://sourceforge.net/projects/altrans/files which has been compiled under Cygwin. In order to compile it yourself download the latest Cygwin at http://www.cygwin.com/ and follow the instructions provided for Linux and Mac OS. Note that after you have compiled the source under Cygwin, in order to run it as a native Windows executable you need to have the relevant Cygwin dll’s (for Cygwin 1.7.16-1: cyggcc_s-1.dll, cygstdc++-6.dll, cygwin1.dll, cygz.dll) in your PATH or in the same directory as the Altrans binary.

Linux and Mac OS

This has been tested and works with gcc 4.2 or newer. For other compilers you will have to edit the makefiles.

Unzip and untar:

tar -xzvf altrans.vX.X.XX.tar.gz

Compile:

cd altrans make This should compile, with warnings that can be ignored, and create the altrans binary under the bin/ directory. Precompiled binary distributions will also be available at http://sourceforge.net/projects/altrans/files for certain flavours of these OSs, however there is no guarantee that these will run everywhere.

Mac binary was tested and runs on Mac OS X 10.6 & 10.7.

Both linux binaries were tested and run on Ubuntu 8.04 & 10.04 & 12.04, Fedora 14 & 17, SuSE 9.3, and CentOS 5.8.

The linux binary should run on any modern Linux distribution that has the standard C & C++ and zlib libraries.

To clean:

make clean

3

Page 5: Manual

II. Introduction

Altrans is a method for the relative quantification of splicing events. It requires a BAM alignment file from an RNA-seq experiment and an annotation file in GTF format detailing the location of the exons in the genome. It uses paired end reads where one mate maps to one exon and the other mate to a different exon and/or split reads spanning exon-exon junctions to count “links” between two exons. When there are overlapping exons, these are grouped into “exon groups” and unique portions of each exon in an exon group are identified which are used when assigning reads to an exon. The link counts ascertained from unique regions are normalized with the probability of observing such a link given the insert size distribution which is referred to as link coverage. Finally the quantitative metric produced is the fraction of one link’s coverage over the sum of the coverages of all the links that the initial first exon makes. The algorithm is as follows:

1. Group overlapping exons from annotation into exon groups. Transcript level information is ignored and exons with exactly the same coordinates belonging to multiple transcripts are treated as one unique exon.

2. Identify unique portion(s) of each exon in an exon group. Exons with immediate unique portions are called “level 1 exons”. In order to assign reads to exons with no unique positions, remove the level 1 exons from the exon group, determine pseudo-unique positions for the remaining exons, and increment the level of these exons. Iterate through this process until all exons in a group have unique or pseudo-unique portions. Use these portions to assign mate pairs to a link. In the following figure the dark boxes are constitutive parts of an exon, light boxes are unique portions of an exon depicted with subscript u followed by the level of the exon, and the empty boxes are pseudo-unique portions of an exon again depicted with subscript u followed by the level of the exon.

T2-E1 T2-E3

T1-E1 T1-E2 T1-E3

T2-E2

Group 1 Group 2 Group 3

E1

E2 E3 E4

Transcript 1

Transcript 2

4

Page 6: Manual

3. In exons groups where step 2 fails to identify unique or pseudo-unique portions for all the exons remove “unifying exons” from the analysis and repeat step 2.

4. For exon groups were there are non-overlapping exons, use the insert size distribution to assign pseudo-counts to certain links between non-overlapping exons.

5. Normalize the link counts determined from the unique portions to calculate a “link coverage”. There are two normalization types implemented. The default is to divide the link counts with the probability of observing such a link given the insert size distribution. The second method involves calculating the number of slots an exon link has given the insert size distribution’s mode, i.e. the most frequent insert size.

• Default method:

E3u, 1 E1u, 2

A mate pair linking E2 to E3

A mate pair linking E1 to E3

E2 E2u, 1 E4u, 1

A split read linking E3 to E4

Group 1 Group 2 Group 3

E1

E2 E3 E4

E2

E1

E3

E3 shares its start position with E1 and its end position with E2 and therefore is a unifying exon. These types of exons are removed from the exon groups thus they are not part of the analysis.

A mate pair like the one shown in red here can be linking E1 to E2, or it can be originating from E3 only. In order to resolve this, a pseudo count is assigned to the E1-E2 link which is the probability of observing the insert size when E1 and E2 are linked over the sum of this probability and the probability of observing the insert size when the mate pair is originating from E3 only.

CE2-E3

= 15 [Link Count] / 0.8 [Probability of observing insert sizes from 10 to 20] = 18.75 [Link Coverage]

Minimum insert size linking these exons = 10

Maximum insert size linking these exons = 20

Link Count=15

E2 E2u E3u

E3

E1 E2

5

Page 7: Manual

• Slots Method:

6. In a given window size, consider all pairings of each exon in an exon group with all other exon group exons. Links between level 1 exons can be calculated directly whereas links between higher level exons are calculated by subtracting coverage of all the other lower level links from the pseudo-coverage of these exons.

7. Calculate the fraction of one exon link as the coverage of the link over the sum of the coverages of all the links that the first exon makes.

𝐹Ei Ej =𝐶Ei Ej

∑ 𝐶Ei En𝐿𝑛=𝑖+1

With a read length of 3 and an insert size of 4, there are 3 slots (shown in black) that link the exons above.

CE2-E3

= 15 [Link Count] / 3 [Number of Slots] = 5 [Link Coverage]

In this figure the darker boxes are constitutive parts of an exon, lighter boxes are unique portions of an exon depicted with subscript u followed by the level of the exon, and the empty boxes are pseudo-unique portions of an exon again depicted with subscript u followed by the level of the exon. Given these two exon groups the link coverages are calculated in the following way:

CE1->E4

≈ CE1u,1->E4u,1

CE1->E5

≈ CE1u,1->E5u,1

CE3->E4

≈ CE3u,1->E4u,1

CE3->E5

≈ CE3u,1->E5u,1

CE2->E4

≈CE2u,2->E4u,1

- CE1->E4

- CE3->E4

CE2->E5

≈ CE2u,2->E5u

- CE1->E5

- CE3->E5

L = last exon index

GroupEj ≠ GroupEi

1 2 3 4 5 6 7 8 9 10 11 12

Group 1

E1 E1u,1

E2u,2

E3 E3u,1

E4 E4u,1

E5 E5u,1

Group 2

FE1-E2 = 5 [RE1-E2] / ( 5 [RE1-E2] + 3 [RE1-E3] )= 0.625 FE1-E3 = 3 [RE1-E3] / ( 5 [RE1-E2] + 3 [RE1-E3] ) = 0.375

E1 E2 E3

6

Page 8: Manual

8. Repeat step 7 in both 5’-to-3’ (forward) and 3’-to-5’ (reverse) directions to capture splice acceptor and donor effects respectively.

7

Page 9: Manual

III. Usage Examples

Before you run altrans please check all the option defaults and make sure they make sense for your specific needs. You need to specify options like --single-end, --split-reads, or --read-length, they are not automatically detected.

The default options are for a paired end experiment with a read length of 49 bp that contains no split read mapping. We include all mate pairs with a mapping quality ≥10 which are correctly oriented on the same chromosome separated by a maximum distance of 1,000,000 bp. The aligner used soft clips reads and uses a minimum alignment length of 20 bp. The fragment length distribution of mate pairs is determined from exons ≥300bp in length:

altrans --bam-file yourBamFile.bam --annotation gencode.v6.gtf.sorted --mapping-quality 10 --max-intron-length 1000000 --read-length 49 --min-exon-length 300 --max-clip 29

The most basic usage involves supplying just a BAM and an annotation file:

altrans --bam-file yourBamFile.bam --annotation gencode.v6.gtf.sorted

If you would like to only include genes that are protein coding or lincRNAs:

altrans --bam-file yourBamFile.bam --annotation gencode.v6.gtf.sorted --gene-types protein_coding lincRNA

If you would like to use all the links, print all the files there are to print, and already have a fragment size distribution file:

altrans --bam-file yourBamFile.bam --annotation gencode.v6.gtf.sorted --use-diff-strand --use-inter-chr --print-bed --print-combined --verbose --distribution-file fragmentSizes.fragment_sizes

If you want to write to local drives on each node and then copy the result files to shared storage and would like to use a prefix for your output files:

altrans --bam-file yourBamFile.bam --annotation gencode.v6.gtf.sorted --output-dir /sharedFolder/altrans --working-dir /scratch/local/weekly --output-prefix myResults

If you have samples with mixed read length, for example 75 and 76, and you would like to analyse everything with the shortest read length, and also if you have split read alignments, e.g. GEM or TopHat, and would like to check for proper pairing:

altrans --bam-file yourBamFile.bam --annotation gencode.v6.gtf.sorted --mapping-quality 150 --check-proper-pairing --read-length 75 --split-reads --anchor-length 1 --trim

If you want to normalize read counts before calculating the fractions:

altrans --bam-file yourBamFile.bam --annotation gencode.v6.gtf.sorted --no-fractions

##Merge all the 2norm files produced from the individual runs. For example if you have file names like sample1_sorted.bam.2norm:

altrans --merge-files *.2norm --output-prefix allSamples --regex "(.+)_.+"

##At this point normalize with the method of your choice or the integrated normalization method, and produce a norm file with positive counts.

8

Page 10: Manual

altrans --normalize allSamples.2norm --covar yourCovariatesFile.txt --output-prefix yourNormFile

##Create a master fragment sizes file. For example if you have file names like sample1_sorted.bam.fragment_sizes:

ls -1 *.fragment_sizes | awk '{t=$1;sub(".*\\/", "", t); sub("_.*","",t);print t,$1}' | tr ' ' '\t' > masterDistFile

##Run altrans again to calculate fractions from normalized counts

altrans --norm-file yourNormFile.norm --annotation gencode.v6.gtf.sorted --distribution-file masterDistFile --probability-file yourProbFile.probability

IV. Worked Example

##Assumes a unix like environment with gcc 4.2 or later. #get the latest altrans wget http://sourceforge.net/projects/altrans/files/altrans.vX.X.XX/altrans.vX.X.XX.tar.gz tar zxvf altrans.vX.X.XX.tar.gz #compile cd altrans make #get the sample dataset and extract wget http://sourceforge.net/projects/altrans/files/sampleDataset.tar.gz tar zxvf sampleDataset.tar.gz cd sampleDataset

#run altrans for both BAM files ../bin/altrans --bam-file Sample1Tissue1_chr22_sorted.bam --annotation gencode.v10.annotation.gtf.chr22.sorted --read-length 75 --output-prefix Sample1Tissue1_chr22 --mapping-quality 150 --check-proper-pairing --split-reads ../bin/altrans --bam-file Sample1Tissue2_chr22_sorted.bam --annotation gencode.v10.annotation.gtf.chr22.sorted --read-length 75 --output-prefix Sample1Tissue2_chr22 --mapping-quality 150 --check-proper-pairing --split-reads

#merge the forward and reverse files

../bin/altrans --merge-files *.forward --regex "(.+)_.+" --output-prefix allSamples.forward

../bin/altrans --merge-files *.reverse --regex "(.+)_.+" --output-prefix allSamples.reverse

9

Page 11: Manual

V. Options Defaults for all the options are given in parentheses.

Important Options:

-a|--annotation Annotation file containing the exons in GTF format (http://genome.ucsc.edu/FAQ/FAQformat.html#format4). This file MUST be sorted first by chromosome then by start position. If the file is unsorted or you are unsure, sort it by

sort -k1,1 -k4,4g filename > filename.sorted

in *nix systems. If you want to use the -g|--gene-types option to include a subset of gene types then the "gene_type" and “transcript_type” attributes have to be set for all the exons in the file. (Required unless -F|--merge-files or -Y|--normalize)

-A|--anchor-length Minimum number of bases required in either side of a splice junction for split reads. Only used when -s|--split-read option is provided. (1)

-b|--bam-file Alignments in BAM format (http://samtools.sourceforge.net/SAM1.pdf). This is required unless you are inputting a normalized link counts file with the -f|--norm-file option, in which case it is ignored. (Required unless -f|--norm-file or -F|--merge-files or -Y|--normalize)

-c|--check-proper-pairing Require the mate pairs to be properly paired according to the aligner as determined from the bitwise flag of the BAM file. This may or may not be a good idea depending on the aligner used. The default behaviour is to ignore this flag and use pairs if both are mapped with mapping quality greater than -m|--mapping-quality, are in the correct orientation, and are separated by less than or equal to -i|--max-intron-length bases. (false)

-e|--min-exon-length While determining fragment size distribution, only mate pairs where both mates map inside the same exon with a size greater than or equal to this, are included. In order not to bias the distribution in favour of small fragment sizes, a value at least twice that of the expected fragment size is suggested. (300)

-E|--single-end The alignment contains single end reads. These reads have to be split mapped for altrans to work. If given then fragment length distribution calculation is skipped and -s|--split-reads and -u|--use-unpaired-split options are automatically set. (false)

-i|--max-intron-length Maximum distance (bp) between the mate pairs. This is only used when no -c|--check-proper-pairing option is given otherwise it is ignored. (1000000)

-m|--mapping-quality Any read less than this threshold is not included in the analysis. (10)

-n|--no-clipping Soft clipping of reads is not allowed by aligner hence only reads where the complete read length is aligned are mapped. (false)

10

Page 12: Manual

-o|--output-dir Output directory, MUST exist. (./)

-p|--output-prefix Prefix to use for output files. If provided with the -f|--norm-file option than this gets appended to the sample name. (-b|--bam-file or -f|--norm-file or the first file in the merge list)

-r|--read-length Read length. (49)

-s|--split-reads Alignment contains split reads. (false)

-u|--use-unpaired-split CURRENTLY NOT USED. Use valid mapped split reads where one mate is mapped but the other is not. Although these would normally fail the pairing criterion, they may still be used since they contain information even as an unpaired read. (false)

-x|--max-clip Maximum clipping length (bp), this is ignored if -n|--no-clipping is given. (29)

Other Options:

-B|--bin-size The length (bp) of the bins that the genome is divided into for matching a position to an exon. Higher numbers decrease memory usage but increase running time. Memory gained from adjusting this is minimal so don’t modify unless memory is in real short supply. (1000)

-C|--use-inter-chr Include links generated when both of the mates are properly mapped but align to exons on different chromosomes. (false)

-D|--use-diff-strand Include links generated when both of the mates are properly mapped but align to exons on different strands. (false)

-d|--distribution-file You can provide 2 types of files with this option, both must be tab separated. (Required if -f|--norm-file)

If you are using the -b|--bam-file option then provide a Fragment size distribution file. Since the fragment size distribution is required before reads can be assigned to exons, if this file is not provided the BAM file is read twice, once to determine the fragment size distribution and once to assign reads to exons.

If you are using the -f|--norm-file option then you MUST provide a Master fragment sizes file. All the samples MUST be in this file.

-f|--norm-file Provide a Norm file in which case instead of reading a BAM file and assigning counts to links, the program will calculate link fractions using these normalized counts. When this option is given you are required to provide a Master fragment sizes file with the -d|--distribution-file option and a Probability file with -J|--probability-file.

11

Page 13: Manual

-F|--merge-files Merges the provided Forward file, Reverse file, Combined file, Raw forward file, Raw reverse file, 2norm file, Norm file, or Converted file files. See also: -H|--has-header and -X|--regex.

-g|--gene-types A space separated list of gene types that are allowed in the analysis. In order to use this option the "gene_type" and “transcript_type” attributes have to be set in your annotation GTF. If given, then BOTH the "gene_type" and “transcript_type” attributes for a particular exon must match the provided types in order for it to be included in the analysis. (include all types)

-h|-?|--help Print the help message and exit.

-H|--has-header Only used when -F|--merge-files is given. The files to be merged have header lines. Disables -X|--regex. (false)

-I|--ignore-prob-groups Do not include groups from which certain exon(s) were removed since they were unifying exon(s), i.e. an exon that overlaps at least 2 other exons and has no unique portions. (false)

-j|--print-bed Print an Extended BED file for the paired reads which pass the -m|--mapping-quality threshold. (false)

-J|--probability-file Probability file. (Required when -f|--norm-file)

-k|--print-combined Print a Combined file where instead of dividing the links of a primary exon into forward and reverse directions, the fractions are computed using all the links a primary exon makes. (false)

-l|--convert Convert a 2norm file or a Norm file, which contains exon IDs rather than exon names, into the long format and exit (false).

-M|--use-multi-level Use reads that map to unique portions of multiple exons and default to the longest covered exon. If you believe these reads disagree with the annotation then they should be ignored, otherwise they are mapping errors and should be included. Generally there are so few of these that they can be ignored. (false)

-N|--no-norm-file Do not print out a 2norm file. (false)

-P|--no-fractions Do not print out the Forward file, the Reverse file, and the Combined file.

-R|--no-raw-counts Do not print out the Raw forward file and the Raw reverse file.

-S|--silent Do not print file processing progress. (false)

12

Page 14: Manual

-t|--trim Auto trim reads longer than -r|--read-length to read length. This is useful if you have sequenced samples with multiple read lengths and would like to treat them as a different read length on the fly. Trimming is accomplished by editing the CIGAR string. (false)

-v|--verbose Print extra information about the exon groups and the corresponding exons to the Log file. Each information line starts with a specific string where lines starting with G describe a particular group and lines starting with E list the member exon(s) details, and the format of the lines is as follows:

GB GroupChromosome GroupStart(0-based) GroupEnd(1-based) GroupID

GI NoExonsInGroup GroupLength LengthOfTheLongestExon N(normal)|WG(problematic group) N(normal)|NO(non-overlapping exons found)

GE SpaceSeparatedListOfExonIDs

E ExonChromosome ExonStart(0-based) ExonEnd(1-based) ExonID UsedExonName RealExonName(s) N(normal)|UE(unifying exon) N(normal)|DS(same coordinates but different strands) strand length UniqueRegionStart(relative to group start, 0-based):UniqueRegionEnd(relative to group start, 0-based)

-V|--version Print the version and exit. (false)

-U|--use-slots Use slots when calculating coverage for each exon, as opposed to using the probability of observing the link. (false)

-w|--working-dir First write the output files to this directory and move the files to the -o|--output-dir directory when finished. This is useful if you are using a cluster and would like to write to local storage in each node and move the files to shared storage, potentially improving performance, MUST exist. (-o|--output-dir)

-W|--force-skip Skip all reads that are not in -r|--read-length length. Use carefully since you may end up with no reads. (false)

-X|--regex Used only when -F|--merge-files is in effect and the files do not have headers.

Regular expression used to extract sample names from file names. The sample name is the part of the regular expression in the first (). For example, with the default setting and a BAM file called UC93T_120311_7.sorted.bam; the sample name extracted is UC93T. If there is no match for the regex then the whole file name is used. When specifying this option please enclose it in double quotes. ((.+)_\\d{6}_\\d.+))

-y|--skip-fusion Fusion reads generated by tophat are not currently supported. You can skip these reads by giving this option. The tophat version you are using need to add the “XF” tag for fusion reads. (false)

13

Page 15: Manual

-Y|--normalize Normalize a merged 2norm file (see -F|--merge-files) with all the covariates given in -Z|--covar. The samples between this file and -Z|--covar must match perfectly. The method used is multiple linear regression in log space (log(𝑦𝑖 + 0.1) = 𝛽0 + 𝛽1𝑥𝑖,1 + …𝛽𝑘𝑥𝑖,𝑘 + 𝜖𝑖) which guarantees positive counts. The residuals from this regression is transformed into counts and added to a link’s estimated mean to come up with final counts (𝑒𝑒𝑖+𝛽0). (Required when -Z|--covar)

-z|--max-reads-dist When reading a BAM file to determine fragment size distribution, stop when this many mate pairs are counted. (use all)

-Z|--covar A tab delimited file containing the covariates to be used in normalization. This file must contain a header. Each row is a sample and the first column is the sample name followed by covariate(s). If a covariate column contains a non-numeric value then this is treated as a factor. (Required when -Y|--normalize)

14

Page 16: Manual

VI. Output and other files

Fragment size distribution file The first line of this file contains the running options used. The rest of the file is tab separated text file with two columns where the first column is a fragment size (fragment size = insert size + 2 * read length) and the second column is the frequency of this fragment size. Each line is a different fragment size and these MUST start from 0 and MUST be sorted and continuous.

Master fragment sizes file A tab separated text file with two columns where the first column is a sample name and the second column is the FULL path of the Fragment size distribution file for that sample. All samples in the analysis, each as a separate line, MUST be present in this file.

Forward file A tab separated text file with five columns: link name, link gene, chromosome, TSS, fraction of this link in the forward direction. Each line is a different link. This is a main output file.

Reverse file A tab separated text file with five columns: link name, link gene, chromosome, TSS, fraction of this link in the reverse direction. Each line is a different link. This is a main output file.

Combined file A tab separated text file with five columns: link name, link gene, chromosome, TSS, fraction of this link in both directions. Each line is a different link.

Raw forward file A tab separated text file with five columns: link name, exon group ID, chromosome, TSS, raw count of this link in forward direction. Each line is a different link. There may be more links in this file than the corresponding forward file since this file lists all the links observed rather than the links with positive normalized coverages.

Raw reverse file A tab separated text file with five columns: link name, exon group ID, chromosome, TSS, raw count of this link in reverse direction. Each line is a different link. There may be more links in this file than the corresponding reverse file since this file lists all the links observed rather than the links with positive normalized coverages.

2norm file A tab separated file with the following columns: exon1 ID, exon2 ID, chromosome, TSS, number of links between exon1 and exon2. Each line is a different link. You can merge the individual 2norm files together and normalize these raw counts with a method of your choice producing a Norm file. This file can also be used as a raw counts file for the Combined file after conversion.

Norm file A tab separated file with the following columns: exon1 ID, exon2 ID, chromosome, TSS, followed by normalized positive counts for each sample where each sample is a different column. Each line is a different link. This file must contain a header.

15

Page 17: Manual

Extended BED file A tab separated file with at least 15 columns:

Column 1: a 15 character long state string where a 1 in the positions below signify:

1: A pair that links two exons 2: A pair that passes QC and aligns to known exons 3: A pair that aligns to non-exonic parts of the genome 4: A pair that partially aligns to known exons 5: A pair that does not agree with annotation although it is exonic. 6: A pair that is in a single exon group however does not agree with any of the group’s exons 7: A pair that aligns to unique regions of multiple exons 8: An exon cannot be found for this pair although it is in an exon group 9: A pair that fails mapping 10: A pair linking exons on different chromosomes 11: A junction pair 12: An unknown junction 13: A split read within the same exon 14: A split within the same exon group 15: A pair which is not properly paired

Columns 2-13: Standard columns of a BED file.

Column 14: Insert start position (1-based)

Column 15: Insert end position (1-based)

Column 16 (optional): Exon ID of this pair’s assignment.

Log file Contains the same information that is printed to the screen.

Converted file A tab separated text file with the following columns: link name, comma separated exon group IDs, comma separated exon strands, comma separated exon chromosomes, followed by counts for each sample where each sample is a different column. Each line is a different link.

Probability file A tab separated file with the following columns: exon1 ID, exon2 ID, chromosome, TSS, number of slots and probability of observing this link. Each line is a different link.

16