Nature Biotechnology: doi:10.1038/nbt...Supplementary Figure 1 Neoepitope presentation pathway illustrations. Somatic DNA mutations (1) are transcribed (2), spliced (3) and missense

Supplementary Figure 1

Neoepitope presentation pathway illustrations.

Somatic DNA mutations (1) are transcribed (2), spliced (3) and missense mutations are translated (4) and undergo processing into 9-10mer peptides (5), which are presented on the cell surface through the MHC I pathway (6). RI neoepitopes are produced from intact DNA (1), transcribed (2), and undergo defective splicing resulting in intron retention (3). RI transcripts are translated resulting in abnormal peptides and early termination (4). Abnormal proteins are degraded through the NMD pathway, processed into 9-10mer peptides (5), and presented on the cell surface through the MHC-I pathway (6).

Nature Biotechnology: doi:10.1038/nbt.4239


Retained intron neoepitope load is not associated with somatic neoepitope load in patient cohorts.

Scatterplots illustrate correlation between somatic neoepitope and RI neoepitope loads, with cohort indicated by color (n = 48 patient samples). Two outliers, Hugo_Mel_PD1_Pt8 and Hugo_Mel_PD1_Pt32, indicated on upper plot with asterisks and excluded from lower plot.




Mass spectra show RI neoepitopes bound to MHC class I molecules in human cell lines.

Corresponding mass spectrometry plots for RI neoepitopes identified experimentally in complex with MHC-I for each of the cell lines shown in Fig. 2B. Experiments were repeated four times with independent measurements for cell line SK-MEL-5. Neoepitope shown had five peptide-to-spectrum matches (PSMs) and was identified in all four replicates within 1% false discovery rate (FDR). Experiments were repeated four times with independent measurements for CA46. Neoepitope shown had two PSMs and was identified in two replicates within 1% FDR. Experiments were repeated three times with independent measurements for DOHH-2. Neoepitope shown had one PSM and was identified in one replicate within 1% FDR. Experiments were repeated four times with independent measurements for HL-60. Neoepitope shown had one PSM and was identified in one replicate within 1% FDR. Experiments were repeated three times with independent measurements for THP-1. Neoepitope shown had five PSMs and was identified in all three replicates within 1% FDR.



RI neoepitope load is not significantly associated with clinical benefit from immunotherapy.

Association of RI load, neoepitope-yielding RI load, and RI neoepitope load with clinical benefit from immunotherapy in Hugo (n = 14 clinical benefit, n = 13 no clinical benefit) and Snyder (n = 8 clinical benefit, n = 13 no clinical benefit) patient cohorts. Boxplots show the median, first, and third quartiles, whiskers extend to 1.5 x the interquartile range, and outlying points are plotted individually. Two-sided Mann-Whitney U p-values > 0.05 for all.



Correlation between RI neoepitope load and markers of immune cytolytic activity.

Scatterplots illustrate expression, measured in transcripts per million (TPM), of immune cytolytic activity markers CD8A (top), GZMA (middle), and PRF1 (bottom) vs. RI neoepitope load for both patient cohorts (n = 48 patient samples). Linear trendline and error margins (grey shaded regions) shown, as well as Pearson’s correlation coefficients (denoted as rho) and accompanying Pearson’s correlation p-values, are denoted on plots.



Association between RI neoepitope load and patient clinical characteristics.

Top: Age vs. RI neoepitope load for Snyder cohort (n = 21 patient samples) and Hugo cohort (n = 27 patient samples). Linear trendline and error margins (grey shaded regions) shown, as well as Pearson’s correlation coefficients (denoted as rho) and accompanying Pearson’s correlation p-values, are denoted on plots. Center left: Disease status vs. RI neoepitope load for both cohorts (n = 48 patient samples). Two-sided Mann-Whitney U p-values shown. Center right: Prior MAP kinase inhibitor therapy vs. RI neoepitope load for Hugo cohort (n = 27 patient samples) (Data not available for Snyder cohort). Two-sided Mann-Whitney U p-values shown. Bottom left: Sex vs. RI neoepitope load for both cohorts (n = 48 patient samples). Two-sided Mann-Whitney U p-values shown. Bottom right: Time of biopsy vs. RI neoepitope load for Snyder cohort (n = 21 patient samples). Two-sided Mann-Whitney U p-values shown. All boxplots show the median, first, and third quartiles, whiskers extend to 1.5 x the interquartile range, and outlying points are plotted individually.



Patients with high RI neoepitope loads and immunotherapy nonresponders show enrichment of similar transcriptional programs.

Gene Set Enrichment Analysis (GSEA) was performed comparing top (n = 12) vs. bottom (n = 11) quartile RI neoepitope load patients and immunotherapy nonresponders (n = 10) vs. responders (n = 13). Only half of the top quartile RI neoepitope load patients were overlapping as nonresponders to immunotherapy. Enrichment of cell cycle- and DNA repair-related gene sets was seen in both high RI neoepitope load patients and immunotherapy nonresponders. Representative GSEA enrichment plots from the G2M checkpoint and Downregulation of TLX targets gene sets are shown for both the top vs. bottom quartile RI neoepitope load patients and immunotherapy nonresponders vs. responders comparisons. FDR q-values are indicated on plots.



Human Protein Atlas samples were used to create a ‘panel of normals’ for filtering.

A ‘panel of normals’ was created using six Human Protein Atlas (HPA) skin samples (two samples each from three distinct individuals) in order to filter intron retention events likely to occur in normal tissue which would not produce RI neoantigens due to immune tolerance. A, Histogram illustrating the number of unique retained introns shared across samples. The majority of introns are retained by all six normal samples. B, UpSet visualization of set intersections of unique retained introns in each unique grouping of one sample per individual (8 total groupings). The set of 7,050 retained introns shared by all 8 groups of normal samples was denoted the final normal retained intron set and filtered from the RI neoepitope analysis of tumors.



Illustrative examples of false positive retained intron events detected upon manual review.

False positive retained intron events were discovered upon manual review of retained introns expressed at aberrantly high levels relative to all intronic expression (> 50 TPM in multiple samples). Likely artifactual introns were filtered from final analysis. IGV screenshots are shown illustrating representative examples. A, Read depth in intron is much higher and more uniform than in neighboring annotated exon; likely a result of transcript annotation error. B, Annotated intron-exon boundary is inconsistent with exon-intron boundary supported by manual review of raw sequencing reads and results in RI neoantigen predicted after an in-frame stop codon. C, Intron expression profile matches surrounding exons and sharply contrasts with other introns in similar region; this intron is likely included in the canonical form of the transcript but not reflected in the annotation. D, Exonic expression of one flanking exon is negligible and does not match with expression profile of other flanking exon, and read depth is low throughout most of the region; first exonic region may be mis-annotated.


Supplementary Materials for

Intron Retention is a Source of Neoepitopes in Cancer

Alicia C. Smart†, Claire A. Margolis†, Harold Pimentel, Meng Xiao He, Diana Miao, Dennis Adeegbe, Tim Fugmann, Kwok-Kin Wong, Eliezer M. Van Allen

†These authors contributed equally to this work

correspondence to: [email protected] This PDF file includes:

Supplementary Table 4 Supplementary Table Titles and Legends, Tables 1-6 Supplementary Software File 1

Other Supplementary Materials for this manuscript include the following:

Supplementary Tables 1-3, 5-6


Supplementary Table S1. Clinical and molecular summary features from Hugo (n = 27) and Snyder (n = 21) patient cohorts. Clinical characteristics included for each patient: cohort, immunotherapy response status, type of immunotherapy. These characteristics were obtained directly from original publications for each cohort. Molecular characteristics included for each patient: total retained intron (RI) load, neoepitope-yielding RI load, RI neoepitope load, mean number of RI neoepitopes yielded by each RI, somatic neoepitope load. Supplementary Table S2. All RI neoepitopes predicted for each patient in Hugo (n = 27) and Snyder (n = 21) cohorts. Table contains one patient neoepitope (unique peptide, HLA allele combination) per row. Fields included: Pos (position in original retained intron peptide sequence), Peptide, Intron_ID (genomic coordinates of RI yielding neoepitope), Allele (HLA Class I allele), 1-log50k (NetMHCpan prediction score), nM (NetMHCpan predicted binding affinity, measured in nM), Rank (NetMHCpan rank of predicted affinity compared to a set of random natural peptides), TPM (neoepitope expression level, measured in transcripts per million), SampleID, Gene, Strand (positive or negative genomic strand). Supplementary Table S3. Cancer cell line RI neoepitopes that were both predicted computationally and discovered experimentally bound to MHC Class I molecules via mass spectrometry. Table contains one cell line neoepitope (unique peptide, HLA allele combination) per row. Rows colored by cell line. Fields included: Cell line, Peptide, Intron ID (genomic coordinates of RI yielding neoepitope), Gene, Strand (positive or negative genomic strand), Allele (HLA Class I allele), 1-log50k (NetMHCpan prediction score), nM (NetMHCpan predicted binding affinity, measured in nM), rank (NetMHCpan rank of predicted affinity compared to a set of random natural peptides), Expression (neoepitope expression level, measured in transcripts per million). Supplementary Table S4. Predicted RI neoepitopes versus somatic neoepitopes with mass spectrometric evidence supporting presence on MHC-I molecules. For each cancer cell line analyzed, comparison shown between the number of computationally-predicted RI-derived neoepitopes versus computationally-predicted somatic mutation-derived neoepitopes that were experimentally proven to be bound to the cell surface in complex with MHC Class I molecules.

Cell line

Number of computationally-

predicted RI neoepitopes detected via

mass spec.

Total number of computationally-

predicted RI neoepitopes

Number of computationally-predicted

somatic neoepitopes detected via mass spec.

Total number of computationally-predicted somatic

neoepitopes

MeWo 2 10,812 0 10,486

SK-MEL-5 2 5,607 0 1,166

CA46 1 10,890 0 992

DOHH-2 2 9,641 0 504

HL-60 1 7,413 0 416

THP-1 1 10,890 2 786


Supplementary Table S5. Gene Set Enrichment Analysis results for Hallmark and corresponding Founders gene sets comparing both top quartile vs. bottom quartile RI neoepitope load patients and immunotherapy responders vs. nonresponders. File contains raw Gene Set Enrichment Analysis (GSEA) results, with four tabs corresponding to Tables S4A-D. A: Hallmark gene sets, top quartile vs. bottom quartile RI neoepitope load. B: Hallmark gene sets, immunotherapy responders vs. nonresponders. C: Founders gene sets, top quartile vs. bottom quartile RI neoepitope load. D: Founders gene sets, immunotherapy responders vs. nonresponders. Founders results reported for all significantly enriched Hallmark gene sets. Supplementary Table S6. Retained introns filtered from RI neoepitope analysis due to either (a) presence in normal skin tissue yielding likely immune tolerance, or (b) determination of false-positive nature upon manual review. File contains two tabs corresponding to Tables S5A-B. A: Introns retained in Human Protein Atlas (HPA) normal skin tissue that were filtered from RI neoepitope analysis of patient tumors due to likely host immune competence (n = 7,050). B: Introns filtered from analysis of patient tumors after manual review (n = 63).


Supplementary Software File 1. Retained intron neoepitope pipeline code, also publically available on GitHub at https://github.com/vanallenlab/retained-intronneoantigen-pipeline. Code below, separated by individual script (script names in bold). README.md # retained-intron-neoantigen-pipeline This pipeline calls RNA-based neoantigens from intron retention events derived from RNA-Seq data and identified through the KMA package (see run instructions below for further detail on this). To run: - Download NetMHCPan-3.0 (http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?netMHCpan) and change paths in runNetMHCpan.py file (line 62). - Download twoBitToFa utility from UCSC genome browser (https://genome.ucsc.edu/goldenpath/help/twoBit.html) and change paths in kmaToPeptideSeqs.py file (line 173). - Download MySQL (you will use it to query the UCSC table browser via public servers). - Download and run KMA package (https://github.com/pachterlab/kma). The output from the KMA package will be the direct input to this pipeline. - Change paths in shell script getNeoantigenBinders.sh (notes in file comments). - Run getNeoantigenBinders.sh from command line as an SGE Array Job. This script is a wrapper and will call all other relevant Python scripts. Additional notes: - Detailed execution instructions and functionality descriptions can be found in each script header, as well as for each individual function. - Feel free to create an Issue if errors arise. getNeoantigenBinders_KMAKallisto.sh # ------------------------------------------------------------------------------------------------ # # Claire Margolis # getNeoantigenBinders_KMAKallisto.sh # Summary: Shell script to run a series of python scripts to go through the pipeline from # KMA to netMHC for all patients in sample. This version is build specifically for kma-kallisto. # *NOTE*: If you want to run this script, go through and verify that the paths to relevant files # are in the correct format for your cohort. You will need to change out_dir.txt, among other # things, to make the script specific to your cohort. # ----------------------------------------------------------------------------------------------- # # ----------------------------------------------------------------------------------------------- # # Specify shell / UGER preferences


#!/bin/bash #$ -cwd #$ -m e #$ -l h_vmem=30g #$ -l h_rt=120:00:00 #$ -t 1-39 # ----------------------------------------------------------------------------------------------- # # Use statements source /broad/software/scripts/useuse reuse Python-2.7 use MySQL-5.6 use EMBOSS # ----------------------------------------------------------------------------------------------- # # ----------------------------------------------------------------------------------------------- # # Set directory paths patient_dir=/xchip/cga_home/margolis/retainedIntron/VA_Mel_ipi/out_dir.txt PAT_DIR=$(cat $patient_dir | head -n $SGE_TASK_ID | tail -n 1) # ----------------------------------------------------------------------------------------------- # # ----------------------------------------------------------------------------------------------- # # Run splitKMA.py on each patient # ( Splits aggregate KMA output into patient-specific KMA output ) echo 'Running splitKMA.py.' python /xchip/cga_home/margolis/retainedIntron/goldStandard/splitKMA_KMAKallisto.py /xchip/cga_home/asmart/KMA/VA_Mel_IPI/160822_VA_Mel_IPI_KMA_IR_flat_filtered_ex41_v1.csv $PAT_DIR $PAT_DIR/kma_results.txt # ----------------------------------------------------------------------------------------------- # # ----------------------------------------------------------------------------------------------- # # Run kmaToPeptideSeqs.py on each patient # ( Converts kma output into FASTA peptide files for each retained intron ) echo 'Running kmaToPeptideSeqs.py.' python /xchip/cga_home/margolis/retainedIntron/goldStandard/kmaToPeptideSeqs.py $PAT_DIR/kma_results.txt 9 $PAT_DIR # ----------------------------------------------------------------------------------------------- #


# ----------------------------------------------------------------------------------------------- # # Run runNetMHCpan.py on each patient # ( Runs netMHCpan with retained-intron peptides and HLA-alleles specific to each patient ) echo 'Running runNetMHCpan.py.' python /xchip/cga_home/margolis/retainedIntron/goldStandard/runNetMHCpan.py $PAT_DIR/peptideSeqsFASTA.txt ../$PAT_DIR/hla_alleles.txt $PAT_DIR # ----------------------------------------------------------------------------------------------- # # ----------------------------------------------------------------------------------------------- # # Run postprocessOutput.py on each patient # ( Processes netMHCpan output file to a more user-friendly, relevant format ) echo 'Running postprocessOutput.py.' python /xchip/cga_home/margolis/retainedIntron/goldStandard/postprocessOutput.py $PAT_DIR/NETMHCpan_out.xls $PAT_DIR/headermap.txt $PAT_DIR $PAT_DIR # ----------------------------------------------------------------------------------------------- # # ----------------------------------------------------------------------------------------------- # # Run aggregateSampleInfo.py once to aggregate all patient results # ( Takes previous step output for all patients and aggregates into one document for whole cohort ) #echo 'Running aggregateSampleInfo.py.' #python /xchip/cga_home/margolis/retainedIntron/goldStandard/aggregateSampleInfo.py patientnamesandresults.txt /xchip/cga_home/____/patientDirs CohortName /xchip/cga_home/____/outfilepath # ----------------------------------------------------------------------------------------------- # splitKMA_KMAKallisto.py # ----------------------------------------------------------------------------------------------- # # Claire Margolis # splitKMA_KMAKallisto.py # Summary: Takes in KMA file, patient ID, and outfile path, and outputs a modified # KMA file containing only the retained intron locations that correpond to the patient of # interest. This is meant to be run in a shell script that will batch run this for each patient # in the cohort. Specifically designed for kma-kallisto version.


# Input format: python splitKMA.py kmaOutfile.csv patientID patientIDoutfile.csv # Output format: patientIDoutfile.csv # ----------------------------------------------------------------------------------------------- # # ----------------------------------------------------------------------------------------------- # # Import necessary packages #!/usr/bin/python import sys import numpy as np import subprocess # ----------------------------------------------------------------------------------------------- # # ----------------------------------------------------------------------------------------------- # # Function: splitFile # Inputs: original kma file, patient ID, outfile # Returns: none (writes to file) # Summary: Takes in original KMA file and includes only rows with patient ID matching our # patient of choice and intron counts > 0 AND TPM > 1. def splitFile(kmafile, patient, outfile): # Read in KMA file with open(kmafile) as f: lines = f.read().splitlines() # Open output file for writing out = open(outfile, 'w') # Write header to file out.write(lines[0]+'\n') # Loop through to get only lines that belong to our patient, write those to outfile for i in range(1, len(lines)): line = lines[i] currpatient = line.split(',')[2] currTPM = line.split(',')[3] currcounts = line.split(',')[7] currTPMfilter = line.split(',')[8] currPSIfilter = line.split(',')[9] currcountsfilter = line.split(',')[10] if currpatient == '"'+patient+'"' and float(currcounts) > 0 and float(currTPM) > 1: if str(currTPMfilter) in "TRUE" and str(currPSIfilter) in "TRUE" and str(currcountsfilter) in "TRUE": out.write(line+'\n') # Close outfile out.close() return


# ----------------------------------------------------------------------------------------------- # # ----------------------------------------------------------------------------------------------- # # Main function def main(): # Check to make sure we have the right number of inputs if len(sys.argv) != 4: print 'Error: incorrect number of inputs.' print 'Please input a KMA .csv file, valid patient ID, and outfile path' sys.exit() # Read in inputs kmafile = sys.argv[1] patient = sys.argv[2] outfile = sys.argv[3] # Split kma file splitFile(kmafile, patient, outfile) return if __name__ == '__main__': main() # ----------------------------------------------------------------------------------------------- #

kmaToPeptideSeqs.py # ----------------------------------------------------------------------------------------------- # # Claire Margolis # kmaToPeptideSeqs.py # Summary: Reads in KMA output .flat_filtered.csv file, extracts chromosome locations for each # unique intron, then focuses on the most biologically relevant scenario: A nucleotide sequence # that starts in preceding exon and has at least 1 AA in intron (we already know where these ORFs # start and their orientation wrt the intron start site). Translates to protein until hitting a # stop codon and then stores the sequence output in a file, as well as the list of unique introns. # Input format: python kmaToFasta.py ____.flat_filtered.csv 9 outdirpath # "9" is default for netMHCI (9 AA window = 27 bases before intron start) # "15" is default for netMHCII (15 AA window = 45 bases before intron start) # Output format:


# uniqueIntronList.txt (list of unique introns in .flat_filtered.csv file) # peptideSeqsFASTA.txt (list of peptide sequences in FASTA format) # ----------------------------------------------------------------------------------------------- # # ----------------------------------------------------------------------------------------------- # # Import necessary packages #!/usr/bin/python import sys import numpy as np import subprocess from Bio.Seq import Seq import bisect # ----------------------------------------------------------------------------------------------- # # ----------------------------------------------------------------------------------------------- # # Function: createUniqueIntronList() # Inputs: kma .flat_filtered.csv file, outfile path # Returns: List of unique intron chromosomal locations, list of corresponding TPM values # Summary: Reads in intron chromosome locations from KMA output .csv file, extracts unique # sequences and their TPM values, saves them in an array for use in subsequent functions, # writes them to .txt file. def createUniqueIntronList(csvfile, outpath): # Read in chromosome locations and TPM values (columns 1 and 4) from KMA output file chromlocs = np.loadtxt(csvfile, dtype=str, delimiter=',', skiprows=1, usecols=[1,3]) # Only extract unique chromosomal locations _, indices = np.unique(chromlocs[:,0], return_index=True) uniquelocs = chromlocs[indices,:] uniquelocs = np.core.defchararray.strip(uniquelocs, '"') # Strip "s from locations # Write contents of uniquelocs array to file (for reference later) filepath = outpath+'/'+'uniqueIntronList.txt' np.savetxt(filepath, uniquelocs, fmt='%s', delimiter='\t') # Return unique list of intron locations and corresponding TPM values return list(uniquelocs[:,0]),list(uniquelocs[:,1]) # ----------------------------------------------------------------------------------------------- # # ----------------------------------------------------------------------------------------------- # # Function: manualTranslate # Inputs: FASTA sequence to translate to protein # Returns: List of protein sequences corresponding to specific FASTA sequence


# Summary: Uses manual codon-->AA dictionary to walk through and translate FASTA sequence. def manualTranslate(fastasequence): # Initialize codon table and list of peptides codontable = {'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M', 'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T', 'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K', 'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R', 'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L', 'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P', 'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q', 'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R', 'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V', 'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A', 'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E', 'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G', 'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S', 'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L', 'TAC':'Y', 'TAT':'Y', 'TAA':'*', 'TAG':'*', 'TGC':'C', 'TGT':'C', 'TGA':'*', 'TGG':'W'} # Translate full length fasta sequence all the way through fulllengthprotein = '' for i in xrange(0, len(fastasequence), 3): codon = fastasequence[i:i+3] # Account for bizarre edge cases that should really never happen if len(codon) != 3 or codon not in codontable: break if codontable[codon] == '*': # Stop translating when we hit a stop codon break AA = codontable[codon] fulllengthprotein += AA return fulllengthprotein # ----------------------------------------------------------------------------------------------- # # ----------------------------------------------------------------------------------------------- # # Function: getSeqs() # Inputs: list of unique intron chromosome locations and their expression values, window size ( # number of AAs around start site) # Returns: None # Summary: Parses out chromosome and intron start location, then expands window size according to # cases (1) and (2) requirements, calls UCSC table browser twoBitToFa() function to get # corresponding FASTA sequences, writes to output file. Also writes corresponding file mapping # headers def getSeqs(intronlocs, tpms, nAAs, outpath): outfile = open(outpath + '/peptideSeqsFASTA.txt','a') headermapfile = open(outpath + '/headermap.txt','a')


for l in range(0, len(intronlocs)): loc = intronlocs[l] tpmval = tpms[l] chrom = loc.split(':')[0] intronstart = loc.split(':')[1].split('-')[0] intronend = loc.split(':')[1].split('-')[1] # Query UCSC table browser to find whether sequence is on minus or plus strand and get ORF orientation sqlcommand = "SELECT strand,exonStarts,exonEnds,exonFrames FROM wgEncodeGencodeBasicV19 WHERE chrom='"+chrom+"' AND txStart<"+intronstart+" AND txEnd>"+intronend fullcommand = 'mysql -h genome-mysql.cse.ucsc.edu -u genome -D hg19 -A --connect_timeout=60 -e "'+sqlcommand+'"' tablebrowserout = subprocess.check_output(fullcommand, shell=True) # Catch instance where table browser doesn't think there's actually a gene in this region if len(tablebrowserout) < 1: continue strand = '' exonstarts = [] exonends = [] exonframes = [] tablebrowserlist = filter(None,tablebrowserout.split('\n')) infostring = '' lengthholder = 0 for i in range(1,len(tablebrowserlist)): if (len(filter(None,tablebrowserlist[i].split(','))) > lengthholder): infostring = tablebrowserlist[i] lengthholder = len(filter(None,infostring.split(','))) strand = infostring.split('\t')[0] exonstarts = filter(None,infostring.split('\t')[1].split(',')) exonstarts = map(int, exonstarts) exonends = filter(None,infostring.split('\t')[2].split(',')) exonends = map(int, exonends) exonframes = filter(None,infostring.split('\t')[3].split(',')) exonframes = map(int, exonframes) # Check to make sure exon starts, ends, and frames list are the same length, and if there is an error, skip this intron if not len(exonstarts) == len(exonends) == len(exonframes): continue # Get ORF orientation at the start of the intron # Handle + and - strand cases separately (need to look for different positions) ORForientation = 0 frame = 0 if strand == ('+'): index = bisect.bisect_left(exonstarts, int(intronstart)) - 1 frame = exonframes[index] if frame == 0: ORForientation = (int(intronstart)-exonstarts[index]) % 3 elif frame == 1:


ORForientation = (int(intronstart)-2-exonstarts[index]) % 3 elif frame == 2: ORForientation = (int(intronstart)-1-exonstarts[index]) % 3 else: # If frame == -1 (meaning no translation takes place according to table browser) continue else: index = bisect.bisect_left(exonstarts, int(intronend)) frame = exonframes[index] if frame == 0: ORForientation = (exonends[index]-int(intronend)) % 3 elif frame == 1: ORForientation = (exonends[index]-2-int(intronend)) % 3 elif frame == 2: ORForientation = (exonends[index]-1-int(intronend)) % 3 else: continue # Determine nucleotide window around which to get sequence wholeseqstart = 0 wholeseqend = 0 if strand == ('+'): wholeseqstart = int(intronstart) - ORForientation - (nAAs*3-3) wholeseqend = int(intronend) + (nAAs*3-3) else: wholeseqstart = int(intronstart) - (nAAs*3-3) wholeseqend = int(intronend) + ORForientation + (nAAs*3-3) # Get genomic sequence loc = chrom + ':' + str(wholeseqstart) + '-' + str(wholeseqend) loc = '-seq='+loc twobitoutput = (subprocess.check_output(['/xchip/cga_home/margolis/Packages/tableBrowser/twoBitToFa', loc,'/xchip/cga_home/margolis/General/hg19.2bit','stdout'])) # Parse output seqlist = twobitoutput.split('\n') headerline = seqlist[0]+"|"+tpmval seqlist = seqlist[1:len(seqlist)-1] sequence = ''.join(str(elem) for elem in seqlist) sequence = sequence.upper() # Reverse complement if it's on the negative strand if strand == ('-'): sequence = str(Seq(sequence).reverse_complement()) # Manually translate sequence peptide = manualTranslate(sequence) # Check to make sure peptide is at least "length" AAs long, and if so, write to output file if len(peptide) < nAAs: continue


else: newheaderline = '>seq'+str(l) outfile.write(newheaderline+'\n') outfile.write(peptide+'\n') headermapfile.write(newheaderline+'\t'+headerline+'\n') return # ----------------------------------------------------------------------------------------------- # # ----------------------------------------------------------------------------------------------- # # Main function that processes command line input and calls other functions def main(): # Check to make sure we have the right number of inputs if len(sys.argv) != 4: print 'Error: incorrect number of inputs.' print 'Please input a KMA .csv file, the AA window you want, and an outfile path.' sys.exit() # Store inputs kmafile = sys.argv[1] window = int(sys.argv[2]) outpath = sys.argv[3] # Create unique intron output file uniqueIntrons,TPMvals = createUniqueIntronList(kmafile, outpath) # Create nucleotide sequences file getSeqs(uniqueIntrons, TPMvals, window, outpath) if __name__ == '__main__': main() # ----------------------------------------------------------------------------------------------- #

runNetMHCpan.py # ----------------------------------------------------------------------------------------------- # # Claire Margolis # runNetMHCpan.py # Summary: Takes in a fasta file containing all peptide sequences upon which netMHCpan is to be run, # runs netMHCpan, writes output to a tab-delimited text file. # Input format: python runNetMHCpan.py FASTAproteinsequences.txt HLAalleles.txt outpath # *RELEVANT*: HLA allele input file can be in one of two formats: # 1. Polysolver winners_hla.txt output file # example line from file: HLA-A hla_a_02_01_01_01 hla_a_32_01_01 # 2. Already processed, one allele per line in netMHC compatible format # example line from file: HLA-A02:01


# Output: netMHCpan output .xls file(s) # ----------------------------------------------------------------------------------------------- # # ----------------------------------------------------------------------------------------------- # # Import necessary packages #!/usr/bin/python import sys import numpy as np import subprocess # ----------------------------------------------------------------------------------------------- # # ----------------------------------------------------------------------------------------------- # # Function: runNetMHCIpan # Inputs: FASTA file of peptide sequences, patient HLA alleles (in a format specified above), outpath # Returns: None (netMHCIpan will automatically write output to a .xls file # Summary: Pre-processes patient HLA alleles, runs netMHCIpan. def runNetMHCIpan(pepfile, hlafile, outpath): # Read in HLA alleles file and process with open(hlafile) as f: hlalines = f.read().splitlines() hlaalleles = [] # Determine which input format the hla allele file is in if len(hlalines[0].split('\t')) <= 1: # In already pre-processed format hlaalleles = hlalines else: # Polysolver output file for line in hlalines: split = line.split('\t') # Reformat each allele (2 for each type of HLA A, B, and C) for i in range(1, 3): currallele = 'HLA-' allele = split[i] components = allele.split('_') currallele += components[1].upper() + components[2] + ':' + components[3] hlaalleles.append(currallele) hlaalleles = list(set(hlaalleles)) # Remove duplicate alleles if there are any hlastring = ','.join(hlaalleles) # Run netMHCI pan command = 'export NHOME=/xchip/cga_home/margolis/Packages/netMHCPan/netMHCpan-3.0; export NETMHCpan=/xchip/cga_home/margolis/Packages/netMHCPan/netMHCpan-3.0/Linux_x86_64; /xchip/cga_home/margolis/Packages/netMHCPan/netMHCpan-3.0/Linux_x86_64/bin/netMHCpan -a '+hlastring+' -f '+pepfile+' -inptype 0 -l 9,10 -s -xls -xlsfile '+outpath+'/NETMHCpan_out.xls -allname


/xchip/cga_home/margolis/Packages/netMHCPan/netMHCpan-3.0/Linux_x86_64/data/allelenames -hlapseudo /xchip/cga_home/margolis/Packages/netMHCPan/netMHCpan-3.0/Linux_x86_64/data/MHC_pseudo.dat -t 500 -version /xchip/cga_home/margolis/Packages/netMHCPan/netMHCpan-3.0/data/version -tdir /xchip/cga_home/margolis/Packages/netMHCPan/netMHCpan-3.0/scratch/XXXXXX -rdir /xchip/cga_home/margolis/Packages/netMHCPan/netMHCpan-3.0/Linux_x86_64/ > '+outpath+'/netMHCpanout.txt' subprocess.call(command, shell=True) return # ----------------------------------------------------------------------------------------------- # # ----------------------------------------------------------------------------------------------- # # Main function def main(): # Check to make sure we have the right number of inputs if len(sys.argv) != 4: print 'Error: incorrect number of inputs.' print 'Please input a FASTA file, a HLAalleles.txt file, and an outpath.' sys.exit() # Parse inputs fasta = sys.argv[1] alleles = sys.argv[2] outpath = sys.argv[3] runNetMHCIpan(fasta, alleles, outpath) return if __name__ == '__main__': main() # ----------------------------------------------------------------------------------------------- #

postprocessOutput.py # ----------------------------------------------------------------------------------------------- # # Claire Margolis # postprocessOutput.py # # Summary: Takes in NETMHC_out.xls (tab-delimited text file) and processes to create a more # user-friendly output format. # Input format: python postprocessOutput.py NETMHCpan_out.xls patientID outpath # Output format: processedNETMHCpan_out.txt # ----------------------------------------------------------------------------------------------- #


# ----------------------------------------------------------------------------------------------- # # Import necessary packages #!/usr/bin/python import sys import numpy as np import subprocess # ----------------------------------------------------------------------------------------------- # # ----------------------------------------------------------------------------------------------- # # Function: processFile # Inputs: netMHCpan output file, header map file, patient ID, outpath # Returns: None (writes to file) # Summary: Postprocesses the netMHCpan output to eliminate useless rows, change data format from # wide to long, add allele name columns. Writes new data to a file in outpath. def processFile(filename, headerfile, patID, outpath): # Read in header map file and make into dictionary headerdict = {} with open(headerfile) as f: for line in f: (key, val) = line.split('\t') headerdict[key[1:len(key)]] = val.strip()[1:len(val.strip())] # Read in first line of file to get number and names of alleles with open(filename, 'r') as f: alleles = f.readline().strip().split('\t') alleles = filter(None, alleles) data = np.loadtxt(filename, dtype='S100', delimiter='\t', skiprows=2) nrow = data.shape[0] ncol = data.shape[1] # Remove all rows with last column (NB) == 0 for i in range(0, nrow): data[i,-1] = data[i,-1].strip() data = data[data[:,-1] != '0'] # Move columns so data is in long form nrow = data.shape[0] listofarrays = [] # Will store all allele-specific arrays initcols = data[:,0:3] # Initial three columns that are common to all HLA alleles for i in range(0, len(alleles)): currstartcol = (3*(i+1))+i currendcol = currstartcol+4 currarray = data[:,currstartcol:currendcol] listofarrays.append(currarray) datalong = np.vstack(tuple(listofarrays)) # Add initial columns and allele column into data frame # Allele column allelevec = [] for i in range(0, len(alleles)): currnewcol = [alleles[i]]*nrow allelevec.extend(currnewcol) datalong = np.insert(datalong, 1, allelevec, axis=1) # Add allele column to datalong


# Initial columns initcollist = [] for i in range(0, len(listofarrays)): initcollist.append(initcols) initcolstoappend = np.vstack(tuple(initcollist)) updateddata = np.concatenate((initcolstoappend, datalong), axis=1) # Eliminate any columns that have a rank above 2 toremove = [] updatednrows = updateddata.shape[0] for i in range(0, updatednrows): if float(updateddata[i,7]) > 2: toremove.append(i) updateddata = np.delete(updateddata, toremove, 0) # Re-map IDs to headers and add TPM column tpmvec = [] for i in range(0, updateddata.shape[0]): currval = headerdict[updateddata[i,2]] updateddata[i,2] = currval.split('|')[0] tpmvec.append(currval.split('|')[1]) finaldata = np.column_stack(tuple([updateddata,tpmvec])) # Write updated data to new file outfilepath = outpath+'/'+patID+'processedNETMHCpan_out.txt' np.savetxt(outfilepath, finaldata, fmt='%s', delimiter='\t', comments='', header='Pos\tPeptide\tID\tcore\tallele\t1-log50k\tnM\tRank\tTPM') return # ----------------------------------------------------------------------------------------------- # # ----------------------------------------------------------------------------------------------- # # Main function def main(): # Check to make sure we have the right number of inputs if len(sys.argv) != 5: print 'Error: incorrect number of inputs.' print 'Please input a netMHCpan output file, a header map file, patient ID, and valid outfile path' sys.exit() # Read in inputs netmhcfile = sys.argv[1] headermapfile = sys.argv[2] patientID = sys.argv[3] outfilepath = sys.argv[4] # Process netMHCpan file processFile(netmhcfile, headermapfile, patientID, outfilepath) return if __name__ == '__main__': main() # ----------------------------------------------------------------------------------------------- #


Documents

Nature Biotechnology: doi:10.1038/nbt...Supplementary Figure 1 Neoepitope presentation pathway illustrations. Somatic DNA mutations (1) are transcribed (2), spliced (3) and missense