30
Pipelines

Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe

Embed Size (px)

Citation preview

Pipelines

Programinput output-Keyboard-File-Pipe

-Screen-File-Pipe

The “echo” program reads text from the inputand writes this to the output

echoinput output-Keyboard-File-Pipe

-Screen-File-Pipe

The “cat” program reads text from the inputand writes this to the output

catinput output-Keyboard-File-Pipe

-Screen-File-Pipe

echo uniprot_sprot_plants.fasta

uniprot_sprot_plants.fasta

cat uniprot_sprot_plants.fasta

>sp|Q43495|108_SOLLC Protein 108 OS=Solanum lycopersicum PE=2 SV=1MASVKSSSSSSSSSFISLLLLILLVIVLQSQVIECQPQQSCTASLTGLNVCAPFLVPGSPTASTECCNAVQSINHDCMCNTMRIAAQIPAQCNLPPLSCSAN>sp|Q9XHP0|11S2_SESIN 11S globulin seed storage protein 2 OS=Sesamum indicum PE=2 SV=1MVAFKFLLALSLSLLVSAAIAQTREPRLTQGQQCRFQRISGAQPSLRIQSEGGTTELWDERQEQFQCAGIVAMRSTIRPNGLSLPNYHPSPRLVYIERGQGLISIMVPGCAETYQVHRSQRTMERTEASEQQDRGSVRDLHQKVHRLRQGDIVAIPSGAAHWCYNDGSEDLVAVSINDVNHLSNQLDQKFRAFYLAGGVPRSGEQEQQARQTFHNIFRAFDAELLSEAFNVPQETIRRMQSEEEERGLIVMARERMTFVRPDEEEGEQEHRGRQLDNGLEETFCTMKFRTNVESRREADIFSRQAGRVHVVDRNKLPILKYMDLSAEKGNLYSNALVSPDWSMTGHTIVYVTRGDAQVQVVDHNGQALMNDRVNQGEMFVVPQYYTSTARAGNNGFEWVAFKTTGSPMRSPLAGYTSVIRAMPLQVITNSYQISPNQAQALKMNRGSQSFLLSPGGRRS>sp|P19084|11S3_HELAN 11S globulin seed storage protein G3 OS=Helianthus annuus GN=HAG3 PE=3 SV=1MASKATLLLAFTLLFATCIARHQQRQQQQNQCQLQNIEALEPIEVIQAEAGVTEIWDAYDQQFQCAWSILFDTGFNLVAFSCLPTSTPLFWPSSREGVILPGCRRTYEYSQEQQFSGEGGRRGGGEGTFRTVIRKLENLKEGDVVAIPTGTAHWLHNDGNTELVVVFLDTQNHENQLDENQRRFFLAGNPQAQAQSQQQQQRQPRQQSPQRQRQRQRQGQGQNAGNIFNGFTPELIAQSFNVDQETAQKLQGQNDQRGHIVNVGQDLQIVRPPQDRRSPRQQQEQATSPRQQQEQQQGRRGGWSNGVEETICSMKFKVNIDNPSQADFVNPQAGSIANLNSFKFPILEHLRLSVERGELRPNAIQSPHWTINAHNLLYVTEGALRVQIVDNQGNSVFDNELREGQVVVIPQNFAVIKRAN

The “grep” program filters the input for given termsand writes the filtered text to the output

grepinput output-Keyboard-File-Pipe

-Screen-File-Pipe

grep --help

Usage: grep [OPTION]... PATTERN [FILE] ...Search for PATTERN in each FILE or standard input.Example: grep -i 'hello world' menu.h main.c

Regexp selection and interpretation: -E, --extended-regexp PATTERN is an extended regular expression -F, --fixed-strings PATTERN is a set of newline-separated strings -G, --basic-regexp PATTERN is a basic regular expression -P, --perl-regexp PATTERN is a Perl regular expression -e, --regexp=PATTERN use PATTERN as a regular expression -f, --file=FILE obtain PATTERN from FILE -i, --ignore-case ignore case distinctions -w, --word-regexp force PATTERN to match only whole words -x, --line-regexp force PATTERN to match only whole lines -z, --null-data a data line ends in 0 byte, not newline

grep sp uniprot_sprot_plants.fasta

>sp|Q43495|108_SOLLC Protein 108 OS=Solanum lycopersicum PE=2 SV=1>sp|Q9XHP0|11S2_SESIN 11S globulin seed storage protein 2 OS=Sesamum indicum PE=2 SV=1>sp|P19084|11S3_HELAN 11S globulin seed storage protein G3 OS=Helianthus annuus GN=HAG3 PE=3 SV=1>sp|P13744|11SB_CUCMA 11S globulin subunit beta OS=Cucurbita maxima PE=1 SV=1>sp|Q05349|12KD_FRAAN Auxin-repressed 12.5 kDa protein OS=Fragaria ananassa PE=2 SV=1>sp|O23878|13S1_FAGES 13S globulin seed storage protein 1 OS=Fagopyrum esculentum GN=FA02 PE=2 SV=1>sp|O23880|13S2_FAGES 13S globulin seed storage protein 2 OS=Fagopyrum esculentum GN=FA18 PE=2 SV=1>sp|Q9XFM4|13S3_FAGES 13S globulin seed storage protein 3 OS=Fagopyrum esculentum GN=FAGAG1 PE=1 SV=1>sp|P83004|13SB_FAGES 13S globulin basic chain OS=Fagopyrum esculentum PE=1 SV=1>sp|P48347|14310_ARATH 14-3-3-like protein GF14 epsilon OS=Arabidopsis thaliana GN=GRF10 PE=2 SV=1>sp|P93207|14310_SOLLC 14-3-3 protein 10 OS=Solanum lycopersicum GN=TFT10 PE=2 SV=2>sp|Q9S9Z8|14311_ARATH 14-3-3-like protein GF14 omicron OS=Arabidopsis thaliana GN=GRF11 PE=2 SV=1>sp|Q9C5W6|14312_ARATH 14-3-3-like protein GF14 iota OS=Arabidopsis thaliana GN=GRF12 PE=2 SV=1>sp|P42643|14331_ARATH 14-3-3-like protein GF14 chi OS=Arabidopsis thaliana GN=GRF1 PE=1 SV=3>sp|P49106|14331_MAIZE 14-3-3-like protein GF14-6 OS=Zea mays GN=GRF1 PE=1 SV=1>sp|Q84J55|14331_ORYSJ 14-3-3-like protein GF14-A OS=Oryza sativa subsp. japonica GN=GF14A PE=2 SV=1>sp|P85938|14331_PSEMZ 14-3-3-like protein 1 (Fragments) OS=Pseudotsuga menziesii PE=1 SV=1>sp|P93206|14331_SOLLC 14-3-3 protein 1 OS=Solanum lycopersicum GN=TFT1 PE=3 SV=2>sp|Q41418|14331_SOLTU 14-3-3-like protein OS=Solanum tuberosum PE=2 SV=1>sp|Q01525|14332_ARATH 14-3-3-like protein GF14 omega OS=Arabidopsis thaliana GN=

Redirection

By placing a “>” with a file name at the end of the command line the output can be redirected to a file.

grep sp uniprot_sprot_plants.fasta > out.txt

The “wc” program counts lines or characters in the inputand writes the count to the output

wcinput output-Keyboard-File-Pipe

-Screen-File-Pipe

wc -l uniprot_sprot_plants.fasta

250177 uniprot_sprot_plants.fasta

wc -l out.txt

33851 out.txt

Creating a pipeline

With the “|” character the output of one program can be linked to the input of another program

pipeline

grepinput outputInput/Output wc

grep sp uniprot_sprot_plants.fasta| wc –l

33851

grep sp uniprot_sprot_plants.fasta| grep thaliana

>sp|P48347|14310_ARATH 14-3-3-like protein GF14 epsilon OS=Arabidopsis thaliana GN=GRF10 PE=2 SV=1>sp|Q9S9Z8|14311_ARATH 14-3-3-like protein GF14 omicron OS=Arabidopsis thaliana GN=GRF11 PE=2 SV=1>sp|Q9C5W6|14312_ARATH 14-3-3-like protein GF14 iota OS=Arabidopsis thaliana GN=GRF12 PE=2 SV=1>sp|P42643|14331_ARATH 14-3-3-like protein GF14 chi OS=Arabidopsis thaliana GN=GRF1 PE=1 SV=3>sp|Q01525|14332_ARATH 14-3-3-like protein GF14 omega OS=Arabidopsis thaliana GN=GRF2 PE=1 SV=2>sp|P42644|14333_ARATH 14-3-3-like protein GF14 psi OS=Arabidopsis thaliana GN=GRF3 PE=1 SV=2>sp|P46077|14334_ARATH 14-3-3-like protein GF14 phi OS=Arabidopsis thaliana GN=GRF4 PE=1 SV=2>sp|P42645|14335_ARATH 14-3-3-like protein GF14 upsilon OS=Arabidopsis thaliana GN=GRF5 PE=1 SV=2>sp|P48349|14336_ARATH 14-3-3-like protein GF14 lambda OS=Arabidopsis thaliana GN=GRF6 PE=1 SV=1>sp|Q96300|14337_ARATH 14-3-3-like protein GF14 nu OS=Arabidopsis thaliana GN=GRF7 PE=1 SV=1>sp|P48348|14338_ARATH 14-3-3-like protein GF14 kappa OS=Arabidopsis thaliana GN=GRF8 PE=2 SV=2>sp|Q96299|14339_ARATH 14-3-3-like protein GF14 mu OS=Arabidopsis thaliana GN=GRF9 PE=1 SV=2>sp|Q9LQ10|1A110_ARATH Probable aminotransferase ACS10 OS=Arabidopsis thaliana GN=ACS10 PE=2 SV=1>sp|Q9S9U6|1A111_ARATH 1-aminocyclopropane-1-carboxylate synthase 11 OS=Arabidopsis thaliana GN=ACS11 PE=1 SV=1>sp|Q8GYY0|1A112_ARATH Probable aminotransferase ACS12 OS=Arabidopsis thaliana GN=ACS12 PE=2 SV=2>sp|Q06429|1A11_ARATH 1-aminocyclopropane-1-carboxylate synthase-like protein 1 OS=Arabidopsis thaliana GN=ACS1 PE=1 SV=2>sp|Q06402|1A12_ARATH 1-aminocyclopropane-1-carboxylate synthase 2 OS=Arabidopsis thaliana GN=ACS2 PE=1 SV=1>sp|Q43309|1A14_ARATH 1-aminocyclopropane-1-carboxylate synthase 4 OS=Arabidopsis thaliana GN=ACS4 PE=1 SV=1>sp|Q37001|1A15_ARATH 1-aminocyclopropane-1-carboxylate synthase 5 OS=Arabidopsis thaliana GN=ACS5 PE=1 SV=1>sp|Q9SAR0|1A16_ARATH 1-aminocyclopropane-1-carboxylate synthase 6 OS=Arabidopsis thaliana GN=ACS6 PE=1 SV=2>sp|Q9STR4|1A17_ARATH 1-aminocyclopropane-1-carboxylate synthase 7 OS=Arabidopsis thaliana GN=ACS7 PE=1 SV=1>sp|Q9T065|1A18_ARATH 1-aminocyclopropane-1-carboxylate synthase 8 OS=Arabidopsis thaliana GN=ACS8 PE=1 SV=1>sp|Q9M2Y8|1A19_ARATH 1-aminocyclopropane-1-carboxylate synthase 9 OS=Arabidopsis thaliana GN=ACS9 PE=1 SV=1

Programstdin stdoutPipe orKeyboard

PipeorScreen

Special output channel for error messages

Programstdin

stdoutPipe orKeyboard

PipeorScreen

stderr

grep sp uniprot_sprot_plants.fas > out.txt

grep: uniprot_sprot_plants.fas: No such file or directory

EMBOSS

"European Molecular Biology Open Software Suite"

http://emboss.sourceforge.net/

Toolbox with bioinformatics applications

http://emboss.bioinformatics.nl/

wossname "open reading frame"

Finds programs by keywords in their short descriptionSEARCH FOR 'OPEN READING FRAME'getorf Finds and extracts open reading frames (ORFs)plotorf Plot potential open reading frames in a nucleotide sequence

wossname documentation

Finds programs by keywords in their short descriptionSEARCH FOR 'DOCUMENTATION'tfm Displays full documentation for an application

tfm getorf

getorf

Function

Finds and extracts open reading frames (ORFs)

Description

This program finds and outputs the sequences of open reading frames (ORFs) in one or more nucleotide sequences. An ORF may be defined as a region of a specified minimum size between two STOP codons, or between a START and a STOP codon. The ORFs can be output as the nucleotide sequence or as the protein translation. Optionally, the program will output the region around the START codon, the first STOP codon, or the final STOP codon of an ORF. The START and STOP codons are defined in a Genetic Code table; a suitable table can be selected for the organism you are investigating. The output is a sequence file containing predicted open reading frames longer than the minimum size, which defaults to 30 bases (i.e. 10 amino acids).

Command line options

All EMBOSS programs have a number of command line options. To get started:

–help Get help–stdout Write to standard output–filter Read stdin, write

stdout

getorf -help

Standard (Mandatory) qualifiers: [-sequence] seqall Nucleotide sequence(s) filename and optional format, or reference (input USA) [-outseq] seqoutall [<sequence>.<format>] Protein sequence set(s) filename and optional format (output USA)

Additional (Optional) qualifiers: -table menu [0] Code to use (Values: 0 (Standard); 1 (Standard (with alternative initiation codons)); 2 (Vertebrate Mitochondrial); 3 (Yeast Mitochondrial); 4 (Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma); 5 (Invertebrate Mitochondrial); 6 (Ciliate Macronuclear and Dasycladacean); 9 (Echinoderm Mitochondrial); 10 (Euplotid Nuclear); 11 (Bacterial); 12 (Alternative Yeast Nuclear); 13 (Ascidian Mitochondrial); 14 (Flatworm Mitochondrial); 15 (Blepharisma Macronuclear); 16 (Chlorophycean Mitochondrial); 21 (Trematode Mitochondrial); 22 (Scenedesmus obliquus); 23 (Thraustochytrium Mitochondrial)) -minsize integer [30] Minimum nucleotide size of ORF to report (Any integer value)

cat example1.fasta | getorf -filter -find 1

>BTBSCRYR_1 [72 - 110] Bovine mRNA for lens beta-s-crystallin...MTAIATVQISTCT>BTBSCRYR_2 [11 - 544] Bovine mRNA for lens beta-s-crystallin...MSKAGTKITFFEDKNFQGRHYDSDCDCADFHMYLSRCNSIRVEGGTWAVYERPNFAGYMYILPRGEYPEYQHWMGLNDRLSSCRAVHLSSGGQYKLQIFEKGDFNGQMHETTEDCPSIMEQFHMREVHSCKVLEGAWIFYELPNYRGRQYLLDKKEYRKPVDWGAASPAVQSFRRIVE>BTBSCRYR_3 [159 - 590] Bovine mRNA for lens beta-s-crystallin...MKGPILLGTCTSYPGASILSTSTGWASTTASAPAGLFTCLVEASISFRSLRKGILMVRCMRPRKTALPSWSSSTCGRSTPVRCWRAPGSSMSCPTTEAGSTCWTRRSTGSPSTGVQLPQLSSLSAALWSDDTDAAKRWLALSSK>BTBSCRYR_4 [547 - 603] Bovine mRNA for lens beta-s-crystallin...MIQMRPNAGWPCHPNKHYK>BTBSCRYR_5 [618 - 445] (REVERSE SENSE) Bovine mRNA for lens beta-s-crystallin...MPIVLFIMLIWMTRPASVWPHLYHHSTMRRKDWTAGEAAPQSTGFRYSFLSSRYCLPR>BTBSCRYR_6 [381 - 331] (REVERSE SENSE) Bovine mRNA for lens beta-s-crystallin...MWNCSMMEGQSSVVSCI>BTBSCRYR_7 [337 - 197] (REVERSE SENSE) Bovine mRNA for lens beta-s-crystallin...MHLTIKIPFLKDLKLILASTRQVNSPAGAEAVVEAHPVLVLRILAPG>BTBSCRYR_8 [192 - 73] (REVERSE SENSE) Bovine mRNA for lens beta-s-crystallin...MYMYPAKLGLSYTAQVPPSTLMELQRLRYMWKSAQSQSLS

Exercise

Make a pipeline that reports (only) the size in residues of the longest protein in this file:

uniprot_sprot_plants.fasta

It can be done using these applications as building blocks:sizeseqnthseq pepstatsgrep cut

http://main.g2.bx.psu.edu/