59
Bioinformatics 生生生生生生生生生生 生生生 [email protected] 13928761660

Bioinformatics 生物信息学理论和实践 唐继军 [email protected] 13928761660

  • Upload
    fay

  • View
    103

  • Download
    0

Embed Size (px)

DESCRIPTION

Bioinformatics 生物信息学理论和实践 唐继军 [email protected] 13928761660. www.cse.sc.edu/~jtang/BJFU. 作业. GTTGCAGCAATGGTAGACTCAACGGTAGCAATAACTGCAGGACCTAGAGGAAAAACAGTAGGGATTAATAAGCCCTATGGAGCACCAGAAATTACAAAAGATGGTTATAAGGTGATGAAGGGTATCAAGCCTGAA 为什么用缺省 blast 出不来结果?需要如何选择? 相关物种的最新 pubmed 文章有哪些?. - PowerPoint PPT Presentation

Citation preview

Page 1: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Bioinformatics生物信息学理论和实践

唐继军

[email protected]

Page 2: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

www.cse.sc.edu/~jtang/BJFU

Page 3: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

作业• GTTGCAGCAATGGTAGACTCAACGGTAGCAAT

AACTGCAGGACCTAGAGGAAAAACAGTAGGGATTAATAAGCCCTATGGAGCACCAGAAATTACAAAAGATGGTTATAAGGTGATGAAGGGTATCAAGCCTGAA

•为什么用缺省 blast出不来结果?需要如何选择?

•相关物种的最新 pubmed文章有哪些?

Page 4: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

DNA Sequencing capability has grown exponentially

Doubling time = 18 months

DNA sequences in GenBank

Page 5: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

BLAST Algorithm

Page 6: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Sample Multiple Alignment

Page 7: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 8: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Bioinformatics Paradigm

• Find the data• Download the data• Reformat the data

• Collect the samples• Run molecular analysis• Filter the data

• Run analysis software• Collect and sort results• Publish / Data sharing

Page 9: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Multi-Sequence FASTA file>FBpp0074027 type=protein; loc=X:complement(16159413..16159860,16160061..16160497); ID=FBpp0074027; name=CG12507-PA;

parent=FBgn0030729,FBtr0074248; dbxref=FlyBase:FBpp0074027,FlyBase_Annotation_IDs:CG12507 PA,GB_protein:AAF48569.1,GB_protein:AAF48569; MD5=123b97d79d04a06c66e12fa665e6d801; release=r5.1; species=Dmel; length=294;

MRCLMPLLLANCIAANPSFEDPDRSLDMEAKDSSVVDTMGMGMGVLDPTQPKQMNYQKPPLGYKDYDYYLGSRRMADPYGADNDLSASSAIKIHGEGNLASLNRPVSGVAHKPLPWYGDYSGKLLASAPPMYPSRSYDPYIRRYDRYDEQYHRNYPQYFEDMYMHRQRFDPYDSYSPRIPQYPEPYVMYPDRYPDAPPLRDYPKLRRGYIGEPMAPIDSYSSSKYVSSKQSDLSFPVRNERIVYYAHLPEIVRTPYDSGSPEDRNSAPYKLNKKKIKNIQRPLANNSTTYKMTL>FBpp0082232 type=protein; loc=3R:complement(9207109..9207225,9207285..9207431); ID=FBpp0082232; name=mRpS21-PA;

parent=FBgn0044511,FBtr0082764; dbxref=FlyBase:FBpp0082232,FlyBase_Annotation_IDs:CG32854-PA,GB_protein:AAN13563.1,GB_protein:AAN13563; MD5=dcf91821f75ffab320491d124a0d816c; release=r5.1; species=Dmel; length=87;

MRHVQFLARTVLVQNNNVEEACRLLNRVLGKEELLDQFRRTRFYEKPYQVRRRINFEKCKAIYNEDMNRKIQFVLRKNRAEPFPGCS>FBpp0091159 type=protein; loc=2R:complement(2511337..2511531,2511594..2511767,2511824..2511979,2512032..2512082); ID=FBpp0091159;

name=CG33919-PA; parent=FBgn0053919,FBtr0091923; dbxref=FlyBase:FBpp0091159,FlyBase_Annotation_IDs:CG33919-PA,GB_protein:AAZ52801.1,GB_protein:AAZ52801; MD5=c91d880b654cd612d7292676f95038c5; release=r5.1; species=Dmel; length=191;

MKLVLVVLLGCCFIGQLTNTQLVYKLKKIECLVNRTRVSNVSCHVKAINWNLAVVNMDCFMIVPLHNPIIRMQVFTKDYSNQYKPFLVDVKIRICEVIERRNFIPYGVIMWKLFKRYTNVNHSCPFSGHLIARDGFLDTSLLPPFPQGFYQVSLVVTDTNSTSTDYVGTMKFFLQAMEHIKSKKTHNLVHN>FBpp0070770 type=protein; loc=X:join(5584802..5585021,5585925..5586137,5586198..5586342,5586410..5586605); ID=FBpp0070770; name=cv-PA;

parent=FBgn0000394,FBtr0070804; dbxref=FlyBase:FBpp0070770,FlyBase_Annotation_IDs:CG12410-PA,GB_protein:AAF46063.1,GB_protein:AAF46063; MD5=0626ee34a518f248bbdda11a211f9b14; release=r5.1; species=Dmel; length=257;

MEIWRSLTVGTIVLLAIVCFYGTVESCNEVVCASIVSKCMLTQSCKCELKNCSCCKECLKCLGKNYEECCSCVELCPKPNDTRNSLSKKSHVEDFDGVPELFNAVATPDEGDSFGYNWNVFTFQVDFDKYLKGPKLEKDGHYFLRTNDKNLDEAIQERDNIVTVNCTVIYLDQCVSWNKCRTSCQTTGASSTRWFHDGCCECVGSTCINYGVNESRCRKCPESKGELGDELDDPMEEEMQDFGESMGPFDGPVNNNY…

Page 10: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Fields

Page 11: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

ENTREZ is the GenBank web query tool

Page 12: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Advanced query

interface:

Page 13: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 14: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Expasy.org

Page 15: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 16: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Other Important Databases

• Genomes• Proteins• Biochemical & Regulatory Pathways• Gene Expression• Genetic Variation (mutants, SNPs)• Protein-Protein Interactions• Gene Ontology (Biological Function)

Page 17: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

http://genome.ucsc.edu/

Page 18: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

UCSC Genome BrowserSearch by gene name:

or by sequence:

Page 19: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 20: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Lots of additional data can be added as optional "tracks"

- anything that can be mapped to locations on the genome

Page 21: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

ensembl.org

Page 22: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 23: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 24: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

KEGG: Kyoto Encylopedia of Genes and Genomes

• Enzymatic and regulatory pathways• Mapped out by EC number and cross-

referenced to genes in all known organisms(wherever sequence information

exits)• Parallel maps of regulatory pathways

Page 25: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 26: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 27: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Genome Ontology• Genetics is a messy science• Scientists have been working in isolation on

individual species for many years - naming genes, mutants, odd phenotypes• “sonic hedgehog”

• Now that we have complete genome sequences, how to reconcile the names across all species?

• Genome Ontology uses a single 3 part system• Molecular function (specific tasks)• Biological process (broad biologial goals - e.g cell division)• Cellular component (location)

Page 28: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 29: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Unix/Linux

Page 30: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 31: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 32: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 33: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 34: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 35: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 36: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Filename Extensions• Most Linux filenames start with a lower case

letter and end with a dot followed by one, two, or three letters: myfile.txt

• However, this is just a common convention and is not required.

• It is also possible to have additional dots in the filename.

• The part of the name following the dot is called the “extension.”

• The extension is often used to designate the type of file.

Page 37: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Some Common Extensions• By convention:

• files that end in .txt are text files • files that end in .c are source code in the "C”

language • files that end in .html are HTML files for the

Web• Compressed files have the .zip or .gz extension

• Linux does not require these extensions (unlike Windows), but it is a sensible idea and one that you should follow

Page 38: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Working with Directories

• Directories are a means of organizing your files on a Linux computer. • They are equivalent to folders on Windows and

Macintosh computers • Directories contain files, executable

programs, and sub-directories • Understanding how to use directories is

crucial to manipulating your files on a Linux system.

Page 39: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Your Home Directory

• When you login to the server, you always start in your Home directory.

• Create sub-directories to store specific projects or groups of information, just as you would place folders in a filing cabinet.

• Do not accumulate thousands of files with cryptic names in your Home directory

Page 40: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

File & Directory Commands• This is a minimal list of Linux commands that you

must know for file management:

• All of these commands can be modified with many options. Learn to use Linux ‘man’ pages for more information.

ls (list) mkdir (make directory)

cd (change directory)

pwd (present directory)

cp (copy) rm (remove)

mv (move) more (view by page)

cat (view entire) man (help)

Page 41: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Navigation• pwd (present working directory) shows the name

and location of the directory where you are currently working: > pwd

/home/jtang• This is a “pathname,” the slashes indicate sub-directories• The initial slash is the “root” of the whole filesytem

• ls (list) gives you a list of the files in the current directory:• > ls

assembin4.fasta Misc test2.txtbin temp testfile

• Use the ls -l (long) option to get more information about each file

> ls -l total 1768

drwxr-x--- 2 browns02 users 8192 Aug 28 18:26 Opioid-rw-r----- 1 browns02 users 6205 May 30 2000 af124329.gb_in2-rw-r----- 1 browns02 users 131944 May 31 2000 af151074.fasta

Page 42: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Sub-directories• cd (change directory) moves you to another

directory>cd Misc> pwd/u/browns02/Misc

• mkdir (make directory) creates a new sub-directory inside of the current directory

> lsassembler phrap space> mkdir subdir> lsassembler phrap space subdir

• rmdir (remove directory) deletes a sub-directory, but the sub-directory must be empty

> rmdir subdir> lsassembler phrap space

Page 43: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Shortcuts• There are some important shortcuts in Linux for

specifying directories• . (dot) means "the current directory" • .. means "the parent directory" - the directory one level

above the current directory, so cd .. will move you up one level

• ~ (tilde) means your Home directory, so cd ~ will move you back to your Home.

• Just typing a plain cd will also bring you back to your home directory

Page 44: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Create new files

• pico• nano• vi/vim• emacs

Page 45: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Programming

• perl• python• c/c++• R• Java

Page 46: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Linux File Protections• File protection (also known as permissions)

enables the user to set up a file so that only specific people can read (r), write/delete (w), and execute (x) it.

• Write and delete privilege are the same on a Linux system since write privilege allows someone to overwrite a file with a different one.

Page 47: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

File Owners and Groups• Linux file permissions are defined according to

ownership. The person who creates a file is its owner.

• You are the owner of files in your Home directory and all its sub-directories

• In addition, there is a concept known as a Group.

• Members of a group have privileges to see each other's files.

• We create groups as the members of a single lab - the students, technicians, postdocs, visitors, etc. who work for a given PI.

Page 48: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

View File Permissions• Use the ls -l command to see the permissions for all files

in a directory:

• The username of the owner is shown in the third column. (The owner of the files listed above is jtang)

• The owner belongs to the group “None”

• The access rights for these files is shown in the first column. This column consists of 10 characters known as the attributes of the file: r, w, x, and -

r indicates read permission w indicates write (and delete) permissionx indicates execute (run) permission - indicates no permission for that operation

$ ls -ltotal 2-rw-r--r-- 1 jtang None 56 Feb 29 11:21 data.txt-rwxr-xr-x 1 jtang None 33 Feb 29 11:21 test.pl

Page 49: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

• The first character in the attribute string indicates if a file is a directory (d) or a regular file (-).

• The next 3 characters (rwx) give the file permissions for the owner of the file.

• The middle 3 characters give the permissions for other members of the owner's group.

• The last 3 characters give the permissions for everyone else (others)

• The default protections assigned to new files on our system is: -rw-r----- (owner=read and write, group =read, others=nothing)

$ ls -ltotal 2-rw-r--r-- 1 jtang None 56 Feb 29 11:21 data.txt-rwxr-xr-x 1 jtang None 33 Feb 29 11:21 test.pl

Page 50: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Change Protections• Only the owner of a file can change its protections• To change the protections on a file use the chmod

(change mode) command. [Beware, this is a confusing command.]

• Taken all together, it looks like this: > chmod 644 data.txtThis will set the owner to have read, write; add the permission for the group

and the world to read

600, 755, 700,

Page 51: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Commands for Files• Files are used to store information, for

example, data or the results of some analysis.• You will mostly deal with text files• Files on the RCR Alpha are automatically backed up to tape

every night.

• cat dumps the entire contents of a file onto the screen. • For a long file this can be annoying, but it can also be

helpful if you want to copy and paste (use the buffer of your telnet program)

Page 52: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

more• Use the command more to view at the contents of

a file one screen at a time:> more t27054_cel.pep!!AA_SEQUENCE 1.0P1;T27054 - hypothetical protein Y49E10.20 - Caenorhabditis elegansLength: 534 May 30, 2000 13:49 Type: P Check: 1278 .. 1 MLKKAPCLFG SAIILGLLLA AAGVLLLIGI PIDRIVNRQV IDQDFLGYTR

51 DENGTEVPNA MTKSWLKPLY AMQLNIWMFN VTNVDGILKR HEKPNLHEIG101 PFVFDEVQEK VYHRFADNDT RVFYKNQKLY HFNKNASCPT CHLDMKVTIP

t27054_cel.pep (87%)

• Hit the spacebar to page down through the file• Ctrl-U moves back up a page• At the bottom of the screen, more shows how much of

the file has been displayed

• Similar command: less

Page 53: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Copy & Move• cp lets you copy a file from any directory to

any other directory, or create a copy of a file with a new name in one directory

• cp filename.ext newfilename.ext• cp filename.ext subdir/newname.ext• cp /u/jdoe01/filename.ext ./subdir/newfilename.ext

• mv allows you to move files to other directories, but it is also used to rename files. • Filename and directory syntax for mv is exactly the same as

for the cp command. • mv filename.ext subdir/newfilename.ext

• NOTE: When you use mv to move a file into another directory, the current file is deleted.

Page 54: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Delete• Use the command rm (remove) to delete

files• There is no way to undo this command!!!

• We have set the server to ask if you really want to remove each file before it is deleted.

• You must answer “Y” or else the file is not deleted.

• But can use –f• rm –rf

Page 55: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Moving Files between Computers• You will often need to move files between

computers - desktop to server and back• There are several options

• Sneaker net (floppy, zip, writeable CD)(not an option for the mendel machine)

• E-mail• Network filesharing• FTP• scp

Page 56: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

FTP/SCP is Simple• File Transfer Protocol is standard for all

computers on any network.• The best way to move lots of data to and

from remote machines: • put raw data onto the server for analysis• get results back to the desktop for use in papers

and grants• Graphical FTP applications for desktop PCs

• On a Mac, use Fetch, CyberDuck (!) • On a Windows PC, use WS_FTP, FileZilla• winscp

Page 57: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

FTP Login• When you open an FTP program, you connect

to sanger just as you would with a terminal• We now use sFTP (secure FTP)• Your username and password are the same.

Page 58: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

• You will automatically end up in your home directory.

• Put files from you PC to the server, Get files from the server to your desktop machine.

Page 59: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Some More Advanced Linux Commands

• grep: searches a file for a specific text pattern• cut: copies one or more columns from a

tab-delimited text file• wc: word count• | : the pipe — sends output of one command

as input to the next • > : redirect output to a file• sed : stream editor – change text inside a file