67
Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Embed Size (px)

Citation preview

Page 1: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Use of Computers in Molecular Biology

Meena K Sakharkar

Training Manager, BioInformatics Centre

National University of Singapore

Page 2: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

What is BioInformatics?

• Many related terms and buzzwords • A multiplicity of names:

– bioinformatics

– biocomputing

– biological computing

– computational biology

– computational genomics

– biological data mining

Page 3: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Overview of the challenges of Molecular Biology

Computing

• The huge dataset problem – automated DNA sequencers – the Human Genome Project – bulk sequencing of cDNAs (ESTs)

Page 4: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

GenBank Growth Chart

0

200000000

400000000

600000000

800000000

1000000000

1200000000

1400000000

1600000000

Dec

-82

Sep

-84

May

-85

May

-86

Feb-

87

Sep

-87

Jun-

88

Dec

-88

Sep

-89

Jun-

90

Mar

-91

Dec

-91

Sep

-92

Apr

-93

Oct

-93

Apr

-94

Oct

-94

Apr

-95

Oct

-95

Apr

-96

Oct

-96

Apr

-97

Oct

-97

Apr

-98

Year

Bas

es

As of Oct. 1999, GenBank contains over 3.8 billion bases of DNA and protein sequence, which requires about 18 gigabytes of computer disk storage space.

Page 5: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Human Genome Project

• What is the Human Genome Project? – 15-year effort formally begun in October 1990. coordinated by the

U.S. Department of Energy and the National Institutes of Health.

– identify all the estimated 80,000 genes in human DNA, – determine the sequences of the 3 billion chemical bases that make

up human DNA,

– store this information in databases,

– develop tools for data analysis, and

– address the ethical, legal, and social issues (ELSI) that may arise from the project.

Page 6: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

• Who is head of the U.S. Human Genome Project? – The DOE Human Genome Program is directed by Ari Patrinos,

and Francis Collins directs the NIH Human Genome Program.

– Ari Patrinos also heads the Department of Energy Office of Biological and Environmental Research.

Page 7: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Related fields

• molecular evolution

• origin of life

• genomics and proteomics

• the Human Genome Project

• theoretical biology

• complexity and information theory

• biotechnology

• lead drug discovery

• computing with biomolecules

Page 8: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Our ( working) definition

• Bioinformatics: the body of tools, algorithms and know-how needed to handle complex biological information the technological aspect

• Computational biology: the application of bioinformatics tools to perform biological studies the scientific aspect very broad and diverse field

Page 9: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

• Bioinformatics is clearly a multi disciplinary field including:

– computer systems management

– networking, database design

– computer programming

– molecular biology

Page 10: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Integrating bioinformatics and computational biology:

• A biologist can use existing tools but might misinterpret results

The black-box effect - the 'software kit'

• A biologist might refrain from doing some interesting analysis if the existing software doesn't offer it as an option

The ability to program is important• A computer scientist or a programmer can produce interesting and/or

efficient algorithms and tools, but these might lack biological relevance.

A biological training/background is important

• Beware of the 'just a tool maker' stigma

• Best results are achieved by integrating the development of tools with their usage in interesting biological systems

Page 11: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

How to handle all the information?

• Producing

• Processing

• Storing

• Sharing

• Querying

• Retrieving

• Visualising

• Annotating

• Curating

Page 12: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Use of Computers in Molecular Biology

• Powerful tools to organise the data itself. – Exponential growth.

– A new release is made every two months.

• Data Analysis. – Retrieval.

– Homology Search.

– Modelling purposes - Drug Design

• Data Integration• Data Visualisation

Page 13: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Paradigmatic Shift:

• Getting new sequences is now easy.

• Having a new sequence, we can start by analysing it using the computer, or we can start by doing experimental work.

• "A month in the lab can often save an hour in the library." - Westheimer ... or searching the Internet, or doing computerised analyses.

• From 'wet lab' to 'soft lab'.

• in vivo, in vitro, and in silico

Page 14: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Information is being collected, organized, and made available:

• GenBank is the central sequence information database in the United States

• Data is shared between GenBank and European Molecular Biology Laboratory (EMBL) and the DNA Database of Japan (DDBJ)

• All sequence data submitted to any of these databases is automatically integrated into the others.

• Sequence data is also incorporated from the Genome Sequence Data Base (GSDB) and from patent applications.

Page 15: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Similarity Searching in the databanks

• "Are there any sequences in the databanks similar to my sequence?"

• Directly searching the databanks by comparing sequences uses too much computer time

• The Biologist uses timesaving tools: FASTA and BLAST

• Relies on statistics and the informed judgement of the Biologist.

Page 16: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Pairwise and Multiple Alignments

• Multiple Alignment is the basis for the study of protein families and functional domains.

• When pairwise alignment is expanded to multiple sequences, it becomes a computationally huge problem.

• To reduce the nearly infinite permutations, a simplified heuristic (approximate) algorithm is used known as progressive pariwise alignment

Page 17: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Structure-function relationships:

Sequence patterns that predict function • Challenging areas of computational molecular

biology is the prediction of the function of protein molecules from their sequence.

• Sequence determines 3-D structure, structure determines function

• Identify conserved regions (domains or motifs)

• Domain databases can be used to scan any unknown protein sequence

Page 18: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Searching Literature using PubMed at NCBI

Page 19: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

PubMed

• Project by NIH and NLM.

• Search Tool for accessing literature citations.

• PubMed Search system - MedLine and Pre Medline Database and Molecular Biology Databases indexed under Entrez.

Page 20: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

MedLine

• MedLine - MEDlers OnLINE Database - NCBI’s premier bibliographic database.

• Covers medicine, nursing, dentistry, veterinary medicine, the health care sciences and pre-clinical sciences.

• Has over 3900 current biomedical journals published in the US and other foreign countries.

Page 21: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

MedLine

• 9 million records.

• Since 1966.

Page 22: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

PreMedLine

• Introduced in August 1996.

• Basic Citation and abstracts before the full records are prepared and added to Medline.

Page 23: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore
Page 24: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore
Page 25: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore
Page 26: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore
Page 27: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore
Page 28: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

MEDLINE SAMPLE RECORD

UI 98408838

AU Tao X, Dafu D

TI Relationship between synonymous codon usage and

protein structure.

MH Codon*

MH Protein Folding*

MH Protein Structure, Secondary*

MH Proteins / genetics ……

AB The hypothesis that synonymous codon usage is related

to protein three- dimensional structure is examined by

PT Journal article

SO FEBS Lett 1998 Aug 28 : 434 (1- 2) : 93- 6

Page 29: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

MEDLINE Indexing

• MeSH Terms to LIMIT Retrieval– human, animal, male, female,– age groups, organism, etc.

• Publication Types ( Another way to LIMIT )– review, clinical trial, letter, journal article, etc.

Page 30: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

MEDLINESubject Headings

Advantages of MeSH Terms

• Represent a subject concept & no term synonyms needed

• Find relevant articles on a search topic that may not be explicitly mentioned in a title or abstract

• Focus search & be specific to eliminate irrelevant records

• Increase search efficiency to save time … Get reliable results

Page 31: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Searching MEDLINESubject Headings

• Disadvantages of MeSH

• Thesaurus terms may not cover all concepts, esp. jargon

• Not every concept in abstract or article can get thesaurus terms

Page 32: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

MEDLINE Searching

Search terms are combined with Boolean “OR” and “AND” .

Page 33: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Modifying Retrieval -- -NOT ENOUGH Found

• Reduce number of concepts to combine

• Add synonyms or related terms– Use both free- text words & MeSH terms– Truncate free- text words as appropriate– Explode subject term, if it has narrower terms

• Do NOT use limits ( e. g., major point, review )

• Consult a professional searcher … Librarian

Page 34: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Modifying Retrieval --- TOO Many Found

• Use MeSH terms only … Use no free- text words• Use “MeSH Power” to Focus Your Search

– Try a more specific MeSH term

– Limit MeSH terms to MAJOR point of article

– Use a Subheading with your MeSH term

• Reduce number of synonyms, if free- text searching

• Add additional concepts to your search• Use Limits … English language, reviews• Restrict to human, animal, or organism

Page 35: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Internet Tools and Searches

Page 36: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Network Utilities

Page 37: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

What is the Internet?

• A world wide collection of networks of computers

• A network of computer networks• A network based on the TCP/IP protocol

Page 38: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Standalone Computer

A typical setup at homeSpeakers

PC Printer

Page 39: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

LAN

A Small Local Area Networkof two computersand one printerin your office

Page 40: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Inter-Departmental Network

Page 41: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Campus Wide Network

Page 42: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Campus Network

Wide Area Network

National Network

InterCountry Network

Global Network

The INTERNET

Page 43: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

What can you do with Internet?INTERNET APPLICATIONS• Electronic Mail (Email)• Internet Talk/Chat (IRC)• File Transfer (FTP)• Remote Login (Telnet)• Internet News (Usenet)• Info retrieval (Gopher, World Wide Web)• AudioVideo Conferencing (CU-SeeMe,

Mbone)• Internet Phone

Page 44: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

FTP: File Transfer Protocol

ftp ncbi.nlm.nih.gov

login: anonymous

passwd: email address

If you want to ftp from a server then use your own login and passwd

Page 45: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Ftp commands continued…..

• cd - change directory

• ls - listing

• pwd - present working directory

• bin - transfer in binary mode

• asc - transfer in ascii mode

• hash - show the transfer.

• lcd - local change directory

Page 46: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

FTP commands continued..

• prompt - multiple file tranfer

• mget - multiple file tranfer

else you can just use get

• mput - put multiple files onto the server

put - single file transfer

Page 47: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Telnet • Work on another machine by remote login.

• Telnet intron.bic.nus.edu.sglogin:

passwd:

• Must have an account on the machine for doing telnet

• Must have internet connection

• Space allocated to you on the machine

Page 48: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

HTML- an Introduction

Page 49: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

What is Hypertext?

• Non-Linear Text

• Links embedded in the text

• Jumps to other locations in the document/db

the quick brown foxjumps overthe fence

Fence........................

Page 50: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Creating a Web Page• Terms to Know

• WWW/Web: World Wide Web

• HTML: Hyper Text Mark-up Language

• URL: Uniform Resource Locator

• I assume that:

– know how to use Netscape or some other Web browser

– have access to a Web server (or that you want to produce HTML documents for personal use in local-viewing mode)

Page 51: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Creating a Web Page

What an HTML Document Is?• Collection of styles • HTML documents are plain-text files • Can be created using any text editor • You can also use word-processing software if you

remember to save your document as "text only with line breaks."

• HTML is not case sensitive. • TAGS are used to mark the element of the file for

your browser.

Page 52: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Creating a Web Page

TAGS Explained• Every HTML document should contain certain

standard HTML tags. • Each document consists of head and body tags. • The head contains the title, and the body contains

the actual text that is made up of paragraphs, lists, and other elements.

Page 53: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

<html>

<head>

<TITLE>A Simple HTML Example</TITLE>

</head>

<body>

<H1>HTML is Easy To Learn</H1>

<P>Welcome to the world of HTML.

This is the first paragraph. While short it is

still a paragraph!</P>

<P>And this is the second paragraph.</P>

</body>

</html>

• The required elements are the <html>, <head>, <title>, and <body> tags (and their corresponding end tags).

• Note: Because you should include these tags in each file, you might want to create a template file with them.

Page 54: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

TAGS Explained

• HTML:

– This element tells your browser that the file contains HTML-coded information.

– The file extension .html also indicates this an HTML document and must be used.

• HEAD:

– The head element identifies the first part of your HTML-coded document that contains the title.

Page 55: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

TITLE

The title element contains your document title and identifies its content in a global context.

BODY

Contains the content of your document.

HEADINGS

HTML has six levels of headings, numbered 1 through 6.

With 1 being the most prominent.

Headings are displayed in larger and/or bolder fonts than normal body text.

The syntax of the heading element is:

<Hy>Text of heading </Hy>

where y is a number between 1 and 6 specifying the level of the heading.

PARAGRAPHS

Carriage returns in HTML files aren't significant.

Word wrapping can occur at any point in your source file, and multiple spaces are collapsed into a single space by your browser.

The </P> closing tag can be omitted. This is because browsers understand that when they encounter a <P> tag, it implies that there is an end to the previous paragraph.

Page 56: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Using the <P> and </P> as a paragraph container means that you can center a paragraph by including the ALIGN=alignment attribute in your source file.

<P ALIGN=CENTER>

This is a centered paragraph.

[See the formatted version below.]

</P>

This is a centered paragraph.

Page 57: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Lists HTML supports unnumbered, numbered, and definition lists. You can

nest lists too, but use this feature sparingly because too many nested items can get difficult to follow.

Unnumbered ListsTo make an unnumbered, bulleted list,

1.start with an opening list <UL> (for unnumbered list) tag

2.enter the <LI> (list item) tag followed by the individual item; no closing </LI> tag is needed

3.end the entire list with a closing list </UL> tag

Below is a sample three-item list:

<UL> <LI> apples <LI> bananas <LI> grapefruit </UL>

The output is: • apples

• bananas

• grapefruit

Page 58: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Numbered Lists

A numbered list (also called an ordered list, from which the tag name derives) is identical to an unnumbered list, except it uses <OL> instead of <UL>. The items are

tagged using the same <LI> tag. The following HTML code:

<OL>

<LI> oranges

<LI> peaches

<LI> grapes

</OL>

produces this formatted output:

1.oranges

2.peaches

3.grapes

Page 59: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

A definition list (coded as <DL>) usually consists of alternating a definition term (coded as <DT>) and a definition definition (coded as <DD>). Web browsers generally

format the definition on a new line.

The following is an example of a definition list:

<DL>

<DT> NCSA

<DD> NCSA, the National Center for Supercomputing

Applications, is located on the campus of the

University of Illinois at Urbana-Champaign.

<DT> Cornell Theory Center

<DD> CTC is located on the campus of Cornell

University in Ithaca, New York.

</DL>

The output looks like:

NCSA

NCSA, the National Center for Supercomputing Applications, is located on the campus of the University of Illinois at Urbana-Champaign.

Cornell Theory Center

CTC is located on the campus of Cornell University in Ithaca, New York.

Page 60: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Nested Lists Lists can be nested. You can also have a number of paragraphs, each containing a nested list, in a

single list item. Here is a sample nested list:

<UL> <LI> A few New England states: <UL> <LI> Vermont <LI> New Hampshire <LI> Maine </UL> <LI> Two Midwestern states: <UL> <LI> Michigan <LI> Indiana </UL> </UL> The nested list is displayed as

• A few New England states: – Vermont

– New Hampshire

– Maine

• Two Midwestern states: – Michigan

– Indiana

Page 61: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Forced Line Breaks/Postal Addresses

The <BR> tag forces a line break with no extra (white) space between lines. Using <P> elements for short lines of text such as postal addresses results in unwanted additional white space. For example, with <BR>:

National Center for Supercomputing Applications<BR> 605 East Springfield Avenue<BR> Champaign, Illinois 61820-5518<BR>

The output is:

National Center for Supercomputing Applications

605 East Springfield Avenue

Champaign, Illinois 61820-5518

Horizontal Rules

The <HR> tag produces a horizontal line the width of the browser window. A horizontal rule is useful to separate sections of your document. For example, many people add a rule at the end of their text and before the <address> information.

You can vary a rule's size (thickness) and width (the percentage of the window covered by the rule). Experiment with the settings until you are satisfied with the presentation. For example:

<HR SIZE=4 WIDTH="50%">

displays as:

Page 62: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

• Physical Styles

<B> bold text

<I> italic text

<TT> typewriter text, e.g. fixed-width font.

Page 63: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

• Linking Power - link text and/or image.

Browser highlights the identified text or image with color and/or underlines to indicate that it is a hypertext link.

HTML's single hypertext-related tag is <A>, which stands for anchor. To include an anchor in your document:

1.start the anchor with <A (include a space after the A)

2.specify the document you're linking to by entering the parameter HREF="filename" followed by a closing right angle bracket (>)

3.enter the text that will serve as the hypertext link in the current document

4.enter the ending anchor tag: </A> (no space is needed before the end anchor tag)

Here is a sample hypertext reference in a file called US.html:

<A HREF="http://www.bic.nus.edu.sg">BIC HomePage</A>

This entry makes the words BIC HomePage the hyperlink to the document http://www.bic.nus.edu.sg/index.html,

Page 64: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

You can make it easy for a reader to send electronic mail to a specific person or mail alias by including the mailto attribute in a hyperlink. The format is:

<A HREF="mailto:emailinfo@host">Name</a>

For example, enter:

<A HREF="mailto:[email protected]"> Meena KS</a>

to create a mail window that is already configured to open a mail window for the Meena KS . (You, of course, will enter another mail address!)

Page 65: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

To include an inline image, enter:

<IMG SRC=ImageName>

where ImageName is the URL of the image file.

The syntax for <IMG SRC> URLs is identical to that used in an anchor HREF. If the image file is a GIF file, then the filename part of ImageName must end with

.gif. Filenames of X Bitmap images must end with .xbm; JPEG image files must end with .jpg or .jpeg; and Portable Network Graphic files must end with .png.

Image Size Attributes

<IMG SRC=SelfPortrait.gif HEIGHT=100 WIDTH=65>

Page 66: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore

Demo:

http://www.ncbi.nlm.nih.gov

Page 67: Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore