27
Dr Richard White Basic Computing Concepts for Bioinformatics

Dr Richard White Basic Computing Concepts for Bioinformatics

Embed Size (px)

Citation preview

Page 1: Dr Richard White Basic Computing Concepts for Bioinformatics

Dr Richard White

Basic Computing Concepts for Bioinformatics

Page 2: Dr Richard White Basic Computing Concepts for Bioinformatics

2

Basic computing concepts

• “Basic Computing concepts” sounds a bit scary. I hope you’ll find it isn’t really.

• Actually some of the “Basic Computing Concepts” you’ll be familiar with already.

Page 3: Dr Richard White Basic Computing Concepts for Bioinformatics

3

What does the average biologist use computers for?

• Browsing the Web, searching with Google, etc. • Email • Word-processing for reports, etc. (e.g. MS Word)• Data handling and simple statistics (e.g. Excel)• Playing CDs, games, etc. (no, of course not, only joking…)

• So you’re probably quite experienced in computer use already

• Databases? The use of bioinformatics databases figures prominently in this course.

Page 4: Dr Richard White Basic Computing Concepts for Bioinformatics

4

What should biologists use computers for?

• Access to biological databases, especially those containing bioinformatics information

• Visualisation: ways to understand data better by visual exploration

• Analysing data, especially to test hypotheses (to understand biology better)

Page 5: Dr Richard White Basic Computing Concepts for Bioinformatics

5

Computer use during this course

• Mostly we’ll be concerned with access to biological databases

• Also some visualisation sessions and maybe some data analysis and hypothesis testing

Page 6: Dr Richard White Basic Computing Concepts for Bioinformatics

6

Using predefined tools

• You’ll be doing a lot of this work using tools available on the web. – This makes life easy, because the hard work of

setting these tools up for use has already been done by someone else.

• However, sometimes it’s useful to get your hands dirty and mess about with the data and ways to process it yourself, – especially if you want to do something that zillions

of other people haven’t already thought of.

Page 7: Dr Richard White Basic Computing Concepts for Bioinformatics

7

Use of databases

• I’ll be running a session on the use of databases in week 4, but at the moment I want to think about this in order to discover some Basic Computing Concepts.

• First, let’s consider the characteristics of databases for a moment.

Page 8: Dr Richard White Basic Computing Concepts for Bioinformatics

8

Simple database concepts

Computers allow the analysis of large data sets. These are frequently arranged as two-dimensional data tables, based on the convention that – each row holds information on a separate object (or

abstract entity such as a species), – each column holds information on a particular

property or characteristic of the objects, – in general there will be a single value in each cell of

the table, representing the value of a specific characteristic for one particular object.

Page 9: Dr Richard White Basic Computing Concepts for Bioinformatics

9

Spreadsheets

• Data in the form of two-dimensional tables is frequently analysed using computer spreadsheet programs such as Microsoft Excel, especially where the purpose is – relatively simple data reorganisation, – summarisation, – statistical testing– report generation.

Page 10: Dr Richard White Basic Computing Concepts for Bioinformatics

10

Databases

• It is becoming harder to distinguish between spreadsheet and database programs.

• Most databases require more than one table: for example, one table may store data about proteins and another table stores data about the species these proteins are found in.

• For more about database systems, see the PowerPoint presentation (DatabaseIntroduction.ppt) on my web site (see handout for details).

Page 11: Dr Richard White Basic Computing Concepts for Bioinformatics

11

Methods for using databases

• What methods exist to use databases?

• Basically there are several approaches to the use of databases:

Page 12: Dr Richard White Basic Computing Concepts for Bioinformatics

12

Database use 1: direct access to database tables

• Run your own database on your own computer (e.g. MS Access)

• Use a program on your PC which gives you direct access to the tables in the remote database (client-server database access)

In both cases, you need instructions as to what the tables are and what they contain, such as SQL.

Page 13: Dr Richard White Basic Computing Concepts for Bioinformatics

13

SQL statements

• SQL (“Structured Query Language”) is a language for specifying the creation of databases and the updating and retrieval of information in them. It is general and “portable” – so that it can be used with a variety of different database systems without having to learn a new language for each one.

• The language goes far beyond this scope of this course. Briefly, it can be used to:– Specify the tables in the database and the fields (columns) they contain

– Make additions and updates to the data in those tables

– Retrieve information from one or more of the tables

Page 14: Dr Richard White Basic Computing Concepts for Bioinformatics

14

SQL for data retrieval

• A typical SQL statement for data retrieval would look something like this: SELECT <some fields> FROM <table> WHERE <condition>;

• The condition effectively selects certain rows from the table.

• Thus the result is often a smaller table than the one being queried.

• Tables can be “joined” together to combine information from more than one table, for example when extracting a molecular sequence from one table and the bibliographic details of the reference to where it was published from another table.

Page 15: Dr Richard White Basic Computing Concepts for Bioinformatics

15

Database use 2: predefined operations

Alternatively, you might have forms and queries already set up for you, which you can just run in order to perform predefined kinds of searches. These predefined operations can be made directly available to you by:

• Browsing a web page, typically containing a form, which gives you access [NPI] to a database somewhere else. You’ve done this if you’ve ever bought anything on the Internet.

• Using or even writing a small program (sometimes called a script to make it seem less scary) to fetch the data for you. This allows you to process the data in useful ways: – to search for features you’re interested in, – to summarise the data in the way you want, or – to extract data for statistical analysis to test hypotheses.

Page 16: Dr Richard White Basic Computing Concepts for Bioinformatics

16

Database use 3: using predefined operations

The predefined operations may be packaged as CGI programs or Web Services or in a variety of other ways, but basically you just send a request to the service, optionally with some ‘parameters’ to specify what you want, and wait for the reply.

The reply may come back, usually,

• in HTML (as a web page containing the data requested) or

• as some other sort of file to be downloaded (i.e. stored on your PC), either – in one of a number of formats invented by the data providers,

– in XML, a standard but flexible (and verbose) way to structure a data file, so that other programs (rather than humans) can process it easily.

Page 17: Dr Richard White Basic Computing Concepts for Bioinformatics

17

Overview of NCBI Entrez

In a later session, you’ll be introduced to a number of bioinformatics databases, but it’s worth spending a moment looking at a popular way to make use of some of them, because you will explore this in Practical 2 in week 4 of this course.

• NCBI web site• Entrez utilities

Page 18: Dr Richard White Basic Computing Concepts for Bioinformatics

18

Brief introduction to Perl programming

(What? In ten minutes??)

This will help you prepare for Practical 2 (the practical part of the 4th week of the course), in which we shall use simple Perl programs to request data from a bioinformatics information provider such as NCBI, by connecting with their Entrez utilities. (Additional Perl tutorial material may be made available.)

• What is a Perl program? (or “script”)

• How to run one

• How to write one

• What do you need? – See the handout

Page 19: Dr Richard White Basic Computing Concepts for Bioinformatics

19

A computer program

A program is a set of instructions to the computer, such as• Get input from user• Perform calculation• Display window• React to mouse clickThese are instructions at a very high level. They need to be broken

down into smaller details. A program consists of combinations of:

• Sequences of instructions (statements)• Repetitions (to execute statements repeatedly)• Selections (to choose which statements to execute)• Functions (subroutines or methods: groups of instructions)

Page 20: Dr Richard White Basic Computing Concepts for Bioinformatics

20

A simple program

• Here is a simple Perl program.#!/usr/local/bin/perl# Program to do the obviousprint 'Hello world.';

• The first line: every Perl program starts off with this as its very first line, although it may vary from system to system, or not be used at all. It tells the machine what to do with the file when it is executed (it tells it to run the file through the Perl software to execute it).

• Everything which is not a comment is a Perl statement which must end with a semicolon, like the last line above.

• So the next thing to do is to run it.

Page 21: Dr Richard White Basic Computing Concepts for Bioinformatics

21

Running the program

• Type in the example program using a text editor, and save it in a file called something.pl.

• Now to run the program just type the following at the Command Prompt.

perl something.pl

• If something goes wrong then you may get error messages, or you may get nothing at all.

Page 22: Dr Richard White Basic Computing Concepts for Bioinformatics

22

Perl programming concepts: variables

Variables can hold both strings and numbers. For example, the statement$priority = 9;

sets the scalar variable $priority to 9, but you can also assign a string to exactly the same variable: $priority = 'high';

• In general variable names consists of numbers, letters and underscores, but they should not start with a number. Perl is case sensitive, so $a and $A are different variables.

Page 23: Dr Richard White Basic Computing Concepts for Bioinformatics

23

Operations and Assignment

Perl uses all the usual arithmetic operators:$a = 1 + 2; # Add 1 and 2 and store in $a

$a = 3 - 4; # Subtract 4 from 3 and store in $a

$a = 5 * 6; # Multiply 5 and 6

$a = 7 / 8; # Divide 7 by 8 to give 0.875

etc.

and for strings Perl has the following among others:$a = $b . $c; # Concatenate $b and $c

Page 24: Dr Richard White Basic Computing Concepts for Bioinformatics

24

Array variables

A slightly more interesting kind of variable is the array variable which is a list of scalars (single values, i.e. numbers and strings). Array variables have the same format as scalar variables except that they are prefixed by an @ symbol. The statement@food = ("apples", "pears", "eels");

assigns a three element list to the array variable @food.The array is accessed by using indices starting from 0, and square

brackets are used to specify the index. The expression$food[2]

returns eels. Notice that the @ has changed to a $ because $food[2] and eels are scalars, not arrays.

Page 25: Dr Richard White Basic Computing Concepts for Bioinformatics

25

File handling

Here is a basic Perl program which does the same as the UNIX cat or Dos/Windows type command on a certain file.#!/usr/local/bin/perl

# Program to open the password file, read it in,

# print it, and close it again.

$file = '/etc/passwd'; # Name the file

open(INFO, $file); # Open the file

@lines = <INFO>; # Read it into an array

close(INFO); # Close the file

print @lines; # Print the array

Page 26: Dr Richard White Basic Computing Concepts for Bioinformatics

26

Control structures

Perl supports lots of different kinds of control structures. Have a look at the Perl resources listed on the handout. Most Perl programs use these features.

• Programs can make choose between alternative branches

• Programs can repeat statements until something happens

• Frequently used statements to carry out some common task can be made into a “subroutine” or “function” and called from others part of the program

Page 27: Dr Richard White Basic Computing Concepts for Bioinformatics

27

End