Biological and Computational Organizationzduan/BioinformaticsSummerWorkshop/... · Web viewIntro to Bioinformatics Lab Learning Perl The following two labs are intended to let biology

Intro to Bioinformatics Lab

Learning Perl

The following two labs are intended to let biology students know that they can

learn to program, and that there are unique victories in the process of

programming and debugging a program.

After the first year teaching the course the biology students suggested more

basic training in programming early in the course. We wanted to do this without

thoroughly boring the computer science students so we decided that having the

computer science students teach the biologists the rudiments of programming

would be the best means of accomplishing that task.

These labs were created by Chris Marks (our TA). He had absolutely no

programming experience, so they represent the expectation of a neophyte

programmer. This helped us keep the projects reasonably close to the level that

might be expected out of the students. Some of the students struggled with the

assignments but they all could all independently write a simple program after two

weeks in the lab.

I once assigned a group of high school students the task of writing a program to

calculate linear regressions and fit a line to a set of point. When they finally

managed to get their programs to work they were changed individuals. The

second assignment of these two takes the student more than half way to a

regression line in a single lab session of 1 1/2 hours!

Lab 1: Introduction to Perl

The purpose of this lab is to familiarize you with the programming language Perl.

Objectives:After completing this lab, you will be able to: Write and save a functional Perl program. Execute a saved program. Explain some basic Perl commands.

Overview:Your goal for this lab is to write and execute two programs using Perl. You will work in groups (biology majors and computer science majors will be teamed up). During the writing of the first program, the computer science majors will be responsible for explaining the basics of writing a functional program to the biology majors in their group. It is essential that you help the biology majors gain the fundamentals of writing a program, saving it in the proper format, and then executing it. After you have completed the first program, the computer science majors will step aside and the biology majors will be prompted to write and execute a second program. This program will be very similar to the first one.

Assessment:Grading for each group will be based on the following:

The biology majors’ ability to write a program unassisted.

Part one: Teaching biologists to write a program

As a group, you are to write a program using the language Perl with the following features:

The program should request the user to enter two numbers. The program should then add these numbers together. The program should then present the sum to the user. The program should include comments that describe what each

line is doing.

To do this you must familiarize yourself and your group members with the basics of programming in Perl. You may use any resources available to you to do this, including the Internet. After you have successfully completed the program, grab the TA and show him what you have done.

TA checklist:

________ The program properly asks for two numbers

________ The program properly sums the two numbers

________ The program properly displays the sum

________ The program contains comments explaining the program steps

________ The program was written properly and well formatted

Part two: Writing a program without help

Without the help of the computer science majors, the biology majors are to write a program using the language Perl with the following features:

The program should request the user to enter two numbers. The program should then present the sum of squares of the two

numbers. The program should then say “goodbye”. The program should include comments that describe what each

line does.

The biology majors must complete this task without the aid of their other group members. Computer science majors can offer no help other than offering encouraging expressions. After successfully completing the program, grab the TA and show him what you have done.

TA checklist:

________ The program properly asks for two numbers (2 pts)

________ The program properly computes the sum of squares of the two numbers (2 pts)

________ The program contains comments explaining the program (2 pts)

________ The program says “goodbye” at the end (2 pts)

________ The program was written properly and well formatted (2 pts)

________/10

Lab 2: Computing basic stats with Perl

The purpose of this lab is to familiarize you with some basic computations in Perl as well as some more advanced commands.

Objectives:After completing this lab, you will be able to: Write a program in Perl that computes some basic stats. Write a program that asks for and manipulates continual input.

Overview:Your goal for this lab is to write and execute two programs using Perl that compute some basic statistics. You will work in groups (biology majors and computer science majors will be teamed up). During the writing of the first program, the computer science majors will be responsible for assisting the biology majors in their group. It is essential that you help the biology majors understand what is happening. After you have completed the first program, the computer science majors will step aside and the biology majors will be prompted to write and execute a second program. This program will be very similar to the first one.

Assessment:Grading for each group will be based on the following:

The biology majors’ ability to write a program unassisted.

Part one: Teaching biologists to write a statistics program

As a group, you are to write a program using the language Perl with the following features:

The program should request the user to enter two numbers (x and y).

The program should present the sum of all x and y values. The program should present the mean of all x and y values. The program should continue to do provide cumulative data for x

and y until the user decides to stop.

Example: Please enter 2 numbers, x and y, separated by a space (press “S” to stop): <user inputs 1 2> The sum of all x values is 1 and the sum of all y values is 2. The mean of all x values is 1 and the mean of all y values is 2. Please enter 2 numbers, x and y, separated by a space (press “S” to stop): <user inputs 2 3> The sum of all x values is 3 and the sum of all y values is 5. The mean of all x values is 1.5 and the mean of all y values is 2.5. Please enter 2 numbers, x and y, separated by a space (press “S” to stop): <user inputs S> The final sum of all x values is 3 and the final sum of all y values is 5. The final mean of all x values is 1.5 and the final mean of all y values is 2.5 Goodbye!

You may use any resources available to you to do this, including the internet. After you have successfully completed the program, grab the TA and show him what you have done.

TA checklist:

________ The program properly asks for two numbers

________ The program properly computes and displays the sum of all x and y values

________ The program properly computes and displays the mean of all x and y values

________ The program properly repeats the process until the user stops

________ The program was written properly and well formatted

Part two: Writing a program without help

Without the help of the computer science majors, the biology majors are to write a program using the language Perl with the following features:

The program should request the user to enter two numbers (x and y).

The program should present the sum of all x and y values. The program should present the mean of all x and y values. The program should present the variance of all x and y values. The program should continue to do provide cumulative data for x

and y until the user decides to stop.

The biology majors must complete this task without the aid of their other group members. Computer science majors can offer no help other than offering encouraging expressions. After successfully completing the program, grab the TA and show him what you have done.

Hint: the equation for variance is: , where M is the sample mean, x is each individual data point, and N is the sample number.

TA checklist:

________ The program properly asks for two numbers (1 pt)

________ The program properly computes and displays the sum of all x and y values (2 pts)

________ The program properly computes and displays the mean of all x and y values (2 pts)

________ The program properly computes and displays the variance of all x and y values (2 pts)

________ The program properly repeats the process until the user stops (2 pts)

________ The program was written properly and well formatted (1 pt)

________/10

Intro to Bioinformatics Lab

Amplification, Sequencing, Editing, and Analysis of Genetic Data

The following labs are intended to demonstrate to the students how genetic data is generated. We use numerous strains of E. coli for genetic samples. I have data raw data from successful sequencing if their amplification or sequencing does not work (this is more common than not). Bacterial samples are convenient since no DNA purification is necessary before amplification.We have not included a lab on sequencing since such a lab is highly specific to the protocols in your lab or facility. This year we did not sequence strains but only amplified them. The students still seem to relish the experience. There are 3 or 4 separate labs in the series

PCR Lab This lab will show them what it really takes to amplify DNA. Sequencing Lab This would be specific to your facilities if you choose to run it.

Bioedit Lab This lab allows students to see how noisy the raw output from sequencing is. They are generally amazed to find out that it really is not a digital signal. In addition the view of chromatograms allows them to see that the computational task of identifying sequences from chromatograms is not trivial.

Analysis Lab Using PAUP* the students see who you can compare genetic data from different strains and look at relatedness using several different methods. It gives the students a good look at how divergent the sequences are for a couple of genes and also gives them some idea of how genes differ in their level of polymorphism.

Lab 3: PCR

Part 1: Pipetting practice

Pipetting is a very important aspect of doing PCR reactions. Poor pipetting can easily result in an unsuccessful PCR reaction. This part of the lab is meant to familiarize you with using a micropipette.

Procedure: As a group, obtain a micropipette, a box of tips, a tube of water,

and an empty small tube. Take turns practicing pipetting by:

o Loading a tip onto the micropipettero Drawing up various amounts of watero Pipetting the water into the tubeo Dispose of the tip o Load a new tipo Draw the water back out of the tube and place it into the

original tube of water.

Part 2: Programming the thermocycler

Follow the attached sheet for directions on programming the thermocycler. Program it to meet the following settings:

2 minutes at 95°C (1 minute at 95°C , 1 minute at 48°C, 1.5 minutes at 72°C) – for 31

cycles 7 minutes at 72°C 4°C HOLD

Part 3: PCR reaction

This part of the lab will be done in your normal groups. Most of the items you need will be closely guarded by your TA in a bucket of ice.

Procedure:1. Obtain a rack for your tubes, 2 large tubes for your master mixes (1 for

each type of primer), and one small tube per group member. Each group member should label their small tube with their initials and sample number. The sample number is found on your protocol worksheet.

2. Fill out your protocol worksheet by calculating the master mix totals.3. All material must be kept on ice when not in use, especially the TAQ

and DNTPs!

4. Add the ingredients on the protocol sheet in order to the large master mix tube. Between each step, mix the solution by pipetting up and down.

5. Add primers to master mix.6. Each group member should then add 25 ul from the master mix to their

small tube.7. Each person should then add .5 ul of sample to their tubes.8. Centrifuge tubes for a few seconds before placing it into the

thermocycler. 9. Run the program on the thermocycler.10.The thermocycler does the rest!

Questions:1. Describe briefly what is happening during the following 3 steps of a

PCR reaction. Include in your description the significance of the temperature at each step.

Denaturation (95°C) –

Annealing (55°C) –

Extension (72°C) -

PCR worksheet

Name ________________________ Date __________________

Primer _______________________ Sample # ______________

_______________________

Master mix recipe

Ingredients # Reactions__________ Total

21.2 ul ultrapure water X_________________X 1.1 = ____________

2.5 ul 10X buffer X_________________X 1.1 = ____________

0.2 ul DNTPs X_________________X 1.1 = ____________

0.2 ul Taq X_________________X 1.1 = ____________

0.2 ul of forward primer X_________________X 1.1 = ____________

0.2 ul of reverse primer X_________________X 1.1 = ____________

Add 0.5 ul of sample

Group number:

Group members:

Bioedit Lab

Introduction: Once we have run a gel and we know we have product, we can use a DNA sequencer to tell us the actual bases in the gene we have sequenced. The machine however makes a few mistakes and certain portions of the gene can be simply hard to read. Therefore, once the machine saves a sequence for us, we can import this file into a program to manually edit it. There are a number of freeware programs available to do this. We will be using Bioedit.

Procedure:

1. Drag the Bioedit folder from the CD to your desktop.

2. Open Bioedit by clicking the icon that says “bioedit”

3. Click “file” “open” and select your first sequence. Make sure “all files” is selected under file types. Two windows will open. Do not close either one of them. You will need the chromatogram later.

4. Now click on the window with the header “DNA sequence from…” and Click “file” “import” sequence from alignment file and select your second sequence. Again, make sure “all files” is selected under file types. You should now have two sequences in your window.

5. These sequences are from the same gene, yet they look very different. Think back to how a PCR is done. Why do these sequences look so different? (Hint: what are two key things we added to our PCR reaction?)

6. Now we want to figure out what organism these genes are from and what proteins they code for. Using your mouse, highlight the entire first sequence. Copy and paste it into BLAST (http://www.ncbi.nlm.nih.gov/blast). What organism is this from and what gene does it code for?

http://www.ncbi.nlm.nih.gov/blast

7. Now that you know what organism and gene your sequences are from, you can change the names accordingly. Do this by clicking the name of the sequence once, pausing, and clicking it again. You can now change the name.

8. Now we want to start aligning these 2 sequences. Since they are from the same gene and the same organism, they should basically be the same. If you answered #5 correctly, the next step should be apparent. Running a blast of both sequences should also help you figure this out. For the next step, click on the name of the second sequence, click sequence nucleic acid _________________(fill in the blank). What did you select and what did this do to the second sequence?

9. Now we want to align the sequences. In some programs you have to do this manually. Fortunately, Bioedit makes it very easy. Using your Ctrl key, click on the names of both sequences. Click sequence pairwise alignment align two sequence (allow ends to slide).

10.Your sequences should now be lined up. Click the button to the left of the anchor. It has a bunch of small dots and letters. By clicking this button, all base pairs that are the same between the two sequences will now show up as dots. Any letter you see is a difference between the two sequences. These are areas of primary interest and will require some editing on your part. N’s are bases that the sequencer could not read properly. You will have to edit these as well. First, let’s take a closer look at the two sequences.

11.Scroll to the far left and right of the aligned sequences. There should be dashes on the left and right ends of at least one of the sequences. The dashes represent areas where there are no bases read. How could there be no base pairs read on the ends of one or both of the sequences? (Hint: Think about how PCR works).

12.Now you can begin editing your sequences. To edit: click sequence edit mode edit residues. You can now click anywhere in the sequences and edit them. Scroll left and right using your arrows or click wherever you want with your mouse. If you want to change a base, position the flashing cursor to the left of it and type your change. This will overwrite the old base. If you make a mistake, use edit undo. You definitely want to replace all the N’s. To do this, look the chromatogram (the window with the sequence and all the waves). Use the waves to help you figure out what the N actually is. You can adjust the horizontal and vertical scales to help you better resolve the bases.

13.When you are done editing, save your edited sequences to the desktop and your TA will come load them onto a fancy storage device.

Assessment: You will be graded on your responses to the above questions as well as the quality of your editing/alignment.

Group members:

Lab 5 - PAUP is your friend

After completing this lab, you will be able to:1. Construct phylogenetic trees using PAUP2. Interpret these trees3. Complete a bootstrap analysis using PAUP*

PAUP

Last week we aligned 8 sequences for the PAB and OMP genes from different E. coli specimens. We also created 8 “combined” sequences, which included both PAB and OMP gene sequences. This week we will use those multiple alignments to create some phylogenetic trees using PAUP.

1. Open PAUP. Click file open and select “OMP”. Make sure “all file types” is selected.

2. Click “analysis” and select “parsimony”3. Click “trees” and select “generate trees” click “ok”4. Click “trees” and select “show trees” select the first tree and click

“ok”5. Repeat steps 1-4 for the “PAB” and “combined” sequences on your

desktop.6. Look at all 3 trees (they should appear in the same window) and answer

the following questions. a. Are they the same? Different? If they are different, why do you

think so?

b. Which sequence (PAB, OMP, or combined) do you think generates the most reliable tree? Why?

c. Describe in your own words what it means to be the most “parsimonious” tree.

7. Close everything and start over. Open PAUP. Click file open and select “OMP”. Make sure “all file types” is selected.

8. Click “analysis” and select “likelihood”9. Click “trees” and select “generate trees” click “ok”10.Click “trees” and select “show trees” select the first tree and click

“ok”11.Repeat steps 7-10 for the “PAB” and “Combined” sequences on your

desktop.

Now let’s use bootstrap analysis to analyze the “combined” tree. Since the combined tree includes both PAB and OMP genes, it is most phylogenetically useful. In a nutshell, bootstrapping generates new trees by randomly selecting characters (base pairs) over and over. The numbers it computes and places on the tree are an indication of how confident you can be about taxons’ location on the tree. For example, if the number 100 appears on a tree then 100% of the time those taxon appeared together during the bootstrapping procedure. 12.Close everything and start over. Open PAUP. Click file open and

select “combined”. Make sure “all file types” is selected. 13.Click “analysis” and select “bootstrap/jackknife” click “continue”

click “search”14.A window will appear which updates you on the search status. When it is

done searching (it should only take a few seconds) click “close” and the consensus tree will appear.

15.Look and the generated tree and answer the following questions:a. What can we say about the relationship between samples B21 and

B22?

b. What can we say about the relationship between samples D11, C12, B32, and B31?

1) Biological and Computational Organization

Undergraduate students in biology rarely have a conceptual grasp of how

computers perform tasks. They may be able to utilize a few applications, but they

may not understand file management, and very few will understand how code in

a high level language becomes an application that they use. Likewise many

undergraduate computer science majors have little or no understanding of

organisms perform tasks. They will know that DNA is the information molecule for

an organism, but they may not understand how enzymes function and very few

will understand how DNA information is used to produce phenotypes.

We have found that the following exercise engages students in connecting

the knowledge that they have in their own field to knowledge of the cross

discipline by allowing them to draw parallels between biological and

computational organization. We use this exercise early and then reinforce the

concepts with very brief lectures on molecular biology and programming

exercises.

The students should be broken down into several groups. And each group will

be asked a question. Within a group there will probably be enough knowledge to

get a rough answer to the question and they will present the answer. Discussion

by the whole class of the answers will then allow the students to reinforce or

rethink their conceptual understanding of the topic. This also gives you a very

good opportunity to assess the knowledge base of the students.

Group composition can be varied depending on the course goals and the

student make up.

Biological Organization

Your DNA is useful for more than placing you at a crime scene or for

identifying you in the case of amnesia. DNA is an instruction set for the

functioning of any organism. In collaboration with your group outline how you

think DNA information, in general, is converted into a functioning organism. Be

prepared to give a short presentation (no more than a few minutes) on how you

think DNA codes for a functioning organism. A diagram is often helpful.

Try to give details that explain the process. Keep in mind that a really good

description should allow for changes in the organism over time and differences

between individuals. For example if you know two individuals that are in the

same condition, age, height and weight and they each drink a shot of tequila, can

you explain why one may take longer to eliminate the alcohol from their blood

stream than the other? Can you explain why an individual that has not had a

drink in 2 months may not be able to clear the alcohol as fast as he could if he

had a drink several hours before? It is OK if you cannot answer the questions

but it is important to see how far you understanding of the process goes.

Computational Organization

If you own a computer you know how to run many applications. The

applications you run whoever had to be written by someone and they most likely

were written (coded) in a high level language. In collaboration with your group

outline how you think, in general, code in a high level language is converted into

a functioning program. Be prepared to give a short presentation (no more than a

few minutes) on how you think code becomes a functioning application. A

diagram is often helpful.

Try to give details that explain the process. Keep in mind that a really good

description should allow for changes in the application over time and flexibility in

the function of the application. For example if you have an application that

controls a pH meter can your description help explain how the same program

might also be used for measuring temperature, or how it might be modified to

measure pressure? It is OK if you cannot answer the questions but it is

important to see how far you understanding of the process goes.

2) Amplification of DNA fragments

Undergraduate students in biology generally are not used to putting together

algorithms. This can lead to a somewhat disjunct understanding of how systems

work. Computer science students are generally very adept at thinking in terms of

general algorithm approaches when it comes to the specifics of biological

systems they, not surprisingly will not have knowledge of biological processes.

We have found that the following exercise engages biology students by taxing

their knowledge of how a technology that is basic to much of molecular biology

works and organizing that knowledge into a more complete understanding with

the help of computer science students. The exercise also allows computer

science students to learn the basic biology while organizing the knowledge base

provided by the biology students.

The students should be broken down into several groups each with some

computer science students and some biology student. Within a group there will

probably be enough knowledge to get an incomplete answer to the question and

they will present the answer. Discussion by the whole class of the answers will

then allow the students to reinforce or rethink their conceptual understanding of

the topic and clarify how Polymerase Chain Reaction Amplification really works.

Essentially the same idea can be used to teach the students how DNA

Sequencing works. Once they know how PCR works Sequencing is easy to

understand, but difficult to figure out. Most likely it is only once different groups

put together their ideas on how sequencing might work that they come up with

the complete answer.

Polymerase Chain Reactions (PCR)

You may have, at some time seen one of the 20 or more crime scene

investigation television shows that currently run 24 hours a day. In those show

they are always taking small DNA samples (sometimes a single molecule) and

Amplifying them into thousands or millions of copies. We want you to describe

how this amplification works.

Write an algorithm that describes how a SPECIFIC section of DNA can be

amplified into thousands of copies. You algorithm should describe what

‘ingredients’ or conditions are needed at each step and why they are needed.

Be prepared to give a short presentation (no more than a few minutes) on your

algorithm, and make sure that everyone in the group can explain the process.

DNA Sequencing

The crime scene investigation television shows that currently run 24 hours a

day are always taking DNA samples and reading the sequence of DNA from a

specific region of a chromosome. We want you to describe how you think this

process works (It is very unrealistic and often magical on the TV shows!).

Write an algorithm that describes how a SPECIFIC section of DNA can be

sequenced. Your algorithm should describe what ‘ingredients’ or conditions are

needed at each step and why they are needed. You should know that the

common technique is much like PCR, however some special items are involved.

In addition to the regular nucleotides you will also need to use nucleotides

that are labeled with markers (often a different color for each nucleotide). These

markers can be detected in DNA fragments that have incorporated them. You

should also know that when a molecule of taq polymerase hits one of these large

markers that it is knocked off of the DNA and the copied fragment is terminated

at that point.

This is a difficult process to figure out (the people who first figured it out now

have Nobel Prizes!), but try to see how far you can get. The groups will each

present their idea of how this might work and we can always put together the

best ideas of each group. Hint: remember that by running though an electrically

charged gel you can separate DNA strands of different length.

3) Learning to use NCBI Website

Everyone who works in bioinformatics will sooner or later come to rely on the

NCBI website for information. Of course, every researcher will need different

things from the website. Teaching each student how to find anything they need

on the website is not, therefore, a good use of anyone’s time. Instead we give a

brief overview of the site and then give the following assignment to the students.

We then check to see which group got the best result and what methods they

used.

There are an infinite number of possible permutations of this type of question

but at the core a question should be found that can be tackled a number of

different ways. The best answer to the following question is most likely to be

found if a group uses the Taxonomy (phylogenetic) capabilities of the site,

something that most students will not think of. In searching around the site most

students will learn to try new options, unlike being told about options this way

they will remember how to they found an answer.

Finding your way on NCBI

As a modern human you are a member of a group of organisms named Homo

sapiens sapiens. The NCBI Database (http://www.ncbi.nih.gov/) lists DNA

sequence data from hundreds of other species and subspecies.

Find a section of DNA sequence from an organism that is not Homo sapiens

sapiens, but is as closely related to us as you can.

Then try to find a complete protein sequence from an organism as close to us

as possible.

Quick Questions and Problems for the Class on Molecular Biology

Many questions seem too easy or simple to bother asking in class, but they

each represent a potential victory for the student and they get students used to

interacting in class. Furthermore, they do not always know the answer offhand,

and this will get them used to asking their own questions? Here are a few

examples of the type of question we like to ask a couple of times per hour. Each

of these questions can generate some important class discussion.

How many bases per “word” are needed in the genetic code and why

are that many bases needed? Biologists will know this but some will not know why.

With a little luck someone will ask how many amino acids there are and the fun will begin

About what percent of Americans have the phenylketonuria gene? This is

one of many questions that you can ask to help students to understand what a gene is

verses what an allele or a gene copy is. Studying variation is what most of bioinformatics

is about and many students do not fully understand the nature of genetic variation.

Is the DNA in your liver cells different from the DNA in your brain cells?

This can be followed up with; How is it that your liver cells and your brain cells

act so differently if they have the same instruction set?

Why is DNA transcribed into RNA before being translated into Proteins?

This is an interesting one because there are so many possible ways in which the

students might approach it. If you don’t ask it, it may well come up from a student in any

case.

Phylogenetics and Paup

Phylogenetic (tree building) problems provide a nice method for teaching

students about tractability, statistical issues and also allows the discussion of

some optimization algorithms. Understanding the issues involved however

requires that students learn to think about trees and what the mean.

Since a tree based on genetic distance, for instance, may not be exactly the

same as a tree based on parsimony (i.e. a cladistic tree) it is good for the

students to get an idea the diversity of methods upon which taxonomy might

occur. In this exercise we ask the students to taxonomically group a set of

pencils which are actually related in terms of their manufacture. There is a

correct grouping in terms of cladistics which will be revealed if a parsimony

method is applied, but of course we leave it open to the students to come up with

any rational method of categorization.

Once the students present their groupings and the rational behind them I tell

them how the pencils are actually related and launch into a discussion of

phylogenetic methods.

Pencil Taxonomy

Biology was revolutionized by the advent of taxonomic methods. In the past

50 years those methods have become much more sophisticated than previous

methods, but the fundamental concepts remain the same.

Develop a scheme to group the 10 pencils that you have been given. Please

note that if you just describe how they differ you have not grouped them (except

to put each one in a separate group!). Try to form sets of groups that make sense

to you. Be prepared to discuss what your groupings are and how you arrived at

them.

More on Phylogenetics.

We follow the pencil exercise up with a lecture on distance and clustering

algorithms, parsimony and maximum likelihood methods. This would then be

followed by another lecture on tree construction and search algorithms (nearest

neighbor interchange, genetic algorithms, branch and bound, star decomposition

and sequential addition) for finding the most parsimonious or the most likely tree.

In the process of teaching these methods we walk through step by step

examples of exhaustive parsimony reconstruction (for a very small tree!), branch

and bound heuristics for parsimony, nearest neighbor searches and star

reconstruction. Along the way the students are queried about what step is next or

what the outcome of the algorithm will be. This is difficult material to cover in a

short course and we have probably tried to cover too much. One thing I have

noticed is that most students come across with a feel for the problem and an

understanding of a few of the methods.

We finally spend some time on bootstrapping and jackknife methods. This

prepares them for the following major project that they complete in the lab. We

give a major project on this topic because it

1) Forces them to integrate a number of tools

2) Pushes the limit of the programming skill of the biologists

3) The result has been published showing them that they can

actually produce a publishable product.

4) Teaches them what randomization methods really can provide

to a study.

Hornwort PAUP project.

Here is the problemThere are two types of sites in the attached sequence alignment. Some of the sites follow the normal “central-dogma” of biology. The DNA is transcribed to RNA then the RNA is translated into proteins.

DNA RNA PROTEIN

There is are, however, some sites that are transcribed into RNA and then changed to a different base pair and then translated. We call this RNA editing, and the sites are referred to as edited sites. Note that this is not cutting and splicing but single base pair changes.

DNA RNA EDITED RNA PROTEIN

We have identified the edited sites and a large proportion of the parsimony informative sites in the hornworts are edited sites. This is not surprising because selection prevents the build up of changes in most sites. Edited sites are protected from selection because those sites are edited back to the correct code.

It turns out that one genus of Hornwort (Lieosporocerus) does not edit this gene. Outside of the hornworts, most plants don’t edit much if at all. When we do a Phylogenetic analysis of the hornworts and related groups Lieosporocerus ends up outside of the hornwort clade (group)! We suspect this may be due to a similar pattern of change in the edited sites across the hornworts. They are changing at sites that never change in Lieosporocerus because of selection eliminates the changes. Lieosporocerus may fall outside of the hornworts in the analysis even though it is a hornwort!

We need a bootstrapped phylogenetic analysis with all of the sites included and one with the edited sites removed. This will tell us if edited sites might be biasing the interpretation.

There will be differences in the confidence with which we separate Lieosporocerus from the hornworts. But we won’t be able to tell if that is because we are loosing signal due to loss of characters in general, or because edited sites are different from the other sites, but you will solve this problem.

Your mission is to

1)Generate a bootstrap (n=1000 bootstraps) consensus tree based on a parsimony or maximum likelihood analysis of all unedited informative sites.

2) Generate a bootstrap (n=1000) consensus tree (same method as in 1) but as an analysis of edited sites only.

3) Generate a bootstrap (n=1000) consensus tree based on an analysis of all informative sites.

4) Generate 1000 analyses with a random set of informative sites removed (the number of sites removed being equal to the number of edited sites). Saving the best trees from each analysis and combining them into a majority rules consensus tree. There is no simple way to do this in Paup. You will need to devise a method of repeating a relatively mundane analysis a large number of times.

In all of these cases I want you to randomize amongst sites that are informative within the hornworts rather than those that are just important in the other taxa. These other sites should be left in all analysis.

The file HWORT_INFORM_NOEDIT.DAT is a file that has all of the sites that are informative within the hornworts and not edited.

The file HWORTs_INFORM.DAT is a file that has all of the sites that are informative within the hornworts (both edited and edited).The file HWORTs.Paup is a file that has all of the sites that are informative within the hornworts (both edited and edited).

You can use any programming and analysis tools that are easily available. There are no turnkey solutions to this problem.

You will work in your lab groups and turn in a group analysis on disk. That analysis will include your results including the treefiles you produced and any code you used to generate those files (can be puap files, perl code etc.). If you have intermediate data files that help show how you completed the job those should be included as well. In addition each group member should submit a written description of how problem 4 and their opinion on whether the removal of edited sites from analysis changes the tree more than a random set of sites (one or two pages).

Documents

Biological and Computational Organizationzduan/BioinformaticsSummerWorkshop/... · Web viewIntro to Bioinformatics Lab Learning Perl The following two labs are intended to let biology