Upload
ace
View
40
Download
0
Embed Size (px)
DESCRIPTION
Homework 1 and 2 review session. Presented by Kirill Bessonov November 2012. HW1: classical Q & A (GenomeGraphs) (1). First two questions were on Bioconductor libraries. There are BioC 608 packages To get citations on particular library use citation(" library_name ") - PowerPoint PPT Presentation
Citation preview
GBIO009-1 - Bioinformatics
____________________________________________________________________________________________________________________
Kirill Bessonov 1
Homework 1 and 2 review session
Presented by
Kirill Bessonov
November 2012
GBIO009-1 - Bioinformatics
____________________________________________________________________________________________________________________
Kirill Bessonov 2
HW1: classical Q & A (GenomeGraphs) (1)• First two questions were on Bioconductor libraries. There are BioC 608 packages• To get citations on particular library usecitation("library_name")
• You were asked to get genomic data on specific genelibrary(GenomeGraphs)#download the whole database of Ensemble IDsensembl_Human_Genes =
useMart("ensembl",dataset="hsapiens_gene_ensembl");#get info on gene form the database on the Ensemble IDgene <- makeGene(id = "ENSG00000115145", type="ensembl_gene_id",
biomart = ensembl_Human_Genes )#get info on transcripttranscript <- makeTranscript(id = "ENSG00000115145",
type="ensembl_gene_id", biomart= ensembl_Human_Genes)gdPlot ( list("gene"=gene, "transcripts"=transcript))#retrieve info from the database displaying first 25 entriesgetBM(c("ensembl_gene_id", "hgnc_symbol", "description"),
filter=c("with_exon_transcript", "with_protein_id", "with_transcript_variation"),values=list(TRUE, TRUE, TRUE), ensembl_Human_Genes )[1:25,]
GBIO009-1 - Bioinformatics
____________________________________________________________________________________________________________________
Kirill Bessonov 3
HW1: classical Q & A (GenomeGraphs) (2)
• What is the gene name (i.e. hgnc_symbol) and function represented by the Ensembl ID - ENSG00000115145?
geneInfo=getBM(c("ensembl_gene_id", "hgnc_symbol", "description"), filter=c("with_exon_transcript", "with_protein_id", "with_transcript_variation"),values=list(TRUE, TRUE, TRUE), ensembl_Human_Genes )
> geneInfo[geneInfo$ensembl_gene_id == "ENSG00000115145", ]
ensembl_gene_id hgnc_symbol description
4829 ENSG00000115145 STAM2 signal transducing adaptor molecule (SH3 domain and ITAM motif) 2
• How many exons does the ensemble id ENSG00000115145 has? 51 exons
attr(gene, "ens")ensembl_gene_id ensembl_transcript_id ensembl_exon_id exon_chrom_start exon_chrom_end rank strand biotype
1 ENSG00000115145 ENST00000263904 ENSE00001351655 153032117 153032506 1 -1 protein_coding• ENSG00000115145 ENST00000263904 ENSE00002888710 153006659 153006743 2 -1 protein_coding
……
48 ENSG00000115145 ENST00000494589 ENSE00002785037 153004538 153004636 3 -1 protein_coding
49 ENSG00000115145 ENST00000494589 ENSE00002808134 153003676 153003822 4 -1 protein_coding
50 ENSG00000115145 ENST00000494589 ENSE00002929781 153001402 153001471 5 -1 protein_coding
51 ENSG00000115145 ENST00000494589 ENSE00001828491 153000503 153000527 6 -1 protein_coding
GBIO009-1 - Bioinformatics
____________________________________________________________________________________________________________________
Kirill Bessonov 4
HW1: classical Q & A (GenomeGraphs) (3)• Execute the following command. How many
chromosomes do you see? 25 chromosomes. 22 autosomal pairs, 1 sex pair and one
mitochondrial chromosome• Why the number of chromosomes in this Ensembl
dataset is greater than 23 chromosome pairs? What does “MT”, “X” and “Y” refer to?
Because of the MT chromosome, since X and Y can be grouped to a single pair
> getBM("chromosome_name","","", ensembl_Human_Genes)[c(1:22,433:435),1][1] "1" "10" "11" "12" "13" "14" "15" "16" "17" "18"
"19" "2" "20" "21" "22" "3" "4" "5" "6" "7" "8" "9" "MT" "X" "Y"
GBIO009-1 - Bioinformatics
____________________________________________________________________________________________________________________
Kirill Bessonov 5
HW2: Pairwise alignments (classical Q&A)
GBIO009-1 - Bioinformatics
____________________________________________________________________________________________________________________
Kirill Bessonov 6
HW2: Pairwise alignments (classical Q&A) Q1
• Please align globally using Needleman–Wunsch algorithm the following DNA sequences. Use
• The following scoring rules: a) gap -5; b) match between two bases +5; c) mismatch between two bases +3;
GBIO009-1 - Bioinformatics
____________________________________________________________________________________________________________________
Kirill Bessonov 7
HW2: Pairwise alignments (classical Q&A) Q3• Do local protein alignment using BLOSUM 62 matrix on the HEAGAWGHEE and PAWHAE
sequence. The scoring rules are a) gap -8; matches and mismatches are given in BLOSUM 62 matrix.
GBIO009-1 - Bioinformatics
____________________________________________________________________________________________________________________
Kirill Bessonov 8
HW2: Pairwise alignments (classical Q&A) Q5
Produce a dot plot of Human and Mouse p53 proteins from previous question and paste the plot below.
Complete the lines of R code to get the dot plot. Are both proteins similar?Yes, very similar since we see clear diagonal corresponding to >90% of sequences length Where is/are the region(s) of greatest variation occur? Between 50-100
GBIO009-1 - Bioinformatics
____________________________________________________________________________________________________________________
Kirill Bessonov 9
HW2: Pairwise alignments (classical Q&A) Q7
• What global alignment score do you get for the two p53 proteins, when you use the BLOSUM62 alignment matrix, a gap opening penalty of -10 and a gap extension penalty of -0.5? Answer: score of 1556
query("p53_HUMAN", "AC=P04637");p53_HUMAN_seq = getSequence(p53_HUMAN); query("p53_MOUSE", "AC=P02340");p53_MOUSE_seq = getSequence(p53_MOUSE);globalAlign <- pairwiseAlignment(p53_HUMAN_seq, p53_MOUSE_seq, substitutionMatrix =
"BLOSUM62", gapOpening = -10, gapExtension = -0.5)
• Errors: the R-code was not stated and the ID of proteins were not given such as Uniprot ID P04637
GBIO009-1 - Bioinformatics
____________________________________________________________________________________________________________________
Kirill Bessonov 10
HW2: Computer StyleImplementation of NW algorithm in R
GBIO009-1 - Bioinformatics
____________________________________________________________________________________________________________________
Kirill Bessonov 11
HW2: Computer style (NW algorithm) [1]
• Given the pseudo-code implement NW algorithm in R– Algorithm has two parts• Calculation of the alignment F-matrix• Finding the optimal path(s) through the matrix
for to length(A) F(i,0) ← d*ifor j=0 to length(B) F(0,j) ← d*jfor i=1 to length(A){ for j=1 to length(B) { Match ← F(i-1,j-1) + S(Ai, Bj) Delete ← F(i-1, j) + d Insert ← F(i, j-1) + d F(i,j) ← max(Match, Insert, Delete) } }
d = gap penalty scorei and j = positions in A & B sequences
GBIO009-1 - Bioinformatics
____________________________________________________________________________________________________________________
Kirill Bessonov 12
HW2: Computer style (NW algorithm) [2]Fmatrix = function(A,B){
fmatrix = matrix(0, nrow = (nchar(A)+1) , ncol = nchar(B)+1)
d = -8 #this is gap penalty
for(i in 0 : nchar(A)){
fmatrix[i+1,1] = d * i #populates initial row with gap penalty
}
for(j in 0 : nchar(B)){
fmatrix[1,j+1] = d * i
}
for(i in 1 : nchar(A)){
for(j in 1 : nchar(B)) {
score = rules(A,B) #get me sccore for the pair of aa or nt
match = fmatrix[i,j] + score
delete = fmatrix[i,j+1] + d
insert = fmatrix[i+1,j] + d
fmatrix[i+1,j+1] = max(match,delete,insert)
}
}
colnames(fmatrix) = strsplit( paste(" " , B, sep=""), "")[[1]];
rownames(fmatrix) = strsplit( paste(" " , A, sep=""), "")[[1]];
return(fmatrix)}
GBIO009-1 - Bioinformatics
____________________________________________________________________________________________________________________
Kirill Bessonov 13
HW2: Computer style (NW algorithm) [3]
rules = function(A,B){s.matrix <- matrix(rep(0,16), nrow = 4, ncol=4, byrow=TRUE,
dimnames = list(c("A","C","G","T"),c("A","C","T","G")))
s.matrix["A",] = c(2,-1,-1,-1)
s.matrix["C",] = c(-1,2,-1,-1)
s.matrix["T",] = c(-1,-1,2,-1)
s.matrix["G",] = c(-1,-1,-1,2)
}> s.matrix A C T GA 2 -1 -1 -1C -1 2 -1 -1G -1 -1 2 -1T -1 -1 -1 2
GBIO009-1 - Bioinformatics
____________________________________________________________________________________________________________________
Kirill Bessonov 14
HW2: Computer style (NW algorithm) [4]• Check the F-matrixfmatrix=Fmatrix("ATCG", "TG") T G -32 -32 -32A -8 -16 -24T -16 -6 -14C -24 -14 -4G -32 -22 -12
• Start finding the optimal path(s) through the matrixAlignmentA = "" AlignmentB = ""
i = nchar(A) + 1j = nchar(B) + 1
while(i > 1 && j > 1){ CurrentScore = fmatrix[i,j] #get score at current position of F-matrix ScoreDiag = fmatrix[i - 1, j - 1] ScoreUp = fmatrix[i, j - 1] what is around that F-matrix cell? ScoreLeft = fmatrix[i - 1, j]
GBIO009-1 - Bioinformatics
____________________________________________________________________________________________________________________
Kirill Bessonov 15
HW1: Computer style (NW algorithm) [5]• Selecting the bottom right cell and starting to trace-back the path of optimal alignmentAlignmentA = ""
AlignmentB = ""
while(i > 1 && j > 1){
CurrentScore = fmatrix[i,j]
ScoreDiag = fmatrix[i - 1, j - 1]
ScoreUp = fmatrix[i, j - 1]
ScoreLeft = fmatrix[i - 1, j]
#considering the score came from diagonal
if (CurrentScore == ScoreDiag + s.matrix[substr(A,i,i), substr(B,j,j)) ){
AlignmentA = paste(substr(A,i-1,i-1),AlignmentA, sep = "")
AlignmentB = paste(substr(B,j-1,j-1),AlignmentB, sep = "")
i = i - 1
j = j - 1
}
On diagonal path: previous + next cell
Which cell of the F-matrix I am now?
GBIO009-1 - Bioinformatics
____________________________________________________________________________________________________________________
Kirill Bessonov 16
HW2: Computer style (NW algorithm) [6]#considering if the score comes from left (introducing a gap)
else if(CurrentScore == ScoreLeft + d){
AlignmentA = paste(substr(A,i-1,i-1),AlignmentA, sep = "")
AlignmentB = paste( "-", AlignmentB, sep = "")
i = i - 1 }
#considering if the score comes from upper cell (introducing a gap)
else if(CurrentScore == ScoreUp + d) {
AlignmentA = paste( "-", AlignmentA, sep = "")
AlignmentB = paste(substr(B,j-1,j-1), AlignmentB, sep = "")
j = j – 1 }
print(AlignmentA)
print(AlignmentB)
finalScore = cat("Final score :",fmatrix[(nchar(A)+1),(nchar(B)+1)])
GBIO009-1 - Bioinformatics
____________________________________________________________________________________________________________________
Kirill Bessonov 17
HW2: Computer style (NW algorithm) [7]
• The scoring matrices could have been accessed though character indices not requiring conversion and making code faster
• How one would output more than one BEST possible alignments?
• Please use more comments in your R-code• Would be nice to see trace-backs visually• Also the scoring rules were not stated clearly