Golden Rules of Bioinformatics

  • Published on
    23-Aug-2014

  • View
    311

  • Download
    4

Embed Size (px)

DESCRIPTION

Golden Rules of Bioinformatics. Presented as part of a full-day introductory bioinformatics course - the example data and source for the slides can be found at https://github.com/widdowquinn/Teaching-Intro-to-Bioinf

Transcript

  • An Introduction to Bioinformatics Tools Part 1: Golden Rules of Bioinformatics Leighton Pritchard and Peter Cock
  • On Condence Ignorance more frequently begets condence than does knowledge: it is those who know little, not those who know much, who so positively assert. . . - Charles Darwin
  • Table of Contents Rule 0 Rule 1 Rule 2 Rule 3 Conclusions
  • Zeroeth Golden Rule of Bioinformatics No-one knows everything about everything - talk to people! local bioinformaticians, mailing lists, forums, Twitter, etc. Keep learning - there are lots of resources There is no free lunch - no method works best on all data The worst errors are silent - share worries, problems, etc. Share expertise (see rst item)
  • Table of Contents Rule 0 Rule 1 Rule 2 Rule 3 Conclusions
  • Subgroups You are in group A, B, C or D - this decides your dataset: expnA.tab, expnB.tab, expnC.tab, expnD.tab You will use R at the command-line to analyse your data
  • The biological question Your dataset expn?.tab describes (log) expression data for two genes: gene1 and gene2 Expression measured at eleven time points (including control) Q: Are gene1 and gene2 genes coregulated? How do we answer this question?
  • Reformulating the biological question Q: Are gene1 and gene2 genes coregulated? A: We cannot determine this from expression data alone
  • Reformulating the biological question Q: Are gene1 and gene2 genes coregulated? A: We cannot determine this from expression data alone Reformulate the question: NewQ: Is there evidence that gene1 and gene2 expression proles are correlated? (is expression gene1 gene2) How do we answer this new question?
  • Starting the analysis Change directory to where Exercise 1 data is located, and start R. 1 $ cd ../../ data/ ex1_expression / 2 $ R
  • Load and inspect data in R 1 > data = read.table("expnA.tab", sep="t", header=TRUE) 2 > head(data) 3 gene1 gene2 4 1 10 8.04 5 2 8 6.95 6 3 13 7.58 7 4 9 8.81 8 5 11 8.33 9 6 14 9.96
  • Load and inspect data in R 1 > mean(data$gene1) 2 [1] 9 3 > mean(data$gene2) 4 [1] 7.500909 5 > sd(data$gene1) 6 [1] 3.316625 7 > sd(data$gene2) 8 [1] 2.031568 9 > cor(data) 10 gene1 gene2 11 gene1 1.0000000 0.8164205 12 gene2 0.8164205 1.0000000
  • Results measure expnA expnB expnC expnD mean(gene1) 9 mean(gene2) 7.5 sd(gene1) 3.3 sd(gene2) 2.0 cor(data) 0.816
  • Results measure expnA expnB expnC expnD mean(gene1) 9 9 9 9 mean(gene2) 7.5 7.5 7.5 7.5 sd(gene1) 3.3 3.3 3.3 3.3 sd(gene2) 2.0 2.0 2.0 2.0 cor(data) 0.816 0.816 0.816 0.816
  • Results measure expnA expnB expnC expnD mean(gene1) 9 9 9 9 mean(gene2) 7.5 7.5 7.5 7.5 sd(gene1) 3.3 3.3 3.3 3.3 sd(gene2) 2.0 2.0 2.0 2.0 cor(data) 0.816 0.816 0.816 0.816 r = 0.816(P < 0.005) in every experiment Can we conclude that gene1 and gene2 are coexpressed in each experiment?
  • Plot the data in R 1 > plot(data)
  • Always plot the data Which gene pairs are coexpressed?
  • Always plot the data Is the matrix of (Pearson) correlation values potentially misleading? 1 > data = anscombe 2 > cor(data)[1:4 ,5:8] 3 y1 y2 y3 y4 4 x1 0.8164205 0.8162365 0.8162867 -0.3140467 5 x2 0.8164205 0.8162365 0.8162867 -0.3140467 6 x3 0.8164205 0.8162365 0.8162867 -0.3140467 7 x4 -0.5290927 -0.7184365 -0.3446610 0.8165214
  • Sometimes real correlation doesnt mean anything
  • First Golden Rule of Bioinformatics Always inspect the raw data (trends, outliers, clustering) What is the question? Can the data answer it? Communicate with data collectors! (dont be afraid of pedantry) Who? When? How? You need to understand the experiment to analyse it (easier if you helped design it). Be wary of block eects (experimenter, time, batch, etc.)
  • Table of Contents Rule 0 Rule 1 Rule 2 Rule 3 Conclusions
  • Exercise 2 You are in group A, B, C or D - this decides your database dbA, dbB, dbC, dbD You will use BLAST at the command-line to analyse your data You will use script at the command-line to record your work
  • Exercise 2 Start recording your actions by entering script at the command line 1 $ script 2 Script started , output file is typescript
  • Exercise 2 Change directory to the ex2 blast directory Run BLAST with the appropriate database Exit script 1 $ cd ../ ex2_blast 2 $ blastp -num_alignments 1 -num_descriptions 1 -query query.fasta -db dbA 3 $ exit 4 exit 5 Script done , output file is typescript
  • Exercise 2 You can view the typescript le with cat 1 $ cat typescript 2 Script started on Fri May 9 10:45:12 2014 3 lpritc@lpmacpro :$ cd ../ ex2_blast 4 [...]
  • Exercise 2 Query= query protein sequence Length=400 Score Sequences producing significant alignments: (Bits) PITG_08491T0 Phytophthora infestans T30-4 choline transporter-l... 34.3 > PITG_08491T0 Phytophthora infestans T30-4 choline transporter-like protein (441 aa) Length=486 Score = 34.3 bits (77), Method: Compositional matrix adjust. Identities = 22/69 (32%), Positives = 38/69 (55%), Gaps = 4/69 (6%) Query 106 EVILPMMYQFALKPSFADVINDYKPYSKHTAGVSDQELKGEATTWMLADKNSRMKAFLSQ 165 E+++PM+Y L F ++ Y P HTA ++ EL+G T ++A+ S + F ++ Sbjct 40 ELMVPMLYSLYLVVLFHLPVSAYYP---HTASMTAHELQGAVITILVAETPSIIIQF-AK 95 Query 166 IKTKSNSSE 174 T SN S+ Sbjct 96 CHTSSNISQ 104
  • Exercise 2 What is a reasonable E-value threshold to call a match? 1e-05, 0.001, 0.1, 10? dbA dbB dbC dbD E-value
  • Exercise 2 What is a reasonable E-value threshold to call a match? 1e-05, 0.001, 0.1, 10? dbA dbB dbC dbD E-value 0.45 0.002 4e-06 0.019 Five orders of magnitude dierence in E-value, depending on database choice - Why?
  • Exercise 2 E-values depend on database size Bit score and alignment do not depend on database size dbA dbB dbC dbD E-value 0.45 0.002 4e-06 0.019 Bit score 34.3 34.3 34.3 34.3 Sequences 100,001 501 1 5,001 Letters 48,650,486 210,866 486 2,066,510
  • Exercise 2 E-values dier, but the query matches a choline transporter-like protein quite well. . . After all, a biological match is a biological match. . .
  • Exercise 2 E-values dier, but the query matches a choline transporter-like protein quite well. . . Doesnt it? After all, a biological match is a biological match. . . Isnt it?
  • Exercise 2 Query= query protein sequence Length=400 Score E Sequences producing significant alignments: (Bits) Value PITG_08491T0 Phytophthora infestans T30-4 choline transporter-l... 34.3 4e-06 > PITG_08491T0 Phytophthora infestans T30-4 choline transporter-like protein (441 aa) Length=486 Score = 34.3 bits (77), Expect = 4e-06, Method: Compositional matrix adjust. Identities = 22/69 (32%), Positives = 38/69 (55%), Gaps = 4/69 (6%) Query 106 EVILPMMYQFALKPSFADVINDYKPYSKHTAGVSDQELKGEATTWMLADKNSRMKAFLSQ 165 E+++PM+Y L F ++ Y P HTA ++ EL+G T ++A+ S + F ++ Sbjct 40 ELMVPMLYSLYLVVLFHLPVSAYYP---HTASMTAHELQGAVITILVAETPSIIIQF-AK 95 Query 166 IKTKSNSSE 174 T SN S+ Sbjct 96 CHTSSNISQ 104
  • Exercise 2 Sequence accessions (PITG ?????T0) are correct in the databases
  • Exercise 2 Sequence accessions (PITG ?????T0) are correct in the databases Sequence functional descriptions are randomly shued: lengths do not match in BLAST output
  • Exercise 2 Sequence accessions (PITG ?????T0) are correct in the databases Sequence functional descriptions are randomly shued: lengths do not match in BLAST output dbA contains only three dierent sequences: two are repeated 50,000 times
  • Exercise 2 Sequence accessions (PITG ?????T0) are correct in the databases Sequence functional descriptions are randomly shued: lengths do not match in BLAST output dbA contains only three dierent sequences: two are repeated 50,000 times query.fasta is random sequence, not a real protein Shued from all P. infestans proteins No nr or PFam matches
  • Second Golden Rule of Bioinformatics Do not trust the software: it is not an authority Software does not distinguish meaningful from meaningless data Software has bugs Algorithms have assumptions, conditions, and applicable domains Some problems are inherently hard, or even insoluble You must understand the analysis/algorithm Always sanity test Test output for robustness to parameter (including data) choice
  • Table of Contents Rule 0 Rule 1 Rule 2 Rule 3 Conclusions
  • Exercise 3 Rule: If there is a vowel on one side of the card, there must be an even number on the other side. Which cards must be turned over to determine if this rule (if a card shows a vowel on one face, the opposite face is even) holds true?
  • Exercise 3 This is the Wason Selection Task If you chose E and 4
  • Exercise 3 This is the Wason Selection Task If you chose E and 4 You are in the typical majority group You are not correct You have been a victim of conrmation bias (System 1 thinking)
  • Exercise 3 This is the Wason Selection Task If you chose E and 4 You are in the typical majority group You are not correct You have been a victim of conrmation bias (System 1 thinking) If you chose E and 7
  • Exercise 3 This is the Wason Selection Task If you chose E and 4 You are in the typical majority group You are not correct You have been a victim of conrmation bias (System 1 thinking) If you chose E and 7 Congratulations! Your choice was capable of falsifying the rule.
  • Exercise 3 Rule: If there is a vowel on one side of the card, there must be an even number on the other side. Card Outcome Rule E Even Can be true even if rule false Odd violated K Even na Odd na 4 Vowel Can be true even if rule false Consonant na 7 Vowel violated Consonant na
  • Exercise 3 This is equivalent to functional classication, e.g: Rule: If there is a CRN/RxLR/T3SS domain, the protein must be an eector.
  • Exercise 3 Conrmation Bias (Wason Selection Task) An uninformative experiment is performed http://en.wikipedia.org/wiki/Wason_selection_task Arming the Consequent (a related formal fallacy) 1. If P, then Q 2. Q 3. Therefore, P Experimental results are misinterpreted http: //en.wikipedia.org/wiki/Affirming_the_consequent
  • Third Golden Rule of Bioinformatics Everyone has expectations of their data/experiment Beware cognitive errors, such as conrmation bias! System 1 vs. System 2 intuition vs. reason Think statistically! Large datasets can be counterintuitive and appear to conrm a large number of contradictory hypotheses Always account for multiple tests. Avoid data dredging: intensive computation is not an adequate substitute for expertise Use test-driven development of analyses and code Use examples that pass and fail
  • Table of Contents Rule 0 Rule 1 Rule 2 Rule 3 Conclusions
  • In Conclusion Always communicate! worst errors are silent Dont trust the data formatting/validation/category errors - check! suitability for scientic question Dont trust the software software is not an authority always benchmark, always validate Dont trust yourself beware cognitive errors think statistically biological stories can be constructed from nonsense