Upload
audra-octavia-pitts
View
231
Download
0
Tags:
Embed Size (px)
Citation preview
EMBL-EBI
Visualization& Data mining
EMBL-EBI
Visualisation
The process of representing abstract data to aid in understanding the meaning of the data.
Not to be confused with rendering data (drawing pictures)
Typically though, we render data in such a way to visualize the information within that data.
EMBL-EBI
Introduction
Biological data comes from & is of interest to: Chemists : reaction mechanism, drug design Biologists : sequence, expression, homology, function. Structure biologists : atomic structure, fold, classification, function. Medicine : clinical effect Education : Media :
Presentation of diverse information to a diverse audience. Each has there own point of view (context).
Expert = scientist working within their own field of expertise Non-expert = scientist using data/information outside their field Novice = Non-scientist
EMBL-EBI
Web pages These are notoriously badly designed often resulting in
the information on that site being unusable.The front page should load quicklyThe main point should appear on the first full screenClutter – not logically laid outToo busy – cannot find the salient point8% men & 0.5% women are colour blindBad text/fonts
Too often it doesn’t workUser will go somewhere elseThe latest wiz-bang stuff only works on the latest browsersOnly works in one browser – they only tested on one.
– Does not conform to standard HTMl
Not just presentation of results
Google is a good design
EMBL-EBI
Asking questions
Asking questionsBiological data is very complex
Chemistry, Biology, Physics, Statistics, Medicine..Most users will be from a different field
Asking the right question is difficult.The user cannot use the correct terminologyToo many things to query (2000 attributes in MSD)SQL : not suitable for most users
Interface too complexToo many check boxes, widgets etc Trying to be too cleverThe “Go” button is buried somewhere
EMBL-EBI
Result presentation
ResultsBiological data is complex
Chemistry, physics, biology, statistics, medicine…
Experts users want all the detail Ie : want to use a specific methodThey want all the detailsThe want (I hope) the statistical validity of the results
The non-expert wants the best practice answer returned within their own context.The want comparative analysis with other fieldsThe want to know the results are valid
EMBL-EBI
Query design
Suitable for text queries
Only one logicAND or OR
PredefinedEasy to useLimited scope2000 attributes ->
2000 check-boxes !
The simple text box design is very common
EMBL-EBI
Query design
Graphical interface Multiple logic
AND/OR/NOT
Under users control Slower Steep learning curve
Some users just cannot get it
Intuitive once mastered Pretty
EMBL-EBI
Query design
HIS|SER:S/H>C2.0 HIS.ne2:S/S>C2.0 HIS.
[n]/T>C2.0
Figurative 2D sketch for 3D query (Active sites) Informative – presents meaning for the questionSlower Less error prone
EMBL-EBI
YAMGP (yet another molecular graphics program)
Many different programs are available
AstexViewer@MSD-EBI
Quanta
Rasmol
MolMol
Chime
O
Spock
Swiss-PDBviewer
Molscript
iMol
Pymol
Chimera
XtalView
FrodoBobscript InsightII
Raster3D
WebLab-viewer
POVRay
Yasara
LigPlotWebMol
PymolGrasp
Mage
Whatif
VMD
Frodo
EMBL-EBI
Result visualisation
Multiple types of biological data Textual data 3D structure 2D chemical sketches 1D sequence Node linked General/derived data Web pages Time Errors/Variance
Patented !
EMBL-EBI
Visualisation : AstexViewer@MSI-EBI
Visualisation Lensing Linked views Brushing Picking Flying views Hyperbolic distortion Animation Solid rendering Depth cues Colour,lighting Highlighting Etc…
Structure/sequence/data
EMBL-EBI
Visualisation : comparative analysis
Similarity/DifferenceData superpositionAttribute display
Colour, size…
CorrelationAttribute mapping
Sequence colour by structure alignment
Analysis Example
EMBL-EBI
Animation
Animation Time dependent display
Reaction chemistry Visual clues. Expression data
Shown as… Rotation Flash On/off Object Synchronization Size, Colour….
Sound NO : incredibly annoying
Animation Example
EMBL-EBI
Multidimensional analysis
Comparative analysis on multiple dataEg. Phi,Psi, Bvalue, Omega
1D & 2D easy3D graphs are difficult to see.4D requires 3D + iso-surfacesHigher – too busy
Use 2D + multiple propertiesSPOTFIRE is the most well knownUse : X/Y/Colour/size/shape… Interactive bracketing Example
EMBL-EBI
Visualization- Summary
Rendering data is not visualization
Not just the display of results
Huge array of non-specific techniques – and entire scientific field !
EMBL-EBI
Data mining
“Analysis of data in a database using tools which look for trends or anomalies without knowledge of the meaning of the data.” (Hyperdictionary)
“True data mining software does not just change the presentation, but discovers previously unknown relationships among the data.” (IBM)
EMBL-EBI
Data mining & Data analysis
Traditional analysis is via “verification-driven analysis”Requires hypothesis of the desired information
(target)Requires correct interpretation of proposed query
Discovery-driven data miningFinds data with common characteristicsResults are ideal solutions to discovery Finds results without previous hypothesisResults have unbiased mean and variance
EMBL-EBI
So what is Hypothesis driven data analysis ?
Define a target = hypothesis Search for target There are/are-not “hits”
Verify/negate hypothesis
Distribution is centred on target
“catalytic triad” : text string matchingAtomic coordinates : coordinate superpositionMathematical graph : graph matchingHIS,ASP,SER : data hierarchy knowledge
EMBL-EBI
Four types of data mining
Creation of predictive models : future data expectation
Link analysis : connections between data objects
Database segmentation : classification
Deviation detection : finding outliers.IBM : white papers
EMBL-EBI
Given multiple sets of primary data (dependant variables)
Characters, numbers, Function(numbers),…. Find anomalies
To many : numerical occurrenceData variation : DerivativesSingularities…..
Correlations and clustersWithin primary datawith other data (independent variables)
So what is this data mining ?
Finds new things !But not what it
means !
EMBL-EBI
Eg
Wife rings husband, “get some nappies for the weekend”Husband takes opportunity to buy some beer !
You won’t grant funding to test this hypothesis !
Retail and Financial industry are heavily into DM.A well known US food supermarket chain found a
correlation :Babies nappies Beer 5pm on Friday
EMBL-EBI
Self/Cross data mining
Most mining software looks for correlations between dependent variables.Rainfall, temperature, cloud-cover
It rains when it is cloudyFree : http://www.cs.waikato.ac.nz/~ml/
Bioinformatics usually involves anomalies within data objects Sequence clusters (sequence finger prints)Local coordinate clusters (active sites)Global coordinate cluster (folds)
EMBL-EBI
Data mining – not idiot proof
Date of birth and age will give 100 % correlation Authors for structure submission will be correlated
to authors on primary citation. “Lysozyme” is the most common fold pattern 36 spelling’s of E.Coli will mask results. Requires representative sets
Statistically valid ones too !
Signal/Noise ratio is a problem
EMBL-EBI
Discovery driven data mining of the PDB
Analysis of 3-dimensional coordinates Defined common patterns of atomic interactions locally
DB segmentation - active sites & common packing features Link analysis - Similarity between different functional group
Defined globally DB segmentation - common patterns of super-secondary str’ Link analysis - common folds in diverse protein families Outlier detection - unique folds
EMBL-EBI
Issues
Systematic “error” propagates as solution300 lysozyme structures return as a strong solution
Results cannot be found below the noise levelNeed to characterise the noise levelNeed to improve signal/noise ratio (S/N) to see
information Target is not biologically defined
It does not give you the biological answerResults should reproduce known biology Can give you new results not previously observed
EMBL-EBI
Data selection
Cannot leave in 300 lysozyme structures ! Select by sequence similarity at 70% exact alignment
Different “phase space” to select data
Remove structures with resolution < 2.5A Remove NMR (different statistics) Remove pre-1982 etc. Geometrical analysis criteria to check for outliers
Using properties NOT target parameters of structure solution
EMBL-EBI
Local atomic interactions
Data Function(3D coordinates) = distance Atom names (independent variable) Residue names (independent variable)
Create 3D Hash table of triplets of distances(*) between “points” This is the dependant variable Order = 3
EMBL-EBI
Local atomic interactions
Merge triplets Any pair of N-fold interactions are a (N+1)
interaction if they have (N-1) equivalence. Order = N
Just keep going until no more (N+1) interaction are found.
Time = 8 seconds to find ~ 2000 interactions(Digital alpha ES40)
EMBL-EBI
Catalytic quartet
EMBL-EBI
Electrostatic interaction
Ligands are found close by rather than associated with the
residues
EMBL-EBI
Iron binding site
EMBL-EBI
Double disulphide
EMBL-EBI
N-linked glycosolation binding site +
Spot the non-sugar
This glycosolation site is the same as active site found in “1a53” – indol-3-glycerolphosphate synthase
EMBL-EBI
Summary
Nearly all Bioinformatics is based on hypothesis driven data analysis Data mining has lost its meaning within Bioinformatics.
Discovery driven data-analysis (true data mining) : Can find unknown dependencies, clusters, outliers Is based on statistical probability Returns distributions unbiased by previous ideas
Information technology may be better for genomes (1D) “A numerical measure of the uncertainty of an outcome” Information content of gene sequences can be defined by the
normalized probability of finding “words” within that sequence