47
1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Embed Size (px)

Citation preview

Page 1: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

1

Thomas Hughes

Data Science – ITEC/CSCI/ERTH-4350/6350

Week 7, October 20, 2015

Data Analysis II and Project Definitions (Teams)

Page 2: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Contents

• Errors and some uncertainty…

• Visualization as an information tool and analysis tool

• New visualization methods (new types of data)

• Use, citation, attribution and reproducability

• Projects!2

Page 3: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Types of data

3

Page 4: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Errors• Personal errors are mistakes on the part of

the experimenter. It is your responsibility to make sure that there are no errors in recording data or performing calculations

• Systematic errors tend to decrease or increase all measurements of a quantity, (for instance all of the measurements are too large). E.g. calibration

• Random errors are also known as statistical uncertainties, and are a series of small, unknown, and uncontrollable events 4

Page 5: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Errors• Statistical uncertainties are much easier to

assign, because there are rules for estimating the size

• E.g. If you are reading a ruler, the statistical uncertainty is half of the smallest division on the ruler. Even if you are recording a digital readout, the uncertainty is half of the smallest place given. This type of error should always be recorded for any measurement

5

Page 6: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Standard measures of error• Absolute deviation

– is simply the difference between

an experimentally determined value

and the accepted value

• Relative deviation– is a more meaningful value than the absolute

deviation because it accounts for the relative size of the error. The relative percentage deviation is given by the absolute deviation divided by the accepted value and multiplied by 100%

• Standard deviation6

Page 7: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Spatial analysis of continuous fields

• Possibly more important than our answer is our confidence in the answer.

• Our confidence is quantified by uncertainties as discussed earlier.

• Once we combine numbers, we need to be able to assess how the uncertainties change for the combination.

• This is called propagation of errors or more correctly the propagation of our understanding/ estimate of errors in the result we are looking at…

7

Page 8: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Bathymetry

8

Page 9: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Cause of errors?

9

Page 10: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Resolution

10

Page 11: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Reliability• Changes in data over time• Non-uniform coverage• Map scales• Observation density• Sampling theorem (aliasing)• Surrogate data and their relevance• Round-off errors in

computers

11

Page 12: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Propagating errors• This is an unfortunate term – it means making

sure that the result of the analysis carries with it a calculation (rather than an estimate) of the error

• E.g. if C=A+B (your analysis), then ∂C=∂A+∂B

• E.g. if C=A-B (your analysis), then ∂C=∂A+∂B!

• Exercise – it’s not as simple for other calcs.

• When the function is not merely addition, subtraction, multiplication, or division, the error propagation must be defined by the total derivative of the function.

12

Page 13: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Error propagation• Errors arise from data quality, model quality

and data/model interaction.

• We need to know the sources of the errors and how they propagate through our model.

• Simplest representation of errors is to treat observations/attributes as statistical data – use mean and standard deviation.

13

Page 14: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Analytic approaches

14

Addition and subtraction

Page 15: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Multiply, divide, exponent, log

15

Page 16: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Parametric statistical ‘tests’• F-test: test if two distributions with the same

mean are the same or different based on their variances and degrees of freedom.

• T-test: test if two distributions with different means are the same or different based on their variances and degrees of freedom

16

Page 17: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

F-test

17

F = S12 / S2

2

where S1 and S2 are the

sample variances.

The more this ratio deviates from 1, the stronger the evidence for unequal population variances.

Page 18: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

T-test

18

Page 19: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Variability

19

Page 20: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Dealing with errors• In analyses:

– report on the statistical properties– does it pass tests at some confidence level?

• On maps:– exclude data that are not reliable (map only

subset of data)– show additional map of some measure of

confidence

20

Page 21: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Elevation map

21

meters

Page 22: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Larger errors ‘whited out’

22

m

Page 23: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Elevation errors

23

meters

Page 24: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Types of analysis• Preliminary

• Detailed

• Summary

• Reporting the results and propagating uncertainty

• Qualitative v. quantitative, e.g. see http://hsc.uwe.ac.uk/dataanalysis/index.asp

24

Page 25: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

What is preliminary analysis?• Self-explanatory…?

• We’ve discussed the sampling issue

• The more measurements that can be made of a quantity, the better the result – Reproducibility is an axiom of science

• When time is involved, e.g. a signal – the ‘sampling theorem’ – having an idea of the hypothesis is useful, e.g. periodic versus aperiodic or other…

• http://en.wikipedia.org/wiki/Nyquist–Shannon_sampling_theorem

25

Page 26: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Detailed analysis• Most important distinction between initial and

the main analysis is that during initial data analysis it refrains from any analysis.

• Basic statistics of important variables– Scatter plots– Correlations– Cross-tabulations

• Dealing with quality, bias, uncertainty, accuracy, precision limitations - assessing

• Dealing with under- or over-sampling

• Filtering, cleaning26

Page 27: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Summary analysis• Collecting the results and accompanying

documentation

• Repeating the analysis (yes, it’s obvious)

• Repeating with a subset

• Assessing significance, e.g. the confusion matrix we used in the supervised classification example for data mining, p-values (null hypothesis probability)

27

Page 28: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Reporting results/ uncertainty• Consider the number of significant digits in

the result which is indicative of the certainty of the result

• Number of significant digits depends on the measuring equipment you use and the precision of the measuring process - do not report digits beyond what was recorded

• The number of significant digits in a value infers the precision of that value

28

Page 29: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Reporting results…• In calculations, it is important to keep enough

digits to avoid round off error.

• In general, keep at least one more digit than is significant in calculations to avoid round off error

• It is not necessary to round every intermediate result in a series of calculations, but it is very important to round your final result to the correct number of significant digits.  

29

Page 30: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Uncertainty• Results are usually reported as result ±

uncertainty (or error)

• The uncertainty is given to one significant digit, and the result is rounded to that place

• For example, a result might be reported as 12.7 ± 0.4 m/s2. A more precise result would be reported as 12.745 ± 0.004 m/s2. A result should not be reported as 12.70361 ± 0.2 m/s2

• Units are very important to any result30

Page 31: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Secondary analysis• Depending on where you are in the data

analysis pipeline (i.e. do you know?)

• Having a clear enough awareness of what has been done to the data (either by you or others) prior to the next analysis step is very important – it is very similar to sampling bias

• Read the metadata (or create it) and documentation 31

Page 32: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Tools• 4GL

– Matlab– IDL– Ferret– NCL– Many others

• Statistics– SPSS– Gnu R

• Excel

• What have you used?32

Page 33: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Considerations for viz. as analysis

• What is the improvement in the understanding of the data as compared to the situation without visualization?

• Which visualization techniques are suitable for one's data? – E.g. Are direct volume rendering techniques to be

preferred over surface rendering techniques?

33

Page 34: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Why visualization?• Reducing amount of data, quantization

• Patterns

• Features

• Events

• Trends

• Irregularities

• Leading to presentation of data, i.e. information products

• Exit points for analysis34

Page 35: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Types of visualization• Color coding (including false color)

• Classification of techniques is based on– Dimensionality– Information being sought, i.e. purpose

• Line plots

• Contours

• Surface rendering techniques

• Volume rendering techniques

• Animation techniques

• Non-realistic, including ‘cartoon/ artist’ style35

Page 36: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Compression (any format)• Lossless compression methods are methods for

which the original, uncompressed data can be recovered exactly. Examples of this category are the Run Length Encoding, and the Lempel-Ziv Welch algorithm.

• Lossy methods - in contrast to lossless compression, the original data cannot be recovered exactly after a lossy compression of the data. An example of this category is the Color Cell Compression method.

• Lossy compression techniques can reach reduction rates of 0.9, whereas lossless compression techniques normally have a maximum reduction rate of 0.5.

36

Page 37: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Remember - metadata• Many of these formats already contain

metadata or fields for metadata, use them!

37

Page 38: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Tools• Conversion

– Imtools– GraphicConverter– Gnu convert– Many more

• Combination/Visualization– IDV– Matlab– Gnuplot– http://disc.sci.gsfc.nasa.gov/giovanni

38

Page 39: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

New modes• http://www.actoncopenhagen.decc.gov.uk/

content/en/embeds/flash/4-degrees-large-map-final

• http://www.smashingmagazine.com/2007/08/02/data-visualization-modern-approaches/

• Many modes: – http://www.siggraph.org/education/materials/

HyperVis/domik/folien.html

39

Page 40: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Periodic table

40

Page 41: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Managing visualization products

• The importance of a ‘self-describing’ product

• Visualization products are not just consumed by people

• How many images, graphics files do you have on your computer for which the origin, purpose, use is still known?

• How are these logically organized?

41

Page 42: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

(Class 2) Management• Creation of logical collections

• Physical data handling

• Interoperability support

• Security support

• Data ownership

• Metadata collection, management and access.

• Persistence

• Knowledge and information discovery

• Data dissemination and publication 42

Page 43: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Use, citation, attribution• Think about and implement a way for others

(including you) to easily use, cite, attribute any analysis or visualization you develop

• This must include suitable connections to the underlying (aka backbone) data – and note this may not just be the full data set!

• Naming, logical organization, etc. are key

• Make them a resource, e.g. URI/ URL

• See http://commons.esipfed.org/node/308 43

Page 44: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Producability/ reproducability• The documentation around procedures used

in the analysis and visualization are very often neglected – DO NOT make this mistake

• Treat this just like a data collection (or generation) exercise

• Follow your management plan

• Despite the lack or minimal metadata/ metainformation standards, capture and record it

• Get someone else to verify that it works 44

Page 45: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Summary• Purpose of analysis should drive the type that

is conducted

• Many constraints due to prior management of the data

• Become proficient in a variety of methods, tools

• Many considerations around visualization, similar to analysis, many new modes of viz.

• Management of the products is a significant task 45

Page 46: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Reading• Note reading for week 7 – data sources for

project definitions– There is a lot of material to review

• Assignment 3 and 4!

• Note – for week 8 (Oct. 27)– Brief Introduction to Data Mining– Longer Introduction to Data Mining and slide sets– Software resources list– Example: Data Mining 46

Page 47: 1 Thomas Hughes Data Science – ITEC/CSCI/ERTH-4350/6350 Week 7, October 20, 2015 Data Analysis II and Project Definitions (Teams)

Project Teams1: Anubha, Alexandra, Ridwan, Taha, William, Yufei

2: Yushi, Gina, Nikhil, Jessica, John A.

3: Devon, Chetan, Rini, Feifei, Chris P.

4: Yuying, Wissal, Xuan, Mark, Coulter

5: Vince, John M., Sowrabbi, Sixiang, Binghui

6: Rahul, Zhuoyl, Patrick, Katie, Fangyan

7: Weihang, Sisira, Ying, Jocelyn, Nick

8: Yue, Chris H., Damian, Uttam, Tom

Anyone missing?

47