MATLAB Data Analysis Greg Reese, Ph.D Research Computing Support Group Academic Technology Services Miami University

MATLAB

Data Analysis

Greg Reese, Ph.D

Research Computing Support Group

Academic Technology Services

Miami University

MATLABData Analysis

© 2010-2013 Greg Reese. All rights reserved 2

3

Data analysis

MATLAB has functions for the basic statistical analysis of numbers stored in a vector. The table that follows shows some of them. For more details, type

help datafun

at the command line.

4

Basic statistics

max Largest component.

min Smallest component.

mean Average or mean value.

median Median value.

std Standard deviation.

var Variance.

sum Sum of elements.

prod Product of elements.

hist Histogram.

5

Basic statistics

ExampleClass’s quiz grades:

2, 9, 8, 5, 4, 5, 8, 10, 8, 7

Store grades in vector and compute the average quiz score:>> grades = [2 9 8 5 4 5 8 10 8 7];>> mean(grades)ans = 6.6000

6

Basic statistics

Try ItMake a vector of a class’s quiz grades: 2, 9, 8, 5, 4, 5, 8, 10, 8, 7

• Compute the mean, minimum, maximum, median, and mode

• Show the number of grades

7

Basic statistics

Try It>> grades = [ 2 9 8 5 4 5 8 10 8 7 ];>> mean(grades)ans = 6.6000>> min(grades)ans = 2>> max(grades)ans = 10

8

Basic statistics

Try It>> grades = [ 2 9 8 5 4 5 8 10 8 7 ];>> median(grades)ans = 7.5000>> mode(grades)ans = 8>> length( grades ) % number of gradesans = 10

9

Basic statistics

Can also compute statistics on matrices. For two-dimensional matrices MATLAB operates on each column separately. This produces a row vector whose length is the number of columns in the original matrix.

10

Basic statistics

ExampleClass’s quiz grades on one Friday (column 1) and on the following Friday (column 2): >> grades = [2 9 8 5 4 8 10; 4 9 9 2 1 4 6]'grades = 2 4 9 9 8 9 5 2 4 1 8 4 10 6

11

Basic statistics

Example>> mean( grades )ans = 6.5714 5.0000>> min( grades )ans = 2 1>> max( grades )ans = 10 9

12

Basic statistics

Two ways to compute a statistic on all of the data. In first, apply function twice.

Example>> mean( mean( grades ) )ans = 5.7857>> min( min( grades ) )ans = 1>> max( max( grades ) )ans = 10

13

Basic statistics

Second way - convert matrix to 1D, then compute statistic

If M is a matrix (of any dimension), M(:) produces a one-dimensional column vector

– Both have same number of elements– M(:) made by stacking columns up, i.e., concatenating second column under first, third column under second, etc.

14

Basic statistics

Example>> m1 = [ 1 2 3; 4 5 6]m1 = 1 2 3 4 5 6>> m2 = m1(:)m2 = 1 4 2 5 3 6

15

Basic statistics

Example>> m1 = [ 1 2 3; 4 5 6]m1 = 1 2 3 4 5 6>> mean( m1 )ans = 2.5000 3.5000 4.5000>> mean( m1(:) )ans = 3.5000>> mean( mean(m1) )ans = 3.5000

16

Basic statistics

Try It>> m1 = [ 1 2 3; 4 5 6];

Compute the minimum and maximum of all elements in m1 using the m1(:) notation>> min( m1(:) )ans = 1>> max( m1(:) )ans = 6

17

Basic statistics

Caution - depending on the statistic, the two methods may not be the same

Try It>> m1 = [ 1 2 3; 4 5 6];

Compute the standard deviation both ways using std()>> std( std( m1 ) )ans = 0>> std( m1(:) )ans = 1.8708 Wuz

up?

18

Basic statistics

Can also take statistics of each row by putting in second parameter of "2", e.g.

Example>> m1m1 = 1 2 3 4 5 6 >> mean( m1, 2 )ans = 2 5

19

Basic statistics

Try ItCompute the median of each row of>> m = [ 3:3:15; 1:5 ]m = 3 6 9 12 15 1 2 3 4 5

using the function median()>> median( m, 2 )ans = 9 3

20

Basic statisticson vectors and matrices

Questions?

21

Sorting

To get more than one of smallest or largest numbers must sort first.

sort( data, direction )• data is a one-dimensional vector• direction is 'ascend' or 'descend'

(if omitted, uses 'ascend')

22

Sorting

ExampleFind 2 lowest and 2 highest grades>> sortedGrades = sort( grades )sortedGrades = 2 4 5 5 7 8 8 8 9 10>> sortedGrades(1:2)ans = 2 4>> sortedGrades(end-1:end)ans = 9 10

23

SortingOften have rows of data, each row representing one object. Want to sort rows based on values in one column (called key). Use sortrows()

aSorted = sortrows( a, col )• a is a matrix• col is column to base sort on

– col > 0 sorts column |col| in ascending order

– col < 0 sorts column |col| in descending order

24

Sorting

Try It% col 1=ID, col 2=graduation year% col 3=money donated>> alumni = [ 7885 2008 5;... 1202 1972 22900;... 4580 2000 350000 ];

recent grad, can only afford $5

Hit it rich in less than 10 years

Old guy, gives a little each year

25

Sorting

Try It% sort by ID, lowest first>> sortrows( alumni, 1 )ans = 1202 1972 22900 4580 2000 350000 7885 2008 5

26

Sorting

Example% sort by $, highest first>> sortrows( alumni, -3 )ans = 4580 2000 350000 1202 1972 22900 7885 2008 5

27

Coefficient of correlation

Correlation quantifies the strength of a linear relationship between two variables. When there is no correlation between the two quantities, then there is no tendency for the values of one quantity to increase or decrease with the values of the second quantity.

28


Coefficient of correlation (r)– Common way to quantify correlation– -1 ≤ r ≤ 1– r close to 1 means two signals trend the same way

• As one increases the other also increases• As one decreases the other also decreases

– r close to -1 means two signals are anticorrelated (trend the opposite way), i.e., as one increases the other decreases and vice versa– r close to zero means there's little correlation

29

Coefficient of correlationTry ItReal data collected by a Miami graduate

>> data = xlsread( 'fuelcons.xls' );>> size( data )ans = 8 4Column 1 – Week

Column 2 – Average weekly temperature

Column 3 – Average weekly wind chill

Column 4 - Millions of cubic feet of natural gas per week required to heat the homes and businesses in a small city

30

Coefficient of correlationTry ItHow should amount of gas used be correlated with wind chill, and why?

– Positive correlation because higher wind chill means higher heat desired, means more gas used

How should amount of gas used be correlated with temperature, and why?

– Negative correlation because lower temperature means higher heat desired, means more gas used

31

Coefficient of correlationTry ItGraph columns 2-4 vs. column 1 on the same plot and determine if the data matches your previous determination of correlation

>> plot( data(:,1), data(:,2:4) )>> legend( 'Temperature', 'Wind chill', 'Gas' )

1 2 3 4 5 6 7 80

10

20

30

40

50

60

70

Temperature

Wind chill

Gas

32


Try ItR = corrcoef(X)

• R is matrix of correlation coefficients– R contains coefficients for all pairs of columns

• X is matrix whose rows are observations (data points) and columns are variables

Compute all correlation coefficients using the matrix of the last three columns as input

33


Try It>> corrcoef( data(:,2:4) )ans = 1.0000 -0.7182 -0.9484 -0.7182 1.0000 0.8706 -0.9484 0.8706 1.0000

Correlation of temperature with itself

Correlation of temperature with wind chillCorrelation of temperature with gas

Correlation of wind chill with gas

Column 2 – Average weekly temperatureColumn 3 – Average weekly wind chillColumn 4 - Millions of cubic feet of natural gas

34

Coefficient of correlationTerm "highly correlated" is common but not precisely defined. A conventional meaning* is something like this:

– no or negligible correlation: 0.0 ≤ r < 0.2– low correlation:0.2 ≤ r < 0.4– moderate correlation: 0.4 ≤ r < 0.6– marked correlation: 0.6 ≤ r < 0.8– high correlation: 0.8 ≤ r ≤ 1.0

* A primer of statistics for non-statisticians

Franzblau, Abraham Norman, 1901-New York : Harcourt, Brace & World, 1958

35


To refine the correlation analysis, often compute the p-value:

– The probability of getting a correlation as large as the observed value by chance, when the true correlation is zero

36


If p is small, we say the correlation is significant

– Conventionally say that correlations with p-value less than 0.05 are significant– Often see results written as "r value of 0.34 with p < 0.05"– In words, "p < 0.05" means "there is less than a 5% chance that a truly uncorrelated data set would produce the given correlation coefficient"

37


To compute p-values, use

[ r p ] = corrcoef( X )• X is matrix of data (as before)•r is matrix of correlation coefficients (as before)• p is matrix of corresponding p-values, i.e., p(i,j) is the p-value for r(i,j)

38

Coefficient of correlationTry ItCompute correlation coefficients and p-values for fuelcons.xls data>> [ r p ] = corrcoef( data(:,2:4) )r = 1.0000 -0.7182 -0.9484 -0.7182 1.0000 0.8706 -0.9484 0.8706 1.0000p = 1.0000 0.0448 0.0003 0.0448 1.0000 0.0049 0.0003 0.0049 1.0000

39

Coefficient of correlationFollow AlongWhich of the three coefficients of correlation are significant?p = 1.0000 0.0448 0.0003 0.0448 1.0000 0.0049 0.0003 0.0049 1.00001. Values below main diagonal are same as those above

2. Values on main diagonal always one§

3. Therefore, want to look at only values above main diagonal

§ The definition of the p-value is "the probability of getting a correlation as large as the observed value by chance, when the true correlation is zero". p(1,1) comes from data that is correlated with itself and so its true correlation is 1, not zero. p(1,1) is always 1 and in fact, the main diagonal of the p-value matrix is always all ones

40

Coefficient of correlationFollow AlongWhich of the three coefficients of correlation are significant and what are values?p = 1.0000 0.0448 0.0003 0.0448 1.0000 0.0049 0.0003 0.0049 1.0000>> aboveDiagonal = triu( ones(size(p)), 1 )aboveDiagonal = 0 1 1 0 0 1 0 0 0>> significantLocations = p<0.05 & aboveDiagonal == 1significantLocations = 0 1 1 0 0 1 0 0 0>> p(significantLocations)ans = 0.0448 0.0003 0.0049

Press to skip details

41

Coefficient of correlationMATLAB function triu(M,k) returns M with values below kth diagonal set to zero, those on kth diagonal and above kept as is

– k = 0 is main diagonal>> triu( p, 0 )ans = 1.0000 0.0448 0.0003 0 1.0000 0.0049 0 0 1.0000>> triu( p, 1 )ans = 0 0.0448 0.0003 0 0 0.0049 0 0 0

Want to test nonzero values of triu(p,1)

42

Coefficient of correlationCan't just see which elements of triu() matrix are <= 0.05 because it substitutes zeros and a p-value can be zero>> upperTriangle = triu( p, 1 )upperTriangle = 0 0.0448 0.0003 0 0 0.0049 0 0 0

>> upperTriangle < 0.05ans = 1 1 1 1 1 1 1 1 1

43

Coefficient of correlationNeed to find p-values that are <= 0.05 and above main diagonal

1. Mark elements above diagonal>> aboveDiagonal = triu( ones(size(p)), 1 )aboveDiagonal = 0 1 1 0 0 1 0 0 0

2. Mark elements also < 0.05>> significant = p<0.05 & aboveDiagonal == 1ans = 0 1 1 0 0 1 0 0 0>> p(significant)ans = 0.0448 0.0003 0.0049

44

Data analysis

MATLAB has other analysis functions• Histograms• Cumulative products• Finite differences• Fourier transforms

For more information, type help datafun

45

Sorting and correlation

Questions?

46

Polynomial fittingFind polynomial of specified degree that matches data with least error

ExampleMake 100 data points of uniform, random constant from [0,5) with additive Gaussian noise of zero mean, variance 4

>> yIntercept = 5 * rand(1,1)yIntercept = 3.2980>> data1 = yIntercept * ones(1,100);>> data1 = data1 + 4*randn(1,100);>> plot( data1 )

47

Polynomial fittingExample

0 10 20 30 40 50 60 70 80 90 100-10

-5

0

5

10

15

48

Polynomial fitting

P = POLYFIT(X,Y,N) finds the coefficients of a polynomial P(X) of degree N that fits the data Y best in a least-squares sense. P is a row vector of length N+1 containing the polynomial coefficients in descending powers,

P(1)*X^N + P(2)*X^(N-1) +...+ P(N)*X + P(N+1).

49

Polynomial fitting

ExampleFind polynomial of degree 0 that matches data with least error>> x = 1:100;>> poly1 = polyfit( x, data1, 0 );>> y1 = poly1(1) + zeros(1,100);>> plot( x, data1, 'r-', x, y1, 'b-' );>> poly1poly1 = 2.9677

Created with 3.2980

50

Polynomial fitting

ExampleResult (blue line) does look like it's right in the middle, i.e., it's the mean

0 10 20 30 40 50 60 70 80 90 100-10

-5

0

5

10

15

51

Polynomial fitting

ExampleMake 100 data points of line with y-intercept from uniform, random distribution from [0,5), with slope from uniform, random distribution from [0,-1), with additive Gaussian noise of zero mean, variance 1

52

Polynomial fitting

Example>> yIntercept = 5 * rand(1,1)yIntercept = 2.0381>> slope = -rand(1,1)slope = -0.8200>> data2 = slope*(1:100)+yIntercept;>> data2 = data2 + randn(1,100);>> plot( data2 );

53

Polynomial fitting

Example

0 10 20 30 40 50 60 70 80 90 100-90

-80

-70

-60

-50

-40

-30

-20

-10

0

10

54

Polynomial fittingExampleFind polynomial of degree 1 that matches data with least error>> x = 1:100;>> poly2 = polyfit( x, data2, 1 );>> y2 = poly2(1)*(1:100)+poly2(2);>> plot( x, data2, 'r-', x, y2, 'b-' );>> poly2poly2 = -0.8212 1.9858

Created with -0.8200 2.0831

55

Polynomial fitting

Example

0 10 20 30 40 50 60 70 80 90 100-90

-80

-70

-60

-50

-40

-30

-20

-10

0

10

56

Polynomial fitting

ExampleSuppose want to compare remainder after get rid of output from model

If model is constant or linear, can get rid of its effect by detrending

57

Polynomial fittingExample

Y = detrend( X, trend_type )– X is data vector– trend_type is 'constant' or 'linear'

removes the best fit straight-line or constant vector from X and returns the residual in vector Y

58

Polynomial fitting

Example

Now can see that noise in first data much larger than noise in second data

0 10 20 30 40 50 60 70 80 90 100-15

-10

-5

0

5

10

59

Polynomial fittingPreviously, evaluated best-fit polynomial "by hand", i.e., explicitly multiplying the polynomial coefficients by powers of the independent variable>> x = 1:100;>> poly2 = polyfit( x, data2, 1 );>> y2 = poly2(1)*(1:100)+poly2(2);

For higher powers, this is clumsy to code and inefficient to run

60

Polynomial fitting

A more general way to evaluate a polynomial at various points is to use

y = polyval(p,x) – returns value of polynomial of degree n evaluated at x– p is a vector of length n+1 whose elements are the coefficients in descending powers of the polynomial to be evaluated

• NOTE - p is the same as the output from polyfit()!

61

Polynomial fitting

Try ItMake an input of ½ cycle of sinusoid and a noisy input equal to the previous input with additive Gaussian noise of 0 mean, ½ standard deviation. On one graph, plot input, noisy input, and best fit cubic

62

Polynomial fittingTry It>> x = 0:pi/100:pi;>> input = sin(x);>> noisyInput = input + 0.5 * randn( size(input) );>> p = polyfit( x, noisyInput, 3 );>> bestFit = polyval( p, x );>> plot( x, [ input; noisyInput;... bestFit ] )>> legend( 'Input', 'Noisy input',... 'Best fit cubic' );

0 0.5 1 1.5 2 2.5 3 3.5-1

-0.5

0

0.5

1

1.5

2

Input

Noisy input

Best fit cubic

63

Polynomial fitting

Be careful of extrapolating your model too far! The further past the range of your input independent values you go, the more likely the model is to be wrong.

64

Polynomial fittingExampleRepeat previous example but 1) let input go from 0 to 2π; 2) fit model only from 0 to π; 3) plot all three from 0 to 2π>> x = 0:pi/100:2*pi;>> input = sin(x);>> noisyInput = input +...0.5 * randn( size(input) );>> midIx = round( length(x)/2 );>> p = polyfit( x(1:midIx),...noisyInput(1:midIx), 3 );>> bestFit = polyval( p, x );>> plot( x, [ input; noisyInput; bestFit ] )>> legend( 'Input', 'Noisy input', 'Best fit cubic' );>>

0 1 2 3 4 5 6 7-10

-8

-6

-4

-2

0

2

Input

Noisy input

Best fit cubic

Bad fit in extrapolation

65

Polynomial fitting

Sometimes may want to know what value of independent variable produces a specified value of the model, i.e., if p(x) is the polynomial model, for what x is p(x) = p* ?

– If h(x) is the altitude of a rocket x miles from its launch site, where does the rocket land, i.e., for what x is h(x) = 0?– If v(t) is the voltage over a capacitor at time t, when does the voltage reach 10V ?

66

Polynomial fittingpolyfit( x, y, n ) returns a vector p of n+1 coefficients such that the best fit polynomial is

We're given the constant y* and want to know what x (or x's) produces p(x) = y*

So the x's we're looking for are the roots of the polynomial p(x) - y* = 0

)1()()2()1()( 11 npxnpxpxpxy nn

0)1()()2()1(

)1()()2()1(*11

11*

ynpxnpxpxp

npxnpxpxpynn

nn

67

Polynomial fittingTo find the roots of a polynomial use the MATLAB function

r = roots( p ) • p is a polynomial defined as before• r is a column vector

Note that roots can be complex. To find roots that have non-zero imaginary parts, i.e., that have complex values, use the expression imag(r) ~= 0

(See help on isreal() for details of why this works)

68

Polynomial fittingExampleThe amount of material left after radioactive decay is given by

– A(t) is amount remaining at time t– A0 is amount at t = 0

– T1/2 (called the half-life) is time for amount to be cut in half from amount at start

• At A(T1/2) the amount of material is A0 / 2

• At A(2T1/2) the amount of material is A0 / 4

• At A(3T1/2) the amount of material is A0 / 8

• etc.

21

2ln

0)(T

t

eAtA

69

Polynomial fittingExampleAssume that there are initially 50 grams of a substance whose half-life is 5 hours. Simulate measurements of the decay by using the radioactive decay equation and adding zero-mean Gaussian noise with a standard deviation of 1. Make data points for every quarter hour of one day, starting at time zero. 1. Fit a cubic polynomial to the data and plot the fit

and the noisy data

2. Determine when the fit predicts there will be 6.25 grams left

3. Compare this to the correct answer

70

Polynomial fittingFollow AlongFollow along!

For A0 = 50 and T1/2 = 5 the equation becomes

• Simulate the measurements>> t = 0:0.25:23.75;>> decay = 50*exp( -t*log(2)/5 ) + randn( size(t) );

• Find the best-fit cubic>> p = polyfit( t, decay, 3 );>> bestFit = polyval( p, t );

• Plot both>> plot( t, [ decay; bestFit ] )

5

2ln

50)(t

etA

0 5 10 15 20 250

5

10

15

20

25

30

35

40

45

50

71

Polynomial fittingFollow Along• Find the roots of p(t) = 6.25>> p(end) = p(end) - 6.25;>> r = roots( p )r = 22.5719 + 9.8810i 22.5719 - 9.8810i 14.8484 • Eliminate all imaginary roots (answer must be real)>> r( imag(r) ~= 0 ) = [];>> rr = 14.8484

72

Polynomial fitting

Follow AlongTo compare answers, note that 6.25g = 50/8g, and the decay equation says this happens in three half-times, i.e., in 15 hours. The best-fit predicts 14.85 hours, which is quite close

73

Polynomial fitting

Questions?

74

The End

Documents

MATLAB Data Analysis Greg Reese, Ph.D Research Computing Support Group Academic Technology Services Miami University