Upload
britney-briggs
View
228
Download
4
Tags:
Embed Size (px)
Citation preview
MATLAB
Data Analysis
Greg Reese, Ph.D
Research Computing Support Group
Academic Technology Services
Miami University
MATLABData Analysis
© 2010-2013 Greg Reese. All rights reserved 2
3
Data analysis
MATLAB has functions for the basic statistical analysis of numbers stored in a vector. The table that follows shows some of them. For more details, type
help datafun
at the command line.
4
Basic statistics
max Largest component.
min Smallest component.
mean Average or mean value.
median Median value.
std Standard deviation.
var Variance.
sum Sum of elements.
prod Product of elements.
hist Histogram.
5
Basic statistics
ExampleClass’s quiz grades:
2, 9, 8, 5, 4, 5, 8, 10, 8, 7
Store grades in vector and compute the average quiz score:>> grades = [2 9 8 5 4 5 8 10 8 7];>> mean(grades)ans = 6.6000
6
Basic statistics
Try ItMake a vector of a class’s quiz grades: 2, 9, 8, 5, 4, 5, 8, 10, 8, 7
• Compute the mean, minimum, maximum, median, and mode
• Show the number of grades
7
Basic statistics
Try It>> grades = [ 2 9 8 5 4 5 8 10 8 7 ];>> mean(grades)ans = 6.6000>> min(grades)ans = 2>> max(grades)ans = 10
8
Basic statistics
Try It>> grades = [ 2 9 8 5 4 5 8 10 8 7 ];>> median(grades)ans = 7.5000>> mode(grades)ans = 8>> length( grades ) % number of gradesans = 10
9
Basic statistics
Can also compute statistics on matrices. For two-dimensional matrices MATLAB operates on each column separately. This produces a row vector whose length is the number of columns in the original matrix.
10
Basic statistics
ExampleClass’s quiz grades on one Friday (column 1) and on the following Friday (column 2): >> grades = [2 9 8 5 4 8 10; 4 9 9 2 1 4 6]'grades = 2 4 9 9 8 9 5 2 4 1 8 4 10 6
11
Basic statistics
Example>> mean( grades )ans = 6.5714 5.0000>> min( grades )ans = 2 1>> max( grades )ans = 10 9
12
Basic statistics
Two ways to compute a statistic on all of the data. In first, apply function twice.
Example>> mean( mean( grades ) )ans = 5.7857>> min( min( grades ) )ans = 1>> max( max( grades ) )ans = 10
13
Basic statistics
Second way - convert matrix to 1D, then compute statistic
If M is a matrix (of any dimension), M(:) produces a one-dimensional column vector
– Both have same number of elements– M(:) made by stacking columns up, i.e., concatenating second column under first, third column under second, etc.
14
Basic statistics
Example>> m1 = [ 1 2 3; 4 5 6]m1 = 1 2 3 4 5 6>> m2 = m1(:)m2 = 1 4 2 5 3 6
15
Basic statistics
Example>> m1 = [ 1 2 3; 4 5 6]m1 = 1 2 3 4 5 6>> mean( m1 )ans = 2.5000 3.5000 4.5000>> mean( m1(:) )ans = 3.5000>> mean( mean(m1) )ans = 3.5000
16
Basic statistics
Try It>> m1 = [ 1 2 3; 4 5 6];
Compute the minimum and maximum of all elements in m1 using the m1(:) notation>> min( m1(:) )ans = 1>> max( m1(:) )ans = 6
17
Basic statistics
Caution - depending on the statistic, the two methods may not be the same
Try It>> m1 = [ 1 2 3; 4 5 6];
Compute the standard deviation both ways using std()>> std( std( m1 ) )ans = 0>> std( m1(:) )ans = 1.8708 Wuz
up?
18
Basic statistics
Can also take statistics of each row by putting in second parameter of "2", e.g.
Example>> m1m1 = 1 2 3 4 5 6 >> mean( m1, 2 )ans = 2 5
19
Basic statistics
Try ItCompute the median of each row of>> m = [ 3:3:15; 1:5 ]m = 3 6 9 12 15 1 2 3 4 5
using the function median()>> median( m, 2 )ans = 9 3
20
Basic statisticson vectors and matrices
Questions?
21
Sorting
To get more than one of smallest or largest numbers must sort first.
sort( data, direction )• data is a one-dimensional vector• direction is 'ascend' or 'descend'
(if omitted, uses 'ascend')
22
Sorting
ExampleFind 2 lowest and 2 highest grades>> sortedGrades = sort( grades )sortedGrades = 2 4 5 5 7 8 8 8 9 10>> sortedGrades(1:2)ans = 2 4>> sortedGrades(end-1:end)ans = 9 10
23
SortingOften have rows of data, each row representing one object. Want to sort rows based on values in one column (called key). Use sortrows()
aSorted = sortrows( a, col )• a is a matrix• col is column to base sort on
– col > 0 sorts column |col| in ascending order
– col < 0 sorts column |col| in descending order
24
Sorting
Try It% col 1=ID, col 2=graduation year% col 3=money donated>> alumni = [ 7885 2008 5;... 1202 1972 22900;... 4580 2000 350000 ];
recent grad, can only afford $5
Hit it rich in less than 10 years
Old guy, gives a little each year
25
Sorting
Try It% sort by ID, lowest first>> sortrows( alumni, 1 )ans = 1202 1972 22900 4580 2000 350000 7885 2008 5
26
Sorting
Example% sort by $, highest first>> sortrows( alumni, -3 )ans = 4580 2000 350000 1202 1972 22900 7885 2008 5
27
Coefficient of correlation
Correlation quantifies the strength of a linear relationship between two variables. When there is no correlation between the two quantities, then there is no tendency for the values of one quantity to increase or decrease with the values of the second quantity.
28
Coefficient of correlation
Coefficient of correlation (r)– Common way to quantify correlation– -1 ≤ r ≤ 1– r close to 1 means two signals trend the same way
• As one increases the other also increases• As one decreases the other also decreases
– r close to -1 means two signals are anticorrelated (trend the opposite way), i.e., as one increases the other decreases and vice versa– r close to zero means there's little correlation
29
Coefficient of correlationTry ItReal data collected by a Miami graduate
>> data = xlsread( 'fuelcons.xls' );>> size( data )ans = 8 4Column 1 – Week
Column 2 – Average weekly temperature
Column 3 – Average weekly wind chill
Column 4 - Millions of cubic feet of natural gas per week required to heat the homes and businesses in a small city
30
Coefficient of correlationTry ItHow should amount of gas used be correlated with wind chill, and why?
– Positive correlation because higher wind chill means higher heat desired, means more gas used
How should amount of gas used be correlated with temperature, and why?
– Negative correlation because lower temperature means higher heat desired, means more gas used
31
Coefficient of correlationTry ItGraph columns 2-4 vs. column 1 on the same plot and determine if the data matches your previous determination of correlation
>> plot( data(:,1), data(:,2:4) )>> legend( 'Temperature', 'Wind chill', 'Gas' )
1 2 3 4 5 6 7 80
10
20
30
40
50
60
70
Temperature
Wind chill
Gas
32
Coefficient of correlation
Try ItR = corrcoef(X)
• R is matrix of correlation coefficients– R contains coefficients for all pairs of columns
• X is matrix whose rows are observations (data points) and columns are variables
Compute all correlation coefficients using the matrix of the last three columns as input
33
Coefficient of correlation
Try It>> corrcoef( data(:,2:4) )ans = 1.0000 -0.7182 -0.9484 -0.7182 1.0000 0.8706 -0.9484 0.8706 1.0000
Correlation of temperature with itself
Correlation of temperature with wind chillCorrelation of temperature with gas
Correlation of wind chill with gas
Column 2 – Average weekly temperatureColumn 3 – Average weekly wind chillColumn 4 - Millions of cubic feet of natural gas
34
Coefficient of correlationTerm "highly correlated" is common but not precisely defined. A conventional meaning* is something like this:
– no or negligible correlation: 0.0 ≤ r < 0.2– low correlation:0.2 ≤ r < 0.4– moderate correlation: 0.4 ≤ r < 0.6– marked correlation: 0.6 ≤ r < 0.8– high correlation: 0.8 ≤ r ≤ 1.0
* A primer of statistics for non-statisticians
Franzblau, Abraham Norman, 1901-New York : Harcourt, Brace & World, 1958
35
Coefficient of correlation
To refine the correlation analysis, often compute the p-value:
– The probability of getting a correlation as large as the observed value by chance, when the true correlation is zero
36
Coefficient of correlation
If p is small, we say the correlation is significant
– Conventionally say that correlations with p-value less than 0.05 are significant– Often see results written as "r value of 0.34 with p < 0.05"– In words, "p < 0.05" means "there is less than a 5% chance that a truly uncorrelated data set would produce the given correlation coefficient"
37
Coefficient of correlation
To compute p-values, use
[ r p ] = corrcoef( X )• X is matrix of data (as before)•r is matrix of correlation coefficients (as before)• p is matrix of corresponding p-values, i.e., p(i,j) is the p-value for r(i,j)
38
Coefficient of correlationTry ItCompute correlation coefficients and p-values for fuelcons.xls data>> [ r p ] = corrcoef( data(:,2:4) )r = 1.0000 -0.7182 -0.9484 -0.7182 1.0000 0.8706 -0.9484 0.8706 1.0000p = 1.0000 0.0448 0.0003 0.0448 1.0000 0.0049 0.0003 0.0049 1.0000
39
Coefficient of correlationFollow AlongWhich of the three coefficients of correlation are significant?p = 1.0000 0.0448 0.0003 0.0448 1.0000 0.0049 0.0003 0.0049 1.00001. Values below main diagonal are same as those above
2. Values on main diagonal always one§
3. Therefore, want to look at only values above main diagonal
§ The definition of the p-value is "the probability of getting a correlation as large as the observed value by chance, when the true correlation is zero". p(1,1) comes from data that is correlated with itself and so its true correlation is 1, not zero. p(1,1) is always 1 and in fact, the main diagonal of the p-value matrix is always all ones
40
Coefficient of correlationFollow AlongWhich of the three coefficients of correlation are significant and what are values?p = 1.0000 0.0448 0.0003 0.0448 1.0000 0.0049 0.0003 0.0049 1.0000>> aboveDiagonal = triu( ones(size(p)), 1 )aboveDiagonal = 0 1 1 0 0 1 0 0 0>> significantLocations = p<0.05 & aboveDiagonal == 1significantLocations = 0 1 1 0 0 1 0 0 0>> p(significantLocations)ans = 0.0448 0.0003 0.0049
Press to skip details
41
Coefficient of correlationMATLAB function triu(M,k) returns M with values below kth diagonal set to zero, those on kth diagonal and above kept as is
– k = 0 is main diagonal>> triu( p, 0 )ans = 1.0000 0.0448 0.0003 0 1.0000 0.0049 0 0 1.0000>> triu( p, 1 )ans = 0 0.0448 0.0003 0 0 0.0049 0 0 0
Want to test nonzero values of triu(p,1)
42
Coefficient of correlationCan't just see which elements of triu() matrix are <= 0.05 because it substitutes zeros and a p-value can be zero>> upperTriangle = triu( p, 1 )upperTriangle = 0 0.0448 0.0003 0 0 0.0049 0 0 0
>> upperTriangle < 0.05ans = 1 1 1 1 1 1 1 1 1
43
Coefficient of correlationNeed to find p-values that are <= 0.05 and above main diagonal
1. Mark elements above diagonal>> aboveDiagonal = triu( ones(size(p)), 1 )aboveDiagonal = 0 1 1 0 0 1 0 0 0
2. Mark elements also < 0.05>> significant = p<0.05 & aboveDiagonal == 1ans = 0 1 1 0 0 1 0 0 0>> p(significant)ans = 0.0448 0.0003 0.0049
44
Data analysis
MATLAB has other analysis functions• Histograms• Cumulative products• Finite differences• Fourier transforms
For more information, type help datafun
45
Sorting and correlation
Questions?
46
Polynomial fittingFind polynomial of specified degree that matches data with least error
ExampleMake 100 data points of uniform, random constant from [0,5) with additive Gaussian noise of zero mean, variance 4
>> yIntercept = 5 * rand(1,1)yIntercept = 3.2980>> data1 = yIntercept * ones(1,100);>> data1 = data1 + 4*randn(1,100);>> plot( data1 )
47
Polynomial fittingExample
0 10 20 30 40 50 60 70 80 90 100-10
-5
0
5
10
15
48
Polynomial fitting
P = POLYFIT(X,Y,N) finds the coefficients of a polynomial P(X) of degree N that fits the data Y best in a least-squares sense. P is a row vector of length N+1 containing the polynomial coefficients in descending powers,
P(1)*X^N + P(2)*X^(N-1) +...+ P(N)*X + P(N+1).
49
Polynomial fitting
ExampleFind polynomial of degree 0 that matches data with least error>> x = 1:100;>> poly1 = polyfit( x, data1, 0 );>> y1 = poly1(1) + zeros(1,100);>> plot( x, data1, 'r-', x, y1, 'b-' );>> poly1poly1 = 2.9677
Created with 3.2980
50
Polynomial fitting
ExampleResult (blue line) does look like it's right in the middle, i.e., it's the mean
0 10 20 30 40 50 60 70 80 90 100-10
-5
0
5
10
15
51
Polynomial fitting
ExampleMake 100 data points of line with y-intercept from uniform, random distribution from [0,5), with slope from uniform, random distribution from [0,-1), with additive Gaussian noise of zero mean, variance 1
52
Polynomial fitting
Example>> yIntercept = 5 * rand(1,1)yIntercept = 2.0381>> slope = -rand(1,1)slope = -0.8200>> data2 = slope*(1:100)+yIntercept;>> data2 = data2 + randn(1,100);>> plot( data2 );
53
Polynomial fitting
Example
0 10 20 30 40 50 60 70 80 90 100-90
-80
-70
-60
-50
-40
-30
-20
-10
0
10
54
Polynomial fittingExampleFind polynomial of degree 1 that matches data with least error>> x = 1:100;>> poly2 = polyfit( x, data2, 1 );>> y2 = poly2(1)*(1:100)+poly2(2);>> plot( x, data2, 'r-', x, y2, 'b-' );>> poly2poly2 = -0.8212 1.9858
Created with -0.8200 2.0831
55
Polynomial fitting
Example
0 10 20 30 40 50 60 70 80 90 100-90
-80
-70
-60
-50
-40
-30
-20
-10
0
10
56
Polynomial fitting
ExampleSuppose want to compare remainder after get rid of output from model
If model is constant or linear, can get rid of its effect by detrending
57
Polynomial fittingExample
Y = detrend( X, trend_type )– X is data vector– trend_type is 'constant' or 'linear'
removes the best fit straight-line or constant vector from X and returns the residual in vector Y
58
Polynomial fitting
Example
Now can see that noise in first data much larger than noise in second data
0 10 20 30 40 50 60 70 80 90 100-15
-10
-5
0
5
10
59
Polynomial fittingPreviously, evaluated best-fit polynomial "by hand", i.e., explicitly multiplying the polynomial coefficients by powers of the independent variable>> x = 1:100;>> poly2 = polyfit( x, data2, 1 );>> y2 = poly2(1)*(1:100)+poly2(2);
For higher powers, this is clumsy to code and inefficient to run
60
Polynomial fitting
A more general way to evaluate a polynomial at various points is to use
y = polyval(p,x) – returns value of polynomial of degree n evaluated at x– p is a vector of length n+1 whose elements are the coefficients in descending powers of the polynomial to be evaluated
• NOTE - p is the same as the output from polyfit()!
61
Polynomial fitting
Try ItMake an input of ½ cycle of sinusoid and a noisy input equal to the previous input with additive Gaussian noise of 0 mean, ½ standard deviation. On one graph, plot input, noisy input, and best fit cubic
62
Polynomial fittingTry It>> x = 0:pi/100:pi;>> input = sin(x);>> noisyInput = input + 0.5 * randn( size(input) );>> p = polyfit( x, noisyInput, 3 );>> bestFit = polyval( p, x );>> plot( x, [ input; noisyInput;... bestFit ] )>> legend( 'Input', 'Noisy input',... 'Best fit cubic' );
0 0.5 1 1.5 2 2.5 3 3.5-1
-0.5
0
0.5
1
1.5
2
Input
Noisy input
Best fit cubic
63
Polynomial fitting
Be careful of extrapolating your model too far! The further past the range of your input independent values you go, the more likely the model is to be wrong.
64
Polynomial fittingExampleRepeat previous example but 1) let input go from 0 to 2π; 2) fit model only from 0 to π; 3) plot all three from 0 to 2π>> x = 0:pi/100:2*pi;>> input = sin(x);>> noisyInput = input +...0.5 * randn( size(input) );>> midIx = round( length(x)/2 );>> p = polyfit( x(1:midIx),...noisyInput(1:midIx), 3 );>> bestFit = polyval( p, x );>> plot( x, [ input; noisyInput; bestFit ] )>> legend( 'Input', 'Noisy input', 'Best fit cubic' );>>
0 1 2 3 4 5 6 7-10
-8
-6
-4
-2
0
2
Input
Noisy input
Best fit cubic
Bad fit in extrapolation
65
Polynomial fitting
Sometimes may want to know what value of independent variable produces a specified value of the model, i.e., if p(x) is the polynomial model, for what x is p(x) = p* ?
– If h(x) is the altitude of a rocket x miles from its launch site, where does the rocket land, i.e., for what x is h(x) = 0?– If v(t) is the voltage over a capacitor at time t, when does the voltage reach 10V ?
66
Polynomial fittingpolyfit( x, y, n ) returns a vector p of n+1 coefficients such that the best fit polynomial is
We're given the constant y* and want to know what x (or x's) produces p(x) = y*
So the x's we're looking for are the roots of the polynomial p(x) - y* = 0
)1()()2()1()( 11 npxnpxpxpxy nn
0)1()()2()1(
)1()()2()1(*11
11*
ynpxnpxpxp
npxnpxpxpynn
nn
67
Polynomial fittingTo find the roots of a polynomial use the MATLAB function
r = roots( p ) • p is a polynomial defined as before• r is a column vector
Note that roots can be complex. To find roots that have non-zero imaginary parts, i.e., that have complex values, use the expression imag(r) ~= 0
(See help on isreal() for details of why this works)
68
Polynomial fittingExampleThe amount of material left after radioactive decay is given by
– A(t) is amount remaining at time t– A0 is amount at t = 0
– T1/2 (called the half-life) is time for amount to be cut in half from amount at start
• At A(T1/2) the amount of material is A0 / 2
• At A(2T1/2) the amount of material is A0 / 4
• At A(3T1/2) the amount of material is A0 / 8
• etc.
21
2ln
0)(T
t
eAtA
69
Polynomial fittingExampleAssume that there are initially 50 grams of a substance whose half-life is 5 hours. Simulate measurements of the decay by using the radioactive decay equation and adding zero-mean Gaussian noise with a standard deviation of 1. Make data points for every quarter hour of one day, starting at time zero. 1. Fit a cubic polynomial to the data and plot the fit
and the noisy data
2. Determine when the fit predicts there will be 6.25 grams left
3. Compare this to the correct answer
70
Polynomial fittingFollow AlongFollow along!
For A0 = 50 and T1/2 = 5 the equation becomes
• Simulate the measurements>> t = 0:0.25:23.75;>> decay = 50*exp( -t*log(2)/5 ) + randn( size(t) );
• Find the best-fit cubic>> p = polyfit( t, decay, 3 );>> bestFit = polyval( p, t );
• Plot both>> plot( t, [ decay; bestFit ] )
5
2ln
50)(t
etA
0 5 10 15 20 250
5
10
15
20
25
30
35
40
45
50
71
Polynomial fittingFollow Along• Find the roots of p(t) = 6.25>> p(end) = p(end) - 6.25;>> r = roots( p )r = 22.5719 + 9.8810i 22.5719 - 9.8810i 14.8484 • Eliminate all imaginary roots (answer must be real)>> r( imag(r) ~= 0 ) = [];>> rr = 14.8484
72
Polynomial fitting
Follow AlongTo compare answers, note that 6.25g = 50/8g, and the decay equation says this happens in three half-times, i.e., in 15 hours. The best-fit predicts 14.85 hours, which is quite close
73
Polynomial fitting
Questions?
74
The End