40
Statistical Data Mining Lecture 2 Edward J. Wegman George Mason University

Statistical Data Mining

  • Upload
    tommy96

  • View
    555

  • Download
    0

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Statistical Data Mining

Statistical Data MiningLecture 2

Edward J. WegmanGeorge Mason University

Page 2: Statistical Data Mining

Data Preparation

Page 3: Statistical Data Mining

Data Preparation

0

10

20

30

40

50

60

ObjectivesDetermination

Data Preparation Data Mining Analysis &Assimilation

Effo

rt (%

)

Page 4: Statistical Data Mining

Data Preparation

• Data Cleaning and Quality• Types of Data• Categorical versus Continuous Data• Problem of Missing Data

– Imputation– Missing Data Plots

• Problem of Outliers• Dimension Reduction, Quantization, Sampling

Page 5: Statistical Data Mining

Data Preparation

• Quality– Data may not have any statistically significant patterns or

relationships– Results may be inconsistent with other data sets– Data often of uneven quality, e.g. made up by respondent– Opportunistically collected data may have biases or errors– Discovered patterns may be too specific or too general to be useful

Page 6: Statistical Data Mining

Data Preparation

• Noise - Incorrect Values– Faulty data collection instruments, e.g. sensors– Transmission errors, e.g. intermittent errors from

satellite or Internet transmissions– Data entry problems– Technology limitations– Naming conventions misused

Page 7: Statistical Data Mining

Data Preparation

• Noise - Incorrect Classification– Human judgment– Time varying– Uncertainty/Probabilistic nature of data

Page 8: Statistical Data Mining

Data Preparation

• Redundant/Stale data– Variables have different names in different databases– Raw variable in one database is a derived variable in

another– Irrelevant variables destroy speed (dimension reduction

needed)– Changes in variable over time not reflected in database

Page 9: Statistical Data Mining

Data Preparation

• Data cleaning• Selecting and appropriate data set and/or

sampling strategy• Transformations

Page 10: Statistical Data Mining

Data Preparation

• Data Cleaning– Duplicate removal (tool based)– Missing value imputation (manual, statistical)– Identify and remove data inconsistencies– Identify and refresh stale data– Create unique record (case) ID

Page 11: Statistical Data Mining

Data Preparation

• Categorical versus Continuous Data– Most statistical theory, many graphics tools developed

for continuous data– Much of the data if not most data in databases is

categorical– Computer science view often takes continuous data into

categorical, e.g. salaries categorized as low, medium, high, because more suited to Boolean operations

Page 12: Statistical Data Mining

Data Preparation

• Problem of Missing Values– Missing values in massive data sets may or may not be

a problem• Missing data may be irrelevant to desired result, e.g. cases with

missing demographic data may not help if I am trying to create selection mechanism for good customers based on demographics

• Massive data sets if acquired by instrumentation may have few missing values anyway

• Imputation has model assumptions

– Suggest making a Missing Value Plot

Page 13: Statistical Data Mining

Data Preparation

• Missing Value Plot– A plot of variables by cases– Missing values colored red– Special case of “color

histogram” with binary data– “Color histogram” also

known as “data image”– This example is 67

dimensions by 1000 cases– This example is also fake

Page 14: Statistical Data Mining

Data Preparation

• Problem of Outliers– Outliers easy to detect in low dimensions– A high dimensional outlier may not show up in low

dimensional projections– MVE or MCD algorithms are exponentially

computationally complex • Visualization tools may help

– Fisher Info Matrix and Convex Hull Peeling more feasible but still too complex for Massive datasets

• Some angle based methods are promising

Page 15: Statistical Data Mining

Data Preparation• Database Sampling

– Exhaustive search may not be practically feasible because of their size

– The KDD systems must be able to assist in the selection of appropriate parts if the databases to be examined

– For sampling to work, the data must satisfy certain conditions (not ordered, no systematic biases)

– Sampling can be very expensive operation especially when the sample is taken from data stored in a DBMS. Sampling 5% of the database can be more expensive that a sequential full scan of the data.

Page 16: Statistical Data Mining

Data Compression

• Often data preparation involves data compression– Sampling– Quantization

Page 17: Statistical Data Mining

Data Quantization

Thinning vs Binning

• People’s first thoughts about Massive Data usually is statistical subsampling

• Quantization is engineering’s success story• Binning is statistician’s quantization

Page 18: Statistical Data Mining

Data Quantization

• Images are quantized in 8 to 24 bits, i.e. 256 to 16 million levels.

• Signals (audio on CDs) are quantized in 16 bits, i.e. 65,536 levels

• Ask a statistician how many bins to use, likely response is a few hundred, ask a CS data miner, likely response is 3

• For a terabyte data set, 106 bins

Page 19: Statistical Data Mining

Data Quantization

• Binning, but at microresolution• Conventions

– d = dimension– k = # of bins– n = sample size– Typically k << n

Page 20: Statistical Data Mining

Data Quantization

• Choose E[W|Q = yj] = mean of observations in jth bin = yj

• In other words, E[W|Q] = Q• The quantizer is self-consistent

Page 21: Statistical Data Mining

Data Quantization

• E[W] = E[Q]• If θ is a linear unbiased estimator, then so is E[θ|Q]• If h is a convex function, then E[h(Q)] ≤ E[h(W)].

– In particular, E[Q2] ≤ E[W2] and var (Q) ≤ var (W).

• E[Q(Q-W)] = 0• cov (W-Q) = cov (W) - cov (Q)• E[W-P]2 ≥ E[W-Q]2 where P is any other quantizer.

Page 22: Statistical Data Mining

Data Quantization

Page 23: Statistical Data Mining

Distortion due to Quantization

• Distortion is the error due to quantization.• In simple terms, E[W-Q]2.• Distortion is minimized when the

quantization regions, Sj, are most like a (hyper-) sphere.

Page 24: Statistical Data Mining

Geometry-based Quantization

• Need space-filling tessellations• Need congruent tiles• Need as spherical as possible

Page 25: Statistical Data Mining

Geometry-based Quantization

• In one dimension– Only polytope is a straight line segment (also bounded

by a one-dimensional sphere).

• In two dimensions– Only polytopes are equilateral triangles, squares and

hexagons

Page 26: Statistical Data Mining

Geometry-based Quantization

• In 3 dimensions– Tetrahedrons (3-simplex), cube, hexagonal prism,

rhombic dodecahedron, truncated octahedron.

• In 4 dimensions– 4 simplex, hypercube, 24 cell

Truncated octahedron tessellation

Page 27: Statistical Data Mining

Geometry-based Quantization

Tetrahedron* .1040042…Cube* .0833333…Octahedron .0825482…Hexagonal Prism* .0812227…Rhombic Dodecahedron* .0787451…Truncated Octahedron* .0785433…Dodecahedron .0781285…Icosahedron .0778185…Sphere .0769670

Dimensionless Second Moment for 3-D Polytopes

Page 28: Statistical Data Mining

Geometry-based Quantization

Tetrahedron Cube Octahedron

IcosahedronDodecahedronTruncated Octahedron

Page 29: Statistical Data Mining

Geometry-based Quantization

Rhombic Dodecahedron

http://www.jcrystal.com/steffenweber/POLYHEDRA/p_07.html

Page 30: Statistical Data Mining

Geometry-based Quantization

Hexagonal Prism

24 Cell with Cuboctahedron Envelope

Page 31: Statistical Data Mining

Geometry-based Quantization

• Using 106 bins is computationally and visually feasible.• Fast binning, for data in the range [a,b], and for k bins

j = fixed[k*(xi-a)/(b-a)]gives the index of the bin for xi in one dimension.

• Computational complexity is 4n+1=O(n).• Memory requirements drop to 3k - location of bin + #

items in bin + representor of bin, I.e. storage complexity is 3k.

Page 32: Statistical Data Mining

Geometry-based Quantization

• In two dimensions– Each hexagon is indexed by 3 parameters.– Computational complexity is 3 times 1-D complexity,– I.e. 12n+3=O(n).– Complexity for squares is 2 times 1-D complexity.– Ratio is 3/2.– Storage complexity is still 3k.

Page 33: Statistical Data Mining

Geometry-based Quantization

• In 3 dimensions– For truncated octahedron, there are 3 pairs of square

sides and 4 pairs of hexagonal sides.– Computational complexity is 28n+7 = O(n).– Computational complexity for a cube is 12n+3.– Ratio is 7/3.– Storage complexity is still 3k.

Page 34: Statistical Data Mining

Quantization Strategies

• Optimally for purposes of minimizing distortion, use roundest polytope in d-dimensions.– Complexity is always O(n).– Storage complexity is 3k.– # tiles grows exponentially with dimension, so-called

curse of dimensionality.– Higher dimensional geometry is poorly known.– Computational complexity grows faster than

hypercube.

Page 35: Statistical Data Mining

Quantization Strategies

• For purposes of simplicity, always use hypercube or d-dimensional simplices– Computational complexity is always O(n).– Methods for data adaptive tiling are available– Storage complexity is 3k.– # tiles grows exponentially with dimension.– Both polytopes depart spherical shape rapidly as d increases.– Hypercube approach is known as datacube in computer science

literature and is closely related to multivariate histograms in statistical literature.

Page 36: Statistical Data Mining

Quantization Strategies

• Conclusions on Geometric Quantization– Geometric approach good to 4 or 5 dimensions.– Adaptive tilings may improve rate at which # tiles

grows, but probably destroy spherical structure.– Good for large n, but weaker for large d.

Page 37: Statistical Data Mining

Quantization Strategies

• Alternate Strategy– Form bins via clustering

• Known in the electrical engineering literature as vector quantization.

• Distance based clustering is O(n2) which implies poor performance for large n.

• Not terribly dependent on dimension, d.• Clusters may be very out of round, not even convex.

– Conclusion• Cluster approach may work for large d, but fails for large n.• Not particularly applicable to “massive” data mining.

Page 38: Statistical Data Mining

Quantization Strategies

• Third strategy– Density-based clustering

• Density estimation with kernel estimators is O(n).• Uses modes mα to form clusters• Put xi in cluster α if it is closest to mode mα.• This procedure is distance based, but with complexity O(kn)

not O(n2).• Normal mixture densities may be an alternative approach.• Roundness may be a problem.

– But quantization based on density-based clustering offers promise for both large d and large n.

Page 39: Statistical Data Mining

Data Quantization

• Binning does not lose fine structure in tails as sampling might.

• Roundoff analysis applies.• With scale of binning, discretization not likely to be much

less accurate than accuracy of recorded data.• Discretization - finite number of bins implies discrete

variables more compatible with categorical data.

Page 40: Statistical Data Mining

Data Quantization

• Analysis on a finite subset of the integers has theoretical advantages– Analysis is less delicate

• different forms of convergence are equivalent– Analysis is often more natural since data is

already quantized or categorical– Graphical analysis of numerical data is not

much changed since 106 pixels is at limit of HVS