Statistical Data Mining

Statistical Data MiningLecture 2

Edward J. WegmanGeorge Mason University

Data Preparation

Data Preparation

0

10

20

30

40

50

60

ObjectivesDetermination

Data Preparation Data Mining Analysis &Assimilation

Effo

rt (%

)

Data Preparation

• Data Cleaning and Quality• Types of Data• Categorical versus Continuous Data• Problem of Missing Data

– Imputation– Missing Data Plots

• Problem of Outliers• Dimension Reduction, Quantization, Sampling

Data Preparation

• Quality– Data may not have any statistically significant patterns or

relationships– Results may be inconsistent with other data sets– Data often of uneven quality, e.g. made up by respondent– Opportunistically collected data may have biases or errors– Discovered patterns may be too specific or too general to be useful

Data Preparation

• Noise - Incorrect Values– Faulty data collection instruments, e.g. sensors– Transmission errors, e.g. intermittent errors from

satellite or Internet transmissions– Data entry problems– Technology limitations– Naming conventions misused

Data Preparation

• Noise - Incorrect Classification– Human judgment– Time varying– Uncertainty/Probabilistic nature of data

Data Preparation

• Redundant/Stale data– Variables have different names in different databases– Raw variable in one database is a derived variable in

another– Irrelevant variables destroy speed (dimension reduction

needed)– Changes in variable over time not reflected in database

Data Preparation

• Data cleaning• Selecting and appropriate data set and/or

sampling strategy• Transformations

Data Preparation

• Data Cleaning– Duplicate removal (tool based)– Missing value imputation (manual, statistical)– Identify and remove data inconsistencies– Identify and refresh stale data– Create unique record (case) ID

Data Preparation

• Categorical versus Continuous Data– Most statistical theory, many graphics tools developed

for continuous data– Much of the data if not most data in databases is

categorical– Computer science view often takes continuous data into

categorical, e.g. salaries categorized as low, medium, high, because more suited to Boolean operations

Data Preparation

• Problem of Missing Values– Missing values in massive data sets may or may not be

a problem• Missing data may be irrelevant to desired result, e.g. cases with

missing demographic data may not help if I am trying to create selection mechanism for good customers based on demographics

• Massive data sets if acquired by instrumentation may have few missing values anyway

• Imputation has model assumptions

– Suggest making a Missing Value Plot

Data Preparation

• Missing Value Plot– A plot of variables by cases– Missing values colored red– Special case of “color

histogram” with binary data– “Color histogram” also

known as “data image”– This example is 67

dimensions by 1000 cases– This example is also fake

Data Preparation

• Problem of Outliers– Outliers easy to detect in low dimensions– A high dimensional outlier may not show up in low

dimensional projections– MVE or MCD algorithms are exponentially

computationally complex • Visualization tools may help

– Fisher Info Matrix and Convex Hull Peeling more feasible but still too complex for Massive datasets

• Some angle based methods are promising

Data Preparation• Database Sampling

– Exhaustive search may not be practically feasible because of their size

– The KDD systems must be able to assist in the selection of appropriate parts if the databases to be examined

– For sampling to work, the data must satisfy certain conditions (not ordered, no systematic biases)

– Sampling can be very expensive operation especially when the sample is taken from data stored in a DBMS. Sampling 5% of the database can be more expensive that a sequential full scan of the data.

Data Compression

• Often data preparation involves data compression– Sampling– Quantization

Data Quantization

Thinning vs Binning

• People’s first thoughts about Massive Data usually is statistical subsampling

• Quantization is engineering’s success story• Binning is statistician’s quantization

Data Quantization

• Images are quantized in 8 to 24 bits, i.e. 256 to 16 million levels.

• Signals (audio on CDs) are quantized in 16 bits, i.e. 65,536 levels

• Ask a statistician how many bins to use, likely response is a few hundred, ask a CS data miner, likely response is 3

• For a terabyte data set, 106 bins

Data Quantization

• Binning, but at microresolution• Conventions

– d = dimension– k = # of bins– n = sample size– Typically k << n

Data Quantization

• Choose E[W|Q = yj] = mean of observations in jth bin = yj

• In other words, E[W|Q] = Q• The quantizer is self-consistent

Data Quantization

• E[W] = E[Q]• If θ is a linear unbiased estimator, then so is E[θ|Q]• If h is a convex function, then E[h(Q)] ≤ E[h(W)].

– In particular, E[Q2] ≤ E[W2] and var (Q) ≤ var (W).

• E[Q(Q-W)] = 0• cov (W-Q) = cov (W) - cov (Q)• E[W-P]2 ≥ E[W-Q]2 where P is any other quantizer.

Data Quantization

Distortion due to Quantization

• Distortion is the error due to quantization.• In simple terms, E[W-Q]2.• Distortion is minimized when the

quantization regions, Sj, are most like a (hyper-) sphere.

Geometry-based Quantization

• Need space-filling tessellations• Need congruent tiles• Need as spherical as possible


• In one dimension– Only polytope is a straight line segment (also bounded

by a one-dimensional sphere).

• In two dimensions– Only polytopes are equilateral triangles, squares and

hexagons


• In 3 dimensions– Tetrahedrons (3-simplex), cube, hexagonal prism,

rhombic dodecahedron, truncated octahedron.

• In 4 dimensions– 4 simplex, hypercube, 24 cell

Truncated octahedron tessellation


Tetrahedron* .1040042…Cube* .0833333…Octahedron .0825482…Hexagonal Prism* .0812227…Rhombic Dodecahedron* .0787451…Truncated Octahedron* .0785433…Dodecahedron .0781285…Icosahedron .0778185…Sphere .0769670

Dimensionless Second Moment for 3-D Polytopes


Tetrahedron Cube Octahedron

IcosahedronDodecahedronTruncated Octahedron


Rhombic Dodecahedron

http://www.jcrystal.com/steffenweber/POLYHEDRA/p_07.html


Hexagonal Prism

24 Cell with Cuboctahedron Envelope


• Using 106 bins is computationally and visually feasible.• Fast binning, for data in the range [a,b], and for k bins

j = fixed[k*(xi-a)/(b-a)]gives the index of the bin for xi in one dimension.

• Computational complexity is 4n+1=O(n).• Memory requirements drop to 3k - location of bin + #

items in bin + representor of bin, I.e. storage complexity is 3k.


• In two dimensions– Each hexagon is indexed by 3 parameters.– Computational complexity is 3 times 1-D complexity,– I.e. 12n+3=O(n).– Complexity for squares is 2 times 1-D complexity.– Ratio is 3/2.– Storage complexity is still 3k.


• In 3 dimensions– For truncated octahedron, there are 3 pairs of square

sides and 4 pairs of hexagonal sides.– Computational complexity is 28n+7 = O(n).– Computational complexity for a cube is 12n+3.– Ratio is 7/3.– Storage complexity is still 3k.

Quantization Strategies

• Optimally for purposes of minimizing distortion, use roundest polytope in d-dimensions.– Complexity is always O(n).– Storage complexity is 3k.– # tiles grows exponentially with dimension, so-called

curse of dimensionality.– Higher dimensional geometry is poorly known.– Computational complexity grows faster than

hypercube.


• For purposes of simplicity, always use hypercube or d-dimensional simplices– Computational complexity is always O(n).– Methods for data adaptive tiling are available– Storage complexity is 3k.– # tiles grows exponentially with dimension.– Both polytopes depart spherical shape rapidly as d increases.– Hypercube approach is known as datacube in computer science

literature and is closely related to multivariate histograms in statistical literature.


• Conclusions on Geometric Quantization– Geometric approach good to 4 or 5 dimensions.– Adaptive tilings may improve rate at which # tiles

grows, but probably destroy spherical structure.– Good for large n, but weaker for large d.


• Alternate Strategy– Form bins via clustering

• Known in the electrical engineering literature as vector quantization.

• Distance based clustering is O(n2) which implies poor performance for large n.

• Not terribly dependent on dimension, d.• Clusters may be very out of round, not even convex.

– Conclusion• Cluster approach may work for large d, but fails for large n.• Not particularly applicable to “massive” data mining.


• Third strategy– Density-based clustering

• Density estimation with kernel estimators is O(n).• Uses modes mα to form clusters• Put xi in cluster α if it is closest to mode mα.• This procedure is distance based, but with complexity O(kn)

not O(n2).• Normal mixture densities may be an alternative approach.• Roundness may be a problem.

– But quantization based on density-based clustering offers promise for both large d and large n.

Data Quantization

• Binning does not lose fine structure in tails as sampling might.

• Roundoff analysis applies.• With scale of binning, discretization not likely to be much

less accurate than accuracy of recorded data.• Discretization - finite number of bins implies discrete

variables more compatible with categorical data.

Data Quantization

• Analysis on a finite subset of the integers has theoretical advantages– Analysis is less delicate

• different forms of convergence are equivalent– Analysis is often more natural since data is

already quantized or categorical– Graphical analysis of numerical data is not

much changed since 106 pixels is at limit of HVS

Documents

Statistical Data Mining