Upload
tommy96
View
555
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
Statistical Data MiningLecture 2
Edward J. WegmanGeorge Mason University
Data Preparation
Data Preparation
0
10
20
30
40
50
60
ObjectivesDetermination
Data Preparation Data Mining Analysis &Assimilation
Effo
rt (%
)
Data Preparation
• Data Cleaning and Quality• Types of Data• Categorical versus Continuous Data• Problem of Missing Data
– Imputation– Missing Data Plots
• Problem of Outliers• Dimension Reduction, Quantization, Sampling
Data Preparation
• Quality– Data may not have any statistically significant patterns or
relationships– Results may be inconsistent with other data sets– Data often of uneven quality, e.g. made up by respondent– Opportunistically collected data may have biases or errors– Discovered patterns may be too specific or too general to be useful
Data Preparation
• Noise - Incorrect Values– Faulty data collection instruments, e.g. sensors– Transmission errors, e.g. intermittent errors from
satellite or Internet transmissions– Data entry problems– Technology limitations– Naming conventions misused
Data Preparation
• Noise - Incorrect Classification– Human judgment– Time varying– Uncertainty/Probabilistic nature of data
Data Preparation
• Redundant/Stale data– Variables have different names in different databases– Raw variable in one database is a derived variable in
another– Irrelevant variables destroy speed (dimension reduction
needed)– Changes in variable over time not reflected in database
Data Preparation
• Data cleaning• Selecting and appropriate data set and/or
sampling strategy• Transformations
Data Preparation
• Data Cleaning– Duplicate removal (tool based)– Missing value imputation (manual, statistical)– Identify and remove data inconsistencies– Identify and refresh stale data– Create unique record (case) ID
Data Preparation
• Categorical versus Continuous Data– Most statistical theory, many graphics tools developed
for continuous data– Much of the data if not most data in databases is
categorical– Computer science view often takes continuous data into
categorical, e.g. salaries categorized as low, medium, high, because more suited to Boolean operations
Data Preparation
• Problem of Missing Values– Missing values in massive data sets may or may not be
a problem• Missing data may be irrelevant to desired result, e.g. cases with
missing demographic data may not help if I am trying to create selection mechanism for good customers based on demographics
• Massive data sets if acquired by instrumentation may have few missing values anyway
• Imputation has model assumptions
– Suggest making a Missing Value Plot
Data Preparation
• Missing Value Plot– A plot of variables by cases– Missing values colored red– Special case of “color
histogram” with binary data– “Color histogram” also
known as “data image”– This example is 67
dimensions by 1000 cases– This example is also fake
Data Preparation
• Problem of Outliers– Outliers easy to detect in low dimensions– A high dimensional outlier may not show up in low
dimensional projections– MVE or MCD algorithms are exponentially
computationally complex • Visualization tools may help
– Fisher Info Matrix and Convex Hull Peeling more feasible but still too complex for Massive datasets
• Some angle based methods are promising
Data Preparation• Database Sampling
– Exhaustive search may not be practically feasible because of their size
– The KDD systems must be able to assist in the selection of appropriate parts if the databases to be examined
– For sampling to work, the data must satisfy certain conditions (not ordered, no systematic biases)
– Sampling can be very expensive operation especially when the sample is taken from data stored in a DBMS. Sampling 5% of the database can be more expensive that a sequential full scan of the data.
Data Compression
• Often data preparation involves data compression– Sampling– Quantization
Data Quantization
Thinning vs Binning
• People’s first thoughts about Massive Data usually is statistical subsampling
• Quantization is engineering’s success story• Binning is statistician’s quantization
Data Quantization
• Images are quantized in 8 to 24 bits, i.e. 256 to 16 million levels.
• Signals (audio on CDs) are quantized in 16 bits, i.e. 65,536 levels
• Ask a statistician how many bins to use, likely response is a few hundred, ask a CS data miner, likely response is 3
• For a terabyte data set, 106 bins
Data Quantization
• Binning, but at microresolution• Conventions
– d = dimension– k = # of bins– n = sample size– Typically k << n
Data Quantization
• Choose E[W|Q = yj] = mean of observations in jth bin = yj
• In other words, E[W|Q] = Q• The quantizer is self-consistent
Data Quantization
• E[W] = E[Q]• If θ is a linear unbiased estimator, then so is E[θ|Q]• If h is a convex function, then E[h(Q)] ≤ E[h(W)].
– In particular, E[Q2] ≤ E[W2] and var (Q) ≤ var (W).
• E[Q(Q-W)] = 0• cov (W-Q) = cov (W) - cov (Q)• E[W-P]2 ≥ E[W-Q]2 where P is any other quantizer.
Data Quantization
Distortion due to Quantization
• Distortion is the error due to quantization.• In simple terms, E[W-Q]2.• Distortion is minimized when the
quantization regions, Sj, are most like a (hyper-) sphere.
Geometry-based Quantization
• Need space-filling tessellations• Need congruent tiles• Need as spherical as possible
Geometry-based Quantization
• In one dimension– Only polytope is a straight line segment (also bounded
by a one-dimensional sphere).
• In two dimensions– Only polytopes are equilateral triangles, squares and
hexagons
Geometry-based Quantization
• In 3 dimensions– Tetrahedrons (3-simplex), cube, hexagonal prism,
rhombic dodecahedron, truncated octahedron.
• In 4 dimensions– 4 simplex, hypercube, 24 cell
Truncated octahedron tessellation
Geometry-based Quantization
Tetrahedron* .1040042…Cube* .0833333…Octahedron .0825482…Hexagonal Prism* .0812227…Rhombic Dodecahedron* .0787451…Truncated Octahedron* .0785433…Dodecahedron .0781285…Icosahedron .0778185…Sphere .0769670
Dimensionless Second Moment for 3-D Polytopes
Geometry-based Quantization
Tetrahedron Cube Octahedron
IcosahedronDodecahedronTruncated Octahedron
Geometry-based Quantization
Rhombic Dodecahedron
http://www.jcrystal.com/steffenweber/POLYHEDRA/p_07.html
Geometry-based Quantization
Hexagonal Prism
24 Cell with Cuboctahedron Envelope
Geometry-based Quantization
• Using 106 bins is computationally and visually feasible.• Fast binning, for data in the range [a,b], and for k bins
j = fixed[k*(xi-a)/(b-a)]gives the index of the bin for xi in one dimension.
• Computational complexity is 4n+1=O(n).• Memory requirements drop to 3k - location of bin + #
items in bin + representor of bin, I.e. storage complexity is 3k.
Geometry-based Quantization
• In two dimensions– Each hexagon is indexed by 3 parameters.– Computational complexity is 3 times 1-D complexity,– I.e. 12n+3=O(n).– Complexity for squares is 2 times 1-D complexity.– Ratio is 3/2.– Storage complexity is still 3k.
Geometry-based Quantization
• In 3 dimensions– For truncated octahedron, there are 3 pairs of square
sides and 4 pairs of hexagonal sides.– Computational complexity is 28n+7 = O(n).– Computational complexity for a cube is 12n+3.– Ratio is 7/3.– Storage complexity is still 3k.
Quantization Strategies
• Optimally for purposes of minimizing distortion, use roundest polytope in d-dimensions.– Complexity is always O(n).– Storage complexity is 3k.– # tiles grows exponentially with dimension, so-called
curse of dimensionality.– Higher dimensional geometry is poorly known.– Computational complexity grows faster than
hypercube.
Quantization Strategies
• For purposes of simplicity, always use hypercube or d-dimensional simplices– Computational complexity is always O(n).– Methods for data adaptive tiling are available– Storage complexity is 3k.– # tiles grows exponentially with dimension.– Both polytopes depart spherical shape rapidly as d increases.– Hypercube approach is known as datacube in computer science
literature and is closely related to multivariate histograms in statistical literature.
Quantization Strategies
• Conclusions on Geometric Quantization– Geometric approach good to 4 or 5 dimensions.– Adaptive tilings may improve rate at which # tiles
grows, but probably destroy spherical structure.– Good for large n, but weaker for large d.
Quantization Strategies
• Alternate Strategy– Form bins via clustering
• Known in the electrical engineering literature as vector quantization.
• Distance based clustering is O(n2) which implies poor performance for large n.
• Not terribly dependent on dimension, d.• Clusters may be very out of round, not even convex.
– Conclusion• Cluster approach may work for large d, but fails for large n.• Not particularly applicable to “massive” data mining.
Quantization Strategies
• Third strategy– Density-based clustering
• Density estimation with kernel estimators is O(n).• Uses modes mα to form clusters• Put xi in cluster α if it is closest to mode mα.• This procedure is distance based, but with complexity O(kn)
not O(n2).• Normal mixture densities may be an alternative approach.• Roundness may be a problem.
– But quantization based on density-based clustering offers promise for both large d and large n.
Data Quantization
• Binning does not lose fine structure in tails as sampling might.
• Roundoff analysis applies.• With scale of binning, discretization not likely to be much
less accurate than accuracy of recorded data.• Discretization - finite number of bins implies discrete
variables more compatible with categorical data.
Data Quantization
• Analysis on a finite subset of the integers has theoretical advantages– Analysis is less delicate
• different forms of convergence are equivalent– Analysis is often more natural since data is
already quantized or categorical– Graphical analysis of numerical data is not
much changed since 106 pixels is at limit of HVS