Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo

Partitioning – A Uniform Model for Data Mining

Anne Denton, Qin Ding, William Jockheck, Qiang Ding

and William Perrizo

Motivation Databases and data warehouses

are currently separate systemsWhy? Standard answer:

Details, details, details … Our answer:

Fundamental issue of representation

Relations Revisited R(A1, A2, …, AN) Set of tuples Any choices at a fundamental level?Yes! Duality between

Element-based representation Space-based representation

Duality

Element-based representation:

Standard representation of tuples with all their attributes

Space-based representation:

The existence (count?) of a tuple is represented in its attribute space

Similar Dualities in Physics Particles can be

represented by their position

More fundamental level:

Particle

Particles can be 1 values in a grid of locations

Field

Space-Based Representation Consider standard tuples as vectors

in the space of attribute domains Represent all possible attribute

combinations as one bit: 1 if data item is present 0 if it isn’t

Allowing counts could be useful for projections (?)

Space-Based Representation as a Partition Partitions are mutually exclusive

and collectively exhaustive sets of elements

The Space-Based Representation partitions attribute space into two sets: Data item present in database (1) Data item not present (0)

Usefulness of Space-Based Representation No indexes needed: instant value-based

access Index locking becomes dimensional

locking Aggregation very easy due to value-

based ordering Selections become “and”sWhat experience do we have with space-

based representations?

Data Cube Representation One value (e.g., sales) given in the

space of the key attributes Space-based with respect to key

attributes Element-based with respect to

non-key attributes

Properties of the Domain Space Ideally space should have

distance, norm, etc. Especially important for data mining

Does that make sense for all domains? Can any domain be mapped to

integer?

Can all Domains be Mapped to Integer? Simplistic answer: yes!

All information in a computer is saved as bits Any sequence of bits can be interpreted as an

integer Problems

Order may be irrelevant, e.g., hair-color Order may be wrong, e.g., sign bit for int Even if order is correct, spacing may vary,

e.g., float (solution in paper: intervalization) Domains may be very large, e.g., movies

Categorical attributes (irrelevant order)

We need more than one attribute for an appropriate representation

Data mining solution: 1 attribute per domain value

Our solution: 1 attribute per bit slice Values are corners of a Hypercube in

log(Domain Size) dimensions Distances are given trough MAX metric

Fundamental Partition(Space-Based Representation) # of dimensions = Number of

attributes # of represented points = product

of all domain sizes Exponential in number of

dimensions! We badly need compression!

How Do We Handle Size? Problem exponential in #of attributes

How can we reduce #of attributes?

Review normalization: We can decompose a relation into a set

of relations each of which contains the entire key and one other attribute

This decomposition is loss less dependency preserving (BCNF relations

only)

Compression for Non-Key AttributesFundamental partition contains one non-zero

data-point in any non-key dimension only Represent number by bit-slicesNote: This works for numerical and categorical

attributesOriginal values can be regained by anding Example 5 (binary 101) is bit 0 & bit 1’ &

bit 2

Concept Hierarchies

Bit sliced representation have significant benefits beyond compression:

Bit slices can be combined into concept hierarchies: Highest level: bit 0 Next level: bit 0 & bit 1 Next level: bit 0 & bit 1 & bit 2

Compression for Key Attributes Database state-independent

compression could lead to information loss (counts > 1)

Database state-dependent compression: Tree structure that eliminates pure

subtrees => P-trees

Other Ideas

Compression is better if attribute values are dense within their domain

We could use extent domain Compression good Problems with insertion

Reorganization of storage Index locking has to be reintroduced …

How Good is Compression so far? If all domains are “dense”, i.e. all values

occur Size can easily be smaller than original

relation If non-key attributes are “sparse”

Not usually a problem: good compression Problems only in extreme cases

E.g., movies as attribute values! If key-attributes are “sparse”

Larger potential for problems, but also large potential for benefit (see data cubes)

Are Key-Attributes Usually Sparse? Many key attributes are dense (“structure”

attributes as keys) Automatically generated IDs are usually

sequential x and y in spatial data mining Time in data streams

Keys in tables that represent relationships tend to be sparse (feature attributes as keys) Student / course offering / grade Data cubes!

What Have We Gained?(Database Aspects) Data simultaneously acts as index No separate index locking

(unless extent domain is used) All information saved as bit

patterns Easy “select” Other database operations discussed

in class

What Have We Gained?(Feature Attribute Keys) Direct mining possible on relations with

feature attributes keys E.g., student / course offering / grade

Rollup can be defined, etc. Clustering, classification, ARM can make

use of proximity inherent in representation Bit-wise representation provides concept

hierarchy for non-key attribute Tree structure provides concept hierarchy

for key attributes

What Have We Gained?(Structure Attribute Keys) For relations with structure attribute

keys mining requires “and”ing produces counts for feature attributes

Bit-wise representation provides concept hierarchy for non-key attribute

Duality: Concept hierarchies in this

representation map exactly to tree structure when the attribute is a key

Mapping Concept HierarchiesBit Slices <-> Tree

P-tree: Take key attributes, e.g. x and y, and bit

interleave them: x = 1 0 0 1 y = 1 1 0 1 1 1 0 1 0 0 1 1

Any two of these digits form a level in the P-tree – or a level in a concept hierarchy

How Could We Use That Duality? Join with other relations and project off

key attributes (Meta P-trees) Can we do that?

We lose uniqueness We can use 1 to represent 1 or more tuples

(equivalent to relational algebra) Or we can introduce counts

Can be useful for data mining Need for non-duplicate eliminating counts exists

also in other applications

How Do Hierarchies Benefit us in Databases? Multi-granularity Locking Subtrees form suitable units for

storage in a block Fast access!

Proportional to # of levels in tree # of bits for bit slices

Summary Space-based representation has

many benefits Value-based access and storage No separate index needed Rollups easy

P-Trees Follow from systematic compression Benefits from concept hierarchies

Documents

Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo