Data Mining and Data Warehousing, many-to-many Relationships, applications William Perrizo Dept of Computer Science North Dakota State Univ

Data Mining and Data Warehousing, many-to-many Relationships,

applications

William PerrizoDept of Computer Science North Dakota State Univ.

Why Mining Data?

Parkinson’s Law (for data)

Data expands to fill available storage (and then some)

Disk-storage version of Moore’s law

Capacity 2 t / 9 months

Available storage doubles every 9 months!

Another More’s Law: More is Less

The more volume, the less information. (AKA: Shannon’s Canon)

A simple illustration: Which phone book is more helpful?

BOOK-1 BOOK-2

Name Number Name NumberSmith 234-9816 Smith 234-9816Jones 231-7237 Smith 231-7237

Jones234-9816

Jones231-7237

Awash with data! US EROS Data Center archives Earth Observing System (EOS) remotely

sensed images (RSI), satellite and aerial photo data (10 petabytes by 2005 ~ 1016 B).

National Virtual Observatory (aggregated astronomical data) (10 exabytes by 2010 ~ 1019 B?).

Sensor networks (Micro and Nano -sensor networks) (10 zettabytes by 2015 ~ 1022 B?).

WWW will continue to grow (and other text collections) (10 yottabytes by 2020 ~ 1025 B?).

Micro-arrays, gene-chips and genome sequence data (10 gazillobytes by 20?0 ~ 1028 B?).

Useful information must be teased out of these large volumes of data. That’s data mining.

Correct Name?

EOS Data Mining example

TIFF image Yield Map

This dataset is a 320 row and 320 column (102,400 pixels) spatial file with 5 feature attributes (B,G,R,NIR,Y). The (B,G,R,NIR) features are in the TIFF image and the Y (crop yield) feature is color coded in the Yield Map (blue=low; red=high)

What is the relationship between the color intensities and yield? We can hypothsize:hi_green and low_red hi_yield which, while not a simply SQL query result, is not

surprising. Data Mining is more than just confirming hypotheses

The stronger rule, hi_NIR and low_red hi_yield is not an SQL result and is

surprising. Data Mining includes suggesting new hypotheses.

Another Precision Agriculture Example Grasshopper Infestation Prediction

• Grasshopper caused significant economic loss each year.

• Early infestation prediction is key to damage control.

Association rule mining on remotely sensed imagery holds significant promise to achieve early detection.

Can initial infestation be determined from RGB bands???

Gene Regulation Pathway Discovery Results of clustering may indicate, for instance, that nine

genes are involved in a pathway. High confident rule mining on that cluster may discover the

relationships among the genes in which the expression of one gene (e.g., Gene2) is regulated by others. Other genes (e.g., Gene4 and Gene7) may not be directly involved in regulating Gene2 and can therefore be excluded (more later).

Gene1Gene2, Gene3

Gene4, Gene 5, Gene6Gene7, Gene8

Gene9

Clustering

ARM

Gene2Gene1 Gene3

Gene8Gene6 Gene9

Gene5

Gene4 Gene7

Sensor Network Data Mining

Micro, even Nano sensor blocks are being developed For sensing

Bio agents Chemical agents Movements Coatings deterioration etc.

There will be billions, even trillions ofindividual sensors creating mountains of data.

The data must be mined for it’s meaning. Other data requiring mining:

shopping market basket analysis (Walmart) Keywords in text (e.g., WWW) Properties of proteins Stock market prediction Astronomical data UNIFIED BASES OF ALL THIS DATA??

Data Mining?

Querying asks specific questions and expect specific answers.

Data Mining goes into the MOUNTAIN of DATA,

and returns with information gems (rules?)

But also, some fool’s gold?

Relevance and interestingness analysis, serves as an

assay (help pick out the valuable information gems).

Data Mining versus QueryingThere is a whole spectrum of techniques to get information from data:

On the Query Processing end, much work is yet to be done (D. DeWitt, ACM SIGMOD’02).

On the Data Mining end, the surface has barely been scratched.

But even those scratches had a great impact – becoming the biggest corporation in the world and filing for bankruptcy

SQLSELECTFROMWHERE

Complex queries(nested, EXISTS..)

FUZZY query,Search engines,BLAST searches

OLAP (rollup, drilldown, slice/dice..

Machine Learning Data Mining Standard querying Searching and Aggregating

Supervised Learning – Classificatior Regression

Unsupervised Learning - Clustering

Association Rule Mining

Data Prospecting

Fractals, …

Walmart vs. KMart

Data Mining

Data mining: the core of the knowledge discovery process.

Data Cleaning/Integration:missing data, outliers,noise, errors

Raw Data

Data Warehouse: cleaned, integrated, read-only, periodic, historical raw database

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

OLAPClassificationClusteringARM

Feature extraction, tuple selection

Our Approach

Compressed, datamining-ready, data structure, Peano-tree (Ptree)1 process vertical data horizontallyvertical data horizontally

Whereas, standard DBMSs process horizontal data verticallyhorizontal data vertically Facilitate data mining Address curses of scalability and dimensionality.

Compressed, OLAP-ready data warehouse structure, Peano Data Cube (PDcube)1

Facilitates OLAP operations and query processing. Fast logical operations on Ptrees are used.

1 Technology is patent pendingby North Dakota State University

A table, R(A1..An), is a horizontalstructure (set of horizontal records)processed vertically (vertical scans)

Vertical structure processed horizontally (ANDs)

Ptrees vertical partition; compress each vertical bit file into a basic Ptree; horizontally process these Ptrees using a multi-operand logical AND.

010 111 110 001011 111 110 000010 110 101 001010 111 101 111101 010 001 100010 010 001 101111 000 001 100111 000 001 100

R( A1 A2 A3 A4) --> R[A1] R[A2] R[A3] R[A4] 010 111 110 001011 111 110 000010 110 101 001010 111 101 111101 010 001 100010 010 001 101111 000 101 100111 000 001 100

R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43

0 1 0 1 1 1 1 1 0 0 0 10 1 1 1 1 1 1 1 0 0 0 00 1 0 1 1 0 1 0 1 0 0 10 1 0 1 1 1 1 0 1 1 1 11 0 1 0 1 0 0 0 1 1 0 00 1 0 0 1 0 0 0 1 1 0 11 1 1 0 0 0 1 0 1 1 0 01 1 1 0 0 0 0 0 1 1 0 0

Horizontal structureProcessed vertically(scans)

P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 0 1 10

0 1 0 0 1 01

0 0 00 0 0 1 01 10

0 1 0

0 1 0 1 0

0 0 01 0 01

0 1 0 0 0 10

0 0 0 1 0

0 0 10 1

0 0 10 1 01

0 0 00 1 01

0 0 0 0 1 0 010 01

00001011

00 0 0 1 10

1-Dimensional Ptrees are built by recording the truth of the predicate “pure 1” recursively on halves, until there is purity, P11:1. Whole file is not pure1 02. 1st half is not pure1 03. 2nd half is not pure1 04. 1st half of 2nd half not 05. 2nd half of 2nd half is 16. 1st half of 1st of 2nd is 17. 2nd half of 1st of 2nd not 0

Ptrees

Ptrees are fixed-size-run-length-compressed, lossless, vertical, structures representing the data, that facilitate fast logical operations on vertical data.

The most useful form of a Ptree is the predicate-Ptree

e.g. (from previous slide) Pure1-tree or P1tree (1-bit at a node iff the corresponding half is pure1 or NonPure0-tree or NP0tree (1 iff half is not pure0).

So far, Ptrees have all been 1-dimensional (recursively

halving the bit file), Ptrees for spatial data are usually 2-dimensional

(recursively quartering, in Peano order), Ptrees can be 3, 4, etc. –dimensional.

A 2-Dimensional Pure1tree

0

1 0 0 0

0 0 1 0 1 1 0 1

1 1 1 0 0 0 1 0 1 1 0 1

0

01 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0

1 0

0 1 1 1 1

1 1 1 0

0

0 0 1 0

0

1 1

0

0 1

0

A 2-D P1tree node is 1 iff that quadrant is purely 1-bits, e.g.,

A bit-file (from, e.g., a 2-D image)1111110011111000111111001111111011110000111100001111000001110000

The corresponding raster ordered spatial matrix

A Count PtreeCounts are the ultimate goal, but P1trees are more compressed and produce the needed counts quite quickly.

Peano or Z-ordering Pure (Pure-1/Pure-0) quadrant Root Count

Level Fan-out QID (Quadrant ID)

1 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1

0 1 2 3

111

( 7, 1 ) ( 111, 001 ) 10.10.11

2

3

2 . 2 . 3

001

55

16 8 15 16

3 0 4 1 4 4 3 4

1 1 1 0 0 0 1 0 1 1 0 1

NP0tree

NP0tree: Node=1 iff that sub-quadrant is not purely 0s. NP0 and P1 are examples of <predicate>trees: node=1 iff sub-quadrant satisfies <predicate>

1 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0

1

1 1 1 0

1 0 1 1 1 1 1 1

1 1 1 0 0 0 1 0 1 1 0 1

Logical Operations on P-trees

Operations are level by level Consecutive 0’s holes can be filtered out

E.g., We only need to load quadrant with Qid 2 for ANDing NP0-tree1 and NP0-tree2.

Ptree dimension (1-D, 2-D, …)

The dimension of the Ptree structure is a user chosen parameter It can be chosen to fit the data

Relations in general are 1-D (fanout=2 trees) images are 2-D (fanout=4 trees) solids are 3-D (fanout=8 trees)

Or it can be chosen to optimize compression or increase processing speed.

Ordering of Triangle Mesh

1,2

1,3

1,0

1,1

1

1,3,3

1,3,21,3,0

1,3,1

The Half Sphere up to 3 Levels

Traverse the south hemisphere in the revere direction (just the identical pattern pushed down instead of pulled up, arriving at the Southern neighbor of the start point.

Challenge 1: Many Records Typical question

How many records satisfy given conditions on the attributes?

Typical answer In record-oriented database systems

Database scan: O(N) Sorting / indexes?

Unsuitable for many problems P-Trees

Compressed, vertical, bit-column storage Bit-wise AND replaces database scan

1-D Ptrees: Compression Aspect

P-Trees: Ordering Aspect Compression relies on long sequences of 0 or 1 Images

Neighboring pixels are more likely to be similar using Peano-ordering (space filling curve)

Other data? Peano-ordering can be generalized Peano-order sorting of attributes to maximize compression.

Peano-Order Sorting

Impact of Peano-Order SortingImpact of Sorting on Execution Speed

0

20

40

60

80

100

120

adult

spam

mus

hroo

m

func

tion

crop

Tim

e in

Sec

on

ds Unsorted

Simple Sorting

Generalized PeanoSorting

0

20

40

60

80

0 5000 10000 15000 20000 25000 30000

Number of Training Points

Tim

e p

er T

est

Sam

ple

in

Mill

isec

on

ds

Speed improvement especially for large data sets

Less than O(N) scaling for all algorithms

So Far Answer to challenge 1: Many records

P-Tree concept allows scaling better than O(N) for AND (equivalent to database scan)

Introduced effective generalization to non-spatial data (thesis)

Challenge 2: Many attributes Focus: Classification Curse of dimensionality Some algorithms suffer more than others

Curse of Dimensionality Many standard classification algorithms

E.g., decision trees, rule-based classification For each attribute 2 halves: relevant irrelevant How often can we divide by 2 before small size of

“relevant” part makes results insignificant? Inverse of

Double number of rice grains for each square of the chess board

Many domains have hundreds of attributes Occurrence of terms in text mining Properties of genes

Possible Solution Additive models

Each attribute contributes to a sum Techniques exist (statistics)

Computationally intensive Simplest: Naïve Bayes

x(k) is value of kth attribute

Considered additive model Logarithm of probability additive

M

ki

ki cCxPcCP

1

)( )|()|(x

Semi-Naïve Bayes Classifier Correlated attributes are joined

Has been done for categorical data

Kononenko ’91, Pazzani ’96 Previously: Continuous data

discretized New (thesis)

Kernel-based evaluation of correlation

0

0.02

0.04

0.06

0.08

0.1

kerneldensityestimate

distributionfunction

data points

1

),(

),(

),(

, 1

)()()(

1 ,

)()()(

bak

N

t

kt

kk

N

t bak

kt

kk

xxK

xxK

baCorr

Results Error decrease in units of standard deviation for

different parameter sets Improvement for wide range of correlation thresholds:

0.05 (white) to 1 (blue)

Semi-Naive Classifier Compard with P-Tree Naive Bayes

-5

0

5

10

15

20

25

spam crop adult sick-euthyroid

mushroom gene-function

spliceDec

reas

e in

Err

or

Rat

e

Parameters (a)

Parameters (b)

Parameters (c)

So Far Answer to challenge 1: More records

Generalized P-tree structure Answer to challenge 2: More attributes

Additive algorithms Example: Kernel-based semi-naïve Bayes

Challenge 3: New subject domains Data on a graph Outlook: Data with time dependence

Standard Approach to Data Mining

Conversion to a relation (table) Domain knowledge goes into table

creation Standard table can be mined with

standard tools Does that solve the problem?

To some degree, yes But we can do better

“Everything should be made as simple as

possible, but not simpler”

Albert Einstein

Claim: Representation as single relation is not rich enough Example:

Contribution of a graph structure to standard mining problems Genomics

Protein-protein interactions

WWW Link structure

Scientific publications Citations

Scientific American 05/03

Data on a Graph: Old Hat? Common Topics

Analyze edge structure Google Biological Networks

Sub-graph matching Chemistry

Visualization Focus on graph structure

Our work Focus on mining node data Graph structure provides connectivity

Protein-Protein Interactions Protein data

From Munich Information Center for Protein Sequences (also KDD-cup 02)

Hierarchical attributes Function Localization Pathways

Gene-related properties

Interactions From experiments Undirected graph

Questions Prediction of a property

(KDD-cup 02: AHR*) Which properties in

neighbors are relevant? How should we integrate

neighbor knowledge? What are interesting

patterns? Which properties say

more about neighboring nodes than about the node itself?

But not:

*AHR: Aryl Hydrocarbon Receptor Signaling Pathway

AHR

Possible Representations OR-based

At least one neighbor has property Example: Neighbor essential true

AND-based All neighbors have property Example: Neighbor essential false

Path-based (depends on maximum hops) One record for each path Classification: weighting? Association Rule Mining:

Record base changes

essential

AHR essential

AHR not essential

Association Rule Mining OR-based representation Conditions

Association rule involves AHR Support across a link greater than within a

node Conditions on minimum confidence and support Top 3 with respect to support:

(Results by Christopher Besemann, project CSci 366)

AHR essential

AHR nucleus (localization)

AHR transcription (function)

Classification Results Problem

(especially path-based representation) Varying amount of information per record Many algorithms unsuitable in principle

E.g., algorithms that divide domain space

KDD-cup 02 Very simple additive model Based on visually identifying relationship Number of interacting essential genes adds to

probability of predicting protein as AHR

KDD-Cup 02: Honorable Mention

NDSU Team

e0

e1

e2

e3

1 0 1 1

0 1 1 1

1 1 0 1

1 0 1 0

0

1 01

1

0 01

0

1 01

1

0 00

1 1 1 1

1 0 0 1

0 1 0 01 0 1 1

o1

o2

o3

o0

Gene-Experiment-Organism Cube 3-D gene expression cube

Organism Dimension Table

30001Musmusculus

mouse

12.10Saccharomyces

cerevisiae

yeast

1850Drosophilamelanogaster

fly

30001Homo sapienshuman

Genome Size (million bp)

VertSpeciesOrganism

Gene Dimension Table

0011PolyA-Tail

.9.1.1.1StopCodonDensity

apopmitomeioapopFunction

RiboNuclRiboMytaSubCell-Location

Experiment Dimension Table (MIAME)

1asa42

1aca42

0hsb22

1hca23

NMHSAD

ED

STZ

CTY

STR

UNV

PI

LAB

g0 g1 g2 g3

e0

e1

e2

e3

17, 78 12, 60 Mi, 40 1, 48

10, 75 0 0 7, 40

0 14, 65 0 0

16, 76 0 9, 45 Pl, 43

Gene-Org Dim Table

chromosome,length

0 1 0 0

0 1 0 1

0 1 1 0

1 0 0 1

1 0 0

0 1 1

0 1 1

0 1

0 10

g0 g1 g2 g3

g0

g1

g2

g3

g0

g1

g2

g3

Protien Interaction Pyramid (2-hop interactions)

Gene Dimension Table 0011PolyA-Tail

.9.1.1.1StopCodonDensity

apopmitomeioapopFunction

RiboNuclRiboMytaSubCell-Location

g401001001010

g301000100100

g211000010010

g111000101001

GENE

Poly-A

SCD1

Mito

Meio

apop

Nucl

Ribo

Myta

SCD2

SCD3

SCD4

Gene Dimension Table (Binary)

Documents

Data Mining and Data Warehousing, many-to-many Relationships, applications William Perrizo Dept of Computer Science North Dakota State Univ