Upload
others
View
22
Download
0
Embed Size (px)
Citation preview
Advanced Data ProfilingIntroduction
Prof. Dr. Felix Naumann, Thorsten Papenbrock, Tobias Bleifuß, Hazar Harmouch and Lan Jiang
WS2017/18
Imag
e cr
edit:
NASA/A
mes
/JPL
-Cal
tech
■ Data profiling is the process of examining the data available in an existing data source [...] and collecting statistics and information about that data.
■ Wikipedia 09/2013
■ Data profiling refers to the activity of creating small but informative summaries of a database.
■ Ted Johnson, Encyclopedia of Database Systems
■ A fixed set of data profiling tasks / results
Advanced Data Profiling Introduction WS2017/18
Definition Data Profiling
Slide 2
Data ProfilingThe meaning of “knowing data”
Name Type Equatorialdiameter Mass Orbital
radiusOrbitalperiod
Rotationperiod
Confirmedmoons Rings Atmosphere
Mercury Terrestrial 0.382 0.06 0.47 0.24 58.64 0 no minimal
Venus Terrestrial 0.949 0.82 0.72 0.62 −243.02 0 no CO2, N2
Earth Terrestrial 1.000 1.00 1.00 1.00 1.00 1 no N2, O2, Ar
Mars Terrestrial 0.532 0.11 1.52 1.88 1.03 2 no CO2, N2, Ar
Jupiter Giant 11.209 317.80 5.20 11.86 0.41 67 yes H2, He
Saturn Giant 9.449 95.2 9.54 29.46 0.43 62 yes H2, He
Uranus Giant 4.007 14.6 19.22 84.01 −0.72 27 yes H2, He
Neptune Giant 3.883 17.2 30.06 164.8 0.67 14 yes H2, He
format
size
CH
AR(3
2)
CH
AR(1
6)
FLO
AT
BO
OLE
AN
VARCH
AR
FLO
AT
FLO
AT
FLO
AT
FLO
AT
INTE
GER
data types
range min = 0.382
max = 11.209aggregation sum = 173
avg = 21.625distribution
01234
Slide 3
Data ProfilingThe meaning of “knowing data”
Name Type Equatorialdiameter Mass Orbital
radiusOrbitalperiod
Rotationperiod
Confirmedmoons Rings Atmosphere
Mercury Terrestrial 0.382 0.06 0.47 0.24 58.64 0 no minimal
Venus Terrestrial 0.949 0.82 0.72 0.62 −243.02 0 no CO2, N2
Earth Terrestrial 1.000 1.00 1.00 1.00 1.00 1 no N2, O2, Ar
Mars Terrestrial 0.532 0.11 1.52 1.88 1.03 2 no CO2, N2, Ar
Jupiter Giant 11.209 317.80 5.20 11.86 0.41 67 yes H2, He
Saturn Giant 9.449 95.2 9.54 29.46 0.43 62 yes H2, He
Uranus Giant 4.007 14.6 19.22 84.01 −0.72 27 yes H2, He
Neptune Giant 3.883 17.2 30.06 164.8 0.67 14 yes H2, He
keysuniqueMass ~ Confirmed Moons
relationships
rulesEquatorial diameter x Mass > 0 Atmosphere à Rings
intra table dependencies
inter table dependenciesMoon.Planet ⊆ Planet.Name
Slide 4
Classification of Traditional Profiling Tasks
Advanced Data Profiling Introduction WS2017/18
Dat
a pr
ofili
ng
Single column
Cardinalities
Patterns and data types
Value distributions
Multiple columns
Uniqueness
Key discovery
Conditional
Partial
Inclusion dependencies
Foreign key discovery
Conditional
Partial
Functional dependencies
Conditional
PartialSlide 5
■ Data profiling gathers technical metadata to support data management
■ Data mining and data analytics discovers non-obvious results to support business management
■ Data profiling results: information about columns and column sets
■ Data mining results: information about rows or row sets
□ clustering, summarization, association rules, …
■ Rahm and Do, 2000
□ Profiling: Individual attributes
□ Mining: Multiple attributes
Advanced Data Profiling Introduction WS2017/18
Data profiling vs. data mining
Slide 6
■ INDs (typically) involve more than one relation.
■ Let D be a relational schema and let I be an instance of D.
■ R[A1, …, An] denotes projection of I on attributes A1, … An of relation R: R[A1, …, An] = πA1, …, An(R)
■ IND R[A1, …, An] Í S[B1, …, Bn], where R, S are (possibly identical) relations of D.
□ Projection on R and S must have same number of attributes.
□ An instance I of D satisfies an IND if I(R)[A1, …, An] Í I(S)[B1, …, Bn]
□ IND is maximal if R[XA] Í S[YB] is invalid for any AÎR, BÎS
□ Values of R: “dependent values”
□ Values of S: “referenced values”
■ Task: Find all maximal, non-trivial INDs
□ Typical assumptions: No repeating attributes, disjoint LHS and RHS
Inclusion Dependencies: Definition
Advanced Data Profiling Introduction WS2017/18
Slide 7
■ Each Title in Showings should appear as a Title in Movies
□ Showings[Title] Í Movie[Title]
■ Aka. “referential integrity”
□ Referenced attributes need not be a key (or unique)
□ Foreign key: helps prune candidates
Example
Advanced Data Profiling Introduction WS2017/18
Slide 8
■ Reflexivity: R[X] Í R[X]
■ Projection:
□ R[A1, …, An] Í S[B1, …, Bn] => R[Ai1, …, Aim] Í S[Bi1, …, Bim] for each sequence i1, …,im of Integers in {1,…,n}
■ Transitivity:
■ R[X] Í S[Y] and S[Y] Í T[Z] => R[X] Í T[Z]
■ Example: “transitive foreign keys” for 1:1 relationships
Inference rules for INDs
Advanced Data Profiling Introduction WS2017/18
Slide 9
■ Unary INDs□ INDs on single attributes: R[A] Í S[B]
■ n-ary INDs□ INDs on multiple attributes: R[X] Í S[Y] □ |X| = |Y|
■ Partial INDs□ IND R[A] Í S[B] is satisfied for x% of all tuples in R□ IND R[A] Í S[B] is satisfied for all but x tuples in R
■ Approximate INDs□ IND R[A] Í S[B] is satisfied with probability p.□ Based on sampling or other heuristics
IND types
Advanced Data Profiling Introduction WS2017/18
Slide 10
■ Unary: R[C] Í S[F]
■ N-ary: R[B,C] Í S[G,F]
■ Partial: R[A] Í75% S[F]
■ Approximate: R[BA] Í S[GH]
Examples
R A B C1 x 12 x 13 y 25 z 4
S F G H1 x 12 y 33 z 44 z 4
Advanced Data Profiling Introduction WS2017/18
Slide 11
■ General insight into data
■ Detect unknown foreign keys
■ Example
□ PDB: Protein Data Bank
□ OpenMMS provides relational schema
– Parses protein and nucleic acid macromolecular structure data from the standard mmCIF format.
□ 175 tables with primary key constraints
□ 2705 attributes
□ But: Not a single foreign key constraint!
Motivation for IND discovery
Advanced Data Profiling Introduction WS2017/18
Slide 12
■ Ensembl – genome database□ Shipped as MySQL dump files□ More than 200 tables□ Not a single foreign key constraint!
■ Web tables: No schema, no constraints, but many connections
■ Why are FKs missing?□ Lack of support for checking foreign key constraints in the host system– Example: Oracle did not support FKs up to v6
□ Fear that checking such constraints would impede database performance□ Lack of database knowledge within the development team□ Dirty data prevents setting the constraint
Motivation for IND discovery
Advanced Data Profiling Introduction WS2017/18
Slide 13
Unary IND detection complexity
Advanced Data Profiling Introduction WS2017/18
Name Type Equatorialdiameter Mass Orbital
radiusOrbitalperiod
Rotationperiod
Confirmedmoons Rings Atmosphere
Mercury Terrestrial 0.382 0.06 0.47 0.24 58.64 0 no minimalVenus Terrestrial 0.949 0.82 0.72 0.62 −243.02 0 no CO2, N2
Earth Terrestrial 1.000 1.00 1.00 1.00 1.00 1 no N2, O2, ArMars Terrestrial 0.532 0.11 1.52 1.88 1.03 2 no CO2, N2, Ar
Jupiter Giant 11.209 317.8 5.20 11.86 0.41 67 yes H2, HeSaturn Giant 9.449 95.2 9.54 29.46 0.43 62 yes H2, HeUranus Giant 4.007 14.6 19.22 84.01 −0.72 27 yes H2, He
Neptune Giant 3.883 17.2 30.06 164.8 0.67 14 yes H2, He
■ Name ⊆ Type ?■ Name ⊆ Equatorial_diameter ?■ Name ⊆ Mass ?■ Name ⊆ Orbital_radius ?■ Name ⊆ Orbital_period ?■ Name ⊆ Rotation_period ?■ Name ⊆ Confirmed_moons ?■ Name ⊆ Rings ?■ Name ⊆ Atmosphere ?
■ Type ⊆ Name ?■ Type ⊆ Equatorial_diameter ?■ Type ⊆ Mass ?■ Type ⊆ Orbital_radius ?■ Type ⊆ Orbital_period ?■ Type ⊆ Rotation_period ?■ Type ⊆ Confirmed_moons ?■ Type ⊆ Rings ?■ Type ⊆ Atmosphere ?
■ Mass ⊆ Name ?■ Mass ⊆ Type ?■ Mass ⊆ Equatorial_diameter ?■ …
Complexity: O(n2-n) for n attributes
Example:10 attr ~ 90 checks1,000 attr ~ 999,000 checks
Slide 14
YX
N-ary IND detection complexity
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABDABE ACD ACEADE BCD BCE BDE CDE
ABCDABCE ABDE ACDE BCDE
ABCDE
A B C D E
AB AC AD AE BC BD BE CD CE DE
A B C D E
AB AC AD AE BC BD BE CD CE DE
A B C D E
AB AC AD AE BC BD BE CD CE DE
A B C D E
AB AC AD AE BC BD BE CD CE DEAB AC AD AE BC BD BE CD CE DEAB AC AD AE BC BD BE CD CE DE
A B C D E
Test combination with all other
combinations of same size!
No n-ary INDs here! Why?
𝑿Í𝒀 :𝑿 ∩ 𝒀 = ∅
𝑛𝑘 ∗
𝑛 − 𝑘𝑘 ∗ 𝑘!
IND Candidates in level k:
nodes
other, non-overlapping nodes
all permu-tations
Advanced Data Profiling Introduction WS2017/18
Slide 15
N-ary IND detection complexity
total 0 2 6 24 80 330 1302 5936 26784 133650 669350 3609672 19674096 113525594 66440031015 014 0 013 0 0 012 0 0 0 011 0 0 0 0 010 0 0 0 0 0 09 0 0 0 0 0 0 08 0 0 0 0 0 0 0 07 0 0 0 0 0 0 0 17297280 2594592006 0 0 0 0 0 0 665280 8648640 60540480 3027024005 0 0 0 0 0 30240 332640 1995840 8648640 30270240 908107204 0 0 0 0 1680 15120 75600 277200 831600 2162160 5045040 108108003 0 0 0 120 840 3360 10080 25200 55440 110880 205920 360360 6006002 0 0 12 60 180 420 840 1512 2520 3960 5940 8580 12012 163801 0 2 6 12 20 30 42 56 72 90 110 132 156 182 210
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of attributes: n
Number of attributes: n
Num
ber
of le
vels
: k
Advanced Data Profiling Introduction WS2017/18
Slide 16
■ Bell&Brockhausen: Siegfried Bell and Peter Brockhausen.“Discovery of Data Dependencies in Relational Databases.”Statistics, Machine Learning and Knowledge Discovery in Databases, ML–Net Familiarization Workshop, 53–58, 1995.
■ Zigzag: Fabien De Marchi and Jean-Marc Petit.“Zigzag: A New Algorithm for Mining Large Inclusion Dependencies in Databases.”In Proceedings of the International Conference on Data Mining (ICDM), 27–34, 2003.
■ FIND2: Andreas Koeller and Elke. A. Rundensteiner.“Discovery of High-Dimensional Inclusion Dependencies.”In Proceedings of the International Conference on Data Engineering (ICDE), 683–685, 2003.
Algorithms
Advanced Data Profiling Introduction WS2017/18
Slide 17
■ SPIDER: Jana Bauckmann, Ulf Leser, Felix Naumann, and Veronique Tietz.“Efficiently Detecting Inclusion Dependencies.”In Proceedings of the International Conference on Data Engineering (ICDE), 1448–1450, 2007.
■ deMarchi/MIND: Fabien De Marchi, Stéphane Lopes, and Jean Marc Petit.“Unary and N-Ary Inclusion Dependency Discovery in Relational Databases.”Journal of Intelligent Information Systems 32 (1): 53–73, 2009.
■ BINDER: Thorsten Papenbrock, Sebastian Kruse, Jorge-Arnulfo Quiané-Ruiz, and Felix Naumann.“Divide & Conquer-Based Inclusion Dependency Discovery.”In Proceedings of the VLDB Endowment, 8:774–785, 2015.
Algorithms
Advanced Data Profiling Introduction WS2017/18
Slide 18
■ S-INDD: Nuhad Shaabani and Christoph Meinel.“Scalable Inclusion Dependency Discovery.”In Proceedings of the International Conference on Database Systems forAdvanced Applications (DASFAA), 425–440, 2015.
■ SINDY: Sebastian Kruse, Thorsten Papenbrock, and Felix Naumann.“Scaling Out the Discovery of Inclusion Dependencies.”In Proceedings of the Conference Database Systems for Business, Technology and Web (BTW), 445–454, 2015.
■ MIND2: Nuhad Shaabani and Christoph Meinel.“Detecting Maximum Inclusion Dependencies without CandidateGeneration.”Proceedings of the Conference International Conference on Database andExpert (DEXA), 118–133, 2016.
Algorithms
Advanced Data Profiling Introduction WS2017/18
Slide 19
MetanomeAn extensible architecture
Advanced Data Profiling Introduction WS2017/18
Slide 20
§ Algorithm execution§ Result & Resource
management
§ Algorithm configuration§ Result & Resource
presentation
Configuration
Resource LinksSPIDER
jar
txt tsv
xmlcsv
DB2DB2
MySQLResults DUCCjar
BINDERjar
DFDjar
SWANjar
Organisation
Advanced Data Profiling Introduction WS2017/18
Slide 21
Group allocation
Study your algorithm(s)
Present your algorithmsImplement your
algorithms
Present your implementations
6 participants,3 teams of 2 students
Prepare experiments to evaluate and compare
your algorithms
Swap algorithms (around Christmas)
Run your experiments for all algorithms
Improve implementations of algorithms
Present improvements
Implementation freeze
Final paper writing
Write algorithm descriptions for paperDescribe your results
Active participation in meetings and discussions
Initial presentation of your algorithm(s)
Implementation of your algorithm(s) using the Metanome interface
Presentation of your implementation
Implementation of improvements to another team’s algorithm(s)
Presentation of your improvements
Final paper-style submission
Grading
Advanced Data Profiling Introduction WS2017/18
Slide 22
10%
15%
20%
15%
10%
10%
20%
■ To apply for this seminar (bindingly):
□ Send an email to [email protected]
□ Deadline: 24.10.2017 23:59
□ In case of too many applications, we need to choose randomly
■ Meeting next week: Data Profiling / Efficient Java Code
■ 30.10.17: Group allocation
Further Procedure
Advanced Data Profiling Introduction WS2017/18
Slide 23