Data quality - The True Big Data Challenge

Preview:

Citation preview

Data QualityThe True Big Data Challenge

Dr. Stefan KühnLead Data Scientist

data2day 2016 - Karlsruhe

A short motivation

• Some „famous“ quotes• "Data are becoming the new raw material of

business."

• "The data fabric is the next middleware.“

• "Data matures like wine, applications like fish."

• "There were 5 Exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days."

• "Information is the oil of the 21st century, and analytics is the combustion engine."

2

A short motivation

3

Data matures like wine?

A short motivation

4

Data matures like wine?

More like grapes…

A short motivation

• Some „critical“ quotes• "Big Data is not the new oil."

• "Data is not information, information is not knowledge, knowledge is not understanding, understanding is not wisdom."

• "It’s easy to lie with statistics. It’s hard to tell the truth without statistics."

• "Anything that can be measured can be improved.“

5

Data Quality Fundamentals

6

Twofold Approach to Data Quality

• Does Data represent the real-world objects / events / concepts it is supposed to?• Does Data meet the expectations of the Data

consumers and the requirements of intended usage?• Warning: Data is not facts!

Data is not existing independent from its creation.

7

Data as representation

8

Idea

Word World

Semiotic Triangle

Where is the Data?

Data as representation

9

Metadata

Data World

Semiotic Triangle

Here is the Data

Data and Metadata

• Data implies a context -> Metadata• Metadata provides explicit knowledge about Data

• Metadata enables a common understanding of Data inside an organization• Metadata serves as documentation and

dictionary, as context for Data Understanding

Metadata is absolutely necessary for the effective use of Data.

10

Responsibility for Data

Common Misunderstanding• Data and Data-related systems typically are managed

and hosted by IT, therefore most people (from business and IT) tend to think that Data is part of IT and not of Business• BUT: Data is not the by-product of Business processes

Data is THE product of Business Processes• Data Quality Improvement as Business Strategy

Shared Responsibility

11

Data Creation as Observation

• Data is created under specific Conditions and for specific Purposes• Creation process involves• Observed Object• Observer• Instrument

• Example - Customer Self-Registration Form• Customer Information as Observed Object• Customer as Observer• Registration Form as Instrument

Instrument is not built / known by Observer.

12

Data as Product

• Analogy between manufacturing of products and creation / production of data• Data as core product of a business process• Transfer quality concepts from Software Development

to „Data Development“• Testing• Staging• Versioning• Continuous Delivery / Improvement

• Product Management• Standardization

Data Quality as Manufactoring Quality

13

Expectations and Requirements

• Implicit assumptions for usage of Data• Creation of Data is a business process• Expectations and requirements have to be

explicitely known when defining the process• Data Quality is Business Process Quality• Constantly changing expectations and

requirements makes Data age like grapes…

Make all assumptions explicit.

14

Data Producers

• People or systems that create Data• Producers have control over what they create

(given the functionality of the instrument)

• Producers don’t have control over possible uses of data• Most Data is produced for a dedicated purpose but used

for several purposes• Data Quality is fixed at the moment of creation

Data Quality starts with enabling producers to produce high-quality Data -> useable Data

15

Data Consumers

• People or systems that use Data within its lifecycle• Multiple systems and people can consume data• Often, Consumers are Producers at the same time• Consumers do not control the production of Data but

have implicit assumptions and expectations about it

Data Quality Processes are Consumers of Data of an unknown Quality and Producers of Data of a defined Quality

16

17

Data Quality Problems

Problematic Aspects of Data Management

• Data crosses Organizational Boundaries• Technical (IT) and non-technical (Business)

roles have to communicate• Shared Responsibility instead of „Ownership“• No common definitions• Twelve Barriers to Effective Management of

Data and Information Assets (Th. Redman)

Holistic Approach to Data Quality required

18

Problematic Aspects of Data Management

19

20

Big Data Quality Big Problems

Summary

Big Data

• "Big data is what happened when the cost of storing information became less than the cost of making the decision to throw it away." (George Dyson)

21

What is Big Data?

• Different Data sources• External Data• No control over data production• No sufficient documentation (Metadata)• No quality definitions available

• Incompatible schema• Example: Car callbacks

• Even more implicit assumptions• Big Data implies less information per unit of data• Lots of data points are redundant• Example: Measure a constant quantity once per day or once

per second

22

What is Big Data in the Media?

• „new oil“• „gold“• „revolution“• „raw material“• „the future“• „bigger, better, faster, more“• „more data beats better algorithms“• …

23

Three major problems

• Redundancy• Big Data by Copy/Paste

• Resolution• Every problem has an inherent time scale of change• Every problem has an inherent level of uncertainty• Increasing the resolution beyond these levels only

resolves noise• Noise• Adding noisy features decreases the signal-noise ratio• Adding good but irrelevant features increases

complexity and can look like noise

24

Redundancy

25

Resolution

26

Resolution

27

Noise

28

Noise

29

Example from Kaggle

30

349 variables - basically rank 1

31

Moore’s Law

32

Moore’s Law: By Wgsimon - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=15193542

What’s the point?

• Moore’s Law• Amount of transistors per area doubles every two

year

• Real-world Problem sizes• Grow at approximately the same speed

• Algorithmic requirements• For answering the same questions in the same

time, we need algorithms with linear complexity

33

Solutions?

34

Overall Goals

• Implement Data Quality Standards• Detect Data Quality Problems• Manage Data Quality Problems• Root Cause Analysis of Data Quality Problems• Measure Costs of „poor“ Data Quality• Measure Value of Data / „high“ Data Quality• Measure Effects of Data Quality

Improvements

35

Typical Approaches

• Force Data Quality (via order)• Fillrate: Make certain fields a must• Range: Prescribe list of valid options

• Buy tool• Hire expert• Fire expert• Collect more bad Data• Relabel „bad“ Data Pool as Data Lake• …

36

Summary of the problem

Big Data

• "Big data is what happened when the cost of storing information became less than the cost of making the decision to throw it away." (George Dyson)

37

Useful Approaches

• Hire expert ;-)• Shared Responsibility• Common Understanding of and access to Metadata• This does not imply that the terminology has to change• Typically, the same term has a different meaning in

different departments • Bounded contexts! (DDD)

• Invest in creating better Data instead of fixing old and broken Data

Treat Data as Product, not as Fact

38

39

Thanks a lot!

www.codecentric.deblog.codecentric.de

stefan.kuehn@codecentric.dedatascience@codecentric.de

Recommended