30
1 Sponsored by: Sponsored by: ‘Bad Data’ Is Polluting Big Data Enterprises Struggle with Real-Time Control of Data Flows A Global Survey of Big Data Professionals June 2016

‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

1 Sponsored by:

Sponsored by:

‘Bad Data’ Is Polluting Big Data Enterprises Struggle with Real-Time Control of Data Flows

A Global Survey of Big Data Professionals

June 2016

Page 2: ‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

2

Executive Summary

The big data market is still maturing, especially as relates to

data in motion and as evidenced by lack of best practices or

consistent processes to clean and manage data quality. For

companies who use big data to optimize current business

operations or to make strategic decisions, it is critical

that they ensure their big data teams have real-time

visibility and control over the data at all times.

This report finds that companies who are leveraging big data are rarely

capable of controlling their data flows. Almost 9 out of 10 companies

report ‘bad data’ polluting their data stores and shockingly nearly 3/4

indicate there is ‘bad data’ in their stores currently. The findings also

reveal a chasm between the problem detection capabilities data experts

have today and what they desire. This translates into a lack of real-time

visibility and control of data flows, operations, quality and security.

Page 3: ‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

3 Sponsored by:3

Key Findings

• 87% state ‘bad data’ pollutes their data stores while 74% state ‘bad data’ is

currently in their data stores

• Ensuring data quality was the most common challenge cited, by 68% of

respondents, and only 34% claimed to be good at detecting divergent data

• 72% responded that they hand code their data flows while 53% claimed they

have to change each pipeline at least several times a month

• Tremendous gaps exist between today’s big data flow management tools’

capabilities and what is needed

• Only 10% of respondents rated their performance as good or excellent across 5

key data flow operational performance areas

• 72% desire a single pane of glass solution to manage all data flows

• 81% state there is a significant operational impact when they upgrade big data

components

Page 4: ‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

4 Sponsored by:

METHODOLOGY AND PARTICIPANTS

Page 5: ‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

5 Sponsored by:5

Research GoalThe primary research goal was to capture how

companies manage the flow of big data. The

research also investigated and documented current

tools’ capabilities, data quality and efforts to maintain

big data pipelines and infrastructure

Goals and Methodology

MethodologyBig data professionals worldwide were invited to

participate in a survey on the topic of big data and

ensuring data flow operations and data quality.

The survey was administered electronically and

participants were offered a token compensation for

their participation.

Participants A total of 314 participants that manage big data operations completed the survey.

Page 6: ‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

6 Sponsored by:6

Companies Represented

Industry Size

500 - 1,00025%

1,000 - 5,00029%

5,000 - 10,00016%

More than 10,000

30%

2%

1%

1%

1%

1%

4%

5%

5%

5%

6%

6%

6%

10%

12%

18%

18%

0% 5% 10% 15% 20%

Other

Food and Beverage

Hospitality and Entertainment

Media and Advertising

Non-Profit

Retail

Transportation

Energy and Utilities

Telecommunications

Government

Services

Education

Healthcare

Manufacturing

Financial Services

Technology

Page 7: ‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

7 Sponsored by:7

Participant Demographics

LocationRole

6%

8%

17%

34%

52%

56%

0% 10% 20% 30% 40% 50% 60%

Business analyst

Business stakeholder who usesdata to make decisions

BI or Analytics Technology Owner(e.g. data architect, head of data

platform)

IT executive with data initiativesin my portfolio

IT manager responsible fordelivering data initiatives

IT staff responsible for implementing and operating data

infrastructure (e.g. database …

United States or Canada

75%

Europe14%

Mexico, Central America, or South

America4%

Australia or New Zealand

3%

Middle East or Africa

2%

Asia2%

Page 8: ‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

8 Sponsored by:

DETAILED FINDINGS

Page 9: ‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

9 Sponsored by:

What challenges

does your company

face when managing

your big data flows?

Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation

1%

32%

40%

47%

52%

60%

68%

0% 10% 20% 30% 40% 50% 60% 70% 80%

We have no challenges

Adapting pipelines to meet new requirements

Upgrading big data infrastructure components(Kafka, Hadoop, etc.).

Building pipelines for getting data into the datastore

Keeping data flow pipelines operating effectively

Complying with security and data privacy policies

Ensuring the quality of the data (accuracy,completeness, consistency)

Page 10: ‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

10 Sponsored by:

Does ‘bad data’

occasionally get into

your data stores?

87% State ‘Bad Data’ Pollutes Their Data Stores

Yes87%

No13%

Page 11: ‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

11 Sponsored by:

Do you believe there

is any ‘bad data’ in

your data stores

currently?

74% State ‘Bad Data’ is Currently in Their Data Stores

Yes74%

No26%

Page 12: ‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

12 Sponsored by:

How does your

company build big

data flow pipelines

today?

77% of Companies Still Use Hand Coding to Build Big Data Flows

27%

63%

77%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

Using big data ingestion tools such as StreamSets,NiFi, etc.

Using ETL or data integration tools

Coding with Python, Java, etc. or low-levelframeworks such as Sqoop, Flume or Kafka

Page 13: ‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

13 Sponsored by:

On average, how

often are changes or

fixes made to typical

data flow pipeline?

53% Change Data Flow Pipelines At Least Several Times a Month

3%

19%

31%

26%

12%

8%

0%

5%

10%

15%

20%

25%

30%

35%

Several times aday

Several times aweek

Several times amonth

Several times aquarter

Several times ayear

Less often thanseveral times a

year

Page 14: ‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

14 Sponsored by:

When data structure

or semantics

unexpectedly

change, how big is

the impact on the

operation of your big

data flows (failures,

slowdowns, data

corruption, etc.)?

85% State Unexpected Structure and Semantic Changes Have Substantial Impact on Dataflow Operations

31% 54% 11% 2%2%

0% 20% 40% 60% 80% 100%

Significant impact

Moderate impact

Minor impact

Structure and semantic changeshave no effect on our big dataflows

Data structure and semanticchanges never occur

Page 15: ‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

15 Sponsored by:

How would you

assess your

ability to detect

each of the

following issues

in real-time?

More Than Half of Companies Lack Real Time Information About Data Flow Quality

18%

5%

7%

7%

16%

33%

29%

37%

37%

46%

30%

43%

38%

37%

29%

13%

20%

16%

17%

9%

6%

3%

1%

1%

1%

0% 10%20%30%40%50%60%70%80%90%100%

Personally identifiable information (creditcard numbers, social security numbers) is

being inappropriately placed in a data store

The values of incoming data are divergingfrom historical norms

Error rates are increasing

Data flow throughput is degrading or latencyis growing

A specific data flow pipeline has stoppedoperating

Excellent

Good

Average

Poor

None

Page 16: ‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

16 Sponsored by:

Only 12% Rated Their Performance as ‘Good’ or ‘Excellent’ Across All Five Key Data Flow Metrics

1. A specific data flow pipeline has

stopped operating

2. Data flow throughput is

degrading or latency is growing

3. Error rates are increasing

4. The values of incoming data are

diverging from historical norms

5. Identify personally information

within the data flows

Five Key Data Flow Metrics

Number of Key Data Flow Metrics Participants Represented as ‘Good’ or ‘Excellent’

19% 17% 19% 20% 12% 12%

1

Metrics

0

Metrics

All 5

Metrics

4

Metrics

3

Metrics

2

Metrics

Page 17: ‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

17 Sponsored by:

In your opinion, how

valuable would it be

to be able to detect

each of these issues

in real-time?

Substantial Value In Real-Time Data Flow Detection Capabilities

40%

23%

33%

28%

42%

35%

46%

46%

49%

42%

18%

26%

17%

20%

14%

6%

4%

4%

3%

3%

0% 20% 40% 60% 80% 100%

Identify personally information withinthe data flows

The values of incoming data arediverging from historical norms

Error rates are increasing

Data flow throughput is degrading orlatency is growing

A specific data flow pipeline hasstopped operating

Very valuable

Valuable

Average value

Limited value

Not valuable

Page 18: ‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

18 Sponsored by:

Gap Between Current Pipeline Real-Time Visibility Capabilities and Stated Value

42%

16%

42%

46%

14%

29%

3%

9%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Assessed value

Real-time ability

Excellent/ Very valuable

Good/ Valuable

Average/ Average value

Poor/ Limited value

None/ Not valuable

A specific data flow pipeline has stopped operating

62%

84%

Page 19: ‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

19 Sponsored by:

B. Data flow throughput is degrading or latency is growing

Chasm Between Today’s Data Flow Throughput Metrics and What is Needed

28%

7%

49%

37%

20%

37%

3%

17%

1%

1%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Assessed value

Real-time ability

Excellent/ Very valuable

Good/ Valuable

Average/ Average value

Poor/ Limited value

None/ Not valuable

44%

77%

Data flow throughput is degrading or latency is growing

Page 20: ‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

20 Sponsored by:

Significant Gap Between Error Rate Visibility Value and Current Capabilities

33%

7%

46%

37%

17%

38%

4%

16%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Assessed value

Real-time ability

Excellent/ Very valuable

Good/ Valuable

Average/ Average value

Poor/ Limited value

None/ Not valuable

44%

79%

Error rates are increasing

Page 21: ‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

21 Sponsored by:

Chasm Between Value of Detecting Divergent Data and Current Capabilities

23%

5%

46%

29%

26%

43%

4%

20%

1%

3%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Assessed value

Real-time ability

Excellent/ Veryvaluable

Good/ Valuable

Average/ Averagevalue

Poor/ Limited value

None/ Not valuable

34%

69%

The values of incoming data are diverging from historical norms

Page 22: ‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

22 Sponsored by:

Large Gap Between Data Privacy Value and Current Capabilities

40%

18%

35%

33%

18%

30%

6%

13%

2%

6%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Assessed value

Real-time ability

Excellent/ Very valuable

Good/ Valuable

Average/ Average value

Poor/ Limited value

None/ Not valuable

51%

75%

Identify personal information within the data flows

Page 23: ‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

23 Sponsored by:

How valuable is it to

have a single control

panel for

comprehensive

visibility and

management across

all of your data

flows?

72% Desire A Single Pane of Glass Solution To Manage All Data Flows

24% 48% 24% 4%

0% 20% 40% 60% 80% 100%

Very valuable

Valuable

Average value

Limited value

Page 24: ‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

24 Sponsored by:

Which of the

following do you

consider to be the

most effective

approach to ensuring

data quality?

50% State that Data Cleansing at the Source is the Most Effective Quality Practice

Cleanse data as it flows in from the

source50%

Cleanse and update data once it is in the

store27%

Data scientists or business analysts

cleanse data before using it

23%

Page 25: ‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

25 Sponsored by:

What is the

operational impact of

upgrading big data

components (ingest

technologies,

message queues,

data stores, search

stores, etc.)?

81% State There is Significant Operational Impact to Upgrading Big Data Components

17% 64% 17% 2%

0% 20% 40% 60% 80% 100%

Heavy impact

Moderate impact

Minor impact

No impact

Page 26: ‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

26 Sponsored by:26

For more information…

About Dimensional ResearchDimensional Research provides practical marketing research to help technology companies make smarter business decisions. Our researchers are experts in technology and understand how corporate IT organizations operate. Our qualitative research services deliver a clear understanding of customer and market dynamics.

For more information, visit www.dimensionalresearch.com.

About StreamSetsPlace holder

For more information, visit www.streamsets.com.

Page 27: ‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

27 Sponsored by:

APPENDIX

Page 28: ‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

28 Sponsored by:

Tremendous Gaps Exist Between Currant Big Bata Flow Management Tool Capabilities and What is Needed

Ability to Detect Area in Real-Time Compared Against Stated Value To Detect in Real-Time

18%

40%

5%

23%

7%

33%

7%

28%

16%

42%

33%

35%

29%

46%

37%

46%

37%

49%

46%

42%

30%

18%

43%

26%

38%

17%

37%

20%

29%

14%

13%

6%

20%

4%

16%

4%

17%

3%

9%

3%

6%

2%

3%

1%

1%

0%

1%

1%

1%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Personally identifiable information (credit card numbers, socialsecurity numbers) is being inappropriately placed in a data store

The values of incoming data are diverging from historical norms

Error rates are increasing

Data flow throughput is degrading or latency is growing

A specific data flow pipeline has stopped operating

Excellent/ Very valuable Good/ Valuable Average/ Average value Poor/ Limited value None/ Not valuable

Stated Value

Current Ability

Stated Value

Current Ability

Stated Value

Current Ability

Stated Value

Current Ability

Stated Value

Current Ability

Page 29: ‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

29 Sponsored by:

Which of the

following approaches

for ensuring data

quality does your

company utilize?

Various Approaches To Managing Data Quality Indicates a Lack of Best Practice

43%

54%

55%

0% 10% 20% 30% 40% 50% 60%

Data scientists or business analysts cleanse databefore using it

Cleanse data as it flows in from the source

Cleanse and update data once it is in the store

Page 30: ‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

30 Sponsored by:

Approximately, what

percentage of data

flow changes and

fixes are made for

day-to-day

maintenance and

troubleshooting

purposes?

Many Must Perform Maintenance and Troubleshooting on Data Flows Routinely

3%

10%

24%

27%

36%

0%

5%

10%

15%

20%

25%

30%

35%

40%

More than 80% 60% - 80% 40% - 60% 20% - 40% Less than 20%