A White Paper
Breaking Big: When Big Data Goes BadThe Importance of Data Quality Management in Big Data Environments
WebFOCUS iWay Software Omni
3 Big Data, Big Problems
3 Volume and Data Quality
3 Variety and Data Quality
4 Velocity and Data Quality
4 Veracity and Data Quality
5 Addressing Big Data Quality Problems
5 Data Quality for Big Data Is Different
5 Big Data Can Be Unstructured and Have No Schema
6 Use Real-Time Capabilities
6 Take Advantage of Big Data Computation for Cleansing
7 Best Practices in Big Data Quality Management
7 Dont Hoard Big Data Just Because You Can
7 Establish Baseline Metrics With Data Profiling and Cleansing
7 Consider Master Data Management
8 Big Data Governance
9 Breaking Bad Data With Information Builders Solutions
9 Assess Your Data Using the Data Profiler Dashboard
9 A Comprehensive Data Quality Management Solution
Table of Contents
The big data boom has largely been fueled by a simple calculation:
Big Data + Analytics = Actionable Insights
The reality, of course, is far different. While big data technology has improved the ability to store large volumes of disparate information in real time, getting the analytics right is not as straightforward. Sometimes, big data can go bad.
A classic example of bad analytics in action comes from Harvard University Professor Gary King, as told in a recent CSO article.1 A big data project designed to predict the U.S. unemployment rate was using Twitter feeds and social media posts to monitor key words like unemployment, jobs, and classifieds. Leveraging these key words with sentiment analysis, the group collected tweets and other social media content to see if there was a correlation between an increase or decrease in the use of these words and the monthly unemployment rate.
While a correlation was obvious, the researchers noticed a significant spike in the number of tweets containing one of those key words. But as Professor King discovered, this spike had nothing to do with unemployment. Tweets containing the word jobs increased for a completely different reason. What they had not noticed was Steve Jobs had passed, he said. Outside of the tragedy of Jobs untimely passing, this kind of story demonstrates the challenge of relying on big data to guide decisions.
A recent post in the New York Times based on IDC data predicts that there will be a whopping 40 trillion gigabytes of global production data by 2020.2 To remain competitive, businesses need to learn how to exploit this information and use it strategically.
The new equation for Big Data success: Big Data + Data Quality + Analytics = Actionable Insight.
1 Armerding, Taylor. Big Data Without Good Analytics Can Lead to Bad Decisions, CSO, August 2013.2 Big Data Will Get Bigger, The New York Times, June 2013.
Big Data+Data Quality+Predictive Analytics=Actionable Intelligence
Breaking Big: When Big Data Goes Bad2
But with all the hype around big data analytics, not enough attention is being given to data quality. Accurate, complete, and consistent information must form the foundation of the models on which the data is built. The reality is, algorithms are only as good as the data with which their modelers work. When the data is suspect, the negative impact will spread far and wide.
In a recent survey by Oracle, business executives were asked to grade themselves on their ability to cope with and take advantage of the data deluge within their industry.3 Approximately 40 percent of respondents in healthcare gave themselves a D or an F, while large numbers of respondents in utilities (39 percent), airlines (31 percent), retail (30 percent), and financial services (25 percent) also gave themselves failing grades.
More organizations are collecting big data that combination of high-volume, high-velocity, high-variety, structured and unstructured information. As data flows in from mobile devices, social networks, and other new and emerging sources, the ability to truly leverage it to boost business performance will require a different calculation:
Big Data + Data Quality + Analytics = Actionable Insights
In this paper, well discuss the differences between big data and conventional data, and how those differences put big data environments at a greater risk of data quality problems. Well also highlight the importance of implementing effective and broad-reaching data quality management in big data scenarios, and share solutions and best practices for doing so.
3 Evans, Bob. The Deadly Cost of Ignoring Big Data: $71.2 Million per Year, The Innovation Advantage, July 2012.
Big data, and its benefits, are often defined by three primary characteristics variety, velocity, and volume. More than a decade after the introduction of these concepts, many businesses still struggle to take advantage of them. This is because there is a fourth V veracity which must serve as the foundation of the other three. Veracity is the hardest to achieve.
Volume and Data QualityCleansing and scrubbing data sets that reach petabytes in size, to the same degree as smaller data sets, is an unrealistic and unattainable goal. Consider also that many data quality issues in smaller, structured data sets are man-made, while most information in big data scenarios is machine-generated, such as log files, GPS data, or click-through data.
It would seem that because big data is not subject to human errors or mistakes, it is therefore cleaner. But the reality is that large volumes of big data are being aggregated across many industries, and this has serious data quality ramifications.
In retail, for example, more companies have begun to aggregate large amounts of big data related to consumer shopping preferences. They gather information about what types of products they buy, how much they are looking to spend, which sales channel they used, etc. In healthcare, organizations use aggregated big data to improve care and increase the likelihood of finding cures for deadly diseases through research, pharmacological enhancements, and wellness programs. Government agencies use aggregated big data for law enforcement purposes, and financial services firms use it to spot market trends and identify key indicators for growth among investment opportunities.
Variety and Data Quality Much of todays big data is acquired through data harvesting. This involves such activities as collecting competitive pricing information from multiple travel or shopping websites, gathering customer sentiment from blogs or social media sites, or grabbing part descriptions from manufacturer or vendor web pages.
But no true analytics can be performed until the acquired unstructured and/or semi-structured data has been identified, parsed, and cleansed.
For example, when applying sentiment analysis to product reviews, carefully tuned language processing is needed to ensure accuracy. Sentiment is not defined solely by the definition of words and phrases, but also by the context in which those words and phrases are used. The word sick might have negative connotations for a restaurant chain (The food made me sick!), but may indicate positive feelings for a concert, opera, or other performance (That sopranos voice was sick!). Some flavor of data quality filtering must be applied to these raw feeds in order to turn the variety of big data into actionable intelligence.
Big Data, Big Problems
Breaking Big: When Big Data Goes Bad4
Velocity and Data QualityBusinesses need to harness data velocity to stay competitive.
In the retail sector, for example, companies must respond in seconds if they hope to connect with customers who visit their sites. They must immediately use all customer data available past purchase history, social networking activity, recent customer support interactions, etc. to generate customized and compelling messages for their consumers on the fly. But for this scenario to work, the data must be accurate. Amazons recommendation engine is so powerful because it makes relevant suggestions at the precise moment a customer is ready to purchase a product. But that requires complete, accurate data.
Veracity and Data QualityVeracity is the most important V; it serves as the foundation of data quality in big data scenarios. Veracity refers to the data that is being stored, and its relevance to the problem being analyzed. Big data veracity is based on:
n Data validity The accuracy of the data for its intended use. Clearly valid data is critical to effective decision-making
n Data volatility How long is data valid and how long should it be stored? In a real-time data world, organizations must determine when data is no longer relevant to an analysis initiative
The former Three Vs of Big Data must now make room for the Fourth V: Veracity.
One of the key goals of big data management is to ensure a high level of quality and accessibility, so trusted and timely data can be leveraged strategically through business intelligence (BI) and big data analytics applications and tools. Everything and everyone who uses big data is directly impacted by its quality. Consequently, the value of big data lies not only in its ability to be harnessed by analytics, but also in its trustworthiness.
There are countless data quality solutions on the market today but, unfortunately, few of them are equipped to promote information integrity in big data environments. To truly maintain optimum data quality across all big data sources, organizations must consider the following:
Data Quality for Big Data Is DifferentMany data quality management tools can only process relational data. Big data is distinctly different from traditional relational data in terms of data types, data access, and data queries. Data quality tools built to natively support big data understand these differences and are optimized to cleanse the information accordingly. To ensure information integrity in big data scenarios, the tools in use must also be able to access, correlate, cleanse, standardize, and enrich sources like IBM Netezza, Oracle Exadata, SAP HANA, Teradata and Teradatas Aster Data, EMC Greenplum, HP Vertica, 1010data, ParAccel, and Kognitio, as well as MapReduce databases such as Hadoop, MongoDB, Cloudera, and MapR.
Big Data Can Be Unstructured and Have No SchemaBig data environments are highly diverse. In addition to structured data from enterprise resource planning (ERP), customer relationship management (CRM), legacy, and other systems, they can also include blog posts, social media streams, cloud-based information, or sensor data, such as RFID or UPC scans and utility gauge readings. In many cases, data without a schema is present. Because of this lack of structure and schema, a number of distinct steps are required to identify and validate the data elements. First, the data quality management tool must be able to parse that data, which automatically breaks down large, unstructured fields into identifiable elements.
For instance, an address written in a single line may have constituent parts identified, such as house number, street name, apartment, city, state, zip code, and country. From there, relevant data will be standardized, validated, and enriched when necessary. Such address errors can be corrected to conform to the appropriate countrys address standards, while also filling in missing information such as a zip code.
Text analytics capabilities can also be very useful when the data schema is unknown. They provide the ability to determine context, such as sentiment, when examining free-form text. This ability across large swaths of data provides insight into how customers feel about a company or its products.
Addressing Big Data Quality Problems
Breaking Big: When Big Data Goes Bad6
When unstructured Big Data is present, it must go through several unique steps to identify and validate its elements.
Use Real-Time CapabilitiesA modern data quality platform for example, one with 64-bit processing and a large memory pool can ensure maximum speed when processing very high volumes of varied data. A state-of-the-art data quality management solution will also be optimized for inline, real-time transactions, whereas older data quality technologies can only process data in batch, with a predisposition to only relational data.
Take Advantage of Big Data Computation for CleansingBig data platforms such as Hadoop offer many benefits, particularly when it comes to processing efficiency. Big data is not just about the storage of data; it is also about the vast distributed power of multiple computing units working in concert to solve large problems. Effective quality management for big data must leverage these advantages. When running data quality procedures using distributed instances across a big data platform, there is a significant impact on speed. This speed is necessary when such large quantities of data are being cleansed.
Parsed Components Standardize Validate Enrich
Unstructured Text Sentiment analysis Natural language
To successfully manage quality in big data environments, organizations need more than just a solution. They need a solid strategy. For example, they must consider where the data is coming from, how it will be used, who will be consuming it, and what decisions it will support before deciding on a plan of action.
Some best practices for ensuring the integrity of big data include:
Dont Hoard Big Data Just Because You CanIt isnt practical to attempt to manage quality across an entire big data store. Users will consume only a fraction of the data collected, so running quality checks and processes on all of it will just waste a tremendous amount of time and resources.
Predictive analytics can help here, allowing organizations to determine which data sets are most likely to be used, and therefore should be targeted. Smart extract, transform, and load (ETL) processes can then be used to collect that data for example, customers most likely to buy a certain product and move it to a repository for cleansing and analysis.
Establish Baseline Metrics With Data Profiling and CleansingData profiling tools should be employed, to understand the current state of data, as well as where and what types of quality problems exist. Data quality management can then be applied to dynamically cleanse any inaccurate or invalid information uncovered during the profiling process.
Another important component of effective big data cleansing is proactive, real-time detection of bad information. Once invalid or corrupt information has entered a big data environment, it may be too late to prevent widespread damage. A data quality management solu...