qaistc.comqaistc.com/.../uploads/2017/09/stc-2017-impact-of-big-data-in-testi… · Web viewHadoop, NoSQL and the related ecosystem provide the framework enabling your company to

1

IMPACT OF BIG DATA IN TESTING

Mounika Dandem (Associate Consultant)Kavusalya Komirishetty (Associate Consultant)

Capgemini India Pvt Ltd

2

ABSTRACT

With the fast advance of big data technology and analytics solutions, building high-quality big data computing

services in different application domains is becoming a very hot research and application topic among academic

and industry communities, and government agencies. Therefore, big data based applications are widely-used

currently, such as recommendation, predication, and decision system. Big Data refers not just to the explosive

growth in data that almost all organizations are experiencing, but also the emergence of data technologies that

allow that data to be leveraged. Big Data is a term used to describe the ability of any company, in any industry, to

find advantage in the ever increasingly large amount of data that now flows continuously into those enterprises, as

well as the semi-structured and unstructured data that was previously either ignored or too costly to deal with. The

big impact of Big Data in the post-modern world is unquestionable, un-ignorable and unstoppable today. While

there are certain discussions around Big Data being really big, here to stay there are facts as shared in the

following sections of this whitepaper that validate one thing - there is no knowing of the limits and dimensions

that data in the digital world can assume. This paper looks at deriving real value from the big data giant through a

look at the data lifecycle, its dimensions and challenges, best practices and application benefits.

3

1. INTRODUCTION

Growing data volumes and interconnected systems have necessitated a need for the next generation of

analytics and data management solutions. Hadoop, NoSQL and the related ecosystem provide the

framework enabling your company to analyze and manage growing volumes of structured and

unstructured data.

Organizations are realizing the importance of utilizing the intelligence inherently present in already

available data as well as correlating various data sources to glean entirely new actionable insights. This

enables organizations to take better decisions faster and this program will help understand the best

possible ways of testing Big Data applications.

2. WHAT IS BIG DATA TESTING?

Big data testing is, in essence, the process of testing data for data and processing integrity, so that

organizations can verify their big data. Big data presents big computing challenges, thanks to massive

dataset sizes and a wide variety of formats. Investing in big data analytics creates business intelligence –

if organizations can trust that intelligence. Hence the importance of big data testing.

The level of difficulty varies widely between testing structured or unstructured big data. Much big data

testing is based on the ETL process: Extract, Transform, and Load. The Extraction phase extracts a set of

test data from structured data applications, usually relationship database management systems (RDBMS).

The transformation process is extensive depending on the ETL goal, and includes data verification and

process verification for testing purpose. Once the data is successfully transformed, then testers can either

move it into a data warehouse or delete the test data.

The big data testing process for unstructured data is a considerably bigger challenge. To see the

differences more closely, let’s look at the divide between traditional database testing and unstructured

application testing in Hadoop:

4

Fig 1: Difference between structured and unstructured data testing

3. CHARACTERISTICS OF BIG DATA

Big Data is often characterized as involving the so-called “three Vs”: Volume, Velocity and Variety

VOLUME VELOCITY VARIETY

Fig 2: Three V’s of Big Data

3.1 VOLUME: THE QUANTITY OF DATA

With the rise of the Web, then mobile computing, the volume of data generated daily around the world

has exploded. For example, organizations such as Facebook generate terabytes of data daily that must be

stored and managed.

5

As the number of communications and sensing devices being deployed annually accelerates to create the

encompassing “Internet of Things,” the volumes of data continue to rise exponentially. By recording the

raw, detailed data streaming from these devices, organizations are beginning to develop high-resolution

models based on all available data rather than just a sample. Important details that would otherwise have

been washed out by the sampling process can now be identified and exploited. The bottleneck in modern

systems is increasingly the limited speed at which data can be read from a disk drive. Sequential

processing simply takes too long when such data volumes are involved. New database technologies that

are resistant to failure and enable massive parallelism are a necessity.

3.2 VELOCITY: STREAMING DATA

It is estimated that 2.3 trillion gigabytes of data are created each day. In our highly connected world,

trends of interest may last only a few days, hours or even just minutes. Some important events, such as

online fraud or hacking attempts, may last only seconds, and need to be responded to immediately. The

need for near-real time sensing, processing and response is driving the development of new technologies

for identifying patterns in data within very small and immediate time windows.

3.3 VARIETY: DIFFERENT TYPES OF DATA

A common theme in Big Data systems is that the source data is increasingly diverse, involving types of

data and structures that are more complex and/or less structured than the traditional strings of text and

numbers that are the mainstay of relational databases. Increasingly, organizations must be able to deal

with text from social networks, image data, voice recordings, video, spreadsheet data, and raw feeds

directly from sensor sources

Even on the Web, where computer-to-computer communication ought to be straightforward, the reality is

that data is messy. Different browsers send differently formatted data, users withhold information, and

they may be using different software versions or vendors to communicate with you. Traditional relational

database management systems are well-suited for storing transactional data but do not perform well with

mixed data types.

4. BIG DATA TESTING STEPS

Testing Big Data application is more a verification of its data processing rather than testing the individual

features of the software product. When it comes to big data testing, performance and functional

testing are the key.

In Big data testing QA engineers verify the successful processing of terabytes of data using commodity

cluster and other supportive components. It demands a high level of testing skills as the processing is

very fast. Processing may be of three types

6

Batch

Real Time

Interactive

Along with this, data quality is also an important factor in big data testing. Before testing the application,

it is necessary to check the quality of data and should be considered as a part of database testing. It

involves checking various characteristics like conformity, accuracy, duplication, consistency, validity,

data completeness, etc.

Big Data Testing can be broadly divided into three steps:

STEP 1: DATA STAGING VALIDATION

The first step of big data testing also referred as Pre-Hadoop stage involves process validation.

I. Data from various sources like RDBMS, weblogs etc. should be validated to make sure

that correct data is pulled into system.

II. Comparing source data with the data pushed into the Hadoop system to make sure they

match.

III. Verify the right data is extracted and loaded into the correct HDFS location

STEP 2: "MAP REDUCE" VALIDATION

The second step is a validation of "Map Reduce". In this stage, the tester verifies the business logic

validation on every node and then validating them after running against multiple nodes, ensuring that the

I. Map Reduce process works correctly

II. Data aggregation or segregation rules are implemented on the data

III. Key value pairs are generated

IV. Validating the data after Map Reduce process

STEP 3: OUTPUT VALIDATION PHASE

The final or third stage of Big Data testing is the output validation process. The output data files are

generated and ready to be moved to an EDW (Enterprise Data Ware-house) or any other system based on

the requirement. Activities in third stage includes

I. To check the transformation rules are correctly applied

II. To check the data integrity and successful data load into the target system and To check

that there is no data corruption by comparing the target data with the HDFS file system

data

7

5. WHY DO WE NEED BIG DATA TESTING IN THE INSURANCE INDUSTRY?

Several accounts of people leaving lucrative jobs in order to work for themselves have popped up

repeatedly over recent times. Marketplace aggregators such as Uber have ensured that these dreams could

be achieved by the average person. Concepts such as ‘ride sharing’ enable consumers to commute to their

destination by paying per trip and using the Uber application, which connects them to owners of cars that

would be willing to drive to the requested destination, by getting paid for their services through Uber.

This implies an income of nearly USD $150,000 per year after putting in about 70 hours of driving every

week.

The overriding concern of most of these cab drivers, however, is the safety of their vehicles among

uncertain roads and the strangers that continuously hop in. Marketplace aggregators have therefore

recognized the need to partner with insurance companies in order to provide auto insurance and motor

insurance policies to their drivers.

Leading insurance providers offer an excellent array of insurance coverage including life, property, along

with automobile or auto insurance. The insurance industry can clearly be said to be data-dependent. Data

capture in the insurance sector remains crucial owing primarily to the following reasons:

Facilitate and manage better the various businesses (such as Uber)

Improve the overall efficiency

Getting to know the customer better

Information management and data analytics play a crucial role for insurance companies, in ensuring that

strategies that aim to expand portfolios, in order to reduce the risk of not sustaining business, are

implemented effectively. Data storage and analyses, especially over a prolonged period of time, helps

business analysts arrive faster at more assured conclusions. However, traditional computing techniques

cannot be utilized while testing big data datasets. They rely heavily upon structured frameworks, a

variety of techniques and a range of tools. The key to successful big data testing lies in effectively

understanding the 3 nouns of big data: Volume, Variety and Velocity.

Internal insurance processes consisting of insurance activities and their supervision have evolved over the

course of the past decade, often adhering to more applicable principles and standards. Internal control is a

chief area for concern, focusing on benefits for both, policyholders as well as shareholders through higher

security standards. The insurance industry continues to adopt to big data analytics at a slower rate than

other industries, such as marketing and finance. The chief reason for this is the lack of skilled personnel

that can make use of the internal and external sources of data, with adequate knowledge of both, business

analytics as well as the insurance sector. The competitiveness ensures that only the companies that make

use of such data possess an edge over those that still do not. Overall, end-to-end internal systems of

insurance companies need to be monitored using modeling platforms, as they include copies of enterprise

data.

8

6. CHALLENGES OF BIG DATA TESTING

This process requires a high level of automation given massive data volumes, and the speed of

unstructured data creation. However, even with automated toolsets big data testing isn’t easy.

Good source data and reliable data insertion: “Garbage in, garbage out” applies. You need

good source data to test, and a reliable method of moving the data from the source into the

testing environment.

Test tools require training and skill: Automated testing for unstructured data is highly

complex with many steps. In addition, there will always be problems that pop up during a big

data test phase. Testers will need to know how to problem-solve despite unstructured data

complexity.

Setting up the testing environment takes time and money: Hadoop eases the pain because it

was created as a commodity-based big data analytics platform. However, IT still needs to buy,

deploy, maintain, and configure Hadoop clusters as needed for testing phases. Even with a

Hadoop cloud provider, provisioning the cluster requires resources, consultation, and service

level agreements.

Virtualization challenges: Few business application vendors do not develop for virtual

environments, so virtualized testing is a necessity. Virtualized images can introduce latency into

big data tests, and managing virtual images in a big data environment is not a straightforward

process.

No end-to-end big unstructured data testing tools: No vendor toolset can run big data tests on

all unstructured data types. Testers need to invest in and learn multiple tools depending on the

data types they need to test.

No matter how challenging the big data testing process is, it must be done – developers can hardly release

untested applications. There are certain features to look for that make the job easier for both structured

and unstructured data testing. Look for high levels of automation and repeatability, so testers do not have

to reinvent the data wheel every time, or pause the testing process to research and take manual steps.

And although Hadoop is very popular for structured and unstructured big data, it’s not the only game in

town. If testing data resides on different platforms like application servers, the cloud, or NoSQL

databases; then look for tools that expand to include them. Also consider testing speeds and data

coverage verification, a smooth training process, and centralized management consoles.

9

7. PRINCIPLES OF BIG DATA TESTING

7.1 BENEFICIAL

"The first principle for ethical data use is that it should be done with an expectation of tangible benefit

“Ideally, it should deliver value to all concerned parties—the individuals who generated the data as well as

the organization that collects and analyses it."

7.2 PROGRESSIVE

The value of progressiveness is reliant on the following:

The expectation of continuous improvement and innovation: In other words, what organizations learn

from applying big data should help deliver better and more valuable results.

Minimizing data usage: Businesses should use the least amount of data necessary to meet the desired

objective, with the understanding that minimizing data usage promotes more sustainable and less risky

analysis.

The above principles should help eliminate hidden insights or correlations such as disenfranchising

individuals based on race or demographics.

7.3 SUSTAINABLE

Sustainable is broken down into these categories: data, algorithmic, and device and/or manufacturer based.

Data sustainability: Sustaining value is closely related to what access organizations have to different social

data sets

Algorithmic sustainability: A critical element of sustainability is an algorithm's longevity. The Altimeter

report suggests longevity is affected by how the data is collected and analysed.

Device- and/or manufacturer-based sustainability: A third consideration is the lifespan of the data being

collected. "For example, if a company develops a wearable or other networked devices that collect and

transmit data, what happens if that product is discontinued, or the company is sold, and the data is auctioned

off to a third party?"

10

8. BUSINESS BENEFITS OF BIG DATA TESTING

When data is in the hands of the right people in an organization, it becomes an asset. With the emergence

of social media, mobile and cloud platforms, organizations face the onslaught of voluminous data.

According to the McKinsey Company’s Business Technology Office and McKinsey Global Institute,

data generation, storage and data mining is becoming economically relevant for businesses. They thus

begin big data testing to harness the competitive advantage of big data generated through their business

applications. The following substantiates the business benefits of big data testing.

ACCURATE DATA: According to Gartner, data volume will expand by 800% within the next five

years and 80% of it will be contributed by unstructured data. The quality analysis of unstructured data

will offer intelligent data insights that are usually difficult to determine with data warehousing facilities

and other traditional business intelligence tools. As unstructured data is typically large and unusable, it

can be mined for business benefits. Accurate data will help businesses analyze their business competition

and focus on their weak areas to strengthen their power.

IMPROVED BUSINESS DECISIONS: Big data can support a manager in better decision making or

automate the decision-making process. Various surveys suggest that big data is used for better decision

making 58% of the time, and in 29% of the time it helps in automating business decisions. According to

Michael Knorr, who heads the integration and data services at a leading financial organization, “it all

depends on the manager, whether he wishes to use data to support his business decisions or automating

his decisions by considering the level of the risk involved. The QA of unstructured business data will

help businesses arrive at better business decisions.”

IMPROVED MARKET TARGETING AND STRATEGIZING: Today, businesses are keen to

exploit the benefits of big data in planning their digital marketing strategies. With advancement in the

web technology, it becomes easier for businesses to collect the vast amount of data based on user

behavior and history. They can convert this data into a compelling, personalized experience for each

customer who comes to the website. Big data testing will help businesses employ optimization and

predictive behavioral targeting to arrive at better decisions.

MINIMIZES LOSSES AND INCREASES REVENUES: According to Gartner, every year businesses

lose $8.2 million to $100 million due to the poor quality of big data. Today, most businesses have put

quality strategies in place to identify bad data from good data, still the losses are high. Big data testing

helps minimize such losses by differentiating valuable data from the heap of semi-structured and

unstructured data. It will help businesses to vastly improve their customer service, make better business

decisions and increase their revenues.

11

9. CONCLUSION

In conclusion, if an organization applies the right test strategies and follows best practices, it will

improve Big Data testing quality, which will help to identify defects in early stages and reduce overall

cost. Big Data is the biggest opportunity in the modern world. There is a whole set of advantages that

businesses can yield from big data and evolve as smart and connected enterprises. Also, it has now

become evident that Big Data is permeating all aspects of life, making it an imperative for businesses,

whether they are ready or not. However, there are also unique and new challenges thrown up by the data

revolution that enterprises need to be careful about. With proper caution and preparedness, a business can

thrive into excellence with big data, without being exposed to a risk situation.

12

REFERENCES & APPENDIX

1. Delort P., Big data in Biosciences, Big Data Paris, 20122. "Next-generation genomics: an integrative approach" (PDF). Nature. July 2010. Retrieved 18

October 2016.3. "BIG DATA IN BIOSCIENCES". ResearchGate. October 2015. Retrieved 18 October2016.4. "Big data: are we making a big mistake?” Financial Times. 28 March 2014. Retrieved 20 October 2016.5. Ohm, Paul. "Don't Build a Database of Ruin". Harvard Business Review.6. Darwin Bond-Graham, Iron Cage book – The Logical End of Facebook's Patents,Counterpunch.org,

2013.12.037. Darwin Bond-Graham, Inside the Tech industry’s Start-up Conference,Counterpunch.org, 2013.09.118. Al-Rodhan, Nayef (2014-09-16). "The Social Contract 2.0: Big Data and the Need to Guarantee Privacy

and Civil Liberties - Harvard International Review". Harvard International Review. Retrieved 2017-04-03.

9. Danah Boyd (29 April 2010). "Privacy and Publicity in the Context of Big Data". WWW 2010 conference. Retrieved 2011-04-18.

10. Jones, MB; Schildhauer, MP; Reichman, OJ; Bowers, S (2006). "The New Bioinformatics: Integrating Ecological Data from the Gene to the Biosphere" (PDF). Annual Review of Ecology, Evolution, and Systematics. 37 (1): 519–544. doi:10.1146/annurev.ecolsys.37.091305.110031

13

AUTHOR BIOGRAPHY

Mounika Dandem: is an Associate Consultant with the Insurance business unit of Capgemini and can be reached

at [email protected]

Kavusalya Komirishetty: is an Associate Consultant with the Insurance business unit of Capgemini and can be

reached at [email protected]

14

THANK YOU!

Documents

qaistc.comqaistc.com/.../uploads/2017/09/stc-2017-impact-of-big-data-in-testi… · Web viewHadoop, NoSQL and the related ecosystem provide the framework enabling your company to