THE LEADING PUBLICATION FOR BI, DATA WAREHOUSING, AND …/media/6397D94CDAA04B4BB464390F54ED... · 2014. 9. 15. · the Hadoop framework components most relevant to the data warehouse

EXCLUSIVELY FORTDWI PREMIUM MEMBERS

VOLUME 19 • NUMBER 3

THE LEADING PUBLICATION FOR BI, DATA WAREHOUSING, AND ANALYTICS PROFESSIONALS

The Case for Hiring MIS Graduates 4 Hugh J. Watson

The Modern Data Warehouse— How Big Data Impacts Analytics Architecture 8 Karen Lopez and Joseph D’Antoni

Cloud Computing for BI: The Economic Perspective 16 Paul G. Johnson

Meeting the Fundamental Challenges of Data Warehouse Testing 28 Wayne Yaddow

BI Case Study: Bagel Chain Serves Up Happy Users with Move to Mobile Dashboards 35 Linda Briggs

BI Experts’ Perspective: BI in Manufacturing 38 Bhargav Mantha, Keith Manthey, Brian Valeyko, and Coy Yonce

Achieving Faster Analytics with In-Chip Technology 45 Elad Israeli

Winners: TDWI Best Practices Awards 2014 51

BI Training Solutions: As Close as Your Conference Room

tdwi.org/onsite

TDWI ONSITE EDUCATION

TDWI Onsite Education brings our vendor-neutral BI, DW, and analytics training to compa-nies worldwide, tailored to meet the specific needs of your organization. From fundamental courses to advanced techniques, plus prep courses and exams for the Certified Business Intelligence Professional (CBIP) designation—we can bring the training you need directly to your team in your own conference room.

YOUR TEAM, OUR INSTRUCTORS, YOUR LOCATION.

Contact Yvonne Baho at 978.582.7105 or [email protected] for more information.

BI Training Solutions: As Close as Your Conference Room

tdwi.org/onsite

TDWI ONSITE EDUCATION

TDWI Onsite Education brings our vendor-neutral BI, DW, and analytics training to compa-nies worldwide, tailored to meet the specific needs of your organization. From fundamental courses to advanced techniques, plus prep courses and exams for the Certified Business Intelligence Professional (CBIP) designation—we can bring the training you need directly to your team in your own conference room.

YOUR TEAM, OUR INSTRUCTORS, YOUR LOCATION.

Contact Yvonne Baho at 978.582.7105 or [email protected] for more information.

1BUSINESS INTELLIGENCE JOURNAL • VOL. 19, NO. 3


3 From the Editor

4 The Case for Hiring MIS Graduates Hugh J. Watson

7 Instructions for Authors

8 The Modern Data Warehouse—How Big Data Impacts Analytics Architecture Karen Lopez and Joseph D’Antoni

16 Cloud Computing for BI: The Economic Perspective Paul G. Johnson

28 Meeting the Fundamental Challenges of Data Warehouse Testing Wayne Yaddow

35 BI Case Study: Bagel Chain Serves Up Happy Users with Move to Mobile Dashboards Linda Briggs

38 BI Experts’ Perspective: BI in Manufacturing Bhargav Mantha, Keith Manthey, Brian Valeyko, and Coy Yonce

45 Achieving Faster Analytics with In-Chip Technology Elad Israeli

51 Winners: TDWI Best Practices Awards 2014

56 BI StatShots

2 BUSINESS INTELLIGENCE JOURNAL • VOL. 19, NO. 3


EDITORIAL BOARD

Editorial Director James E. Powell, TDWI

Managing Editor Jennifer Agee, TDWI

Production Editor Marie Gipson, TDWI

Senior Editor Hugh J. Watson, TDWI Fellow, University of Georgia

Director, TDWI Research Philip Russom, TDWI

Director, TDWI Research David Stodder, TDWI

Director, TDWI Research Fern Halper, TDWI

Associate Editors

Barry Devlin, 9sight Consulting

Mark Frolick, Xavier University

Troy Hiltbrand, Idaho National Laboratory

Claudia Imhoff, TDWI Fellow, Intelligent Solutions, Inc.

Barbara Haley Wixom, TDWI Fellow, University of Virginia

Advertising Sales: Scott Geissler, [email protected], 248.658.6365.

List Rentals: 1105 Media, Inc., offers numerous e-mail, postal, and telemarketing lists targeting business intelligence and data warehousing professionals, as well as other high-tech markets. For more information, please contact our list manager, Merit Direct, at 914.368.1000 or www.meritdirect.com.

Reprints: For single article reprints (in minimum quantities of 250–500), e-prints, plaques, and posters, contact: PARS International, phone: 212.221.9595, e-mail: [email protected], www.magreprints.com/QuickQuote.asp.

© Copyright 2014 by 1105 Media, Inc. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. Mail requests to “Permissions Editor,” c/o Business Intelligence Journal, 555 S Renton Village Place, Ste. 700, Renton, WA 98057-3295. The information in this journal has not undergone any formal testing by 1105 Media, Inc., and is distributed without any warranty expressed or implied. Implementation or use of any information contained herein is the reader’s sole responsibility. While the information has been reviewed for accuracy, there is no guarantee that the same or similar results may be achieved in all environments. Technical inaccuracies may result from printing errors, new developments in the industry, and/or changes or enhancements to either hardware or software components. Printed in the USA. [ISSN 1547-2825]

Product and company names mentioned herein may be trademarks and/or registered trademarks of their respective companies.

President Rich Zbylut

Director, Online Products Melissa Reeve & Marketing

Graphic Designer Rod Gosser

President & Neal Vitale Chief Executive Officer

Senior Vice President & Richard Vitale Chief Financial Officer

Executive Vice President Michael J. Valenti

Vice President, Finance Christopher M. Coates & Administration

Vice President, Erik A. Lindgren Information Technology & Application Development

Vice President, David F. Myers Event Operations

Chairman of the Board Jeffrey S. Klein

Reaching the StaffStaff may be reached via e-mail, telephone, fax, or mail.

E-mail: To e-mail any member of the staff, please use the following form: [email protected]

Renton office (weekdays, 8:30 a.m.–5:00 p.m. PT) Telephone 425.277.9126; Fax 425.687.2842 555 S Renton Village Place, Ste. 700 Renton, WA 98057-3295

Corporate office (weekdays, 8:30 a.m.–5:30 p.m. PT) Telephone 818.814.5200; Fax 818.734.1522 9201 Oakdale Avenue, Suite 101, Chatsworth, CA 91311

Business Intelligence Journal (article submission inquiries)

Jennifer Agee E-mail: [email protected] tdwi.org/journalsubmissions

TDWI Premium Membership (inquiries & changes of address)

E-mail: [email protected] tdwi.org/PremiumMembership 425.226.3053 Fax: 425.687.2842

tdwi.org


Projects require resources, and today you’re likely to hear the term “allocation of scarce resources.” In this issue, we examine resources—people, capital, hardware, and software—whatever it takes to get the job done.

Your project will certainly need people, but as senior editor Hugh J. Watson asks, where can you find good help these days? After examining the skills a BI analyst needs, Watson makes the case for hiring MIS graduates, who “tend to have the combination of technical, business, soft, and system development skills that fit many BI positions.”

Resources also include hardware and software solutions. Karen Lopez and Joseph D’Antoni explore how big data impacts your analytics architecture. They describe the Hadoop framework components most relevant to the data warehouse architect and developer. Elad Israeli shows how to efficiently use resources such as hard disks, RAM, and CPU to enable large storage capacity and strong analytics performance.

With enterprise budgets stretched to the limit, it’s no wonder Hadoop is frequently touted as a low-cost approach to storing and analyzing big data. Our BI Experts’ Perspective column asked Bhargav Mantha, Keith Manthey, Brian Valeyko, and Coy Yonce to share their knowledge and recommendations about big data projects.

If you don’t have the budget for all the BI hardware and software you’d like, BI in the cloud might be a viable alternative. Paul G. Johnson examines the “compelling economic advantages” of the cloud that benefit both enterprises and society as a whole. He offers a framework to help you evaluate the costs and benefits of transi-tioning to the cloud.

Of course, before you put any project into production, you need to test it thoroughly. That can eat up resources. Wayne Yaddow presents three challenges enterprises face in testing their data warehouse projects and suggests best practices and test methodologies to maximize your efficiency (and thus conserve resources).

You may also be able to save money by moving to mobile computing. Our case study takes a closer look at the success Einstein Noah Restaurant Group realized when it moved managers from Excel to mobile dashboards.

With the right resources, enterprises can do great things, as evidenced by the 10 winners of this year’s TDWI Best Practices Awards. The full list of outstanding enterprises and their projects begins on page 51.

As always, we welcome your comments. Please send them to [email protected].

From the Editor


The Case for Hiring MIS Graduates Hugh J. Watson

Introduction One of the companies on our MIS (management information systems) Advisory Board recently posted a position for a BI business analyst.1 The company is mature in its use of BI and analytics and employs a rich variety of traditional and big data platforms and data analysis tools. The position is entry-level and the announcement is targeted to new graduates of the University of Georgia’s MIS program. The announce-ment caused me to think about the skills needed for BI and where to find people with those skills.

The announcement states that candidates must have a good understanding of databases, database design principles, and SQL skills. The person will be part of the finance business intelligence team and will help users in finance, accounting, marketing, and sales operations access information for analytical reporting. Specific technical skills include SQL (with Oracle DB 11), Oracle BI, and Excel. The person will manage a network of manual and automated reporting procedures; work with users on ad hoc reporting needs by providing training, guidance, and report design; help users access information; assist with the development of dashboards; help with user training for new applications; ensure that the data infrastructure meets new application needs; and administer user accounts. The candidate should be familiar with the systems development life cycle (SDLC). This sounds like a typical BI analyst position.

Analyzing the Skills The position calls for a variety of skills.

Technical skills. All BI positions require technical skills. In the announcement, database skills are a “must have.” I would argue that database is the most common technical skill required across a large number of BI positions.

HIRING MIS GRADUATES

Hugh J. Watson is a Professor of MIS

and holds a C. Herman and Mary Virginia

Terry Chair of Business Administration

in the Terry College of Business at the

University of Georgia. He is senior editor

of the Business Intelligence Journal. [email protected]

1 A description of the Board is available at http://www.terry.uga.edu/about/boards-councils/mis-advisory-board.



The announcement also calls for experience with Oracle BI and Excel. These are data access and analysis tools—the “getting data out” part of BI. I imagine that if an applicant had experience with MicroStrategy, Business Objects, Cognos, or some other reporting/OLAP tool, the person would be seriously considered. Like database fluency, the ability to access and analyze data is a technical skill.

Business skills. The position also requires the person to work with people in finance, accounting, marketing, and sales operations to provide training, guidance, report design, assistance in accessing information, dash-board design, and user training on new applications. To be effective in performing these tasks, the person should have business skills. It is important for analysts to be familiar with business concepts and terminology such as ROI, NPV, P&L, incremental costs, sales channels, customer affinity to buy, and customer lifetime value in order to be effective.

I had a conversation several years ago with a BI direc-tor about the supposed difficulty of getting face time with senior executives. The conversation illustrates the importance of business knowledge. The director thought the face-time problem was perhaps exaggerated but that it was critical for analysts to understand the business. He expanded by saying, “It takes about five minutes for an executive to decide if the analyst is going to be of any help, and if the answer is ‘no,’ the analyst will never get on the executive’s calendar again.”

One of the problems with new hires, and especially new graduates, even if they come from business schools, is that they don’t understand the organization-specific terminology. This is why it is difficult to hire new graduates into BI, and if they are hired, they often start out in technical positions.

Soft skills. Soft skills are critical for projects that require interactions with internal and external stakeholders; and isn’t that the case with nearly all BI projects? Although there are opportunities for some “back room” technical gurus, most BI projects require people who can lead projects; work and collaborate well on teams; are well

organized; have an ability to prioritize, learn, and adapt on the fly; possess good communication and presenta-tion skills; and have strong interpersonal skills.

The need for soft skills is seen in the announcement in that the position requires candidates to be able to work with users in various ways, work on project teams, and be well organized.

System development skills. The announcement indicates that the applicant should be familiar with the SDLC. It might better say agile development method-ologies, but giving the benefit of the doubt, some people use the SDLC generically to refer to all development methodologies.

It has been my experience that being formally trained in various development methodologies is important. Some-times I come across IT people who have very specific skills, such as building Web pages or administering networks, but do not either understand or appreciate the importance of using a methodology. Often these people fail to see system projects holistically as socio-technical undertakings and how their part of the project relates to the overall project and business needs.

The MIS and Computer Science (CS) Options One of the things that frustrates me (as an MIS professor) is how often business and technical writers equate IT with computer science and fail to recognize the existence, size, and nature of MIS programs. For example, you often read statements such as “businesses need more CS gradu-ates” or “IT professionals need business skills.” Although these statements are generally true, they fail to recognize that on most campuses, MIS programs (sometimes called IS or CIS) have as many majors as CS and require a full complement of business courses.

My belief is that although CS graduates are a good fit for technical BI positions, the majority of BI work is better performed by MIS graduates. Table 1 shows the course-work taken by the University of Georgia’s undergraduate MIS and CS students. The courses are typical of MIS and CS programs across the country.


An important difference between the programs is that MIS requires a combination of business (MIS programs are housed in business schools) and technical courses while all CS courses (housed in the arts and sciences departments) are technical. Students of CS also take more technology courses in total. If your staffing needs are for someone who is highly technical, a CS graduate is a good hire. However, some MIS students are also very technical and take their elective courses in CS.

MIS students take all of the courses required of business majors. Some CS students take a few business courses to broaden their knowledge. In general, however, MIS students have a better understanding of business concepts, terminology, and processes.

The MIS program helps students develop their soft skills. There is a required course in project management. Nearly every course requires a group project, such as building a database application for a client. The teams make end-of-project presentations, often to the client, and receive feedback on their work and presentations.

Management Information Systems Courses Computer Science Courses

Common body of business knowledge courses (economics, accounting,

finance, marketing, management, operations, etc.)

Introduction to Information Systems

Computer Programming (Java)

Business Process Management

Data Management

Network-based Application Development/Advanced Java Programming

Project Management

Systems Analysis and Design

Electives: Business Intelligence, Energy Informatics, Enterprise Systems

(ERP), IS Leadership, etc.

Introduction to Programming

Software Development

Systems Programming

Theory of Computing

Data Structures

Computer Architecture

Software Engineering

Operating Systems

Computer Networks

Web Programming

Database Management

Electives: Computer Modeling, Introduction to Game Programming,

Numerical Methods and Computing, Simulation, Artificial Intelligence,

Algorithms, Introduction to Robotics, Real-time Systems, Distributed

Computing, Computer Graphics, etc.


Both MIS and CS programs teach various development methodologies and have projects that require their use.

Conclusion McKinsey & Company recently surveyed more than 800 executives about their organizations’ technology-talent needs next year. Heading the needs list was people with analytics and data science skills. Number two on the list was people with joint business and IT skills. These are exactly the skills that many BI positions require.

Driving this need are the growth of analytics in organi-zations and Baby Boomers who are retiring at a high rate and taking their years of knowledge and experience with them. These forces are causing a shortage of technical talent in general, and specifically, the skills needed for BI and analytics work.

There are several options for securing talent for your BI team—hiring them away from other companies, converting power users into BI professionals, and using consultants. Another option is to hire new college

Table 1: Comparing MIS and CS programs.


graduates. Although computer science grads are great for the most technical BI positions, and marketing, finance, and accounting grads are appropriate for some positions, MIS grads tend to have the combination of technical, business, soft, and system development skills that fit many BI positions.

Don’t reject applicants if they lack specific technical skills. New graduates are unlikely to have experience with all of your technologies. If candidates demonstrate an ability to learn and execute (e.g., class projects), they can quickly pick up new technologies. By way of contrast, business and especially soft skills are more difficult to learn.

Hiring these graduates can be challenging, however. Demand for them is rivaling that of the dot-com boom of the late 1990s and they are receiving multiple job offers at top salaries.


If you want to hire MIS grads, I encourage you to go beyond relying on HR to find potential candidates. My experience is that college recruiters are often not granular enough in their understanding of BI to find what you need. If there is a major university near you, it is likely that the MIS department offers a course in BI. I recommend you let the instructor know of your hiring interest and offer to be a guest speaker. This approach is likely to give you a pipeline for hiring the best BI talent coming out of school. ■

The Business Intelligence Journal is a quarterly journal that focuses on all aspects of business intelligence, data warehousing, and analytics. It serves the needs of researchers and practitioners in this important field by publishing surveys of current practices, opinion pieces, conceptual frameworks, case studies that describe innova-tive practices or provide important insights, tutorials, technology discussions, and annotated bibliographies. The Journal publishes educational articles that do not market, advertise, or promote one particular product or company.

Visit tdwi.org/journalsubmissions for the Business Intelligence Journal’s complete submissions guidelines, including writing requirements and editorial topics.

Submissionstdwi.org/journalsubmissions

Materials should be submitted to: Jennifer Agee, Managing Editor E-mail: [email protected]

Upcoming Deadlines Volume 20, Number 2 Submission Deadline: November 21, 2014 Distribution: June 2015

Volume 20, Number 3 Submission Deadline: February 20, 2015 Distribution: September 2015

Instructions for Authors


BIG DATA ARCHITECTURE

Joseph D’Antoni is a senior architect

with over 10 years of experience.

He is a solutions architect for SQL

Server and big data at Anexinet.

[email protected]

Karen Lopez is senior project manager

and architect at InfoAdvisors. She

specializes in practical application of

data architecture and data evangelism.

[email protected]

The Modern Data Warehouse—How Big Data Impacts Analytics ArchitectureKaren Lopez and Joseph D’Antoni

AbstractThe advent of big data technologies—and associated hype—can leave data warehouse professionals and business users doubtful but hopeful about leveraging new sources and types of data. This confusion can impact a project’s ability to meet expectations. It can also polarize teams into “which one will we use” thinking.

Good architectures address the cost, benefits, and risks of every design decision. Good architectures draw upon existing skills and tools where they make sense and add new ones where needed. We architects always use the right tool for the job.

In this article, we describe the parts of the Hadoop frame-work that are most relevant to the data warehouse architect and developer. We sort through the reasons an organization should consider big data solutions such as Hadoop and why it’s not a battle of which (classic data warehouse or big data) is best. Both can—and should—exist together in the modern data architecture.

IntroductionThe concept of data warehousing has been with us for at least 30 years and has reached maturity within IT organizations and among data analysts. In the 1990s, online analytical processing (OLAP) systems allowed analysts to perform operations that might not have been possible in other solutions during that period. However, newer, disruptive technologies have been introduced that change overall system architecture and approaches to large-scale data analysis.



There has been a good deal of us-versus-them contro-versy in the relational and non-relational database world, mostly due to the mistaken belief that an organization must choose one over the other. As we have seen with many technologies over the decades, finding the right tool for the job is paramount to support business needs. Platform wars rarely benefit our organizations.

Big data technologies have moved beyond the “only for Web start-ups” or “only for scientific use” phase and are now ready to answer real-world business questions.

A Data StoryMany stories used to explain big data and Hadoop use social media and scientific sensor data—all wonderful examples of the divergence from traditional data. However, these examples sometimes leave traditional enterprise users feeling as if there are no applications in their world for these technologies.

Big data isn’t just about using new tools; it’s about solving problems that could be too expensive to solve in traditional architectures. Let’s look at how a retailer with a mature data warehouse might make use of big data solutions.

A typical retailer might support the following types of data analytics in the data warehouse:

■ Product sales ■ Promotion effectiveness ■ Store sales ■ Shopping basket mixes and trends ■ Customer preferences and purchasing histories ■ External customer demographic data ■ External daily weather data

In addition, a retailer might want to include analysis of the following:

■ Customer traffic and shopping patterns within a store via mobile tracking, shopping cart tracking, or customer interactions with kiosks

■ Customer shopping behavior via in-store video analytics and other sensor tracking

■ Customer shopping patterns on a website, complete with browsing behavior, ad tracking, and other Web-based logging

■ Municipal traffic data and road closure data to identify anomalies in sales patterns

■ Consumer tax credits by income and postal code

■ Hourly weather data by store

■ Sentiment analysis from social media

■ Influencer analysis from social media

The latter examples could be technically implemented in traditional data warehouse architectures, but the volume and performance load of all this data would likely require significant hardware upgrades and put perfor-mance pressure on existing loads, some to the point of being economically infeasible. This retailer would want to offload all that data into big data clusters that are optimized for processing large data volumes, then load the resulting smaller, post-processed, smarter data into their enterprise data warehouse and marts.

Big data opportunities for more insight abound for all kinds of organizations, not just technology or start-ups.

Hadoop and Its EcosystemHadoop is the technology with the most disruptive potential in the big data space—it started simply as a project at Yahoo! to build a better search engine and process all that data, but has evolved into the centerpiece of a modern data analytics architecture, with a large group of open source components surrounding it.

When Hadoop was introduced, implementation and interaction were a challenge, especially to enterprise IT organizations. Management tools were extremely limited and an installation required managing versions of Java libraries, compiling software, and writing custom code



to interact with data—which required a new paradigm for developers to learn and understand. Despite these early limitations, Hadoop’s power quickly brought it to the fore for large-scale data processing. At its core, Hadoop is two things—a framework for data processing called MapReduce and a distributed file system known as the Hadoop Distributed File System (HDFS). These technologies combine to allow massive parallelism and fault tolerance while running on commodity hardware.

A common refrain in modern computing is that storage is cheap—this is far from the case with large enterprises utilizing storage area network (SAN) storage. According to the Gartner Group, the average cost for enterprise SAN storage was $4,876 per terabyte in 2011 (Gartner, 2011). Even allowing for some reduction in cost over time, storage is a major part of IT’s ongoing operating expense. We can use an analytic architecture that is optimized to process larger data volumes to leverage costs and benefits of storage and processor budgets appropriately.

The Hadoop ecosystem performance approach is different from traditional systems tuning in the following ways:

■ Scale out instead of up. In the relational data ware-house environment, performance is often improved by using larger and faster hardware (which tends to be exponentially more expensive as it grows in scale) or by purchasing an appliance from a software vendor. In the Hadoop world, we add more nodes (servers) and do the work in parallel.

■ Commodity hardware. Hadoop is designed around dense, local storage and large sequential reads. It leverages horizontal scale to provide a great deal of aggregate memory (RAM) and I/O operations per second by combining all available resources in a given cluster of nodes.

■ Parallel processing. Hadoop is architected to manage and support massively parallel processing (MPP), which is optimized for processing very large data sets.

Although MapReduce is a powerful and robust framework, writing Java code in mass scale would have required retraining data analysts and other IT personnel, who are used to working with structured query language (SQL) and scripting. This skills and tools mismatch meant that enterprises were unlikely to adopt Hadoop solutions. The open source community realized these limits and brought together several projects—Hive, Pig, and later Impala—to provide a more user-familiar interface to HDFS.

HiveApache Hive refers to itself as a “data warehouse which facilitates querying and manages large data sets residing in distributed storage.” Hive functions as a SQL metastore on top of HDFS—users can impose schemas (which look like tables to the user) onto files and then query them using a language called Hive Query Language (HiveQL). This language is based on SQL, so developers and analysts can more easily query HDFS data. When a user runs a query in HiveQL, a MapReduce job is generated and launched to return the data. No Java coding is required.

PigApache Pig also builds a high-level procedural language that acts as an interface to HDFS. Pig is more frequently utilized in extract, transform, and load (ETL) scenarios than for just returning data results. Pig uses a text-based language called Pig Latin, which focuses on ease of use and extensibility.

ImpalaApache Impala is part of a number of second-generation Hadoop solutions (along with Spark and Shark) that leverage memory-based processing to perform analytics. Impala has access to the same data in the HDFS cluster (and typically relies on the Hive metastore for table structures) but it doesn’t translate the SQL queries it’s processing into MapReduce. Instead, Impala uses a spe-cialized distributed query engine similar to those found in commercial parallel relational database management systems (RDBMS).



YARNHadoop has evolved. In the past, the entire operations of the cluster were run using MapReduce. Now, YARN (Yet Another Resource Negotiator) allows for a more distributed, faster architecture. One of the implications of these changes is the need to build HDFS clusters with more memory than was common in the past. It used to be commonplace to use 64–96 GB of RAM in a given cluster data node; today, 256–512 GB nodes are becoming common.

These components make up only a small subset of the entire Hadoop framework, but they are the most relevant pieces for a data warehouse architect to under-stand. The Hadoop ecosystem is sometimes referred to as the “zoo” in keeping with its elephant-based name. Figure 1 shows these relevant components and how they fit together.

Analytics and Data WarehousingTraditional data warehousing is focused on operational metrics such as inventory, supply chain, and operational goals. These metrics tend to look at historical and current data, and although they may allow for some forward-looking forecasting, they usually look at

internal data only, with limited use of outside data sources. With years of evolution and ever more powerful hardware, data warehouses have become repositories allowing for large-scale reporting and analysis. Pundits have speculated that big data platforms could be the death of the traditional data warehouse; however, there are many regulatory, operational, and financial reporting requirements that will ensure that the data warehouse remains a component of the IT landscape in the future.

Although data warehousing asks questions about past business events and does attempt to perform predictive analysis, the RDBMSs at the center of the warehouse were not specifically designed for analytical queries. Online analytical processing and multidimensional capacity added more power to the analysis. However, at larger scales the needs of these systems could only be met by expensive, converged solutions. This was driven by several trends; for example, data volumes increased dramatically and now terabytes are normal and petabytes are becoming more common.

Predictive AnalyticsPredictive analytics is an area of data mining that specializes in extracting patterns from past data and applying statistical models to forecast future behavior. These types of analysis have become more widely available to organizations as computing power has become cheaper and their data volumes have increased. In the past, such analysis was limited to credit agencies, financial services, and insurance firms. Now, these types of analyses have become widely available and are used in a variety of industries as diverse as professional sports and medical decision-support software.

Other trends have changed an organization’s data land-scape. The proliferation of mobile devices, sensor data, and Web logs have led to new forms of data. Frequently called “unstructured,” data, this data is most commonly presented in the form of Java Script Object Notation (JSON) or Extensible Markup Language (XML). These data types are not easily ingested by traditional platforms due to their variable structures within the same data set, but they are easily loaded into HDFS and several parsers are available to transform that data into a

Figure 1: Hadoop data warehouse components.

Scripting

(Pig)

SQL-like Query

(HiveQL)

Distributed Processing

(MapReduce)

Resource Scheduling

(YARN)

Distributed Storage

(HDFS)

SQL-like Query

(Impala)

Hadoop “Zoo”



format that can be easily analyzed. Truly unstructured data is also being analyzed with pattern matching in video and audio files.

Analytics MartsAt the same time these trends have converged, analysts have begun taking advantage of larger volumes of data in order to perform “advanced” analytics. A frequent use case is to build a model for predictive analytics and run it against real-time or near-real-time data. These models will be built over and over again and run many times in an effort to perfect the models, so service times for these solutions must be very good. Hadoop HDFS has not been the platform for these real-time analytics; a more common scenario is to extract data from HDFS and load it into a memory-optimized columnar platform that allows for a high degree of data compression. Many columnar databases still support SQL and offer scale-out MPP on a similar hardware platform to HDFS.

External DataAnother trend in this area is the widespread use of external data sources. The most publicized use cases for this data involve social media data for sentiment analysis

and even outage reporting, but external data use cases go far beyond that. Many firms have begun to incorporate weather data, purchased data about their competitors, and income tax and census data. Market research data tends to be very expensive, so the firms using it need to respect it like any key business asset. Many cities have begun open government initiatives in an effort to maintain transparency—this data can also be used for competitive advantage.

Bringing It All TogetherThe modern enterprise data warehouse (EDW) needs to bring together the technologies and data required to support traditional business needs and stronger predic-tive analytics, leveraging large data sets. The classic data warehouse architecture features transactional databases, some external data, ETL systems, and business intel-ligence systems, as shown in Figure 2.

The enterprise data warehouse would typically be imple-mented in a relational database management system, as would the OLTP and data mart data stores.

Figure 2: Classic data warehouse architecture.

Classic DW Architecture

OLTPDB

OLTPDB

OtherDB

ETL

Staging/ETL DB

DataMart

DataMart

EDW

On-Premises



The modern big-data-enabled warehouse adds to those component systems to support parallel processing, scale out, and analytic marts as shown in Figure 3.

Data Modeling in a Classic Data WarehouseIn a traditional data warehouse development project, dimensional models are prepared for the EDW and data marts, usually derived or inspired by OLTP and external data models and specifications. Data architects optimize these models for data loading and consumption. Data cleansing, denormalization, datatype transformations, and indexing strategies tend to be the focus of data modeling efforts.

Both of these architecture diagrams are highly symbolic. A tailored architecture might contain other components or leave some components out of the solution. Components might also be derived from other components, depending on the business needs and models being addressed. In fact, it is common to see Hadoop components used to process data throughout the architecture: using data from the EDW, analytics mart, or OLTP systems.

Data Modeling in the Modern Data WarehouseIn addition to the efforts described in the classic data warehouse project, data architects can provide value to the Hadoop tasks. Data models for OLTP systems will still be required where that data is used in Hadoop. Data models should be prepared for external data sources. Data architects can assist in the design of HiveQL “tables.” Data models of the physical file store in Hadoop (HDFS) aren’t required, but logical data models of the data that is managed there for any length of time would be.

Many modern data modeling tools have begun to sup-port Hive schemas, at least for import. These tables can then be documented along with all the other enterprise data assets.

Challenges in Big Data ImplementationChanging hardware and software paradigms has never been easy or inexpensive for IT organizations as evidenced by the large number of firms still using mainframe platforms. In some aspects, new platforms make some IT problems easier—as noted, the hardware

Figure 3: Modern data warehouse architecture.

Modern DW Architecture

OLTPDB Distributed Processing

(MapReduce)

Distributed Storage(HDFS)

Distributed Storage(Blob Storage)

Analytics Mart

Data Mart

OLTPDB

OLTPDB

ETL

Staging EDW

Hadoop

Cloud and/or On-Premises

External Data



is distributed, which eliminates single points of failure. High availability is inherent in the system design; however, when talking about massive amounts of data, backups are always a challenge.

Given the highly available nature of HDFS and the challenges of backing up massive data volumes, many firms choose to forego performing backups of these data volumes, which could leave them vulnerable in a disaster. There are options from some Hadoop vendors for disaster recovery if your organization needs it for its analytic platform.

From a skills perspective, your organization needs the following key abilities:

■ Linux system administrators ■ Automation engineers ■ Java ■ Data analysis

Compared to a traditional model, where the database administrator (DBA) manages the data warehouse database, the DBA role does not apply in HDFS. Linux system administration skills are very important, and although there are distributions of Hadoop running on the Windows operating system that are popular with enterprise organizations, the overwhelming majority of implementations are running on Linux platforms, where community support is more available. When dealing with tens or thousands of cluster nodes, automation becomes very important. Software and firmware updates are also candidates for automated processing.

Leveraging Cloud Computing for Big DataBig data makes for an interesting cloud computing solution—particularly if workloads are highly vari-able. Like most other cloud computing offerings, there are two types of solutions: platform-as-a-service (PaaS)—basically Hadoop-as-a-service—and infrastruc-ture-as-a-service (IaaS).

Most major cloud vendors have Hadoop-as-a-service offerings—these can be a fantastic way to get up

and running with Hadoop and the toolkit within an afternoon. This means that the vendor manages all the underlying infrastructure and you manage the configu-ration of Hadoop.

The IaaS offerings simply involve spinning up a number of virtual machines (VMs) and building a Hadoop cluster on them. This places more of the onus of configu-ration onto your staff but provides more flexibility with the tools installed alongside Hadoop.

One major challenge to both of these solutions is getting large, existing data volumes into the cloud. As a result, many vendors provide services allowing you to ship data tapes or hard drives to get them loaded onto their storage. The good news here is that most cloud providers do not charge a fee to upload data.

Like most other cloud computing solutions, the benefits involve flexibility and low initial capital investment. For example, if a firm wants to run a large-scale fraud detection solution that monitors personal behavior and browsing history across thousands of nodes, the cloud is a viable option if the workload is over a short period of time. Even for much smaller configurations, getting up to speed quickly without the hassle of installing software can be incentive enough to use a cloud solution.

Cloud Trade-OffsThe trade-offs with cloud solutions are the ongoing expense, slower performance, and security concerns.

The cloud limits an enterprise’s initial capital invest-ment, but for long-term, larger implementations the costs may creep up. Most cloud vendors also charge for outbound data flows, so if your reporting solution is on-premises, that is another expense to be considered.

Performance in a cloud will always be limited by the multitenant nature of the environment. Shared infrastructure is required to offer the cost savings and scale of cloud computing. To meet its financial goals, the provider needs to maximize its hardware usage while meeting its performance service-level agreements (SLAs). This is not to say cloud performance is bad—it simply



will not match levels achieved with dedicated hardware in an on-premises installation.

Firms are concerned about security when moving to a cloud computing model; however, cloud providers are going out of their way to address these concerns. Con-sult your cloud provider for specifics about any privacy or regulatory concerns that apply to your industry; providers update these certifications regularly.

Economics of Big Data SolutionsOne of the key drivers of big data in enterprise IT organizations has been the high cost of RDBMS licensing and the infrastructure to support it. Data warehouses tend to require features that only the more expensive “enterprise” editions of RDBMS offer and in some cases require the purchase of additional options. Most major RDBMS packages are licensed by the CPU core, which means as workload increases, so does the license expense. The nature of the RDBMS also limits horizontal scaling, so to address performance concerns larger, more expensive server hardware or faster storage is required. Another expense (though much smaller) is the cost of operating system licensing required to support the RDBMS.

Big data platforms are not totally free, but there are some clear cost advantages. Because performance is achieved through horizontal scaling and aggregate resources, individual nodes do not need to be as power-ful as a monolithic server. As addressed earlier, Hadoop (and most other big data and NoSQL platforms) leverage dense, local storage that comes at a much lower cost than enterprise SAN storage. All of these software platforms run nearly exclusively on Linux and most implementations take place on completely free distribu-tions of the operating system.

Hadoop itself is available as a free open source project, but most organizations will choose to go with a com-mercial distribution for ease of management. The annual cost of support and licensing for the commercial solu-tions are about $4,000/node/year (Bantleman, 2012), which is not insignificant but is far lower than the cost of a commercial RDBMS. Although RDBMS pricing

varies per vendor and individual agreement, costs can be as high as $50,000 per CPU core.

In most scenarios, it makes the most sense to use big data technologies to process and aggregate big data into classic data solutions using the right tool for the job.

A Final ThoughtThe most important thing a data warehouse professional needs to understand is that Hadoop and other big data technologies are not an either/or decision. Every design decision comes down to cost, benefit, and risk. Those factors change over time, as we have seen since the first release of Hadoop. Right now, we have the opportunity to leverage these special-use technologies within an existing data warehouse architecture to leverage a greater variety of data sources than ever before. ■

ReferencesBantleman, John [2012]. “The Big Cost of Big

Data,” Forbes, http://www.forbes.com/sites/ciocentral/2012/04/16/the-big-cost-of-big-data/ (accessed on May 14, 2014).

Gartner [2011]. “IT Key Metrics Data 2012: Key Infrastructure Measures: Storage Analysis: Current Year,” Jamie K. Guevara, Linda Hall, Eric Steggman.

http://www.forbes.com/sites/ciocentral/2012/04/16/the-big-cost-of-big-data/


Paul G. Johnson is a CPA and

business intelligence practitioner

with over 35 years of industry

and consulting experience.

[email protected]

Cloud Computing for BI: The Economic PerspectivePaul G. Johnson

AbstractCloud computing continues to gain acceptance as a viable alternative to on-premises computing infrastructure. Why? In this article, we examine the compelling economic advantages of the cloud that can benefit individual enterprises and society as a whole. We also present a simple framework for evaluating the costs and benefits of transitioning to the cloud.

IntroductionEconomics, often called “the dismal science,” may be defined as the study of the use of scarce resources that have alternative uses. I learned recently that computing power is, indeed, a scarce resource when our project team tried to schedule history loads for a new data ware-house. We were restricted to a six-hour daily window, which forced the history loads to take several days. There were alternative uses for the computing resources that were deemed more important than ours.

The idea of cloud computing is not new, but it has gained considerable momentum and acceptance in the marketplace in the past few years. I strongly believe that the economics are extremely compelling and are driving the shift to “pay-as-you-go” computing. In this article I will explore why economic forces are moving companies to consider this approach, and I will provide a frame-work for testing the economics in your environment.

Cloud Definition and Optimal WorkloadsI have found no single definition that covers all aspects of cloud computing, but the following definition quoted in John Rhoton’s excellent book, Cloud Computing Explained, encapsulates the main features that make cloud computing highly attractive from an economic perspective:

CLOUD COMPUTING COSTS



A large pool of easily usable and accessible virtual-ized resources (such as hardware, development platforms and/or services). These resources can be dynamically reconfigured to adjust to a variable load (scale), allowing also for an optimum resource utilization. This pool of resources is typically exploited by a pay-per-use model in which guaran-tees are offered by the Infrastructure Provider by means of customized SLAs.1

The Utility ModelI spent the first 20 years of my career as a financial analyst in a large electric utility company. There are several great similarities between the electric utility industry and cloud computing that illustrate the concepts we’ll discuss in this article.

The IT manager in a firm faces many of the same challenges as the electric company’s management. One of the primary concerns in both worlds is to ensure that sufficient capacity is available to meet demand. In an electric utility, the peak demand throughout the year looks similar to Figure 1.

There are three key points in this graph:

■ We see two prominent peak loads. One in February represents the winter peak and a higher peak in August represents the summer peak. At a minimum, the utility must provide generating capacity sufficient to meet each of these peaks.

■ The top section above the line represents the reserve margin. This is capacity held in reserve that can be quickly brought online in the event of unexpectedly hot or cold weather. It also provides a safety margin for unexpected power plant failures. A typical reserve margin is 15 to 20 percent.

■ The light area below the line represents idle capac-ity. In effect, this represents power plants that cost

hundreds of millions of dollars but sit idle and do not produce revenue for the utility investors. To make matters worse, it takes many years to build a large power plant, and utilities must plan for expected loads 10 to 20 years into the future.

One strategy to minimize idle capacity is to offer cheaper rates during the winter months, thus encouraging cus-tomers to install electric heaters rather than gas furnaces. From the utility’s perspective, selling the power at a lower price is better than letting expensive power plants sit idle. From the customers’ perspective, there is an opportunity to heat their homes in winter at a lower cost. Thus, the effort to minimize idle capacity benefits all parties.

Another challenge in the electric utility business is managing peak loads throughout the day. A typical load profile for a summer day appears in Figure 2.

In this case, we see hourly loads that decrease during the wee hours of the morning when most people are sleeping and the outside temperature has fallen. The demand ramps up rapidly during the day as the temperature rises

1 The original source of the definition is cited in Rhoton’s book as: Vaquero, Rodero-Moreno, Cáceres, and Lindner, “A Break in the Clouds: Towards a Cloud Definition,” 2009.

Figure 1: Peak megawatt demand for an electric utility throughout the year.

JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC

Idle capacity Peak load

Megawatts

Reserve margin



and factories are running at full capacity. Then demand trails off again in the evening hours as factories close for the day and the area cools. The utility must provide capacity (plus reserve) to meet the 5 p.m. peak. If this peak could somehow be reduced, one less power plant might be built. Utilities reduce this peak requirement by offering “time-of day” rate plans where customers pay higher rates during peak load times and lower rates during off-peak times. This variable pricing drives behaviors so customers will move their consumption of electricity to off-peak times, which benefits them with lower overall electricity cost. This is known as load shifting and is shown in Figure 3.

The load factor doesn’t change for the utility (it has merely been shifted), but the 5 p.m. peak is reduced from 10,000 megawatts to 9,250, meaning the utility need not provide as much capacity. If the company can build one fewer power plant, hundreds of millions of dollars in capital costs will be saved. These economic decisions benefit all parties because scarce resources are being utilized more efficiently.

In the early days of industrializa-tion, factories generated their own power. Later on, it became pos-sible to generate electricity at high voltages, thus enabling transmission over long distances. Electric utility providers were then able to build large, centralized generation facili-ties. Often these could be located near coal mines (also owned by the utility) for ready access to fuel, or

near lakes for large cooling water capacity. These and other factors enabled significant economies of scale for utility providers, and it became more economical for factories to take their power from the grid.

We could certainly power our own homes using a combination of solar panels and gas-powered generators, but due to economies of scale, we cannot match the prices offered by utilities. In turn, utilities can offer low prices and still earn a reasonable profit for investors. We will see these same benefits occurring in the world of cloud computing.

The IT Manager’s ViewIT managers may be thought of as micro-utility providers of computing services for their firms. They must plan in advance for expected demand on comput-ing resources, acquire and install the hardware, and ensure that an adequate reserve margin is in place to meet unexpected peak demand. In a sense, these peaks are much less predictable than those faced by electric utilities. The data center is simultaneously supporting development, integration testing, stress testing, migra-tion, and production computing activities. If jobs fail and have to be re-run off schedule, the peaks would occur at times other than planned, or multiple peak demand points could coincide and damage performance for all consumers.

Figure 2: Summertime megawatt demand by hour.

12000100008000600040002000

0

12 M

idni

ght

1 AM

2 AM

3 AM

4 AM

5 AM

6 AM

7 AM

9 A

M

10 A

M

11 A

M

12 N

oon

1 PM

2 PM

3 PM

4 PM

5 PM

6 PM

7 PM

8 PM

9 PM

10 P

M

11 P

M

12000100008000600040002000

0

12 M

idni

ght

1 AM

2 AM

3 AM

4 AM

5 AM

6 AM

7 AM

9 AM

10 A

M

11 A

M

12 N

oon

1 PM

2 PM

3 PM

4 PM

5 PM

6 PM

7 PM

8 PM

9 PM

10 P

M

11 P

M

Idle capacity Demand

Figure 3: Summertime megawatt demand with load shifting.

Megawatts

Reserve margin

Megawatts

Reserve margin

Idle capacity Demand



Let’s look at the world of business intelligence.

At a minimum, a data warehouse or business intelligence (BI) program should utilize a development environment, an integration test environment, and a production environment. Within each environment are a number of servers: database servers, ETL servers, Web servers, and application servers. The production servers will likely be multi-node with failover capability. Figure 4 depicts a hypothetical utilization pattern for these servers over the life span of a brand new data warehouse effort.

We might use the development environment for proto-typing during the design phase, so we have 5 percent utilization of the development environment early on and heavier utilization during the development phase. We could perhaps improve the utilization by using offshore resources, thus making development a 24/7 activity.

As we move into integration testing, we will still be using the development environment for bug fixes, but that use will be winding down. Utilization of the integration testing environment will spike as we load history and execute stress testing to verify the applica-tion’s scalability.

When we deploy, we will undoubtedly be performing some pre-deployment migration activities in the integration test environment. We will have to load history into the production environment, which will

drive extraordinarily high utilization for a brief period. After that, utilization will fall into a normal daily pattern.

The production servers are utilized only at the end of the project cycle. This could be six months to a year after the design phase begins. If these servers were ordered at the project’s inception, they will have been completely idle while the project life cycle was unfolding.

Of course, we recommend that data warehouses be developed and deployed incrementally, and the cycle will iterate back

to the development environment for the next subject area in the data warehouse to improve the utilization patterns in each environment over time. Still, each environment will undergo wide fluctuations in utiliza-tion throughout the data warehouse project life cycle. It is a huge challenge to line up just the right amount of computing power in advance.

Next, let’s turn to a time-of-day analysis, staying focused on the production environment. Figure 5 uses a stacked area chart to illustrate all servers working together.

The ETL servers run at high utilization from midnight to 7 a.m. The database servers run throughout the 24-hour period—they are inserting and updating data during the night and responding to query requests during the workday. The application servers run at low utilization during the overnight period, but higher during the day as the applications are performing real-time calculations. Finally, the Web servers come into play only when users are actively accessing their Web portals.

The drop at noon represents lunchtime, when both people and machines get a break. Notice that there is very little overlap in utilization of the ETL servers and the Web servers. These servers could be totally shut down for a large part of the 24-hour processing cycle.

Figure 4: Utilization of server environments by project phase.

40%

35%

30%

25%

20%

15%

10%

5%

0%

Design Develop

Development

Utili

zatio

n

Integration Production

Test Deploy


Based on Figure 5, the utilization of each type of server over the 24-hour cycle is:

■ ETL: 19% ■ Database: 31% ■ Application: 19% ■ Web: 11%

Virtualization of servers is one way to help mitigate idle computing capacity. A certain amount of physical hardware is procured and divided into virtual machine images, each of which may be allocated disk capacity, RAM, and CPU cycles. This allocation may be changed quickly to accommodate expected loads on individual virtual machines. However, the total capacity is fixed, and if exceeded, service will degrade.

In the scenario depicted in Figure 5, what would be the effect if the ETL processes had to start four hours late due to a failure in one of the source systems? The graph might appear as in Figure 6.

In this example, when the ETL processes were delayed by four hours, an effort was made to double up on certain ETL processes that could run in parallel. This approach ended up overloading the database servers and resulted in severe performance degradation. The perfor-mance bottleneck then caused the ETL processes to run


an additional two hours, which started to interfere with workday query activities.

The point is that even in a virtualized in-house comput-ing environment, significant excess capacity must be maintained to cope with unexpected events. Similarly, in an integration test environment, you might have three different project teams trying to run stress tests at the same time. This can be planned for in advance, but each team will have to wait its turn because of the capacity limitations. With a cloud-based solution, you can shift this problem to the cloud provider. You can easily start up additional servers to meet unexpected loads, and the cloud provider is contractually obligated to deliver.

Procurement delays are another significant problem the IT manager faces. In-house computer hardware is generally budgeted as a capital expenditure (CAPEX). This budget is very competitive with other initiatives within the firm, and is typically allocated only once per year. The IT manager must successfully anticipate the computing needs well over a year in advance! If the budget is too high, there will be excess idle capacity and a worthy competing initiative elsewhere in the company may not receive funding. If the budget is too low, service may be degraded, and useful work will simply not be done on time. In either case, the consequences are serious.

Figure 5: Total daily computing load by hour.

12 M

idni

ght

1 AM

2 AM

3 AM

4 AM

5 AM

6 AM

7 AM

9 PM

10 A

M

11 A

M

12 N

oon

1 PM

2 PM

3 PM

4 PM

5 PM

6 PM

7 PM

8 PM

9 PM

10 P

M

11 P

M

Web

Idle

Application

Database

ETL

Com

putin

g Re

sour

ces



Procurement lag includes items such as budget approvals, sizing exercises, obtaining competitive bids, delivery time, obtaining space (and cooling) in the data center, installation, hardware configuration, software installation and configuration, testing, and certification for usage. This process can take weeks, if not months, and must be planned carefully. Compare this with a cloud-based solu-tion, where you can ramp up servers in minutes. Consider the demand/capacity graph in Figure 7.

The bottom line indicates the growth in computing demand, which is relatively linear over time. Due to approval and procurement lag, the IT manager needs to acquire computing capacity in relatively large blocks. The area between the top and bottom lines indicates idle capacity. Notice that we have large amounts of idle capac-ity immediately after a hardware purchase, but minimal idle capacity just before the next purchase. When capacity is at a minimum, performance degradation is likely to be more common until the next purchase.

The elastic nature of the cloud provides for much closer matching of computing supply and demand within the firm. Cloud providers offer a wide variety of server and capacity options, and they may be implemented extremely quickly. The cloud also allows the IT manager to switch funds allocated from CAPEX to the operating budget. If extra capacity is needed temporarily, it will

become a relatively minor operating budget variance, as opposed to an emergency CAPEX budget request requiring high-level approvals.

From an economic perspective, cloud computing represents a variable cost that can be closely matched with variable demand. In-house computing behaves more like a fixed cost; if a firm experiences a temporary downturn in business, an IT manager might have to sell idled servers for 30 cents on the dollar and buy new ones later at full price when the business recovers. For this reason, variable costs are much preferred to fixed costs.

The Cloud Provider’s ViewAlthough early forms of cloud computing existed as long ago as the 1960s, a case can be made that cloud comput-ing in its current form was pioneered by Amazon.com, Inc., in the early 2000s. Amazon started as an online bookstore in 1995, but eventually became a general retailer. Given that Amazon is not a brick-and-mortar operation, its IT shop is a major component of its overall cost structure. To minimize costs, Amazon learned how to effectively leverage low-cost, commoditized hardware, and with its growing purchasing power, it could procure large amounts of this hardware at discount. The company still faced a problem common to most retailers: a disproportionate share of sales, and hence server load, occurred between late November and late December.

Figure 6: Computing load by hour with performance degradation at peak.

12 M

idni

ght

1 AM

2 AM

3 AM

4 AM

5 AM

6 AM

7 AM

9 AM

10 A

M

11 A

M

12 N

oon

1 PM

2 PM

3 PM

4 PM

5 PM

6 PM

7 PM

8 PM

9 PM

10 P

M

11 P

M

Web

Idle

Application

Database

ETL

Com

putin

g Re

sour

ces

Performance degradation


They had to provide sufficient capacity to meet the peak load that occurs in just one month of the year. The result was massive idle computing capability during the remaining 11 months.

Amazon tested the idea of selling its unused computing capacity to other companies that had higher processing needs between January and November. In this way, Amazon recovered some of its fixed investment in computing hardware, and its customers were able to acquire capacity on a temporary, low-cost basis.

Other providers now include Rackspace, Google, Microsoft, and IBM. Each provider has its own business model and cost recovery mechanisms, but the goal remains to most efficiently utilize computing assets for provider and customer alike.

The cloud provider will seek to have a steady but grow-ing load profile throughout the year. Because they serve hundreds if not thousands of customers, the peak loads will be much more diversified and result in a much flat-ter load profile. The provider will still have to maintain a reserve margin for unexpected peak computing loads, but the need should be very predictable through statisti-cal analysis. Various pricing models also help to flatten out any peaks and valleys. Examples include:

■ Standard pricing: A basic rate for “pay-as-you-go” services. When a customer shuts down a server, there are no fees except for persistent storage.

■ Committed pricing: A customer commits in advance to using a certain amount of capacity and in return receives a discounted rate. The customer will likely know what their base computing load is expected to be, so they can use an optimized combination of committed and standard pricing.

■ Spot market pricing: If a customer can be flexible about when they want to use cloud capacity, they can set their own price. When the provider has excess capacity, they will offer it on a “spot market.” If the spot market price drops below the customer’s offering price, they get the capacity. If the spot price rises above the offering price, the capacity is lost. These spot prices can be extremely low. To use an airline analogy, the cloud provider views its excess capacity as “seats leaving the gate empty” and will do whatever it can to monetize this capacity. For a customer, this pricing model could be advantageous for activities such as prototyping, sizing exercises, and stress testing.

Cloud providers have mastered the art of effective virtu-alization along with the construction and operation of large, secure data centers. They purchase commoditized hardware in volume at great discounts, and they know how to make it perform at optimal levels. They have significant legal exposure if they violate their service-level agreements (SLAs) or allow a security breach, so they pay close attention to these details. All of these factors allow the cloud provider to offer very economical and secure services to its customers while still earning a profit.

Challenges and Risks of the CloudWhenever I suggest a cloud-based solution, the first objection raised is about security. Understandably, IT managers are concerned about allowing data outside the corporate firewall. This has given rise to the concept of “private clouds,” which are separated from the public cloud and often located behind the company’s firewall. The problem here is that the economic advantages


Figure 7: Procurement lead/lag.

Cloud CapacityIn-House Capacity Computing Demand

Com

putin

g Re

sour

ces

Time


may not be as powerful because private cloud capacity typically has to be determined in advance.

Cloud providers are highly motivated to keep their data centers secure from both physical and network hacker attack. A successful and highly publicized attack will likely put them out of business, particularly if the exercise of due diligence comes into question. From the IT manager’s perspective, security must be planned out in detail and will likely include techniques such as restricting port access to a list of preauthorized IP addresses and/or device IDs, as well as full encryption of all data moving into and out of the cloud. I believe that with careful planning and monitoring, it is possible to achieve very high levels of security in the cloud. When you analyze the costs and benefits of a potential move to the cloud, be sure to include the cost of software tools needed to handle data encryption and decryption.

A larger challenge concerns data integration. A well-architected data warehouse will eventually cover many subject areas (such as finance, HR, operations, sales, and marketing). These subject areas should be tied together with a library of conformed fact and dimension tables. This supports “drill-across” queries between fact tables from different subject areas. You want the drill-across queries to operate seamlessly, so the optimal solution may be to put the entire data warehouse into the cloud. Of course, this is a major architectural decision, and may not always be practical.

Another risk is service outages, which can be mitigated by the service-level agreement you have with your cloud provider. The major providers offer redundancy across multiple data centers, which maximizes uptime. Various SLAs are offered that provide uptime guarantees, with monetary consideration offered for unexpected outages. The costs will vary depending on the SLA you select, and you get to determine the cost/benefit trade-off that best fits your situation.

Calculating Your Own Cloud EconomicsIf you are considering a move to the cloud, you should be able to quantify the financial impact. If you find the financial benefits are marginal, you may not want to

take on the risks and challenges discussed above. In this section I provide a basic framework and example you can follow to help you evaluate your situation.

This analysis can become overly complex if you try to follow typical accounting concepts such as CAPEX, depreciation, operating expenses (OPEX), and return on investment (ROI). A simpler and more reliable approach is to use a technique known as discounted cash flow analysis. With this approach, you simply determine the net cash flow over a period of years that will result from your decision to move to the cloud. You apply a discount rate to account for the time value of money. Although the calculations are simple, the underlying assumptions are not, and I highly recommend that you partner with your company’s finance department as well as your corporate tax department to get the most accurate analysis possible.

Step 1: Determine the time frameYou will need to forecast each cost and benefit over a carefully selected time frame. It may be tempting to say a data warehouse has an indefinite lifetime and choose a 10-year (or longer) time frame. However, the world of information technology changes rapidly, and it is notori-ously difficult to accurately forecast costs and benefits so far out. You may want to match the time frame with your “in-house” procurement cycle for IT infrastructure. For example, if you purchase new servers every three years, you can consider a three-year time frame, or you might double it to six years, which will compare your cloud costs with two procurement cycles of on-premises hardware. The idea is to select the time period for matching cloud and on-premises infrastructure.

Another important rule is that the time frame starts on the date you will start transitioning to the cloud. You do not consider historic time periods in the analysis; “sunk costs” for existing infrastructure are not appropriate here. That money has already been spent.

Step 2: Determine the costs and benefitsThis is the heart of the analysis, and must be thought through very carefully. Here are a few ground rules.




A “benefit” generates a net cash inflow for the firm. Examples include:

■ Avoided hardware purchases

■ Avoided hire of on-premises support staff (salary plus benefits)

■ Avoided lease costs on existing or new hardware

■ Avoided cooling costs and electricity to run hardware in your data center

■ Total cost reductions from eliminating the data center altogether

■ Avoided project delays waiting on the infrastructure team to move data or configure servers

■ Sale of existing hardware to a party outside the firm

■ Avoided software licensing costs (if the cloud solution already embeds these costs in its rates)

A “cost” generates a net cash outflow for the firm, such as:

■ Metered computing time on cloud-based servers

■ Persistent storage costs on cloud-based storage media

■ Reserved capacity charges for cloud-based servers

■ Costs of on-demand support from the cloud provider

■ Costs of a dedicated high-speed pipeline to upload your data to the cloud

■ Costs of preparing your data for migration to the cloud

■ Training costs to acquire skills needed to use cloud infrastructure

■ Costs to move data within the cloud

■ Costs to retrieve your data from the cloud

■ Costs to tune or redesign your solution for optimal performance in the cloud

■ Lease termination fees for existing hardware being retired early

When you are considering a scenario where an existing server is being replaced with cloud capacity, think about what is actually going to happen to that server. If it is going to continue to sit idle in the data center, you cannot claim a benefit. If it is going to be utilized by another department, and thereby eliminates a new hardware purchase, you can claim a benefit. If the exist-ing server is to be sold, you can count that as a benefit. Always think about the cash impact on the organization as a whole.

Another important consideration is the effect of cor-porate income taxes on the analysis. Because every tax situation is different, I will not dive into the details here. In general, you should quantify the tax impact of each cost/benefit and determine if that impact will occur in the same year as the cost/benefit or if the tax impact is spread over multiple years. For example, if one of the benefits involves the avoided purchase of a server, there will also be a lost tax deduction which may be spread over a five-year period. I recommend that you work closely with your corporate tax department to determine the tax impacts of your plan.

Step 3: Obtain the discount rateThe discount rate, also known as the hurdle rate, factors in the time value of money. The general principle is that a dollar paid or received in the future is worth less than a dollar paid or received today. Factors such as inflation, risk, and required shareholder return on equity are key determinants in setting the discount rate. The value will vary by industry due to the shareholder equity component.

Do not try to arrive at the discount rate on your own; ask your corporate finance department for the rate instead, which will be used across the enterprise for all



capital budgeting decisions, not just those involving migration to the cloud.

Once you have the discount rate, you can use a simple mathematical formula to calculate a present value (PV) factor for each year: PV factor = 1/(1+i)(n–1), where i is the discount rate and n is the year. This formula assumes the cash flows occur at the beginning of the year, so in year one, the PV factor will be 1. In year two, assuming the discount rate is 13%, the formula resolves to 1/(1+0.13)1 = 0.8850. In year three it will be 1/(1+0.13)2 = 0.7831. You will multiply the annual cash flow for each year by this factor to get the discounted cash flow for that year. You can readily see the value of deferring costs wherever possible. The examples in the next step will further illustrate the concept.

Step 4: Calculate the net present valueThe final step is to summarize the costs and benefits to determine the net present value over the time period you choose. I will present an example comparing an on-premises Hadoop installation with two different cloud offerings based on different pricing models. These are not meant to be actual costs but are estimates that illustrate the concepts as explained. I have also ignored income taxes for simplicity.

General assumptions ■ 5 TB capacity, growing at 25 percent per year

■ Hadoop storage requirements at 50 percent of raw data

■ Close to 24/7 utilization

■ Five-year time frame

■ Discount rate is 13% (provided by the corporate finance department)

On-premises Hadoop assumptions (Scenario 1, Table 1) ■ Hardware: $1,000/TB, with $5,000 minimum

purchase

■ Maintenance: $100/TB/year

■ Floor space/power/cooling: $300/TB/year

■ System management software: $3,500/year

■ FTE support

● System administrator: 1 FTE at $150,000/year salary and benefits

● Hadoop programmer: 1 FTE at $150,000/year salary and benefits

Table 1: Costs/benefits for an on-premises Hadoop solution.

On-Premises Hadoop

Year Data (TB) Hardware MaintenancePower/

Cooling/ Floor Space

System Mgmt

SoftwareSupport FTEs

Total Spend

(a)

PV Factor

(b)

Annual PV

(a*b)

Year 1 5 5,000 500 1,500 3,500 300,000 310,500 1.0000 310,500

Year 2 6 625 625 1,875 3,500 300,000 306,625 0.8850 271,350

Year 3 8 781 781 2,344 3,500 300,000 307,406 0.7831 240,744

Year 4 10 977 977 2,930 3,500 300,000 308,384 0.6931 213,726

Year 5 12 1,221 1,221 3,662 3,500 300,000 309,604 0.6133 189,886

Totals 8,604 4,104 12,311 17,500 1,500,000 1,542,519 1,226,206



Cloud provider 1 assumptions (Scenario 2, Table 2) ■ $7,500/month flat rate service fee

■ 3–15 TB range of Hadoop storage at about $7,500/month

■ FTE support: 1/3 FTE at $150,000/year salary and benefits = $50,000/year

Cloud provider 2 assumptions (Scenario 3, Table 3) ■ $1,000/TB/year reserved instance fees

■ FTE support: 2/3 FTE at $150,000/year salary and benefits = $100,000/year

In this scenario, the hardware is relatively inexpensive but the ongoing support is labor-intensive and costly. To keep this system running at high utilization requires a Hadoop programmer and a system administrator. Considering that the system must be supported 24/7, I have assumed two full-time equivalents (FTEs) to cover these roles. This becomes a huge cost component in the on-premises Hadoop solution. We also have to consider costs for cooling, floor space, maintenance, and system management software licenses.

In the second scenario, cloud provider 1 offers a “big data service” at a flat rate of $90,000 per year based on a range from 3 TB to 15 TB of total storage. They provide 24/7 uptime and handle all engineering and administra-tive tasks to keep a big data solution running. For this scenario, I have assumed one-third of an FTE to support administration (paying the monthly bill) and minor tuning efforts involved with this type of service.

In the third scenario, cloud provider 2 offers a solution based on a per-TB pricing model. However, the DBA functions to tune the solution are not included, so I have increased the FTE allocation here to two-thirds of an FTE. This is a bit more of a “pay-as-you-go” model based on the actual storage you wish to purchase.

Table 4 shows a summary of the total costs over five years for all three scenarios.

The figures to compare are the present value dollars because they take into account the time value of money. Based on the assumptions provided, Cloud Provider 2 is the best option from a purely economic perspective.

One clear conclusion from these comparisons is that the cost of human labor is a key consideration that can swing the analysis dramatically. Before you make a decision, you must understand the expected level of support required to keep a solution running, whether it is on-premises or in the cloud. Cloud providers are masters at automation, and many of the administrative tasks you might otherwise be doing manually have likely been fully automated.

SummaryEvery business will strive to be the low-cost producer in its marketplace for a given level of quality. Information technology plays a vital role, but is still considered an overhead cost that should be minimized. Excess capacity is particularly wasteful and cannot easily be shed in an economic downturn. Finally, human labor is one of the most expensive costs, and it’s more efficient to use human talent for generating creative business ideas rather than for managing computing infrastructure.

Cloud computing solves many of these problems by pooling scarce resources, exploiting economies of scale, and automating menial administrative tasks to an extent that most IT shops cannot match. By optimizing the allocation of scarce resources, cloud computing reduces costs to individual companies and benefits society as a whole. It is not practical to expect that all computing will move to the cloud, but free market forces will continue to spur adoption of cloud-based technologies. Practitioners of BI should anticipate this trend and help our companies as they transition more computing resources to the cloud. ■

ReferenceRhoton, John [2009]. Cloud Computing Explained,

Recursive Press.


Table 2: Costs/benefits for cloud provider 1.

Cloud Provider 1: Flat Annual Fee

Data

(TB)

Service

Fees

Support

FTEs

Total Spend

(a)PV Factor (b)

Annual PV

(a*b)

Year 1 5 90,000 50,000 140,000 1.0000 140,000

Year 2 6 90,000 50,000 140,000 0.8850 123,894

Year 3 8 90,000 50,000 140,000 0.7831 109,641

Year 4 10 90,000 50,000 140,000 0.6931 97,027

Year 5 12 90,000 50,000 140,000 0.6133 85,865

Totals 450,000 250,000 700,000 556,427

700,000 556,427


Table 3: Costs/benefits for cloud provider 2.

Cloud Provider 2: Pricing per TB

Data

(TB)

License

Fees

Support

FTEs

Total Spend

(a)PV Factor (b)

Annual PV

(a*b)

Year 1 5 5,000 100,000 105,000 1.0000 105,000

Year 2 6 6,250 100,000 106,250 0.8850 94,027

Year 3 8 7,813 100,000 107,813 0.7831 84,433

Year 4 10 9,766 100,000 109,766 0.6931 76,073

Year 5 12 12,207 100,000 112,207 0.6133 68,819

Totals 41,036 500,000 541,036 428,352

Table 4: Overall cost comparison.

Overall Cost Comparison

Nominal Dollars Present Value Dollars

On-Premises Hadoop 1,542,519 1,226,206

Cloud Provider 1 (flat rate pricing) 700,000 556,427

Cloud Provider 2 (pricing per TB) 541,036 428,352


DATA WAREHOUSE TESTING

Wayne Yaddow is a senior data

warehouse, ETL, and BI report QA

analyst working as a consultant

in the NYC financial industry. He

has spent 20 years helping

organizations implement data quality

and data integration strategies.

[email protected]

Meeting the Fundamental Challenges of Data Warehouse TestingWayne Yaddow

Abstract Decisions in today’s organizations are happening increasingly in real time, and the systems that support business decisions must be high quality. People sometimes confuse testing data warehouses that produce business intelligence (BI) reports with back-end or database testing or with testing the BI reports themselves. Data warehouse testing is much more complex and diverse than that. Nearly everything in BI applica-tions involves data that is the most important component of intelligent decision making.

This article identifies three primary data warehouse testing challenges and offers approaches and guidelines to help you address them. The best practices and the test methodology presented here are based on practical experiences testing BI/DW applications.

Knowing the Challenges Not all data warehousing challenges are complex, but they are diverse. Unlike other software projects, data warehousing projects are not developed with a front-end application in mind. Instead, they focus on the back-end infrastructure that supports the front-end client reporting.

Knowing these challenges early, and using them as an agenda, provides a good technique for solving them. They fall into three major categories:

1. Identifying each necessary data warehouse test focus and associated project document needed for test planning



2. Confirming that all stages of the extract, transform, and load (ETL) process meet quality and functional requirements

3. Identifying qualified testing staff and skills

What does this mean for testing your data warehouse? Over time, a frequently changing and competitive market will raise new functional requirements for the data warehouse. Because enterprises must comply with new and changing compulsory legal and regulatory requirements, they need frequent releases and upgrades, so data and related applications must be tested several times throughout the year.

Before describing the challenges of testing data ware-houses, we review some of the key characteristics that define most data warehouse projects:

■ Diverse sources of data feed systems

■ Complex environments containing multiple databases, file systems, application servers, programming tools, and database programming languages

■ Wide variety of rules to process and load data

■ Dynamic requirements for presenting or working with data

■ Complex data set types such as files, tables, queues, and streams

■ Data may be structured, semi-structured, or unstructured

■ High dependency on the interfacing systems and error/exception handling

Test planning for your BI/DW project should be designed to overcome the most significant challenges for

Figure 1: Testing the data warehouse—test focal points for quality assurance (graphic courtesy of Virtusa Corp.).

Database Testing

Validate: ● Stored procedures, triggers, views,

constraints ● Existing data quality ● Referential integrity and data

consistency ● Data value persistance and retrieval ● The use of data inspection tools ● Visually inspecting results ● Use in-memory database to speed up

database tests

Validate: ● Interfaces between the application and

database operate correctly ● DB has been successfully accessed and

updated ● Data has been extracted, scrubbed, and

loaded correctly ● Data conversion criteria converts data

as expected to the target ● Any calculation or manipulation

associated with the data are performing as expected and yield accurate results

Validate: ● On-line delivery methods resulting from

DW input ● Conformance and behavior of the data

in accordance with business rules and workflow

● Custom report generation ● System performance ● Source to staging validation ● Staging to enterprise data ● Warehouse validation

Validate: ● Short transactions to minimize long-

term locks and improve concurrency ● Avoid user interaction during

transactions ● High normalization of the database and

reduce redundant information ● Minimal or no historical or aggregated

data ● Careful use of indexes ● SQL query tuning ● Performance monitoring ● System configuration ● Hardware tuning

Performance Testing

Data Validation Testing

Data Validation

Testing

Performance Testing

GUI and Business Rule

TestingDatabase Testing GUI and Business Rule Testing



data warehouse testers. We look at three such challenges in this article.

Challenge 1: Identifying Test Focus and Documentation for Data Warehouse Test PlanningBecause data warehouse testing is different from most software testing, a best practice is to break the testing and validation process into several well-defined, high-level focal areas for data warehouse projects. Doing so allows targeted planning for each focus area, such as integration and data validation.

■ Data validation includes reviewing the ETL mapping encoded in the ETL tool as well as reviewing samples of the data loaded into the test environment.

■ Integration testing tasks include reviewing and accepting the logical data model captured with your data modeling tool (such as ERwin or your tool of choice), converting the models to actual physical database tables in the test environment, creating the proper indexes, and testing the ETL programs created by your ETL tool or procedures.

■ System testing involves increasing the volume of the test data to be loaded, estimating and measuring load times, and placing data into either a high-volume test area or in the user acceptance testing (UAT) and, later, production environments.

■ Regression testing ensures that existing functionality remains intact each time a new release of ETL code and data is completed.

■ Performance and scalability tests assure that data loads and queries perform within expected time frames and that the technical architecture is scalable.

■ Acceptance testing includes verification of data model completeness to meet the reporting needs of the specific project, reviewing summary table designs, validation of data actually loaded in the production data warehouse environment, a review of the daily upload procedures, and finally application reports.

Few organizations discard the databases on which new or changed applications are based, so it is important to have reliable database models and data mappings when your data warehouse is first developed, then keep them current when changes occur. Consider developing the following documents most data warehouse testers need:

Source-to-target mapping. The backbone of a successful BI solution is an accurate and well-defined source-to-target mapping of each metric and the dimensions used. Source-to-target data mapping helps designers, developers, and testers understand where each data source is and how it transitioned to its final displayed form. Source-to-target mappings should identify the original source column names for each source table and file, any filter conditions or transformation rules used in the ETL processes, the destination column names in the data warehouse or data mart, and the definitions used in the repository (RPD file) for the metric or dimension. This helps you derive a testing strategy focused more on the customized elements of the solution.

Data models. Data warehouse models are crucial to the success of your data warehouse. If they are incorrect or nonexistent, your warehouse effort will likely lose credibility. All project leaders should take the necessary time to develop the data warehouse data model. For most data warehouses, a multi-month building effort with a highly experienced data warehouse modeler may be needed after the detailed business requirements are defined. Again, only a very experienced data warehouse modeler should build the model. It may be the most important skill on your data warehouse team.

The data architecture and model is the blueprint of any data warehouse and understanding it helps you grasp the bigger picture of a data warehouse. The model helps stakeholders understand the key relationships between the major and critical data sources.

We stress the importance of getting your data model right because fixing it might require a great deal of effort, in addition to stalling your project. We’ve seen projects with models so corrupted that it almost makes sense to start from scratch.



Change management. Several factors contribute to information quality problems. Changes in source systems often require code changes in the ETL process. For example, in a particular financial institution, the ETL process corresponding to the credit risk data warehouse has approximately 25 releases each quarter. Even with appropriate quality assurance methods, there is always room for error. The following types of potential defects can occur when ETL processes change:

■ Extraction logic that excludes certain types of data that were not tested.

■ Transformation logic may aggregate two different types of data (e.g., car loan and boat loan) into a single category (e.g., car loan). In some cases, transformation logic may exclude certain types of data, resulting in incomplete records in the data warehouse.

■ Current processes may fail due to system errors or transformation errors, resulting in incomplete data loading. System errors may arise when source systems or extracts are not available or when source data has the incorrect format. Transformation errors may also result from incorrect formats.

■ Outdated, incomplete, or incorrect reference and lookup data leads to errors in the data warehouse. For example, errors in the sales commission rate table may result in erroneous commission calculations.

■ Data quality issues including source system data that is incomplete or inconsistent. For example, a customer record in the source system may be missing a ZIP code. Similarly, a source system related to sales may use an abbreviation of the product names in its database. Incompleteness and inconsistency in source system data will lead to quality issues in the data warehouse.

Challenge 2: Identifying the Precise Focus of ETL Testing Well-planned extraction, transformation, and load test-ing should be a high priority and appropriately focused. Reconciliation of warehouse data with source system

data (data that feeds the data warehouse) is critical to ensure business users have confidence that all reported information is accurate according to business require-ments and data mapping rules.

Why is ETL testing so complex? Most of the action occurs behind the scenes; output displayed in a report or as a message to an interfacing system is just the tip of the iceberg. There is always more underlying data behind such reports and messages. The combinations can be virtually endless depending on the type of application and business/functional logic. Furthermore, enterprises are dependent on various business rules as well as different types of data (such as transactional, master, and reference data) that must be considered.

The environments to be tested are complex and hetero-geneous. Multiple programming languages, databases, data sources, data targets, and reporting environments are often all integral parts of the solution. Information for test planning can come from functional specifica-tions, requirements, use cases, user stories, legacy applications, and test models—but are they complete and accurate? What about conflicts in input specifica-tions? Many data types must be managed, including files, tables, queues, streams, views, and structured and unstructured data sets.

How do you represent test data? How do the testers prepare test data? How do they place it where it belongs? How much time will this take, and how much time should it take?

You can perform testing on many different levels, and you should define them as part of your ETL testing strategy. Examples include the following:

Constraint testing. The objective of constraint testing is to validate unique constraints, primary keys, foreign keys, indexes, and relationships. Test scripts should include these validation points. ETL test cases can be developed to validate constraints during the loading of the warehouse. If you decide to add constraint valida-tion to the ETL process, the ETL code must validate all business rules and relational data requirements.



If you automate testing in any way, ensure that the setup is done correctly and maintained throughout the ever-changing requirements process. An alternative to automation is to use manual queries. For example, you can create SQL queries to cover all test scenarios and execute these tests manually.

Source-to-target count comparisons. The objective of “count” test scripts is to determine if the record counts in corresponding source data match the expected record counts in the data warehouse target. Counts should not always match; for example, duplicates may be dropped or data may be cleansed to remove unwanted records or field data. Some ETL processes are capable of capturing record count information (such as records read, records written, or records in error). When the ETL process can capture that level of detail and create a list of the counts, allow it to do so. It is always a good practice to use SQL queries to double-check the source-to-target counts.

Source-to-target data validation. No ETL process is smart enough to perform source-to-target field-to-field validation. This piece of the testing cycle is the most labor-intensive and requires the most thorough analysis of the data. There are a variety of tests you can perform during source-to-target validation. For example, verify that:

■ Primary and foreign keys were correctly generated

■ Not-null fields were populated properly

■ There was no improper data truncation in each target field

■ Target table data types and formats are as specified

Transformations of source data and application of business rules. You must test to verify all possible outcomes of the transformation rules, default values, and straight moves as specified in the business requirements and technical specification document.

Users who access information from the data warehouse must be assured that the data has been properly collected

and integrated from various sources, after which it has been transformed to remove inconsistencies, then stored in formats according to business rules. Examples of transformation testing include the following:

■ Table look-ups. When a code is found in a source field, does the system access the proper table and return the correct data to populate the target table?

■ Arithmetic calculations. Were all arithmetic calcula-tions and aggregations performed correctly? When the numeric value of a field is recalculated, the recalcula-tion should be tested. When a field is calculated and automatically updated by the system, the calculation must be confirmed. As a special mention, you must ensure that when you apply business rules, no data field exceeds its boundaries (value limits).

Batch sequence dependency testing. Data warehouse ETLs are essentially a sequence of processes that execute in a defined sequence. Dependencies often exist among various processes. Therefore, it is critical to maintain the integrity of your data. Executing the sequences in the wrong order might result in inaccurate data in the warehouse. The testing process must include multiple iterations of the end-to-end execution of the entire batch sequence. Data must be continually verified and checked for integrity during this testing.

Job restart testing. In a production environment, ETL jobs and processes fail for a variety of reasons (for example, database-related or connectivity failures). Jobs can fail when only partially executed. A good design allows for the ability to restart any job from its failure point. Although this is more of a design suggestion, every ETL job should be developed and tested for restart capability.

Error handling. During process validation, your testing team will identify additional data cleansing needs and identify consistent error patterns that might be averted by modifying the ETL code. It is the responsibility of your validation team to identify any and all suspect records. Once a record has been both data and process



validated and the script has passed, the ETL process is functioning correctly.

Views. You should test views created with table values to ensure the attributes specified in the views are correct and the data loaded in the target table matches what is displayed in the views.

Sampling. These tests involve selecting a representative portion of the data to be loaded into target tables. Predicted results, based on mapping documents or related specifications, will be matched to the actual results obtained from the data loaded. Comparison will be verified to ensure that the predictions match the data loaded into the target table.

Duplicate testing. You must test for duplicates at each stage of the ETL process and in the final target table. This testing involves checks for duplicate rows and checks for multiple rows with the same primary key, neither of which can be allowed.

Performance. This is the most important aspect after data validation. Performance testing should check that the ETL process completes within the load window specified in the business requirements.

Volume. Verify that the system can process the maximum expected data volume for a given cycle in the time expected. This testing is sometimes overlooked or conducted late in the process; it should not be delayed.

Connectivity. As its name suggests, this involves testing the upstream and downstream interfaces and the intra-data-warehouse connectivity. We suggest that the testing examines the exact transactions between these interfaces. For example, if the design approach is to extract the files from a source system, test extracting a file from the system, not just the connectivity.

Negative testing. Check on whether the application fails and where it should fail with invalid inputs and out-of-boundary scenarios.

Operational readiness testing (ORT). This is the final phase of testing; it focuses on verifying the deployment of software and the operational readiness of the applica-tion. During this phase you will test:

■ The solution deployment

■ The overall technical deployment “checklist” and time frames

■ The security of the system, including user authentica-tion and authorization and user access levels

Evolving needs of your business and changes in source systems will drive continuous change in the data warehouse schema and the data being loaded. Hence, development and testing processes must be clearly defined, followed by impact analysis and strong alignment between development, operations, and the business.

Challenge 3: Choosing Qualified Testers for Data Warehouse QA EffortsThe impulse to cut costs is often strong, especially in the final delivery phase. A common mistake is to delegate testing responsibilities to resources with limited business and data testing experience.

Our best-practice recommendations to help you choose the best testers follow.

Identify crucial tester skills. The data warehouse testing lead and other hands-on testers are expected to dem-onstrate extensive experience in their ability to design, plan, and execute database and data warehouse testing strategies and tactics to ensure data warehouse quality throughout all stages of the ETL life cycle.

Recent years have seen a trend toward business analysts, ETL developers, and even business users planning and conducting data warehouse testing. This may be risky. Among the required skills for data warehouse testers are the following:



■ A firm understanding of data warehouse and database concepts

■ The ability to develop strategies, test plans, and test cases specific to data warehousing and the enterprise’s business

■ Advanced skill with SQL queries and stored procedures

■ In-depth understanding of the organization’s ETL tool (for example, Informatica, SSIS, DataStage, etc.)

■ An understanding of project data and metadata (data sources, data tables, data dictionary, business terminology)

■ Experience with data profiling using associated methods and tools

■ The ability to create effective ETL test cases and scenarios based on ETL database loading technology and business requirements

■ Understanding of data models, data mapping docu-ments, ETL design, and ETL coding

■ The ability to communicate effectively with data warehouse designers and developers

■ Experience with multiple DB systems, such as Oracle, SQL Server, Sybase, or DB2

■ Troubleshooting of the ETL (e.g., Informatica/ DataStage) sessions and workflows

■ Skills for deployment of DB code to databases

■ Unix/Linux scripting, Autosys, Anthill, etc.

■ Use of Microsoft Excel and Access for data analysis

■ Implementation of automated testing for ETL processes

■ Defect management and tools

ConclusionsWe have highlighted several approaches and solutions for key data warehouse testing challenges following a concept-centered approach that expands on success-ful methods from multiple respected resources. The testing challenges and proposed solutions described here combine an understanding of the business rules applied to the data with the ability to develop and use QA procedures that check the accuracy of entire data domains—i.e., both source and data warehouse targets.

Suggested levels of testing rigor frequently require addi-tional effort and skilled resources. However, employing these methods, BI/DW teams can be assured from day one of their data warehouse implementation and data quality. This will build end users’ confidence in data warehouses and will ultimately lead to more effective BI/DW implementations.

Testing the data warehouse and BI applications requires good testing skills as well as active participation in requirements gathering and design phases. Additionally, in-depth knowledge of BI/DW concepts and technology is crucial so that one may comprehend the end-user requirements and therefore contribute to a reliable, efficient, and scalable design.

We have presented the numerous flavors of testing involved while assuring quality of BI/DW applications. Our emphasis relies on the early adoption of a standard-ized testing approach with the customization required for your specific projects to ensure a high-quality product with minimal rework. ■


BI CASE STUDY

Linda Briggs writes about technology

in corporate, education, and government

markets. She is based in San Diego.

[email protected]

Bagel Chain Serves Up Happy Users with Move to Mobile DashboardsLinda Briggs

For pleasing users, perhaps nothing beats a move from serving up traditional grid reports on laptops to rolling out iPads with dashboards and visual analytics. Just ask Dan Cunningham, senior vice president of information technology at Einstein Noah Restaurant Group. After he and his team moved managers from Microsoft Excel reports to brightly colored dashboards displaying store performance data earlier this year, “we’ve been getting fantastic feedback.” Cunningham, an IT veteran, says the project has been as well received as anything he’s ever rolled out: “I’ve never had an implementation go as well as this one from an end-user viewpoint.”

Einstein Noah, with headquarters in Lakewood, Colo-rado, is the nation’s largest operator of bagel bakeries; it’s a leader in the “fast casual” segment of the restaurant industry. Retail brands include the familiar names Einstein Bros. Bagels, Noah’s New York Bagels, and Manhattan Bagel. The company employs nearly 7,000 people in company-owned stores and in its support center. Across the three brands, there are approximately 860 locations, all in the U.S.

For its mobile enterprise analytics project, the company selected MicroStrategy; it’s also building and populating a new enterprise data warehouse as part of the project. Adding another twist, agile development methods were employed on the BI project—a first for Einstein Noah.

It’s all part of a multi-year BI initiative launched in 2013. The overall plan: to migrate from Excel-based reporting to a more visual mobile application that could display operational performance at the store, region, and brand level graphically. MicroStrategy was selected


for the project, Cunningham says, based on “completeness of solution,” along with the fact that Einstein Noah has been a MicroStrategy user for over 10 years (the company originally licensed the product through a back-office reporting solution, then upgraded to full versions of MicroStrategy). The company also uses QlikView and some Microsoft SQL, but “MicroStrategy will be the center-piece and the focus point of our BI strategy,” Cunningham says.

Currently, the company has a data warehouse containing point-of-sale data such as sales totals and transac-tions. Moving forward, that data will be migrated into the new data warehouse, and a wider variety of data sources added. Part of the proj-ect includes better data governance, including master data management. “We’re implementing a stricter data governance approach,” Cunning-ham says, “as we pick up pieces of data from different sources.” Those sources include Web-based applica-tions, data from existing Microsoft Access and SQL databases, supply chain data, flat files, and more. “You name it, we have it as a data source right now,” Cunningham says, “and we’re in the process of migrating them all into the new data warehouse.”

Because all reporting on mobile devices using the MicroStrategy app will draw from the new enterprise data warehouse, Cunningham is using that fact to encourage users to be patient as enhanced data

Web-based solution on their laptops. Eventually, Cunningham hopes to phase in smartphones, including both Android and iOS devices.

For now, iPad users especially are delighted with the new system and are sending plenty of ideas to Cunningham’s team. “We have a good year’s worth of requests for things people want us to deliver on the iPad,” he says. “Eventually we’ll [add] smartphones, but right now, people are ecstatic about not having to pull out their laptops.”

The Importance of DesignTraining has been minimal because most users were already comfortable with the basic functions of an iPad. Good dashboard and app design also played a role in minimizing training. “My team did a very good job making the information easily visible,” Cunningham says, so users needn’t waste time hunting for the data they wanted. “People were already comfortable with the iPad,” he says, “and MicroStrategy’s app takes advantage of that. ... No one has come back to us and said, ‘Hey, we’re not using it because we can’t figure it out.’”

To design the dashboard, the Einstein Noah team worked with a consultant from the enterprise consulting firm SmartBridge who “had a terrific eye for design,” Cun-ningham says. As a general design rule, the team tried to avoid basing dashboards on prior reporting solutions that managers might have used. “Mobile is a new environment and a new way to look at things,”

governance rules kick in and other changes are gradually implemented.

The rollout involves deploying and extending MicroStrategy’s mobile app, which provides a variety of store-level information to the management team. The information, previously delivered in traditional Excel reports, has been shifted to dashboards and additional data added from other sources, all in a highly visual and interactive interface.

Currently, users of the new app are Einstein Noah’s senior operations management group, including area business managers, directors of opera-tions, regional VPs, senior leadership including the company’s CFO and CEO, and Cunningham’s enterprise solutions team. In total, the app has been rolled out to about 60 users so far—all of whom are mobile users at least some of the time.

The next step is to extend the application to store managers, who can run the software in stores via a

BI CASE STUDY

Training has been

minimal because most

users were already

comfortable with the

basic functions of

an iPad.


Cunningham points out. “Users open the dashboard and they’re not comparing it to a reporting solution; they’re comparing it to other apps.”

How quickly the mobile app responds in the field is also criti-cal, he says. As with design, user expectations for speed aren’t set by prior reporting solutions but by other mobile apps. Happily, the dashboards, which are pulling data directly from the new data ware-house, render within seconds. “We may have to do something differ-ently in the future,” Cunningham says, “but we’ve had no issues with performance to date.”

One example of use in the field is a customer feedback survey in which a customer can use a sales receipt to log on to a specific site and give immediate feedback after a transaction. Customer responses are loaded daily into the enterprise data warehouse, where area store manag-ers can see them immediately. Those customer responses are added to other data—all of it available to regional managers, who can then visit a store at any time and view current information on the spot, such as performance over time and customer comments. “We weren’t providing this kind of information at all in the past via Excel-based reports,” Cunningham says.

As a big fan of visual analytics, Cunningham likes that the dashboards use colorful, easy-to-read heat maps, diagrams in various colors to clearly spotlight sales

growth and problem areas, and other visually appealing displays. Those images, he says, make it far easier for managers to discern pat-terns than a spreadsheet would; a manager who sees an issue can click on a visual image to drill down into store details.

“It’s just a much more powerful tool than grid-based reports for looking at things like large trends across the organization,” Cunningham says. “You get a much more rapid return of information. ... From just a spreadsheet file, you wouldn’t know what you need to focus on.”

Leveraging Agile DevelopmentRather than more traditional “waterfall” development method, the project used an agile develop-ment methodology—a first for the company, Cunningham says, and a big contributor to the overall success of the project. More traditional development was used early on—for example, in reviewing hardware options. Once the basics were behind them and the core environment was built, the team moved to agile development with great success.

“We were very strict about [follow-ing agile development dictates],” he says, from collecting input from end users, conducting stand-up morning meetings, building quick iterations of the software within a few weeks, then repeating the cycle. “If something doesn’t work exactly as end users anticipated, we can modify it in the next sprint, or if a new priority comes up, we can

quickly add that in.” The process has greatly shortened development cycles and enabled users to quickly see what they’ve requested.

Future plans include incorporating information such as store audit data in the data warehouse, to be made available through the MicroStrategy app. An area business manager will be able to plug in information from a store site, such as temperatures in the refrigerators. Previously, storing that information would have meant turning on a laptop, booting it up, opening a spreadsheet through Microsoft SharePoint, typing in the information, and saving it. Now, it can be entered into a dashboard on the iPad with a few strokes. That information can also be compiled across stores, whereas it was specific to each location previously.

The data also will be entered into the data warehouse, where it can be correlated specifically with store performance. “If we see a strong correlation between positive store audits and positive transactions,” Cunningham says, “that will be apparent. In the past, we couldn’t correlate that data to store perfor-mance. ... Overall, it’s just another way we’re helping [users] to be able to focus on the details they really need.” ■

BI CASE STUDY


Experts’ PerspectiveBI in Manufacturing

Bhargav Mantha, Keith Manthey, Brian Valeyko, and Coy Yonce

Nicole Mercer is the BI director for Everything Rugs, which manufactures and sells indoor and outdoor rugs to big box and specialty stores. Most of her team’s efforts are on descriptive analytics—running queries, reports, dashboards, and special

analyses. The team also supports an enterprise BI tool that allows users to create their own reports and dashboards. The data infrastructure is provided by an enterprise data warehouse and dependent data marts for manufac-turing, finance, and sales and marketing. All in all, it’s a fairly vanilla BI environment.

Nicole is sensing an interest in big data. Senior management has mentioned it. Marketing is talking about sentiment analysis and viral marketing. Although Nicole has been following big data (it is impossible to miss) and her data warehousing vendor has big data platforms, she has many questions. Can you help her with some of them?

1. Her intuition is to start with a small project. Is this the best approach? What are the characteristics of a good starting project? Is there a specific project you would recommend?

2. Any new project will have to go through Everything Rugs’ usual corporate approval process. In fact, the platforms her data warehousing vendor offers would require special funding. Is there anything unique about big data projects that Nicole should be aware of to get funding for the project? How should she work with management and the vendor to secure approval?

3. Nicole has developed a well-controlled, centralized data infrastructure. When she was hired, she was able to eliminate most of the independent data marts. She senses that with big data and more platforms, her infra-structure is going to get messier. She is concerned that she may not fully control the new platforms unless she moves fast or in concert with the business units. Is this a common and reasonable concern? What advice do you have for Nicole?

4. Nicole has been reading about Hadoop as a low-cost approach to storing and analyzing big data. Although the Hadoop software (and the Hadoop ecosystem) is free, she knows that making it work together would require considerable effort and possibly outside help. What advice do you have to

Bhargav Mantha is a manager

at ZS Associates.

[email protected]

Keith Manthey is vice

president for Equifax, Inc.

[email protected]

Brian Valeyko is director of EDW, BI,

and big analytics for NCR Corporation.

[email protected]

Coy Yonce is the product owner for

software solutions from EV Technologies, a

services and software partner with SAP.

[email protected]

BI EXPERTS’ PERSPECTIVE


help Nicole evaluate this alternative?

5. Recently, Nicole heard a speaker mention that more firms are using Hadoop as the platform for processing data from all sources, even structured data currently stored in the data warehouse. The structured data would be processed in Hadoop and then stored in the warehouse. Does this make sense for structured data? What are the ben-efits and drawbacks of this approach?

BHARGAV MANTHA

1. Nicole should aim for little victories through a small project

with defined objectives and clear expected business value. Some key characteristics of a good starting project include:

■ Recruit an executive sponsor from marketing or sales who is eager to make decision making more datadriven.

■ Ensure the business problem is well defined with clear success metrics.

■ Pick a project that can deliver high impact with relatively few resources in a short time frame of 8 to 12 weeks. I recommend analyzing customer churn.

■ Leverage existing technol-ogy to create faster value. Nicole shouldn’t make huge investments up front in a big data–specific platform. Many big data applications can be programmed using traditional approaches such as SQL.

■ Tap into existing team members’ expertise in data, computation, and business and train them only on specific skills (e.g., clickstream analysis) required to solve the problem.

2. Nicole should spend time with key stakeholders and create a strong business case that identifies the problem, cost, and value. She must share short-term and long-term goals and success factors, and also highlight information such as data sources, technology, and phases—along with associated deliverables.

In addition, Nicole must set appro-priate expectations to ensure the project won’t be subject to the same rigors and controls as a traditional data warehouse project. She should tell management to expect a period of exploration, application of hypothesis, and refinement.

While working with the vendor, Nicole should ask for case studies and complete due diligence in mapping the business case to the capabilities and limitations of the vendor’s big data products. She should negotiate, invest appropri-ately in the right technology, and ensure the vendor provides adequate support throughout.

Overall, Nicole should leverage her existing resources—technology and people—and not make a huge investment up front.

3. Nicole must modify best prac-tices for enterprise data warehouse (EDW) data integration, data qual-ity, and data modeling by changing

the existing infrastructure, tools, and processes.

She also should decide where the big data ecosystem will reside in this centralized data infrastructure. I recommend she create a hybrid model in which the EDW controls the highly structured optimized operational data while the Hadoop-based infrastructure controls highly distributed and volatile data. This model will accomplish business goals faster and help execute specific cases within the more flexible, decentralized big data infrastructure.

4. Many companies view Hadoop as an open source alternative to the DW platform because of its scalability, high fault tolerance, flexibility, ability to handle various data types, and low cost (it’s usually less than $1,000 per terabyte). When weighing it as an alternative, Nicole should:

■ Carefully identify and evaluate Hadoop’s analytic and data management requirements.

■ Create a solution-cost framework to include costs related to deploying Hadoop, people, and training, as well as developing applications, queries, and analytics.

■ Ensure a win by optimizing total cost, risk, and time to value. It is important Nicole neither underestimates nor ignores the real, long-term costs of a big data solution.



■ Employ a flexible architecture that leverages both existing DW technology and Hadoop.

■ Evaluate technologies that have commercialized Hadoop and provide it as a cloud-based service. Consider solutions from companies such as Cloudera, Hortonworks, and MapR.

5. The first step to a successful Hadoop deployment is to deter-mine where it fits in Nicole’s data warehouse architecture.

Hadoop seems most compelling as a platform for capturing and storing big data within an extended DW environment, in addition to process-ing that data for analytics on other platforms. This approach allows firms to protect their investment in their respective EDW infrastructure and also extend it to accommodate the big data environment.

It is beneficial to have Hadoop process structured data—if data volume is very large and the exist-ing hardware is not scalable or is extremely expensive. Other benefits may include:

■ Deploying a scalable and eco-nomical ETL environment. By shifting the “T” (transform) to Hadoop, Nicole can dramatically reduce costs and release database capacity and resources for faster query performance.

■ Enabling Nicole to forever keep all data in a readily accessible online environment.

As for drawbacks, it may be overkill to use Hadoop for processing structured data alone if volume or velocity are not high enough.

Hadoop can quickly ingest any data format and is a natural framework for manipulating non-traditional data types. However, it is unwise to make a huge invest-ment and modify your existing data architecture with Hadoop just for traditional structured data processing.

It is best to follow a hybrid model where Nicole can:

■ Store summary structured data from online transaction processes and back office systems into the EDW.

■ Store unstructured data in Hadoop/NoSQL, e.g., all com-munication with customers from

phone logs, Web logs, customer feedback, tweets, and e-mail.

■ Correlate data in the EDW with data in the Hadoop cluster to get better insight about customers, products, and equipment.

KEITH MANTHEY

Nicole finds herself in a typical situation

these days. The data warehouse has been optimized and automated, and runs on autopilot. This is great news for the business and the insights it can enjoy. The bad news is that the data warehouse is slow to change and often will struggle to “keep up” with new uses such as social media ingestion. What is Nicole to do?

My advice is to stop thinking that different use cases need to be rationalized with the same tool set or environment. Hadoop is certainly in all the trade magazines and conference brochures these days, and is part of the “shiny new toy” syndrome affecting enterprises. However, the problem Nicole faces isn’t merely that the business wants to move to Hadoop. The problem is that there is a gap. Nicole’s well-structured and valiantly performing data warehouse can answer all known queries, but there is no “lab” environment that can be populated with new data to allow investiga-tion and experimentation with new use cases (not only social).

In essence, Everything Rugs became complacent once they could answer the questions they had when they built the data


The first step to a

successful Hadoop

deployment is to

determine where

it fits in Nicole’s

data warehouse

architecture.


warehouse. Mildly accretive data sources are added with ease, but data sources that would require a pivot are left out.

Where should Nicole begin? The idea of starting with a small initiative and exploring the process, tools, and technologies is strongly encouraged. The goal would be to start diving into the use case at a fairly low cost. The expected benefits may turn out to be unrealistic. A great starting point would be to look at sentiment analysis around the Everything Rugs brand or line of brands. This is a fairly straightforward analysis and can be scaled as desired. By sizing the study, Nicole can conduct it on an inexpensive set of servers without requiring significant capital investment. Nicole’s team might need some training, consulting, and learning, though.

One word of caution for Nicole is that if she decides not to build the lab environment to explore new options, she shouldn’t think the questions will go away. With the advent of cloud-based offerings for data mining, Hadoop, and every other possible use case, Nicole’s business users could build their own environment. By meeting with her business clients and showing them how to use a lab to meet their marketing desires for customer segmentation, binning, sentiment analysis, and other predic-tive marketing components, she can get ahead of them and potentially prevent them from creating their own environment in a place where she can’t regulate it (such as in the cloud).

Hadoop by its very nature is extremely insecure, which has created a large market for add-on products, services, and other components. Nicole’s regular data warehouse tool vendor certainly offers capabilities to secure the environment. If Nicole were to engage them or build a “roll your own” environment by other means, she should focus on security. By securing the environment, she can also keep any sprawl centralized. Nicole’s clients will be able to load and manipulate their own data sources in Hadoop and create their own reports, which would be the goal of the lab. Nicole would be able to staff and control the environment so that it is centralized and manageable.

If Nicole proceeds with her data warehouse vendor, she may be in luck. Most of the bigger data warehouse vendors are working on solutions where the end user can write SQL and do table joins between relational data in the data warehouse and flat files in the

Hadoop environment. The technol-ogy is starting to meet the need to marry two disparate data points for a specific series of reports.

If her vendor doesn’t offer this feature today, it will most likely be available soon. In the not-too-distant future, the data warehouse will join with Hadoop flat files and enable enterprise users to find new and interesting insights. Extract-ing the structured data from the warehouse to create those reports isn’t a great concern anymore.

Finally, if the lab environment and ideation with the business reveal the value in Hadoop and ingest-ing new and changing sources to her business, Nicole will need to request capital. My advice would be to lead an effort with her business partners to create business value. As the environment creates real revenue or real paths to revenue, it will also create an opportunity to ask for funding for a full Hadoop environment.

BRIAN VALEYKO

Much of the media hype surrounding big data and big analytics

seems to be the selective amplifica-tion of some small components of a larger system that has existed for quite a while: the data warehouse concept. Some newer technologies have come along that allow for faster processing of larger data sets. In addition, hardware has become cheaper. Combining these two scenarios seems natural, so many people make the leap that this


Using storage cost

as the only factor

in determining how

to set up your data

infrastructure is overly

simplistic.


argument holds true for data.

In some cases, it makes sense to buy in bulk. Non-perishable items that don’t take up a lot of space and don’t require a lot of maintenance might make sense to stockpile if you get a better deal. Is data non-perishable? Sometimes. Does data take up much space? It depends on the kind(s) being collected. What about the maintenance question? Maintainability, accessibility, and usability are the really tricky issues.

Another buzz phrase going around is the data lake. I’ve seen different definitions, but they all seem to point to a sort of large-scale storage of every bit of information that a business generates or can get its hands on without regard to order or maintenance or rules! Rules seem too “old school” for the wake boarders on the lake. It sounds a bit like Lake Havasu at Spring Break, which might be fun for some but is probably not a great place to try fishing. Will your users accept slow reporting? Probably not. Will you be able to re-train everyone to use MapReduce, Python, R, or dozens of other tools and techniques to interrogate a free-for-all data lake with no guidelines or data catalogs? Doubtful. You probably wouldn’t want to even if you could.

How can Nicole deal with these issues? The noise and demands won’t stop. She must create a plan for how to accomplish some of the goals: reducing storage costs, allow-ing queries of larger data sets, doing predictive analytics. We have to do these things. However, we can’t do

means we can use Hadoop file stor-age on commodity hardware and use MapReduce to interrogate a lot of data very cheaply—and believe they have completed their analysis.

That hurdle jumped, they make plans to do away with the data warehouse: “We’ll just put every-thing on Hadoop and store it forever because the storage is so cheap.” That might be appropriate in some circumstances (see Ques-tion 5), but I don’t believe it applies to most businesses. Rather, I think there’s a better case for mixed modes of usage/analysis and storage types (HDFS and RDBMS) that should be designed to meet the cost, performance, maintenance, accessibility, usability, and security requirements of the many business use cases.

Using storage cost as the only factor in determining how to set up your data infrastructure is overly simplistic. If that were a rational way to look at the issue, everyone would have in their refrigerator a four-gallon jar of mustard that they purchased at a wholesale club (because it costs $0.03 per ounce if you buy four gallons at a time, compared to the $0.22 per ounce in a 20-oz. jar). Of course, the same is true for milk, ketchup, pickles, and many other warehouse store items. Unfortunately, everyone would need multiple refrigerators and freezers, all taking up space and using electricity. More condiments would be wasted because some of the products would spoil before the family could use them. This just wouldn’t make sense. The same

them at the cost of the rest of the enterprise. Financial and regulatory reporting must continue. Security concerns must be addressed. New advanced analytics needs must be explored as well.

One way I would suggest for Nicole to start the process (Question 1) is to survey the business for needs and test the tools available to figure out the best way to meet those needs. Ask for use cases—business issues that have a clear return (Question 2) based on the availability of some answer that can’t be accessed today or at least can’t be accessed quickly or easily without pain and manual calculation. I contend that business users in the organization know the information they’d love to have if they only knew how to get it or had time to do the calculations, so use that survey to build those demand-driven use cases.

Create a sandbox, test some tools and techniques on the use cases, and see where it leads (Question 4). Many of the tools and even storage can be accessed for free (or very inexpensively) at least in a trial mode. The results of the tests can yield input for a road map that may turn into a data lake, but it will likely also contain some safe harbors and may even have a boatel on the shore that some might call a data warehouse (Question 3).

Many on-premises and cloud-based tools allow connection to multiple data sets and/or databases, and there are tools that can help to coordinate the movement of data between databases and database



data. Attempting to move forward without fully understanding all the pros and cons will result in failure, user dissatisfaction, and weakened support for future initiatives.

Behind the blanket term “big data” hides the concept and technologies to process massive amounts of structured and unstructured data for building analytics and analytic applications. The goal is not to replace every BI deployment that exists; it is to augment existing deployments. This is an important point for Nicole to understand because it will help to focus her efforts.

She is right to think that a small project is the way to begin. By focusing on improving the useful-ness of analytics in her organization within a well-defined project that has a well-defined ROI, Nicole has a better chance to succeed in introducing these new technologies. She should consider the following factors when selecting a starting project:

■ Define clear start and end points

■ Monitor via metrics to determine success based on a goal of increasing productivity for a given problem or workflow

■ Focus on a subset of organiza-tional data (e.g., a region, specific people within a department, manufacturing of a single product, etc.)

■ Include all touchpoints that will be required when moved into

types. Finally, great strides are being made in the areas of data lineage, data management, data cataloging, and security that can provide navigational markers, charts, guidelines, and even regula-tions for lake, marina, and boatel usage.

In the end, I would imagine that Everything Rugs will meet its needs with a combination of databases, maintenance tools, reporting/BI tools, and analytic engines. Maybe it will still be called a logical data warehouse by then, or it might have a better name such as “data barnyard” so it can include all the hives, pigs, warm puppies, and whatnot. We’ll see.

By the way, the only reason to buy mustard in bulk is that you need to buy 80 hot dogs at a time to avoid having either buns or dogs left over. Which should be plenty to hand out to friends at the data lake, I guess. See you out there once I figure out which fridge I left the mustard in.

COY YONCE

Nicole faces a scenario common to BI directors:

How does she take advantage of these big data technologies that could boost decision-making productivity significantly within her organization? At the same time, Nicole needs to understand what these technologies offer so she can have well-informed conversations with the business people who are listening to the siren’s call of big

production (e.g., loading data, cleaning data, integrating data, designing user interaction, and implementing analytics via BI tools)

■ Leverage the power of big data technologies by combining disparate data sets made up of historical and real-time data

■ Augment existing BI deploy-ments by focusing on analytics, not ad hoc querying or reporting

■ Include project members with the right skills

■ Rely on agile project manage-ment principles (early feedback is important)

These factors will also affect the funding required to ensure that the pilot succeeds. The individuals chosen to participate in the project should be focused on its success, which means working on it full time. The skills of the project team are an important budget consid-eration. Big-data-focused projects should be staffed with people who understand data, programming, networking, security, governance, metadata, and the existing BI implementation.

These people should also feel free to utilize the tools they need to prove that these new technologies are right for the organization—another factor that impacts cost. For Nicole, this means budgeting for the fact that her project team will write applications that may not be used



a breadth of new skills. The costs of training on Hadoop should be weighed against the potential benefits it provides in terms of cost savings over a commercial platform.

Nicole is also right to be concerned about de-centralization of her land-scape; however, only temporarily. During the life of the pilot project, the analysts should feel free to use any tools they deem necessary to take full advantage of this new platform. This will mean using lots of new tools that are not controlled by IT; however, IT should provide the framework by which the new technologies are implemented. This will once again bring order to the use of the new tools.

In reality, it’s no different from the life cycle of BI in most organiza-tions. New tools come along and, if useful, are eventually incorporated into the standard tool set. Nicole needs to plan for a bit of disarray and disorganization during her adventure.

The implementation of big data provides technologies that are useful for making the most of analytics for creating a more efficient organization. Nicole is in a position to ensure that Everything Rugs’ experience with these new technologies is positive and brings long-term benefits. She should ensure that she gets proper execu-tive buy-in, builds the right project team, and sets proper expectations about what success means for a pilot project. ■

in production but will be replaced by BI tools that are more generally accepted.

Choosing the right platform and how to deploy it is another impor-tant part of the funding discussion. If Nicole chooses an on-premises solution such as SAP HANA, Cloudera, or IBM Watson Founda-tions, then there will be associated hardware and software costs, not to mention consulting, licensing, and support costs. There is the potential to offset costs by leveraging cloud-based platforms such as Amazon Elastic MapReduce, Google BigQuery, or the SAP HANA Cloud Platform; however, security and privacy should be taken into consideration when evaluating these options.

Some of the aforementioned platforms leverage Hadoop in the underlying architecture (e.g., Cloudera). Hadoop stores data in the Hadoop File System and then uses MapReduce technology to process all of the data in parallel chunks. This allows data processing to be spread throughout a cluster of machines running Hadoop. The Hadoop software is available for free along with the MapReduce functionality. This is, therefore, an option that could potentially allow Nicole to control costs.

However, with free software comes the need to train staff to implement, maintain, and support it. With a project of this magnitude, it is best to associate people to the project who already have the required skills rather than attempting to teach


Editor’s note: Philip Russom, research director for data management at TDWI, explains what steps to take first in “Getting Started with Big Data.” See http://tdwi.org/articles/2014/08/26/Getting-Started-Big-Data.aspx.

http://tdwi.org/articles/2014/08/26/Getting-Started-Big-Data.aspx


Elad Israeli is co-founder and

CPO of SiSense, a provider of high-

performance business intelligence

software. He is a veteran of the Israeli

army’s 8200 Elite Intelligence Unit and

Ness Technologies (NASDAQ: NSTC).

[email protected]

Achieving Faster Analytics with In-Chip TechnologyElad Israeli

Abstract In-memory technology accelerates the performance of relational database management systems and online analytical processing, but cost and scalability are formidable challenges to its adoption. This article looks at how a recent innovation—in-chip technology—takes the best features and characteristics of in-memory technology and overcomes the drawbacks by efficiently using hard disks, RAM, and CPU to enable large storage capacity and strong performance.

Introduction One of the biggest issues with developing and deploying business intelligence (BI) solutions is poor query perfor-mance resulting from large data sets or extensive concurrent querying. Business users want immediate gratification, and they need results from fast and responsive applica-tions. What they really want is Google search for their specialized databases, but query performance is challenging because data sets are growing exponentially. Queries are executed frequently and by multiple people simultaneously, which takes a toll on the modest resources of commodity hardware. The challenge is compounded by business requirements that constantly and unpredictably change.

Typically, challenges from large data sets are solved by parallelizing resource-intensive operations across multiple machines (a database cluster). However, setting up a computer cluster is often considered overkill for BI and would also put a significant burden on IT as well as perpetuate the IT bottleneck for which BI solutions are notorious. This is why most technological innovations in BI focus on accomplishing more with a simpler architecture using one commodity server rather than a computer cluster.

IN-CHIP TECHNOLOGY


With the introduction of 64-bit computing early in this century, in-memory technology became the first major innovation since the relational database. In-memory technology is today considered a significant performance advance over traditional relational database manage-ment system (RDBMS) and online analytical processing (OLAP) technology, but in-memory analytics still has significant cost and scale limitations.

The most recent innovation, called in-chip technology, harnesses the positive characteristics of in-memory while handling its cost and scalability drawbacks. It achieves this by efficiently utilizing (automatically and in real time) the best qualities of modern hard disks, RAM, and CPU to enable the highest storage capacity while ensuring performance equal to or better than in-memory technology.

To understand how in-chip technology works, it’s useful to review some basic computer architecture.

In general, computers have two types of data storage: disk (or hard disk) and random access memory (RAM). Typically, modern computers have 15–100 times more available disk storage than RAM; however, reading data from disk is much slower than reading from RAM, which is one reason 1 GB of RAM costs approximately 320 times that of 1 GB of disk space.

Another key distinction is that data stored on disk is unaffected by powering down the computer; data resid-ing in RAM is instantly lost. For example, although Microsoft Word documents stored on disk don’t have to be recreated after a reboot, you must still re-load the operating system, re-launch the word processor, and reload any documents you need to work on because applications and their internal data are partly, if not entirely, stored in RAM while they run.

Disk-based Databases vs. In-Memory Databases This understanding of the basic differences between disk and RAM storage provides insight into the difference between disk-based and in-memory databases.

Disk-based databases are engineered to efficiently query data that resides on the hard drive. At a very basic level, these databases assume that the entire amount of data cannot fit inside the relatively small amount of available RAM and therefore must have very efficient disk reads for queries to complete within a reasonable time. The engineers of such databases have the benefit of unlimited storage but must face the challenges of relying on relatively slow disk operations.

By contrast, with in-memory databases, the data set to be queried is first loaded into memory under the assumption that the entire data set will fit inside the available RAM. The advantage of this approach is that it maximizes use of the fastest available storage system. The disadvantage is that far less RAM is available than disk space.

This is the fundamental trade-off between in-memory and disk-based technologies: faster reads but limited amounts of data with the in-memory approach compared to slower reads but practically unlimited data volume with the disk-based approach. This trade-off puts the engineers of BI applications in a quandary because today’s BI users want to have both fast query response times and access to as much data as possible.

To understand how in-chip technology solves this dilemma by taking advantage of both in-memory and disk-based technology, let’s briefly review the evolution of BI solutions.

First-Generation BI: RDBMS/OLAP The first generation of BI technology was based on the RDBMS, such as SQL Server, Oracle, and MySQL. These databases were originally designed for transactional processing—that is, inserting, updating, and deleting records that were stored in rows. Developed in the 1980s, the RDBMS was designed to work on the hardware of the day, which featured very little RAM, relatively weak CPUs, and limited disk space. RDBMS handles record (row) processing extremely well, even today.

Using the RDBMS for high-performance BI on large data sets, however, has proved to be challenging. The

IN-CHIP TECHNOLOGY


design of tables means they take up more RAM, which in turn means that reading them often requires slow disk reads and makes it harder to efficiently utilize the available RAM and CPU. In addition, although the RDBMS needs to support high-performance transaction insertions and updates, BI solutions need to support high-performance queries that require aggregating, grouping, and joining. A single architectural approach simply can’t achieve both goals.

Further, the standard query language used to extract transactions from relational databases, SQL, was designed for efficiently fetching rows. However, it is rare that a BI query requires scanning or retrieving an entire row of data. In fact, it is nearly impossible to formulate an efficient BI query using SQL syntax.

Although relational databases work well as the backbone of operational applications (such as CRM, ERP, and websites) where transactions are frequently and simulta-neously inserted, they are a poor choice for supporting analytic applications that involve the simultaneous retrieval of partial rows along with heavy calculations.

Second-Generation BI: In-Memory As noted, in-memory databases approach the query problem by loading the entire data set into RAM. In so doing, they don’t need to access the disk to run queries, thus gaining an immediate and substantial performance advantage because scanning data in RAM is orders of magnitude faster than reading it from disk. Some of these databases introduce additional optimizations that further improve performance. For example, most employ compression techniques to squeeze even more data into the same amount of RAM.

Although highly beneficial in theory, storing the entire data set in RAM has serious implications for BI applica-tions. The amount of data that can be queried is limited by the amount of available RAM. Limited memory space reduces the quality and effectiveness of a BI application because it limits how much historical data can be included and how many fields can be queried. It is certainly possible to keep adding more RAM, but hardware costs will increase exponentially.

In addition, the fact that 64-bit computers are cheap and can theoretically support unlimited amounts of RAM does not mean they actually do so in practice. A standard low-cost, desktop-class computer with standard hardware physically supports up to 12 GB of RAM today. To move to systems allowing up to 64 GB of RAM costs about twice as much. Moving beyond 64 GB requires a full-blown server, which is very expensive.

The amount of RAM a BI application requires is affected by the size of the data set as well as by the number of people simultaneously querying it. Having five to 10 people using the same in-memory BI applica-tion can easily double the amount of RAM required for intermediate calculations that must be performed to generate the query results. Because most companies consider a key success factor for their BI solution the ability to support many users, they will find that in-memory technology quickly becomes far too expensive, especially when future needs are considered.

Even without cost considerations, there are critical implications to having the entire data set stored in memory. Re-loading the data set into RAM every time the computer reboots requires significant time, and with large data sets, copying data from RAM to the CPU can actually be slower than reading partial data from disk. If this problem is not addressed through efficient resource use, in-memory becomes as slow for large data sets as disk is for smaller ones.

In-Chip Technology Invented in 2008, in-chip technology is based on the concept of optimizing the technology that already exists on today’s commodity computers, in particular the CPU. In other words, optimal hardware utilization.

In-chip technology is based on two ideas: (1) the low performance thresholds of traditional technologies can be attributed almost entirely to antiquated software, not incapable hardware, and (2) there will always be more raw data than can fit in RAM, regardless of how well it is compressed. Let’s explore each of these ideas in detail.

IN-CHIP TECHNOLOGY


Performance and CPU UtilizationIn-chip technology does not rely on the operating system to optimize communication with the CPU. Instead, it uses its own code to optimally utilize RAM and the CPU to avoid cache misses and significantly reduce the number of times the same piece of data is copied between RAM and the CPU. This introduces an out-of-the-box boost to performance.

Today’s CPUs are extremely powerful and have relatively large caches: an x86 CPU has three layers of in-chip memory, where data is stored prior to being processed: the L1, L2, and L3 caches (see Figure 1). Each indi-vidual CPU core has a devoted L1 and L2 cache, and the L3 cache is shared among all cores. The L1 cache has a capacity of 32 KB, the L2 cache holds 256 KB, and the L3 cache holds 8 to 20 MB.

It takes three times longer for a CPU core to fetch data from the L1 cache than if the data already resides in the core. If the data is in the L2 cache, it takes an additional 3.3 times longer for the data to get to the core, so it takes about 10 times longer to pull data from the L2 cache. If the data is in the L3 cache, it takes an addi-tional 3.5 times longer to move the data, now totaling 35 times as long. Reading data from RAM is another 10 times slower, and reading from a disk is thousands of times slower.

In-chip technology considers the specifications of the CPU and applies its unique code to organize the query data and communicate it to the CPU in such a way that if the CPU needs that piece of data again, it will exist in cache.

The technology learns to fetch the associated compressed result sets in advance, with sub-query results pre-loaded into L1 cache as compressed data, making economical use of this very fast but limited resource. Later, decom-pressed images of that same data can be moved to the larger (but slower) L2 and L3 caches. In this way, decompression operations read from and write to cache and are therefore extremely fast.

In-chip technology also focuses on getting more out of multicore CPUs. When processing queries or doing analytical calculations, vector algebra is applied to the data, allowing for the full exploitation of x86 in-chip single instruction multiple data (SIMD) instructions, also called vector instructions. These CPU instructions enable short arrays (that is, the columns of data) to be acted upon by a single instruction. Because the cores process many data values in parallel, they process data much faster.

RAM and Storage Utilization Like all databases, in-chip utilizes disk for persistent storage. However, unlike in-memory databases that avoid disk reads by loading the entire data set into RAM, in-chip achieves speedier performance by loading only those parts of the data that are required by a particular query in real time. Data is unloaded from RAM, thus freeing RAM based on usage across all users running queries.

IN-CHIP TECHNOLOGY

Figure 1: In-memory technology can optimize use of CPU and RAM, as in this sample architecture.

L1 Cache L1 Cache L1 Cache L1 Cache

L2 CacheL2 CacheL2 Cache

Speed

Megabytes

GigabytesTerabytes

3x

3.3x

3.5x

8.6x

3333x

L2 Cache

Disk

Disk

Disk

CPU

Main Memory (RAM)

L3 Cache

Core 1 Core 2 Core 3 Core 4


This ability starts with a columnar database, which is in itself not a new concept but has already been widely accepted as ideal for analytics. A columnar database stores information in columns, not rows. This fundamental capability allows you to scan a field in a table on disk without scanning the entire table. As an example, scanning one field in a table made up of 10 fields and 10,000 rows would require 10,000 disk-reads from a columnar database but 100,000 reads from a tabular database such as RDBMS or most in-memory technologies.

Columnar databases also make it easier to compress data because it’s more likely that a single field has similar and repetitive values (eye color, gender) than an entire table would. Obviously, this saves storage space, but in-chip also has the unique ability to perform calculations on data sets while they are still compressed, further reduc-ing the amount of RAM a query consumes in real time.

Being able to quickly load only parts of the data into RAM and keeping columns compressed in-memory both deliver real-time performance on data that lies beyond the size of physical RAM.

A smart query engine can reuse query particles to avoid re-calculating them. This enables the execution of one query to improve the performance of a completely different query, and is very different from standard result caching, which reuses only the stored results of entire queries.

Concurrency Handling Under the hood, in-chip does not use a SQL-based query engine but rather a query engine that speaks columnar algebra. A query is broken down into thou-sands of columnar algebra instructions that the in-chip query engine can reuse without re-calculating across dif-ferent queries with similar execution plans. This enables the execution of one query to improve the performance of a completely different query, as mentioned earlier.

This ability means that an additional concurrent user adds very little overhead to RAM and the CPU, making them available to more concurrent users.

Understanding the Business Benefits of In-Chip Technology Today, most BI solutions can handle between one and two terabytes of data. This is great for databases within this range, but data volumes are continuing to grow rapidly, and the desire to exploit big data—all of it—is driving the need for larger database capacity. In-chip technology offers the potential to easily handle 5 TB and beyond, scaling up by optimizing the use of the available resources, balancing disk against memory and CPU. It also automatically adjusts to data volumes and to the number of users and workloads.

Because in-chip technology simply optimizes the use of x86 CPUs, it does not require the purchase of any special hardware. It goes faster when given a more powerful CPU with more cores, and even faster when provided with more memory. If it runs out of resources, a new commodity server can easily be added and workloads split between servers.

In-chip technology is currently being used by companies around the world to accelerate their analytics. For example, Magellan Vacations, a luxury hotel booking company, relies on telephone-based agents to provide its clients with personalized recommendations and book hotel rooms. To support its agents and measure performance, the company needs to track huge amounts of sales metrics, such as closing rates, commissions, and bookings by destination. The company tested an in-memory technology, but the performance was sub-par and the solution required IT specialists to work with the tool’s proprietary scripts and IT consultants to work with and modify the application. Instead, Magel-lan Vacations selected an in-chip technology solution that provided agents with near-real-time feedback on sales closings, destination performance, and other metrics that would help them better serve customers. IT implemented the solution without a major infrastructure upgrade because the solution was sufficiently scalable to maintain acceptable performance on the existing infrastructure while reports were being generated. The solution also did not require an investment in expensive IT resources.

IN-CHIP TECHNOLOGY


Wix, a popular Flash-based website builder, has helped users build more than 22 million websites. At this scale, the company needed a powerful analytics and reporting solution that could help it quickly track a variety of metrics, including conversions, marketing campaign efficacy, and user behavior. Prior to implementing a technology solution, data was largely managed via scripting, and reports were difficult to explore and change. However, thanks to the performance of in-chip technology, the company can quickly gain insight based on behavioral data combined from numerous sources while generating reports that validate the success of marketing campaigns and changes in user behavior. The senior BI analyst managed the implementation without outside help or massive hardware purchases. Users found it easy to upload and connect multiple forms of data, including MS-SQL, Oracle, and MySQL databases, Excel and CSV files, and direct API access to Google Adwords and Google Analytics.

By delivering execution speeds between 10 and 100 times faster than in-memory-based solutions, in-chip technology can potentially provide the scale and performance required to enable the type of data self-service environment that most companies today can only fantasize about: the ability to quickly and easily glean more insight for more people asking a wider range of questions at any level of detail. With in-chip technol-ogy, even as data sets swell to hundreds of terabytes and eventually even petabytes, it will be possible to produce reports that go further back in time, offer a more granular level of detail, and consolidate reports across multiple data sources and business entities. ■

IN-CHIP TECHNOLOGY


2014 TDWI BPA WINNERS

Winners: TDWI Best Practices Awards 2014

Government and Non-Profit Denver Public Schools Solution Sponsor: RevGen Partners

Denver Public Schools (DPS) is committed to meeting the educational needs of every student in the city and county of Denver with great schools in every neighbor-hood. The district’s 185 schools include traditional, magnet, charter, and pathways schools.

Under the leadership of Superintendent Tom Boasberg and guided by The Denver Plan, DPS is the fastest-growing urban school district in the country in terms of enrollment, and Colorado’s fastest-growing large district in terms of academic growth. DPS is committed to establishing Denver as a national leader in achievement, high school graduation, and college and career readiness.

DPS chose RevGen Partners to develop the application portion of a customized school performance manage-ment system (SPMS) to track district, school, educator, and student achievement. DPS transitioned to a culture of data-driven decisions aimed at student success, seeking to increase graduation rates, raise student academic performance, and replicate best practices in top-performing schools.

The joint team from DPS and RevGen Partners created an enterprise-level SPMS designed to:

■ Provide a comprehensive view of school performance and success, including attendance and reading proficiency.

■ Inform teacher evaluation and development. Student performance is a key part of evaluating teacher and principal effectiveness.

TDWI’s Best Practices Awards recognize organizations for developing and implementing

world-class business intelligence and data warehousing solutions. Here are summaries of the

winning solutions for 2014. For more information, visit tdwi.org/bpawards2014.

2014

2013



■ Create tools to support continuous improvement, addressing long-term and immediate goals for eight operational departments.

■ Develop school and individual performance plans. The system shows individual and group progress toward stated goals.

■ Identify top-performing schools and replicate best practices. Robust data on each school allows district administrators and principals to review specifics of each rating, ranging from attendance rates to indi-vidual subject proficiency gains.

Enterprise Data Management Strategies Teachers Insurance Annuities Association- College Retirement Equities Fund (TIAA-CREF)Founded in 1918, TIAA-CREF is a national financial services organization and is the leading provider of retirement services in the academic, research, medical, and cultural fields.

TIAA-CREF’s technology strategy and architecture team created advanced data quality and data management capabilities used by business initiatives and project teams to identify and improve the quality of critical data. These enterprise data services are delivered via an innovative method called data quality as a service (DQaaS). Using concepts from cloud computing, business technology solutions leverage these enterprise data services to improve data quality. Users need not build their own redundant solutions.

The team created key framework components ranging from data quality and profiling services to data steward playbook methodology to deliver the data quality services in an agile way. The organization’s data transformation program offered DQaaS to lower operational risk and cost, enable strategic business initiatives, and advance the data management practices of the enterprise.

Business initiatives have used the data quality services to quickly pinpoint and correct root causes of data quality gaps in business processes, which create inefficiency. Business data owners and stewards leverage the enterprise

data quality dashboard to systemically improve the quality of critical data, giving visibility to data quality levels, showing trends, and identifying defective records. Project teams have performed data analysis faster using these data services.

Enterprise Data Warehousing Caisse de dépôt et placement du QuébecLa Caisse de dépôt et placement du Québec is a financial institution that manages funds primarily for public and parapublic pension and insurance plans. As of December 31, 2013, it held $200.1 billion in net assets. As one of Canada’s leading institutional fund managers, La Caisse invests globally in major financial markets, private equity, infrastructure, and real estate.

An enterprise data warehouse for financial data is used by all lines of business for analytical and downstream systems alike. Enterprise context and SOA allow for simple and effective design.

With a combination of winning enterprise conditions and a will to design a data warehouse capable of stand-ing the test of time, La Caisse’s team of experienced BI/DW architects and developers has created what they believe to be a very simple and effective data warehouse to satisfy business needs, including the capacity to evolve in response to demands of constantly changing business conditions.

Performance Management DellFor more than 28 years, Dell has empowered countries, communities, customers, and individuals to use tech-nology to realize their dreams. They trust Dell to deliver technology solutions to do and achieve more, whether at home, work, school, or anywhere in their world.

Dell built a global BI performance management solution for all supply chain and financial executive leadership and extended operational teams that would measure and assist Dell’s strategic initiatives. It also measures and monitors the current state of operations in the company on a daily basis. The project likewise provides robust data analytics-


based decision making capability for executives to chart new strategies.

The company’s CSR performance management BI solution has become the most reliable system to measure the implementation effectiveness of Dell’s strategic initiatives around supply chain and operations. The product has evolved consistently, keeping pace with the changes in the strategy. CSR has also evolved into a supply chain newspaper highlighting Dell supply chain operations KPIs and has become a trusted BI framework for executives making data-driven decisions for company supply chain operations.

CSR data and metric platforms have contributed to pre-dictive initiatives in the supply chain and have provided an analytics-based framework for driving the strategy behind future initiatives in supply chain performance management. Significant cost savings have also been achieved by establishing the single information source in CSR, enabling different IT and business stakeholders to communicate using the same data and metrics. Overall, Dell’s performance management solution has created a well-adopted and one-stop environment for BI and analytics-based supply chain performance management for executives and business analysts.

Emerging Technologies and Methods Aircel Limited, IndiaAircel is one of India’s leading innovative mobile services providers, serving more than 68 million subscribers. The company has 3G spectrum in 13 circles and BWA spectrum in 8 circles, and accomplished the fastest 3G rollout ever in the Indian telecom space. In 2006, Aircel was acquired by Malaysia’s biggest integrated communi-cations service provider, Maxis Communication Berhard, in a joint venture with Sindya Securities & Investments Pvt. Ltd. Through its innovative and affordable offerings, strategic partnerships, and best-in-class solutions, Aircel caters to the growing mobile telephony demands of customers and enterprises alike.

The prepaid sector continues to dominate Indian telecom. Therefore, the ability to understand a subscriber’s interests, needs, preferences, and consumption patterns in

real time, and to use these insights to provide customized products and services across all customer touchpoints, offers a unique strategic advantage. Aircel initiated its “Unified Marketing Platform” project to provide its marketing and sales teams a 360-degree customer view for designing and providing relevant offerings across all touchpoints.

Earlier, Aircel had heterogeneous systems capturing day-to-day customer interactions and transactions, which hampered the generation of an integrated view for business users to analyze customer demographics, usage patterns, social behavior, and so on. Keeping market demands, future growth, and telecom dynamics in perspective, Aircel implemented an enterprise data warehouse (EDW) solution using Teradata 2x/5x/6x series to bring together multiple lines of business and more than 21 disparate data streams.

Aircel’s objective was to provide a platform to enable personalized promotions across touchpoints and a seam-less customer experience—customers should feel they are well understood by Aircel and that services offered to them are tailor made. The EDW implementation has indeed provided Aircel a competitive edge in the Indian telecom industry.

Right-Time BI and Analytics Comdata Solution Sponsor: Credera

Comdata is a business-to-business provider of innovative electronic payment solutions. As an issuer and processor, Comdata provides fleet, corporate payment, healthcare, virtual card, and prepaid solutions to more than 30,000 customers. The company’s solutions include financial transactions of all kinds and are changing the way com-panies manage data, pay employees, process transactions, and control spending on key business purchases. Founded in 1969 and headquartered in Brentwood, Tennessee, with more than 1,200 employees globally, Comdata enables over $54 billion in payment volume annually.

FleetAdvance is an analytics solution that drives smarter fueling decisions by commercial transportation fleets, analyzing the quality of a fueling transaction against



other alternatives. FleetAdvance presents real-time dashboards and metrics, instant notifications, analytical views of performance, and route-planning tools.

In the recession-wary world, FleetAdvance fills a void for value-savvy fleet customers. FleetAdvance is a strategic component of Comdata’s comprehensive fleet solutions, which enable customers to greatly improve their fuel purchasing efficiency. FleetAdvance includes features that allow customers to plan routes, analyze fuel spending, and realize actual savings on fuel purchases.

FleetAdvance differentiates Comdata in the marketplace. The methodology and technologies enable Comdata to react to market conditions quickly and continue to provide additional value to customers.

Overall, the FleetAdvance product showcases the innovative culture at Comdata both from a client-service perspective and by its ability to use technology as a differentiator. FleetAdvance serves as a real-world example of turning data into information and, in turn, driving real-time and predictive analytics based on this information.

Enterprise BI USAAUSAA provides insurance, banking, investments, retirement products, and advice to 10.1 million members of the U.S. military and their families. Legendary for its commitment to its members, USAA is consistently recognized for outstanding service, employee well-being, and financial strength. USAA membership is open to all who are serving or have honorably served our nation in the U.S. military—and their eligible family members.

To improve understanding of call data from a member perspective, including why members call and what occurs during the call, this project consolidated multiple disparate data sources into a single data mart. This data mart and the automated reports built upon it enabled enhanced analytics about member call experience and call transfers, and optimization of sales and service.

The development of Member-Centric Call Data (MCCD) was an evolving, multi-year journey that enabled USAA to approach new frontiers of conceptual and strategic design of its contact centers. The team was able to experi-ment with new development methodologies, technical approaches, training, and change management practices to set the example for more BI development across the organization.

The co-location and collaboration of business and techni-cal resources resulted in a high-functioning team able to quickly deliver quality results. The extended project team delivered varying levels of data, tools, and analysis to different organizations across the company, resulting in new insights that were either unavailable or difficult to achieve in the past. In so doing, the MCCD solution has directly supported USAA’s corporate mission.

BI on a Limited Budget HDFC Standard Life Insurance Company LimitedHDFC Life, one of India’s leading private life insurance companies, offers a range of individual and group insurance solutions. It is a joint venture between Housing Development Finance Corporation Limited (HDFC), India’s leading housing finance institution, and Standard Life plc, the leading provider of financial services in the United Kingdom. The company’s goals are to enable a strong performance management platform (including action-based dashboards), improve operational efficiency to enhance the customer experience, and provide an intuitive interface, enabling users to explore data with little training.

HDFC Life’s unique methodology of implementation and innovation in the way its BI tool was leveraged helped develop industry-leading modules and reap business benefits, including:

■ Optimization of over $600,000 in human resources and development costs

■ Customer retention increased by 18 percent

■ Customer satisfaction score leaped 7 basis points



■ Repeat purchases were up by 27 percent

■ Enhanced channel partner/distributor service levels with best-in-class service on data/information manage-ment and support

HDFC Life has a listening team that keeps updating and releasing new versions of its modules on Qlik Insights almost every week. The company monitors authenticated entry and modules views (such as the number of clicks in a module and which are the lesser visited/used modules).

In a competitive environment, it is vital that HDFC Life is nimble and responsive to the information needs of its distribution and channel partners and enhances their ability to service customers. Qlik Insights was extended to them, with customized views, dashboards, and analytic models hosted on the system, or necessary information was pushed for action with a centralized publisher. HDFC Life benefited by using the tool QlikView for more than just reporting.

Analytics ScotiabankScotiabank is a leading financial services provider in over 55 countries and Canada’s most international bank. Through its team of more than 86,000 employees, Scotia-bank and its affiliates offer a broad range of products and services, including personal and commercial banking, wealth management, and corporate and investment banking to more than 21 million customers.

The project was creating a new method to price customers applying for unsecured lines of credit. It required the development of multiple predictive models, a new data mart, and significant integration between the enterprise data warehouse (EDW) and operational systems to deliver pricing recommendations.

The new approach to selling lines of credit to retail customers, launched in 2011, leveraged analytics and modeling to create more precise pricing recommendations and an optimization solution to align prices to business results. All modeling and analytical work was conducted within the EDW environment, which expanded the

customer information captured to include application and lost quote details. The project improved average prices on new lines of credit by 35 basis points over a control group using the former pricing approach. Its success has encouraged the bank to launch similar pricing initiatives for mortgages and GICs.

The project showcases how analytics solutions deployed within an EDW environment can be integrated into operational systems, and the value that can be generated by consolidating customer, account, application, and sales interaction data into a single environment.

Big Data Technologies adMarketplaceSolution Sponsor: HP Vertica

AdMarketplace is the first and only programmatic marketplace for search and the largest search advertising solution of its kind outside of Google and Yahoo!/Bing.

Originally designed as the search advertising engine for eBay, adMarketpace became available to all advertisers in 2007. Now, the adMarketplace Advertiser 3D platform contributes to the more than 50 percent of search advertising activity that occurs outside of major search engines. adMarketplace’s proprietary search solution utilizes keywords and user demographic information—traffic source, device type, location, and more—to match consumers with relevant ads in real time.

Fortune 500 brands and ad agencies count on Advertiser 3D, the analytics platform, and BidSmart, the algo-rithmic bidding system, to provide data transparency, granular control, and algorithmic real-time account optimization. Both Advertiser 3D and Bidsmart use HP Vertica for data storage and processing.

Flexibility is the core of adMarketplace’s big data model. Every minute, data enters a clustered, columnar data warehouse and then runs through a series of BidSmart’s predictive algorithms. This system processes terabytes of data to make predictions about relevancy and competitive bid landscape to determine click value for advertisers. ■



StatShotsTDWI Technology Survey: Advanced Analytics

The Chicago 2014 Technology Survey asked conference attendees about advanced analytics, which provides algorithms for complex analysis of structured or unstructured data and includes sophisticated statistical models, machine learning, neural networks, text analytics, advanced visualization, and other advanced data mining techniques. About 100 people responded to the survey, so this should be considered a quick snapshot study.

Advanced analytics is gaining steam.About 41% of respondents were utilizing advanced analytics and 37% were currently exploring it (Figure 1). The remainder (22%) either didn’t know or had no plans for the technology. The most common kinds of advanced analytics being used are predictive analytics for customer behavior, fraud, or risk management.

Companies view advanced analytics as an opportunity.The vast majority (close to 90%) of respondents see advanced analytics as an opportunity. Less than 2% answered that it was a problem.

Enterprises are still primarily analyzing structured data stored in their data warehouses.The vast majority of respondents were using structured data from records or tables. Some were supplementing this with demographic data. This is a typical scenario for predictive analytics. About 46% were using time series data, which is also not surprising. What’s interesting is that 30% of respondents claimed to be using geospatial data as part of their advanced analytics. That’s good news because it shows some movement forward in analytics.

Although companies are making the move to advanced analytics, the infrastructure currently supporting it is either the enterprise data warehouse/mart (70%), flat files (60%), or desktop applications (61%). (See Figure 2.) Interestingly, the data warehouse appliance is currently in use by 47% of respondents, and the analytic platform is poised for growth. On the flip side, less than 10% of the respondents were currently using Hadoop or a public

cloud as a platform for advanced analytics. This may change, however, if respondents stick to their plans. An additional 37% will use Hadoop and 25% will use the public cloud in the next three years.

Want to know where you stand relative to your peers in terms of big data analytics maturity? Take the big data maturity assessment at tdwi.org/BDMM.

—Fern Halper, TDWI Research Director

Which statement best describes the state of advanced analytics in your organization or company?

We are exploring advanced analytics solutions now 37%

We have been utilizing advanced analytics for at least three years in our organization 19%

We are neither utilizing nor exploring advanced analytics 14%

We have deployed advanced analytics and it has been used for less than one year 13%

We have been utilizing advanced analytics for at least one year but not more than three years 9%

Don't know 8%

Figure 1: Based on 101 respondents in May 2014.

Which components are part of the infrastructure to support advanced analytics? Now? Three years from now?

Using now

Three years from now

Won't use

Don't know

Enterprise data warehouse/marts 70% 12% 7% 11%

Desktop application 61% 8% 13% 18%

Flat files on servers 60% 9% 13% 18%

Data warehouse appliance 47% 15% 13% 25%

Analytic platform 39% 37% 9% 15%

Enterprise content management system 29% 23% 10% 38%

Analytics sandboxes 28% 31% 11% 30%

Public cloud as a platform for some DW components

9% 25% 23% 43%

Hadoop Distributed File System (HDFS) 9% 37% 17% 38%

Figure 2: Based on 101 respondents in May 2014.

BI STATSHOTS

tdwi.org/cbip

CERTIFIED BUSINESS INTELLIGENCE PROFESSIONAL

Get Recognized as an Industry LeaderAdvance your career with CBIP

“Professionals holding a TDWI CBIP certification command an average salary of $113,500—more than $8,200 greater than the average for non-certified professionals.”2013 TDWI Salary, Roles, and Responsibilities Report

TDWI CERTIFICATION

Distinguishing yourself in your career can be a difficult yet rewarding task. Let your résumé show that you have the powerful combination of experience and education that comes from the BI, DW, and analytics industry’s most

meaningful and credible certification program.

Become a Certified Business Intelligence Professional today! Find out how to advance your career with a BI certification credential from TDWI. Take the first step: visit tdwi.org/cbip.

TDWI Partners

These solution providers have joined

TDWI as special Partners and share

TDWI’s strong commitment to

quality and content in education

and knowledge transfer for business

intelligence, data warehousing,

and analytics.

Documents

THE LEADING PUBLICATION FOR BI, DATA WAREHOUSING, AND …/media/6397D94CDAA04B4BB464390F54ED... · 2014. 9. 15. · the Hadoop framework components most relevant to the data warehouse