33
Multi-terabyte Data Warehouses on MYSQL? Absolutely!

Multi-terabyte Data Warehouses on MYSQL? Absolutely!

Embed Size (px)

Citation preview

Page 1: Multi-terabyte Data Warehouses on MYSQL? Absolutely!

Multi-terabyte Data Warehouses on MYSQL? Absolutely!

Page 2: Multi-terabyte Data Warehouses on MYSQL? Absolutely!

Agenda

Data Warehousing Today

Traditional Data Warehouse Solutions

A New Approach to Multi-Terabyte Data Warehouses

Page 3: Multi-terabyte Data Warehouses on MYSQL? Absolutely!

A Look at the Market

The worldwide database market was $18.8 billion in 2006

11.1 Billion for OLTP 7.7 Billion for Data warehousing

41% of the database budget is spent on data warehousing

Data warehousing is the number one area of CIO spend in North America in 2007 and 2008

The data warehousing market is growing twice as fast as the OLTP database market

Page 4: Multi-terabyte Data Warehouses on MYSQL? Absolutely!

BI is Not Just for the Boardroom

BI started as a strategic, decision-support tool, used to create canned reports by executives and analysts to guide the ship

Today, BI is mission critical, and serves users across the enterprise, used to support not only traditional analytics, but also daily, operational decision making

These changes in use have brought changes in infrastructure requirements

Traditional RDBMS have trouble making the grade

Page 5: Multi-terabyte Data Warehouses on MYSQL? Absolutely!

How is the Data Warehouse different?

Many simple transactions of exactly the same type

A lot of tuning (data model, indexes, partitions) in order for the specific transaction to perform well

Focus on current data Scaling thru massive

hardware and multiple copies of the database (eg. Online gaming systems have 1 database instance for every customer)

Interface to an application, often custom

• Many queries - all very different, unpredictable, and always changing

• Queries are very complex - lots of joins, group bys, where clauses

• Focus on history over many days, months, years

• Interface to user and many different Business Intelligence tools

OLTP Data Warehouse

Page 6: Multi-terabyte Data Warehouses on MYSQL? Absolutely!

Industry Research

The Data Warehousing Challenge

Gartner

“Collecting and analyzing information that enables your organization to better lead, decide, measure, manage and optimize its overall efficiency is a major financial and competitive differentiator. The faster an enterprise can gather and use relevant information, the faster it will be able to reduce costs and increase profits.”

“Volume of the world’s data doubles every three years. Ninety-two percent of new information is stored in magnetic media...organizations face a simple problem: what to do with all the data.”

Page 7: Multi-terabyte Data Warehouses on MYSQL? Absolutely!

The Problem: Data Warehouses are Strained

Data is growing exponentially

Users are asking more

complex questions

Data is aggregated and deletedData is archived and not usable

Complex queries are blockedComplex queries don’t perform

+ =

Data Volume Complexity Trouble

Page 8: Multi-terabyte Data Warehouses on MYSQL? Absolutely!

What do the current limitations mean for Stakeholders?

Users Do not get access to data they need; Queries run too slowly; Are not allowed to think creatively – ask new and different questions; Are told to wait for months for what they want in minutes

IT Besieged with requests for new data sources; Feature creep and changing requirements straining resources; Analytic system maintenance tuning affect support for operational systems;

Executives CIOs face service level complaints and rising IT costs; Business unit leaders without analytic data fail to achieve objectives;

Page 9: Multi-terabyte Data Warehouses on MYSQL? Absolutely!
Page 10: Multi-terabyte Data Warehouses on MYSQL? Absolutely!

Founded 2005

Headquarters Toronto, Canada; offices in Boston, MA and Warsaw, Poland

BrighthouseA highly scalable, analytic data warehouse built on MYSQL designed to deliver fast response for ad hoc, complex queries without burdening IT

Major Benefits

Simplicity: “Load and Go”, no new schemas, no indices, no data partitioning, easy to maintain

Scalability: Ideal for databases of 500GB - 50 TB Low TCO: Industry-leading compression, less storage, industry

standard servers, low software costs, minimal ongoing operational expenses

Key Partner

MySQL/Sun Leverages MySQL connectivity to ETL and BIProvides MySQL customer with scalable, enterprise-ready data warehouse

About Infobright

Page 11: Multi-terabyte Data Warehouses on MYSQL? Absolutely!

Data Warehousing: Part of the Problem

More Data & Data Sources

.

More Kinds of Output Needed by More Business Users

0101010101010101010101010101

0101010101010101010101010

0101010101010101010101

1

0101010101010101010101

10

1010 1011001

0 110

01

1

0

01

101

0

10

101

1

1

0101

0

1010

101

10 0101

10

01

10

01

10

1

0

10101

01 010 01 0101

011 1010

01011

01

0

10

1010 1011001

0 110

01

1

0

01

101

0

10

101

10

0101010101010101010101010

0101010101010101010101010101

1

10110

0

101

1010 10 1101

010

0

0 101 0010

0

Clickstream and log files

Existing data warehouse

External Sources

I/O intensive, write centric Labor intensive, heavy

indexing and partitioning Hardware intensive: massive

storage; big servers

Traditional Data Warehousing

Page 12: Multi-terabyte Data Warehouses on MYSQL? Absolutely!

Real Life Example

Background A large internet marketing company was performance driven and used data from 155 million online consumers worldwide, to do sophisticated analytics and advanced targeting technologies to create value for both marketers and publishers.There operational systems could no longer handle both operations and reporting. Company was unable to execute queries against large data volume.

IT Challenge The volumes of data (32 million visitors/day and 140 million actions) exceeded capabilities of production system. In addition, staff could not keep pace with needs of the users; no ad-hoc query ability therefore no ability to compete using analytics

“Data is the difference. The difference between a campaign that meets your objectives vs. one that blows them away. Between paying $9 vs. $60 for a new customer. Between predicting what sites to advertise on vs. knowing you’ve put the right message in front of the right person no matter what site they are on.”

Page 13: Multi-terabyte Data Warehouses on MYSQL? Absolutely!

Desired Queries: #1 - Campaign Effectiveness:

Goal was to determine the optimum number of times to show an ad to get best results

Actual query example: analyzed 2 billion rows of campaign frequency by date, to look at 5 campaigns in order to determine how many times a user saw each campaign.

#2 - User Demographics by Campaign:

Counts users by different demographic categories Very wide range of possible results across varying range of rows Two actual query examples:

User entered incorrect campaign number. Search was performed against 1.3 billion rows in user campaign aggregate table and the result was a null set

Largest campaign (highest results returned) where 89 million rows (11% of entire table) in user campaign were selected and joined to 57 million rows in the user dimension table

Page 14: Multi-terabyte Data Warehouses on MYSQL? Absolutely!

Traditional Data Warehouse Approach

Identify the reporting requirements Determine the data needed Design the data warehouse:

Extract-Transform-Load Data Model (Logical and Physical) Canned reports and BI tools

…then Revise the model as reporting requirements change and data

grows: Add indexes Partition data to improve performance Restrict users!

Page 15: Multi-terabyte Data Warehouses on MYSQL? Absolutely!

Traditional Data Warehouse Approach

Results: Software costs well known and predictable but... Management and support costs spiral:

Partitioning strategies Indexing strategies Additional data marts More hardware

Business user satisfaction declines as restrictions are placed on:

Adhoc query capabilities Volume of historical data that can be queried Time lag between business requirement and system delivery

With this particular client, their systems were unable to handle this volume of data so they couldn’t run these queries at all!

Page 16: Multi-terabyte Data Warehouses on MYSQL? Absolutely!

Market Evolution

Innovation

Work

ing

Sm

art

er

Work Harder

Database Advances

Hardware Advances

Traditional

All-purpose RDBMSResource intensive, lots of

DBA time

Divide and conquer onlots of hardware (MPP)

Nothing to address underlying issues

Extending Database Concepts

Incremental improvements, still inflexible

Data Warehouse Innovator

Work Smarter

Radical New Approach

Page 17: Multi-terabyte Data Warehouses on MYSQL? Absolutely!

The Analytic Data Warehouse

0101010101010101010101010101

0101010101010101010101010

0101010101010101010101

1

0101010101010101010101

10101010101010101010101010101

1010 1011001

01

10

01

1

0

01

1

0

101011

1010

110

010

110

01

10

01

10

1

0

10101

10

10 10110

010

110

01

0

01

10

10

101

10

0101010101010101010101010

0101010101010101010101010101

101

1010101010101010101010101010101010101010101

What to Look for in a New Approach

Clickstream and log files

Existing data warehouse

External Sources

New Approach Leverages column approach Automatically creates

structures that:• finds needed data• responds to all queries,• are always ready Has small footprint Uses existing infrastructure Is easy to setup and

maintain01

0

0101

10 1 1

Page 18: Multi-terabyte Data Warehouses on MYSQL? Absolutely!
Page 19: Multi-terabyte Data Warehouses on MYSQL? Absolutely!

The Infobright Analytic Data Warehouse

Working Smarter, Not HarderScalable solution without scaling IT

0101010101010101010101010101

0101010101010101010101010

0101010101010101010101

1

0101010101010101010101

10101010101010101010101010101

1010 1011001

01

10

01

1

0

01

1

0

101011

1010

110

010

110

01

10

01

10

1

0

10101

10

10 10110

010

110

01

0

01

10

10

101

10

0101010101010101010101010

0101010101010101010101010101

101

1010101010101010101010101010101010101010101

A New Approach: Introducing Brighthouse

Clickstream and log files

Existing data warehouse

External Sources

Better Analytics Faster Response Decreased IT Burden Smaller Footprint

Page 20: Multi-terabyte Data Warehouses on MYSQL? Absolutely!

How Brighthouse Works Smarter

Brighthouse

Smarter architecture: Load data and go

No indices or partitions to build / maintain

Knowledge Grid created automatically as data loaded

Up to 40:1 compression reduces storage

Open architecture leverages off-the-shelf hardware

Data Packs—Data stored in manageably sized, highly compressed data packs

Data compressed using algorithms tailored to data type

Knowledge Grid—statistics and metadata “describing” the super-compressed data

Page 21: Multi-terabyte Data Warehouses on MYSQL? Absolutely!

How Brighthouse Works Smarter

Brighthouse

Optimizer iterates over the Knowledge Grid

Only the data packs needed to resolve the query are decompressed

Often query results can be determined from the Knowledge Grid alone

Data Packs

Knowledge Grid

Query received by Brighthouse

Page 22: Multi-terabyte Data Warehouses on MYSQL? Absolutely!

Brighthouse is Easy on IT

Existing Data Warehouses

Clickstream/Logfiles

External Sources

010101010101010101010101010101010101010101011010 1011001

0 110

01

1

0

01

101

0

10

0101010101010101010101010

1010101

10 0101

10

01

10

01

10

1

0

10

0101010101010101010101010

0101010101010101010101010101

10

1010 101

10010 1

10

01

1

0

01

101

0

10

101

10

0101010101010101010101010101

1

No strain on IT: No need for physical

data modeling Run on standard

hardware Works with existing

BI and ETL platforms MySQL “wrapper” No need to learn

new database system

Leverage mature tools

ETL Platform Connector

BI Connectors

Page 23: Multi-terabyte Data Warehouses on MYSQL? Absolutely!

MySQL selected due to:mature connectors, tools, resourcesinterconnectivity and certification with BI Toolscommercial OEM license protects our IPmost broadly used open-source DB (12 million users).BenefitsGreatly improved time to marketDevelopment focused on competitive differentiatorsSell to MySQL customers experiencing scalability problems

BrightHouse Architecture and MySQL

Page 24: Multi-terabyte Data Warehouses on MYSQL? Absolutely!

Real Life Example: Results with Brighthouse

What type of queries did they want to run? How did Brighthouse perform?

#1 Users by Campaign: Analyzed 2 billion rows of campaign frequency by date to look at 5 campaigns#2 Demographics by Campaign Against 1.3 billion rows in user campaign aggregated table where the result is a null set Largest campaign (highest results returned) where 89 million rows (11% of entire table) in user campaign were selected and joined to 57 million rows in the user dimension table

Overall benefits of Infobright solution

•The ability to track impressions/clicks/actions by user and thereby more intelligently provide their clients with reliable data. They can now compare costs to clickthroughs to optimize banner purchasing and improve cost performance.•The ability to optimize advertising dollars for their clients. •The ability to make data accessible via SQL, BI Tools (Pentaho) to end users•The ability to lower the cost of queries - The ability to eliminate DBA involvement and thereby reduce the personnel costs associated with the ad hoc queries•Brighthouse supported 10TB data warehouse with a single, $20K industry standard server and much less storage than alternatives

Page 25: Multi-terabyte Data Warehouses on MYSQL? Absolutely!

Customer Query Response TimeResults vs. Oracle

Oracle Time: 136 secBrighthouse 3.0 Time: 16.8 sec

Page 26: Multi-terabyte Data Warehouses on MYSQL? Absolutely!

Query Speed as Volume of Data Grows

• Queries were moderately complex, with at least two table joins and two or more where clauses• Tables were indexed• Response time represents the average response time of queries

Avera

ge R

esp

on

se T

ime (

in S

ecs

)Impact of Additional Data on Query Times

Increases as data volume grows

Brighthouse Performance Advantage

Page 27: Multi-terabyte Data Warehouses on MYSQL? Absolutely!

Brighthouse Load Time RemainsConstant

• Comparison of load to a single table. Data was loaded in 10 million row chunks• Table had a single index

Load

Tim

e (

in S

ecs

)Data Load Times as Volume Increases

Volume (Rows in Millions)

Savings in processing time during load over conventional databases

Brighthouse load time stays constant

Page 28: Multi-terabyte Data Warehouses on MYSQL? Absolutely!
Page 29: Multi-terabyte Data Warehouses on MYSQL? Absolutely!

Brighthouse is Fast

Brighthouse is designed specifically to quickly run complex queries on large data sets The Knowledge Grid’s small chunks of highly compressed data are

fast and easy to manipulate Knowledge Grid optimizer iteratively optimizes query execution plan Only data packs needed to answer query are opened; Often query

results can be determined from the Knowledge Grid alone Users enjoy fast response times no matter how complex or

spontaneous their query

“Each month we process and analyze data generated by 20 billion online transactions,”...We are pleased by Brighthouse’s performance and the fact that we now can get answers to questions we want to ask. -- Ola Udén, CTO of TradeDoubler

Page 30: Multi-terabyte Data Warehouses on MYSQL? Absolutely!

Brighthouse is Flexible

Brighthouse ensures changing and complex analytic requirements are supported with fast response times Knowledge Grid is built on-the-fly, creating a layer of statistics and

metadata across all columns and rows Knowledge Grid obviates the need for Indexing, data partitioning, or

other physical data structures Data no longer needs to be off-loaded or archived

Users can ask any question of all the data

"Brighthouse allows us to do very complex analyses on over 30 terabytes of

data”-- Jay Webster, General Manager, BlueLithium

Page 31: Multi-terabyte Data Warehouses on MYSQL? Absolutely!

Brighthouse is Simple

Brighthouse eliminates the complications, cost and disruption IT teams must endure to support complex queries

No DBA resources required to build indices and data partitions in response to user requirements

No complicated performance variables to tune; “No Knobs” Leverages MySQL ease of use, connectivity, and supported BI tools Runs on off-the-shelf hardware

Reduced complexity frees IT resources and significantly lowers lifetime TCO

Page 32: Multi-terabyte Data Warehouses on MYSQL? Absolutely!

How does Brighthouse impact TCO?

Hardware footprint 20 to 50 times smaller Fewer DBA resources required

40 – 60% reduction in one-time build Up to 90% reduction in ongoing support

Support for existing infrastructure Load and Go Improved SLAs – immediate response vs weeks or months

ETL andData Changes

PhysicalModeling

TuningHardwareSoftware

Page 33: Multi-terabyte Data Warehouses on MYSQL? Absolutely!

Prove it!

RAPID START:1. Call us – we’ll walk you

through a a few questions to mutually determine if our technology is a good fit.

2. Agree on process – e.g. your place or ours?

3. Load and go – Load your data, run your queries

4. Summarize results – performance, compression, load times

5. Next steps – did we prove it?

Contact Us [email protected] 416.596.2483, x. 225

Download Claudia Imhoff paper:

http://www.infobright.com