Upload
others
View
25
Download
0
Embed Size (px)
Citation preview
Best Practices in Data Loading
for an Oracle Data Warehouse
Jean-Pierre Dijcks
Oracle Corporation
United States
Keywords:
Data Integration, ETL, Data Warehouse, Oracle Database Machine, Best Practices
Introduction
Perhaps the most significant trend in data warehousing over the past few years has been the
growth in data volumes. Whether you are a school district, a financial institution or a
manufacturing organization, you are storing more and more data.
If we look at the Winter Survey1 and project the trend over the next couple of years we see
impossible to imagine growth rates in data volumes.
Figure 1. Growth in data volumes
There are many reasons for this data growth: new business processes continue to be
automated and more detailed information is being collected at every level. Regulatory
compliance increase data storage and the desire to analyze more historical data add to the
growth in data volumes. In this paper we do not focus on why data volumes are growing, but
on what that growth means for a data loading or data integration.
1 Source: Winter TopTen Survey, Winter Corporation, Waltham MA, 2008
Increasing Data Volumes and ETL
As the data warehouses grow, so does the requirement to provide performing ETL jobs.
Another strain on the ETL subsystem is the need to load data in ever shorter intervals.
Five years ago loading a data warehouse on a nightly basis was state of the art and the goal to
achieve. Today that modus operandi is outdates and most data warehouse systems are
increasing their load requirements to multi-batches or even micro-batches. In these condensed
batch operations multiple loads are run in a single day. Micro-batching even runs a batch load
every few minutes or so.
Both these forces are causing strain on the actual ETL subsystem and this paper intends to
show how to deal with this strain and how to optimize the use of Oracle’s ETL capabilities to
satisfy your Service Level Agreements in ETL.
Optimized ETL Requires Balanced Hardware
Optimizing data loading or ETL starts with a platform that – at least in theory – can handle
your required throughputs. When looking at the scalability and performance of a system it is
crucial to look at both software and hardware characteristics and understand potential
bottlenecks.
Balanced Systems
A system is balanced when the storage array is capable of reading and moving – through the
storage area network and the Host Bus Adapters (HBA) – enough data to the database servers
to have the CPUs adequately loaded. In other words, neither the IO capacity, nor the
bandwidth within the system, nor the CPU should be a constraint on the system.
Figure 2. Balance between database and storage bandwidth?
Consider the simplified example as shown in Figure 2. The storage subsystem can deliver a
maximum a throughput of 2GB/sec, where as the upper system to the compute platform can
deliver 4GB/sec. If we now assume that the database servers have sufficient CPU capacity
and other resources to handle the 4GB input, they will run at maximum half capacity due to
the bottleneck limiting the storage to deliver no more than 2GB/sec.
Figure 3. A balanced system
When balancing a system it is crucial to balance between I/O capacity, CPU capacity, the
memory available and interconnect capacity. This balance is show in Figure 3.
Balancing the system in Figure 2 will allow the storage arrays to deliver the full 4GB to the
compute side and utilize the available CPU capabilities. When all components are balanced,
the system should double its performance.
Old Hardware and Incorrect System Sizing
If you were trying to break the track record of this years Formula 1 world champion, would
you go to the track with a Formula 1 car from the late eighties?
Reality says however that many of the systems running today are on yesterday’s hardware and
yesterday’s networking and storage systems. Sure, you added a bunch of disks to
accommodate more storage capacity, but you did not, or could not update the compute server
or the SAN.
Now your CIO has challenged you to go and break the track record with the old F1 car. You
can hire a better racecar driver, you can adjust the carburetor, and you can buy better tires, but
the result of your attempts is simple: failure. There is no way you can break that track record.
If we go back to Figure 2 and imagine the pipe from the storage array carries 4GB, we have a
balanced system. But what is the storage array, due to the number of disks or the disks used
can only deliver a 1GB data stream to the system?
You face the same problem in all of these cases. You need more speed than you can get from
the system. No matter how smart your software is it cannot go faster than the infrastructure
that carries it.
How to utilize Oracle for fast bulk loading
For this paper we mainly are focusing on bulk data movement. While we see a trend towards
more real-time deployments, the large majority of data movements are focused on bulk
movement.
Access Methods
When talking about speed of movement, the crucial component is the way Oracle can access
the actual data. With newer releases of Oracle Database new and interesting access methods
become available. The following is a ranking and a brief discussion of some of these methods,
slowest to fastest:
• Web Services – In the original incarnation (using SOAP style communication) web
services allow a system to connect to an outside hosted system via a simple API call.
In general, a web service is a slow method of communicating for bulk data loads.
• Database Links – In this method we incorporate various access protocols depending
on the source system. ODBC and Oracle Database Gateways utilize a database link to
connect to a non-Oracle system. Database links are also used to connect to disparate
Oracle systems. Database links should never be used to connect to a schema within the
same Oracle database. Database links are a convenient way of connecting to a remote
database, but this convenience comes at the cost of performance. A database link does
not allow parallel loading and therefore often provides a bottleneck in the data
movement process.
• Data Pump – since it’s introduction Data Pump has made some interesting things
possible in ETL. Many of us categorize Data Pump as an export/import utility, but the
fact that you can choose specific columns and tables allows for a very fast way of
moving data between two or more Oracle instances. Since 11g External Tables can
also directly read Data Pump files, allowing ETL style SQL access directly on the
actual export file, without first staging or importing the data.
• Flat Files – are still one the best performing means of moving large data volumes
between databases. Especially in a heterogeneous environment where data is loaded
from non-Oracle to Oracle, flat files outperform almost every other means. While Data
Pump will work for Oracle to Oracle situations, a SQL Server unload to file, FTP and
External Table load into Oracle is many magnitudes faster than a database link
scenario.
• Transportable Tablespaces – is arguably the fastest way of moving data within an
Oracle environment. While Data Pump allows for much more granular data
movement, the effective movement of the entire file without any additional steps
makes this fast, direct and simple.
Using the Right Access Method
With all the access methods available (and this is probably not the full list), the task for the
ETL developer is to choose the right method.
Most ETL tools – as a default mechanism – leverage database links to move data around
when Oracle comes into play. However that default is most likely not the way to move data
around in large volumes. When it comes to Oracle to Oracle movements the preferred choice
is to use Data Pump when supported and Flat Files when Data Pump is not available (due to
release restrictions for example).
In a heterogeneous environment, the fastest movement of data is by unloading into Flat Files,
compressing the data and using some FTP mechanism to move the files. Then for Oracle the
ETL strategy should leverage External Tables (and NOT SQL LOADER) with pre-processing
capabilities.
The graph in Figure 4 shows the performance as mentioned above and the ability to handle
heterogeneous source types.
Figure 4. Determine access method based on speed and heterogeneity
Reference data sets – for example small dimensions in a star schema, or recoding tables – can
leverage database link mechanisms. Even when going to a non-Oracle system, that may be a
good enough method. Small data sets typically move fast enough over a database link to not
worry about them. Optimizing this via unload and reload mechanisms is not going to gain
enough to warrant spending the extra time on the more complex processes that go with
unloading.
Changed Data
For large data volumes, the detection of changed data should be decoupled from the access
method. The process that detects changes should do just that. Deliver a set of changed data to
a transportation mechanism.
Detecting changes in itself can be a complex process and the goal is to do the change
detection as quickly as possible. Various methods are available and should be considered, but
it should also be considered that the change detection is not enforcing a particular a
transportation mechanism.
In other words, once the changes are identified, ideally you then choose how to extract and
move these identified changes. A timestamp based changed detection is the simplest case.
Once the window is known, the data – if in Oracle – can be moved using Data Pump, not
using a simple select with a timestamp where clause.
Parallel Loading
Parallel loading and join techniques based on parallel processing are an absolute must when
trying to load data into Oracle. Parallel loading is also the big driver for using External Tables
and not SQL Loader. By using External Tables you can specify that data is moved in parallel
right on the table creation statement. Oracle will then spawn and manage the parallel
processes to actually load the data in parallel. SQL Loader requires the ETL developer to do
his/her own parallelism.
Partitioning and ETL
When joining data elements in queries, but also in ETL, a good method of getting great
performance is to design a schema to leverage partition wise joins. By partitioning tables on
their join column, Oracle can deliver a joined partition set to a single parallel process. That
strategy delivers many small joins divided over the parallel processes making the entire
process run in parallel. This join method is the most efficient from a processing perspective
and the goal should be to run partition wise joins as much as possible for large data sets.
Figure 5. Five Steps for Partition Exchange Loading
Another method for ETL that is based on partitioning large table is Partition Exchange
Loading. The theory is that, rather than inserting into a large table with indexes, inserting to a
smaller table, then enabling indexes is faster. Once the smaller table has its indexes (and
statistics) created, Oracle allows to swap this table into a large partitioned table via an
exchange. The table object becomes a specific partition in the large table, and the partition
becomes the – now empty – table. This action is only a single dictionary operation and costs
no time at all. This process is shown in Figure 5.
The process to publish partitions can also be used to create a single “publish all data” moment
for a data warehouse. Figure 6 and the surrounding text shows another method to achieve this
singular moment in time for data publication.
Publish and Subscribe Model for ETL
Using newer technologies as flashback query archives allows you to create an entire
subsystem used alongside ETL utilities. To avoid end users querying systems that are being
updated you can use “AS OF” queries to regulate which data is being queried.
The scenario goes a little like this. You have a reporting environment on top of the data
warehouse and want to make sure data that comes in and is loaded only gets published after it
is checked within the context of the entire system.
That means you need to do your loads, update the entire system, but shield the end users from
that data until it is verified and certified. Once verified, you want to publish the data and
update all data sources for the reports.
Figure 6. Using AS OF views for publishing the latest data
The diagram above shows these steps. To make this work, the schema (on the left receiving
the ETL loads) is covered with a layer of views that handle the exact timestamp of data
visible to the end users.
1. Update the view layer to set the timestamp to a moment before the ETL starts
2. Run the regular jobs
3. Ensure the data is correct and all present
4. Publish the data by updating the view layer to a point in time after the ETL load
5. The end users now query the updated data in the warehouse
Now, none of the above is required to achieve read consistency for queries! Oracle does that
all by itself without any thoughts from the ETL guys (unlike other databases in the DW
space). So do not confuse those two things.
Summary
More data equals a lot more strain on the ETL infrastructure. It is important to understand that
a well performing ETL infrastructure in many cases depends on the existing hardware and
software platform in use. It is crucial that the system is sized to perform at the required level.
Once the hardware and software pieces are in place to satisfy the throughput required from the
ETL subsystem it is a task for ETL team to utilize the tools of the trade to create a well
performing ETL architecture.
To run ETL fast it is important to leverage Oracle software with the latest features. Instead of
running SQL Loader jobs, Oracle recommends the usage of External Tables. For large data
sets, database links should be avoided and alternative means such as Data Pump,
Transportable Tablespaces and flat files should be utilized.
As data warehouses grow traditional solutions run out of steam and it pays to look at features
that are in Oracle to assist in making ETL faster, simpler or just easier to handle. It is
important for the data warehouse and ETL users to move along and sometimes think outside
the box!
Kontaktadresse:
Jean-Pierre Dijcks
500 Oracle Parkway
Redwood Shores, CA 94065
USA
Telefon: +1 650 607 5394
E-Mail [email protected]
Internet: www.oracle.com