Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Real-World Performance TrainingLoading Data
Real-World Performance Team
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Agenda
The DW/BI Death Spiral
Parallel Execution
Loading Data
Exadata and Database In-Memory
Dimensional Queries
1
2
3
5
4
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
The Schema
Oracle Retail Data Warehouse
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Retail Demonstration
Table Size of Source Data in GByte Number or Rows in Millions
Transactions 51.8 463.7
Payments 54.2 463.7
Line Items 940.8 6,980.6
Total 1046.9 7,908.0
Table Sizes
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Retail Demonstration
Table Size of Table (GB) Compression Ratio
Transactions 29.1 1.78 : 1
Payments 29.2 1.86 : 1
Line Items 257.1 3.66 : 1
Total 315.5 3.32 : 1
Table Sizes – Default Compression
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Retail Demonstration
Table Size of Table (GB) Compression Ratio
Transactions 4.8 10.82 : 1
Payments 4.9 10.99 : 1
Line Items 55.0 17.11 : 1
Total 64.7 16.18 : 1
Table Sizes – Hybrid Columnar Compression
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Loading a Data Warehouse
• Two broad approaches
– ETL: Extract Transform Load
– ELT: Extract Load Transform
Oracle Confidential – Internal/Restricted/Highly Restricted 33
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Loading a Data Warehouse
• Extract the data from the source system. In many cases, this is the Data Warehouse itself
• Perform Transformation and Validation, usually on some middle tier server
• Load the data into the Data Warehouse.– Often the data is written to the Data Warehouse using DML operations; inserts, updates and
deletes. In turn, this may require indexes in order to perform
• A whole business has been developed around “data integration” products and services, such as– Informatica
– Ab Initio
Oracle Confidential – Internal/Restricted/Highly Restricted 34
ETL – Extract Transform Load
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Loading a Data Warehouse
• Extract the data from the source system
• Load the data as-is into “staging” tables on the Data Warehouse system
• Validation and Transformation performed via SQL and set based processing techniques
• Final data is added to the target fact or dimension table
– Partition Exchange is an effective technique for this step
Oracle Confidential – Internal/Restricted/Highly Restricted 35
ELT – Extract Load Transform
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Loading a Data Warehouse
• Extracting data from a source system is often the most challenging– What tools are available depends on the data source
– For Oracle, there is no “data unload” product• Home grown tools
• Fastreader from WisdomForce (now Informatica)
• Datapump Export, Transportable Tablespaces
– Compression Benefits • Reduced time to copy data over the network
• Increased load performance
– Where will the data be staged• DBFS, NFS, ZFS
• USB Drive !
Oracle Confidential – Internal/Restricted/Highly Restricted 36
Extract
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Loading a Data Warehouse
• Data Loading is a CPU/Memory constrained operation.
– Data loads scale well over multiple CPUs, cores and hosts (assuming no other form of contention)
–Memory usage for meta data associated with highly partitioned objects can become significant at high DOP
• Use external tables with a parallel SQL statement (e.g. CTAS or IAS) to minimize on-disk and in-memory meta data. Do NOT use multiple threads of SQL*Loader– Using external tables is also much simpler than having to manage multiple threads of
SQL*Loader
Oracle Confidential – Internal/Restricted/Highly Restricted 37
Loading
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Loading a Data Warehouse
• Direct Path Load– Enabled using the APPEND hint– Default for CTAS and for Parallel Inserts
• Why Direct Path Load?– Allows a single parallel insert operation to efficiently load data from multiple parallel server processes
• Significant performance improvement for parallel DML/DDL operations
– Required for basic/default and HCC compression – No redo or undo– Bypasses buffer cache
• Possible Issues– Only one direct path load into a table/partition at a time– No logging for Data Guard
Loading
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Anatomy of an External Table
Loading a Data Warehouse
create table FAST_LOAD
(
column definition list ...
)
organization external
( type oracle_loader
default directory SPEEDY_FILESYSTEM
preprocessor exec_file_dir:’zcat.sh’
characterset ‘ZHS16GBK’
badfile ERROR_DUMP:’FAST_LOAD.bad’
logfile ERROR_DUMP:’FAST_LOAD.log’
(
file column mapping list ...
)
location
(file_1.gz, file_2.gz, file_3.gz, file_4.gz )
reject limit 1000
parallel 4
/
External Table Definition
Reference the Mount Point Uncompress the data
using a secure wrapper
The Character set must match the Character set of the Files
Note Compressed Files
The number of files should match or be a multiple of the DoP.
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Loading a Data Warehouse
• Elimination of duplicates– Outer Join back to the table
– Window function
– Aggregate with HAVING clause
• Foreign Key References– Outer Joins between tables
• The choice of techniques will be dependent on the following– Good/Bad validation of the data
– The desire to identify and locate bad rows e.g. find ROWIDS
– The desire to programmatically eliminate bad rows
Validation and Transformation
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Duplicate Rows
Data Validation SQL
Simply Check the Data Obtain one of the ROWIDs of duplicates to investigate
Query the rows you wish to keep eliminating duplicates based on the load time
select
pk,count(*)
from DIRTY_DATA
group by pk
having count(*)>1;
select
pk,
count(*),
max(rowid)
from DIRTY_DATA
group by pk
having count(*)>1;
select column_list
from
(
select
a.*,row_number() over
(
partition by pk
order by load_time desc
) rowno
from DIRTY_DATA a
)
where rowno=1
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Orphaned Row Check
Data Validation SQL
Look For Orphans Look for Parents with no Children
select C.rowid
from PARENT P
right outer join
CHILD C
on P.pk = C.fk
where P.pk is null;
select P.rowid
from PARENT P
left outer join
CHILD C
on P.pk = C.fk
where C.fk is null;
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Loading a Data Warehouse
• Data Transformation
– Change data by performing transformations into a new table
– Consistent and Predictable Performance
– Supports Direct Path Loads and Compression
• Data Modification
– Change data in place
– Update, Delete, Insert
–Overhead and performance impact of changing existing blocks
– Does not work well with compression
Data Transformation vs Data Modification
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Loading a Data Warehouse
• Use either – INSERT /*+APPEND */ INTO … SELECT
– CREATE TABLE … AS SELECT
• Using an INSERT– Constraints such as NOT NULL can be correctly applied and enforced
– Data type, column lengths and precision can be defined and preserved
• Using a CTAS– DDL (not DML)
– Some optimizations available, that are currently disabled for DML. This may change over time.
Data Transformation SQL
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Delete
Rewriting DML as Transformation
alter session enable parallel
dml;
delete from tx_log
where
symbol = ‘JAVA’;
commit;
alter session enable parallel dml;
insert /*+ append */ into tx_log_new
select * from tx_log
where
symbol != ‘JAVA’;
alter table tx_log
rename to tx_log_old;
alter table tx_log_new
rename to tx_log;
The predicate is the compliment of the DELETE, it selects the rows to keep
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Update
Rewriting DML as Transformation
alter session enable parallel dml;
update sales_ledger
set tax_rate = 9.9
where tax_rate = 9.3
and sales_date > ‘01-Jan-09’;
commit;
alter session enable parallel dml;
insert /*+ append */ into sales_ledger_new
select
<column list>,
case
sales_date>‘01-Jan-09’
and
tax_rate=9.3
then
9.9
else
tax_rate
end,
<column list>
from sales_ledger;
The UPDATE predicates are moved to the SELECT list in a CASE statement to transform the rows
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Loading a Data Warehouse
• An example workflow may be:– Load data into first staging table• Basic data integrity, nulls, data types etc.
– Check the data, writing “good” data to a second staging table• Uniqueness, foreign keys, business rules etc.
• Apply constraints with “RELY DISABLE NOVALIDATE”
– Transform the data into a third staging table• Tax corrections, time zone corrections, consolidate codes etc.
• Gather statistics on final staging table, including synopsis
– Use Partition Exchange to “swap in” the staging table to the Fact table• Gather Global statistics, which will be rolled up using partition synopses
Example Workflow
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Example Workflow
Oracle Confidential – Internal/Restricted/Highly Restricted 48
Load Data into Staging Table
5-305-295-285-275-265-25 5-31
Daily PartitionedTable
. . .
Stage_3
Partition Synopses
Stage_2Stage_1Load data from externaltable into stage_1
Stage_1_err Stage_2_err
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Data Load
Oracle Confidential – Internal/Restricted/Highly Restricted 49
Validation
5-305-295-285-275-265-25 5-31
Daily PartitionedTable
. . .
Stage_3Stage_2
Partition Synopses
Stage_1Valid data transformedinto stage_2
Invalid data transformedInto stage_1_err
Stage_1_err Stage_2_err
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Data Load
Oracle Confidential – Internal/Restricted/Highly Restricted 50
Transformation
5-305-295-285-275-265-25 5-31
Daily PartitionedTable
. . .
Stage_3
Partition Synopses
Stage_2Stage_1Data transformationInto stage_3VAT codes, time zonesetc
Invalid data transformedInto stage_2_err Stage_1_err Stage_2_err
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Data Load
Oracle Confidential – Internal/Restricted/Highly Restricted 51
Gather Statistics
5-305-295-285-275-265-25 5-31
Daily PartitionedTable
. . .
Stage_3
Partition Synopses
Stage_2Stage_1Gather statistics
Stage_1_err Stage_2_err
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Data Load
Oracle Confidential – Internal/Restricted/Highly Restricted 52
Partition Exchange
5-305-295-285-275-265-25 5-31
Daily PartitionedTable
. . .
Stage_3
Partition Synopses
Stage_2Stage_1Exchange stage_3 with partition
Stage_1_err Stage_2_err
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Data Transformation SQL
Driver Transformation Modification
Compression No Impact Compression may be Lost and severely impact performance
Fragmentation None Fragmentation, row chaining, and holes will almost certainly take place
Logging and UNDO Possible to eliminate Will take place and may impact performance and administration requirements
Indexes Indexes need to be rebuilt if used Indexes will be maintained in place. This may be a performance overhead and Bit Map indexes may become fragments and require rebuilding
Meta Data Grants etc will require redefinition No impact
Space Overhead of maintaining multiple copies of the data Overhead of UNDO and Logging
Coding New code required writing and new techniques need teaching Old Code runs with performance challenges
3rd Party Issues May not be supported by Tool Vendors
Transformation vs. Modification
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Loading a Data Warehouse
• Data validation and modification
– Best executed in the database via SQL
– This presents big challenges to users who have committed to classic ETL tools such as Informatica
– Changes of data are best made via transformation and redefinition than via classic OLTP DML statements ( delete, update, merge )• Allows exploitation of hardware and parallelism
• Minimizes fragmentation and maximizes compression
• Minimizes logging and minimizes recovery
– Set based techniques use efficient CPU and IO techniques
Summary