26
BASEL | BERN | BRUGG | BUCHAREST | DÜSSELDORF | FRANKFURT A.M. | FREIBURG I. BR. | GENEVA HAMBURG | COPENHAGEN | LAUSANNE | MANNHEIM | MUNICH | STUTTGART | VIENNA | ZURICH BASEL | BERN | BRUGG | BUCHAREST | DÜSSELDORF | FRANKFURT A.M. | FREIBURG I. BR. | GENEVA HAMBURG | COPENHAGEN | LAUSANNE | MANNHEIM | MUNICH | STUTTGART | VIENNA | ZURICH danischnider.wordpress.com @dani_schnider The Night is Too Short: 10 Tips to Improve ETL Performance Dani Schnider, Trivadis AG

The NightisTooShort: 10 TipstoImproveETL Performance10 Tips to Improve ETL Performance (ADW) 1. Use Set-based Operations 2. Avoid Nested Loops 3. Drop Unnecessary Indexes 4. Avoid

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

BASEL | BERN | BRUGG | BUCHAREST | DÜSSELDORF | FRANKFURT A.M. | FREIBURG I. BR. | GENEVA HAMBURG | COPENHAGEN | LAUSANNE | MANNHEIM | MUNICH | STUTTGART | VIENNA | ZURICHBASEL | BERN | BRUGG | BUCHAREST | DÜSSELDORF | FRANKFURT A.M. | FREIBURG I. BR. | GENEVA HAMBURG | COPENHAGEN | LAUSANNE | MANNHEIM | MUNICH | STUTTGART | VIENNA | ZURICH

danischnider.wordpress.com@dani_schnider

The Night is Too Short:10 Tips to Improve ETL PerformanceDani Schnider, Trivadis AG

BASEL | BERN | BRUGG | BUCHAREST | COPENHAGEN | DÜSSELDORF | FRANKFURT A.M. | FREIBURG I.BR. GENEVA | HAMBURG | LAUSANNE | MANNHEIM | MUNICH | STUTTGART | VIENNA | ZURICH

Dani Schnider• Senior Principal Consultant at Trivadis AG in

Glattbrugg/Zurich

• Trainer of several Trivadis courses

• Co-Author of Books “Data Warehousing mitOracle” and “Data Warehouse Blueprints”

• Oracle ACE

@dani_schnider danischnider.wordpress.com

The Night is Too Short to Load all Data

Sunset/sunrise in Alta, Norway, 12 July 2016, 00:13

Blog Post: 10 Tips to Improve ETL Performance

https://danischnider.wordpress.com/2017/07/23/10-tips-to-improve-etl-performance/

Tip 1: Use Set-based Operations

DECLARECURSOR cur_source IS

SELECT * FROM source;BEGIN

FOR c IN cur_source LOOPINSERT INTO targetVALUES c;

END LOOP;END;

INSERT INTO targetSELECT * FROM source

Set-based Row-based

Demo

Tip 2: Avoid Nested Loop

Tip 3: Drop Unnecessary Indexes

Full Table Scan or Index Scan?

• Full table scans are good• For queries with weak selectivity• High percentage of data is read

• Index scans are good• For queries with strong selectivity• Small percentage of data is read (< 1-2 %)

• Typically for DWH and ETL• High percentage (often 100%) in ETL• Queries with aggregations on large data sets

Tip 3: Drop Unnecessary Indexes

Each additional index needs maintenance

effort during ETL jobs

Tip 4: Avoid Functions in WHERE Conditions

------------------------------------------------------------------| Id | Operation | Name | Starts | E-Rows | A-Rows |------------------------------------------------------------------| 0 | SELECT STATEMENT | | 1 | | 1035 ||* 1 | TABLE ACCESS FULL| ADDRESSES | 1 | 1043 | 1035 |------------------------------------------------------------------1 - filter(("CITY"='Basel' AND "CTR_CODE"='CH'))

SELECT * FROM addressesWHERE ctr_code = 'CH' AND city = 'Basel';

SQL or PL/SQL functions in WHERE conditions are hard to estimate for the optimizer

Tip 4: Avoid Functions in WHERE Conditions

------------------------------------------------------------------| Id | Operation | Name | Starts | E-Rows | A-Rows |------------------------------------------------------------------| 0 | SELECT STATEMENT | | 1 | | 1035 ||* 1 | TABLE ACCESS FULL| ADDRESSES | 1 | 26 | 1035 |------------------------------------------------------------------

1 - filter((UPPER("CITY")='BASEL' AND UPPER("CTR_CODE")='CH'))

SELECT * FROM addressesWHERE UPPER(ctr_code) = 'CH' AND UPPER(city) = 'BASEL';

SQL or PL/SQL functions in WHERE conditions are hard to estimate for the optimizer

Demo

Tip 5: Take Care of OR in WHERE Condition

SELECT CASEWHEN t.empno IS NULL THEN 'INS'WHEN s.empno IS NULL THEN 'DEL'

ELSE 'UPD'END dml_flag

, NVL(s.empno, t.empno) empno, s.ename, s.job, s.mgr, s.sal, s.comm, s.deptno

FROM emp_source sFULL JOIN emp_target t ON (s.empno = t.empno)

WHERE (NVL(s.ename, '(null)') != NVL(t.ename, '(null)'))OR (NVL(s.job, '(null)') != NVL(t.job, '(null)'))OR (NVL(s.mgr, -999999) != NVL(t.mgr, -999999))OR (NVL(s.sal, -999999) != NVL(t.sal, -999999))OR (NVL(s.comm, -999999) != NVL(t.comm, -999999))OR (NVL(s.deptno, -999999) != NVL(t.deptno, -999999))

Example: Delta detection between two tables

Tip 5: Take Care of OR in WHERE Condition

SELECT CASEWHEN t.empno IS NULL THEN 'INS'WHEN s.empno IS NULL THEN 'DEL'

ELSE 'UPD'END dml_flag

, NVL(s.empno, t.empno) empno, s.ename, s.job, s.mgr, s.sal, s.comm, s.deptno

FROM emp_source sFULL JOIN emp_target t ON (s.empno = t.empno)

WHERE DECODE(s.ename, t.ename, 0, 1)+ DECODE(s.job, t.job, 0, 1)+ DECODE(s.mgr, t.mgr, 0, 1)+ DECODE(s.sal, t.sal, 0, 1)+ DECODE(s.comm, t.comm, 0, 1)+ DECODE(s.deptno, t.deptno, 0, 1) > 0

Example: Delta detection between two tables

Tip 6: Reduce Data as Early as Possible

Transformation

Lookup

Join

Transformation

Filter

Tip 6: Reduce Data as Early as Possible

Transformation

Lookup

Join

Transformation

Filter

Tip 7: Use WITH to Split Complex Queries

WITH a AS (SELECT ...FROM t3JOIN t4 ON ...

WHERE ...), b AS (SELECT ...

FROM t5WHERE ...)

, c AS (SELECT ...FROM aJOIN b ON ...)

, d AS (SELECT ...FROM t1JOIN t2 ON ...JOIN c ON ...)

SELECT ...FROM d

WHERE ...

SELECT ...FROM (SELECT ...

FROM t1JOIN t2 ON ...JOIN (SELECT ...

FROM (SELECT ...FROM t3JOIN t4 ON ...

WHERE ...) aJOIN (SELECT ...

FROM t5WHERE ...) b

ON ...WHERE ...) c

) dWHERE ...

WITH a AS (SELECT /*+ materialize */ ...FROM t3JOIN t4 ON ...

WHERE ...), b AS (SELECT /*+ materialize */ ...

FROM t5WHERE ...)

, c AS (SELECT /*+ materialize */ ...FROM aJOIN b ON ...)

, d AS (SELECT /*+ materialize */ ...FROM t1JOIN t2 ON ...JOIN c ON ...)

SELECT ...FROM d

WHERE ...Demo

Original Statement:ORA-01652 after 2 minutes

Rewritten Statement:Result after 3 seconds

Tip 8: Run Statements in Parallel

SOURCE(Parallel 8)

Transformation

TARGET(Parallel 8)

Tip 8: Run Statements in Parallel

------------------------------------------------------------------------------------

| Id | Operation | Name | TQ |IN-OUT| PQ Distrib |------------------------------------------------------------------------------------

| 0 | INSERT STATEMENT | | | | || 1 | PX COORDINATOR | | | | |

| 2 | PX SEND QC (RANDOM) | :TQ10000 | Q1,00 | P->S | QC (RAND) |

| 3 | LOAD AS SELECT (HYBRID TSM/HWMB)| TARGET | Q1,00 | PCWP | || 4 | OPTIMIZER STATISTICS GATHERING | | Q1,00 | PCWP | |

| 5 | PX BLOCK ITERATOR | | Q1,00 | PCWC | || 6 | TABLE ACCESS FULL | SOURCE | Q1,00 | PCWP | |

------------------------------------------------------------------------------------- Degree of Parallelism is 8 because of table property

ALTER SESSION ENABLE PARALLEL DML;INSERT /*+ PARALLEL (target, 8) */ INTO targetSELECT /*+ PARALLEL (source, 8) */ * FROM source;

Tip 8: Run Statements in Parallel

4 CPUs

8 CPUs

21

Tip 9: Perform Direct-Path INSERT

Direct-path INSERT is also used for Parallel DML statements

INSERT /*+ APPEND */ INTO salesSELECT * FROM stage_sales;

Append new table blocks above high water mark is faster than conventional INSERT

Tip 9: Perform Direct-Path INSERT

10 ETL Performance Tipps - DOAG 201821 21.11.2018

Restrictions must be considered:

• If FK constraints are defined,PDML / direct-path is disabled

• Conventional load is used

Recommendation:

• Define reliable constraints

----------------------------------------------------| Id | Operation | Name |----------------------------------------------------| 0 | INSERT STATEMENT | || 1 | LOAD TABLE CONVENTIONAL | SALES || 2 | PX COORDINATOR | || 3 | PX SEND QC (RANDOM) | :TQ10000 || 4 | PX BLOCK ITERATOR | || 5 | TABLE ACCESS STORAGE FULL| STG_SALES |----------------------------------------------------

Note-----

- automatic DOP: Computed Degree of Parallelism is 8- PDML disabled because parent referential constraints

are present

ALTER TABLE salesADD FOREIGN KEY (cust_id) REFERENCES countriesRELY DISABLE NOVALIDATE

Tip 10: Gather Statistics after Loading each Table

10 ETL Performance Tipps - DOAG 201822 21.11.2018

DBMS_STATS.gather_table_stats(ownname => 'DWH',tabname => 'T1',no_invalidate => FALSE);

T1

T2

T3DBMS_STATS.gather_table_stats

(ownname => 'DWH',tabname => 'T2',no_invalidate => FALSE);

| 1 | INSERT STATEMENT | | 1500 || 2 | INSERT | T3 | 1500 || 3 | HASH JOIN | | 1500 || 4 | TABLE ACCESS FULL| T1 | 2000 || 5 | TABLE ACCESS FULL| T2 | 3000 |

ETL Job

T1

2000

T2

3000

T3

1500

Tip 10: Gather Statistics after Loading each Table

10 ETL Performance Tipps - DOAG 201823 21.11.2018

Since Oracle 12c, Online Statistics Gathering is used for the following cases:

• CREATE TABLE AS SELECT

• Direct-Load INSERT into empty table (after TRUNCATE)

• Direct-Load INSERT into non-empty table (ADW only)

------------------------------------------------------| Id | Operation | Name |------------------------------------------------------| 0 | INSERT STATEMENT | || 1 | LOAD AS SELECT | TARGET || 2 | PX COORDINATOR | || 3 | PX SEND QC (RANDOM) | :TQ10000 || 4 | OPTIMIZER STATISTICS GATHERING | || 5 | PX BLOCK ITERATOR | || 6 | TABLE ACCESS STORAGE FULL | SOURCE |------------------------------------------------------

Demo

10 Tips to Improve ETL Performance

1. Use Set-based Operations2. Avoid Nested Loops3. Drop Unnecessary Indexes4. Avoid Functions in WHERE Condition5. Take Care of OR in WHERE Condition6. Reduce Data as Early as Possible7. Use WITH to Split Complex Queries8. Run Statements in Parallel9. Perform Direct-Path INSERT10. Gather Statistics after Loading each Table

10 Tips to Improve ETL Performance – Revised for ADW

https://danischnider.wordpress.com/2018/07/20/10-tips-to-improve-etl-performance-revised-for-adwc/

10 Tips to Improve ETL Performance (ADW)

1. Use Set-based Operations2. Avoid Nested Loops3. Drop Unnecessary Indexes4. Avoid Functions in WHERE Condition5. Take Care of OR in WHERE Condition6. Reduce Data as Early as Possible7. Use WITH to Split Complex Queries8. Run Statements in Parallel9. Perform Direct-Path INSERT10. Gather Statistics after Loading each Table