16
The EDW Ecosystem Leveraging Hadoop within an existing Data Warehouse environment LUKE KAY STEVE O’NEILL

The EDW Ecosystem

Embed Size (px)

Citation preview

Page 1: The EDW Ecosystem

The EDW EcosystemLeveraging Hadoop within an existing Data Warehouse environment

LUKE KAY STEVE O’NEILL

Page 2: The EDW Ecosystem

Department of Immigration and Border Protection 2

Question: What is this?

Answer: A data lake…Image Source: Linkedin , 2016

Page 3: The EDW Ecosystem

Department of Immigration and Border Protection 3

1. Overview of DIBP business

2. Why Hadoop in DIBP?

3. Overview of EDW environment

4. Technical implementation of Hadoop

5. Next steps

6. Questions

Agenda

Page 4: The EDW Ecosystem

Department of Immigration and Border Protection 4

The Department

Each week on average…

•Previously two separate organisations: • ACBPS (Australian Customs and Border Protection Service)• DIAC (Department of Immigration and Citizenship)

•DIBP formed on 1st July 2015 (Department Immigration and Border Protection) and ABF (Australian Border Force)

DIBP Annual Report 2014-2015 ACBPS Annual Report 2014-2015

Page 5: The EDW Ecosystem

Department of Immigration and Border Protection 5

Hadoop drivers in DIBP?Data Archival / ETL Offload

- Significant Legacy ‘application’ debt- Historical data sources kept for ‘one-off’

querying or Audit purposes- The right platform for the right workload

Business Requirements

- Load Big Data, but make it easily accessible- Store it for long periods of time

Analytics

Advanced Analytics (Spark)- Text mining

- Unstructured Data

New Data types & Functionality

- ’Free text’ searching (SOLR)- Key value store (HBASE)

Page 6: The EDW Ecosystem

Department of Immigration and Border Protection 6

EDW / Hadoop StrategyThe right data in the right place for the right outcome:

• Teradata EDW for the structured• Hortonworks Hadoop for the Unstructured• Teradata Aster for Discovery

All connected over high speed ‘Infiniband network’ for quick data transfer and seamless connectivity.

Our Strategy:

• Walk before you run. Start small and expect rework

• Ability to query and access the two environments

efficiently and effectively (more on this)

• Free text / unstructured searches (SOLR) compared

to full table scans in EDW

Page 7: The EDW Ecosystem

Department of Immigration and Border Protection 7

• Embedded Operationally in the Department

• Get everything / Integrate Data

• As close to the event as possible

• Used everywhere

• Intelligence, Operational , Analytics and Reporting capabilities

The Enterprise Data Warehouse

Page 8: The EDW Ecosystem

Department of Immigration and Border Protection 8

EDW ENVIRONMENT

Page 9: The EDW Ecosystem

Department of Immigration and Border Protection

Our RequirementsUnstructured Repository

• Tens of millions of BLOBsYears of backlog / historical data

• New projects • Multiple source database types• Fast and convenient retrieval

Logging Repository• Billions of rows of infrequently accessed logging data

Timeframe• Now

Page 10: The EDW Ecosystem

Department of Immigration and Border Protection

Our Solution

Source HBase

Hive

AvroSqoop

Avro

Pig

HiveQLHDFS

Source

Archive

Landing Staging

Page 11: The EDW Ecosystem

Department of Immigration and Border Protection

The ChallengesLearning Curve

• Once the basics are covered, the curve gets steep.

Toolsets• Vast number of tools• Behaviour of tools

Robustness• eg, HBase regions crashing under load

The Business• Fear of the new

Page 12: The EDW Ecosystem

Department of Immigration and Border Protection

Advice and LessonsKeep it Simple

• Start with your existing skillsets• Start with a defined need

Expect Rework• Partly from bugs• Mostly from experience

Help is out There• Not always easy to find, but generally good quality• Be careful of old advice

Page 13: The EDW Ecosystem

Department of Immigration and Border Protection

Advice and LessonsRun Your Own Race

• Don’t be afraid to not like tool X or approach Y• Don’t feel the need to always jump to the latest and

greatest

Use Your Existing Standards• eg, our Landing and Staging concept

Enjoy!

Page 14: The EDW Ecosystem

Department of Immigration and Border Protection 14

What’s next- Departmental EDW consolidation (leverage lessons

learned)

- Hadoop 2.4 just complete ( Now we will take new functionality, bug fixes, security)

- Advanced Analytics work on the Hadoop platform

- Hadoop Roadmap (next slide)

Page 15: The EDW Ecosystem

Department of Immigration and Border Protection 15

Page 16: The EDW Ecosystem

Department of Immigration and Border Protection 16

Questions