28
Oracle – Big Data THE INTELLIGENCE LIFE-CYCLE and Schema-Last Approach Dr Neil Brittliff PhD

Oracle openworld-presentation

Embed Size (px)

Citation preview

Page 1: Oracle openworld-presentation

Oracle – Big DataTHE INTELLIGENCE LIFE-CYCLEand Schema-Last Approach

Dr Neil Brittliff PhD

Page 2: Oracle openworld-presentation

A little about myself… Awarded a PhD at the University of Canberra in March this year for my work in

the Big Data space Currently employed as Data Scientist within the Australian Government Have been employed by 5 law enforcement agencies Developed Cryptographic Software to support the Australian Medicare System First used Oracle products back in 1986 Worked in the IT industry since 1982 Resides in Canberra (capital of Australia)

Canberra is the only capital city in Australia that is not named after a person Interests

Tennis (play) / Cricket (watch) Bushwalking and camping Piano Playing (very bad) Making stuff out of wood Enjoys the art of Programming (prefers the ‘C’ language) Pushing the limits of the Raspberry Pi

2

Page 3: Oracle openworld-presentation

University of Canberra - 2015

Talk Structure 3

Motivation Principles and Constraints Intelligence Life-Cycle

Collect & Collate Analyse & Produce Report & Disseminate

Motivation Research

What is a Schema The Problem with ETL Data Cleansing verses Data Triage

A New Architecture Oracle Big Data The Schema-Last Approach

Indexing Technologies and Exploitation User Reaction Observations and Opportunities

Page 4: Oracle openworld-presentation

University of Canberra - 2015

National Criminal Intelligence 4

The Law Enforcement community are also in the business of collecting and analysing criminal intelligence and data, and where possible, sharing that resulting information…

To do this, they need rich, contemporary, and comprehensive criminal intelligence… The National Criminal Intelligence Fusion Capability, which brings together subject

matter experts, analysts, technology and big data to identify previously unknown criminal entities, criminal methodologies, and patterns of crime.

Fusion capability identifies the threats and vulnerabilities through the use of data.

It brings together, monitors and analyses data and information from Customs, other law enforcement, Government agencies and industry to build an intelligence picture of serious and organised crime in Australia.

Page 5: Oracle openworld-presentation

University of Canberra - 2015

Australian Institute of Criminology

5

• While many of the challenges posed by the volume of data are addressed in part by new developments in technology, the underlying issue has not been adequately resolved.

• Over many years, there have been a variety of different ideas put forward in relation to addressing the increasing volume of data, such as data mining.

Darren Quick and Kim-Kwang Raymond ChooAustralian Institute of Criminology September 2014

Page 6: Oracle openworld-presentation

University of Canberra - 2015

Objectives 6

Support the Australian Intelligence Criminal Model Simple Interface to exploit the data Data ingestion must be simple to do

and minimise transformation Support the large variety of data sources Fast ingestion and retrieval times Enable exact and fuzzy searching

Support ‘Identity Resolution’

Support metadata Main the data’s integrity

Preserve Data-Lineage/Provenance Reproduce the ingested data source

exactly!

We don’t want this!

Page 7: Oracle openworld-presentation

University of Canberra - 2015

The Intelligence Life-Cycle

7

Plan, prioritise & direct

Collect & collate

Report & disseminate

Analyse & produce

Evaluate & review

Page 8: Oracle openworld-presentation

University of Canberra - 2015

Intelligence – Data Source Classification

8

Low95%

High5%

Data SOURCE CLASSIFICATIONLow HighVelocity

VarietyVolumeVeracity

Value

Colle

ct &

col

late

Anal

yse

& p

rodu

ce

Page 9: Oracle openworld-presentation

University of Canberra - 2015

Some Definitions: 9

That a major problem for the data scientist is to flatten the bumps as a result of the heterogeneity of data. Jimmy Lin and Dmitriy Ryaboy. Scaling big data mining infrastructure: The twitter experience.

Colle

ct &

Col

late

Schema is from the Greek word meaning ‘form' or ‘figure' and is a formal representation of data model which has integrity constraints controlling permissible data values.

Data munging or sometimes referred to as data wrangling means taking data that’s storedin one format and changing it into another format.

Page 10: Oracle openworld-presentation

Analyse

AnalyseStorage

Schema Application 10Sc

hem

a Fi

rst

Raw Data

Triage

Cleanse

Raw Data

StorageSc

hem

a La

st

Schema

Schema

Page 11: Oracle openworld-presentation

University of Canberra - 2015

Data Cleansing … 11

Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data.

“Data cleansing is the process of analysing the quality of data in a data source, manually approving/rejecting the suggestions by the system, and thereby making changes to the data. Data cleansing in Data Quality Services (DQS) includes a computer-assisted process that analyses how data conforms to the knowledge in a knowledge base, and an interactive process that enables the data steward to review and modify computer-assisted process results to ensure that the data cleansing is exactly as they want to be done.” Microsoft: 2012

Colle

ct &

Col

late

Page 12: Oracle openworld-presentation

University of Canberra - 2015

Data Sources – Always Increasing

12

Gap

Colle

ct &

Col

late

Page 13: Oracle openworld-presentation

University of Canberra - 2015

Data Cleansing - Doesn’t WORK

13

“Data cleansing can be time-consuming and tedious, but robust estimators are not a substitute for careful examination of the data for clerical errors and other problems. ” David Ruppert. Inconsistency of resampling algorithms for high-breakdown regression estimators and a new algorithm. Journal of the American Statistical Association, 97: 148–149, 2002.

“Formal data cleansing can easily overwhelm any human or perhaps the computing capacity of an organization.” N. Brierley, T. Tippetts, and P. Cawley. Data fusion for automated non-destructive inspection. Proceedings of the RSPA, 2014.

“that the data volume may overwhelm the Extract Transform Load process and that data cleansing may introduce unintentional errors.” Vincent McBurney, 17 mistakes that ETL designers make with very large data, 2007.

Colle

ct &

Col

late

Page 14: Oracle openworld-presentation

University of Canberra - 2015

Data Cleansing – Loss of Format

14

Input Date Cleansed Date

Comment

20 July 2014 20-07-2014 Australian DateJuly-20-2014 20-07-2014 American

Format(mmm-dd-yyyy)

2014-20-07 20-07-2014 Arabic Format (right to left)

20-07-14 20-07-2014 Data AmbiguityJuly 2014 01-07-2014 Imputed Value

 "If you torture the data long enough, it will confess.“

Clifton R. Musser

Colle

ct &

Col

late

Page 15: Oracle openworld-presentation

University of Canberra - 2015

ETL vs Triage 15

Initiate

Extract

Determine

Suitability?

Transform

n

Assessment?

Load

Report

Complete

n

Initiate

Triage

Load

Suitability?

Application

n

Verify?

Fuse

Resolve

Complete

n

Colle

ct &

Col

late

ETL Triage

Page 16: Oracle openworld-presentation

University of Canberra - 2015

We did our research … 16

Page 17: Oracle openworld-presentation

University of Canberra - 2015

Oracle’s BDA(Big Data Appliance)

17

Colle

ct &

Col

late

Page 18: Oracle openworld-presentation

University of Canberra - 2015

Data Storage/Collation 18

Store the Data Semantically Built on an defined taxonomy/ontology Perfect to capture metadata

Searched for the perfect Triple-Store

Subject Predicate Object

Triple

GraphList

Colle

ct &

Col

late

Page 19: Oracle openworld-presentation

University of Canberra - 2015

The Architecture 19

Collect & Collate Analyse & Produce

Set Store

Hbase

Historical

Data

NewData

RDF

/ Mod

ellin

g

Feeds

Dat

a Ex

plor

atio

n

Sem

anti

c St

ore

Disseminate

Index

IIR

Index

SOLR

BDA

Pala

ntir

Sear

ch A

ssis

tant

Data Flow

Dat

a Ex

ploi

tati

on

SPARQL

R Language

Apache PIG

Page 20: Oracle openworld-presentation

University of Canberra - 2015

Schema Last … 20

‘Triaged’ Data

First NameMiddle NameLast Name

Schema

Full-Name

Street NumberStreet NameSuburbStatePostcode

Full-Address

Colle

ct &

Col

late

Models

Page 21: Oracle openworld-presentation

University of Canberra - 2015

ACC Search Engines – ‘Smackdown’

21

Feature SOLR IIR

License Apache License CommercialStorage Inverted List Third-party

DatabaseSupport Google Like search Next

ReleaseScore Model Inverse

Document Frequency

NormalizedScore

Result Pagination Homophone Support Can use

synonym support

Phoneme Search Spread indexes across multiple nodes Schema-less Support

Programming Interface Rest SOAP - API

Geo-spatial

Colle

ct &

Col

late

Page 22: Oracle openworld-presentation

University of Canberra - 2015

Collect & Collation Tool 22

Colle

ct &

Col

late

Page 23: Oracle openworld-presentation

University of Canberra - 2015

Bongo – Exploration 23

Anal

yse

& P

rodu

ce

Page 24: Oracle openworld-presentation

University of Canberra - 2015

Palantir – Semantic Interface 24

Repo

rt &

Dis

sem

inat

e

Page 25: Oracle openworld-presentation

User Reaction 25

Time to Triage

< 1 Hour> 1 Hour < 24 Hour> 24 Hours

General Size % - Megabytes< 1

> 1 < 100> 100 < 1000> 1000

• Developed a Palantir Plugin to search the Fusion Data Holding

• Bulk Matching was a great success

• In general, user reaction has been positive

• Time to Triage was usually under an hour where cleansing could take weeks!!!

Australian Crime Commission 2015

Page 26: Oracle openworld-presentation

University of Canberra - 2015

Ingestion Rate –The Improvement

26

Colle

ct &

Col

late

Page 27: Oracle openworld-presentation

University of Canberra - 2015

Observations… 27

The Bulk Matcher Performance and Reliability

Interaction with Palantir Configuration over Customisation Search for the ‘Single Source of Truth’

Golden Record Acceptance of the Schema Last Approach Overwhelmed by Search Results

Page 28: Oracle openworld-presentation

University of Canberra - 2015

Further Reading and Contacts

28

Strategic Thinking in Criminal IntelligenceJerry H RatcliffeThe Federation Press – 2009 ISBN 978 186287 734-4

Intelligence-Led PolicingJerry RatcliffeRoutledge – 2008ISBN 978-1-843292-339-8

Data MatchingConcepts and Techniques and Record Linkage, Entity Resolution, and Duplicate DetectionPeter ChristenSpringer – 2012ISBN 978-3-642-31163-5

Foundations of Semantic Web TechnologiesPascal Hitzler, Markus Krötzsch, Sebastian RudolphCRC Press – 2010ISBN 978-1-4200-9050-5

Big Data – A revolution that will transform how we live, work, and thinkViktor Mayer-Schönberger and Kenneth CukierHMH – 2013ISBN 978-0-544-00269-2

Sharma The Schema Last Approach to Data Fusion Neil Brittliff and Dharmendra Sharma The Schema Last Approach to Data Fusion AusDM 2014

A Triple Store Implementation to support Tabular Data Neil Brittliff and Dharmendra Sharma AusDM 2014

Australian Institute of Criminology http://www.aic.gov.au

University of Canberrahttp://www.Canberra.edu.au