2 © 2016 Pivotal Software, Inc. All rights reserved. 2 © 2016 Pivotal Software, Inc. All rights reserved.
Large Scale Fraud Analytics GemFire Greenplum Connector (G2C)
3 © 2016 Pivotal Software, Inc. All rights reserved.
Background
� Government fraud revenue retention program
� Detecting & retaining ~$5B annually – Primary focus on identity theft – Processes up to 8 million cases per day – Current & historic data size ~60 TB (compressed)
� Modifying architecture to integrate GemFire for scalable Java-based business logic, web service integration, and event driven design
4 © 2016 Pivotal Software, Inc. All rights reserved.
Fraud Systems Simplified
Prepare
• Ingest • Restructure (ETL)
Score • Model Evaluation
Disposition
• Business Logic • Prioritization
Respond
• Investigation • Stop Payments
Business Logic Engine
ETL
Reporting
In-db Analytics
Application Services
5 © 2016 Pivotal Software, Inc. All rights reserved.
Case Study Architecture – Scaling Up
GemFire
Greenplum
Spring Boot App Services
Informatica w/ PWX (ETL)
Business Objects (Reporting)
Legacy Logic Implementation
Logic Engine
In-db Analytics
Greenplum
Prepare
• Ingest • Restructure (ETL)
Score • Model Evaluation
Disposition
• Business Logic • Prioritization
Respond
• Investigation • Stop Payments
6 © 2016 Pivotal Software, Inc. All rights reserved.
Pivotal Greenplum (GPDB)
� Postgres Community OSS – Original fork of 8.2.15 – Massively parallel processing
database
� Master coordinates queries across segments databases
� Supports in-database model evaluation – MadLib, PL/R, SAS
GPDB
Logical
GPDB
Physical
GPDB
Software
Master
Segments
7 © 2016 Pivotal Software, Inc. All rights reserved.
Initial Implementation
� Fraud model results evaluated by business logic engine
� Flat file data extraction – Significant custom code to
construct required object model – Table à CSV à POJO
� Shared element in an otherwise distributed system – Performance considerations
GPDB
Legacy Logic Implementation
8 © 2016 Pivotal Software, Inc. All rights reserved.
Architecture Adjustments
� New requirements introduced external integrations – Drives desire for web-services
� Desire to improve performance & simplify codebase
� Expanding business logic – Logic engine run as a GemFire
function
GemFire
GPDB
Legacy Logic Implementation
Spring Boot (App Services)
9 © 2016 Pivotal Software, Inc. All rights reserved. 9 © 2016 Pivotal Software, Inc. All rights reserved.
GemFire Greenplum Connector
10 © 2016 Pivotal Software, Inc. All rights reserved.
Context
Greenplum!
ANSI SQL
Analytical
Parallel Configurable Data
Load
GemFire!App 1 App 1 App 1
App 1 App 1 App 2
Native API Rest / HTTP
Transactional
Custom Apps
Transactional data write
behind
Data Science, Analytics & ML
11 © 2016 Pivotal Software, Inc. All rights reserved.
GemFire Greenplum Connector (G2C)
� Extension package for GemFire
� Provides simple import and export of data between GemFire regions & Greenplum tables – Parallel data motion leveraging Greenplum’s external table interface
� Simple mapping between table rows and PdxInstance – Flat object relational mapping – Set of predefined type conversions – Configurable GemFire data collocation
12 © 2016 Pivotal Software, Inc. All rights reserved.
Greenplum
Master
Segments GemFire
G2C Data Interfaces
JDBC / ODBC
Data Node
Data Node
Control Logic
13 © 2016 Pivotal Software, Inc. All rights reserved.
GpdbService is the primary entry point for explicitly invoked data motion
1. Import - loads the full table contents from Greenplum
2. Export - sends region contents to Greenplum
Sample Data Import / Export Cache cache = CacheFactory.getAnyInstance(); GpdbService gpdb = GpdbService.getInstance(cache); long count; count = gpdb.importRegion(region); count = gpdb.exportRegion(region);
12
14 © 2016 Pivotal Software, Inc. All rights reserved.
Basic Cache Configuration Configured via GemFire extension framework • 1) Each region maps to a jndi data
source back by Greenplum • 2) Link an entity type and table • 3) Declare a field to be used as the key
• Compound keys supported • 4) Define a mapping between the table
columns • Default auto-configuration • Optional name and column attributes for
naming convention changes • Class used to control type conversion • Set of built in types
<region name="Parent"> <region-attributes refid="PARTITION"> <partition-attributes/> </region-attributes> <gpdb:store datasource="datasource"> <gpdb:types> <gpdb:pdx name="io.pivotal...entity.Parent" table="parent"> <gpdb:id field="id" /> <gpdb:fields> <gpdb:field name="name" /> <gpdb:field name="id" column="id" /> <gpdb:field name="income"
class="java.math.BigDecimal" /> </gpdb:fields> </gpdb:pdx> </gpdb:types> </gpdb:store> </region>
2
1
3
4
15 © 2016 Pivotal Software, Inc. All rights reserved.
Configuring Collocation Parent-child foreign key relationships
supported through collocation 1. Compound keys configurations
result in a HashMap based key in GemFire
2. Provided partition resolver works with compound keys
<region name="Child"> <...> <partition-resolver> <class-name> io.pivotal.gemfire.gpdb.IdPartitionResolver
</class-name> <parameter name="field"> <string>parentId</string> </parameter> </...> <gpdb:id> <gpdb:field ref="parentId" /> <gpdb:field ref="id" /> </gpdb:id> <gpdb:fields>
<gpdb:field name="parentId"/> <gpdb:field name="id" />
</...>
1
2
16 © 2016 Pivotal Software, Inc. All rights reserved.
Configuring Automatic Synchronization ● Data exported to Greenplum via
asynchronous eventing ○ Time and batch size triggers
available
● Causes each GemFire member to independently interact with Greenplum ○ Configure GPDB resource queues
accordingly
<region name="Child"> <...> <gpdb:store datasource="datasource"> <gpdb:synchronize mode="automatic"
time-interval="3000" persistent="false" />
<gpdb:types> <...>
17 © 2016 Pivotal Software, Inc. All rights reserved.
Case Study G2C Configuration Details
� Existing required domain objects – Multiple many-to-one groupings
� Wide tables / objects (500+ fields)
� Data Collocation configured on caseId
� Source tables wrapped in views
CaseWrapper
- caseId - …
ModelScores
- caseId - …
Documents
- caseId - …
PriorHistory
- caseId - …
OtherData…
- caseId - …
* *
* *
1
LogicResults
- caseId - …
18 © 2016 Pivotal Software, Inc. All rights reserved.
Simple Loading – Single Table per Object :LoadTrigger :GPDBService :Region :AsyncEventLister :LogicEngine results:Region
Import() put()
processEvents()
process()
put()
19 © 2016 Pivotal Software, Inc. All rights reserved.
Complex Loading – Multiple Tables per Object :MergeLoader :GPDBService :Region :LogicEngine results:Region
Import() put()
process()
put()
par
assemble()
:LoadTrigger
executeFunction()
20 © 2016 Pivotal Software, Inc. All rights reserved.
Impacts & Results
� Simplified implementation & code reduction
� Maintained or improved data motion rates – Case study CPU bound – Additional improvements in the backlog
� Improved system throughput
21 © 2016 Pivotal Software, Inc. All rights reserved. 21 © 2016 Pivotal Software, Inc. All rights reserved.
Questions?
Join the Apache Geode Community!
• Check out: http://geode.incubator.apache.org
• Subscribe: [email protected]
• Download: http://geode.incubator.apache.org/releases/