Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

Traversing our way through Apache Spark GraphFrames

and GraphX

Mo PatelData Day Texas 2017

A bit about me• Currently Deep Learning Practice Director at Teradata

– Road Object Detection & Scene Labeling– Visual Product Search– Chatbots

• Previously– Analytics @ Social Sharing Startup– Analytics @ Intelligence Community– Distributed Systems @ Satellite Operations Company– Software Engineering @ Defense Communications Program

• Research Interests: Distributed Systems for Analytics

• Love snowboarding and in general outdoor sports and working out to keep doing those things

mopatel

What is this talk about?• What are Graphs and what are some

interesting things about Graphs?• What are some Graph Analytics Examples?• What are GraphFrames?• What is GraphX?• How can Graph Analytics help financial

companies fight Synthetic Identity Fraud?

What is a Graph?Natural Artificial

WikipediaWikipedia

Power of Graphs

Graphic Source: http://a16z.com/2016/03/07/all-about-network-effects/ slide 14

Power of Graphs• Good: Facebook, Twitter, WhatApp…

most popular social networks

• Bad: MySpace, Friendster, Orkut…“Nobody goes there anymore. It's too crowded” – Yogi Berra

• Data Growth: Recall Metcalfe’s (n2) and Reed’s Law (2n)

• Memory Intensive• Processing Intensive

Graph Databases cost

money, Graph Analytics make money!

Graph Databases cost money, Graph Analytics

make money!• Page Rank, EigenCentrality• Modularity, Clustering Coefficient,

Betweenness, Closeness• Loopy Belief Propogation, SALSA

Node Score in a Graph• Usecase: Find out how important an

entity is in a graph– Entity Fraud Detection– Influencers– Crime Bosses

• Methods: PageRank, EigenCentralityPageRank: http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm (Implemented: Spark, Aster, iGraph) EigenCentrality: http://www.stat.washington.edu/~pdhoff/courses/567/Notes/l6_centrality.pdf (Implemented: Spark, iGraph)

Communities in a Graph• Usecase: Detect similar nodes– Behavioral Segmentation– Crime Rings– Product Strength & Weakness

• Methods: Modularity, Clustering Coefficient, Betweenness, Closeness

Modularity: https://github.com/gephi/gephi/wiki/Modularity (Implemented: Aster, Gephi) Clustering Coefficient, Betweenness, Closeness: http://www.stat.washington.edu/~pdhoff/courses/567/Notes/l6_centrality.pdf (Implemented: Spark, iGraph)

Growth in Graph• Usecase: Predict where will the graph

grow or suggest new edges– Event Prediction– Product Recommendation

• Methods: Loopy Belief Propagation, Belief Networks, SALSA

Loopy Belief Propagation: https://people.csail.mit.edu/fisher/publications/papers/ihler05b.pdf (Implemented: Aster, Markovian) SALSA: http://www9.org/w9cdrom/175/175.html (Implemented: Aster, Github PageRanking)

GraphX• Apache Spark Library for conducting Graph

Analytics• Graph Operations: num[Edges, Vertices],

degress, collectNeighbors• Graph Analytics:– PageRank– Connected Components– Triangle Counter

http://spark.apache.org/graphx/

Property Graph

GraphFrame• SQL like context is very popular• Lots of ways to work with Graphs: Cypher,

SPARQL, Gremlin..• Spark introduced DataFrame in February 2015• Goal: Make it easy for DataFrame users to

work with Graphs• GraphFrame: GraphX & DataFrame Operations

https://graphframes.github.io/index.html

GraphFrameVertices DataFrameval vertices = sqlContext.createDataFrame(List(

(“a1", “Wine", “Beverage”), (“b2", "Beer", “Beverage”), (“c3", “Pretzel", “Snack”), (“d4", "Cheese", “Snack”)

)).toDF("id", "name", “type")

Edges DataFrame GraphFrameval edges = sqlContext.createDataFrame(List(

("a1", “d4", 15455), ("b2", “c3", 4849), (“a1", “c3", 40),(“b2”, “d4”, 134)

)).toDF(“item1", “item2", “count")

val productsGraphFrame = GraphFrame(vertices, edges)

productsGraphFrame. vertices.filter(“type == Snack")

productsGraphFrame. numEdges

What is Synthetic Identity Fraud?

http://security.frontline.online/article/2014/2/2379-Synthetic-Identity-Fraud

Why has Synthetic Identity Fraud emerged as a big problem?

Verafin

How are Synthetic IDs created?

Verafin

How are Financial Companies exploited?

Verafin

What is the impact of Synthetic Identity Fraud?

Verafin

How can Graph Analytics helps solve Synthetic Identity Problem?

Customer Address DataFrameval customerAddresses = sqlContext.createDataFrame(List(

(“a1", “123 Main Street", “123abc456efg”), (“b2", ”345 High Street", “123abc456efg”), (“c3", “789 Park Ave", “123abc456efg”)

)).toDF("id", ”address", “customerid")vertices.

Add Fake Addressval fakeAddress = sqlContext.createDataFrame(List(

(“d4", “999 Ocean Ave", “123abc456efg”)

)).toDF("id", ”address", “customerid")

val tempCustomerAddresses = customerAddresses.union(fakeAddress)

DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx

How can Graph Analytics helps solve Synthetic Identity Problem?

Master Address Connection Edges DataFrameval masterAddressConnections = sqlContext.createDataFrame(List(

("b2", "a1"), ("e5", "c3"), ("c3", "b2"),("a1", "c3"),("e5", "d4") …

)).toDF("src", "dst")

val toEdgeMatches = masterAddressConnections.join(customerAddresses, masterAddressConnections("to") === customerAddresses("address")).select("to","from")

val fromEdgeMatches = masterAddressConnections.join(customerAddresses, masterAddressConnections("from") === customerAddresses("address")).select("to","from")

val checkEdges = fromEdgeMatches.union(toEdgeMatches)

Detection GraphFramePageRankval detectionGraphFrame = GraphFrame(tempCustomerAddresses , checkEdges)

//PageRankval resultRanks = detectionGraphFrame.pageRank.resetProbability(0.15).tol(0.01).run()

//Personalized PageRankval d4Ranks = detectionGraphFrame.pageRank.resetProbability(0.15).maxIter(10).sourceId("d4").run()

resultRanks.vertices.select("id", "pagerank").show()

How do we decide if this address is fraud or not?

PageRankid pageranka10.9463535901944437b20.9463535901944437c30.9463535901944437d4 0.15

Personalized PageRank

a1id pageranka1 0.3334337192862304

5 c3 0.2834186613932958

6 b2 0.2158043756308593

3 d4 0.0

b2id pagerankb2 0.3334337192862304

5 a1 0.2834186613932958

6 c3 0.2158043756308593

3 d4 0.0

c2id pagerankc3 0.3334337192862304

5 b2 0.2834186613932958

6 a1 0.2158043756308593

3 d4 0.0

d4id pagerankd4 0.15 a1 0.0 b2 0.0c3 0.0

Future Directions and Thoughts• Focus on delivering value over tools and

technologies• Will we settle on a language for Graph

Analytics?• More algorithms in GraphX?• Large scale Graph Analytics is still not

scalable

Apache Spark GraphX: http://spark.apache.org/graphx/ Follow me on Twitter (@mopatel) for interesting Deep Learning and Analytics tweets

Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

Technology

Apache Spark チュートリアル

Big data with Apache spark - WUNCA · 2017-07-21 · - Apache spark architecture - Databricks community - Introduction to Big Data with Apache Spark บ าย - Apache Spark on Databricks

Using Apache Spark

Dale Wong - Spark GraphX Demo

[@NaukriEngineering] Apache Spark

Apache Spark 101

GraphX : Graph Analytics on Spark

Apache spark

Azure Databricks больше, чем просто Apache на стероидах€¦ · Streaming Stream processing GraphX Graph Computation Spark ... Can be used for batch and

Spark, GraphX, and Scalaorg. apache. spark. graphx. PartitionStrategy} // Load the edges in canonical order and partition the graph for triangle count val graph = GraphLoader. edgeLfstFfIe(sc,

Developing Apache Spark Applications - Cloudera · Apache Spark Quick Start Apache Spark Overview Apache Spark Programming Guide Using the Spark DataFrame API A DataFrame is a distributed

UTILIZING ACCELERATORS TO SPEEDUP ETL, ML, AND DL … · Spark SQL/DF GraphX Streaming MLlib. 7 SPARK 3.X IS A UNIFIED AI PLATFORM END-TO-END APACHE SPARK 3.0 PIPELINE CLUSTER MANAGEMENT/DEPLOYMENT

The Pregel Programming Model with Spark GraphX

Spark SQL | Apache Spark

Introduction into scalable graph analysis with Apache Giraph and Spark GraphX

Integrating Apache Hive with Kafka, Spark, and BI...Community Connection: Integrating Apache Hive with Apache Spark--Hive Warehouse Connector Apache Spark-Apache Hive connection configuration

URL: MESH.CS.UMN.EDU e3e1 Printingmesh.cs.umn.edu/posters/mesh-OpenHouse_Nov15.pdf · •Implemented on Apache Spark GraphX 1.2.1 •Run on shared 6-node cluster (2x6-core, 24GB RAM

In-Memory Processing with Apache Spark - harschalig-membres.imag.fr/.../uploads/sites/125/2016/11/Spark.pdfStreaming Spark SQL MLlib & ML (machine learning) GraphX (graph) Resilient

Spark SQL and DataFrames Spark GraphX Spark Mlib Spark ...Spark GraphX! Spark Mlib! Spark Streaming Lightning-fast cluster computing. Chaining transformations 2. ... Covert RDD to

Spark GraphXについて @Spark Meetup 2014/9/8