Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

Preview:

Citation preview

Traversing our way through Apache Spark GraphFrames

and GraphX

Mo PatelData Day Texas 2017

A bit about me• Currently Deep Learning Practice Director at Teradata

– Road Object Detection & Scene Labeling– Visual Product Search– Chatbots

• Previously– Analytics @ Social Sharing Startup– Analytics @ Intelligence Community– Distributed Systems @ Satellite Operations Company– Software Engineering @ Defense Communications Program

• Research Interests: Distributed Systems for Analytics

• Love snowboarding and in general outdoor sports and working out to keep doing those things

mopatel

What is this talk about?• What are Graphs and what are some

interesting things about Graphs?• What are some Graph Analytics Examples?• What are GraphFrames?• What is GraphX?• How can Graph Analytics help financial

companies fight Synthetic Identity Fraud?

What is a Graph?Natural Artificial

WikipediaWikipedia

Power of Graphs• Good: Facebook, Twitter, WhatApp…

most popular social networks

• Bad: MySpace, Friendster, Orkut…“Nobody goes there anymore. It's too crowded” – Yogi Berra

• Data Growth: Recall Metcalfe’s (n2) and Reed’s Law (2n)

• Memory Intensive• Processing Intensive

Graph Databases cost

money, Graph Analytics make money!

Graph Databases cost money, Graph Analytics

make money!• Page Rank, EigenCentrality• Modularity, Clustering Coefficient,

Betweenness, Closeness• Loopy Belief Propogation, SALSA

Node Score in a Graph• Usecase: Find out how important an

entity is in a graph– Entity Fraud Detection– Influencers– Crime Bosses

• Methods: PageRank, EigenCentralityPageRank: http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm (Implemented: Spark, Aster, iGraph) EigenCentrality: http://www.stat.washington.edu/~pdhoff/courses/567/Notes/l6_centrality.pdf (Implemented: Spark, iGraph)

Communities in a Graph• Usecase: Detect similar nodes– Behavioral Segmentation– Crime Rings– Product Strength & Weakness

• Methods: Modularity, Clustering Coefficient, Betweenness, Closeness

Modularity: https://github.com/gephi/gephi/wiki/Modularity (Implemented: Aster, Gephi) Clustering Coefficient, Betweenness, Closeness: http://www.stat.washington.edu/~pdhoff/courses/567/Notes/l6_centrality.pdf (Implemented: Spark, iGraph)

Growth in Graph• Usecase: Predict where will the graph

grow or suggest new edges– Event Prediction– Product Recommendation

• Methods: Loopy Belief Propagation, Belief Networks, SALSA

Loopy Belief Propagation: https://people.csail.mit.edu/fisher/publications/papers/ihler05b.pdf (Implemented: Aster, Markovian) SALSA: http://www9.org/w9cdrom/175/175.html (Implemented: Aster, Github PageRanking)

GraphX• Apache Spark Library for conducting Graph

Analytics• Graph Operations: num[Edges, Vertices],

degress, collectNeighbors• Graph Analytics:– PageRank– Connected Components– Triangle Counter

http://spark.apache.org/graphx/

Property Graph

GraphFrame• SQL like context is very popular• Lots of ways to work with Graphs: Cypher,

SPARQL, Gremlin..• Spark introduced DataFrame in February 2015• Goal: Make it easy for DataFrame users to

work with Graphs• GraphFrame: GraphX & DataFrame Operations

https://graphframes.github.io/index.html

GraphFrameVertices DataFrameval vertices = sqlContext.createDataFrame(List(

(“a1", “Wine", “Beverage”), (“b2", "Beer", “Beverage”), (“c3", “Pretzel", “Snack”), (“d4", "Cheese", “Snack”)

)).toDF("id", "name", “type")

Edges DataFrame GraphFrameval edges = sqlContext.createDataFrame(List(

("a1", “d4", 15455), ("b2", “c3", 4849), (“a1", “c3", 40),(“b2”, “d4”, 134)

)).toDF(“item1", “item2", “count")

val productsGraphFrame = GraphFrame(vertices, edges)

productsGraphFrame. vertices.filter(“type == Snack")

productsGraphFrame. numEdges

What is Synthetic Identity Fraud?

http://security.frontline.online/article/2014/2/2379-Synthetic-Identity-Fraud

Why has Synthetic Identity Fraud emerged as a big problem?

Verafin

How are Synthetic IDs created?

Verafin

Verafin

How are Financial Companies exploited?

Verafin

What is the impact of Synthetic Identity Fraud?

Verafin

Verafin

How can Graph Analytics helps solve Synthetic Identity Problem?

Customer Address DataFrameval customerAddresses = sqlContext.createDataFrame(List(

(“a1", “123 Main Street", “123abc456efg”), (“b2", ”345 High Street", “123abc456efg”), (“c3", “789 Park Ave", “123abc456efg”)

)).toDF("id", ”address", “customerid")vertices.

Add Fake Addressval fakeAddress = sqlContext.createDataFrame(List(

(“d4", “999 Ocean Ave", “123abc456efg”)

)).toDF("id", ”address", “customerid")

val tempCustomerAddresses = customerAddresses.union(fakeAddress)

DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx

How can Graph Analytics helps solve Synthetic Identity Problem?

Master Address Connection Edges DataFrameval masterAddressConnections = sqlContext.createDataFrame(List(

("b2", "a1"), ("e5", "c3"), ("c3", "b2"),("a1", "c3"),("e5", "d4") …

)).toDF("src", "dst")

val toEdgeMatches = masterAddressConnections.join(customerAddresses, masterAddressConnections("to") === customerAddresses("address")).select("to","from")

val fromEdgeMatches = masterAddressConnections.join(customerAddresses, masterAddressConnections("from") === customerAddresses("address")).select("to","from")

val checkEdges = fromEdgeMatches.union(toEdgeMatches)

Detection GraphFramePageRankval detectionGraphFrame = GraphFrame(tempCustomerAddresses , checkEdges)

//PageRankval resultRanks = detectionGraphFrame.pageRank.resetProbability(0.15).tol(0.01).run()

//Personalized PageRankval d4Ranks = detectionGraphFrame.pageRank.resetProbability(0.15).maxIter(10).sourceId("d4").run()

resultRanks.vertices.select("id", "pagerank").show()

DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx

How do we decide if this address is fraud or not?

PageRankid pageranka10.9463535901944437b20.9463535901944437c30.9463535901944437d4 0.15

Personalized PageRank

DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx

a1id pageranka1 0.3334337192862304

5 c3 0.2834186613932958

6 b2 0.2158043756308593

3 d4 0.0

b2id pagerankb2 0.3334337192862304

5 a1 0.2834186613932958

6 c3 0.2158043756308593

3 d4 0.0

c2id pagerankc3 0.3334337192862304

5 b2 0.2834186613932958

6 a1 0.2158043756308593

3 d4 0.0

d4id pagerankd4 0.15 a1 0.0 b2 0.0c3 0.0

Future Directions and Thoughts• Focus on delivering value over tools and

technologies• Will we settle on a language for Graph

Analytics?• More algorithms in GraphX?• Large scale Graph Analytics is still not

scalable

Apache Spark GraphX: http://spark.apache.org/graphx/ Follow me on Twitter (@mopatel) for interesting Deep Learning and Analytics tweets

Recommended