27
Social Media Brand Protection and Compliance Harold Nguyen 12 September 2014 Cassandra Summit 2014 September 10 -11 | #CassandraSummit

Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassandra for Success in Fraud Detection

Embed Size (px)

DESCRIPTION

Presenter: Harold Nguyen, Senior Data Scientist at Nexgate In this talk, we focus on a use case by showing how Cassandra can detect spam and spammers on social media. We also show how we use Cassandra to train our 100+ social-media-security classifiers. The accuracy of any security product is directly tied to the breadth of the corpus of data upon which it is built. For Nexgate, this means that the success of our products is inextricably tied to our ability to save everything we've ever scanned, but in a way that is still readily accessible. In the days before NoSQL, this was hard. This talk is about how Datastax and Cassandra make it easy.

Citation preview

Page 1: Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassandra for Success in Fraud Detection

Social Media Brand Protection and Compliance

Harold Nguyen 12 September 2014

Cassandra Summit 2014 September 10 -11 | #CassandraSummit

Page 2: Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassandra for Success in Fraud Detection

Ø  Nexgate helps automate the discovery, monitoring, and protection of your brand’s Social Media accounts

(Let us show you: nexgate.com/demo) Ø  One thing we do is offer automated

classification of content for 100+ categories –  Including malware, spam, hate speech, etc… –  And flagging for violation of HIPAA, FFIEC, SEC

and FINRA compliance standards

What is Nexgate ?

Page 3: Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassandra for Success in Fraud Detection

About Us

Company – Security & Compliance for Social Ø  Launched April 2013 - Series A from Sierra & WindForce Ventures

–  18 employees, 7 in Engineering (2 Data Scientists)

Ø  Security people from:

Ø  Customers:

Page 4: Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassandra for Success in Fraud Detection

Ø  Over 350 million pieces of social media total content spread across Facebook, Twitter, YouTube, Google+, LinkedIn

Ø  Currently about 1.5 million new

content per day –  All classified in real time as it

comes in

Ø  Over 65 million total social media content authors

Ø  About 250,000 new social media

content authors per day

Scale of Data

Page 5: Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassandra for Success in Fraud Detection

“The completeness of any classification system is predicated on the breadth of the corpus of data upon which it is built” – Rich Sutton, CTO

Content Classification

Ø  In order to have an accurate classification system, we need to have A LOT of data

Ø  In order to have a lot of

data, we need a strong and capable infrastructure to store all the data we collect

Page 6: Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassandra for Success in Fraud Detection

Not to say MySQL is a dumpster – we heavily rely on MySQL!

Relational Database

Ø  In the beginning, we threw everything into MySQL – why not? It’s:

–  Easy to use –  Many people already

know how to use it –  Secure –  Inexpensive (free) –  Manages memory well –  Fast (up to 50 million

rows) –  Supports several

development interfaces

Page 7: Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassandra for Success in Fraud Detection

Not Only Relational Database

Ø  But after several months, realized we needed a NoSQL solution

Page 8: Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassandra for Success in Fraud Detection

Social Media Data

Ø  Social media data size is

on average about 1k, including content and metadata

–  Content includes the

actual text and links from the social media message

–  Metadata includes time, social ID, parent, account, etc…

–  Metadata can vary

depending on the social media platform (likes, followers, subscribers, etc…)

Social media data are pretty rough and jagged – store some of it in a NoSQL solution

Page 9: Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassandra for Success in Fraud Detection

Storing Social Media Data Ø  Store social media data across both SQL and NoSQL

SQL: Fixed length, non-null, heavily indexed, group access

NoSQL: Variable length, commonly null, softly indexed, single access, text search

Page 10: Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassandra for Success in Fraud Detection

NoSQL Requirements Ø  Our requirements when searching for a NoSQL solution

Easy to use Simple and proven horizontal scalability Integrated tools for research (Solr): search and analysis Operation simplicity: all nodes the same Fantastic Enterprise support (Thanks !) Simple to deploy and maintain Integration with other “big data” tools support (Hadoop, Spark!)

Page 11: Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassandra for Success in Fraud Detection

Deployment!!•  Multi-region AWS EC2!•  M1 Large instances!•  Instance attached storage!•  About to scale again!•  Separate dev, test, prod clusters!!Datastax:!•  Start-up pricing, per-core pricing!•  On site experts, responsive support!

Page 12: Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassandra for Success in Fraud Detection

Data throughput

Average reads = 70 / sec Average writes = 25 / sec

Page 13: Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassandra for Success in Fraud Detection

Ø  Among the many security and compliance

classifications that Nexgate provides, we also have powerful spam detection

Ø  Spam can be a single link directing to a

fraudulent site (screenshots of a Facebook comment):

Fighting Spam with Cassandra

Page 14: Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassandra for Success in Fraud Detection

Ø  Or it can be less obvious, and more personal. This is extremely common. Here, the same user has posted the same message across different social media accounts (screenshot taken from Nexgate product):

Page 15: Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassandra for Success in Fraud Detection

Social media spam has grown 687% since the start of 2013.

Get the report at http://nx.gt/SocialSpamReport

Page 16: Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassandra for Success in Fraud Detection

Ø Can create Spam signatures to catch this type of content

Ø  ...but it would be too slow to catch Spam in real time.

Ø Cassandra

Cassandra and Social Media Spam

Page 17: Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassandra for Success in Fraud Detection

Ø  Even though Cassandra is a NoSQL schema-less database, it is worth carefully defining the data model

Ø  Can’t just “throw data at it” – can make for some really inefficient queries

Ø  Define the data model based on how you will query the data

Ø  For us, we want to determine spam content that has been posted duplicate times –  Spammers tend to post same-content messages

Define Your Data Model

Page 18: Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassandra for Success in Fraud Detection

Ø  Typical table in Cassandra –  Wide “unconstrained” rows is a nice feature w.r.t. SQL

Spam Multiplicity Data Model

Ø  Row key -> hash of content Ø  Column Key -> Unique ID (strictly increasing with time) Ø  Column Value -> Item_id and time of post

Page 19: Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassandra for Success in Fraud Detection

Ø  Spammers typically post the same content over and over Ø  Easy to determine how many times a same-content post is made:

check the number of columns Ø  Will never double count because the column key will simply be

updated instead of added Ø  Indexed by the content, so quick reads and writes Ø  By reading the column value, can extract the time series information

of duplicated posts –  Can also map back to the original value – we store actual content

indexed by the item_id in another Cassandra table

Ø  Cassandra not a magic bullet –  still need a relational database to glue all the pieces of data together –  Batch processing may need other tools like Hadoop

Why this Data Model ?

Page 20: Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassandra for Success in Fraud Detection
Page 21: Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassandra for Success in Fraud Detection

Ø  This has become invaluable to us for catching spam content in real time – the following “rant” comment was posted 38 times…

–  Brand can more easily moderate given automated tools

Real-world spam multiplicity

Ø  In another example, a customer received 25,000 inappropriate messages, and this tool helped us automate content removal

Page 22: Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassandra for Success in Fraud Detection

Ø Another way to tackle real-time spam is by identifying spammy users –  Since Cassandra effortlessly keeps all the

content we observed, our algorithm takes into account all the posts contributed by an author to determine if they are a spammer

Ø Additionally, it is important to keep all data

to train our 100+ classifiers

Importance of Keeping All Data

Page 23: Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassandra for Success in Fraud Detection

Ø  Cassandra actually has been humming along quite nicely! –  Barely any tweaking needed from default values –  No deletes (just the nature of our dataset) => not a lot of frequent

repairs performed (repair is done to resolve inconsistencies across all replicas of data due to deletes)

•  Fine for us, because repair requires intensive disk I/O

Ø  Only times we observed performance issues: –  When the rates of our reads and writes reached a certain threshold –  When the size of the data being inserted was too large –  Heap memory issue with Cassandra 1.1.x

Ø  In all cases, Datastax provided a quick and simple solution, mostly just toggling a few parameters in config files and restarting the nodes

Tuning Cassandra

Page 24: Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassandra for Success in Fraud Detection

Ø  Community is wonderful - it's really easy to jump on the Cassandra IRC channel and talk to fellow users and developers to get real-time feedback.

–  With IRC and mailing list help, implemented composite columns to detect malware sites on the second day of using Cassandra 3 years ago

Ø  In fact, when we tested a migration to the latest version of Casandra, and one of our Ruby wrappers didn't play nice with CQL3, I was able to speak directly with the Ruby wrapper author on IRC and received a reason on why it didn't work.

–  In the same day, I committed and made a pull request for a fix to the Ruby wrapper on github, and the author looked at it the next morning

Ø  Datastax support has been invaluable for providing fast feedback and simple solutions

Cassandra Community

Page 25: Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassandra for Success in Fraud Detection

Ø OpsCenter helpful in debugging performance issues

Ø Solr – used to obtain training data for

classifiers by phrase matching Ø Looking forward:

–  Datastax Spark support to look into training labeled data with MapReduce

Datastax Additional Tools

Page 26: Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassandra for Success in Fraud Detection

Thank you!

Let us show you: nexgate.com/demo Follow us: @NXGate facebook.com/NXGate

Page 27: Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassandra for Success in Fraud Detection