Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassandra for Success in Fraud Detection

Social Media Brand Protection and Compliance

Harold Nguyen 12 September 2014

Cassandra Summit 2014 September 10 -11 | #CassandraSummit

Ø  Nexgate helps automate the discovery, monitoring, and protection of your brand’s Social Media accounts

(Let us show you: nexgate.com/demo) Ø  One thing we do is offer automated

classification of content for 100+ categories –  Including malware, spam, hate speech, etc… –  And flagging for violation of HIPAA, FFIEC, SEC

and FINRA compliance standards

What is Nexgate ?

About Us

Company – Security & Compliance for Social Ø  Launched April 2013 - Series A from Sierra & WindForce Ventures

–  18 employees, 7 in Engineering (2 Data Scientists)

Ø  Security people from:

Ø  Customers:

Ø  Over 350 million pieces of social media total content spread across Facebook, Twitter, YouTube, Google+, LinkedIn

Ø  Currently about 1.5 million new

content per day –  All classified in real time as it

comes in

Ø  Over 65 million total social media content authors

Ø  About 250,000 new social media

content authors per day

Scale of Data

“The completeness of any classification system is predicated on the breadth of the corpus of data upon which it is built” – Rich Sutton, CTO

Content Classification

Ø  In order to have an accurate classification system, we need to have A LOT of data

Ø  In order to have a lot of

data, we need a strong and capable infrastructure to store all the data we collect

Not to say MySQL is a dumpster – we heavily rely on MySQL!

Relational Database

Ø  In the beginning, we threw everything into MySQL – why not? It’s:

–  Easy to use –  Many people already

know how to use it –  Secure –  Inexpensive (free) –  Manages memory well –  Fast (up to 50 million

rows) –  Supports several

development interfaces

Not Only Relational Database

Ø  But after several months, realized we needed a NoSQL solution

Social Media Data

Ø  Social media data size is

on average about 1k, including content and metadata

–  Content includes the

actual text and links from the social media message

–  Metadata includes time, social ID, parent, account, etc…

–  Metadata can vary

depending on the social media platform (likes, followers, subscribers, etc…)

Social media data are pretty rough and jagged – store some of it in a NoSQL solution

Storing Social Media Data Ø  Store social media data across both SQL and NoSQL

SQL: Fixed length, non-null, heavily indexed, group access

NoSQL: Variable length, commonly null, softly indexed, single access, text search

NoSQL Requirements Ø  Our requirements when searching for a NoSQL solution

Easy to use Simple and proven horizontal scalability Integrated tools for research (Solr): search and analysis Operation simplicity: all nodes the same Fantastic Enterprise support (Thanks !) Simple to deploy and maintain Integration with other “big data” tools support (Hadoop, Spark!)

Deployment!!•  Multi-region AWS EC2!•  M1 Large instances!•  Instance attached storage!•  About to scale again!•  Separate dev, test, prod clusters!!Datastax:!•  Start-up pricing, per-core pricing!•  On site experts, responsive support!

Data throughput

Average reads = 70 / sec Average writes = 25 / sec

Ø  Among the many security and compliance

classifications that Nexgate provides, we also have powerful spam detection

Ø  Spam can be a single link directing to a

fraudulent site (screenshots of a Facebook comment):

Fighting Spam with Cassandra

Ø  Or it can be less obvious, and more personal. This is extremely common. Here, the same user has posted the same message across different social media accounts (screenshot taken from Nexgate product):

Social media spam has grown 687% since the start of 2013.

Get the report at http://nx.gt/SocialSpamReport

Ø Can create Spam signatures to catch this type of content

Ø  ...but it would be too slow to catch Spam in real time.

Ø Cassandra

Cassandra and Social Media Spam

Ø  Even though Cassandra is a NoSQL schema-less database, it is worth carefully defining the data model

Ø  Can’t just “throw data at it” – can make for some really inefficient queries

Ø  Define the data model based on how you will query the data

Ø  For us, we want to determine spam content that has been posted duplicate times –  Spammers tend to post same-content messages

Define Your Data Model

Ø  Typical table in Cassandra –  Wide “unconstrained” rows is a nice feature w.r.t. SQL

Spam Multiplicity Data Model

Ø  Row key -> hash of content Ø  Column Key -> Unique ID (strictly increasing with time) Ø  Column Value -> Item_id and time of post

Ø  Spammers typically post the same content over and over Ø  Easy to determine how many times a same-content post is made:

check the number of columns Ø  Will never double count because the column key will simply be

updated instead of added Ø  Indexed by the content, so quick reads and writes Ø  By reading the column value, can extract the time series information

of duplicated posts –  Can also map back to the original value – we store actual content

indexed by the item_id in another Cassandra table

Ø  Cassandra not a magic bullet –  still need a relational database to glue all the pieces of data together –  Batch processing may need other tools like Hadoop

Why this Data Model ?

Ø  This has become invaluable to us for catching spam content in real time – the following “rant” comment was posted 38 times…

–  Brand can more easily moderate given automated tools

Real-world spam multiplicity

Ø  In another example, a customer received 25,000 inappropriate messages, and this tool helped us automate content removal

Ø Another way to tackle real-time spam is by identifying spammy users –  Since Cassandra effortlessly keeps all the

content we observed, our algorithm takes into account all the posts contributed by an author to determine if they are a spammer

Ø Additionally, it is important to keep all data

to train our 100+ classifiers

Importance of Keeping All Data

Ø  Cassandra actually has been humming along quite nicely! –  Barely any tweaking needed from default values –  No deletes (just the nature of our dataset) => not a lot of frequent

repairs performed (repair is done to resolve inconsistencies across all replicas of data due to deletes)

•  Fine for us, because repair requires intensive disk I/O

Ø  Only times we observed performance issues: –  When the rates of our reads and writes reached a certain threshold –  When the size of the data being inserted was too large –  Heap memory issue with Cassandra 1.1.x

Ø  In all cases, Datastax provided a quick and simple solution, mostly just toggling a few parameters in config files and restarting the nodes

Tuning Cassandra

Ø  Community is wonderful - it's really easy to jump on the Cassandra IRC channel and talk to fellow users and developers to get real-time feedback.

–  With IRC and mailing list help, implemented composite columns to detect malware sites on the second day of using Cassandra 3 years ago

Ø  In fact, when we tested a migration to the latest version of Casandra, and one of our Ruby wrappers didn't play nice with CQL3, I was able to speak directly with the Ruby wrapper author on IRC and received a reason on why it didn't work.

–  In the same day, I committed and made a pull request for a fix to the Ruby wrapper on github, and the author looked at it the next morning

Ø  Datastax support has been invaluable for providing fast feedback and simple solutions

Cassandra Community

Ø OpsCenter helpful in debugging performance issues

Ø Solr – used to obtain training data for

classifiers by phrase matching Ø Looking forward:

–  Datastax Spark support to look into training labeled data with MapReduce

Datastax Additional Tools

Thank you!

Let us show you: nexgate.com/demo Follow us: @NXGate facebook.com/NXGate

Technology

Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassandra for Success in Fraud Detection