37
Monetize the Noise: How Naming data junk became a security data treasure Paul Sitowitz & Scott King Walker September 28 th , 2015

Data Junk VTS Prez - 20150925-3

Embed Size (px)

Citation preview

Page 1: Data Junk VTS Prez - 20150925-3

Monetize the Noise:  How Naming data junk became a security data treasurePaul Sitowitz & Scott King WalkerSeptember 28th, 2015

Page 2: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 2

Reduce, Reuse, Recycle

• Restore

• Repurpose

• Remake

• Reinvent

• ReimagineImage by Jakub Jankiewicz (jcubic / Kuba) (Open Clip Art Library, detail page) CC0, via Wikimedia Commons

Page 3: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 3

Noise, Noise, Noise, and Pigeon Droppings

• Excerpt is from: Wired Magazine, “Accept Defeat: The Neuroscience of Screwing Up” by Jonah Lehrer, 12.21.09.

• http://www.wired.com/2009/12/fail_accept_defeat/

Page 4: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 4

Junk & Noise

• These are the unwanted things that we usually discard or else try to block out

• Junk• Trash• Unused items• Not needed items• Useless items• Not liked items

• Noise• Loud sounds• Interference• Malicious signals• Harmful irritants • Bad smells

Page 5: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 5

Noise in our Data

• "Photon-noise" by Mdf - Photon-noise.jpg. Licensed under CC BY-SA 3.0 via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Photon-noise.jpg#/media/File:Photon-noise.jpg

Page 6: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 6

Data Analyzer, and Signal from Noise

• YXD• NXD• Resolution Success = Signal• Resolution Failure = Noise (or does it?)

• The Data Analyzer product is based on finding signal in this noise.

Page 7: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 7

Looking at NXDs• When a Name Server can not resolve a domain, an NXD

response is returned• This data is typically discarded as “junk”• Data Analyzer analyzes this data to identify domains

• with sufficient traffic • requested during business hours• requested from specific locations around the world• … and many other desirable characteristics (like clickable traffic)

• We rate and score these NXDs (from 1 to 10) and allow our customers to query them

Page 8: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 8

NXD Domains• Sample of NXDs with sufficient traffic (according to DA):

• GENTLEMANMILLION.NET• CDSYHD.NET• SILVERHORSETRADER• A2VERISIGNDNS.COM• SARAH.COM• 3RDBILLION.NET• XN--JJEEP-3F5FW08B.COM• PAULHUNTHOMES.COM• CAT-HUSE• SCOTTSTORAGE2.COM • MANNYSGOLFWORLD.COM

Page 9: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 9

EDAS Record Format

• The NXD DNS traffic data available to Verisign is stored in the EDAS record format

• A single EDAS formatted record contains:• The Requesting IP (recursive name server)• The Requested Domain including TLD (up to 3rd level)• The time of day that the request was made• The Site name were they request was received• The DNS Record Type for the request (typically A and AAAA)

Page 10: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 10

Big Data

• NXD request data is captured by the Traffic Monitoring team from our Edge sites for COM/NET/CC/TV

• Comprises 90% of NXD traffic

• The data is then ingested into the VSCC• an average of 300 Gigabytes each day

• Data Analyzer allows customers to query up to 26 weeks of raw NXD data

• That’s 42.2 Terabytes of data needed by a single customer query

• If we factor in a 3X replication model used by the VSCC at both the BRN and ILG sites, that adds up to about 250 Terabytes of raw storage!!!

Page 11: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 11

Query Processing Time

• A 26 week queryon raw NXD data can take more than 8 hours to complete

• And that’s running across more than a hundred powerful data node machines in the VSCC

• With this in mind, the Data Analyzer product also stores 60 days of aggregated data for our Complete Index in order to add more value with less time needed to produce results

• Our indexes take a few hours every night to build• This index based data supports very flexible filter based

queries

Page 12: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 12

Noisy Data

• With so much data comes the potential for a lot of noise:• But what kinds of noise?• How much noise?• How can we find this noise?

Page 13: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 13

Finding Noise in the NXD Data - Sample 100K

100 5700 113001690022500281003370039300449005050056100617006730072900785008410089700953000

10000

20000

30000

40000

50000

60000

70000

80000

Classic hockey stick pattern, very dramatic, but nothing to see here. Right?

Page 14: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 14

Top 1K NXD domains from 100K Sample

1 33 65 97 1291611932252572893213533854174494815135455776096416737057377698018338658979299619930

10000

20000

30000

40000

50000

60000

70000

80000

“Gap” – No domains with request frequencies between 9918 and 2820 in sample.

Page 15: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 15

The Spike

• The large “spike” in the previous graphs show an unusually large number of NXD requests

• Do we give these NXDs real high scores since they get lots of traffic?

• Or is this just plain noise in our data that we should discard?

• It also turns out that these large requests also exhibit similar request traffic patterns

• We believe that these requests are from “Botnets”

Page 16: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 16

The Botnet Problem

Traffic from Botnets is:• Automatic, behind the scenes Traffic• From infected computers• Algorithmically generated based on time/date

Traffic from Botnets can be detected by:• High traffic levels from consistent sets of recursive name servers• Lack of traffic from other name servers

1 51 1011512012513013514014515015516016517017518018519019510

10000

20000

30000

40000

50000

60000

70000

80000

Page 17: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 17

What Are Botnets

• Enable most sophisticated and popular types of cybercrime today

• They allow hackers to take control of many computers at a time which operate as part of a powerful "botnet”

• Many of these computers are infected without their owners knowledge

• Bots often spread themselves across the Internet by infecting unprotected computers

• Their goal is to stay hidden until they are instructed to carry out a task by a Command and Control server

Page 18: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 18

About Botnets• Botnets use an algorithm for generating domain names to

make it difficult to identify. While many may be NXDs, some are not

• These Botnet domains, if registered, would connect a Botnet to a Command and Control server that issues instructions to commit attacks

• Botnets are just “zombie” machines without C&C servers to tell then what to do

Page 19: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 19

Botnet Detection• NXDs with very large amounts of traffic and that exhibit

similar traffic patterns are most likely NOT requested by humans

• These domains are classified by the Botnet Detection Service (BDS) as “suspicious” and the requests are considered to be from “botnets”:

CDSYHD.NETA2VERISIGNDNS.COM3RDBILLION.NETPAULHUNTHOMES.COMSCOTTSTORAGE2.COM

GENTLEMANMILLION.NETCDSYHD.NETSILVERHORSETRADERA2VERISIGNDNS.COMSARAH.COM3RDBILLION.NETXN--JJEEP-3F5FW08B.COMPAULHUNTHOMES.COMCAT-HUSESCOTTSTORAGE2.COM MANNYSGOLFWORLD.COM

Page 20: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 20

About BDS

• Implemented using Hadoop streaming and the Mahout machine learning library

• Identifies similar NXD traffic patterns across many different name servers

• Runs once a day at 4:30pm EST• Analyzes 1 day of NXD data for COM/NET/CC/TV and

produces a “suspicious” domains list• Collects the past 60 days of suspicious domains and

publishes the unique collection to an HDFS folder in the VSCC

• Exposed to other products via a DAG data retrieval API

Page 21: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 21

A Patented Technology

• https://www.google.com/patents/US8745737

Page 22: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 22

DA Use Case of BDS• Prevent promotion of these suspicious domains to our

customers• Provides two major benefits:

• Customers benefit by not registering domains with high traffic that won’t see human traffic

• System efficiency benefit of less domains to query from

Page 23: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 23

Monetize the Noise

• Remember that “One engineer’s noise is another engineer’s signal.”

• The effort to make use of the BDS data earlier this year started with a joke of an idea. It was something like: “If we know what domains the infected computers are looking for, we could register those domains, take over their botnets and use them for ourselves!”.

• (Probably not really, because they tend to use encrypted instructions to prevent this, but maybe.)

Page 24: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 24

Monetize the Noise

This silly starting point lead down a list of other options:• Prevent the registration of these domains to clean

up .COM and .NET• Sell the data to a security company so they could pay to

block traffic to these domains.• Use the data itself to target the companies that most

desperately need the blocking service.

Page 25: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 25

How the connection happened

• Eventually, we found a security company interested. You may have heard of them… Verisign.

• Paul had a discussion about BDS with Jim Gould who asked him to present it at a PESAB meeting. That lead to an engineering to engineering discussion about the usefulness to the security side of the business.

• Once the engineering feasibility was in place, we had their product people talk to our product people, and the security use of the data was quickly approved.

• Takeaway is: “Don’t let the organizational structure stand in the way of a good use for your data.”

Page 26: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 26

Current Usage

• Data Analyzer uses these domains as a “black list” to filter them out of our indexes to prevent us from ever returning to our customers in order to help prevent potential registration

• Recursive DNS uses these to ensure that resolution requests for them are ignored to prevent potential “botnet” transmissions

• How else might we use the suspicious Botnet domain list?

Page 27: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 27

Future work

• BDS data from DA could be used in several ways within the company to improve security products. Blocking traffic within the Recursive service is just one use.

• How about:• Selling BDS data feed as a standalone or add-on security product.• Using traffic to BDS domains to prioritize Recursive sales leads.• Using BDS domains within a Recursive appliance to identify

infected computers on a network. (Don’t just block, disinfect!)

Page 28: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 28

Going Further

• block the registration of these suspected domains in Core• use the registration attempts to identify criminals • While our Botnet domain list only comprises

COM/NET/CC/TV domains, we can use BDS for other TLDs

• Maybe a service we could provide to other registries

Page 29: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 29

Botnet Domains And Request Information

Page 30: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 30

Digging Deeper

Page 31: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 31

Infection Detection

• With the help from a Recursive Server appliance that captures the IPs of the original requests

• We can track back from the Recursive server to the actual “Bots!”• If we can find these Bots then we can help to shut them down

• Another possibility might be to include the IP of the actual requesting machines inside the DNS messages using the EDNS0 - Extension mechanisms for DNS

• Allow for storing more information in DNS messages• Is currently used in about 10% of DNS messages to enable things

like GEO location

Page 32: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 32

Information Gathering• So far, since the suspected domains have all been NXDs,

the intended C&C servers have not yet been registered• We can use BDS to identify suspected domains based on

YXD traffic data that point to real, live C&C servers• While the BDS algorithm would definitely work on YXD

data, we might have some challenges:• TTL based caching by resolvers• Frequent IP switching for C&C server domains to avoid detection

Page 33: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 33

Taking Down C&C servers

• If we can identify the domains for suspected live C&C servers, perhaps we can:

• Block DNS resolution on EDGE and Electra servers • Use ‘Core” to suspend the registration for these domains so they

appear “out of zone”• Fine registrars• Go after domain owners

• While a service to the entire internet, there most likely would be legal implications in any of the above

Page 34: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 34

Room For Improvement

• While a great service, BDS does help out with Botnet transmissions that are NOT DNS based

• add support for IPV6 traffic• add monitoring to track the rate of false positives• use for analyzing traffic data for other TLDs• use for analyzing YXD traffic data• Potentially look at additional data points in the AVRO

summary feed currently used by the Real-Time cluster (RTC) and soon to be used as a replacement for the existing Traffic Monitor feed (end Q2 next year)

• Will also include traffic data from our Electra sites• And the missing 10% of NXD traffic data!

Page 35: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 35

Eureka!

• Excerpt is from: Wired Magazine, “Accept Defeat: The Neuroscience of Screwing Up” by Jonah Lehrer, 12.21.09.

• http://www.wired.com/2009/12/fail_accept_defeat/

Page 36: Data Junk VTS Prez - 20150925-3

Verisign Confidential and Proprietary 36

Takeaways

• Your Noise could be MY signal• Reimagine and Reuse & Find reasons to Keep more of

your data• Find more Value & Throw more effectively• Don’t let the organization stand in the way

Page 37: Data Junk VTS Prez - 20150925-3

© 2015 VeriSign, Inc. All rights reserved. VERISIGN and other trademarks, service marks, and designs are registered or unregistered trademarks of VeriSign, Inc. and its subsidiaries in the United States and in foreign countries. All other trademarks are property of their respective owners.