Machine Learning for Cyber Security at Network Speed & Scale

MACHINE LEARNING FOR CYBER SECURITY AT

NETWORK SPEED & SCALE

AN INVITATION TO COLLABORATE ON THE USE OF ARTIFICIAL

INTELLIGENCE AGAINST ADAPTIVE ADVERSARIES

1ST PUBLIC EDITION: OCTOBER 11, 2011

Written by: Olin Hyde

[email protected] www.ai-one.com

COPYRIGHT 2011, AI-ONE INC.

ALL RIGHTS RESERVED

© 2011, ai-one inc.

QUAD CHART SUMMARY

MACHINES CAN LEARN

HUMANS

AI-ONE WORKS LIKE AN “EMPTY BRAIN” MEANING BY DETECTING PATTERNS AND ASSOCIA

New technology enables machines to learn like humans by understanding the inherent structure of data. Unlike other forms of artificial intelligence, aione’s technology detects every relationshipbetween every byte without any human intervention at the moment of data ingestion. The biologically inspired system is autonomic – spawning computational and data cells within a neural network as it responds to external sensors.

POTENTIAL APPLICATIONS

CYBER WARFARE Defensive: Recognition of threat patternsOffensive: Recognition of vulnerabilities

COMPLIANCE Risk assessment & mitigationBehavior monitoring & management

INSIDER THREAT MITIGATION Conspiracy detection Anomalous usage detection

RED TEAM ATTACK SIMULATIONS

Identification of API component weaknessIntelligent malware Incremental insertion attacks

BLUE TEAM DEFENSE SIMULATIONS

Deep packet pattern recognitionSoftkill counter measures

UMMARY:

EARN LIKE

BRAIN” – LEARNING

PATTERNS AND ASSOCIATIONS

New technology enables machines to learn like humans by understanding the inherent

Unlike other forms of artificial intelligence, ai-every relationship

without any human intervention at the moment of data ingestion. The biologically inspired system is autonomic

spawning computational and data cells within a neural network as it responds to

OVERVIEW ai-one’s software development kit that enables programmers to build machine learningapplications. This tool generates an associative network (called a lightweight ontology) that reveals every between each byte in the system. Like a human brain, the technology contextual meaning of data by detecting patterns and relationships – signals within complex data. The technology has broad applicability to solve problems that traditionally relied upon human cognition to detect; such as finding term co-occurrences, isolating anomalous patterns and identifying latent relationships. A modified version of ai-one’s core technologyhas the potential to transform enabling highly adaptive attackMoreover, it has the potential situational awareness of every packet’s content, destination and intended purponear real-time across exabyte-

PPLICATIONS

Defensive: Recognition of threat patterns Offensive: Recognition of vulnerabilities

Risk assessment & mitigation Behavior monitoring & management

IMULATIONS Identification of API component weakness

Incremental insertion attacks

IMULATIONS Deep packet pattern recognition

BENEFITS

AUTONOMIC Learns without any human interventionFinds the unexpected

OBJECTIVE (UNBIASED) Detects intrinsic & hidden patterns No cognitive bias from humans

FAST DEPLOYMENT

Works with existing technologiesSDK enables plug-n-play architecture

SCALABLE/FLEXIBLE Many deployment options Product roadmap to 2EB/instance capacity

PROVEN In use by Swiss BKA, SwissPort, othersCOTS version available for evaluation now

Page 2 of 25

software development kit that enables machine learning into

applications. This tool generates an associative network (called a lightweight

every relationship byte in the system. Like a

human brain, the technology learns the meaning of data by detecting

including subtle

The technology has broad applicability to solve lly relied upon human

finding high-order occurrences, isolating anomalous

latent relationships.

one’s core technology cyber warfare by

enabling highly adaptive attacks and defenses. potential to provide

situational awareness of every packet’s nation and intended purpose in

-scale networks.

any human intervention

Detects intrinsic & hidden patterns No cognitive bias from humans

Works with existing technologies play architecture

Product roadmap to 2EB/instance capacity

In use by Swiss BKA, SwissPort, others COTS version available for evaluation now

© 2011, ai-one inc. Page 3 of 25

CONTENTS

Forward ........................................................................................................................... 4

About ai-one inc. ....................................................................................................... 5

Abstract: .......................................................................................................................... 6

The Current State of Cyber Security is Fundamentally Flawed ..................................... 10

Sources & Types of Cyber Attacks ......................................................................... 12

Exploiting API Weaknesses (Application Hijacking) ................................................ 13

Machine Learning Measures and Counter-Measures to API Exploits ..................... 14

Exploiting Impersonations ....................................................................................... 14

Machine Learning Measures and Counter-Measures to Impersonation ................. 15

Threat Evolution: Exploiting Complexity ........................................................................ 16

Lightweight Ontologies (LWO): A New Computational Approach .................................. 16

Summary of the Benefits of ai-one’s Technology.................................................... 18

ai-one Technology Roadmap ........................................................................................ 19

Current Commercial-Off-The-Shelf ......................................................................... 19

64-Bit Multi-thread COTS ....................................................................................... 19

64-Bit Chipsets ....................................................................................................... 20

Next Steps – Proofs of Concept .................................................................................... 20

Immediate COTS-Based Approach ........................................................................ 20

Intermediate COTS Approach ................................................................................ 21

Matrix Chipset Approach ........................................................................................ 21

Appendix - A Worse Case Scenario: MHOTCO Attacks ............................................... 23

Unnoticeable Attrition .............................................................................................. 24

The Game Changer: Machine Learning .................................................................. 25


FORWARD

How will artificial intelligence impact global cyber security?

Or put another way: How to attack and defend cyber assets with a new generation of machine

learning technologies?

This paper provides actionable technical insights for business, government and military

executives seeking technologies that will provide a competitive advantage in cyber security. We

believe the requirements for both military and civilian cyber defenses are similar enough to use

published (public) military specifications as a common denominator for protecting cyber assets.

Based on more than 50 people-years of research and development, we believe that machine

learning is transformational to the “cyber battlespace” – where computers and/or networks are

intentionally disrupted to cause harm or further criminal, political, ideological, social, or similar

objectives.

Our goal is inspire innovation: ai-one does not provide a solution or services. We only provide

core machine learning technologies. We believe that a complete artificial intelligence solution to

combat cyber security threats requires combining multiple tiers of technology – possibly

including natural language processing (NLP), machine learning, signal processing, Bayesian

decisioning tools, packet profile ontologies, etc.

Cyber warfare spans both military and civilian concerns. The US Department of Defense has

defined cyber warfare as the “Fifth Battlespace” (after land, sea, air and submarine domains).

As a result, all branches of the US military now have cyber-specific commands.

Similarly, the civilian world is justifiably obsessed with the protection of cyber assets. Groups

such as Anonymous and WikiLeaks have wreaked havoc on financial institutions, markets and

governments by disrupting mission critical networks and disseminating proprietary information.

Cyber security is essential for civil freedoms, economic opportunities and national security.

ai-one is one of a handful of firms that provide off-the-shelf machine learning and artificial

intelligence application program interfaces (API). Our breakthrough is creating a system that

enables any programmer to build machine learning into almost any program. The core value is

simple:

We detect patterns.

If you know the patterns… you know the relationships between data elements.

If you know the relationships… then you know the context of any element.

If you know context… you understand meaning.


ai-one’s APIs enable machines to learn. Any data. Any format. Faster and more accurately than

a human. The implications for this technology spans the entirety of all computing.

First, let’s define the terms artificial intelligence and machine learning as they are used within

this paper – as there are many interpretations of both.

• Artificial intelligence (AI) is the simulation of human intelligence in machines.

Its critical feature is the ability to make decisions. Thus, there is a vast range of

capabilities within artificial intelligence. A simple manifestation would be a search

engine – such as Google. More sophisticated AI systems would include agents

that make autonomous decisions – such as Apple’s SIRI.

• Machine learning (ML) is a branch of AI that is specifically concerned with

enabling machines to understand information, intent and context. Its critical

feature is to derive the meaning of data by evaluating data from sensors and/or

data storage devices. Examples include: latent Dirichlet allocation (LDA) and ai-

one’s adaptive holosemantic dataspace (HSDS). Both are self-organizing maps

(SOMs) that detect patterns. However, there are significant differences due to

LDAs use of Bayesian statistics which make it computationally less efficient than

HSDS which is a new form of neural network that is transparent, autonomous

and at least as accurate as Bayes.

ABOUT AI-ONE INC. ai-one inc. is a Delaware C-corporation with headquarters in La Jolla, California and offices in

Zurich Switzerland (ai-one AG) and Berlin Germany (ai-one GmbH). The company was

originally named Semantic Systems when it was founded in Zurich Switzerland in 2003 by

Walter Diggelmann, Manfred Hoffleisch and Thomas Diggelmann. The company

commercializes the mathematical discoveries by Manfred Hoffleisch and the invention of the

Hoffleisch Neuronal Network (HNN). This technology has now evolved dramatically over the

past eight years to a point where it is now commercially available as a software development kit

(SDK) and application programming interface (API).

Our mission is to embed “Biologically Inspired Intelligence” in every computing device and

application, empowering developers to help people to use the global information explosion to

improve the quality of human life. More information on ai-one can be found at www.ai-one.com.


“America’s prosperity in the 21st

century will depend on cyber

security.”

PRESIDENT BARAK OBAMA, MAY 29, 2009

ABSTRACT:

Machines can learn like humans by understanding the inherent complexity of

patterns and associations in data.

The goal of this paper is to inspire new ideas and invite collaboration to innovate new

ways to protect large-scale cyber assets. Our central questions are:

1. How will real-time, deep pattern recognition change cyber warfare?

2. How will machine learning of byte-patterns impact the evolution of cyber

attacks?

3. How can machine learning systems protect large-scale networks?

4. Can machine learning reduce the human capital and expenditures required to

defend large scale networks?

Cyber defenses of the US military, government and critical civilian infrastructure are

inadequate. The US Department of Homeland’s “Cyberstorm III” drill in September 2010

demonstrated that private industry and government resources are unable to protect

critical infrastructure from destruction from a well-orchestrated cyber attack.1 “American

cyber defense has fallen far behind the technological capabilities of our adversaries

[such]…that the number of cyber attacks is now so large and their sophistication so

great that many organizations are having trouble determining which new threats and

vulnerabilities pose the greatest risk.”2

This paper outlines a framework to improve US cyber defenses in a matter of months at

very minimal cost with virtually no technological risk.

A new form of machine learning discovered by ai-one inc. has the potential to transform

cyber warfare. This technology was made

commercially available in June 2011. It is in

use by Swiss law enforcement, a major

European mobile network and under

evaluation by more than 40 organizations

worldwide.3

1 US GAO report, “CYBERSECURITY: Continued Attention Needed to Protect Our Nation’s Critical

Infrastructure.” Statement of Gregory C. Wilshusen, Director, Information Security Issues, July 26, 2011. 2 The Lipman Report, “Threats to the Information Highway: CyberWarfare, Cyber Terrorism and Cyber

Crime.” October 15, 2010, p.1.

3 Bundeskriminalamt (German equivalent to the US FBI) built a shoe print recognition system that is in

use at three major Swiss CSI labs. ai-one is restricted from advertising or using the name of customers as part of licensing and non-disclosure agreements.


“All warfare is based on deception.”

THE ART OF WAR BY SUN TZU, 600 BC

“All war presupposes human

weakness and seeks to exploit it.”

CARL VON CLAUSEWITZ IN VOM KRIEGE

Large scale government and corporate networks are irresistible targets for cyber attacks

– from hackers, hostile government agencies and malicious NGOs. These networks are

fantastically complex. Each user, application, data source, sensor and control

mechanism add value. Yet each of these components increases the threat surface for

cyber attacks. Defending a network by simplifying network complexity is not an option.

Taking functionality away from a network would be self-defeating. Moreover, the best

networks use a blend of custom, commercial and open-source technologies – each

presenting a new opportunity for attack. Thus, cyber security depends on understanding

complexity – not simplifying it.

Current technologies using Computer programming – such as anti-malware software,

firewalls and network appliances (such as IDPS) – are unable to detect the most

catastrophic forms of zero-day attacks:

incremental delivery of viruses, application

hijacking, impersonation, insider

conspiracies and cloaked DDoS.4

Why? Computer programming is reductionist and prone to cognitive biases. First,

programmers and analysts simplify threat profiles by categorizing them so they can be

processed mathematically and logically using structured data. For example, they look

for viruses and potential variations using fuzzy matching techniques. Simplifying the

complexity of suspicious byte-patterns into mathematical models provides ample

opportunities for attackers to “hide in the noise.” Secondly, programmers and analysts

are human. They make mistakes. Moreover, they tend to repeat mistakes – so if you

find one security hole, you can search for patterns that will lead you to others.

Cyber attackers know these weaknesses and exploit them by hiding within the noise of

network complexity and discovering patterns of weaknesses. Deception and exploitation

of predictable defensive patterns are the pillars of successful offensive cyber attacks.

Thus, current defenses are destined to fail against the next generation of zero-day

cyber attacks (such as incremental viral insertion, MHOTCO and genetic algorithm

intrusions).5

New artificial intelligence technology that

learns through detecting data heterarchies

enables unprecedented levels of cyber

security and countermeasures. Knowing the

4 Zero-day attacks refer to threats to networks that exploit vulnerabilities that are unknown to

administrators and/or cyber security applications and appliances. Zero-day exploits include detection of security holes that are used or shared by attackers before the network detects the vulnerability. 5 See Appendix for “Worst Case Scenario” that describes possible MHOTCO attack.


structure of data is the key to understanding its meaning. Machine learning using

heterarchical pattern recognition reveals the relationships and associations between all

bytes across an entire system (or network) – including overlaps, multiplicities, mixed

ascendancies, and divergent-but-coexistent

patterns. This approach is similar to how

humans learn: We associate stimuli with

patterns. For example, a child learns that the

sound “dog” refers to the 65-pound, four-legged

creature with soft fuzzy white hair. A computer

would need to be programmed with a series of

commands to know that dog refers to a specific

creature – and is thus unable to recognize

similarities that are not part of the

predetermined definition of “dog” – such as a

black 5-pound miniature poodle.

In June 2011, ai-one released a new machine

learning application programming interface

(API) that is a radical departure from traditional

forms of artificial intelligence. The technology is

a neural network that detects heterarchical

byte-patterns and creates a dynamic

descriptive associative network – called a lightweight ontology. This technology

determines the meaning of data by evaluating the relationships between each byte,

cluster of bytes, words, documents, and so on. Unlike other forms of artificial

intelligence, ai-one’s approach:

• Detects how each byte relates to another – including multiple paths,

asynchronous relationships and multiple high-order co-occurrences.

• Automatically generates an associative network (lightweight ontology) revealing

all patterns and relationships – detecting anomalies within any portion of the data

set.

• Enables machine learning without human intervention.

• Unbiased. Does not rely upon external ontologies or standards.

• Learns associations upon data ingestion – so it is much faster than techniques

that require recalculations, such as COStf-idf.6, 7

6 COStf-idf is an approach to determine the relevance of a term in any given corpus.

7 For a more extensive comparison see: Reimer, U., Maier, E., Streit, S., Diggelmann, T., Hoffleisch, M.,

Learning a Lightweight Ontology for Semantic Retrieval in Patient-Centered Information Systems. In International Journal of Knowledge Management, 7(3), 11-26, (July-September 2011)

A REPRESENTATION OF HETERARCHY DATA

STRUCTURE


“UNDERSTANDING AI-ONE REQUIRES

AN OPEN MIND – ONE THAT IGNORES

WHAT HAS BEEN AND EMBRACES WHAT

IS POSSIBLE.”

ALLAN TERRY, PHD, FORMER DARPA AI

SCIENTIST (PRIME CONTRACTOR)

• Non-redundant. Each byte pattern is stored only once. This has the effect of

compressing data while increasing pattern recognition speed.

• Spawning cells. The underlying cell structure in the neural network is autonomic;

generating cells as they are needed as they are stimulated by sensors (during

data input).

• Neural cells can be theoretically shared across other instances of the network.8

This technology has the potential to enable

cyber security systems to detect, evaluate and

counter threats by assessing anomalies within

packets, byte-patterns, data traffic and user

behaviors across the entire network. When

placed into a matrix chipset, this technology

can theoretically evaluate every byte across

the entire network in real time with exabytes (1018) of capacity using a combination of

sliding windows, high performance computing (HPC) and hardware accelerators.

As such, we will present how this technology has the potential to revolutionize cyber

security by supporting each of the “Five Pillars” framework defined by the US Military for

cyberwarfare:9, 10

Cyberwarfare Pillar Potential Roles for Machine Learning

Cyber domain is similar to other elements in battlespace.

• Transparency to command & control of emerging threats • Unbiased detection & analysis of threats by detecting

anomalies • Empower human analysts with actionable intelligence

Proactive defenses • Constant real-time monitoring of every packet across network

• Near instant recognition of anomalies within packet payload or communication frames

Protection of critical infrastructure

• Enhance intrusion detection and protection systems (IDPS) with real-time libraries & heuristic approximations of potential threats

Collective defense • Early detection & instant response across entire network • Enable counter-counter-measures, trapping, etc.

Maintain advantage of technological change

• Early adoption of technology with accelerating rate of returns (1st mover advantage).

8 ai-one internal research project scheduled for mid-2012.

9 http://www.defense.gov/news/newsarticle.aspx?id=60869

10 For purposes of this paper, the requirements of large multi-national corporations (such as Goldman-

Sachs, Google, Exxon, etc.) are substantially similar to those of government agencies (such as DoD, DHS, NSA, etc.).


The next generation of cyber security attacks will be deadly in their subtly: They can

remain undetected until it is too late to prevent catastrophic loss of data, connectivity

and/or malicious manipulation of sensitive information. Such attacks can collapse key

infrastructure systems such as power grids, communications networks, financial

systems and national security assets.

The advantages of machine learning as a first line of defense against zero-day attacks

include:

• Force multiplication – enabling fewer human analysts to indentify, thwart and

counter far greater numbers of attacks than programmatic approaches.

• Evolutionary advantage – enabling cyber defenses to preempt threat

adaptations by detecting any change within byte patterns.

• Battlespace awareness – providing network security analysts with situational

awareness by identifying and classifying byte pattern mutations.

• Proactive defenses – Constant monitoring of the entire threat surface to detect

any patterns of vulnerability before they can be exploited by the enemy.


THE CURRENT STATE OF CYBER SECURITY IS FUNDAMENTALLY

FLAWED Our research indicates that cyber security is far worse than is commonly reported in

news outlets. We estimate there is an extreme

shortage of human capital with the skills necessary to

thwart attacks from rapidly evolving, highly adaptive

adversaries. 11 , 12 Research for this paper includes

publically available sources of information found on the

Internet, interviews with network and software security

experts and experts in artificial intelligence. In

particular, we speculate on how machine learning

might impact the security of large-scale (enterprise)

networks from both offensive and defensive perspectives. In particular, we seek to find

ways that machine learning might create and thwart zero-day attacks in networks

deploying the most current security technologies, such as neural network enabled

intrusion detection and protection system (IDPS), heuristic and fuzzy matching anti-

malware software systems, distributed firewalls, and packet encryption technologies.

Furthermore, we evaluate ways that adaptive adversaries might bypass application level

security measures such as:

• address space layout randomization (ASLR)

• heap hardening

• data execution prevention (DEP)

We conclude that machine learning provides first-mover advantages to both attackers

and defenders. However, we find that the nature of machine learning’s ability to

understand complexity provides the greater advantage to network defenses when

deployed as part of a multi-layer defensive framework.

As networks grow in value they become exponentially more at risk to cyber attacks.

Metcalfe’s Law states that the value of any network is proportional to the number of

users.13 From a practical standpoint, usability is proportional to functionality. That is, the

use of a network is proportional to its functionality: The more it can do, the more people

will use it. From a cyber security standpoint, each additional function (or application)

running on a network increases the threat surface. Vulnerabilities grow super-linearly

11 The shortage in cyber warriors in the US Government is widely reported. For example, see

http://www.npr.org/templates/story/story.php?storyId=128574055 12

Threats to the Information Highway: Cyber Warfare, Cyber Terrorism and Cyber Crime 13

V∝n2 where value (V) is proportional to the square of the number of connected users of a network (n).

TWITTER CALL FOR DDOS ATTACK


because attacks can happen at both the application surface (through an API) and in the

connections between applications (through malicious packets).14

Coordinated cyber attacks using more than one method are the most effective means to

find zero-day vulnerabilities. The December 2009 attack on Google reportedly relied

upon exploiting previously discovered pigeonholes to extract information while human

analysts were concurrently distracted by what appeared to be an unrelated attack.

SOURCES & TYPES OF CYBER ATTACKS Threats Attack Types Internal (employees, contractors, etc.) & External (hostile nations, terrorist organizations, criminals, etc.)

• Malicious code (viruses, Trojans, etc.)

• Incremental payloads (MHOTCO, API hijacking, etc.)

• Brute Force (DDoS, hash collisions, etc.)

• Impersonation (ID hack, etc.)

• Camouflage (cloaking, masking, etc.)

• Conspiracy (distributed leaks, espionage, etc.)

Cyber attacks are usually derivatives of previously successful tactics.15 Attackers know

that software programmers are human – they make mistakes. Moreover, they tend to

repeat the same mistakes – making it relatively easy to exploit vulnerabilities once they

are detected.16 Thus, if a hacker finds that a particular part of a network has been

breached with a particular byte-pattern (such as a birthday attack) they will often create

numerous variations of this pattern to be used in the future to secure an entry into the

network (such as a pigeonhole).

Let’s evaluate a few of these types of attacks to compare and contrast Computer

programming and machine learning approaches to exploit and defend cyber

vulnerabilities.

14 Threat vulnerability is a corollary to Metcalfe’s Law whereby each additional network connection

provides an additional point security exposure. T∝(n2p

2) where vulnerability (T) is proportional to the

square of the number of connected users of a network (n) times the square of the number of APIs (p). 15

Interview with former anonymous hacker. 16

Yamaguchi, Fabian. “Automated Extraction of API Usage Patterns from Source Code for Vulnerability Identification” Diploma Thesis TU Berlin, January 2011.


“Should we fear hackers? Intent is

at the heart of this question.”

KEVIN MITNICK, HACKER, AFTER HIS RELEASE

FROM FEDERAL PRISON 2000.

EXPLOITING API WEAKNESSES (APPLICATION HIJACKING) Detecting flaws in application program interfaces (APIs) is a rapidly evolving form of

cyber attack where vulnerabilities in the underlying application are exploited. For

example, an attacker may use video files to embed code that will cause a video player

to erase files. This approach often involves incrementally inserting malicious code,

frame-by-frame, to corrupt the file buffer and/or hijack the application. This incremental

approach depends upon finding flaws within the code base. This is easily done if the

attacker has access to the application outside the network – such as a commercial or

open-source copy of the software.

PROGRAMMING MEASURES AND COUNTER-MEASURES TO API EXPLOITS Traditional approaches to thwart derivative attacks to an API are relatively

straightforward and human resource intensive: First, the attack is analyzed to identify

markers (such as identifiers within packet payload). Next, the markers are categorized,

classified and recorded – usually into a master library (e.g., McAfee Global Threat

Intelligence). Finally, anti-malware software (such as McAfee) and IDPS network

appliances (such as ForeScout CounterACT) scan packets to detect threats from known

sources (malware, IPs, DNS, etc.). Threats that are close derivatives of known threats

are easily thwarted using look up tables, algorithms and heuristics while concurrently

detecting and isolating anomalous network behavior for further human review.

PROBLEMS WITH THE COMPUTER PROGRAMMING APPROACH There are many problems with defenses that know only what they are programmed to

know. First, it is almost impossible for a person to predict and program a computer to

handle every possible attack. Even if you could, it is practically impossible to scale

human resources to meet the demands of addressing each potential threat as network

complexities grow exponentially. A single adaptive adversary can keep many security

analysts very busy. Next, cyber threats are far easier to produce than they are to detect

– it takes 10 times more effort to isolate and develop counter measures to a virus than it

does to create it. 17 Finally, the sheer scale of external intelligence and human

resources far outstrips the defensive

resources available within the firewall. For

example, the US Army’s estimated 21,000

security analysts must counter the collective

learning capacity and computational

17 Estimate based on evaluation of virus source codes available at

http://www.tropicalpcsolutions.com/html/security/example-malicous-vb-javascript-code.html. Also see: Stepan, Adrian. “Defeating Polymorphism: Beyond Emulation” Microsoft Corporation, 2005.


resources of all hackers seeking to disrupt ARCYBER – potentially facing a 100:1

disadvantage worldwide.18

Moreover, new approaches to malware involve incremental loading of fragments of

malware into a network where they are later assembled and executed by a native

application. Often the malicious code fragments are placed over many disparate

channels and inputs thereby disguising themselves as noise or erroneous packets.19

MACHINE LEARNING MEASURES AND COUNTER-MEASURES TO API EXPLOITS Machine learning is an ideal technology for both attacking and defending against API

source code vulnerabilities. Knowing that programmers tend to repeat mistakes, an

attacker can find similarities across the code base to identify vulnerabilities. A

sophisticated attacker might use genetic algorithms and/or statistical techniques (such

as naïve Bayes) to find new vulnerabilities that are similar to others that have been

found previously. Machine learning provides defenders with an advantage over

attackers because it detects these flaws before the attack. This enables the defender to

entrap, deceive or use other counter-measures against the attacker.

Machine learning provides a first-mover advantage to both defender and attacker – but

the advantage is far stronger for the defender because it can detect any anomaly within

the byte-pattern of the network – even after malicious code has bypassed cyber

defenses, as in a sleeper attack.20 Thus, the attacker would need to camouflage byte-

patterns in addition to finding and exploiting vulnerabilities – thus requiring the attacker

to add tremendous complexity to his tactics to bypass defenses. Since machine learning

becomes more intelligent with use, the defenders systems will harden with each attack

– becoming exponentially more secure over time.

EXPLOITING IMPERSONATIONS Counterfeiting network authentication to gain illicit

access to network assets is one of the oldest tricks

in the hacker’s book. This can be done as easily as

leaving a thumb drive infected with malware in a

parking lot for a curious insider to insert into a

network computer. It can also involve sophisticated

18 Force size estimates from http://www.arcyber.army.mil/org-arcyber.html

19 Examples of this technique were discussed at the BlackHat Security Conference in early August 2011.

20 For a discussion on sleeper attacks see: Borg, Scott. “Securing the Supply Chain for Electronic

Equipment: A Strategy and Framework.” The Internet Security Alliance report to the White House. (available on http://www.whitehouse.gov/files/documents/cyber/) and also The US Cyber Consequences Unit (http://www.usccu.us/)

IDENTITY THEFT

PHOTO CREDIT: NEW YORK TIMES


social engineering to crack passwords,

to impersonate a legitimate user.

PROGRAMMING MEASURES AND

Traditional approaches to impersonation attacks

controlling access to network assets using predetermined permissions

attacker is inside the network with a false identity, he can

not trigger any alarms by violating his permissions. This defense is entirely

programmatic as it assumes that if the attacker gets past the firewall he will behave

differently than a legitimate user. This is irrelevant to defense since the

his presence to learn about network assets to attack them in different ways. For

example, the attacker can identify APIs, network appliance

security protocols to identify further

external attack.

PROBLEMS WITH THE COMPUTER

IMPERSONATIONS Rules-based permissions are only as good as the rules

Attackers familiar with these rule

stay within acceptable boundaries of use.

MACHINE LEARNING MEASURES AND

In the case of insider threats, machine learning provides the defender more advantages

than the attacker. Although attackers can use machine learning of byte

“hack” an identity, they are limited to behaving exactly as that identity

extent that they must know how that person has behaved in the past and how the

system will perceive their every movement. The defenders advantage is that machine

learning creates an “entology”

This is a heterarchical representation of

all past behavior at the byte

level. This enables network security to

evaluate use patterns to find anomalies

that would be difficult (if not impossible) to

predict using a set of

programming commands.

learning does not depend on

rather just observation to find associations

and patterns. This can be done at every

21 Interview with former forensic network security agent at major investment bank.

AI-ONE’S TECHNOLOGY WORKS L

BRAIN” – LEARNING FROM ASSOCI

passwords, find use patterns and points of entry for a hacker

to impersonate a legitimate user.21

EASURES AND COUNTER-MEASURES TO IMPERSONATIONS

tional approaches to impersonation attacks depend upon user authentication and

controlling access to network assets using predetermined permissions

attacker is inside the network with a false identity, he can run freely so long as he does

ger any alarms by violating his permissions. This defense is entirely

assumes that if the attacker gets past the firewall he will behave

differently than a legitimate user. This is irrelevant to defense since the attacker can use

sence to learn about network assets to attack them in different ways. For

can identify APIs, network appliances and determine other

security protocols to identify further vulnerabilities that might be compromised with an

OMPUTER PROGRAMMING APPROACH TO P

based permissions are only as good as the rules can model human behavior.

Attackers familiar with these rules and the standard practices of network security easily

stay within acceptable boundaries of use.

EASURES AND COUNTER-MEASURES TO IMPERSONATION

ase of insider threats, machine learning provides the defender more advantages

than the attacker. Although attackers can use machine learning of byte

“hack” an identity, they are limited to behaving exactly as that identity

must know how that person has behaved in the past and how the


an “entology” – an ontology of the entity – for every authenticated user.

representation of

at the byte- or packet-

network security to

to find anomalies

that would be difficult (if not impossible) to

predict using a set of computer

Machine

learning does not depend on rules –

to find associations

This can be done at every

ith former forensic network security agent at major investment bank.

Page 15 of 25

S TECHNOLOGY WORKS LIKE AN “EMPTY

LEARNING FROM ASSOCIATIONS.

patterns and points of entry for a hacker

MPERSONATIONS depend upon user authentication and

controlling access to network assets using predetermined permissions. Once an

so long as he does

ger any alarms by violating his permissions. This defense is entirely

assumes that if the attacker gets past the firewall he will behave

attacker can use

sence to learn about network assets to attack them in different ways. For

and determine other

vulnerabilities that might be compromised with an

PREVENT

human behavior.

s and the standard practices of network security easily

MPERSONATION ase of insider threats, machine learning provides the defender more advantages

than the attacker. Although attackers can use machine learning of byte-patterns to

“hack” an identity, they are limited to behaving exactly as that identity would – to the

must know how that person has behaved in the past and how the


for every authenticated user.


at every point within the network – routers, network appliances, APIs, data bases

access points, etc.

THREAT EVOLUTION: EXPLOITING COMPLEXITY

Detecting cyber threats is much like finding signals within noise. The greater the noise,

the more difficult it is to detect faint signals.

Traditional computer programming technologies

require data to be structured into a known format

before it can be transformed using mathematical

and logical operations. Machines are only as

smart as what they are told. A programmer must

command every step of the process for the

machine to complete a task – such as recognize

a pattern.

For more than 50 years, the field of artificial intelligence evolved techniques to enable

machines to learn. A complete discussion of this vast body of work is beyond the scope

of this paper. However, it is important to note that developments in neural networks

enable machines to learn complex patterns without human supervision – such as

Hopfield and Kohonen neural networks. In these technologies, it is necessary to provide

structure and parameters for what will be learned. For example, traditional neural

networks rely upon training sets and/or neighborhood functions. Even then, these

approaches run the risks of “over-learning” and learning the wrong things.22, 23 Learning

is biased because the networks depend on human assumptions.24

LIGHTWEIGHT ONTOLOGIES (LWO): A NEW COMPUTATIONAL

APPROACH

An easy way to understand ai-one’s technology is to think it as an “empty brain” (like an

infant) that learns the meaning of data through associations. Similar to a small child that

learns language by associating individual sounds with physical objects, ai-one’s neural

networks learns the meaning of bytes by associating them with the other bytes. The

network builds an associative network that defines the relationship of every byte within

the entire corpus of data. This relationship can be symmetric, asymmetric or

heterogeneous. This corpus can be as small as a single character of a word or as large

22 http://www.ncbi.nlm.nih.gov/pubmed/17650068

23 Reimer, U., op. cit.

24 Ibid.


BRAINVIEW VISUALIZATION TOOL SHOWS THE

LWO DATA RELATIONSHIPS INSIDE AI-ONE

as the entire Internet. The limitation to how much the data the system can process is a

function of hardware and system architecture.

ai-one’s technology is radically different

from other forms of neural nets or artificial

intelligence. First, ai-one’s nets do not have

any neural structures pre-defined by the

user. Rather, they resemble neurological

structures where connections between the

nodes are autonomic – forming without

conscious control. These connections form

an n-dimensional graph that describes all

relationships between every byte that has

been fed into the system. The system

learns at the time of data ingestion –

automatically adjusting relationships to

account for new data.

Second, the system creates a lightweight

ontology (LWO) that automatically classifies each byte into a hierarchy by topic –

starting with the most general then progressively moving to the most specific. A

unlimited number of hierarchies can form in any direction – thereby forming a

heterarchy. Hierarchical classifications are arranged by hyponymy. 25 , 26 ai-one’s

lightweight ontology differs from a full-fledged ontology because it detects only the

inherent semantic meaning of each byte as it relates to another – there is no human

bias or over-learning. Rather, the LWO enables the machine to learn high-order

relationships between any data element. For example, it can detect the conceptual

meaning of words and isolate when a word is used in an unexpected or unique way.

Another feature of ai-one’s technology is that it provides humans with the option to

teach the system thereby giving the machine an intentional point-of-view. Queries can

be introduced to the LWO that dynamically adjust the topography of the data to

influence the importance of data elements to specific relationships. This enables it to

25 Also known as hypernym-hyponym relationships. A hyponym is a word (or data element) that is

included in the meaning of another word (or data element) with broader classifications. For example, ‘scarlet’ is the hyponym to the hypernym ‘red.’ 26

Hyponymy is usually associated with computational linguistics and natural language processing. ai-one applies these classification and extraction techniques to include other forms of data. For a discussion on hyponymy see: Navigli, Roberto and Velardi, Palo “Learning Word-Class Lattices for Definition and Hypernym Extraction” Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1318–1327, Uppsala, Sweden, 11-16 July 2010. © 2010 Association for Computational Linguistics


“learn” the optimal path to answer a question. If that question is repeated, the system

tightens the associations among the relevant data elements that form the answer. This

process can be thought of as similar to the way muscle memory works in humans.

Complex patterns, such as a tennis serve, are learned through repetition.

Unlike traditional neural nets, ai-one’s technology reveals all relationships that comprise

the answer to a query. It is semi-transparent. It is also teachable – commands within the

SDK enable humans to instruct the system to make specific associations and ignore

others. However, the best practice is to teach the system by directing it to use external

resources that are verified as truthful (such as malware libraries) to learn patterns

faster.

Some of the most useful instructions focus the system on finding anomalies – so you

find answers to the questions you didn’t know you needed to ask.

Finally, the system is both language and data agnostic because it learns at the byte-

level.

SUMMARY OF THE BENEFITS OF AI-ONE’S TECHNOLOGY

• Works with existing technologies � The CORE API/Library works with other

programming languages such as C, Java, Microsoft, etc.

• Autonomic learning � it learns as it is stimulated by external sensors, without

any human intervention or training sets.

• Machine generated lightweight ontologies (LWO) � reveals all relationships

with simple commands.

• Dynamic topologies � Finds the best answers by automatically reshaping data

surfaces to fit queries.

• Byte-level processing � Data and language agnostic. It works equally well with

structured and unstructured data. It can work with or without external references

(e.g., human-curated ontologies, libraries, databases, etc).

• Fast �� 10,000x faster than COStf-idf in a benchmark comparison.

• Efficient �� No need to re-index entire corpus as new information is learned or

inserted.

• Transparency � It reports on the pathway that is used to determine

associations.

• Asymmetric, bidirectional pathways � enables machine detection of high-

order co-occurrences where concepts (or words or packets or bytes) can be

closely associated although they never occur in the same place at the same time.

(E.g., The words “rust” and “corrosion” mean the same thing although they never

occur together.)


SCREEN SHOT OF AI-ONE’S BRAINBOARD PROTOTYPING TOOL

FOR MACHINE LEARNING INSTRUCTION SETS

• Low cost & fast deployment �� far less expensive and faster to implement than

competing technologies.

• Future flexibility �� extensible architecture ensures solutions build with SDK will

port into future firmware (chipsets) in the near future. Our intent is to provide

customers with an easy plug-and-play upgrade to further improve the speed and

performance of solutions built using ai-one software.

AI-ONE TECHNOLOGY ROADMAP

CURRENT COMMERCIAL-OFF-THE-SHELF ai-one’s technology is currently offered as a core API/Library, with a small software

development kit (SDK) that enables programmers to build artificial intelligence into any

application. The system mimics neurophysiology – generating neural nets with

massively connected, asymmetrical graphs that are stimulated by binary spikes from

external sensors (e.g., IPDS, firewalls, etc.).

The system grows as it

learns through

exposure to data (e.g.,

from sensors such as

routers, firewalls, etc.).

The current

manifestation of the

technology is

appropriate for a small

scale proof-of-concept

to demonstrate how ai-

one’s technology can

be used to harden

network security

against MHOTCO and

other, unanticipated

forms of cyber attacks.

64-BIT MULTI-THREAD COTS ai-one is currently porting the 32-bit, single thread dll into a 64-bit multithread, platform

agnostic system. Theoretically, this version will have the capacity of storing and

processing up to 2 exabytes of data (2EB or 2018 bytes) per instance. Processing speed

for this size of data will depend upon a number of factors beyond ai-one’s control – such

a memory access times, processor speeds, operating system overhead, etc.


64-BIT CHIPSETS Future plans include porting

where it will run approximately 10,000x faster than

advanced stage of research after

spending more than eight years

developing the technology in this

direction.

We anticipate that the most likely

candidate for an “artificial brain” that

can learn the relationships of both

structured and unstructured data will

be a matrix ASICS chipset where

the holosemantic network using ai

one’s neural mathematics will

operate in unison with traditional

chipsets running linear equations

(e.g., Intel, IBM, etc.).

This matrix architecture enables

unlimited scaling across multiple

computational clusters.

Since 2006, ai-one has been

building experimental prototypes

using this matrix architecture. Our

research indicates these chipsets

will be 104 to 106 times faster

neural networks running on the

current generation of Intel i7 900

chips. Commercial production

requires additional research and

testing.

NEXT STEPS – PROOFS

IMMEDIATE COTS-BASED27

The current version of ai-one’s

small-scale proof of concept (POC

27 COTS = commercial-off-the-shelf.

CONCEPT DIAGRAM FOR HETERARCHICAL

PROCESSING CHIP

ARCHTECTURE FOR AI-ONE ARTIFICIAL

CHIPSET

Future plans include porting the core technology onto FPGA and ASICS chipsets

it will run approximately 10,000x faster than traditional neural nets. ai

advanced stage of research after

spending more than eight years

developing the technology in this

We anticipate that the most likely

icial brain” that

can learn the relationships of both

structured and unstructured data will

be a matrix ASICS chipset where

the holosemantic network using ai-

one’s neural mathematics will

operate in unison with traditional

chipsets running linear equations

This matrix architecture enables

unlimited scaling across multiple

one has been

building experimental prototypes

rix architecture. Our

research indicates these chipsets

faster than

neural networks running on the

current generation of Intel i7 900

Commercial production

dditional research and

S OF CONCEPT 27

APPROACH one’s technology (32-bit, single thread) is appropriate for a

(POC) where a total of 250MB of network traffic would be

shelf.

Page 20 of 25

ETERARCHICAL DATA

ONE ARTIFICIAL BRAIN

to FPGA and ASICS chipsets –

traditional neural nets. ai-one is at an

bit, single thread) is appropriate for a

) where a total of 250MB of network traffic would be


AI-ONE’S FIRST WORKING PROTOTYPE OF AN

ARTIFICIAL BRAIN, 2006

monitored offline using a moving windows approach. The objective would be to prove

ai-one’s machine learning approach can detect threats posed by an oppositional (red)

attack team. This system could be built within three months at a cost of less than

$600,000 using all off-the-shelf hardware and software integrated into a custom built

application using ai-one’s Topic-Mapper API.

INTERMEDIATE COTS APPROACH The second evolution would be a POC using a 64-bit multithread instance of ai-one’s

technology. This is due for commercial

release in 2012. Research indicates this

configuration can potentially process up

to 2EB (exabytes or 2018 bytes) of data

per instance and can be deployed as a

software-as-a-service (SaaS) over a

commercial host such as Amazon Web

Services or Google App Engine.

However, more research is necessary

before determining how this COTS

approach would be most effective against

cyber threats. For example, it might be

most effective to use a sliding window

approach combined with clustering multiple instances of the 64-bit dll across a HPC

cluster. 28 This POC will demonstrate the ability of the ai-one solution to scale to

accommodate security for most enterprise networks – likely in excess of 1 petabyte

(1015) of traffic per day). We estimate this will take approximately 1 year to develop at a

cost of approximately $3-5 million excluding hardware manufacturing costs.

MATRIX CHIPSET APPROACH The third evolution will involve developing and testing in two stages. The first stage is a

POC using the 64-bit multithread deployed as a field programmable gate array (FPGA)

that will be configured to run up to 1 exabytes (1018) to demonstrate that the ai-one

solution can operate at network speed for a 1 petabyte/day traffic load. We estimate this

will take approximately 1 year to develop at a cost of not more than $10 million for the

first chipset excluding the costs of manufacturing hardware.

28 High performance computing (HPC) clusters may require changes to bus architectures to

accommodate neural cell traffic. More research is necessary in this area.


Once FPGA is proven, the second stage is to

application specific chip sets (ASICS) that

performs 10,000x of current COTS speeds

operate at very low-energy levels

(<10% of current Intel i7 chips) and

be able to process at least

1PB/second. We estimate this

solution will require an additional

year of development at a cost of

approximately $50 million excluding

the costs of manufacturing hardware

CLUSTERING OF AI-ONE MATRIX CHIPS FOR

the second stage is to deploy the FPGA

application specific chip sets (ASICS) that will operate in clusters where each chip

of current COTS speeds. We estimate the ai-one ASICS solution will

energy levels

chips) and

be able to process at least

1PB/second. We estimate this

an additional 1

at a cost of

excluding

the costs of manufacturing hardware.

Page 22 of 25

HIPS FOR HPC

solution using

in clusters where each chip

one ASICS solution will


APPENDIX - A WORSE CASE SCENARIO: MULTIPLE HIGH ORDER

TERM CO-OCCURRENCE ATTACKS

Our research indicates that current network security technologies are unable to thwart

multiple high order term co-occurrence (MHOTCO) attacks. The essence of MHOTCO

is to use packets that look and behave differently when viewed as individuals but

assimilate into malware once a critical mass touches a vulnerable network asset,

appliance or application. Each packet can be thought of as a word or part of a word.

MHOTCO attacks use different “words” that mean the same thing – usually at several

extended levels.29 The diagram below illustrates how MHOTCO escapes detection from

computer programming approaches.

DIAGRAM 1: RELATING PACKETS USING A SINGLE HIGH ORDER CO-OCCURRENCE

REFERENCE PACKET

MHOTCO attacks introduce malware into a network using multiple packets that are

seemingly unrelated because they come from different sources, use different control

bits and have payloads that do not have any similarity. Thus, traditional lookup tables

and heuristic testing of packets will not detect a threat from any single or small group of

packets. An effective tactic to attack structured network defenses using MHOTCO is to

29 Hyponynyms can extend to many levels called orders. For example, ‘scarlet’ is a first-order hyponym to

‘red’ and a second-order hyponym to ‘color,’ etc.


intentionally stimulate and monitor network defense counter-measures to deduce

vulnerabilities in unstructured (human analyst) defenses.

UNNOTICEABLE ATTRITION A possible scenario might start with a birthday attack on a military network whereby

hash collisions identify pigeonhole opportunities to bypass an intrusion detection and

protection system (IDPS) and firewall security systems. Once the pigeonhole error is

identified, the hacker reveals further vulnerabilities in the network appliances’ approach

for threat identification by intentionally stimulating signature-based, statistical anomaly-

based and stateful protocol analysis detection measures. The attacker uses a series of

split A/B tests to compare successful and rejected intrusions to refine malware packet

wrappers and payloads. These results develop a topology of network defenses –

including detailed analysis of vulnerabilities in IDPS, network access and control

systems, and malware counter-measures.

Successful MHOTCO attacks are unrecognized by the network defenses for several

reasons:

• They fail to recognize that the attack is from a common source.

• They have no physical or signature similarity.

• They do not spike network activity (unlike DDoS).

The MHOTCO attacker camouflages his identity by using rotating or masking IP

addresses, aliases, and impersonations, etc. He places malicious packets that do not

conform to the known patterns within the libraries of the anti-malware software. These

lie dormant and undetected until the attacker decides to use them. The MHOTCO

attacker is patient – he might take years to slowly test security measures to ensure that

network activities never increase above a threshold. He hides in a sea of cyber noise.

The human analyst never knows that network security has been compromised – all he

can do is refine the malware detection algorithms using his imperfect knowledge of the

situation.

At this point, the attacker has complete control of the cyber battlespace – he knows

where the vulnerabilities lie and has implanted malicious code that can be deployed at

his command at a time of his choosing. He has used counter-counter measures to

intentionally misguide his adversaries into building ever more convoluted defenses that

depend on the questionable accuracy of algorithms and statistics at a cost of thousands

of human analysts – all while increasing complexity and potential areas to hide.

Meanwhile, the network administrators and security analysts feel confident they are

safe. They believe they have successfully thwarted numerous unrelated attacks – and

can report that network activities are within acceptable variances with detailed analysis


of the threats that have been stopped.

analyze and process threats.

troops to step-in if needed.

Cyber warriors have no idea that they have

do anything about it.

THE GAME CHANGER: MACHINE

Now imagine the same situation where a MHO

the network administrators use artificia

can learn the intent, purpose and expected behavior for every packet on the network.

The artificial intelligence system learns the

to every other. The system develo

holosemantic dataspace –

understand how all packets are related

control wrappers are useless because the machines

understand the inherent latent semantic meaning of each packet

relationships autonomously.

Now the MHOTCO attacker unintentionally reveals his intent with every packet he

introduces into the system. The network administrators are in control. They can deploy

CN measures to manage attacks at their own discretion

political objectives.

Machine learning transforms

network vulnerabilities into strategic

military assets.

30 Holosemeantic means ‘whole meaning.’ ai

neural network contains enough information to inferunderstand this is to think of it as an emergent system. For more information on emergence see: Corning, Peter A. “The Re-Emergence of “EMERGENCE”: A Venerable Concept in Search of a Theory” in Complexity (200

AI-ONE’S API CONNECTS NETWORK SE

HOLOSEMANTIC DATASPACE

he threats that have been stopped. They have developed mountains of code to

analyze and process threats. And they have thousands of “highly trained” cyber

e no idea that they have already lost the battle – until it is too late

ACHINE LEARNING Now imagine the same situation where a MHOTCO attacker uses the same tricks

the network administrators use artificial intelligence so network appliances (machines)


artificial intelligence system learns the relationship of every packet

. The system develops an infinitely scalable, n-dimensional graph

that enables the machines (firewalls, IDPS, etc.) to

understand how all packets are related at the byte-level.30 Camouflaging payloads and

control wrappers are useless because the machines using ai-one’s intelligence

understand the inherent latent semantic meaning of each packet by detecting hyponymy

.

CO attacker unintentionally reveals his intent with every packet he


measures to manage attacks at their own discretion to support strategic military and

Machine learning transforms

network vulnerabilities into strategic

Holosemeantic means ‘whole meaning.’ ai-one uses this term to describe that each cell within the neural network contains enough information to infer the shape of the data space. Another way to understand this is to think of it as an emergent system. For more information on emergence see: Corning,

Emergence of “EMERGENCE”: A Venerable Concept in Search of a Theory” in Complexity (2002) 7(6): pages 18-30.

Page 25 of 25

CONNECTS NETWORK SENSORS TO

CE.

They have developed mountains of code to

And they have thousands of “highly trained” cyber-defense

until it is too late to

CO attacker uses the same tricks – only

l intelligence so network appliances (machines)


relationship of every packet on the network

dimensional graph – a

that enables the machines (firewalls, IDPS, etc.) to

Camouflaging payloads and

one’s intelligence

by detecting hyponymy

CO attacker unintentionally reveals his intent with every packet he


to support strategic military and

one uses this term to describe that each cell within the the shape of the data space. Another way to

understand this is to think of it as an emergent system. For more information on emergence see: Corning,

Documents

Machine Learning for Cyber Security at Network Speed & Scale