Upload
maine
View
48
Download
0
Tags:
Embed Size (px)
DESCRIPTION
SRS Architecture Study. Partha Pal Franklin Webber. Outline. Level of service w/o attack. Regenerative. Level of service. Survivable (OASIS Dem/Val). undefended. Start of focused attack. time. S elf- R egenerative S urvivable System: Self: Organic decision making - PowerPoint PPT Presentation
Citation preview
SRS Architecture Study SRS Architecture Study
Partha Pal
Franklin Webber
2
OutlineOutline
• Study goals• SRS Technologies• Top down• Bottom up• Strawman• Issues, challenges
Level of service w/o attack
undefended
Survivable (OASIS Dem/Val)
Regenerative
time
Level of service
Start of focused attack
Self-Regenerative Survivable System:
Self: Organic decision making
Regenerative: Better than graceful degradation/simple recovery– reversing the trend
3Balances pros and cons of both approaches
If the high watermark is implemented then it provides a concrete context, but “grand fathering” may impact choice and Integration of new capabilities
This is study in the abstract..leading to an abstract architecture that will need a concrete context to realize..
3rd generation assumptions are still valid- Absolute prevention, and accurate and on time detection are impossible to achieve
Study PlanStudy Plan
Understand how to incorporate the new technologies in a distributed information system that not only tolerates the effects of cyber-attacks, but also attempts to stop and reverse the loss of resources and capabilities
Start with the new (SRS) capabilities, build a partial architectural framework, and then see what other capabilities, mechanisms and services are needed to complete the architecture–
• offers a high level of resistance to attacks (protection),
• improves visibility of attacker activity/attack effects (detection), and
• is able to adapt to changes caused by the attacker (react)
Start with a high watermark survivability architecture, identify where SRS capabilities could benefit, re-organize the architecture to integrate the selected capabilities, mechanisms and services
Combine & contrast the abstract architecture with the more concrete case to create a Strawman Self-regenerative Survivable System Architecture
Bottom
up.. Top
down.
.
4
Summary of SRS Technology StudySummary of SRS Technology Study
• Sent a questionnaire to each original SRS project (i.e., all except Asbestos)
• General outline:– Claims– Key Capabilities– Benefits and Other Distinguishing Factors– Assumptions– Use Cases and Interface Issues
• Customized for issues we thought especially important or were confused about
• All responded, some very quickly some needed gentle prodding – thank you!
• Varying degrees of maturity– Some projects started with existing technology
• At least half of the projects offer multiple technologies that could be used independently
• Less overlap than we expected: many technologies seem complementary
• Unsurprisingly, not a lot of support for integration
Process
General Observations
5
Biologically-Inspired Diversity ProjectsBiologically-Inspired Diversity Projects
• Genesis– A toolkit offering a variety of transformations– Based on Strata and is portable
• DAWSON– A toolkit offering a variety of transformations– Based on Windows DLLs
• Comparisons– Some overlap in randomization techniques– Genesis also offers highly-attack-resistant runtime
transformations that incur Strata’s overhead– DAWSON also offers Windows-specific transformations– May be combined but value and difficulty are unclear
6
Cognitive Immunity and Self-Healing Cognitive Immunity and Self-Healing ProjectsProjects
• Learning and Repair– Daikon: learns program constraints from a set of traces– Kvasir: monitors program to create traces for Daikon– Archie: checks program constraints at runtime– Repair Tool: repairs damage to conform to constraints– Tools existed before SRS but are being improved
• RMPL (Concurrent Model-Based Execution)– A language expressing temporal properties without fully
specifying an order of execution, and probabilistic assumptions and choices
– An executive that plans, dispatches methods and replans when necessary
7
• AWDRAT– Language to specify behavior (Architectural Model)– Language to describe Method Selection Metadata– Tools to instrument Java to monitor and control behavior– An executive that
• Detects anomalies by Architectural Differencing• Combines other observations to update a Trust Model• Selects methods to maximize utility and/or minimize costs
• Cortex– A “taste-tester” framework for redundant components– Scyllarus: situation assessment– CIRCA: generates controllers from models
Cognitive Immunity and Self-Healing, Cognitive Immunity and Self-Healing, cont’dcont’d
8
• Comparisons– Learning and Repair tools are complementary to others– Cortex learning by taste testing is also complementary– AWDRAT and RMPL address some of the same issues
but:• AWDRAT is middleware to defend existing application; RMPL
is a language and environment for building new applications• Geared to different application domains:
– RMPL– embedded/autonomous vehicle systems
– AWDRAT- information processing systems
– AWDRAT’s Trust Modeling is complementary to others
Cognitive Immunity and Self-Healing, Cognitive Immunity and Self-Healing, cont’dcont’d
9
Granular Scalable Redundancy Granular Scalable Redundancy ProjectsProjects
• Steward– Scalable support for Byzantine fault-tolerant state-machine replication
• BFT-like protocol for LANs• Paxos-like protocol for WANs• Library for threshold crypto
• CMU– Byzantine fault-tolerant data storage using scalable asynchronous
protocols• read/write (R/W)• query/update (Q/U)
• QuickSilver– Tempest (time-critical; probabilistic; SlingShot protocol)– QuickSilver (scale to many groups; virtual synchronous protocol)– Cayuga (efficient automata for searching publication histories)– ChunkySpread (dynamic IP multicast)
10
• Comparisons among protocols– Significantly different attack (fault) models– Significantly different assumptions about applications– CMU’s Q/U protocol makes the weakest assumptions
about the attacker but has more restrictive application than Steward, SlingShot or QuickSilver
Granular Scalable Redundancy, cont’dGranular Scalable Redundancy, cont’d
11
Reasoning about Insider Threat Reasoning about Insider Threat ProjectsProjects
• PMOP– Framework for monitoring operator behavior,
recognizing and blocking bad actions
• HDSM (High-Dimensional Search and Modeling)– Insider Modeler and Analyzer, currently used offline– Search engine for high-dimensional space of sensor
data– Response Engine
• Asbestos– New x86 OS with efficient support for trustworthy
isolation in hosts and processes running untrusted code
12
• Comparisons– All are complementary to each other– PMOP seems to be AWDRAT’s Architectural
Differencing applied to operators rather than components
– HDSM’s search engine is complementary to other SRS technologies but the Response Engine overlaps in scope with AWDRAT executives
Reasoning about Insider Threat, Reasoning about Insider Threat, cont’dcont’d
13
SRS Technologies
Top Down ApproachTop Down Approach
What can we learn about the architecture of SRS systems by trying to transform a high watermark survivable system into an SRS system?
DPASA Architecture applied to the JBI Exemplar used in OASIS Dem/Val
And, its limitations and shortcomings, as identified by:
Developers’ experiences
Testing and validation
Out of lab deployment
Multiple red team exercises
Understanding of their
Capabilities
Assumptions
Limitations
Maturity
Our study found that the there is sizable intersection that pushes the high watermark more towards an SRS system!
• Much better than finding that technologies do not address the identified problems; or even if they do, “self” and “regenerative” aspects had no gain
These changes are incremental improvements over current DPASA architecture. Changing the architecture substantially, (e.g., implementing JBI CAPI using QuickSilver) without appropriate forethought is not likely to lead to a more survivable system because the system will lose the well tested interaction of existing protection, detection and adaptive response mechanisms
14
Limitations and Shortcomings of the DPASA Limitations and Shortcomings of the DPASA Architecture Architecture
• Recovery supported only for some key components
• Availability seems to be the most attractive target for the adversary
• Interpretation of observation, deduction and decision making require expertise
• More options for adaptive response
• Lack of support for improving the system on the fly
The last three are more tightly inter-related among themselves and more SRS oriented, but SRS technologies may help in all but the last one
15
Improving Recovery Improving Recovery
• State:– Partially implemented: some clients
and some PSQ (those committed to MySQL)
• Connection: – Reasonably handled
• Group view:– PSQ:
• View among servers: handled well• View of servers from clients: takes a
long time– SM:
• Dependant on Spread: could be broken in a bad way
• Improvement possibilities– Need “safe” state transfer or carry over
• Can SRS technologies help?– Replace Spread transmitter?– Implement the (in memory) data
structures maintained by PSQ servers as Q/U objects using CMU protocol?
– Clients and DC: use Asbestos for protecting check-pointed state?
SM and PSQ are redundant, maintain some replicated state SRS technologies provide supporting infrastructure
Self: who makes the decision to recover (or not to recover), and when?Regenerative: Recovering to “operational” without any other “changes” is still in the realm of “delaying the eventual degradation”
Full recovery
Restart with state loss
16
Some DetailsSome Details
q1sm
Sig Vrfy
Voting
q2sm
Sig Vrfy
Voting
q3sm
Sig Vrfy
Voting
q4sm
Sig Vrfy
Voting
SXMTR SXMTR SXMTR SXMTR
SPREAD GCS
q1sm wants to multicast message M: q1sm signs M and hands it to its XMTR, which returns success only of all XMTRs in the group acknowledges receiving M
q1sm q2sm q3sm q4sm
q1psq
q1dc
q2psq q3psq q4psq
q2dc q3dc q4dc
Combination of managed switches and ADF policies define who can talk to whom and over which port and protocol
SsXMTR SsXMTR SsXMTR SsXMTR
Steward or QuickSilver
It is not clear whether the unavailability observed is purely an implementation problem, but switching over to Steward or QuickSilver transport may still be advantageous:
• Maintaining the state machine replication abstraction is advantageous for state recovery
•Simpler XMTR
• Can handle more quads
The way client’s PSQ messages are handled by our PSQ servers are similar to using CMU’s Q/U protocols– imagine the subscription info as a Q/U object, replicated at each PSQ server, part of which is maintained in memory-one difference is that instead of the client, one PSQ server acts as its proxy.
q1psq
Sig Vrfy
Voting
q2psq
Sig Vrfy
Voting
q3psq
Sig Vrfy
Voting
q4psq
Sig Vrfy
Voting
sock
et
sock
et
sock
et
sock
et
sock
et
sock
et
sock
et
sock
et
sock
et
sock
et
sock
et
sock
et
Q/U Objects
Q/U client
Client’s PSQ Req
Q/U protocol and Object Synching
Using the Q/U object abstraction and associated protocol will help state recovery of a restarted PSQ server—different clients may have interacted with different quads while the recovering quad was down.
17
Making Availability Compromises More Making Availability Compromises More DifficultDifficult
Unavailability triggered by corruption:• Non-redundant and homogeneous perimeter (PIX FW Routers)
• Corrupt references
Attacks on Java: serialization bombs, garbage collection/lease
SQL Injection
Need privileged access on inside host(s)
From outside
WNIDS
CombOPS
Wing Ops LAN
ANIDS
MAF
AMC CONUS LAN
q4sm
q4ps
q4cor
q4psq
q4dc
q4NIDS
q4ap
q4sm
q4ps
q4cor
q4psq
q4dc
q4NIDS
q4ap
PIX
PIXPIX
xxx
Attacker network
Not brute force DOS!
PIX PIX
HUB
MAF
ANIDS
AMC CONUS LAN
Redundant and fail overMonitor all legs
other
Diversity (costly)
Taste tester?
Dynamic diversity using Genesis?May not be memory exploit?
PIX T tester
HUB
MAF
ANIDS
AMC CONUS LAN
spl hw
18
Availability cont’dAvailability cont’d
q1sm
Corrupt references:
Attacks on Java: serialization bombs, garbage collection/lease mechanisms
q2sm q3sm q4sm
Q1’s IP=127.0.0.1 Q2’s IP=127.0.0.1 Q3’s IP=127.0.0.1 Q4’s IP=127.0.0.1
Registering client gets 127.0.0.1 for all quads
Flaw: Q1SM’s unsolicited statement about other quads IP address is believed by everybody
Send a serialized packet with a huge number in the size fieldJVM tries to allocate the memory in the size field and throws OOME
Variations: Send a number of such packets to consume available memory
Packets may or may not be well formed, but to keep the memory allocated “serialization” must succeed
Tell a RMI server that a client reference has been released
Need to guess the client side refs
• Create variants of JVM or other libraries using Genesis or DAWSON tools
Enforce a size rule?
• Use AOP to implement a check before allocation?
• Use Daikon toolset to learn the max size of serialized packets, enforce it as an invariant and fix when violated?
Code/Implementation problem
19
Availability cont’dAvailability cont’d
SQL Injection
Client
PSQ Proxy
PSQ Server
RDB
DC
Roles and other checks
Encrypted message containing data D
Session check
Encrypted message containing data D
Query created from D by PSQ
DPASA has the notion of escrow
MySQL in all quads, but on different OS
Improving Detection and Response
• Add taste tester
• Two possibilities at PSQ level or at the RDB level
Improving Prevention (& detection) X
Strictly control what is executed on the RDB
• Vet D
• Create a white list
Use diverse DBs (hoping some will behave differently)
• Can SRS diversity techniques help
• Genesis tainting?
Client
PSQ Proxy
PSQ Server
RDB
DC
Encrypted message containing data D
Session check
Encrypted message containing data D
Query created from D by PSQ
T taster PSQ
T taster RDBX
X
Cost…
Applicability, Extendibility …
20
More Organic Decision Making More Organic Decision Making
At which granularity the cost overruns benefits?
Most DPASA implemented components have some of these in “code”– should they be made explicit?
Should we add these as architectural elements at key components SM, PS, PSQ and LC
q1sm
q1ps
q1cor
q1psq
q1dc
q1NIDS
q1ap
q2sm
q2ps
q2cor
q2psq
q2dc
q2NIDS
q2ap
q3sm
q3ps
q3cor
q3psq
q3dc
q3NIDS
q3ap
q4sm
q4ps
q4cor
q4psq
q4dc
q4NIDS
q4ap
q4sm
q4ps
q4cor
q4psq
q4dc
q4NIDS
q4ap
ENIDS
WxHaz
ChemHaz
EDC
JEES
ENIDS
WxHaz
ChemHaz
EDC
JEES
SCRBT
TAP
SW
DIS
T
AO
DB
SV
R
TA
PD
B
PNIDS
AODB
Target
CAF
SCRBT
TAP
SW
DIS
T
AO
DB
SV
R
TA
PD
B
PNIDS
AODB
Target
CAF ANIDS
MAF
WNIDS
CombOPS
ENV LAN PLANNING LAN
Wing Ops LAN
AMC CONUS LAN
Q1sm invites Combat Ops, but does not see all heartbeats
Q2SM sees heartbeats from 4 out of 5 Combat Ops components
Q3sm shows some missing heartbeats from Combat Ops
Q4sm same as Q3SM
GUI Up, but cannot subscribe
No significant alerts in Emerald
Combat ops got bad references for Q1, Q3 and Q4?
• Most likely not all at once
Try to push right references
• Try refreshing these first
• If fails try refreshing with q3 blocked? (DPASA Operators)DPA
SA O
pera
tors
Organic Decision Making: within the system, by the system
Issu
es to
be
addr
esse
d by
the
arch
itect
ure
• Detection– Arch differencing– Deviation from spec
• Interpretation– Models, JHU A-DAGs
• Deductive analysis, hypothesis testing
– HDSM? Cortex
• Response selection– RMPL? Cortex
21
More Maneuvering Room for DefenseMore Maneuvering Room for Defense
• Beyond restart process, reboot ,and graceful degradation (block or isolate, reduce quorum size etc)– More spares, distributed widely
• (Scalable redundancy)
– Restart a variant• (Genesis, Dawson)
– Reboot a new system• (Asbestos?)
– Change transport• (from QuickSilver to SlingShot, accept the weaker guarantees)
SRS technologies provide the infrastructure or mechanisms– but the management?• Policies, decision making– when to restart a variant, when to reboot with what restrictions, which transport?• SRS cognitive capabilities (reasoning about the system) will likely fall short in reasoning about SRS technologies
Carrying over state and keys?
22
Improving the System on the FlyImproving the System on the Fly
• Even if improvement causing changes are identified along with the right time to apply them, the system must be “architected” to take the changes– Authorized vs unauthorized changes– Risk of automation– a new attack avenue– Different kinds “Change”
• Code changes– – Restart– state and key issues
• Policy or configuration changes– IP Tables, ADF, rate limiting, size checking
» Hooks exists, can be done manually
• Protocol/transport changes
This is an architecture and implementation issue– solution will likely be dependant on the technologies being used
23
A Futuristic DPASA++ System A Futuristic DPASA++ System
Taste testers: at key service providers such as PSQ (using existing redundancy) and may even at the perimeter router.
q1sm
q1ps
q1cor
q1psq
q1dc
q1NIDS
q1ap
q2sm
q2ps
q2cor
q2psq
q2dc
q2NIDS
q2ap
q3sm
q3ps
q3cor
q3psq
q3dc
q3NIDS
q3ap
q4sm
q4ps
q4cor
q4psq
q4dc
q4NIDS
q4ap
q4sm
q4ps
q4cor
q4psq
q4dc
q4NIDS
q4ap
ENIDS
WxHaz
ChemHaz
EDC
JEES
ENIDS
WxHaz
ChemHaz
EDC
JEES
SCRBT
TAP
SW
DIS
T
AO
DB
SV
R
TA
PD
B
PNIDS
AODB
Target
CAF
SCRBT
TAP
SW
DIS
T
AO
DB
SV
R
TA
PD
B
PNIDS
AODB
Target
CAF ANIDS
MAF
WNIDS
CombOPS
ENV LAN PLANNING LAN
Wing Ops LAN
AMC CONUS LAN
q4sm
q4ps
q4cor
q4psq
q4dc
q4NIDS
q4ap
q4sm
q4ps
q4cor
q4psq
q4dc
q4NIDS
q4ap
q4sm
q4ps
q4cor
q4psq
q4dc
q4NIDS
q4ap
q4sm
q4ps
q4cor
q4psq
q4dc
q4NIDS
q4ap
q4sm
q4ps
q4cor
q4psq
q4dc
q4NIDS
q4ap
q7sm
q7ps
q7cor
q7psq
q7dc
q7NIDS
q7ap
More quads (PSQ/SM: Scalable Redundancy)
Enhanced SMs: eliminate advisors, more decision support interfaces
Emerald Auto-action Arch Difference HD Search
Diverse variants of JVM and libraries
OS support of isolation– keys, check pointed data, etc.
LCs enhanced with Arch Diff and Cognitive Executive
Use Genesis, DAWSON, Asbestos, RMPL/AWDRAT technologies
LC
LC
LC
LC
LC
LC
LC
LC
LC
LC
Removal of existing component/feature
Enhancement of existing component/feature
Addition of new component/feature
Color Code
24
Bottom Up Approach:Bottom Up Approach: Self-Regeneration Feedback Loop Self-Regeneration Feedback Loop
servicedeviation
Controller
Application
service
servicespecification
resourceallocation
“service” may include the app’s• functional correctness and/or• quality of service delivery
25
Resource
Resource
Controller
Application
Resource
servicemeasurement
servicemeasurement
resourceconfiguration
resourceallocation
service deviationknowledge
analysis
strategy
service
servicespecification
Feedback Loop Including ResourcesFeedback Loop Including Resources
26
Using SRS Technologies in Feedback Using SRS Technologies in Feedback LoopLoop
• Service specification:– RMPL, AWDRAT, Daikon
• Service measurement:– Archie, RMPL, Architectural Differencing, PMOP
• Resource configuration:– Genesis, DAWSON, Repair Tool, Cortex, HDSM
• Resource allocation:– RMPL, AWDRAT
• Controller:– Knowledge: Cortex– Analysis: Trust Modeling, HDSM– Strategy: RMPL, AWDRAT
27
Using SRS Technologies for Using SRS Technologies for DistributivityDistributivity
• Self-Regenerative System will likely distribute– Application and/or– Resources and/or– Controller
• For coordinating distributed redundant application services and resources– Steward, Q/U, R/W, QuickSilver (virtual synchrony)
• For coordinating distributed redundant controllers– SlingShot (probabilistic time-critical)
28
Design Choices for Feedback LoopsDesign Choices for Feedback Loops
• Hierarchy– Loops may be placed within application components,
resources, and/or controllers of larger loops– Loops may share resources and/or controllers– Controllers often share data:
• Synthesized from lower layers• Inherited from higher layers
– Trade speed for smarts:• small loops are fast and dumb; large loops slow and aware
• Coordination– Replicated controllers allow easier analysis of
defensive properties– Autonomous, decentralized controllers reduce the
cost of coordination
29
Example: Multiple Components, Example: Multiple Components, Nested and Distributed Controllers, Nested and Distributed Controllers,
Shared ResourcesShared Resources
Controller
Component
Resource
Controller
Component
Controller
Resource
Resource
Resource
Resource
30
Design Rules-Of-ThumbDesign Rules-Of-Thumb
• Use purely local reaction only when accurate self-accusation is possible– “Organic” decision-making– Examples: if uncaught exception, restart thread;
if seg fault, start new variant
• Controller scope should follow some boundary defined by access controls.– Examples: a LAN bounded by firewalls
• For every resource, some controller scope should monitor all its uses.
31
Natural Architectural FragmentsNatural Architectural Fragments
• Use AWDRAT, RMPL, or Cortex as Controller framework• entire system or a significant subsystem and/or• one object or process
• Use Genesis or DAWSON to create alternate method implementations used in AWDRAT or RMPL
• Use Asbestos to compartmentalize data for multiple clients in Q/U protocol or multiple groups in QuickSilver protocol
• Construct a Unified Communication Service from multicast protocols• Runtime selection of alternate communication protocol with
different properties
• Apply Learning and Repair technology to other SRS components
32
ConclusionConclusion
• Various SRS technologies would have allowed improvement to our DPASA system defenses.
• Taken collectively, SRS technologies address most parts of the problem of self-regenerative control.
• Underlying SRS ideas seem sound but many implementations are immature.
• SRS technologies do not show how to distribute and scale self-regenerative control loops.
BackupBackup
34
Placeholder for StrawmanPlaceholder for Strawman
• Componentization of defense– Protection, detection and adaptation– Organic decision making
• Unified Communication Service• Architecture:
– Organizing defense-enabled components over the UCS substrate
• Layered vs monolithic• Loose confederation vs Logical centralization• (DPASA is layered and logically centralized)
– Deliberative inter-component adaptations
FixIt