Autonomic Runtime System: Design and Evaluation for SAMR Applications * Salim Hariri High...
44
Autonomic Runtime System: Design and Evaluation for SAMR Applications * Salim Hariri High Performance Distributed Computing Laboratory The University of Arizona http:// www.ece.arizona.edu/~hpdc Supported by: NSF, DOE, DARPA, Intel, Raytheon and AOL grants
Autonomic Runtime System: Design and Evaluation for SAMR Applications * Salim Hariri High Performance Distributed Computing Laboratory The University of
Autonomic Runtime System: Design and Evaluation for SAMR
Applications * Salim Hariri High Performance Distributed Computing
Laboratory The University of Arizona
http://www.ece.arizona.edu/~hpdc Supported by: NSF, DOE, DARPA,
Intel, Raytheon and AOL grants
Slide 2
Outline Motivation and objectives Autonomia: An Autonomic
Control and Management Environment Self-Optimization
Self-Protection Conclusion Remarks
Slide 3
Information Technology and Biology Convergence Our system
design methods and management tools seem to be inadequate for
handling the complexity, size, and heterogeneity of today and
future Information systems Biological systems have evolved
strategies to cope with dynamic, complex, highly uncertain
constraints
Slide 4
Current Design and Development of Computing Systems Different
fields evolved separately and Targeted few
domains/applications
Slide 5
New System Construction: Part to The Whole Approach Adds
Complexity High-Cost Interoperability Issues
Slide 6
Autonomic Computing System: Wholestic Approach Self-Healing
Component Self-Optimizing Component Self-Configuring Component
Self-Protecting Component Autonomic Building Block Secure,
Fault-Tolerant System High-Performance, Fault-Tolerant System
Autonomic Computing Systems
Slide 7
Autonomia: An Autonomic Control and Management l Provide
dynamically programmable control and management services to support
the development and deployment of autonomic applications l Provide
Autonomic Runtime Services (self-healing, self- configuring,
self-protecting, self-optimizing) l Provide automated deployment,
registration, discovery of autonomic components l Provide automated
configuration of autonomic applications and system resources
Current Implementations Intractable for Large Problems
Slide 13
Georeferenced Distributed DB Wildfire Autonomic Runtime Manager
(WARM) Dynamic Data Driven Wildfire Model Analysis Objectives
Resource State Application State Monitor Natural Region
Characterization System Capability Module Memory Bandwidth
Availability Access Policy Resource History Module Active
Performance Model Planning Engine Knowledge Repository VCU Virtual
Computation Unit Autonomic Scheduling VCU Virtual Resource Unit
Heterogeneous, Dynamic Computational Environment NR1 Burned NR2
Burning NR3 Unburned SP2 ClusterBeowulf Linux MPP IBM SP2 tt NC M
IBM SP2 Actual Predicted Sensors Survey Flights GPS Satellite Wild
Fire Model Development Environment Regional Weather Terrain
Characteristic Local Weather Temp Humidity Wind Speed Wind
Direction Clouds Precipitation Lightning Fire Behavior Location
Intensity Geometry Propagation Fuel Conditions Smoke Locations and
concentration Firefighting Activities Execution NR2 CPU
Slide 14
Forest Fire Cell Space: Dynamic Repartitioning Initial
partitioning NR2 Burning zone finer gridding Burned zone coarser
gridding NR2 NR3 NR5
Slide 15
Wild Fire Simulation Physics The entire area is represented as
a 2-D cell-space.The weather and vegetation conditions are assumed
to be uniform within a cell, but may vary in the entire cell space
When a cell is ignited, its state will change from unburned to
burning . During its burning phase, the fire will propagate to its
eight neighbors along the eight directions as shown below. As the
simulation time advances, the fire will propagate from the first
ignition cell to other cells.
Slide 16
Parallel Wild Fire Simulation Analysis The composition of
execution time at time step t for 4 processors. To decrease T(t),
make the computation time on each processor as even as possible,
which minimizing the synchronization time. Imbalance Ratio (IR)
characterizes the imbalance situation
Slide 17
Fire Simulation Example t = 1 The example above describes the
imbalance ratio at different time steps. As the simulation
advances, imbalance situation will get worse. t = Nt = 2N
Slide 18
Self-Optimization Monitors the state of fire simulation to
obtain the computation load at any time step Monitors the states of
the underlying system to obtain the computation capacity Monitor
the imbalance ratio at any time step. If the imbalance ratio is
larger than a given threshold, dynamically adjust the workload
among processors at run time.
Slide 19
Self-Optimization Algorithm Obtain the total workload at time t
Estimate the computation time of one burning cell on processor p
with the consideration of system load Where L(p,t) is the length of
CPU queue on processor p at time t Calculate the average execution
time of one burning cell
Slide 20
Self-Optimization Algorithm(contd) To balance the load on each
processor, processor allocation factor (PAF) is defined as
inversely proportional to the processor execution time with respect
to the average execution time. Calculate the Processor Load Ratio
(PLR) that characterize the capacities of processors Note that:
Calculate the workload assigned to processor p at time step t,
workload(p,t)
Slide 21
Fire Simulation Example with Self-Optimization Algorithm With
the self-optimization algorithm, the imbalance situation will be
dramatically decreased. t = 1 t = 2Nt = N
Slide 22
Wildfire Autonomic Runtime Manager Wildfire Autonomic Runtime
Manager (WARM) Resource State Application State Monitor Online
Monitoring and Analysis Resource History Module Scheduler VCU
Virtual Resource Unit NR1 Burned Heterogeneous, Dynamic
Computational Environment SP2 ClusterBeowulf Linux MPP IBM SP2 tt
NC M IBM SP2 Execution (DDWM VRUs) NR2 Burning NR3 Unburned NR2
Active Performance Model Planning Engine Knowledge Repository VCU
Virtual Computation Unit System Capability Module Memory Bandwidth
Availability Access Policy CPU Online Planning Autonomic Scheduling
NR1 Burned 1 Run (DDWM) 2 3 4 5 6 7 8
Slide 23
Experimental results Problem size is 64K and number processors
is 8 With self-optimization, the imbalance ratio will be controlled
as close to the threshold. But without self-optimization, the
imbalance ration will get larger as the simulation advances
Slide 24
Experimental results (contd) Problem size is 64K and number
processors is 8. Without self-optimization, the execution times of
processors for one time step will be heterogeneous as the
simulation advances. With self-optimization, the execution times of
processors for one time step will be almost evenly distributed as
the simulation advances.
Slide 25
Experimental results (contd) Problem size (256*256 = 64K)
Problem size (512*512 = 256K) Number of Processors Execution Time
with Static Partition (s) Execution Time with Dynamic Partition (s)
Percentage Improvement 82441.881540.5836.91% 161824.431132.7937.91%
Number of Processors Execution Time with Static Partition (s)
Execution Time With Dynamic Partition (s) Percentage Improvement
816868.0411244.4033.34% 1611121.667859.8929.33%
329093.396092.2333%
Slide 26
Memory-based Proactive Runtime Partitioning Optimize
performance using memory-based approach minimize number of page
faults and balance work among processors Memory function model for
RM3D W is application workload, a i are PF-based heuristics
Memory-based processor grouping and workload partitioning Lightly
(X - ), moderately (X), or heavily (X + ) loaded groups based on
2-level threshold with N -, N, and N + processors respectively Work
in group X - transferred to X + with unit of work being Sort
processors in X + in ascending order of available memory Checks are
made for processors with corresponding least available memory
Threshold conditions for work transfers must be met After work
transfers, new memory-based work partitioning ratios are computed
as
Slide 27
Memory-based Proactive Runtime Partitioning Better performance
moderately, heavily loaded scenarios Most processors have less
available memory Frequent page faults resulting in long application
delays Memory-based algorithm yields better performance Evaluation
Scenario Lightly loaded Moderately loaded Heavily loaded Execution
time without memory adaptation (seconds) 6922.1415890.4716962.1
Execution time with memory adaptation (seconds)
5210.877401.618284.84 Percentage improvement24.72%53.42%51.16%
Memory-based proactive adaptation performance gain for RM3D
application with base grid size 128*32*32 on 8 processors
Slide 28
CPU-based Proactive Runtime Partitioning Adaptive system
sensitive partitioner uses system capacities and obtained
performance function to compute the relative computational
capacities of each processor System Capacity Calculation N
processors, the total work to be assigned is L Runtime monitors
application and system state Application state: level of
refinement, number, shape and aspect ratio of refined patches
System state: computational load, memory availability, link
bandwidth Performance engine selects the appropriate performance
function to predict the execution time of the application for next
time step is the execution time on processor k The PF of RM3D on
processor k for a given load X1 and AMR level X2 is empirically
defined as:
Slide 29
CPU Based Proactive System Sensitive Runtime Partitioning
CPU-based proactive partitioning performance gain on 16 processors.
(Base grid size: 64 16 16) ScenariosExecution time w/o CPU
adaptation (seconds) Execution time with CPU adaptation (seconds)
Percentage Improvement Lightly loaded2126.06727.1765.8% Moderately
loaded2301.151641.7328.66% Heavily loaded2378.251624.1531.71%
Slide 30
Autonomia Self-Healing analyzer monitoring Self healing
monitoring and analyzing engine planning execution Knowledge Self
healing planning and execution engine APPLICATION FAULT MANAGER
Event server Mobile Agent System APPLICATION RUNTIME MANAGER
Autonomic Middleware Services SELF-HEALING SERVICE AUTONOMIC
RUNTIME SYSTEM Component FAult Manager Heterogeneous Environment
AIK User application Application Management Editor
Slide 31
Self-Healing Engine
Slide 32
Self-Protection Methodology Online Monitoring Adaptive Analysis
Self Healing Engine Data mining Statistic Engine Real Network
Running Environment
Slide 33
Measurement Attributes for Different Protocols Inside a network
element, the measurement attributes can be monitored at different
protocol layers. During the attack (DoS attack, SQL slammer worm,
email worm, etc.), significant behaviors will be observed. Impacted
ProtocolsMeasurement AttributesObserved Behaviors App layer IF:
invocation frequency of emails NIP/NOP: number of incoming/outgoing
PDUs. IF increase 2 or 3 in order of magnitude NIP/NOP increases 1
to 2 in order of magnitude compared with normal scenario AR
increases 1 or 2 in order of magnitude HTTP, DNS, SMTP, pop3
Transport layer NIP/NOP: number of incoming/outgoing PDUs. TCP/UDP
Network layer NIP/NOP: number of incoming/outgoing PDUts AR: ARP
Request rate. IP/ICMP/ARP
Slide 34
Illustrative Network Example 100 Mbps, router to router
links.Router to client node links are 30 Mbps and 10 Mbps 150
clients, 30 routers - client networks 12 routers and 30 servers -
server networks Traffic Configuration Legitimate client traffic
through same interface as attack traffic to other servers
Legitimate client traffic through different interface to attacked
server Legitimate client traffic through same interface to attacked
server and towards attack targets Legitimate server traffic (heavy)
through different interface and towards other clients. Attack
traffic Client Net 0 Client Net 1 Client Net 3 Server Net1 Server
Net 2 Client Net 2
Slide 35
Abnormality Distance (AD) Abnormality Distance of measurement
attributes is used as an abnormality metric for profile modeling of
the component behavior. where and are the mean and variance under
the normal operation condition corresponding to the online
measurement of attribute k. Right figure shows the AD tcp_out based
on the single measurement attribute measure where the larger
magnitude of the AD tcp_out indicates the abnormal behavior that
might be due to an attack. Packet Number AD TCP-out
Slide 36
Multivariate Analysis Techniques on Network Attack Detection
Measurement Attributes tcpOut: legitimate outgoing TCP segments
rate tcpTotal: legitimate outgoing and spoofed outgoing TCP
segments rate NRC: Normal Region Center, which is the baseline
profile for the normal state AD: Abnormality Distance UCL tcpout
LCL tcpout UCL tcptotal tcpOut A tcpTotal LCL tcptotal NRC AD
Normal Region
Slide 37
Validation on Attacker Side Spoofed TCP SYN Attack Attack
intensity and duration are adjustable TCP SYN attack traffic is
spoofed Number of incoming/outgoing packets only wont detect the
attack existence Jointly with the total TCP network activity
analysis can reveal the attack.
Slide 38
Autonomia Self-Protection Architecture Raw Traffic w.r.t.
metric 1 Information Theory Autonomic Runtime Engine Online
Monitoring Policy Translator Change Network Topology Abnormality
function w.r.t metrics 1.. m Raw Traffic w.r.t. metric 2 Raw
Traffic w.r.t. metric n Normal/ Abnormal Characterization Change
Network Configuration Parameters Analysis Engine
Slide 39
Working Flow of the Analysis Engine 1.Information theory is
used to identify the most important features that can be extracted
from network data. 2.Genetic algorithm is used to train data and
obtain the threshold and coefficients used by the linear rule for
detection. 3.Threshold and coefficients are used to detect a wide
range of attacks in the period of testing.
Slide 40
Network Attack Feature Extraction Feature(X)I(X;Y)
Is_hot_login0 Land0 Root_shell0 Su_attempt0 Is_guest_login0.006
Flag0.062 Protocol_type0.304 Logged_in0.381 service0.571
Feature(X)I(X;Y) Is_hot_login0 Land0 Root_shell6e-06
Su_attempt5.3e-6 Is_guest_login0.0018 Flag0.0629
Protocol_type0.3116 Logged_in0.3931 service0.5927 Total Dataset DoS
+ Normal Feature(X)I(X;Y) Is_guest_login0 Is_hot_login0 Su_attempt0
Land0 Logged_in5.2e-5 Protocol_type7.3e-5 Flag0.0001
Root_shell0.003 service0.003 Feature(X)I(X;Y) Is_hot_login0 Land0
Su_attempt2.8e-5 Root_shell0.0002 Logged_in0.0021 Flag0.0033
Protocol_type0.0039 Is_guest_login0.0144 service0.0505 U2R+Normal
R2L + Normal Feature(X)I(X;Y) Is_hot_login0 Land0 Su_attempt7e-06
Root_shell1.4e-5 Is_guest_login0.0022 Protocol_type0.0386
Logged_in0.0701 Flag0.0807 service0.1243 Probe + Normal Discrete
Features Base dataset has a larger sample size Discrete feature
provides little semantics information
Slide 41
Network Attack Feature Extraction (Cont.) Feature(X)I(X;Y)
service0.571 logged_in0.381 protocol_type0.304 flag0.062
Is_guest_login0.006 su_attempt0 root_shell0 Land0 Is_hot_login0
Discrete Features on Total Dataset Feature(X)I(X;Y) count0.613353
dst_bytes0.504773 srv_count0.326754 src_bytes0.282306
same_srv_rate0.079569 srv_serror_rate0.066003 serror_rate0.061391
dst_host_count0.053339 Duration0.050635 dst_host_srv_count0.024559
num_root0.002558 rerror-_rate0.001 Continuous Features on Total
Dataset Continuous Features Compared with the discrete features,
some continuous features will provide more information to the final
detection Information provided by the continuous features is much
more meaningful Partition strategy is deployed in the
discretization of the continuous features Heuristic algorithms
(e.g. Genetic Algorithm) is used to determine the optimal partition
Combining both discrete and continuous features will provide better
detection rate
Slide 42
Experimental Results We compare our approach that is based on
discrete features with fuzzy classifier evolved using Ctree and
those of the winner group in the KDDCup99 contest. ClassOur
ApproachCtreeWinner Entry Normal98.34%92.78%99.5%
Dos99.33%98.91%97.1% U2R63.64%88.13%13.2% R2L5.86%7.41%8.4%
PROBE93.95%50.35%83.3%
Slide 43
Results Discrete vs. Cont. & Combined We compare the
results of using discrete and continuous features respectively
ClassResults using Discrete Features Results using Continuous
Features Normal98.34%98.45% 99.98% Dos99.33%99.93% 99.98%
U2R63.64%75.34% 98% R2L5.86%41.34% 80% PROBE93.95%99.91%
Slide 44
Summary and Concluding Remarks Increased complexity,
heterogeneity, uncertainty, and scale require new paradigms to
design, control and manage systems and applications Systems and
Applications need to operate reliably, securely, efficiently and
cost-effectively Need Wholestic Approach that can dynamically
integrate and address all these issues simultaneously at the layers
of the system and application hierarchy Autonomic Computing
Provides an interesting, pragmatic approach to address these issues
Many challenges are ahead including composing and analyzing in
real-time the operations and states of systems and applications
need new bio-inspired metrics that accurately characterize and
quantify the system and application normal and abnormal states