Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
A TECHNIQUE FOR IMPROVING THE SCHEDULING OF NETWORK COMMUNICATING PROCESSES IN MOSIX
By
RENGAKRISHNAN SUBRAMANIAN B.E., South Gujarat University, 1998
A REPORT
Submitted in partial fulfillment of the
requirements for the degree
MASTER OF SCIENCE
Department of Computing and Information Sciences College of Engineering
KANSAS STATE UNIVERSITY
Manhattan, Kansas December, 2002
Approved by:
Major Professor
Dr. Daniel Andresen
ABSTRACT
MOSIX is a software tool for supporting cluster computing. The core of the MOSIX
technology is the capability of multiple workstations and servers to work cooperatively as
if part of a single system. The primary job of MOSIX is, when you create (one or more)
processes, MOSIX will distribute (and redistribute) your processes (transparently) among
various MOSIX-enabled workstations / servers, to obtain the best possible performance.
When two processes that communicate over the network are created and are then
distributed over the network, MOSIX still binds them to the base workstation where they
were created for system calls. This means that the communication processes talk to each
other through their base node and not directly. The technique discussed in this report
discusses this communicating method, the reasons behind this method and a technique for
improvising this method by not changing the basic architecture. The proposed technique
uses the firewalling system called IPTables, available in the Linux operating system to
allow the processes to communicate through their base node and also improve the
performance of the processes to nearly the same as if the processes were communicating
directly.
i
TABLE OF CONTENTS
LIST OF FIGURES ......................................................................................................... III
LIST OF TABLES...........................................................................................................IV
LIST OF EQUATIONS ....................................................................................................V
ACKNOWLEDGEMENTS.............................................................................................VI
DEDICATION..................................................................................................................VII
1. INTRODUCTION AND BACKGROUND..............................................................1
1.1 MOSIX................................................................................................................... 1
1.2 Structure of MOSIX networking communicating processes ............................ 3
1.3 The Solution........................................................................................................... 6
2 APPROACH TOWARDS THE SOLUTION.........................................................7
2.1 Introduction – The “triangle routing” ................................................................ 7
2.2 Reasoning ............................................................................................................... 8
2.3 Timing Analysis................................................................................................... 11
2.4 Architecture ......................................................................................................... 15
3 IMPLEMENTATION...............................................................................................17
3.1 Environment information .................................................................................. 17
3.2 IPTables ............................................................................................................... 18 3.2.1 About ............................................................................................................. 18 3.2.2 Netfilter Architecture .................................................................................... 21 3.2.3 NAT background ........................................................................................... 22 3.2.4 NAT Architecture in IPTables ...................................................................... 24 3.2.5 NAT example usage...................................................................................... 25 3.2.6 Performance Evaluation of IPTABLES ........................................................ 28 3.2.7 Importance of performance evaluation......................................................... 30
3.3 IPTables for the problem at hand ..................................................................... 32
ii
3.3.1 How............................................................................................................... 32 3.3.2 Actual Rules .................................................................................................. 32 3.3.3 How do these rules work? ............................................................................. 33
4 TESTING.................................................................................................................37
4.1 Purpose................................................................................................................. 37
4.2 Environment ........................................................................................................ 38
4.3 Test Procedures................................................................................................... 38 4.3.1 General.......................................................................................................... 38 4.3.2 MOSIX.......................................................................................................... 40 4.3.3 IPTables ........................................................................................................ 41 4.3.4 Direct communication................................................................................... 42
5 RESULTS................................................................................................................43
5.1 MOSIX................................................................................................................. 43
5.2 IPTables ............................................................................................................... 45
5.3 Direct Communication ....................................................................................... 46
5.4 Summary.............................................................................................................. 46
6 CONCLUSION........................................................................................................51
6.1 Observations ........................................................................................................ 51
6.2 Inferences............................................................................................................. 51
6.3 Future Work ........................................................................................................ 52
7 RELATED RESEARCH........................................................................................53
8 REFERENCES .......................................................................................................55
iii
LIST OF FIGURES
FIGURE 1-1: ORIGIN OF PROCESSES – I .........................................................4 FIGURE 1-2: ORIGIN OF PROCESSES – II ........................................................4 FIGURE 1-3: COMMUNICATION OF PROCESSES AFTER MIGRATION BY MOSIX .5 FIGURE 1-4: BEFORE MIGRATING PROCESS B................................................6 FIGURE 1-5: AFTER MIGRATING PROCESS B.................................................6 FIGURE 2-1: PROCESSES A AND B................................................................7 FIGURE 2-2: PROCESS B IS MIGRATED TO NODE C .........................................8 FIGURE 2-3: MICROSCOPIC VIEW .............................................................. 10 FIGURE 2-4: TIME ANALYSIS..................................................................... 14 FIGURE 2-5: FLOWCHART ......................................................................... 16 FIGURE 3-1: IPTABLES WORKING FLOWCHART (FROM
[AMERICO02PERFORMANCE]).............................................................. 20 FIGURE 3-2: PACKET TRAVERSING IN NETFILTER (FROM [RUSTY02LINUXNAT])
........................................................................................................ 21 FIGURE 3-3: NAT ARCHITECTURE, IPTABLES (FROM [RUSTY02LINUXNAT]).25 FIGURE 3-4: HOW DO THESE RULES WORK? STEP 1...................................... 34 FIGURE 3-5: HOW DO THESE RULES WORK? STEP 2...................................... 35 FIGURE 3-6: HOW DO THESE RULES WORK? STEP 3...................................... 36 FIGURE 3-7: HOW DO THESE RULES WORK? STEP 4...................................... 37 FIGURE 4-1: GENERAL TEST PROCEDURE................................................... 38 FIGURE 4-2: MOSIX TEST PROCEDURE: STEP 1.......................................... 40 FIGURE 4-3: MOSIX TEST PROCEDURE: STEP 2.......................................... 41 FIGURE 4-4: IPTABLES TEST PROCEDURE .................................................. 42 FIGURE 4-5: DIRECT COMMUNICATION TEST PROCEDURE ........................... 43 FIGURE 5-1: EXECUTION TIME COMPARISON CHART................................... 49 FIGURE 5-2: BANDWIDTH COMPARISON CHART.......................................... 49 FIGURE 5-3: %CPU UTILIZATION COMPARISON CHART.............................. 50 FIGURE 5-4: LOAD AVERAGE COMPARISON CHART .................................... 50
iv
LIST OF TABLES TABLE 3-1: PERFORMANCE EVALUATION PARAMETERS OF
[AMERICO02PERFORMANCE]............................................................... 29 TABLE 5-1: MOSIX TEST RESULT............................................................. 44 TABLE 5-2: IPTABLES TEST RESULT.......................................................... 45 TABLE 5-3: DIRECT COMMUNICATION TEST RESULT................................... 46 TABLE 5-4: COMPARISON OF LATENCY...................................................... 47 TABLE 5-5: COMPARISON OF BANDWIDTH.................................................. 47 TABLE 5-6: COMPARISON OF CPU UTILIZATION......................................... 48 TABLE 5-7: COMPARISON OF LOAD AVERAGE ............................................ 48
v
LIST OF EQUATIONS
EQUATION 2-1: DISSECTION OF TIME TAKEN BY A PACKET IN ITS PROCESSES’ UHN ................................................................................................ 13
EQUATION 2-2: TIME SAVED BY THE PACKET.............................................. 14 EQUATION 3-1: TIME TAKEN FOR PROCESSING A TCP PACKET, 1400 BYTES
AND 10 RULES ................................................................................... 30 EQUATION 3-2: TIME SAVED IF PACKETS ARE REDIRECTED AT THE FIREWALL 31 EQUATION 3-3: RECALCULATED TIME SAVED FOR PACKETS REDIRECTED AT
FIREWALL ......................................................................................... 31
vi
ACKNOWLEDGEMENTS I sincerely thank Prof. Daniel Andresen, my major professor, for giving me
encouragement, timely advice, guidance and facilities to complete this project. I also
thank him for being flexible, adjusting and patient during the course of this project.
I would like to thank Prof. Masaaki Mizuno and Prof. William H. Hsu for serving in my
committee. I would like to thank Prof. Mitchell L. Neilsen for agreeing to proxy during
my final examination.
I would like to thank Ms. Delores Winfough for patiently helping me out in
understanding the policies of the graduate school.
I thank Mr. Jesse R. Greenwald and Mr. Daniel R. Lang for helping and solving my day-
to-day problems with my experiments. I thank Mr. Thomas J. Rothwell for partnering
with me during the initial periods of the project.
I would like to thank Mr. Ashish Sharma for help in the benchmark programs. I thank
Mr. Sadanand Kota and Mr. Madhusudhan Tera for help with using MOSIX.
vii
DEDICATION
To my parents
1
1. Introduction and Background This report aims at discussing the scheduling technique used by MOSIX1 on processes
that communicate over the network, explore the reasons behind the particular scheduling
technique on network processes and suggest a new technique that will improve the
performance of processes communicating over the network.
The first section will give introduction about MOSIX, the architecture of network
communicating processes in MOSIX. The second section will discuss about approach
and architecture towards solving the problem. The third section will discuss the
implementation in detail. The fourth section will discuss the tests done to evaluate the
solution and discuss why such tests were conducted. The fifth section will discuss the
results of these experiments, the performance improvement of this solution over the
native MOSIX scheduling technique with the help of graphs. The final section will
discuss the conclusion and make comments related to future work that could be done with
this solution as base.
1.1 MOSIX MOSIX is a software tool for supporting cluster computing. It consists of kernel- level,
adaptive resource sharing algorithms that are geared for high performance, overhead-free
scalability and ease-of-use of a scalable computing cluster. The core of the MOSIX
technology is the capability of multiple workstations and servers (nodes2) to work
cooperatively as if part of a single system. The algorithms of MOSIX are designed to 1 MOSIX stands for Multi computer Operating System for UnIX.
2 A MOSIX-enabled work station or server will be called ‘node’ hereafter
2
respond to variations in the resource usage among the nodes by migrating processes from
one node to another, preemptively and transparently, for load-balancing and to prevent
memory depletion at any node. MOSIX is scalable and it attempts to improve the overall
performance by dynamic distribution and redistribution of the workload and the resources
among the nodes of a computing-cluster of any size. MOSIX conveniently supports a
multi-user time-sharing environment for the execution of both sequential and parallel
tasks. [barak99scalable ]
MOSIX can transform a Linux cluster of x86 based workstations and servers to run
almost like an SMP. The main purpose of MOSIX is when you create (one or more)
processes in your login node MOSIX will distribute (and redistribute) your processes
(transparently) among the nodes, to obtain the best possible performance. The core of
MOSIX is a set of adaptive management algorithms that continuously monitor the
activities of the processes vs. the available resources, in order to respond to uneven
resource distribution and to take advantage of the best available resources. [mosix02web]
The algorithms of MOSIX use preemptive process migration to provide:
• Automatic work distribution - for parallel processing or to migrate processes from
slower to faster nodes.
• Load balancing - for even work distribution.
• Migration of processes from a node that run out of main memory, to avoid
swapping or thrashing.
• Migration of an intensive I/O process to a file server
• Migration of parallel I/O processes from a client node to file servers.
3
1.2 Structure of MOSIX networking communicating processes MOSIX supports preemptive (completely transparent) process migration (PPM). After a
migration, a process continues to interact with its environment regardless of its location.
To implement the PPM, the migrating process is divided into two contexts: the user
context – that can be migrated, and the system context – that is UHN 3dependent, and
may not be migrated. [barak99scalable ]
The user context, called the remote, contains the program code, stack, data, memory-
maps and registers of the process. The remote encapsulates the process when it is running
in the user level. The system context, called the deputy, contains description of the
resources, which the process is attached to, and a kernel-stack for the execution of system
code on behalf of the process. The deputy encapsulates the process when it is running in
the kernel. It holds the site-dependent part of the system context of the process, hence it
must remain in the UHN of the process. While the process can migrate many times
between different nodes, the deputy is never migrated. [barak99scalable]
The processes that MOSIX migrates for automatic work distribution, load balancing also
include processes that communicate over the network or processes that reside in the same
workstation and communicate with each other. The above explained structure of
processes can be seen more pictographically in specific scenarios below.
There are at least two scenarios that arise while exploring the origin of processes that
communicate over the network.
3 Unique Home Node, the node where the process is created
4
Scenario a) The communicating processes originated in the same node and have now
either been migrated to a different node or not been migrated.
Scenario b) The communicating processes originated in different nodes and have now
been migrated to a different node or not been migrated.
Figure 1-1: Origin of processes – I
Figure 1-2: Origin of processes – II
In Figure 1-1, the communicating processes (A and B) originate from Node a and are
migrated by MOSIX to any of the nodes in the cluster. Let us say, process B has been
migrated to Node b. In such a case, MOSIX still binds the process B to its node of origin
Node a Process A & B
Cluster of nodes
Node a Process A
Node b Process B
Cluster of nodes
Node b
5
(which is Node a) and routes all the communicating packets from B to A through A itself,
which is obvious.
In Figure 1-2, the communicating process A can originate from Node a and its
counterpart process B can originate from Node b and can be migrated to any of the nodes
in the cluster. Let us say, the process B is now migrated to Node a by MOSIX. But, now,
since Node b is the node of origin for B, all communicating packets from B to A traverse
through Node b, instead of directly communicating with A (which is in the same node).
Hence, the communicating processes look like Figure 1-3:
Figure 1-3: Communication of processes after migration by MOSIX
Let us discuss another offshoot picture of Scenario b. In this case, Processes A and B
have their UHN as Nodes a and b respectively. Process B is later moved to Node c. In
such a case, the new communication picture looks like the following set of figures.
Node b
Node a Process A Process B
B to A
B to A
6
Figure 1-4: Before migrating process B
Figure 1-5: After migrating process B
From the figures above, it is seen that process B, though moved to Node c, communicates
with its counterpart process A through its UHN (which is still Node b). Process B should
have communicated with its counterpart directly instead of going through its UHN. This
underlying problem now becomes very obvious. This method of re-direction of MOSIX
increases latency and in many cases causes inefficiency in the whole system. This needs
to be rectified so that B contacts A directly, instead of traveling through its UHN.
1.3 The Solution
Node b Process B
Node c
Node a Process A
Node a Process A
Node b Node c Process B
7
The technique discussed in this report achieves better performance of the communicating
processes in terms of decreased latency and increased bandwidth and better load averages
on the nodes for various configurations.
2 Approach towards the solution
2.1 Introduction – The “triangle routing” As discussed in the previous section, the problem with these processes is their binding to
their UHN. Let us take the following example to discuss the approach.
Figure 2-1: Processes A and B
Processes A and B are two processes that are communicating over the network. Now,
MOSIX migrates process B to a different Node c.
Node a Process A
Node b Process B
8
Figure 2-2: Process B is migrated to Node c
Now, the communications between processes A and B happen through node B. As
discussed in previous sections, the user space of process B still resides in Node b. This
means that the MOSIX user level code in Node b identifies the packets from Node a that
are meant for process B and notes that process B now resides in Node c. So, after
identifying the packet from Node a, MOSIX in Node b redirects the packet to process B
in Node c.
2.2 Reasoning
Let us now break the communication into various parts and work through the various
layers that each packet from process A goes through while communicating to process B.
The process A resides in the user space. It communicates to the kernel for getting hold of
a socket. (Let us assume that process A is a client and is seeking service from process B.
This example will be taken in the rest of the paper.) The kernel provides a socket to the
user space process A. The process then identifies that it has to go through TCP/IP stack of
Node a Process A
Node b
Node c Process B
9
the kernel. So, it goes through the TCP/IP stack and then through the firewalling system,
goes down to the lower layers like the device drivers, MAC and the physical layers. At
the receiving end, process B has already opened a port and is waiting for a connection
through a socket from any other process that needs its service. When A requests a service
to B, A’s packet goes out of the socket, through the kernel, through the TCP/IP stack,
adding header after header at each layer, through the firewalling system, through the
lower layers to the lower layer of Node b (UHN of process B). The lower layers of node
b identify the packet from Node a and take it up through the firewalling system, through
the TCP/IP stack, ripping off the headers one by one, up to the kernel / user space border
to MOSIX. Now, MOSIX looks at the packet and decides that process B is no longer
residing in this node b and so it redirects the fate of the packet is its new destination node
c (the new node where process B is now residing). The packet again takes the same path
as it took in Node a and goes down to the physical layer and contacts the lower layers of
Node c. The same process happens in Node c and the packet from Node a, process A,
reaches its destination to process B, at Node c.
The above communication can be redrawn with a microscopic view as shown below.
10
Figure 2-3: Microscopic View
It is observed that this extra path that the packet takes in the Node b increases the latency
of the packet, because, the packet is taking an all-new round about route to its destination.
Moreover, this all-new round about route consumes a chunk of the bandwidth of the
network available between Nodes a & b and Nodes b & c. This eats up more time of
MOSIX too, which takes time to decide on what to do with each such packet coming
from A for B in Node b. On top of it, this could happen for many such processes from
User Space
Kernel / Socket Layer
TCP
Firewall
Lower Layers
MOSIX
User Space
Kernel / Socket Layer
TCP
Firewall
Lower Layers
MOSIX
User Space
Kernel / Socket Layer
TCP
Firewall
Lower Layers
MOSIX
Node a
Node b
Node c
11
Node a communicating with many other processes in Node c whose UHN is Node b.
since MOSIX is installed on all nodes of the cluster, MOSIX schedulers in every node
will be spending enough time for any such re-direction that is reached to the respective
node for each packet. This naturally decreases the performance of the whole system.
Now, if there is a method by which we can redirect these packets that arrive from Node a
to Node b towards Node c at a lower layer, a layer, which can filter such packets, a layer
which does not consume more time for deciding upon the fate of the packet, a layer
where we can do a network address translation on the incoming packet and redirect it
towards Node c, it would drastically reduce the load tha t Node b takes to handle
redirection. This would also decrease the latency of the packet and eventually increase
bandwidth of the whole system.
How can all this be achieved? Which is the best of the lower layers that can best do all
this at considerable ease? Naturally, the answer lies in the firewalling system. The
firewall can intercept packets, filter them, it can identify the headers of the packet, it can
identify the destination of the packet, it can do a network address translation and can
redirect the packet to a different destination. It resides at a very low level in the
networking layers and can therefore be effectively used without bothering to remove and
add headers all the way up and down the network levels.
2.3 Timing Analysis The basic purpose of the analysis given above was to dissect the problem into smaller
segments and understand where there is a delay and where there could be an
improvement in the architecture. In this section, the architecture for solving the problem
at hand with the help of firewalling system would be explained in detail.
12
The time taken for taken by each packet to travel from process A in Node a to process B
in Node c in the discussion above can be divided into the following parts:
• Time taken by packet to travel from Node a to lower layer in Node b
• Time taken by packet to travel from node b lower layer to MOSIX in node b
• Time taken by packet to travel from MOSIX in node b to lower layer in node b
• Time taken to travel from lower layer in node b to process B
In the above dissection, we are more concerned about the time taken by packet to travel
from lower layer of node b to MOSIX and back from MOSIX to the lower layer. It is in
this particular time interval that we are going to implement the firewalling technique to
do the redirection. So, let us break down this time interval into more detail. When it is
broken down, it will look like the following: (Refer figure 2.4)
13
Total time taken by packet to travel from physical (lower) layer of Node b to
MOSIX and back to the physical layer =
(Time taken by packet to travel from physical layer to the firewall, say
T1) +
(Time taken by packet to travel from firewall to TCP/IP layer, say T2) +
(Time taken by packet to travel from TCP/IP layer to the socket layer,
say T3) +
(Time taken by packet to travel from socket to MOSIX, say T4) +
(Time taken by MOSIX to decide on the fate of the packet, say T5) +
(Time taken by packet to travel from MOSIX space to the kernel / socket
layer, say T6) +
(Time taken by packet to travel from socket to TCP layer, say T7) +
(Time taken by packet to travel from TCP to firewall layer, say T8) +
(Time taken for packet to travel from firewall to physical layer, say T9)
Equation 2-1: Dissection of time taken by a packet in its processes’ UHN
14
Figure 2-4: Time Analysis
As proposed earlier, the aim is to reduce the time taken by packet to travel all the way
reaching MOSIX by intercepting it at the firewall layer. In that case, the amount of time
that will be saved if the packet is intercepted at the firewall layer will be:
Time saved =
(Total time spent in Node b) –
(T2 + T3 + T4 + T5 + T6 + T7+ T8) +
(Time taken by firewall layer)
Equation 2-2: Time saved by the packet
It is not necessary to calculate T3 or T4 or any timing above the firewall layer for our
purpose. This is because; the tests that we do (later in the report) will give the summation
of these times, (T3 + T4 + T5 + T6 + T7).
Hence, the rest of this paper will concentrate upon the firewalling techniques like, how to
intercept the desired packets, how to do a network address translation and how to redirect
T1
User Space
Kernel / Socket Layer
TCP/ IP
Firewall
Lower Layers
MOSIX
T2
T4
T5
T6
T7 T3
T8
T9
15
them to the right node c, where they should eventually go, instead of the detailed timing
of packets above the firewall layer.
2.4 Architecture The primary aim of the solution is to intercept the packet and do a corresponding network
address translation. The step-by-step procedure to do this can be perceived with the help
of the following flowchart.
16
Figure 2-5: Flowchart
The architecture is pretty straightforward. A set of rules is first written to intercept the
necessary packets and do the corresponding network address translation on them. The
firewall waits for a packet to arrive. As soon as a packet arrives, the firewall intercepts
Send packet to upper layers
Wait for a packet to arrive
Intercept the packet Write Rules
Check header for source and destination addresses
Verify rules with the packet headers
Do a network address translation
Send packet to physical layer
No
Yes
17
the packet. The firewall filters every packet that arrives with the set of rules already
written for it. If a packet matches a rule or rule set, the packet hits the fate written in the
rule. In this case, the packet is stopped from going up the network layers. It is then
directed down to the physical layer towards its new destination. If the packet does not
match the rule set, the packet is sent up the network layer. The firewall now waits for the
next packet to arrive so that it can intercept.
3 Implementation
3.1 Environment information As mentioned earlier in this report, the working environment for MOSIX is Linux on x86
platforms. MOSIX is available in two parts. The first part is the MOSIX core itself,
which adds on as a patch to the Linux kernel. It then requires the kernel to be compiled
with MOSIX enabled as a configuration option. The Linux box can then be rebooted for
use with MOSIX. The second part is a set of system administrator tools for MOSIX like
manual migration of processes using their PIDs4, enabling or disabling auto-migration,
MOSIX process monitor tool, etc. that can be downloaded separately and installed.
The latest version of MOSIX that was available while running the tests for this report was
MOSIX 1.8.0 for Linux kernel version 2.4.19. Hence, the choice of implementation
environment in this report is restricted to Linux. Red Hat and Debian Linux distributions
were used. Care was taken regarding the version of gcc.
4 Process ID = PID
18
The MOSIX distribution page specifies that “Note: for now do not use distributions that
use gcc-3.2, such as RedHat-8 or Slackware-9. gcc-3.2 is unsuitable for compiling the
kernel”. [mosix02web]
The Linux OS has firewall support built inside its kernel. There has been continuous
evolution of this firewall system for the Linux kernel. As mentioned above, the working
environment is Linux kernel ve rsion 2.4.19. The firewalling system inbuilt in Linux for
2.4.x kernels is called IPTables. It is the re-designed and heavily improved successor of
the previous IPChains (for 2.2.x kernels) and IPFwadm (for 2.0.x kernels) systems. In
this section, IPTables will be explained in detail. It will also be explained as to how
IPTables has been used for our purposes. IPTables is also called netfilter.
3.2 IPTables
3.2.1 About5
IPTables is a generic table structure for the definition of rule sets. Each rule within an IP
table consists out of a number of classifiers (matches) and one connected action (target).
5 This section talks about IPTables in general. All information in this section, in most cases, are taken from
the documentation of IPTables, as it is, or modified a little for the purpose of this paper. The source for this
information is in the documentation section of www.netfilter.org. The author of documentation for this
website is Rusty Russell [rusty02linuxnetfilter], [rusty02linuxnat]. One more source of information is
[americo02performance]. I would not like to take any credit for information in the following sections about
IPTables, except section 3.2.7.
19
Netfilter is a set of hooks inside the Linux 2.4.x kernel's network stack, which allows
kernel modules to register callback functions called every time a network packet traverses
one of those hooks.
The main features of the netfilter system are:
• Stateful packet filtering (connection tracking)
• All kinds of network address translation
• Flexible and extensible infrastructure
Netfilter, IPTables and the connection tracking as well as the Network Address
Translation subsystems together build the whole framework.
Basically, rules are instructions with pre-defined characteristics to match on a packet.
When a match is found the firewall makes a decision to handle that packet. Each rule is
executed in order until a match is found. A rule can be set like this:
iptables [table] <command> <match> <target/jump>
There are 3 default policies: INPUT – to check the headers of incoming packets,
OUTPUT – for outgoing packets/connections, and FORWARD – if the machine is used
as a router (e.g. as a Network Address Translator.) Each policy has its own set of rules.
Let us take the following examples:
#iptables –P INPUT ACCEPT
#iptables –A INPUT –p tcp –dport 23 –j DROP
20
(-P: policy; -A: append; -p: protocol; -dport: destination port; -j: jump)
The first rule states that the firewall system will allow any packet from any network to
come in. The second rule states that, for all packets that come under the TCP protocol,
only those packets that come in with port number 23 are matched and are then dropped.
This rule is appended to the INPUT policy.
Figure 3-1: IPTables working flowchart (from [americo02performance])
21
As seen in the figure above, IPTables has a set of policies, namely INPUT, OUTPUT and
FORWARD policies that are meant for incoming, outgoing and packets meant for a third
machine respectively. For each of these policies, a set of chains can be created. These
chains can be matched with packets and protocols, IP addresses, input/output interfaces,
MAC addresses, etc. After matching packets in these policies, the fate of a packet can be
decided as to whether it should be accepted, dropped, rejected, queued or returned.
3.2.2 Netfilter Architecture In more detail, Netfilter is a series of hooks in various points in a protocol stack (at this
stage, IPv4, IPv6 and DECnet). The (idealized) IPv4 traversal diagram looks like the
following:
Figure 3-2: Packet traversing in Netfilter (from [rusty02linuxnat])
On the left is where packets come in: having passed the simple sanity checks (i.e., not
truncated, IP checksum OK, not a promiscuous receive), they are passed to the Netfilter
framework's NF_IP_PRE_ROUTING [1] hook.
[Route] [1] [3] [4]
[2]
[Route]
[5]
22
Next they enter the routing code, which decides whether the packet is destined for
another interface, or a local process. The routing code may drop packets that are
unroutable.
If it's destined for the box itself, the Netfilter framework is called again for the
NF_IP_LOCAL_IN [2] hook, before being passed to the process (if any).
If it's destined to pass to another interface instead, the Netfilter framework is called for
the NF_IP_FORWARD [3] hook.
The packet then passes a final Netfilter hook, the NF_IP_POST_ROUTING [4] hook,
before being put on the wire again.
The NF_IP_LOCAL_OUT [5] hook is called for packets that are created locally. Here
you can see that routing occurs after this hook is called: in fact, the routing code is called
first (to figure out the source IP address and some IP options).
3.2.3 NAT background There is more to IPTables than just accepting and dropping packets. This section will
discuss about Network Address Translation in IPTables.
Normally, packets on a network travel from their source (such as your home computer) to
their destination (such as www.gnumonks.org) through many different links. None of
these links really alter the packet: they just send it onward.
If one of these links were to do NAT, then they would alter the source or destinations of
the packet as it passes through. But, this is not how the system was designed to work.
Usually the link doing NAT will remember how it mangled a packet, and when a reply
23
packet passes through the other way, it will do the reverse mangling on that reply packet,
so everything works.
Some of the most common uses of NAT can be divided into three categories.
• Most ISPs give you a single IP address when you dial up to them. You can send
out packets with any source address you want, but only replies to packets with
this source IP address will return to you. If you want to use multiple different
machines (such as a home network) to connect to the Internet through this one
link, you'll need NAT. This is commonly known as ‘masquerading’ in the Linux
world.
• Sometimes you want to change where packets heading into your network will go.
Frequently this is because you have only one IP address, but you want people to
be able to get into the boxes behind the one with the ‘real’ IP address. If you
rewrite the destination of incoming packets, you can manage this. This type of
NAT is called port-forwarding.
• Sometimes you want to pretend that each packet which passes through your Linux
box is destined for a program on the Linux box itself. This is used to make
transparent proxies: a proxy is a program which stands between your network and
the outside world, shuffling communication between the two. The transparent part
is because your network won't even know it's talking to a proxy, unless of course,
the proxy doesn't work.
This report will look at a method combining the various above mentioned usages of
IPTables in a later section 3.3.
24
3.2.4 NAT Architecture in IPTables In IPTables, NAT is divided into two different types: Source NAT (SNAT) and
Destination NAT (DNAT).
Source NAT is when you alter the source address of the first packet: i.e. you are changing
where the connection is coming from. Source NAT is always done post-routing, just
before the packet goes out onto the wire. Masquerading is a specialized form of SNAT.
Destination NAT is when you alter the destination address of the first packet: i.e. you are
changing where the connection is going to. Destination NAT is always done before
routing, when the packet first comes off the wire. Port forwarding, load sharing, and
transparent proxying are all forms of DNAT.
In IPTables, we need to create NAT rules which tell the kernel what connections to
change, and how to change them. To do this, we use the IPTables tool to alter the NAT
table by specifying the ‘-t nat’ option. The ‘-t’ option in IPTables specifies the type of
table that should be used. In the section 3.2.1, we used the default table of IPTables,
called filter. For dong NAT, we will use the ‘nat’ table.
The table of NAT rules contains three lists called ‘chains’: each rule is examined in order
until one matches. The chains are called PREROUTING (for Destination NAT, as
packets first come in), and POSTROUTING (for Source NAT, as packets leave). And the
third is called OUTPUT. The OUTPUT chain will be discussed later.
25
Figure 3-3: NAT Architecture, IPTables (from [rusty02linuxnat])
The IPTables NAT can be best described with the help of the diagram above. At each of
the points above, when a packet passes we look up what connection it is associated with.
If it's a new connection, we look up the corresponding chain in the NAT table to see what
to do with it. The answer it gives will apply to all future packets on that connection.
3.2.5 NAT example usage
IPTables takes a number of standard options as listed below. All the double-dash options
can be abbreviated, as long as IPTables can still tell them apart from the other possible
options.
The most important option here is the table selection option, ‘-t’. For all NAT operations,
we will want to use ‘-t nat’ for the NAT table. The second most important option to use is
‘-A’ to append a new rule at the end of the chain (e.g. ‘-A POSTROUTING’), or ‘-I’ to
insert one at the beginning (e.g. ‘-I PREROUTING’).
[Routing Decis ion]
Local Process
PRE-ROUTING
DNAT
POST -ROUTING
SNAT
26
We can specify the source (‘-s’ or ‘—source’) and destination (‘-d’ or ‘—destination’) of
the packets you want to NAT. These options can be followed by a single IP address (e.g.
192.168.1.1), a name (e.g. www.gnumonks.org), or a network address (e.g.
192.168.1.0/24 or 192.168.1.0/255.255.255.0). If we omit the source address option, then
any source address will do. If we omit the destination address option, then any destination
address will do.
We can specify the incoming (‘- i’ or ‘--in- interface’) or outgoing (‘-o’ or ‘--out-
interface’) interface to match, but which we can specify depends on which chain we are
putting the rule into: at PREROUTING we can only select incoming interface, and at
POSTROUTING we can only select outgoing interface. If we use the wrong one,
IPTables will give an error.
We can also indicate a specific protocol (‘-p’ or ‘—protocol’), such as TCP or UDP; only
packets of this protocol will match the rule. The main reason for doing this is that
specifying a protocol of TCP or UDP then allows extra options: specifically the ‘--
source-port’ and ‘--destination-port’ options (abbreviated as ‘--sport’ and ‘--dport’).
These options allow us to specify that only packets with a certain source and destination
port will match the rule. This is useful for redirecting web requests (TCP port 80 or 8080)
and leaving other packets alone.
These options must follow the ‘-p’ option (which has a side-effect of loading the shared
library extension for that protocol). We can use port numbers, or a name from the
/etc/services file.
27
We want to do Source NAT; change the source address of connections to something
different. This is done in the POSTROUTING chain, just before it is finally sent out; this
is an important detail, since it means that anything else on the Linux box itself (routing,
packet filtering) will see the packet unchanged. It also means that the ‘-o’ (outgoing
interface) option can be used.
Source NAT is specified using ‘-j SNAT’, and the ‘--to-source’ option specifies an IP
address, a range of IP addresses, and an optional port or range of ports (for UDP and TCP
protocols only).
## Change source addresses to 1.2.3.4.
# iptables -t nat -A POSTROUTING -o eth0 -j SNAT --to 1.2.3.4
## Change source addresses to 1.2.3.4, 1.2.3.5 or 1.2.3.6
# iptables -t nat -A POSTROUTING -o eth0 -j SNAT --to 1.2.3.4-1.2.3.6
## Change source addresses to 1.2.3.4, ports 1-1023
# iptables -t nat -A POSTROUTING -p tcp -o eth0 -j SNAT --to 1.2.3.4:1-1023
Destination NAT is done in the PREROUTING chain, just as the packet comes in; this
means that anything else on the Linux box itself (routing, packet filtering) will see the
packet going to its ‘real’ destination. It also means that the ‘-i’ (incoming interface)
option can be used.
28
Destination NAT is specified using ‘-j DNAT’, and the ‘--to-destination’ option specifies
an IP address, a range of IP addresses, and an optional port or range of ports (for UDP
and TCP protocols only).
## Change destination addresses to 5.6.7.8
# iptables -t nat -A PREROUTING - i eth0 -j DNAT --to 5.6.7.8
## Change destination addresses to 5.6.7.8, 5.6.7.9 or 5.6.7.10.
# iptables -t nat -A PREROUTING - i eth0 -j DNAT --to 5.6.7.8-5.6.7.10
## Change destination addresses of web traffic to 5.6.7.8, port 8080.
# iptables -t nat -A PREROUTING -p tcp --dport 80 - i eth0 -j DNAT --to 5.6.7.8:8080
Though there is much more to the NAT of IPTables, the explanation above is sufficient
enough for the explanation of the rest of the papers and section 3.3.
3.2.6 Performance Evaluation of IPTABLES6
In this section, we will discuss the performance evaluation done on IPTables by Américo
J. Melara of California Polytechnic State University, San Luis Obispo
[americo02performance].
This thesis paper tests the firewall’s performance with the help of the following
parameters
6 All material in this section is referred from [americo02performance]
29
Transmission Protocol TCP UDP Type of filtering/matching TCP, IP, MAC UDP, IP, MAC
INPUT policy ACCEPT &
DROP DROP Connection speed 100Mbps Payload size 64 &1400 bytes Number of rules No firewall,10, 40, 100
Table 3-1: Performance evaluation parameters of [americo02performance]
In short, the performance test does a permutation and combination of various parameters
specified above and quotes its results as follows.
(a) That the payload size impacts the performance before and after the firewall but
not the firewall itself.
(b) That the INPUT policy does not affect the performance of the firewall
(c) That the firewall is affected only by type of filtering/matching and the number of
rules, and
(d) That the time to process a packet from the start time to the socket layer (Refer
section 2.2) is affected by the parameters in (c) and also by the payload size.
The test is done by keeping timestamps at various points in the network processing layers
of the system. (Refer section 2.2)
• Start time = T2 – T1
• Firewall = (T3 – T1) – (T2 – T1) = T3 – T2
• TCP layer = (T4 – T1) – (T3 – T1) = T4 – T3
• Socket layer = (T5 – T1) – (T4 – T1) = T5 – T4
• Total processing time = T5 - T1
30
These processing times are calculated for various combinations of the parameters
specified above. The results of these performance tests are explained and plotted in
graphs in the paper. The combination that is interesting to this paper is the test on TCP
packets with 10, 40 & 100 firewall rules.
3.2.7 Importance of performance evaluation The section 2.2 and 2.3 talked about the reasoning behind the problem solution and the
timing analysis. The timing analysis talks about splitting of the traversal of a network
packet all the way from the lower layers till the user space and how the timing of the
packet is distributed across various layers. The layers of interest for us will be the lower
layers, the firewall layer, the TCP layer and the socket layer. If the timings for the above
layers are known, the amount of time spent by the firewall layer on a set of rules can be
found out. As an example, let us take the following parameters from the Performance
Test [americo02performance].
A TCP packet of 1400 bytes, with 10 firewall rules. According to the test results, the
following is true:
Total time taken for processing this packet =
(Time taken by packet to travel from the lower layers to
the firewall layer, 11.94 µsec) +
(Time taken at the firewall layer, 8.59 µsec) +
(Time taken at the TCP layer, 24.22 µsec) +
(Time taken at the socket layer, 2.9 µsec)
Equation 3-1: Time taken for processing a TCP packet, 1400 bytes and 10 rules
31
Referring back to section 2.2 and 2.3, we can add to the above amount of time spent by a
packet, the amount of time spent by MOSIX on the packet to decide on the fate of the
packet.
Hence,
The amount of time that can be saved if the packets are redirected at the
firewall layer =
(Time taken at the TCP layer, 24.22 µsec) +
(Time taken at the socket layer, 2.9 µsec) +
(Time taken by MOSIX to decide on the fate of the packet,
M µsec).
Equation 3-2: Time saved if packets are redirected at the firewall
However, in this case, the packet has to travel up and travel down the network layer.
Hence, the packet spends the same amount of time in getting the header removed while
coming back down the network layer. So, the above calculated time can be re-calculated
as:
Recalculated Time =
2 × { (Time taken at the TCP layer, 24.22 µsec) +
(Time taken at the socket layer, 2.9 µsec)
} +
(Time taken by MOSIX to decide on the fate of the packet,
M µsec).
Equation 3-3: Recalculated time saved for packets redirected at firewall
32
This gives us a clear idea on how much time can be saved while redirecting packets at the
firewall layer.
3.3 IPTables for the problem at hand
3.3.1 How
The problem at hand requires a firewall that can identify packets, filter them using a
matching technique, and then redirect it to another machine at the firewall layer itself.
From our discussion on IPTables, from the discussion on the performance test on
IPTables, it is pretty clear that IPTables can do the required.
IPTables can be used to filter out incoming packets, it can be used to change the
destination address of the intercepted packet and send it back down towards its new
destination.
3.3.2 Actual Rules Referring back to the NAT architecture [section 3.2.4], we can set up the following set of
rules for identifying and redirecting a packet.
Rule 1: The first rule will catch the incoming packet (by matching its IP address, port
number) at the PREROUTING chain of the ‘nat’ table. After filtering out such a packet,
this packet needs to be redirected to its new destination using ‘-j DNAT’. The rule for
this will look like the following:
# iptables -t nat -A PREROUTING -s $CLIENT -d $FIREWALL _SYSTEM -p tcp \
--dport $SERVER_PORT -i eth0 -j DNAT --to-destination $NEW_DESTINATION
33
(where –s: source, -d: destination, --dport: destination port)
Rule 2: The packet now goes through the POSTROUTING chain of the firewall. At this
point, the packet has to go to its new destination with its source address changed to this
system (where the firewall resides). Only if this source address is changed, the new
destination will reply back to this system. Otherwise, it will contact the source system
directly, from where the packet came. This rule will use the POSTROUTING chain and
the SNAT option of IPTables.
# iptables -t nat -A POSTROUTING -s $CLIENT -p tcp \
--dport $SERVER_PORT -o eth0 -j SNAT --to-source $FIREWALL_SYSTEM
Rule3: When the new destination replies back to the firewall sys tem, the packet has to be
redirected to the original source. This is another DNAT that will complete the cycle.
# iptables -t nat -A PREROUTING -s $SERVER -d $FIREWALL_SYSTEM -p tcp \
--sport $SERVER_PORT -i eth0 -j DNAT --to-destination $CLIENT
3.3.3 How do these rules work? These rules can be pictographically represented as shown in the figure below: Step 1
Node a, Process A Thinks Process B is in Node b. However process B is in Node c. Does not know location of process B
Node b, IPTables rule set. Gets packet from process A for process B Knows process B is in Node c Does DNAT on packets from A Does SNAT before sending it to Node c
34
Figure 3-4: How do these rules work? Step 1
35
Step 2
Figure 3-5: How do these rules work? Step 2
Node b, IPTables rule set Waits response from Node c
Node a, Process A Awaits response from Node b Thinks Process B is in Node b
Node c, Process B Thinks Process A is in Node b. Replies back to node b Does not know correct location of process A
36
Step 3
Figure 3-6: How do these rules work? Step 3
Node b, IPTables rule set Gets response from Process B. Does DNAT on packet from B Converts the DNAT to Node a Sends packet to Node a, process A
Node a, Process A Awaits response from Node b Thinks Process B is in Node b
Node c, Process B Awaits response from Node b Thinks Process A is in Node b
37
Step 4
Figure 3-7: How do these rules work? Step 4
4 Testing
4.1 Purpose The primary purpose of the test is to compare the effect of using the IPTables rules on
Node b (refer section 3.3.3) against MOSIX network communication technique and direct
communication of processes between Node a and Node c. The test will measure the total
execution time / latency, the bandwidth taken by the processes, the load average, %age
CPU utilizations of the respective systems for
a) MOSIX communication
b) IPTables communication
c) Direct communication
Node b, IPTables rule set Waits for next packet from Node a
Node a, Process A Gets response from Node b Thinks it is from process B, node b, happy Begins sending next packet to Node b
Node c, Process B Awaits next packet from Node b Thinks Process A is in Node b
38
4.2 Environment The nodes used for the testing environment had the following configuration:
• Pentium P4 CPU
• 1.6 GHz Processor speed
• Intel Ether express Network card
• 100 Mbps LAN
• Two Red Hat 7.2 Linux boxes with Kernel 2.4.19
• One Debian Linux box with Kernel 2.4.18
• All nodes were connected on to the same LAN switch
4.3 Test Procedures
4.3.1 General The architecture that was maintained during the tests is exactly similar to the architecture
explained in section 3.3.3, which is as follows
Figure 4-1: General Test Procedure
Node b
Node a
Node c
39
A server-client communicating pair was created in order to satisfy the test purpose. They
communicate to each other using variable parameters. Some of the parameters that were
used in creating this server-client pair were
• Buffer size for each send / receive.
• The total amount of data that they would transmit. In other words, this would be
total number of iterations for which the data would be sent. This parameter was
used instead of specifying the time for which the data should be sent, because, the
purpose of the test is to find time of execution and not to specify it.
• The number of such communicating pairs.
• The port number on which the communication service would run.
The servers starts first and waits on a port number. The client contacts the server on this
port number on the server’s machine. A connection is established between them. The
server then starts pumping data to the client according to the parameters specified above.
At the end of data transfer, the server sends an end-signal to close the connection. The
client prints out the time taken for execution in seconds and microseconds.
Tests were first designed for different sizes of data transfer and also different amount of
communicating pairs. However, as seen in section 3.2.6, the size of data does not really
affect the performance of the system. So, the test was made for number of
communicating pairs.
40
4.3.2 MOSIX For testing under MOSIX, a scenario has to be created where a process is migrated from
its UHN to another node so that the “triangular” route of communication happens at the
UHN. The following steps were taken in order to create this scenario.
Step 1
Figure 4-2: MOSIX Test Procedure: Step 1
Node b Server is created here. Server waits on a port number.
Node a
Node c Client will reside here Client is not yet created.
41
Step 2
Figure 4-3: MOSIX Test Procedure: Step 2
When more than one communicating processes was created, all of them were migrated to
Node a.
4.3.3 IPTables For testing the IPTables procedure, the server was created in Node a and the client was
created in Node c. Node b is where all the IPTables rules mentioned in section 3.3.2
reside. When the client contacts Node b, the request is forwarded to Node a and the
server thinks that Node b is requesting service. When it returns back with a reply to the
request, Node b forwards the reply to the client in Node c. Thus, the connection cycle is
established and transfer of data occurs through the Node b.
Node b Server is migrated manually using MOSIX admin tools to node c All processes think server is still in Node b
Node a Server is now here. But, it goes to Node b for system calls. Node b is its UHN
Node c Client is created. Contacts server in Node b. Is unaware that server is in Node a
42
Figure 4-4: IPTables Test Procedure
When more than one communicating processes was created, more rules were added in
Node b to cater to each communicating process. As we saw in section 3.3.2, there are
three rules required for one communicating pair. So, for every communicating pair, an
extra set of three rules needs to be written.
4.3.4 Direct communication Ideally, if the MOSIX network communicating processes were migrated, they should
have contacted each other directly, instead of the communicating technique discussed in
section 1.2. This test was conducted to find out the actual performance (latency,
bandwidth, load average of the two systems on which the communication processes
reside) so that it can be compared with the MOSIX method and the IPTables method.
Node b IPTables rules are written here. Forwards packets from Node a to Node c and vice-versa
Node a Server is created here. It will get request from Node b (which is in reality from Node c). It will reply back to Node b
Node c Client is created here Client contacts Node b requesting for service
43
Figure 4-5: Direct Communication Test Procedure
5 Results
5.1 MOSIX As mentioned in the previous section, tests were conducted for increasing number of
communicating pairs. The various results noted were: the total execution time for the
communicating processes to finish off data transfer7, the bandwidth occupied, the load
average on the MOSIX UHN while the processes were communicating (in the test above,
the UHN will be Node b) and the %age CPU system utilization 8on the UHN.
7 The amount of data transferred is a parameter given to the test. In these tests, it was 400 MB
8 System CPU %age is the amo unt of CPU used by kernel. Since MOSIX is in the kernel, system CPU is
noted.
Node b
Node a Server is created here. It will get request from Node c It will reply back to Node c
Node c Client is created here Client contacts Node c requesting for service
44
No of
communicating
pairs
Time to
complete data
transfer*
(seconds)
Bandwidth
(Mbps)
% system
CPU
utilization
Load average
(1.00 = full)
3 203.73 15.71 58.8
0.52
6 275.56
11.61
85.7
1.34
9 390.28
8.19
85
1.73
12 513.64
6.23
87.3
1.83
15 640.91
4.99
87.5
1.8
25 1063.21
3.01
90
3.55
50 2130.70
1.50
90%
4.42
Table 5-1: MOSIX Test Result
*The data in this table is an average on the number of connections. Please refer the appendix for complete data
The MOSIX test results show increased load average and %CPU utilization with
increasing number of communicating pairs. A more detailed comparison can be made
after reading the results from the other two tests.
45
5.2 IPTables Similar test results are shown here for IPTables rule set. The %age CPU utilization and
load average are calculated for the node that has the rule set written, which, according to
the previous section is Node b.
Total Number
of
connections
Time to complete
data transfer
(seconds)
Bandwidth
(Mbps)
% system
CPU
utilization
Load average
(1.00 = full)
3 109.79
29.15
27.1
0.02
6 219.58
14.57
27.1
0.01
9 328.89
9.73
25.5
0.01
12 437.77
7.31
27.5
0.02
15 552.14
5.79
27.9
0.01
25 913.89
3.50
24.5
0.02
50 1840.99
1.74
28
0.02
Table 5-2: IPTables Test Result
46
5.3 Direct Communication For the direct communication test, there is no need for testing the load average and %age
CPU utilization, because, there is no middle system existing. However, the total time for
execution and latency are noted and are shown in the table below.
Total Number of
connections
Time to complete
data transfer
(seconds)
Bandwidth
(Mbps)
3 103.51
30.92
6 212.61
15.05
9 316.60
10.11
12 424.55
7.54
15 529.11
6.05
25 882.72
3.63
50 1746.64
1.83
Table 5-3: Direct Communication Test Result
5.4 Summary The tables shown above can be summarized and shown below for comparison on the
basis of latency, bandwidth, %age CPU ut ilizations and load averages.
47
LATNENCY (sec) No end-to-end connections MOSIX IPTABLES NORMAL
3 203.73 109.79 103.51 6 275.56 219.58 212.61 9 390.28 328.89 316.60
12 513.64 437.77 424.55
15 640.92 552.14 529.11
25 1063.21 913.89 882.72
50 2130.70 1840.99 1746.64
Table 5-4: Comparison of Latency
Table 5-5: Comparison of Bandwidth
BANDWIDTH (Mbps) No end-to-end connections MOSIX IPTABLES NORMAL
3 15.71 29.15 30.92 6 11.61 14.57 15.05 9 8.19 9.73 10.11
12 6.23 7.31 7.54
15 4.99 5.79 6.05
25 3.01 3.50 3.63
50 1.50 1.74 1.83
48
%cpu utilization No end-to-end connections MOSIX IPTABLES
3 58.8 27.1
6 85.7 27.1
9 85 25.5
12 87.3 27.5
15 87.5 27.9
25 90 24.5
50 90 28
Table 5-6: Comparison of CPU Utilization
load average
No end-to-end connections
MOSIX IPTABLES
3 0.52 0.02
6 1.34 0.01
9 1.73 0.01
12 1.83 0.02
15 1.8 0.01
25 3.55 0.02
50 4.42 0.02
Table 5-7: Comparison of Load Average
49
Execution Time Comparison
0
100
200
300
400
500
600
700
0 2 4 6 8 10 12 14 16
No of Connections
Tim
e (s
ecs)
mosix iptables direct
Figure 5-1: Execution Time Comparison Chart
Bandwidth Comparison
0
5
10
15
20
25
30
35
0 2 4 6 8 10 12 14 16
No of connections
Ban
dwid
th (M
bps)
mosix iptables direct
Figure 5-2: Bandwidth Comparison Chart
50
%CPU Utlization Comparison
0
20
40
60
80
100
0 10 20 30 40 50 60
No of connections
%C
PU
Uti
lizat
ion
mosix iptables
Figure 5-3: %CPU Utilization Comparison Chart
Load Average Comparison
0
1
2
3
4
5
0 10 20 30 40 50 60
No of Connections
Load
Ave
rage
mosix iptables
Figure 5-4: Load Average Comparison Chart
51
6 Conclusion
6.1 Observations
• From the graph and table of comparison of latency, it is clear that the total
execution time for IPTables is pretty close to the total execution time of direct
communication, while MOSIX takes a huge toll, huge difference in total
execution time. On an average, MOSIX takes 33% more time on execution than
direct communication while IPTables takes only 4% more. On an average,
MOSIX takes 28% more time on execution than IPTables.
• The bandwidth comparison chart and table shows that the bandwidth occupied by
MOSIX is considerably less as compared to IPTables and direct communication.
On an average, it is 20% less than IPTables. However, as number of end-to-end
communications increase, the bandwidth difference between the three methods of
testing does not vary much.
• However, while the bandwidth graph converges, the load average and CPU
utilization show a drastic difference. The CPU utilization and load average on the
IPTables system is considerably low as compared to that of the MOSIX system,
which completely hogs the system. MOSIX, on an average, takes 212% more
CPU utilization and at least 138 times more load average than IPTables.
6.2 Inferences The observations made in the previous section shows that using MOSIX to manually
schedule the network communicating process has actually slowed down the time of
execution of the network communicating processes. An interesting point here is that if
52
MOSIX had done an auto migration on these network communicating processes, the
MOSIX system would still have a huge percentage CPU load on them.
The IPTables test on the other hand has shown that it takes very less execution time of
the two communicating processes and it does not add to any CPU utilization or load of
the system, which handles the redirection of the rules.
The observations and inferences make this point very clear that the MOSIX methodology
of structuring the network communicating processes is time consuming and resource
crunching. However, if the same structure is used for a pair of communicating processes
with the IPTables rule set defined, the cost effectiveness and resource efficiency is
greatly increased. It is so much increased that it is as if the communicating processes are
connected directly and are not routed through a middle system.
6.3 Future Work
Naturally, the integration of IPTables methodology inside MOSIX will enable MOSIX to
be more efficient. In such a case, the basic structure of MOSIX is not changed, i.e.
MOSIX maintains it UHN, remote & deputy concept, but still improves performance.
This integration could be done possible by a step-wise approach. In a broad sense, these
steps could be:
a) Identify the double-redirection that was created by migrating the process from its
UHN.
b) Create IPTables rule set on the fly using an API / library.
c) The library sits on every MOSIX workstation and updates the creation of new rule
sets.
53
d) Remove rules after the processes are done with communication.
There are many limitations associated with NAT itself. These can be found in more detail
in [hain00architectural]. There could be another approach to the whole situation using
IPTables rule set. If there is a way to re-direct the packets that are generated locally to a
system by doing a local DNAT on them instead of doing in a middle system, this problem
could be solved. However, from the IPTables documentation [rusty02linuxnat],
“The NAT code allows you to insert DNAT rules in the OUTPUT chain, but this is not
fully supported in 2.4 (it can be, but it requires a new configuration option, some testing,
and a fair bit of coding, so unless someone contracts Rusty to write it, I wouldn't expect it
soon).
The current limitation is that you can only change the destination to the local machine
(e.g. `j DNAT --to 127.0.0.1'), not to any other machine, otherwise the replies won't be
translated correctly.”
Enabling of DNAT on locally generated packets could be a possible future work on
IPTables that could prove to be an efficient solution to the problem.
On the backside, there are some inherent drawbacks in the NAT system, which are
discussed in detail in [hain00architectural, holdrege01protocol, and sebue02network].
Since MOSIX works on Linux on x86 platforms, these NAT problems do not come into
picture.
7 Related Research A variety of different approaches have been taken in resolving the problem discussed in
section 1.2 of this paper. These approaches can be classified into two categories: one,
54
which addresses the problem of NAT at source, the other which address the problem of
socket migration. We discuss various research related to these approaches.
Mobile communication with Virtual Network Address translation [gong02mobile] is an
architecture that allows transparent migration of end-to-end live network connections
associated with various computation units. Such computation units can be either a single
process, or a group of processes, or an entire host. VNAT virtualizes network connections
perceived by transport protocols so that identification of network connections is
decoupled from stationary hosts. Such virtual connections are then remapped into
physical connections to be carried on the physical network using network address
translation. However, VNAT is tailored specifically for the ZAP project
[steven02design].
MIGSOCK [bryan02migsock] is a project at the Carnegie Mellon University,
Information Networking Institute, that implements the migration of TCP sockets in the
Linux operating system. MIGSOCK provides a kernel module that re- implements TCP to
make migration possible. The implementation requires modifications to the kernel files
(patches) and migration option available to user applications. The remainder of the
functionality exists in the kernel module which can be loaded on demand by the kernel.
This seems like a good patch that can be made to MOSIX so that the problem discussed
in section 1.2 can be eradicated. However, the source code for this software was available
only on request to the authors. E-mail requests were sent without any response. Also, this
software has not yet been integrated with MOSIX.
55
[alex00end] presents an architecture that allows suspending and resuming TCP
connections. However, it does not support migration of TCP connections where both the
end points move simultaneously.
MSOCKS [david98msocks] presents an architecture called Transport Layer Mobility that
allows mobile nodes to not only change their point of attachment to the Internet, but also
to control which network interfaces are used for the different kinds of data leaving from
and arriving at the mobile node. MSOCK implements transport layer mobility scheme
using a split-connection proxy architecture and a new technique called TCP Splice that
gives split-connection proxy systems the same end-to-end semantics as
normal TCP connections. However, MSOCKS handles a mobile client and a stationary
server. So, it does not match well with the problem in section 1.2.
There is a mention of socket migration in the MOSIX web page [mosix02web] as an
ongoing project.
8 References [alex00end] Alex C. Snoeren and Hari Balakrishnan, An End-to-End Approach
to Host Mobility, Proceedings of the 6th International Conference
on Mobile Computing and Networking (MobiCom ’00), Boston,
MA, August 2000.
[americo02performance]
56
Américo J. Melara, Performance analysis of the Linux firewall in a
host, Masters Thesis, California Polytechnic State University, San
Luis Obispo, June 2002.
[barak98mosix] Barak A. and La'adan O., The MOSIX Multicomputer Operating
System for High Performance Cluster Computing , Journal of
Future Generation Computer Systems, Vol. 13, No. 4-5, pp. 361-
372, March 1998.
[barak99scalable] Barak A., La'adan O. and Shiloh A., Scalable Cluster Computing
with MOSIX for LINUX, Proc. Linux Expo '99, pp. 95-100,
Raleigh, N.C., May 1999.
[bryan02migsock] Bryan Kuntz and Karthik Rajan, MIGSOCK: Migratable TCP
socket in Linux, Master’s Thesis, Carnegie Mellon University,
Information Networking Institute, February 2002.
[david98msocks] David A. Maltz and Pravin Bhagwat, MSOCKS: An Architecture
for Transport Layer Mobility, Proceedings of the IEEE INFOCOM
’98, San Francisco, CA, 1998.
[gong02mobile] Gong Su and Jason Nieh, Mobile Communication with Virtual
Network Address Translation, Technical Report CUCS-003-02,
Department of Computer Science, Columbia University, February
2002.
[hain00architectural] T. Hain, Architectural Implications of NAT, RFC 2993, IETF,
November 2000.
57
[holdrege01protocol] M. Holdrege and P. Srisuresh, Protocol Complications with the IP
Network Address Translator, RFC 3027, IETF, January 2001.
[mosix02web] http://www.mosix.org
[rusty02linuxnat] Rusty Russell, Linux 2.4 NAT HOWTO, Linux Netfilter core Team,
http://www.netfilter.org/documentation/HOWTO/NAT-
HOWTO.html, January 2002.
[rusty02linuxnetfilter] Rusty Russell and Harald Welte, Linux netfilter Hacking HOWTO,
Linux Netfilter core Team,
http://www.netfilter.org/documentation/HOWTO//netfilter-
hacking-HOWTO.html, July 2002
[sebue02network] D. Senie, Network Address Translator (NAT)-friendly Application
Design Guidelines, RFC 3235, IETF, January 2002.
[stevem02design] Steven Osman, Dinesh Subhraveti, Gong Su, and Jason Nieh, "The
Design and Implementation of Zap: A System for Migrating
Computing Environments", Proceedings of the Fifth Symposium
on Operating Systems Design and Implementation (OSDI 2002),
Boston, MA, December 9-11, 2002