A TECHNIQUE FOR IMPROVING THE SCHEDULING OF NETWORK …people.cis.ksu.edu/~subbu/docs/Report_1122.pdf · 2002-11-28 · A TECHNIQUE FOR IMPROVING THE SCHEDULING OF NETWORK COMMUNICATING

A TECHNIQUE FOR IMPROVING THE SCHEDULING OF NETWORK COMMUNICATING PROCESSES IN MOSIX

By

RENGAKRISHNAN SUBRAMANIAN B.E., South Gujarat University, 1998

A REPORT

Submitted in partial fulfillment of the

requirements for the degree

MASTER OF SCIENCE

Department of Computing and Information Sciences College of Engineering

KANSAS STATE UNIVERSITY

Manhattan, Kansas December, 2002

Approved by:

Major Professor

Dr. Daniel Andresen

ABSTRACT

MOSIX is a software tool for supporting cluster computing. The core of the MOSIX

technology is the capability of multiple workstations and servers to work cooperatively as

if part of a single system. The primary job of MOSIX is, when you create (one or more)

processes, MOSIX will distribute (and redistribute) your processes (transparently) among

various MOSIX-enabled workstations / servers, to obtain the best possible performance.

When two processes that communicate over the network are created and are then

distributed over the network, MOSIX still binds them to the base workstation where they

were created for system calls. This means that the communication processes talk to each

other through their base node and not directly. The technique discussed in this report

discusses this communicating method, the reasons behind this method and a technique for

improvising this method by not changing the basic architecture. The proposed technique

uses the firewalling system called IPTables, available in the Linux operating system to

allow the processes to communicate through their base node and also improve the

performance of the processes to nearly the same as if the processes were communicating

directly.

i

TABLE OF CONTENTS

LIST OF FIGURES ......................................................................................................... III

LIST OF TABLES...........................................................................................................IV

LIST OF EQUATIONS ....................................................................................................V

ACKNOWLEDGEMENTS.............................................................................................VI

DEDICATION..................................................................................................................VII

1. INTRODUCTION AND BACKGROUND..............................................................1

1.1 MOSIX................................................................................................................... 1

1.2 Structure of MOSIX networking communicating processes ............................ 3

1.3 The Solution........................................................................................................... 6

2 APPROACH TOWARDS THE SOLUTION.........................................................7

2.1 Introduction – The “triangle routing” ................................................................ 7

2.2 Reasoning ............................................................................................................... 8

2.3 Timing Analysis................................................................................................... 11

2.4 Architecture ......................................................................................................... 15

3 IMPLEMENTATION...............................................................................................17

3.1 Environment information .................................................................................. 17

3.2 IPTables ............................................................................................................... 18 3.2.1 About ............................................................................................................. 18 3.2.2 Netfilter Architecture .................................................................................... 21 3.2.3 NAT background ........................................................................................... 22 3.2.4 NAT Architecture in IPTables ...................................................................... 24 3.2.5 NAT example usage...................................................................................... 25 3.2.6 Performance Evaluation of IPTABLES ........................................................ 28 3.2.7 Importance of performance evaluation......................................................... 30

3.3 IPTables for the problem at hand ..................................................................... 32

ii

3.3.1 How............................................................................................................... 32 3.3.2 Actual Rules .................................................................................................. 32 3.3.3 How do these rules work? ............................................................................. 33

4 TESTING.................................................................................................................37

4.1 Purpose................................................................................................................. 37

4.2 Environment ........................................................................................................ 38

4.3 Test Procedures................................................................................................... 38 4.3.1 General.......................................................................................................... 38 4.3.2 MOSIX.......................................................................................................... 40 4.3.3 IPTables ........................................................................................................ 41 4.3.4 Direct communication................................................................................... 42

5 RESULTS................................................................................................................43

5.1 MOSIX................................................................................................................. 43

5.2 IPTables ............................................................................................................... 45

5.3 Direct Communication ....................................................................................... 46

5.4 Summary.............................................................................................................. 46

6 CONCLUSION........................................................................................................51

6.1 Observations ........................................................................................................ 51

6.2 Inferences............................................................................................................. 51

6.3 Future Work ........................................................................................................ 52

7 RELATED RESEARCH........................................................................................53

8 REFERENCES .......................................................................................................55

iii

LIST OF FIGURES

FIGURE 1-1: ORIGIN OF PROCESSES – I .........................................................4 FIGURE 1-2: ORIGIN OF PROCESSES – II ........................................................4 FIGURE 1-3: COMMUNICATION OF PROCESSES AFTER MIGRATION BY MOSIX .5 FIGURE 1-4: BEFORE MIGRATING PROCESS B................................................6 FIGURE 1-5: AFTER MIGRATING PROCESS B.................................................6 FIGURE 2-1: PROCESSES A AND B................................................................7 FIGURE 2-2: PROCESS B IS MIGRATED TO NODE C .........................................8 FIGURE 2-3: MICROSCOPIC VIEW .............................................................. 10 FIGURE 2-4: TIME ANALYSIS..................................................................... 14 FIGURE 2-5: FLOWCHART ......................................................................... 16 FIGURE 3-1: IPTABLES WORKING FLOWCHART (FROM

[AMERICO02PERFORMANCE]).............................................................. 20 FIGURE 3-2: PACKET TRAVERSING IN NETFILTER (FROM [RUSTY02LINUXNAT])

........................................................................................................ 21 FIGURE 3-3: NAT ARCHITECTURE, IPTABLES (FROM [RUSTY02LINUXNAT]).25 FIGURE 3-4: HOW DO THESE RULES WORK? STEP 1...................................... 34 FIGURE 3-5: HOW DO THESE RULES WORK? STEP 2...................................... 35 FIGURE 3-6: HOW DO THESE RULES WORK? STEP 3...................................... 36 FIGURE 3-7: HOW DO THESE RULES WORK? STEP 4...................................... 37 FIGURE 4-1: GENERAL TEST PROCEDURE................................................... 38 FIGURE 4-2: MOSIX TEST PROCEDURE: STEP 1.......................................... 40 FIGURE 4-3: MOSIX TEST PROCEDURE: STEP 2.......................................... 41 FIGURE 4-4: IPTABLES TEST PROCEDURE .................................................. 42 FIGURE 4-5: DIRECT COMMUNICATION TEST PROCEDURE ........................... 43 FIGURE 5-1: EXECUTION TIME COMPARISON CHART................................... 49 FIGURE 5-2: BANDWIDTH COMPARISON CHART.......................................... 49 FIGURE 5-3: %CPU UTILIZATION COMPARISON CHART.............................. 50 FIGURE 5-4: LOAD AVERAGE COMPARISON CHART .................................... 50

iv

LIST OF TABLES TABLE 3-1: PERFORMANCE EVALUATION PARAMETERS OF

[AMERICO02PERFORMANCE]............................................................... 29 TABLE 5-1: MOSIX TEST RESULT............................................................. 44 TABLE 5-2: IPTABLES TEST RESULT.......................................................... 45 TABLE 5-3: DIRECT COMMUNICATION TEST RESULT................................... 46 TABLE 5-4: COMPARISON OF LATENCY...................................................... 47 TABLE 5-5: COMPARISON OF BANDWIDTH.................................................. 47 TABLE 5-6: COMPARISON OF CPU UTILIZATION......................................... 48 TABLE 5-7: COMPARISON OF LOAD AVERAGE ............................................ 48

v

LIST OF EQUATIONS

EQUATION 2-1: DISSECTION OF TIME TAKEN BY A PACKET IN ITS PROCESSES’ UHN ................................................................................................ 13

EQUATION 2-2: TIME SAVED BY THE PACKET.............................................. 14 EQUATION 3-1: TIME TAKEN FOR PROCESSING A TCP PACKET, 1400 BYTES

AND 10 RULES ................................................................................... 30 EQUATION 3-2: TIME SAVED IF PACKETS ARE REDIRECTED AT THE FIREWALL 31 EQUATION 3-3: RECALCULATED TIME SAVED FOR PACKETS REDIRECTED AT

FIREWALL ......................................................................................... 31

vi

ACKNOWLEDGEMENTS I sincerely thank Prof. Daniel Andresen, my major professor, for giving me

encouragement, timely advice, guidance and facilities to complete this project. I also

thank him for being flexible, adjusting and patient during the course of this project.

I would like to thank Prof. Masaaki Mizuno and Prof. William H. Hsu for serving in my

committee. I would like to thank Prof. Mitchell L. Neilsen for agreeing to proxy during

my final examination.

I would like to thank Ms. Delores Winfough for patiently helping me out in

understanding the policies of the graduate school.

I thank Mr. Jesse R. Greenwald and Mr. Daniel R. Lang for helping and solving my day-

to-day problems with my experiments. I thank Mr. Thomas J. Rothwell for partnering

with me during the initial periods of the project.

I would like to thank Mr. Ashish Sharma for help in the benchmark programs. I thank

Mr. Sadanand Kota and Mr. Madhusudhan Tera for help with using MOSIX.

vii

DEDICATION

To my parents

1

1. Introduction and Background This report aims at discussing the scheduling technique used by MOSIX1 on processes

that communicate over the network, explore the reasons behind the particular scheduling

technique on network processes and suggest a new technique that will improve the

performance of processes communicating over the network.

The first section will give introduction about MOSIX, the architecture of network

communicating processes in MOSIX. The second section will discuss about approach

and architecture towards solving the problem. The third section will discuss the

implementation in detail. The fourth section will discuss the tests done to evaluate the

solution and discuss why such tests were conducted. The fifth section will discuss the

results of these experiments, the performance improvement of this solution over the

native MOSIX scheduling technique with the help of graphs. The final section will

discuss the conclusion and make comments related to future work that could be done with

this solution as base.

1.1 MOSIX MOSIX is a software tool for supporting cluster computing. It consists of kernel- level,

adaptive resource sharing algorithms that are geared for high performance, overhead-free

scalability and ease-of-use of a scalable computing cluster. The core of the MOSIX

technology is the capability of multiple workstations and servers (nodes2) to work

cooperatively as if part of a single system. The algorithms of MOSIX are designed to 1 MOSIX stands for Multi computer Operating System for UnIX.

2 A MOSIX-enabled work station or server will be called ‘node’ hereafter

2

respond to variations in the resource usage among the nodes by migrating processes from

one node to another, preemptively and transparently, for load-balancing and to prevent

memory depletion at any node. MOSIX is scalable and it attempts to improve the overall

performance by dynamic distribution and redistribution of the workload and the resources

among the nodes of a computing-cluster of any size. MOSIX conveniently supports a

multi-user time-sharing environment for the execution of both sequential and parallel

tasks. [barak99scalable ]

MOSIX can transform a Linux cluster of x86 based workstations and servers to run

almost like an SMP. The main purpose of MOSIX is when you create (one or more)

processes in your login node MOSIX will distribute (and redistribute) your processes

(transparently) among the nodes, to obtain the best possible performance. The core of

MOSIX is a set of adaptive management algorithms that continuously monitor the

activities of the processes vs. the available resources, in order to respond to uneven

resource distribution and to take advantage of the best available resources. [mosix02web]

The algorithms of MOSIX use preemptive process migration to provide:

• Automatic work distribution - for parallel processing or to migrate processes from

slower to faster nodes.

• Load balancing - for even work distribution.

• Migration of processes from a node that run out of main memory, to avoid

swapping or thrashing.

• Migration of an intensive I/O process to a file server

• Migration of parallel I/O processes from a client node to file servers.

3

1.2 Structure of MOSIX networking communicating processes MOSIX supports preemptive (completely transparent) process migration (PPM). After a

migration, a process continues to interact with its environment regardless of its location.

To implement the PPM, the migrating process is divided into two contexts: the user

context – that can be migrated, and the system context – that is UHN 3dependent, and

may not be migrated. [barak99scalable ]

The user context, called the remote, contains the program code, stack, data, memory-

maps and registers of the process. The remote encapsulates the process when it is running

in the user level. The system context, called the deputy, contains description of the

resources, which the process is attached to, and a kernel-stack for the execution of system

code on behalf of the process. The deputy encapsulates the process when it is running in

the kernel. It holds the site-dependent part of the system context of the process, hence it

must remain in the UHN of the process. While the process can migrate many times

between different nodes, the deputy is never migrated. [barak99scalable]

The processes that MOSIX migrates for automatic work distribution, load balancing also

include processes that communicate over the network or processes that reside in the same

workstation and communicate with each other. The above explained structure of

processes can be seen more pictographically in specific scenarios below.

There are at least two scenarios that arise while exploring the origin of processes that

communicate over the network.

3 Unique Home Node, the node where the process is created

4

Scenario a) The communicating processes originated in the same node and have now

either been migrated to a different node or not been migrated.

Scenario b) The communicating processes originated in different nodes and have now

been migrated to a different node or not been migrated.

Figure 1-1: Origin of processes – I

Figure 1-2: Origin of processes – II

In Figure 1-1, the communicating processes (A and B) originate from Node a and are

migrated by MOSIX to any of the nodes in the cluster. Let us say, process B has been

migrated to Node b. In such a case, MOSIX still binds the process B to its node of origin

Node a Process A & B

Cluster of nodes

Node a Process A

Node b Process B

Cluster of nodes

Node b

5

(which is Node a) and routes all the communicating packets from B to A through A itself,

which is obvious.

In Figure 1-2, the communicating process A can originate from Node a and its

counterpart process B can originate from Node b and can be migrated to any of the nodes

in the cluster. Let us say, the process B is now migrated to Node a by MOSIX. But, now,

since Node b is the node of origin for B, all communicating packets from B to A traverse

through Node b, instead of directly communicating with A (which is in the same node).

Hence, the communicating processes look like Figure 1-3:

Figure 1-3: Communication of processes after migration by MOSIX

Let us discuss another offshoot picture of Scenario b. In this case, Processes A and B

have their UHN as Nodes a and b respectively. Process B is later moved to Node c. In

such a case, the new communication picture looks like the following set of figures.

Node b

Node a Process A Process B

B to A

B to A

6

Figure 1-4: Before migrating process B

Figure 1-5: After migrating process B

From the figures above, it is seen that process B, though moved to Node c, communicates

with its counterpart process A through its UHN (which is still Node b). Process B should

have communicated with its counterpart directly instead of going through its UHN. This

underlying problem now becomes very obvious. This method of re-direction of MOSIX

increases latency and in many cases causes inefficiency in the whole system. This needs

to be rectified so that B contacts A directly, instead of traveling through its UHN.

1.3 The Solution

Node b Process B

Node c

Node a Process A

Node a Process A

Node b Node c Process B

7

The technique discussed in this report achieves better performance of the communicating

processes in terms of decreased latency and increased bandwidth and better load averages

on the nodes for various configurations.

2 Approach towards the solution

2.1 Introduction – The “triangle routing” As discussed in the previous section, the problem with these processes is their binding to

their UHN. Let us take the following example to discuss the approach.

Figure 2-1: Processes A and B

Processes A and B are two processes that are communicating over the network. Now,

MOSIX migrates process B to a different Node c.

Node a Process A

Node b Process B

8

Figure 2-2: Process B is migrated to Node c

Now, the communications between processes A and B happen through node B. As

discussed in previous sections, the user space of process B still resides in Node b. This

means that the MOSIX user level code in Node b identifies the packets from Node a that

are meant for process B and notes that process B now resides in Node c. So, after

identifying the packet from Node a, MOSIX in Node b redirects the packet to process B

in Node c.

2.2 Reasoning

Let us now break the communication into various parts and work through the various

layers that each packet from process A goes through while communicating to process B.

The process A resides in the user space. It communicates to the kernel for getting hold of

a socket. (Let us assume that process A is a client and is seeking service from process B.

This example will be taken in the rest of the paper.) The kernel provides a socket to the

user space process A. The process then identifies that it has to go through TCP/IP stack of

Node a Process A

Node b

Node c Process B

9

the kernel. So, it goes through the TCP/IP stack and then through the firewalling system,

goes down to the lower layers like the device drivers, MAC and the physical layers. At

the receiving end, process B has already opened a port and is waiting for a connection

through a socket from any other process that needs its service. When A requests a service

to B, A’s packet goes out of the socket, through the kernel, through the TCP/IP stack,

adding header after header at each layer, through the firewalling system, through the

lower layers to the lower layer of Node b (UHN of process B). The lower layers of node

b identify the packet from Node a and take it up through the firewalling system, through

the TCP/IP stack, ripping off the headers one by one, up to the kernel / user space border

to MOSIX. Now, MOSIX looks at the packet and decides that process B is no longer

residing in this node b and so it redirects the fate of the packet is its new destination node

c (the new node where process B is now residing). The packet again takes the same path

as it took in Node a and goes down to the physical layer and contacts the lower layers of

Node c. The same process happens in Node c and the packet from Node a, process A,

reaches its destination to process B, at Node c.

The above communication can be redrawn with a microscopic view as shown below.

10

Figure 2-3: Microscopic View

It is observed that this extra path that the packet takes in the Node b increases the latency

of the packet, because, the packet is taking an all-new round about route to its destination.

Moreover, this all-new round about route consumes a chunk of the bandwidth of the

network available between Nodes a & b and Nodes b & c. This eats up more time of

MOSIX too, which takes time to decide on what to do with each such packet coming

from A for B in Node b. On top of it, this could happen for many such processes from

User Space

Kernel / Socket Layer

TCP

Firewall

Lower Layers

MOSIX

User Space


TCP

Firewall

Lower Layers

MOSIX

User Space


TCP

Firewall

Lower Layers

MOSIX

Node a

Node b

Node c

11

Node a communicating with many other processes in Node c whose UHN is Node b.

since MOSIX is installed on all nodes of the cluster, MOSIX schedulers in every node

will be spending enough time for any such re-direction that is reached to the respective

node for each packet. This naturally decreases the performance of the whole system.

Now, if there is a method by which we can redirect these packets that arrive from Node a

to Node b towards Node c at a lower layer, a layer, which can filter such packets, a layer

which does not consume more time for deciding upon the fate of the packet, a layer

where we can do a network address translation on the incoming packet and redirect it

towards Node c, it would drastically reduce the load tha t Node b takes to handle

redirection. This would also decrease the latency of the packet and eventually increase

bandwidth of the whole system.

How can all this be achieved? Which is the best of the lower layers that can best do all

this at considerable ease? Naturally, the answer lies in the firewalling system. The

firewall can intercept packets, filter them, it can identify the headers of the packet, it can

identify the destination of the packet, it can do a network address translation and can

redirect the packet to a different destination. It resides at a very low level in the

networking layers and can therefore be effectively used without bothering to remove and

add headers all the way up and down the network levels.

2.3 Timing Analysis The basic purpose of the analysis given above was to dissect the problem into smaller

segments and understand where there is a delay and where there could be an

improvement in the architecture. In this section, the architecture for solving the problem

at hand with the help of firewalling system would be explained in detail.

12

The time taken for taken by each packet to travel from process A in Node a to process B

in Node c in the discussion above can be divided into the following parts:

• Time taken by packet to travel from Node a to lower layer in Node b

• Time taken by packet to travel from node b lower layer to MOSIX in node b

• Time taken by packet to travel from MOSIX in node b to lower layer in node b

• Time taken to travel from lower layer in node b to process B

In the above dissection, we are more concerned about the time taken by packet to travel

from lower layer of node b to MOSIX and back from MOSIX to the lower layer. It is in

this particular time interval that we are going to implement the firewalling technique to

do the redirection. So, let us break down this time interval into more detail. When it is

broken down, it will look like the following: (Refer figure 2.4)

13

Total time taken by packet to travel from physical (lower) layer of Node b to

MOSIX and back to the physical layer =

(Time taken by packet to travel from physical layer to the firewall, say

T1) +

(Time taken by packet to travel from firewall to TCP/IP layer, say T2) +

(Time taken by packet to travel from TCP/IP layer to the socket layer,

say T3) +

(Time taken by packet to travel from socket to MOSIX, say T4) +

(Time taken by MOSIX to decide on the fate of the packet, say T5) +

(Time taken by packet to travel from MOSIX space to the kernel / socket

layer, say T6) +

(Time taken by packet to travel from socket to TCP layer, say T7) +

(Time taken by packet to travel from TCP to firewall layer, say T8) +

(Time taken for packet to travel from firewall to physical layer, say T9)

Equation 2-1: Dissection of time taken by a packet in its processes’ UHN

14

Figure 2-4: Time Analysis

As proposed earlier, the aim is to reduce the time taken by packet to travel all the way

reaching MOSIX by intercepting it at the firewall layer. In that case, the amount of time

that will be saved if the packet is intercepted at the firewall layer will be:

Time saved =

(Total time spent in Node b) –

(T2 + T3 + T4 + T5 + T6 + T7+ T8) +

(Time taken by firewall layer)

Equation 2-2: Time saved by the packet

It is not necessary to calculate T3 or T4 or any timing above the firewall layer for our

purpose. This is because; the tests that we do (later in the report) will give the summation

of these times, (T3 + T4 + T5 + T6 + T7).

Hence, the rest of this paper will concentrate upon the firewalling techniques like, how to

intercept the desired packets, how to do a network address translation and how to redirect

T1

User Space


TCP/ IP

Firewall

Lower Layers

MOSIX

T2

T4

T5

T6

T7 T3

T8

T9

15

them to the right node c, where they should eventually go, instead of the detailed timing

of packets above the firewall layer.

2.4 Architecture The primary aim of the solution is to intercept the packet and do a corresponding network

address translation. The step-by-step procedure to do this can be perceived with the help

of the following flowchart.

16

Figure 2-5: Flowchart

The architecture is pretty straightforward. A set of rules is first written to intercept the

necessary packets and do the corresponding network address translation on them. The

firewall waits for a packet to arrive. As soon as a packet arrives, the firewall intercepts

Send packet to upper layers

Wait for a packet to arrive

Intercept the packet Write Rules

Check header for source and destination addresses

Verify rules with the packet headers

Do a network address translation

Send packet to physical layer

No

Yes

17

the packet. The firewall filters every packet that arrives with the set of rules already

written for it. If a packet matches a rule or rule set, the packet hits the fate written in the

rule. In this case, the packet is stopped from going up the network layers. It is then

directed down to the physical layer towards its new destination. If the packet does not

match the rule set, the packet is sent up the network layer. The firewall now waits for the

next packet to arrive so that it can intercept.

3 Implementation

3.1 Environment information As mentioned earlier in this report, the working environment for MOSIX is Linux on x86

platforms. MOSIX is available in two parts. The first part is the MOSIX core itself,

which adds on as a patch to the Linux kernel. It then requires the kernel to be compiled

with MOSIX enabled as a configuration option. The Linux box can then be rebooted for

use with MOSIX. The second part is a set of system administrator tools for MOSIX like

manual migration of processes using their PIDs4, enabling or disabling auto-migration,

MOSIX process monitor tool, etc. that can be downloaded separately and installed.

The latest version of MOSIX that was available while running the tests for this report was

MOSIX 1.8.0 for Linux kernel version 2.4.19. Hence, the choice of implementation

environment in this report is restricted to Linux. Red Hat and Debian Linux distributions

were used. Care was taken regarding the version of gcc.

4 Process ID = PID

18

The MOSIX distribution page specifies that “Note: for now do not use distributions that

use gcc-3.2, such as RedHat-8 or Slackware-9. gcc-3.2 is unsuitable for compiling the

kernel”. [mosix02web]

The Linux OS has firewall support built inside its kernel. There has been continuous

evolution of this firewall system for the Linux kernel. As mentioned above, the working

environment is Linux kernel ve rsion 2.4.19. The firewalling system inbuilt in Linux for

2.4.x kernels is called IPTables. It is the re-designed and heavily improved successor of

the previous IPChains (for 2.2.x kernels) and IPFwadm (for 2.0.x kernels) systems. In

this section, IPTables will be explained in detail. It will also be explained as to how

IPTables has been used for our purposes. IPTables is also called netfilter.

3.2 IPTables

3.2.1 About5

IPTables is a generic table structure for the definition of rule sets. Each rule within an IP

table consists out of a number of classifiers (matches) and one connected action (target).

5 This section talks about IPTables in general. All information in this section, in most cases, are taken from

the documentation of IPTables, as it is, or modified a little for the purpose of this paper. The source for this

information is in the documentation section of www.netfilter.org. The author of documentation for this

website is Rusty Russell [rusty02linuxnetfilter], [rusty02linuxnat]. One more source of information is

[americo02performance]. I would not like to take any credit for information in the following sections about

IPTables, except section 3.2.7.

19

Netfilter is a set of hooks inside the Linux 2.4.x kernel's network stack, which allows

kernel modules to register callback functions called every time a network packet traverses

one of those hooks.

The main features of the netfilter system are:

• Stateful packet filtering (connection tracking)

• All kinds of network address translation

• Flexible and extensible infrastructure

Netfilter, IPTables and the connection tracking as well as the Network Address

Translation subsystems together build the whole framework.

Basically, rules are instructions with pre-defined characteristics to match on a packet.

When a match is found the firewall makes a decision to handle that packet. Each rule is

executed in order until a match is found. A rule can be set like this:

iptables [table] <command> <match> <target/jump>

There are 3 default policies: INPUT – to check the headers of incoming packets,

OUTPUT – for outgoing packets/connections, and FORWARD – if the machine is used

as a router (e.g. as a Network Address Translator.) Each policy has its own set of rules.

Let us take the following examples:

#iptables –P INPUT ACCEPT

#iptables –A INPUT –p tcp –dport 23 –j DROP

20

(-P: policy; -A: append; -p: protocol; -dport: destination port; -j: jump)

The first rule states that the firewall system will allow any packet from any network to

come in. The second rule states that, for all packets that come under the TCP protocol,

only those packets that come in with port number 23 are matched and are then dropped.

This rule is appended to the INPUT policy.

Figure 3-1: IPTables working flowchart (from [americo02performance])

21

As seen in the figure above, IPTables has a set of policies, namely INPUT, OUTPUT and

FORWARD policies that are meant for incoming, outgoing and packets meant for a third

machine respectively. For each of these policies, a set of chains can be created. These

chains can be matched with packets and protocols, IP addresses, input/output interfaces,

MAC addresses, etc. After matching packets in these policies, the fate of a packet can be

decided as to whether it should be accepted, dropped, rejected, queued or returned.

3.2.2 Netfilter Architecture In more detail, Netfilter is a series of hooks in various points in a protocol stack (at this

stage, IPv4, IPv6 and DECnet). The (idealized) IPv4 traversal diagram looks like the

following:

Figure 3-2: Packet traversing in Netfilter (from [rusty02linuxnat])

On the left is where packets come in: having passed the simple sanity checks (i.e., not

truncated, IP checksum OK, not a promiscuous receive), they are passed to the Netfilter

framework's NF_IP_PRE_ROUTING [1] hook.

[Route] [1] [3] [4]

[2]

[Route]

[5]

22

Next they enter the routing code, which decides whether the packet is destined for

another interface, or a local process. The routing code may drop packets that are

unroutable.

If it's destined for the box itself, the Netfilter framework is called again for the

NF_IP_LOCAL_IN [2] hook, before being passed to the process (if any).

If it's destined to pass to another interface instead, the Netfilter framework is called for

the NF_IP_FORWARD [3] hook.

The packet then passes a final Netfilter hook, the NF_IP_POST_ROUTING [4] hook,

before being put on the wire again.

The NF_IP_LOCAL_OUT [5] hook is called for packets that are created locally. Here

you can see that routing occurs after this hook is called: in fact, the routing code is called

first (to figure out the source IP address and some IP options).

3.2.3 NAT background There is more to IPTables than just accepting and dropping packets. This section will

discuss about Network Address Translation in IPTables.

Normally, packets on a network travel from their source (such as your home computer) to

their destination (such as www.gnumonks.org) through many different links. None of

these links really alter the packet: they just send it onward.

If one of these links were to do NAT, then they would alter the source or destinations of

the packet as it passes through. But, this is not how the system was designed to work.

Usually the link doing NAT will remember how it mangled a packet, and when a reply

23

packet passes through the other way, it will do the reverse mangling on that reply packet,

so everything works.

Some of the most common uses of NAT can be divided into three categories.

• Most ISPs give you a single IP address when you dial up to them. You can send

out packets with any source address you want, but only replies to packets with

this source IP address will return to you. If you want to use multiple different

machines (such as a home network) to connect to the Internet through this one

link, you'll need NAT. This is commonly known as ‘masquerading’ in the Linux

world.

• Sometimes you want to change where packets heading into your network will go.

Frequently this is because you have only one IP address, but you want people to

be able to get into the boxes behind the one with the ‘real’ IP address. If you

rewrite the destination of incoming packets, you can manage this. This type of

NAT is called port-forwarding.

• Sometimes you want to pretend that each packet which passes through your Linux

box is destined for a program on the Linux box itself. This is used to make

transparent proxies: a proxy is a program which stands between your network and

the outside world, shuffling communication between the two. The transparent part

is because your network won't even know it's talking to a proxy, unless of course,

the proxy doesn't work.

This report will look at a method combining the various above mentioned usages of

IPTables in a later section 3.3.

24

3.2.4 NAT Architecture in IPTables In IPTables, NAT is divided into two different types: Source NAT (SNAT) and

Destination NAT (DNAT).

Source NAT is when you alter the source address of the first packet: i.e. you are changing

where the connection is coming from. Source NAT is always done post-routing, just

before the packet goes out onto the wire. Masquerading is a specialized form of SNAT.

Destination NAT is when you alter the destination address of the first packet: i.e. you are

changing where the connection is going to. Destination NAT is always done before

routing, when the packet first comes off the wire. Port forwarding, load sharing, and

transparent proxying are all forms of DNAT.

In IPTables, we need to create NAT rules which tell the kernel what connections to

change, and how to change them. To do this, we use the IPTables tool to alter the NAT

table by specifying the ‘-t nat’ option. The ‘-t’ option in IPTables specifies the type of

table that should be used. In the section 3.2.1, we used the default table of IPTables,

called filter. For dong NAT, we will use the ‘nat’ table.

The table of NAT rules contains three lists called ‘chains’: each rule is examined in order

until one matches. The chains are called PREROUTING (for Destination NAT, as

packets first come in), and POSTROUTING (for Source NAT, as packets leave). And the

third is called OUTPUT. The OUTPUT chain will be discussed later.

25

Figure 3-3: NAT Architecture, IPTables (from [rusty02linuxnat])

The IPTables NAT can be best described with the help of the diagram above. At each of

the points above, when a packet passes we look up what connection it is associated with.

If it's a new connection, we look up the corresponding chain in the NAT table to see what

to do with it. The answer it gives will apply to all future packets on that connection.

3.2.5 NAT example usage

IPTables takes a number of standard options as listed below. All the double-dash options

can be abbreviated, as long as IPTables can still tell them apart from the other possible

options.

The most important option here is the table selection option, ‘-t’. For all NAT operations,

we will want to use ‘-t nat’ for the NAT table. The second most important option to use is

‘-A’ to append a new rule at the end of the chain (e.g. ‘-A POSTROUTING’), or ‘-I’ to

insert one at the beginning (e.g. ‘-I PREROUTING’).

[Routing Decis ion]

Local Process

PRE-ROUTING

DNAT

POST -ROUTING

SNAT

26

We can specify the source (‘-s’ or ‘—source’) and destination (‘-d’ or ‘—destination’) of

the packets you want to NAT. These options can be followed by a single IP address (e.g.

192.168.1.1), a name (e.g. www.gnumonks.org), or a network address (e.g.

192.168.1.0/24 or 192.168.1.0/255.255.255.0). If we omit the source address option, then

any source address will do. If we omit the destination address option, then any destination

address will do.

We can specify the incoming (‘- i’ or ‘--in- interface’) or outgoing (‘-o’ or ‘--out-

interface’) interface to match, but which we can specify depends on which chain we are

putting the rule into: at PREROUTING we can only select incoming interface, and at

POSTROUTING we can only select outgoing interface. If we use the wrong one,

IPTables will give an error.

We can also indicate a specific protocol (‘-p’ or ‘—protocol’), such as TCP or UDP; only

packets of this protocol will match the rule. The main reason for doing this is that

specifying a protocol of TCP or UDP then allows extra options: specifically the ‘--

source-port’ and ‘--destination-port’ options (abbreviated as ‘--sport’ and ‘--dport’).

These options allow us to specify that only packets with a certain source and destination

port will match the rule. This is useful for redirecting web requests (TCP port 80 or 8080)

and leaving other packets alone.

These options must follow the ‘-p’ option (which has a side-effect of loading the shared

library extension for that protocol). We can use port numbers, or a name from the

/etc/services file.

27

We want to do Source NAT; change the source address of connections to something

different. This is done in the POSTROUTING chain, just before it is finally sent out; this

is an important detail, since it means that anything else on the Linux box itself (routing,

packet filtering) will see the packet unchanged. It also means that the ‘-o’ (outgoing

interface) option can be used.

Source NAT is specified using ‘-j SNAT’, and the ‘--to-source’ option specifies an IP

address, a range of IP addresses, and an optional port or range of ports (for UDP and TCP

protocols only).

## Change source addresses to 1.2.3.4.

# iptables -t nat -A POSTROUTING -o eth0 -j SNAT --to 1.2.3.4

## Change source addresses to 1.2.3.4, 1.2.3.5 or 1.2.3.6

# iptables -t nat -A POSTROUTING -o eth0 -j SNAT --to 1.2.3.4-1.2.3.6

## Change source addresses to 1.2.3.4, ports 1-1023

# iptables -t nat -A POSTROUTING -p tcp -o eth0 -j SNAT --to 1.2.3.4:1-1023

Destination NAT is done in the PREROUTING chain, just as the packet comes in; this

means that anything else on the Linux box itself (routing, packet filtering) will see the

packet going to its ‘real’ destination. It also means that the ‘-i’ (incoming interface)

option can be used.

28

Destination NAT is specified using ‘-j DNAT’, and the ‘--to-destination’ option specifies

an IP address, a range of IP addresses, and an optional port or range of ports (for UDP

and TCP protocols only).

## Change destination addresses to 5.6.7.8

# iptables -t nat -A PREROUTING - i eth0 -j DNAT --to 5.6.7.8

## Change destination addresses to 5.6.7.8, 5.6.7.9 or 5.6.7.10.

# iptables -t nat -A PREROUTING - i eth0 -j DNAT --to 5.6.7.8-5.6.7.10

## Change destination addresses of web traffic to 5.6.7.8, port 8080.

# iptables -t nat -A PREROUTING -p tcp --dport 80 - i eth0 -j DNAT --to 5.6.7.8:8080

Though there is much more to the NAT of IPTables, the explanation above is sufficient

enough for the explanation of the rest of the papers and section 3.3.

3.2.6 Performance Evaluation of IPTABLES6

In this section, we will discuss the performance evaluation done on IPTables by Américo

J. Melara of California Polytechnic State University, San Luis Obispo

[americo02performance].

This thesis paper tests the firewall’s performance with the help of the following

parameters

6 All material in this section is referred from [americo02performance]

29

Transmission Protocol TCP UDP Type of filtering/matching TCP, IP, MAC UDP, IP, MAC

INPUT policy ACCEPT &

DROP DROP Connection speed 100Mbps Payload size 64 &1400 bytes Number of rules No firewall,10, 40, 100

Table 3-1: Performance evaluation parameters of [americo02performance]

In short, the performance test does a permutation and combination of various parameters

specified above and quotes its results as follows.

(a) That the payload size impacts the performance before and after the firewall but

not the firewall itself.

(b) That the INPUT policy does not affect the performance of the firewall

(c) That the firewall is affected only by type of filtering/matching and the number of

rules, and

(d) That the time to process a packet from the start time to the socket layer (Refer

section 2.2) is affected by the parameters in (c) and also by the payload size.

The test is done by keeping timestamps at various points in the network processing layers

of the system. (Refer section 2.2)

• Start time = T2 – T1

• Firewall = (T3 – T1) – (T2 – T1) = T3 – T2

• TCP layer = (T4 – T1) – (T3 – T1) = T4 – T3

• Socket layer = (T5 – T1) – (T4 – T1) = T5 – T4

• Total processing time = T5 - T1

30

These processing times are calculated for various combinations of the parameters

specified above. The results of these performance tests are explained and plotted in

graphs in the paper. The combination that is interesting to this paper is the test on TCP

packets with 10, 40 & 100 firewall rules.

3.2.7 Importance of performance evaluation The section 2.2 and 2.3 talked about the reasoning behind the problem solution and the

timing analysis. The timing analysis talks about splitting of the traversal of a network

packet all the way from the lower layers till the user space and how the timing of the

packet is distributed across various layers. The layers of interest for us will be the lower

layers, the firewall layer, the TCP layer and the socket layer. If the timings for the above

layers are known, the amount of time spent by the firewall layer on a set of rules can be

found out. As an example, let us take the following parameters from the Performance

Test [americo02performance].

A TCP packet of 1400 bytes, with 10 firewall rules. According to the test results, the

following is true:

Total time taken for processing this packet =

(Time taken by packet to travel from the lower layers to

the firewall layer, 11.94 µsec) +

(Time taken at the firewall layer, 8.59 µsec) +

(Time taken at the TCP layer, 24.22 µsec) +

(Time taken at the socket layer, 2.9 µsec)

Equation 3-1: Time taken for processing a TCP packet, 1400 bytes and 10 rules

31

Referring back to section 2.2 and 2.3, we can add to the above amount of time spent by a

packet, the amount of time spent by MOSIX on the packet to decide on the fate of the

packet.

Hence,

The amount of time that can be saved if the packets are redirected at the

firewall layer =

(Time taken at the TCP layer, 24.22 µsec) +

(Time taken at the socket layer, 2.9 µsec) +

(Time taken by MOSIX to decide on the fate of the packet,

M µsec).

Equation 3-2: Time saved if packets are redirected at the firewall

However, in this case, the packet has to travel up and travel down the network layer.

Hence, the packet spends the same amount of time in getting the header removed while

coming back down the network layer. So, the above calculated time can be re-calculated

as:

Recalculated Time =

2 × { (Time taken at the TCP layer, 24.22 µsec) +

(Time taken at the socket layer, 2.9 µsec)

} +

(Time taken by MOSIX to decide on the fate of the packet,

M µsec).

Equation 3-3: Recalculated time saved for packets redirected at firewall

32

This gives us a clear idea on how much time can be saved while redirecting packets at the

firewall layer.

3.3 IPTables for the problem at hand

3.3.1 How

The problem at hand requires a firewall that can identify packets, filter them using a

matching technique, and then redirect it to another machine at the firewall layer itself.

From our discussion on IPTables, from the discussion on the performance test on

IPTables, it is pretty clear that IPTables can do the required.

IPTables can be used to filter out incoming packets, it can be used to change the

destination address of the intercepted packet and send it back down towards its new

destination.

3.3.2 Actual Rules Referring back to the NAT architecture [section 3.2.4], we can set up the following set of

rules for identifying and redirecting a packet.

Rule 1: The first rule will catch the incoming packet (by matching its IP address, port

number) at the PREROUTING chain of the ‘nat’ table. After filtering out such a packet,

this packet needs to be redirected to its new destination using ‘-j DNAT’. The rule for

this will look like the following:

# iptables -t nat -A PREROUTING -s $CLIENT -d $FIREWALL _SYSTEM -p tcp \

--dport $SERVER_PORT -i eth0 -j DNAT --to-destination $NEW_DESTINATION

33

(where –s: source, -d: destination, --dport: destination port)

Rule 2: The packet now goes through the POSTROUTING chain of the firewall. At this

point, the packet has to go to its new destination with its source address changed to this

system (where the firewall resides). Only if this source address is changed, the new

destination will reply back to this system. Otherwise, it will contact the source system

directly, from where the packet came. This rule will use the POSTROUTING chain and

the SNAT option of IPTables.

# iptables -t nat -A POSTROUTING -s $CLIENT -p tcp \

--dport $SERVER_PORT -o eth0 -j SNAT --to-source $FIREWALL_SYSTEM

Rule3: When the new destination replies back to the firewall sys tem, the packet has to be

redirected to the original source. This is another DNAT that will complete the cycle.

# iptables -t nat -A PREROUTING -s $SERVER -d $FIREWALL_SYSTEM -p tcp \

--sport $SERVER_PORT -i eth0 -j DNAT --to-destination $CLIENT

3.3.3 How do these rules work? These rules can be pictographically represented as shown in the figure below: Step 1

Node a, Process A Thinks Process B is in Node b. However process B is in Node c. Does not know location of process B

Node b, IPTables rule set. Gets packet from process A for process B Knows process B is in Node c Does DNAT on packets from A Does SNAT before sending it to Node c

34

Figure 3-4: How do these rules work? Step 1

35

Step 2


Node b, IPTables rule set Waits response from Node c

Node a, Process A Awaits response from Node b Thinks Process B is in Node b

Node c, Process B Thinks Process A is in Node b. Replies back to node b Does not know correct location of process A

36

Step 3


Node b, IPTables rule set Gets response from Process B. Does DNAT on packet from B Converts the DNAT to Node a Sends packet to Node a, process A

Node a, Process A Awaits response from Node b Thinks Process B is in Node b

Node c, Process B Awaits response from Node b Thinks Process A is in Node b

37

Step 4


4 Testing

4.1 Purpose The primary purpose of the test is to compare the effect of using the IPTables rules on

Node b (refer section 3.3.3) against MOSIX network communication technique and direct

communication of processes between Node a and Node c. The test will measure the total

execution time / latency, the bandwidth taken by the processes, the load average, %age

CPU utilizations of the respective systems for

a) MOSIX communication

b) IPTables communication

c) Direct communication

Node b, IPTables rule set Waits for next packet from Node a

Node a, Process A Gets response from Node b Thinks it is from process B, node b, happy Begins sending next packet to Node b

Node c, Process B Awaits next packet from Node b Thinks Process A is in Node b

38

4.2 Environment The nodes used for the testing environment had the following configuration:

• Pentium P4 CPU

• 1.6 GHz Processor speed

• Intel Ether express Network card

• 100 Mbps LAN

• Two Red Hat 7.2 Linux boxes with Kernel 2.4.19

• One Debian Linux box with Kernel 2.4.18

• All nodes were connected on to the same LAN switch

4.3 Test Procedures

4.3.1 General The architecture that was maintained during the tests is exactly similar to the architecture

explained in section 3.3.3, which is as follows

Figure 4-1: General Test Procedure

Node b

Node a

Node c

39

A server-client communicating pair was created in order to satisfy the test purpose. They

communicate to each other using variable parameters. Some of the parameters that were

used in creating this server-client pair were

• Buffer size for each send / receive.

• The total amount of data that they would transmit. In other words, this would be

total number of iterations for which the data would be sent. This parameter was

used instead of specifying the time for which the data should be sent, because, the

purpose of the test is to find time of execution and not to specify it.

• The number of such communicating pairs.

• The port number on which the communication service would run.

The servers starts first and waits on a port number. The client contacts the server on this

port number on the server’s machine. A connection is established between them. The

server then starts pumping data to the client according to the parameters specified above.

At the end of data transfer, the server sends an end-signal to close the connection. The

client prints out the time taken for execution in seconds and microseconds.

Tests were first designed for different sizes of data transfer and also different amount of

communicating pairs. However, as seen in section 3.2.6, the size of data does not really

affect the performance of the system. So, the test was made for number of

communicating pairs.

40

4.3.2 MOSIX For testing under MOSIX, a scenario has to be created where a process is migrated from

its UHN to another node so that the “triangular” route of communication happens at the

UHN. The following steps were taken in order to create this scenario.

Step 1

Figure 4-2: MOSIX Test Procedure: Step 1

Node b Server is created here. Server waits on a port number.

Node a

Node c Client will reside here Client is not yet created.

41

Step 2

Figure 4-3: MOSIX Test Procedure: Step 2

When more than one communicating processes was created, all of them were migrated to

Node a.

4.3.3 IPTables For testing the IPTables procedure, the server was created in Node a and the client was

created in Node c. Node b is where all the IPTables rules mentioned in section 3.3.2

reside. When the client contacts Node b, the request is forwarded to Node a and the

server thinks that Node b is requesting service. When it returns back with a reply to the

request, Node b forwards the reply to the client in Node c. Thus, the connection cycle is

established and transfer of data occurs through the Node b.

Node b Server is migrated manually using MOSIX admin tools to node c All processes think server is still in Node b

Node a Server is now here. But, it goes to Node b for system calls. Node b is its UHN

Node c Client is created. Contacts server in Node b. Is unaware that server is in Node a

42

Figure 4-4: IPTables Test Procedure

When more than one communicating processes was created, more rules were added in

Node b to cater to each communicating process. As we saw in section 3.3.2, there are

three rules required for one communicating pair. So, for every communicating pair, an

extra set of three rules needs to be written.

4.3.4 Direct communication Ideally, if the MOSIX network communicating processes were migrated, they should

have contacted each other directly, instead of the communicating technique discussed in

section 1.2. This test was conducted to find out the actual performance (latency,

bandwidth, load average of the two systems on which the communication processes

reside) so that it can be compared with the MOSIX method and the IPTables method.

Node b IPTables rules are written here. Forwards packets from Node a to Node c and vice-versa

Node a Server is created here. It will get request from Node b (which is in reality from Node c). It will reply back to Node b

Node c Client is created here Client contacts Node b requesting for service

43

Figure 4-5: Direct Communication Test Procedure

5 Results

5.1 MOSIX As mentioned in the previous section, tests were conducted for increasing number of

communicating pairs. The various results noted were: the total execution time for the

communicating processes to finish off data transfer7, the bandwidth occupied, the load

average on the MOSIX UHN while the processes were communicating (in the test above,

the UHN will be Node b) and the %age CPU system utilization 8on the UHN.

7 The amount of data transferred is a parameter given to the test. In these tests, it was 400 MB

8 System CPU %age is the amo unt of CPU used by kernel. Since MOSIX is in the kernel, system CPU is

noted.

Node b

Node a Server is created here. It will get request from Node c It will reply back to Node c

Node c Client is created here Client contacts Node c requesting for service

44

No of

communicating

pairs

Time to

complete data

transfer*

(seconds)

Bandwidth

(Mbps)

% system

CPU

utilization

Load average

(1.00 = full)

3 203.73 15.71 58.8

0.52

6 275.56

11.61

85.7

1.34

9 390.28

8.19

85

1.73

12 513.64

6.23

87.3

1.83

15 640.91

4.99

87.5

1.8

25 1063.21

3.01

90

3.55

50 2130.70

1.50

90%

4.42

Table 5-1: MOSIX Test Result

*The data in this table is an average on the number of connections. Please refer the appendix for complete data

The MOSIX test results show increased load average and %CPU utilization with

increasing number of communicating pairs. A more detailed comparison can be made

after reading the results from the other two tests.

45

5.2 IPTables Similar test results are shown here for IPTables rule set. The %age CPU utilization and

load average are calculated for the node that has the rule set written, which, according to

the previous section is Node b.

Total Number

of

connections

Time to complete

data transfer

(seconds)

Bandwidth

(Mbps)

% system

CPU

utilization

Load average

(1.00 = full)

3 109.79

29.15

27.1

0.02

6 219.58

14.57

27.1

0.01

9 328.89

9.73

25.5

0.01

12 437.77

7.31

27.5

0.02

15 552.14

5.79

27.9

0.01

25 913.89

3.50

24.5

0.02

50 1840.99

1.74

28

0.02

Table 5-2: IPTables Test Result

46

5.3 Direct Communication For the direct communication test, there is no need for testing the load average and %age

CPU utilization, because, there is no middle system existing. However, the total time for

execution and latency are noted and are shown in the table below.

Total Number of

connections

Time to complete

data transfer

(seconds)

Bandwidth

(Mbps)

3 103.51

30.92

6 212.61

15.05

9 316.60

10.11

12 424.55

7.54

15 529.11

6.05

25 882.72

3.63

50 1746.64

1.83

Table 5-3: Direct Communication Test Result

5.4 Summary The tables shown above can be summarized and shown below for comparison on the

basis of latency, bandwidth, %age CPU ut ilizations and load averages.

47

LATNENCY (sec) No end-to-end connections MOSIX IPTABLES NORMAL

3 203.73 109.79 103.51 6 275.56 219.58 212.61 9 390.28 328.89 316.60

12 513.64 437.77 424.55

15 640.92 552.14 529.11

25 1063.21 913.89 882.72

50 2130.70 1840.99 1746.64

Table 5-4: Comparison of Latency

Table 5-5: Comparison of Bandwidth

BANDWIDTH (Mbps) No end-to-end connections MOSIX IPTABLES NORMAL

3 15.71 29.15 30.92 6 11.61 14.57 15.05 9 8.19 9.73 10.11

12 6.23 7.31 7.54

15 4.99 5.79 6.05

25 3.01 3.50 3.63

50 1.50 1.74 1.83

48

%cpu utilization No end-to-end connections MOSIX IPTABLES

3 58.8 27.1

6 85.7 27.1

9 85 25.5

12 87.3 27.5

15 87.5 27.9

25 90 24.5

50 90 28

Table 5-6: Comparison of CPU Utilization

load average

No end-to-end connections

MOSIX IPTABLES

3 0.52 0.02

6 1.34 0.01

9 1.73 0.01

12 1.83 0.02

15 1.8 0.01

25 3.55 0.02

50 4.42 0.02

Table 5-7: Comparison of Load Average

49

Execution Time Comparison

0

100

200

300

400

500

600

700

0 2 4 6 8 10 12 14 16

No of Connections

Tim

e (s

ecs)

mosix iptables direct

Figure 5-1: Execution Time Comparison Chart

Bandwidth Comparison

0

5

10

15

20

25

30

35

0 2 4 6 8 10 12 14 16

No of connections

Ban

dwid

th (M

bps)

mosix iptables direct

Figure 5-2: Bandwidth Comparison Chart

50

%CPU Utlization Comparison

0

20

40

60

80

100

0 10 20 30 40 50 60

No of connections

%C

PU

Uti

lizat

ion

mosix iptables

Figure 5-3: %CPU Utilization Comparison Chart

Load Average Comparison

0

1

2

3

4

5

0 10 20 30 40 50 60

No of Connections

Load

Ave

rage

mosix iptables

Figure 5-4: Load Average Comparison Chart

51

6 Conclusion

6.1 Observations

• From the graph and table of comparison of latency, it is clear that the total

execution time for IPTables is pretty close to the total execution time of direct

communication, while MOSIX takes a huge toll, huge difference in total

execution time. On an average, MOSIX takes 33% more time on execution than

direct communication while IPTables takes only 4% more. On an average,

MOSIX takes 28% more time on execution than IPTables.

• The bandwidth comparison chart and table shows that the bandwidth occupied by

MOSIX is considerably less as compared to IPTables and direct communication.

On an average, it is 20% less than IPTables. However, as number of end-to-end

communications increase, the bandwidth difference between the three methods of

testing does not vary much.

• However, while the bandwidth graph converges, the load average and CPU

utilization show a drastic difference. The CPU utilization and load average on the

IPTables system is considerably low as compared to that of the MOSIX system,

which completely hogs the system. MOSIX, on an average, takes 212% more

CPU utilization and at least 138 times more load average than IPTables.

6.2 Inferences The observations made in the previous section shows that using MOSIX to manually

schedule the network communicating process has actually slowed down the time of

execution of the network communicating processes. An interesting point here is that if

52

MOSIX had done an auto migration on these network communicating processes, the

MOSIX system would still have a huge percentage CPU load on them.

The IPTables test on the other hand has shown that it takes very less execution time of

the two communicating processes and it does not add to any CPU utilization or load of

the system, which handles the redirection of the rules.

The observations and inferences make this point very clear that the MOSIX methodology

of structuring the network communicating processes is time consuming and resource

crunching. However, if the same structure is used for a pair of communicating processes

with the IPTables rule set defined, the cost effectiveness and resource efficiency is

greatly increased. It is so much increased that it is as if the communicating processes are

connected directly and are not routed through a middle system.

6.3 Future Work

Naturally, the integration of IPTables methodology inside MOSIX will enable MOSIX to

be more efficient. In such a case, the basic structure of MOSIX is not changed, i.e.

MOSIX maintains it UHN, remote & deputy concept, but still improves performance.

This integration could be done possible by a step-wise approach. In a broad sense, these

steps could be:

a) Identify the double-redirection that was created by migrating the process from its

UHN.

b) Create IPTables rule set on the fly using an API / library.

c) The library sits on every MOSIX workstation and updates the creation of new rule

sets.

53

d) Remove rules after the processes are done with communication.

There are many limitations associated with NAT itself. These can be found in more detail

in [hain00architectural]. There could be another approach to the whole situation using

IPTables rule set. If there is a way to re-direct the packets that are generated locally to a

system by doing a local DNAT on them instead of doing in a middle system, this problem

could be solved. However, from the IPTables documentation [rusty02linuxnat],

“The NAT code allows you to insert DNAT rules in the OUTPUT chain, but this is not

fully supported in 2.4 (it can be, but it requires a new configuration option, some testing,

and a fair bit of coding, so unless someone contracts Rusty to write it, I wouldn't expect it

soon).

The current limitation is that you can only change the destination to the local machine

(e.g. `j DNAT --to 127.0.0.1'), not to any other machine, otherwise the replies won't be

translated correctly.”

Enabling of DNAT on locally generated packets could be a possible future work on

IPTables that could prove to be an efficient solution to the problem.

On the backside, there are some inherent drawbacks in the NAT system, which are

discussed in detail in [hain00architectural, holdrege01protocol, and sebue02network].

Since MOSIX works on Linux on x86 platforms, these NAT problems do not come into

picture.

7 Related Research A variety of different approaches have been taken in resolving the problem discussed in

section 1.2 of this paper. These approaches can be classified into two categories: one,

54

which addresses the problem of NAT at source, the other which address the problem of

socket migration. We discuss various research related to these approaches.

Mobile communication with Virtual Network Address translation [gong02mobile] is an

architecture that allows transparent migration of end-to-end live network connections

associated with various computation units. Such computation units can be either a single

process, or a group of processes, or an entire host. VNAT virtualizes network connections

perceived by transport protocols so that identification of network connections is

decoupled from stationary hosts. Such virtual connections are then remapped into

physical connections to be carried on the physical network using network address

translation. However, VNAT is tailored specifically for the ZAP project

[steven02design].

MIGSOCK [bryan02migsock] is a project at the Carnegie Mellon University,

Information Networking Institute, that implements the migration of TCP sockets in the

Linux operating system. MIGSOCK provides a kernel module that re- implements TCP to

make migration possible. The implementation requires modifications to the kernel files

(patches) and migration option available to user applications. The remainder of the

functionality exists in the kernel module which can be loaded on demand by the kernel.

This seems like a good patch that can be made to MOSIX so that the problem discussed

in section 1.2 can be eradicated. However, the source code for this software was available

only on request to the authors. E-mail requests were sent without any response. Also, this

software has not yet been integrated with MOSIX.

55

[alex00end] presents an architecture that allows suspending and resuming TCP

connections. However, it does not support migration of TCP connections where both the

end points move simultaneously.

MSOCKS [david98msocks] presents an architecture called Transport Layer Mobility that

allows mobile nodes to not only change their point of attachment to the Internet, but also

to control which network interfaces are used for the different kinds of data leaving from

and arriving at the mobile node. MSOCK implements transport layer mobility scheme

using a split-connection proxy architecture and a new technique called TCP Splice that

gives split-connection proxy systems the same end-to-end semantics as

normal TCP connections. However, MSOCKS handles a mobile client and a stationary

server. So, it does not match well with the problem in section 1.2.

There is a mention of socket migration in the MOSIX web page [mosix02web] as an

ongoing project.

8 References [alex00end] Alex C. Snoeren and Hari Balakrishnan, An End-to-End Approach

to Host Mobility, Proceedings of the 6th International Conference

on Mobile Computing and Networking (MobiCom ’00), Boston,

MA, August 2000.

[americo02performance]

56

Américo J. Melara, Performance analysis of the Linux firewall in a

host, Masters Thesis, California Polytechnic State University, San

Luis Obispo, June 2002.

[barak98mosix] Barak A. and La'adan O., The MOSIX Multicomputer Operating

System for High Performance Cluster Computing , Journal of

Future Generation Computer Systems, Vol. 13, No. 4-5, pp. 361-

372, March 1998.

[barak99scalable] Barak A., La'adan O. and Shiloh A., Scalable Cluster Computing

with MOSIX for LINUX, Proc. Linux Expo '99, pp. 95-100,

Raleigh, N.C., May 1999.

[bryan02migsock] Bryan Kuntz and Karthik Rajan, MIGSOCK: Migratable TCP

socket in Linux, Master’s Thesis, Carnegie Mellon University,

Information Networking Institute, February 2002.

[david98msocks] David A. Maltz and Pravin Bhagwat, MSOCKS: An Architecture

for Transport Layer Mobility, Proceedings of the IEEE INFOCOM

’98, San Francisco, CA, 1998.

[gong02mobile] Gong Su and Jason Nieh, Mobile Communication with Virtual

Network Address Translation, Technical Report CUCS-003-02,

Department of Computer Science, Columbia University, February

2002.

[hain00architectural] T. Hain, Architectural Implications of NAT, RFC 2993, IETF,

November 2000.

57

[holdrege01protocol] M. Holdrege and P. Srisuresh, Protocol Complications with the IP

Network Address Translator, RFC 3027, IETF, January 2001.

[mosix02web] http://www.mosix.org

[rusty02linuxnat] Rusty Russell, Linux 2.4 NAT HOWTO, Linux Netfilter core Team,

http://www.netfilter.org/documentation/HOWTO/NAT-

HOWTO.html, January 2002.

[rusty02linuxnetfilter] Rusty Russell and Harald Welte, Linux netfilter Hacking HOWTO,

Linux Netfilter core Team,

http://www.netfilter.org/documentation/HOWTO//netfilter-

hacking-HOWTO.html, July 2002

[sebue02network] D. Senie, Network Address Translator (NAT)-friendly Application

Design Guidelines, RFC 3235, IETF, January 2002.

[stevem02design] Steven Osman, Dinesh Subhraveti, Gong Su, and Jason Nieh, "The

Design and Implementation of Zap: A System for Migrating

Computing Environments", Proceedings of the Fifth Symposium

on Operating Systems Design and Implementation (OSDI 2002),

Boston, MA, December 9-11, 2002

Documents

A TECHNIQUE FOR IMPROVING THE SCHEDULING OF NETWORK …people.cis.ksu.edu/~subbu/docs/Report_1122.pdf · 2002-11-28 · A TECHNIQUE FOR IMPROVING THE SCHEDULING OF NETWORK COMMUNICATING