Net Filter

8/6/2019 Net Filter

1/22

The Net Filter Facility

The Linux net filter is a framework in the kernel that allows modules to observe and modify packets

as they pass through the protocol stack. Kernel services or modules register custom hooks/filters

by identifying both

protocol family (e.g., PF_INET) and by

the point in packet processing (e.g., NF_IP_LOCAL_IN) at which the filter is to be invoked.

The facility is currently available for IPv4, IPv6 and DECnet but could be extended to other

protocol families. Each protocol family can provide several processing points in the stack where a

packet of that protocol can be passed to a filter.

These points are referred to as hook points or hook types. Hence, when registering a custom hook,

the protocol family and the protocol specific hook type must be specified.

Only kernel components or installable modules can directly register hooks --- never application code.

1

8/6/2019 Net Filter

2/22

Hook management structures

A statically allocated array oflist headers defined in net/netfilter/core.c is the root of the hook

chains for each protocol and hook types. NF_MAX_HOOKS, the maximum types of hooks a

protocol can support has been defined as 8 in include/linux/netfilter_ipv4.h.

52/* I n this code, we can be waiting indefinitely for userspace t o 53 * service a packet if a hook returns NF_QUEUE. We could keep a count 54 * of skbuffs queued for userspace, and not deregister a hook unless 55 * this is zero, but that sucks. Now, we simply check when the 56 * packets come back: if the hook is gone, the packet is discarded. */

57 struct list_head nf_hooks[NPROTO] [NF_MAX_HOOKS] ;

47 /* Largest hook number + 1 */

48 #define NF_MAX_HOOKS 8

There are 5 hook types defined for the IPV4 protocol. It is common for a packet to be processed by

multiple hooks.

41 /* IP Hooks */ 42 /* After promisc drops, checksum checks. */ 43 #define NF_IP_PRE_ROUTING 0 44 /* If the packet is destined for this box. */ 45 #define NF_IP_LOCAL_IN 1

46 /* If the packet is destined for another interface. */

47 #define NF_IP_FORWARD 2 48 /* Packets coming from a local process. */ 49 #define NF_IP_LOCAL_OUT 3 50 /* Packets about to hit the wire. */ 51 #define NF_IP_POST_ROUTING 4

52 #define NF_IP_NUMHOOKS 5

2

8/6/2019 Net Filter

3/22

Hook function prototype

A hook handler has the following prototype.

53 typedef unsigned int n f_hookfn(unsigned int h ooknum, 54 struct sk_buff * *skb, 55 const struct net_device *in, 56 const struct net_device *out, 57 int ( *okfn) ( struct sk_buff *) ) ;

Note that the prototype is a typedef. This C construct will allow a declaration of the form:

nf_hookfn *hook;

to actually declare a pointer to a function having the parameters shown above, even though the

declaration looks more like a pointer to a structure.

Here is an example, taken from a firewall. Even though a hook is passed a pointer to the okfn a

typical hook will never invoke it, because that would prevent other hooks from inspecting the

packet. The OK function is normally invoked by the netfilter code after all hooks have processed

the packet.

423 static unsigned int fw_output ( 424 unsigned int hook, 425 st ruct sk_buff * *pskb,

426 const st ruct net_device * in, 427 const st ruct net_device * out , 428 int ( * okfn) ( st ruct sk_buff * ) )

3

8/6/2019 Net Filter

4/22

Definining and registering a netfilter hook

Each custom hook is defined using the following nf_hook_ops structure. This structure is passed to

the nf_register_hookfunction.

59 s truct nf_hook_ops 60 { 61 struct list_head list; 62 63 /* User fills in from here down. */ 64 nf_hookfn *hook; 65 struct module *owner; 66 int pf; 67 int hooknum; 68 /* Hooks are ordered in ascending priority. */

69 int priority; 70 };

Structure elements are used as follows:

list: links all hooks of a common PROTO and hook type into the hook

chain

pf: protocol family (PF_INET) of the filter.

hooknum: the protocol specific hook type (NF_IP_FORWARD) identifier.

priority: determines order of the hook in the list.

hook: A pointer to the hook function. It's prototype was shown previously

4

8/6/2019 Net Filter

5/22

Registering a firewall hook

The firewall whose handler was shown earlier registered its handler in this way. The hook element

of the structure should contain the address of the filter function.

488 static st ruct nf_hook_ops postroute_ops = { 489 . hook = fw_output, 490 #if (LINUX_VERSION_CODE >= 0x020500) 491 . owner = THIS_MODULE, 492 #endif 493 . pf = PF_INET, 494 . hooknum = NF_IP_POST_ROUTING, 495 . priority = NF_IP_PRI_FILTER, 496 };

532 rc = nf_register_hook(&postroute_ops) ; 533 if ( rc

8/6/2019 Net Filter

6/22

Hook priorities

During registration hooks are inserted on the chain in priority order. Low numbers mean higher

priority and packets are passed to high priority hooks before low priority hooks.

54 enum nf_ip_hook_priorities { 55 NF_IP_PRI_FIRST = INT_MIN, 56 NF_IP_PRI_CONNTRACK_DEFRAG = - 400, 57 NF_IP_PRI_RAW = - 300, 58 NF_IP_PRI_SELINUX_FIRST = - 225, 59 NF_IP_PRI_CONNTRACK = - 200, 60 N F_IP_PRI_BRIDGE_SABOTAGE_FORWARD = - 175, 61 NF_IP_PRI_MANGLE = - 150, 62 NF_IP_PRI_NAT_DST = - 100, 63 N F_IP_PRI_BRIDGE_SABOTAGE_LOCAL_OUT = - 50, 64 NF_IP_PRI_FILTER = 0,

65 NF_IP_PRI_NAT_SRC = 100, 66 NF_IP_PRI_SELINUX_LAST = 225, 67 NF_IP_PRI_CONNTRACK_HELPER = INT_MAX - 2, 68 NF_IP_PRI_NAT_SEQ_ADJ UST = INT_MAX - 1, 69 NF_IP_PRI_CONNTRACK_CONFIRM = I NT_MAX, 70 NF_IP_PRI_LAST = INT_MAX, 71 };

6

8/6/2019 Net Filter

7/22

Registering a hook

The nf_register_hook() function defined in net/core/netfilter.c adds the nf_hook_ops structure that

defines a custom hook to the appropriate list based on the protocol family and filter type.

Since the list is ordered by ascending priority values, invocation order is lowest numerical valuefirst.

60 61 int n f_register_hook(struct nf_hook_ops * reg) 62 { 63 struct list_head *i; 64 65 s pin_lock_bh(&nf_hook_lock) ;

66 list_for_each(i, &nf_hooks[reg->pf] [ reg->hooknum] ) { 67 if ( reg- >priority priority)

68 break; 69 } 70 list_add_rcu(&reg->list, i->prev); 71 spin_unlock_bh(&nf_hook_lock) ; 72 73 synchronize_net(); 74 return 0; 75}

7

8/6/2019 Net Filter

8/22

IP Packet transmission through the netfilter

From ip_push_pending_frames(), the IP packet is pushed to the netfilterlayer using theNF_HOOK

macro defined in include/linux/netfilter.h. Parameters passed include the output device to be used

and the the final "OK" output function to be invoked on successful verdict from all the hooks in thelist. The hook type isNF_IP_LOCAL_OUT. The input device is set to NULL, since the packet

originated on the local host.

713 err = NF_HOOK(PF_INET, NF_IP_LOCAL_OUT, skb,

NULL, rt->u.dst.dev, dst_output);

211/* Activate hook; either okfn or kfree_skb called , unless a h ook 212 returns NF_STOLEN ( in which case, it' s up to the hook to deal with 213 the consequences) . 214

215 Returns - ERRNO if packet dropped. Zero means queued, stolen or 216 accepted. about async and trust the return value to mean " packet was ok" . 222 223 AK: 224 J ust document it clearly, then you can expect some sense from kernel 225 coders : ) 226* / 227

228/* This is gross, but inline doesn' t cut it for avoiding the function 229 call in fast path: gcc doesn' t inline ( needs value tracking?) . - - RR * / 230 231/* HX: It' s slightly less gross now. * / 232

8

8/6/2019 Net Filter

9/22

Hook macros

The macro translates to a call to the nf_hook_thresh() function.

If a value of 1 is returned, the packet is passed to the OK function. The return code from the OK function is then returned to the caller of NF_HOOK.

233 #define NF_HOOK_THRESH(pf, hook, skb, indev, ou tdev, okfn, thresh) \

234 ( {int __ret; \ 235 if ( (__ret = n f_hook_thresh(pf, h ook, &(skb) , indev, outdev,

okfn, thresh, 1) ) == 1) \ 236 __ret = ( okfn) ( skb) ; 237 __ret; }) 238

The NF_HOOK_macro just invokes NF_HOOK_TRESH with the INT_MIN threshold parameter.

This makes all hooks eligible to process the packet regardless of their priority.

245 #define NF_HOOK( pf, hook, skb, indev, outdev, okfn) \ 246 NF_HOOK_THRESH( pf, hook, skb, indev, out dev, okfn, INT_MIN)

The NF_HOOK_COND macro passes a condargument through to nf_hook_thresh(), The

NF_HOOK_COND macro also unconditionally sets thresh to INT_MIN.. This macro is used in IP

multicast output.

239 #define NF_HOOK_COND(pf, h ook, skb, indev, outdev, okfn , cond)\ 240 ( {int __ret; \ 241 if ( (__ret=nf_hook_thresh(pf, h ook, &(skb) , indev, outdev, okfn, INT_MIN, cond) ) == 1) \ 242 __ret = ( okfn) ( skb) ;

243__ret; })

9

8/6/2019 Net Filter

10/22

Thenf_hook_thresh() function

This function invokes the nf_hook_slow() function if the netfilter debug option is defined or if there

are hooks/filters set for the specific protocol family and hook type. If the condargument is 0 or the

hook chain is empty a 1 is returned indicating that the OK function should be invoked.

188 static inline int nf_hook_thresh(int pf, unsigned int hook, 189 struct sk_buff **pskb, 190 struct net_device *indev, 191 struct net_device *outdev, 192 int ( *okfn) ( struct sk_buff *) ,

int thresh, 193 int cond) 194 {195 if ( ! cond)

196 return 1; 197 #ifndef CONFIG_NETFILTER_DEBUG 198 if ( list_empty(&nf_hooks[pf] [ hook] ) ) 199 return 1; 200 #endif 201 return nf_hook_slow(pf, hook, pskb , indev,

outdev, okfn, t hresh) ; 202 }

10

8/6/2019 Net Filter

11/22

Hook actions

The following actions and corresponding return codes may be take by a hook.

15 /* Responses from hook functions. */

16 #define NF_DROP 0 17 #define NF_ACCEPT 1 18 #define NF_STOLEN 2 19 #define NF_QUEUE 3 20 #define NF_REPEAT 4 21 #define NF_STOP 5 22 #define NF_MAX_VERDICT NF_STOP 23 24 /* we overload the higher bits for encoding auxiliary data such as the queue

25 * number. Not nice, but better than additional function arguments. */

26 #define NF_VERDICT_MASK 0x0000ffff 27 #define NF_VERDICT_BITS 16 28 29 #define NF_VERDICT_QMASK 0xffff0000 30 #define NF_VERDICT_QBITS 16 31 32 #define NF_QUEUE_NR(x) ( ( ( x

8/6/2019 Net Filter

12/22

Invoking the hook chain

When the condparameter is non zero and the hook list is non empty, the nf_hook_slow() function is

invoked. The nf_hook_slow() function is defined in net/core/netfilter.c, it's task is to invoke each

hook in the specified list, and based on the verdict from the hooks, the appropriate action is taken.

In summary the return valuesmean the following to the NF_HOOK macro:

Negative: packet was dropped.

Zero: packet was queued or stolen so nothing more to do.

One: the packet was not dropped, queued, stolen. The NF_HOOK macro invokes

the OK function directly.

158 /* Returns 1 if okfn( ) needs to be executed by the caller, 159 * - EPERM for NF_DROP, 0 otherwise. */

160 int nf_hook_slow(int pf, unsigned int hook, struct sk_buff **pskb, 161 struct net_device *indev, 162 struct net_device *outdev, 163 int ( *okfn) ( struct sk_buff *) , 164 in t h ook_thresh) 165 { 166 struct list_head *elem;

167 unsigned int verdict; 168 int ret = 0;

12

8/6/2019 Net Filter

13/22

Processing hook chain

The actual processing of the hook chain is done in nf_iterate which returns the final verdict. Note

that NF_STOP implies NF_ACCEPT. This presumably allows high priority hooks to impose their

will and thus disabling the normal procedure in which any hook can demand NF_DROP.

170 /* We may already have this, but read- locks nest anyway * / 171 rcu_read_lock(); 172 173 elem = &nf_hooks[ pf] [ hook] ;

174 next_hook: 175 verdict = nf_iterate(&nf_hooks[pf][hook] , pskb, hook, 176 indev, outdev, &elem, okfn, hook_thresh) ;

177 if ( verdict == NF_ACCEPT | | verdict == NF_STOP) { 178 ret = 1; 179 goto unlock; 180 } else if ( verdict == NF_DROP) { 181 kfree_skb(*pskb); 182 ret = - EPERM; 183 } else if ( ( verdict & NF_VERDICT_MASK) == NF_QUEUE) { 184 NFDEBUG( "nf_hook: Verdict = QUEUE. \n" ); 185 if ( ! nf_queue(pskb, elem, pf, hook, indev, outdev, okfn, 186 verdict >> NF_VERDICT_BITS) ) 187 goto next_hook; 188 } 189 unlock: 190 rcu_read_unlock( ) ; 191 return ret; 192 }

13

8/6/2019 Net Filter

14/22

Iterating through the hook chain

The nf_iterate() function is defined in net/core/netfilter.c

116 unsigned int nf_iterate(struct list_head *head,

117 struct sk_buff **skb, 118 int hook, 119 const struct net_device *indev, 120 const struct net_device *outdev, 121 struct list_head **i, 122 int ( *okfn) ( struct sk_buff *) , 123 int hook_thresh) 124 {

One iteration of this loop occurs for each hook called.

127 /* 128 * The caller must not block between calls to this 129 * function because of risk of continuing from deleted 130 element */

131 list_for_each_continue_rcu( *i, head) { 132 struct nf_hook_ops *elem = ( struct nf_hook_ops *) *i; 133

This would appear to skip over hooks whose values are numerically lower than the hook threshold.

The motivation for doing this is unclear. When called via NF_HOOK, the threshold is INT_MIN.

134 if ( hook_thresh > elem- >priority) 135 continue; 136

14

8/6/2019 Net Filter

15/22

Invoking an individual hook

The call to elem >hook passes the packet to a function such asfw_output() which returns its verdict

on the packet.

137 /* Optimization: we don' t need to hold module 138 reference here, since function can' t sleep. - - RR */

139 verdict = elem->hook(hook, skb, indev, outdev, okfn);

The verdict returned by the hook function determines the action taken. An immediate return,

possibly aborting the send, is made if the value returned is NF_QUEUE, NF_STOLEN, or

NF_DROP. For values of NF_REPEAT or NF_ACCEPT the list_for_each() loop continues.

140 if ( verdict ! = NF_ACCEPT) {

141 #ifdef CONFIG_NETFILTER_DEBUG 142 if ( unlikely( (verdict & NF_VERDICT_MASK) 143 > NF_MAX_VERDICT) ) { 144 NFDEBUG( "Evil return from %p( %u) . \n" , 145 elem- >hook, hook) ; 146 continue; 147 } 148 #endif 149 if ( verdict ! = NF_REPEAT) 150 return verdict; 151 *i = ( *i) ->prev;

152 } // end of verdict ! = NF_ACCEPT 153 } // end of list_for_each( ) 154 return NF_ACCEPT; 155 }

15

8/6/2019 Net Filter

16/22

Thedst_output_function

In the udpsendnotes it was seen that if the packet is accepted for transmission by nf_hook_slow, the

okfn(), dst_output(), is called. It simply passes control to the outputfunction associated with the dst

structure that is presently bound to the sk_buff.

225 static inline int dst_output ( struct sk_buff * skb) 226 { 227 return skb->dst->output(skb); 228 }

16

8/6/2019 Net Filter

17/22

The ip_output function

The pointer skb >dst refers to the route cache element associated with this packet's source and

destination. During routing , rt >u.dst >output was set to ip_output() which is defined in

net/ipv4/ip_output.c.

The ip_output() function

sets skb >dev to the device associated with the route's associated output device structure

and

sets the protocol type to ETH_P_IP.

This indicates that the value 0x8000 represents an IP packet even if the output device is not an

ethernet device. This used to be done in ip_finish_output. This function used to explicitly invokeip_do_nat().

277 int ip_output( struct sk_buff * skb) 278 { 279 struct net_device *dev = skb- >dst- >dev; 280 281 I P_INC_STATS(IPSTATS_MIB_OUTREQUESTS);

282 283 skb- >dev = dev; 284 skb- >protocol = htons( ETH_P_IP) ;

285

Note that the above two lines are the only things that occurbetween the end of the

NF_LOCAL_OUTPUT hook and the post routing hook. But its possible that significant

transformations (e.g. NAT could have occurred within the NF_LOCAL_OUTPUT hook).

286 r eturn N F_HOOK_COND(PF_INET, NF_IP_POST_ROUTING, skb, 287 NULL, dev, ip_finish_output, 288 ! ( IPCB( skb) - >flags & IPSKB_REROUTED) ) ; 289 }

17

8/6/2019 Net Filter

18/22

The ip_finish_output() function

The ip_finish_output() is responsible for invoking fragmentation if the packet is too long and gso is

not supported.

203 static inline int ip_finish_output ( struct sk_buff * skb) 204 { 205 #if defined(CONFIG_NETFILTER) && defined(CONFIG_XFRM) 206 /* Policy lookup after SNAT yielded a new policy */ 207 if ( skb- >dst- >xfrm ! = NULL) { 208 IPCB( skb) ->flags | = IPSKB_REROUTED; 209 return dst_output( skb) ; 210 } 211 #endif

212 if ( skb- >len > dst_mtu( skb- >dst) && ! skb_is_gso( skb) ) 213 return ip_fragment( skb, ip_finish_output2) ; 214 else 215 return ip_finish_output2(skb); 216 }

18

8/6/2019 Net Filter

19/22

The ip_finish_output2() function

The ip_finish_output2() function is also defined in net/ipv4/ip_output.c . It first tries to confirm

that sufficient headroom exists for the MAC header. If not, it will askskb_realloc_headroom() to

reallocate the kmalloc'dcomponent with sufficient headroom.

You should make real sure that your protocol doesn'ttrigger the call to skb_realloc_headroom()

163 s tatic inline int ip_finish_output2(struct sk_buff * skb) 164 { 165 struct dst_entry * dst = skb- >dst ; 166 struct hh_cache * hh = dst- >hh; 167 struct net_device *dev = dst- >dev; 168 int hh_len = LL_RESERVED_SPACE( dev) ; 169

170 /* Be paranoid, rather than too clever. */ 171 if ( unlikely( skb_headroom( skb) hard_header)) { 172 struct sk_buff *skb2; 173 174 skb2 = skb_realloc_headroom(skb,

LL_RESERVED_SPACE(dev)); 175 if ( skb2 == NULL) { 176 kfree_skb(skb); 177 return - ENOMEM; 178 }

The skb_realloc_headroom() succeeds skb2, originally a clone ofskb, now points to a new

kmalloc'd component. Hence it is necessary to charge the sock's write buffer quota, and then free

the original sk_buffand kmalloc'darea.

179 if ( skb- >sk) 180 skb_set_owner_w( skb2, skb- >sk) ; 181 kfree_skb(skb) ; 182 skb = skb2; 183 }

19

8/6/2019 Net Filter

20/22

Passing the packet to the dev layer

There are two mechanisms by which calls to the dev layer may be made. If thedst_entry has an

hh_cache pointer, then the hh_cache entry must contain both the hardware header itself and a

pointer to an output function at the device / link layer.

The hh_output() function is set to dev_queue_xmit()if the ARP cache element is in the

NUD_REACHABLE state, but will point to neigh_resolve_output()when the entry becomes stale.

If there is no hh pointer in the dst_entry, the neighborpointer that was established when the route

cache entry was constructed will be used. This neighbor structure has an outputfunction pointer

which was set to neigh >ops >output. For ethernet devices, this function is

neigh_resolve_output(). Otherwise (for a loopback, point to point, or virtual device) it set to

invoke dev_queue_xmit() by the arp_constructor() function that is called when each neighbourstructure was created.

185 if ( hh) { 186 int hh_alen; 187 188 read_lock_bh( &hh- >hh_lock) ; 189 hh_alen = HH_DATA_ALIGN( hh- >hh_len) ; 190 memcpy( skb- >data - hh_alen, hh- >hh_data, hh_alen) ; 191 read_unlock_bh( &hh- >hh_lock) ;

192 skb_push( skb, hh- >hh_len) ; 193 return hh- >hh_output( skb) ;

194 } else if ( dst- >neighbour) 195 return dst- >neighbour- >output( skb) ; 196

20

8/6/2019 Net Filter

21/22

Failure of routing

If there is no hardware header structure and no neighbor structure available, then there is no way to

send the packet and it must be dropped. The net_ratelimit() function is used to limit the number of

printk's generated to not more than 1 every 5 seconds to avoid flooding the syslog in casesomething is badly amiss in the network setup.

197 if ( net_ratelimit( )) 198 printk( KERN_DEBUG

" ip_finish_output2: No header cache and no neighbour! \n" ) ; 199 kfree_skb(skb); 200 return - EINVAL; 201 }

21

8/6/2019 Net Filter

22/22

Hardware header length macros

228 /* cached hardware hdr; allow for machine alignment * / 229 #define HH_DATA_MOD 16

230 #define HH_DATA_OFF(__len) \ 231 ( HH_DATA_MOD - ( ( ( __len - 1) & ( HH_DATA_MOD - 1) ) + 1) )

232 #define HH_DATA_ALIGN(__len) \ 233 ( ( ( __len) +( HH_DATA_MOD- 1) ) &~( HH_DATA_MOD - 1) ) 234 unsigned long hh_data[ HH_DATA_ALIGN( LL_MAX_HEADER) /

sizeof(long)] ; 235} ; 236 237 /* Reserve HH_DATA_MOD byte aligned hard_header_len , but at least that much.

238 * Alternative is: 239 * dev->hard_header_len ? ( d ev->hard_header_len + 240 * ( HH_DATA_MOD - 1) ) & ~( HH_DATA_MOD - 1) : 0 241 * 242 * We could use other align values, but we must maintain the 243 * relationship HH alignment hard_header_len + ( dev)- >needed_headroom)&~(HH_DATA_MOD - 1) ) + HH_DATA_MOD)

250 #define LL_RESERVED_SPACE_EXTRA(dev, extra) \ 251 ( ( ( ( dev)- >hard_header_len+(dev)- >needed_headroom +

(extra)) &~(HH_DATA_MOD - 1) ) + HH_DATA_MOD)

252 #define LL_ALLOCATED_SPACE(dev) \ 253 ( ( ( ( dev)- >hard_header_len+(dev)- >needed_headroom +

(dev)- >needed_tailroom)&~(HH_DATA_MOD - 1)) + HH_DATA_MOD) 254

22

Documents

Net Filter