Squid Guide and Tutorial

P a g e | 1

Terminology and technology

What is squid Squid is a very fast proxy-cache program. But what is a "proxy cache"? According to

Project Gutenberg's Online version of Webster's Unabridged Dictionary:

Proxy: An agent that has authority to act for another.

Cache: A hiding place for concealing and preserving provisions which it is inconvenient to carry

Squid acts as an agent, accepting requests from clients (such as browsers) and passes them to the appropriate Internet server. It stores a copy of the returned data in an on-disk cache. The real benefit of Squid emerges when the same data is requested multiple times, since a copy of the on-disk data is returned to the client, speeding up Internet access and saving bandwidth. Small amounts of disk space can have a significant impact on bandwidth usage and browsing speed. (?costs?)

Internet Firewalls (which are used to protect company networks) often have a proxy component. What makes the Squid proxy different from a firewall proxy? Most firewall proxies do not store copies of the returned data, instead they re-fetch requested data from the remote Internet server each time. Squid differs from firewall proxies in other ways too:

Many protocols are supported (firewalls often have specific proxies for specific protocols it's difficult to ensure code security of a large program)

Hierarchies of proxies, arranged in complex relationships are possible When, in this book, we refer to a 'cache', we are referring to a 'caching proxy' - something

that keeps copies of returned data. A 'proxy' on the other hand, is a program which do not cache replies.

The web consists of HTML pages, graphics and sound files (to name but a few!). Since only a very small portion of the web is made up of text, referring to all cached data as pages is misleading. To avoid ambiguity, caches store objects, not pages.

Many Internet servers support more than one protocol. A given server can support more than one type of query protocol. A web server uses the Hyper Text Transfer Protocol (HTTP) to serve data. An older protocol, the File Transfer Protocol (FTP) often runs on web servers too. Muddling them up would be bad. Caching an FTP response and returning the same data to the client on a subsequent HTTP request would be incorrect. Squid uses the complete URL to uniquely identify everything stored in the cache.

So as to avoid returning out of date data to clients, objects must be expired. Squid therefore allows you to set refresh times for objects, ensuring old data is not returned to clients.

Squid is based on software developed for the Harvest project, which developed their 'cached' (pronounced 'Cache-Dee') as a side project. Squid development is funded by the National Laboratory of Network Research (NLANR), who are in turn funded by the National Science Foundation (NSF). Squid is 'open source' software, and although development is done mainly with NSF funding (??), features are added and bugs fixed by a team of online collaborators.

P a g e | 2

Why Cache?Regions with plenty of bandwidth

Small Internet Service Providers (ISPs) cache to reduce their line costs, since a large portion of their operating costs are infrastructural, rather than staff related.

Companies and content providers (such as AOL) have recently started caching. These organizations are not short of bandwidth (indeed, they often have as much bandwidth as a small country), but their customers occasionally see slow response. There are numerous reasons for this:

Origin Server Load Raw bandwidth is increasing faster than overall computer performance. These days many

servers act as a back-end for one site, load balancing incoming requests. Where this is not done, the result is slow response. If you have ever received a call complaining about slow response, you will know the benefit of caching - in many cases the user's mind is already made up: it's your fault.

Quick Abort Squid can be configured to continue fetching objects (within certain size limits) even

although somebody who starts a download aborts it. Since there is a chance of more than one person wanting the same file, it is useful to have a copy of the object in your cache, even if the first user aborts. Where you have plenty of bandwidth, this continued-fetching ensures that there will be a local copy of the object available, just in case someone else wants it. This can dramatically reduce latency, at the cost of higher bandwidth usage.

Peer Congestion As bandwidth increases, router speed needs to increase at the same rate. Many peering

points (where huge volumes of Internet traffic are exchanged) often do not have the router horsepower to support their ever-increasing load. You may invest vast sums of money to maintain a network that stays ahead of the growth curve, only to have all your effort wasted the moment packets move off your network onto a large peering point, or onto another service provider's network.

Traffic spikes Large sporting, television and political events can cause spikes in Internet traffic. Events

like The Olympics, the Soccer World Cup, and the Starr report on the Clinton-Lewinsky issue create large traffic spikes.

You can plan ahead for sports events, but it's difficult to estimate the load that they will eventually cause. If you are a local ISP, and a local team reaches the finals, you are likely to get a huge peak in traffic. Companies can also be affected by traffic spikes, with bulk transfer of large databases or presentations flooding lines at random intervals. Though caching cannot completely solve this problem, it can reduce the impact.

Unreachable sites If Squid attempts to connect to an origin server, only to find that it is down, it will log an

error and return the object (even if there is a chance of sending out-of-date data to the client)

P a g e | 3

from disk. This reduces the impact of a large-scale Internet outage, and can help when a backhoe digs up a major segment of your network backbone.

Regions with limited bandwidth In many regions, bandwidth is expensive and latency due to very long haul links is high. Costs

In many countries, or even in rural areas of industrialized countries, bandwidth may be expensive. Saving bandwidth reduces Internet infrastructural costs significantly. Since Internet connectivity is so expensive, ISPs and their customers reduce their bandwidth requirements with caches. Latency

Although reduction of latency is not normally the major reason for introducing caching in these areas, the problems experienced in the high bandwidth regions are exacerbated by the high latency and lower speed of the lines to those regions.

Squid acts only as a HTTP-type proxy - it will act as a proxy for browser-type applications. Generally speaking, though, it won't act as a proxy for applications other than browsers.

Squid is based on the HTTP/1.1 specification. Squid can only proxy for programs that use this protocol for Internet access. Browsers, for example, use this specification; their primary function is the display of retrieved web data, using the HTTP protocol.

FTP clients, on the other hand, often support proxy servers, but do not communicate with them using the HTTP protocol. These clients will not be able to understand the replies that Squid sends. Note, though, that web browsers can communicate with an HTTP-type proxy to connect

Squid is also not a generic proxy, and only handles a small subset of all possible Internet protocols: Quake, News, RealAudio and Video Conferencing will not work through Squid. Squid only supports UDP for inter-cache communication - it does NOT support any client program that uses UDP for it's communication, which excludes many multimedia programs.

Supported Protocols

Supported Client Protocols Squid supports the following incoming protocol request types (when the proxy requests are sent in HTTP format)

HyperText Transfer Protocol (HTTP), which is the specification that the WWW is based on.

File Transfer Protocol (FTP) Gopher Wide Area Information Services (WAIS) (With the appropriate relay server.) Secure Socket Layer - which is used for secure online transactions.

Inter Cache and Management Protocols HTTP, which is used for retrieving copies of objects from other caches. Internet Cache Protocol (ICP). ICP is used to find out if a specific object is in another

cache's store.

P a g e | 4

Cache Digests. This protocol is used to retrieve an index of objects in another cache's store. When a cache receives a request for an object it does not have, it checks this index to determine which cache does have the object.

Simple Network Management Protocol (SNMP). Common SNMP tools can be used to retrieve information about your cache.

Hyper Text Caching Protocol (HTCP). Though HTCP is not widely implemented, Squid is in the process of incorporating the protocol. (?check...1.2.3?)

Inter-Cache Communication Protocols Squid gives you the ability to share data between caches, but why should you?

Just as there is a benefit to connecting individual PC's to a network, and this network to the Internet, so there is an advantage to linking your cache to other people's networks of caches.

Note: The following is not a complete discussion on how the size of your user base will influence your hit rate. Installing Squid discusses this topic in more depth. In short: the larger your user base, the more objects requested, the higher the chance of an object being requested twice. To increase your hit rate, add more clients.

However, in many cases the size of your user base is finite - it's limited by the number of staff members or customers. Co-operative peering with other caches increases the size of your user base, and effectively increases your hit rate. If you peer with a large cache, you will find that a percentage of the objects your users are requesting are already available there. Many people can increase their hit rate by about 5% by peering with other caches.

If you have a large network, one cache may not handle all incoming requests. Rather than having to continuously upgrade one machine, it makes sense to split the load between multiple servers. This reduces individual server load, while increasing the overall number of queries your cache system can handle.

Squid implements Inter-Cache protocols in a very efficient manner, through ICP Multicast queries, and Cache Digests, which allow for large networks of caches (hierarchies). With these features, large networks of caches add very little latency, allowing you to scale your cache infrastructure as you grow.

For your cache system to be efficient and fast, not only is raw bandwidth an issue - choosing the right hardware and software is a difficult task.

Firewall Terminology Firewalls are used by many companies to protect their networks. Squid is going to have

to interact with your firewall to be useful. So that we are on the same wavelength, I cover some of the terminology here: it makes later descriptions easier if we get all the terms sorted out first.

The Two Types of Firewall A proxy-based firewall does not route packets through to the end-user. All incoming data

is handled by the firewall's IP stack, with programs (called proxies) that handle all the passing of data.

Proxies accept incoming data from the outside, process it (to make sure that it is in form that is unlikely to break an internal server) and pass it on to machines on the inside. The software

P a g e | 5

running on the proxy-level firewall is (hopefully!) written in a secure manner, so to crack through a proxy-level firewall you would probably have to find a hole in the firewall software itself, rather than on the software on inside machine.

A packet-filtering firewall works on a per-packet basis, deciding whether to pass every packet to and from your inside machines on the basis of the packet's IP protocol, and source and destination IP/port pairs. If a publicly available service on an internal server is running buggy software, a cracker can probably penetrate your network.

It's not up to me to say what sort of proxy is the best: it depends too much on your situation anyway. Packet filtering firewalls are normally the easiest to get Squid working through. The 'transparent redirection' of proxy firewalls, on the other hand, can be useful: it can make redirection of client machines to the cache server easy.

Firewalled Segments If you have a firewall, your network is generally segmented in two: hosts are either

trusted or untrusted. Trusted hosts are on the inside of the firewall. Untrusted hosts (basically the rest of the Internet) sit on the outside of the firewall.

Occasionally, you may have a third class of hosts: semi-trusted. These hosts normally sit on a separate segment of the network, separate from your internal systems. They are not trusted by your secure hosts, but are still protected by the firewall. Public servers (such as web servers and ftp servers) are generally put here. This zone (or segment) is generally called a Demilitarized Zone (or DMZ).

Hand Offs With a proxy-level firewall, client machines are normally configured to use the firewall's

internal interface as a proxy server. Some firewalls can do automatic redirection of outgoing web requests, but may not be able to do authentication in this mode (?or is this heresy?). If you already have a large number of client machines setup to talk to the firewall as a proxy, the prospect of having to change all their setups can influence your decision on where to position the cache server. In many cases it's easier to re-configure the firewall to communicate with a parent, than to change the proxy server settings on all client machines.

The vast majority of proxy-level firewalls are able to talk to another proxy server using HTTP. This feature is sometimes called a hand-off, and it is this which allows your firewall to talk to a higher-level firewall or cache server via HTTP. Hand-offs allow you to have a stack of firewalls, with higher-up firewalls protecting your entire company from outside attacks, and with lower-down firewalls protecting your different divisions from one another. When a firewall hands-off a request to another firewall or proxy server, it simply acts as a pipe between the client and the remote firewall.

The term hand-off is a little misleading, since it implies that the lower-down firewall is somehow less involved in the data transfer. In reality the proxy process handling such a request is just as involved as when conversing directly with a destination server, since it is channeling data between the client and the firewall the connection was handed to. The lower-down firewall is, in fact, treating the higher-up cache as a parent.

P a g e | 6

Installing Squid

Hardware Requirements Caching stresses certain hardware subsystems more than others. Although the key to good

cache performance is good overall system performance, the following list is arranged in order of decreasing importance:

Disk random seek time Amount of system memory Sustained disk throughput CPU power

Do not drastically underpower any one subsystem, or performance will suffer. In the case of catastrophic hardware failure you must have a ready supply of alternate

parts. When your cache is critical, you should have a (working!) standby machine with operating system and Squid installed. This can be kept ready for nearly instantaneous swap-out. This will, of course, increase your costs, something that you may want to take into account. Chapter 13 covers standby procedures in detail.

Gathering Statistics When deciding on your cache's horsepower, many factors must be taken into account. To

decide on your machine, you need an idea of the load that it will need to sustain: the peak number of requests per minute. This number indicates the number of 'objects' downloaded in a minute by clients, and can be used to get an idea of your cache load.

Computing the peak number of requests is difficult, since it depends on the browsing habits of users. This, in turn, makes deciding on the required hardware difficult. If you don't have many statistics as to your Internet usage, it is probably worth your while installing a test cache server (on any machine that you have handy) and pointing some of your staff at it. Using ratios you can estimate the number of requests with a larger user base.

When gathering statistics, make sure that you judge the 'peak' number of requests, rather than an average value. You shouldn't take the number of requests per day and divide, since your peak (during, for example, lunch hour) can be many times your average number of requests.

It's a very good idea to over-estimate hardware requirements. Stay ahead of the growth curve too, since an overloaded cache can spiral out of control due to a transient network problems. If a cache cannot deal with incoming requests for some reason (say a DNS outage), it still continues to accept incoming requests, in the hope that it can deal with them. If no requests can be handled, the number of concurrent connections will increase at the rate that new requests arrive.

If your cache runs close to capacity, a temporary glitch can increase the number of concurrent, waiting, requests tremendously. If your cache can't cope with this number of established connections, it may never be able to recover, with current connections never being cleared while it tries to deal with a huge backlog.

P a g e | 7

Squid 2.0 may be configured to use threads to perform asynchronous Input/Output on operating systems that supports Posix threads. Including async-IO can dramatically reduce your cache latency, allowing you to use a less powerful machine. Unfortunately not all systems support Posix threads correctly, so your choice of hardware can depend on the abilities of your operating system. Your choice of operating system is discussed in the next section - see if your system will support threads there.'

Hard Disks Disk Speed

There are numerous things to consider when buying disks. Earlier on we mentioned the importance of disks with a fast random-seek time, and with high sustained-throughput. Having the world's fastest drive is not useful, though, if it holds a tiny amount of data. To cache effectively you need disks that can hold a significant amount of downloaded data, but that are fast enough to not slow your cache to a crawl.

Seek time is one of the most important considerations if your cache is going to be loaded. If you have a look at a disk's documentation there is normally a random seek time figure. The smaller this value the better: it is the average time that the disk's heads take to move from a random track to another (in milliseconds).

Operating systems do all sorts of interesting things (which are not covered here) to attempt to speed up disk access times: waiting for disks can slow a machine down dramatically. These operating system features make it difficult to estimate how many requests per second your cache can handle before being slowed by disk access times (rather than by network speed). In the next few paragraphs we ignore operating system readahead, inode update seeks and more: it's a back of the envelope approximation for your use.

If your cache does not use asynchronous Input-Output (described in the Operating system section shortly) then your cache loses a lot of the advantage gained by multiple disks. If your cache is going to be loaded (or is running anywhere approaching capacity according to the formulae below) you must ensure that your operating system supports posix threads!

A cache with one disk has to seek at least once per request (ignoring RAM caching of the disk and inode update times). If you have only one disk, the formula for working out seeks per second (and hence requests per second) is quite simple: requests per second = 1000/seek time

Squid load-balances writes between multiple cache disks, so if you have more than one data disk your seeks-per-second per disk will be lower. Almost all operating systems will increase random seek time in a semi-linear fashion as you add more disks, though others may have a small performance penalty. If you add more disks to the equation, the requests per second value becomes even more approximate! To simplify things in the meantime, we are going to assume that you use only disks with the same seek time. Our formula thus becomes: theoretical requests per second = 1000/(seek time / number of disks)Let's consider a less theoretical example:I have three disks - all have 12ms seek times. I can thus (theoretically,as always) handle:requests per second = 1000/(12/3) = 1000/4 = 250 requests per second

While we are on this topic: many people query the use of IDE disks in caches. IDE disks these days generally have very similar seek times to SCSI disks, and (with DMA-compatible IDE controllers) approach the speed of data transfer without slowing the whole machine down.

Deciding how much disk space to allocate to Squid is difficult. For the pilot project you can simply allocate a few megabytes, but this is unlikely to be useful on a production cache.

P a g e | 8

Disk Space The amount of disk space required depends on quite a few factors.

Assume that you were to run a cache just for yourself. If you were to allocate 1 gig of disk, and you browse pages at a rate of 10 megabytes per day, it will take at least 100 days for you to fill the cache. You can thus see that the rate of incoming cache queries influences the amount of disk to allocate.

If you examine the other end of the scale, where you have 10 megabytes of disk, and 10 incoming queries per second, you will realize that at this rate your disk space will not last very long. Objects are likely to be pushed out of the cache as they arrive, so getting a hit would require two people to be downloading the object at almost exactly the same time. Note that the latter is definitely not impossible, but it happens only occasionally on loaded caches.

The above certainly appears simple, but many people do not extrapolate. The same relationships govern the expulsion of objects from your cache at larger cache store sizes. When deciding on the amount of disk space to allocate, you should determine approximately how much data will pass through the cache each day. If you are unable to determine this, you could simply use your theoretical maximum transfer rate of your line as a basis. A 1mb/s line can transfer about 125000 bytes per second. If all clients were setup to access the cache, disk would be used at about 125k per second, which translates to about 450 megabytes per hour. If the bulk of your traffic is transferred during the day, you are probably transferring 3.6 gigabytes per day. If your line was 100% used, however, you would probably have upgraded it a while ago, so let's assume you transfer 2 gigabytes per day. If you wanted to keep ALL data for a day, you would have to have 2 gigabytes of disk for Squid.

The feasibility of caching depends on two or more users visiting the same page while the object is still on disk. This is quite likely to happen with the large sites (search engines, and the default home pages in respective browsers), but the chances of a user visiting the same obscure page is slim, simply due to the volume of pages. In many cases the obscure pages are on the slowest links, frustrating users. Depending on the number of users requesting pages you should keep pages for longer, so that the chances of different users accessing the same page twice is higher. Determining this value, however, is difficult, since it also depends on the average object size, which, in turn, depends on user habits.

Some people use RAID systems on their caches. This can dramatically increase availability, but a RAID-5 system can reduce disk throughput significantly. If you are really concerned with uptime, you may find a RAID system useful. Since the actual data in the cache store is not vital, though, you may prefer to manually fail-over the cache, simply re-formatting or replacing drives. Sure, your cache may have a lower hit-ratio for a short while, but you can easily balance this minute cost against what hardware to do automatic failover would have cost you.

You should probably base your purchase on the bandwidth description above, and gather data to decide when to add more disk space.

Memory/Ram Requirements Squid keeps an in-memory table of objects in RAM. Because of the way that Squid

checks if objects are in the file store, fast access to the table is very important. Squid slows down dramatically when parts of the table are in swap.

P a g e | 9

Since Squid is one large process, swapping is particularly bad. If the operating system has to swap data, Squid is placed on the 'sleeping tasks' queue, and cannot service other established connections. (? hmm. it will actually get woken up straight away. I wonder if this is relevant ?)

Each object stored on disk uses about 75 bytes (? get exact value ?) of RAM in the index. The average size of an object on the Internet is about 13kb, so if you have a gigabyte of disk space you will probably store around about 80 000 objects.

At 75 bytes of RAM per object, 80 000 objects require about six megabytes of RAM. If you have 8gigs of disk you will need 48Mb of RAM just for the object index. It is important to note that this excludes memory for your operating system, the Squid binary, memory for in-transit objects and spare RAM for for disk cache.

So, what should your sustained-thoughput of your disks be? Squid tends to read in small blocks, so throughput is of lesser importance than random seek times. Generally disks with fast seeks are high throughput, and most disks (even IDE disks these days) can transfer data faster than clients can download it from you. Don't blow a year's budget on really high-speed disks, go for lower-seek times instead - or add more disks.

CPU Power Squid is not generally CPU intensive. On startup Squid can use a lot of CPU while it

works out what is in the cache, and a slow CPU can slow down access to the cache for the first few minutes upon startup. A Pentium 133 machine generally runs pretty idle, while receiving 7 TCP requests a second

A multiprocessor machine generally doesn't increase speed dramatically: only certain portions of the Squid code are threaded. These sections of code are not processor intensive either: they are the code paths where Squid is waiting for the operating system to complete something. A multiprocessor machine generally does not reduce these wait times: more memory (for caching of data) and more disks may help more.

Choosing an Operating System (? Who is I ?)

Where I work, we run many varieties of Unix. When I first installed Squid it was on my desktop Linux machine - if I break it by mistake it's not going to cause users hassles, so I am free to do on it what I wish.

Once I had tested Squid, we decided to allow general access to the cache. I installed Squid on the fastest unused machine we had available at the time: a (then, at least) top of the range Pentium 133 with 128Mb of RAM running FreeBSD.

I was much more familiar with Linux at that stage, and eventually installed Linux on the public cache machine. Though running Linux caused some inconveniences (specifically with low per-process filehandle limits), it was the right choice, simply because I could maintain the machine better. Many times my experience with Linux has gotten me out of potentially sticky situations.

If your choice of operating system saves you time, and runs Squid, use it! Just as I didn't use Digital Unix (Squid is developed on funded Digital Unix machines at NLANR), you don't need to use Linux just because I do.

P a g e | 10

Most modern operating systems sport both similar performance and similar feature sets. If your system is commonly used and roughly Posix compliant at the source level, it will almost certainly be supported by Squid.

When was the last time you had an outage due to hardware failure? Unless you are particularly unlucky, the interval between hardware failures is low. While the quality of hardware has increased dramatically, software often does not keep pace. Many outages are caused by faulty application of operating system software. You must thus be able to pick up the pieces if your operating system crashes for some reason.

Experience If you normally work on a specific operating system, you should probably not use your

cache as a system to experiment with a new 'flavor' of Unix. If you have more experience in an operating system, you should use that system as the basis for your cache server. Customers rapidly turn off caching when a cache stops accepting requests (while you learn your way around some 'feature').

Your cache system will almost certainly form a core part of your network as soon as it is stable. You must be able to return the system to working order in minimal time in the event of a system failure, and this is where your existing experience becomes crucial. If the failure happens out of business hours you may not be able to get technical support from your vendor. A dialup ISP's hours of business differ dramatically to that of Operating System vendors.

Features Though most operating systems support similar features, there are often no standards for

functions required for some of the less commonly used operating system features. One example is transparency: many operating systems can now support transparent redirection to a local program, but almost all of them function in a different way, since there is not a real standard for the way an operating system is supposed to function in this scenario.

If you are unable to find information about Squid on your operating system, you may want to organize a trial hardware installation (assuming that you are using a commercial operating system) as a test. Only when you have the system running can you be sure that your operating system supports the required features. Squid works on the following systems: - Linux - freeBSD - Slackware (CaPunG) - (? List ?)

If you are using Squid without extensions like transparency and ARP access control lists, you should not have problems. For your convenience a table of operating system support of specific features is included. Since Squid is constantly being developed, it's likely that this list will change.

Compilers Squid is written on Digital Unix (?version ?) machines running the GNU C compiler

(GCC). GCC is included with free operating systems such as Linux and FreeBSD, and is easily available for many other operating systems and hardware platforms. The GNU compiler adheres as closely to the ANSI C standard as possible, so if a different compiler is included with your

P a g e | 11

operating system, it may (or may not) have trouble interpreting Squid's source code, depending on it's level of ANSI compliance. In practice, most compilers work fine.

Some commercial compilers choose backward compatibility with older versions over ANSI compliance. These compilers generally support an option that turns on 'ANSI compliant mode'. If you have trouble compiling Squid you may have to turn this mode on. (? is this still valid? I remember things like this back in the Borland C days - though I seem to remember this on a Unix system too... ?) In the worst possible scenario you may have to compile GCC with your existing compiler and use GCC to compile Squid.

If you do not have a compiler, you may be able to find a precompiled version of GCC for your system on the Internet. Be very careful when installing software from untrusted sources. This is discussed shortly in the "precompiled binary" section.

If you cannot find versions of GCC for your platform, you may have to factor in the cost of the compiler when deciding on your operating system and hardware.

Basic System Setup Before you even install the operating system, it's best to get an idea as to how the system

will look once Squid is up and running. This will allow you to partition the disks on the machine so that their mount path will match Squid's default configuration.

Default Squid directory structure Normally Squid's directory tree looks like this: /usr/local/squid/ /bin/ /cache/ /etc/ /src/squid-2.0/

Working through each directory below /usr/local/squid in the order presented above: Back to the cache directory: if you have more than one partition for the cached data, you

can make subdirectories for each of the filesystems in the cache directory. Normally people name these directories cache1, cache2', cache3 and so forth. Your cache directories should be mounted somewhere like /usr/local/squid/cache/1/ and /usr/local/squid/cache/2/. If you have only one cache disk, you can simply name the directory /usr/local/squid/cache/.

In Squid-1.1 cache directories had to be identical in size. This is no longer the case, so if you are upgrading to Squid 2.0 you may be able to resize your cache partitions. To do this, however, you may have to repartition disks and reformat.

When you upgrade to the latest version of Squid, it's a good idea to keep the old working compiled source tree somewhere. If you upgrade to the latest Squid and encounter problems, simply kill Squid, change to the previous source directory and reinstall the old binaries. This is a lot faster than trying to remember which source tree you were running, downloading it, compiling it, applying local patches and then reinstalling.

User and Group IDs Squid, like most daemon processes on Unix machines, normally runs as the user nobody

and with the group nogroup.

P a g e | 12

For the maximum flexibility in allowing root and non-root users to manipulate the Squid configuration, you should make both a new user and two new groups, specifically for the Squid system, rather than using the nobody and nogroup IDs. Throughout this book we assume that you have done so, and that a group and a user have been created, (both called squid) and a second admin group, called squidadm. The squid user's primary group should be squid, and the user's home directory should be /usr/local/squid (the default squid software install destination).

When you have multiple administrators of a cache machine, it is useful to have a dedicated squidadm group, with sub-administrators added to this group. This way, you don't have to change to the root user whenever you want to make changes to the Squid config. It's possible, for users in the squidadm group to gain root access, so you shouldn't place people without root access in the squidadm group.

When the config file has been changed, a signal has to be sent to the Squid process to inform it that that config files are to be re-read. Sending signals to running processes isn't possible when the signal sender isn't the same userid as the receiver. Other config file maintainers need permission to change their user-id (either by using the 'su' command, or by logging in with another session) to either the root user or to the user Squid is running as.

In some environments cache software maintainers aren't trusted with root access, and the user nobody isn't allowed to login. The best solution is to allow users that need to make changes to the config file access to a reload script using sudo. Sudo is available for many systems, and source code is available.

In Chapter 4 we go through the process of changing the user-id that Squid runs as, so that files Squid creates are owned by the squid user-id, and by the group squid. Binaries are owned by root, and config files are changeable by the squidadm group.

Getting Squid Now that your machine is ready for your Squid install, you need to download and install

the Squid program. This can be done in two ways: you can download a source version and compile it, or you can download a precompiled binary version and install that, relying on someone else to do the compilation for you.

Binary versions of Squid are generally easier to install than source code versions, specifically if your operating system vendor distributes a package which you can simply install.

Installing Squid from source code is recommended. This method allows you to turn on compile-time options that may not be included in distributed binary versions (one of many examples: SNMP support is not included into the source at compile time unless it is specifically included, and most binary versions available do not include snmp support). If your operating system has been optimized so that Squid can run better (let's say you have increased the number of open filehandles per process) a precompiled binary will not take advantage of this tuning, since your compiler header files are probably different to the ones where the binaries were compiled.

It's also a little worrying running binaries that other people distribute (unless, of course, they are officially supplied by your operating system vendor): what if they have placed a trojan into the binary version? To ensure the security of your system it is recommended that you compile from the official source tree.

P a g e | 13

Since we suggest installing from source code first, we cover that first: if you have to download a Squid binary from somewhere, simply skip to the next sub-section: Getting a binary version of Squid.

Getting the Squid source code Squid source is mirrored by numerous sites. For a list of mirrors, have a look at

http://www.squid-cache.org/Mirrors/ Deciding which of the available files to download can become an issue, especially if you

are not familiar with the Squid version naming convention. Squid is (as of this writing) in version 2. As features are added, the minor version number is incremented (Squid 2.0 becomes Squid 2.1, then Squid 2.2 etc etc). Since new features may introduce new bugs, the first version including new features is distributed as a pre-release (or beta) version. The first pre-release of Squid 1.2 is called squid-2.1.PRE1-src.tar.gz. The second is squid-2.1.PRE2-src.tar.gz. Once Squid is considered stable, a general release version is distributed: the first release version is called squid-2.0.RELEASE-src.tar.gz, the second (which would include minor bug fixes) squid-2.0.RELEASE2-src.tar.gz.

In short, files are named as follows: squid-2.minor-version-number.stability-info.release-number.tar.gz. Unless you are a Squid developer, you should download the last available RELEASE version: you are less likely to encounter bugs this way.

Squid source is normally available via FTP (the File Transfer Protocol), so you should be able to download Squid source by using the ftp program, available on almost every Unix system. If you are not familiar with ftp, you can simply select the mirror closest to you with your browser and save the Squid source to your disk by right-clicking on the filename and selecting save as (do not simply click on the filename - many browsers attempt to extract compressed files, printing the tar file to y our browser window: this is definitely not what you want!). Once the download is complete, transfer the file to the cache machine.

Getting Binary Versions of Squid Finding binary versions of Squid to install is easy: deciding which binary to trust is more

difficult. If you do not choose carefully, someone could undermine your system security. If you cannot compile Squid cache, but know (and trust) someone that can do it for you, get them to help. It's better than downloading a version contributed by someone that you don't know.

Compiling Squid Compiling Squid is quite easy: you need the right tools to do the job, though. First, let's

go through getting the tools, then you can extract the source code package, include optional Squid components (using the configure command) and then actually compile the distributed code into a binary format.

A word of warning, though: this is the stage where most people run into problems. If you haven't compiled source before, try and follow the next section in order - it shouldn't be too bad. If you don't manage to get Squid running, at least you have gained experience.

Compilation Tools

P a g e | 14

GNU utilities mentioned below are avaliable via FTP from the official GNU ftp site or one of it's mirrors. A list of mirrors is available at http://www.gnu.org/, or download them directly from ftp://ftp.gnu.org/.

The GNU compiler is only distributed as source (creating a chicken-and-egg problem if you do not have a compiler) you may have to do an Internet search (using one of the standard search engines) to try and find a binary copy of the GNU compiler for your system. The Squid source is distributed in compressed form. First a standard tar file is created. This file is then compressed with the GNU gzip program. To decompress this file you need a copy of gzip. GCC (The Gnu C Compiler) is the recommended compiler: the developers wrote Squid with it, and it is available for almost all systems. You will also need the make program, of which there is also a GNU version easily available.

If possible, install a C debugger: the GNU debugger (GDB) is available for most platforms. Though a debugger is not necessary for installation, but is very useful in the case of software bugs (as discussed in chapter 13).

Unpacking the Source Archive Earlier we looked at the tree structure of the /usr/local/squid directory. I suggest

extracting the Squid source to the /usr/local/squid/src directory. So, create the directory and copy the downloaded Squid tar.gz file into it.

First let's decompress the file. Some versions of tar can decompress the file in one step, but for compatability's sake we are going to do it in two steps. Decompress the tar file by running gzip -dv squid-version.tar.gz. If all has gone well you should have a file called squid-version.tar in the current directory. To get the files out of the "tarball", run tar xvf squid-version.tar.

Tar automatically puts the files into a subdirectory: something like squid-2.1.PRE2. Change into the extracted directory,and we can start configuring the Squid source.

Running configure Now that you have decided which options to use, it's time to run configure. Here's an example: ./configure --enable-err-language=Bulgarian --prefix=/usr/local

Running ./configure with the options that you have chosen should go smoothly. In the unlikely event that configure returns with an error message, here are some suggestions that may help.

Broken compilers The most common problem for new installers is that there is a problem with the installed

compiler (or the headers) for the system. To test this theory simply try and run configure with no options at all. If you still get an

error message it is almost certainly a compiler or header file problem. To make sure try and compile a program that uses some of the less used system calls and

see if this compiles. If your compiler doesn't compile files correctly, you might want to check if the header

files exist, and if they do, permissions on the directory and the include files themselves. If you have installed GCC in a non-standard directory, or if you are cross compiling, you

may need configure to append options to the GCC command it uses during it's tests. You can get configure to append options to the GCC command line by setting the 'CFLAGS' shell variable prior to running configure. If, for example, you compiler only works when you you modify the

P a g e | 15

default i nclude directory, you can get configure to append that option to the default command line with a (Bourne Shell) command like: CFLAGS=-I/usr/people/staff/oskar/gcc/include export CFLAGS

Incompatible Options Some configuration options exclude the use of others. This is another common cause of

problems. To test this you should just try and run configure without any options at all, and see if the problem disappears. If so, you can try and work out which option is causing the conflict by adding each option to the configure command line one-by-one. You may find that you have to choose between two options (for example Async-IO and the DL-Malloc routines). In this case you may have to decide which of the options is the most important in your setup.

Compiling the Squid Source Now that you have configured Squid, you need to make the Squid binaries. You should

simply have to run make in the extracted source directory, and a binary will be created as src/squid. cache:/ # cd /usr/local/squid/src/squid-2.2.RELEASE cache:/usr/local/squid/src/squid-2.2.RELEASE# make

If the compilation fails, it may be because of conflicting configure options as described in the configure section. Follow the same instructions described there to find the offending option. (You should run make clean between configure runs, to ensure that old binaries are removed) As a start, try running configure without any options at all and then see if make completes. If this works, try additional configure options one at a time to see which one causes the problem.

Installing the Squid binary The make command creates the binary, but doesn't install it.

Running make install creates the /usr/local/squid/bin and /usr/local/squid/etc subdirectories, and copies the binaries and default config files in the appropriate directories. Permissions may not be set correctly, but we will work through all created directories and set them up correctly shortly.

This command also copies the relevant config files into the default directories. The standard config file included with the source is placed in the etc subdirectory, as are the mime.types file and the default Squid MIB file (squid.mib).

If you are upgrading (or reinstalling), make install will overwrite binary files in the bin directory, but will not overwrite your painfully manipulated configuration files. If the destination configuration file exists, make install will instead create a file called filename.default. This allows you to check if useful options have been added by comparing config files. If all has gone well you should have a fully installed (but unconfigured) Squid system setup. Congratulations!

P a g e | 16

Squid Configuration Basics

From Squid User's GuideJump to: navigation, search

For Squid, the default configuration file is probably right for 90% of installations - once you have Squid running, you should change the configuration file one option at a time. Don't get over-ambitious in your changes quite yet! Leave things like refresh rules until you have experimented with the basic options - what port you want to accept your requests on, what user to run as, and where to keep cached pages on your drives.

So that you can get Squid running, this chapter works through the basic Squid options, giving you background information and introducing you to some of the basic concepts. In later chapters you'll move on to more advanced topics.

The Squid config file is not arranged in the same order as this book. The config file also does not progress from basic to advanced config options in any specific order, but instead consists of related sections, with all hierarchy settings in a specific section of the file, all access controls in another and so forth.

To make changes detailed in this chapter you are going to have to skip around in the config file a bit. It's probably easiest to simply search for the options discussed in each subsection of this chapter, but if you have some time it will be best if you read through the config file, so that you have an idea of how sections fit together.

The chapter also points out options that may have to be changed on the other 10% of machines. If you have a firewall, for example, you will almost certainly have to configure Squid differently to someone that doesn't.

Version Control Systems I recommend that you put all Squid configuration files and startup scripts under revision

control. If you are like me, you love to play with new software. You change an option, get the program to re-read the configuration file, and see what difference it makes. By repeating this process, I learn what each option does, and at the same time I gain experience, and discover why the program is written the way it is. Quite often configuration files make no sense until you discover the overall structure of the underlying program.

P a g e | 17

The best way for you to understand each of the options in the Squid config file (and to understand Squid itself) is to experiment with the multitude of options. At some stage in the experimentation stage, you will find that you break something. It's useful to be able to revert to a previous version (or simply to be reminded what changes you have made).

Many readers will already have used a Revision Control System. The RCS system is included with many Unix systems, and source is freely available. For the few that haven't used RCS, however, it's worth including some pointers to some manual pages: ci(1)co(1)rcs(1)rcsdiff(1)rlog(1)

One of the wonders of Unix is the ability to create scripts which reduce the number of commands that you have to type to get something done. I have a short script on all the machines I maintain called rvi. Using rvi instead of vi allows me to use one command to edit files under RCS (as opposed to the customary four). Put this file somewhere in your path and make it executable chmod +x rvi. You can then simply use a command like rvi squid.conf to edit files that are under revision control. This is a lot quicker than running each of the co, rcsdiff and ci commands. #!/bin/shco -l $1$VISUAL $1rcsdiff -u $1ci -u $1

The Configuration File All Squid configuration files are kept in the directory /usr/local/squid/etc. Though there

is more than one file in this directory, only one file is important to most administrators, the squid.conf file. Though there are (as of this writing) 125 option tags in this file, you should only need to change eight options to get Squid up and running. The other 117 options give you amazing flexibility, but you can learn about them once you have Squid running, by playing with the options or by reading the descriptions in chapter 10.

Squid assumes that you wish to use the default value if there is no occurrence of a tag in the squid.conf file. Theoretically, you could even run Squid with a zero length configuration file.

The remainder of this chapter works through the options that you may need to change to get Squid to run. Most people will not need to change all of these settings. You will need to change at least one part of the configuration file though: the default squid.conf denies access to all browsers. If you don't change this, Squid will not be very useful!

Setting Squid's HTTP Port The first option in the squid.conf file sets the HTTP port(s) that Squid will listen to for

incoming requests. Network services listen on particular ports. Ports below 1024 can only be used by the

system administrator, and are used by programs that provide basic Internet services: SMTP, POP, DNS and HTTP (web). Ports above 1024 are used for untrusted services (where a service does not run as administrator), and for transient connections, such as outgoing data requests.

P a g e | 18

Typically, web servers listen for incoming web requests (using the HyperText Transfer Protocol - HTTP) on port 80.

Squid's default HTTP port is 3128. Many people run their cache servers on a port which is easier to remember: something like 80 or 8080). If you choose a low-numbered port, you will have to start Squid as root (otherwise you are considered untrusted, and you will not be able to start Squid). Many ISPs use port 8080, making it an accepted pseudo-standard.

If you wish, you can use multiple ports appending a second port number to the http_port variable. Here is an example: http_port 3128 8080

It is very important to refer to your cache server with a generic DNS name. Simply because you only have one server now does not mean that you should not plan for the future. It is a good idea to setup a DNS hostname for your proxy server. Do this right away! A simple DNS entry can save many hours further down the line. Configuring client machines to access the cache server by IP address is asking for a long, painful transition down the road. Generally people add a hostname like cache.mydomain.com to the DNS. Other people prefer the name proxy, and create a name like proxy.mydomain.com.

Using Port 80 HTTP defines the format of both the request for information and the format of the server

response. The basic aspects of the protocol are quite straight forward: a client (such as your browser) connects to port 80 and asks for the file by supplying the full path and filename that it wishes to download. The client also specifies the version of the HTTP protocol it wishes to use for the retrieval.

With a proxy request the format is only a little different. The client specifies the whole URL instead of just the path to the file. The proxy server then connects to the web server specified in the URL, and sends a normal HTTP request for the page.

Since the format of proxy requests is so similar to a normal HTTP request, it is not especially surprising that many web servers can function as proxy servers too. Changing a web server program to function as a proxy normally involves comparatively small changes to the code, especially if the code is written in a modular manner - as is the Apache web server. In many cases the resulting server is not as fast, or as configurable, as a dedicated cache server can be.

The CERN web server httpd was the first widely available web proxy server. The whole WWW system was initially created to give people easy access to CERN data, and CERN HTTPD was thus the de-facto test-bed for new additions to the initial informal HTTP specification. Most (and certainly at one stage all) of the early web sites ran the CERN server. Many system administrators who wanted a proxy server simply used their standard CERN web server (listening on port 80) as their proxy server, since it could function as one. It is easy for the web server to distinguish a web site request from a normal web page request, since it simply has to check if the full URL is given instead of simply a path name. Given the choice (even today) many system administrators would choose port 80 as their proxy server port simply as 'port 80 is the standard port for web requests'. There are, however, good reasons for you to choose a port other than 80.

Running both services on the same port meant that if the system administrator wanted to install a different web server package (for extra features available in the new software) they would be limited to software that could perform both as a web server and as a proxy. Similarly, if

P a g e | 19

the same sysadmin found that their web server's low-end proxy module could not handle the load of their ever-expanding local client base, they would be restricted to a proxy server that could function as a web server. The only other alternative is to re-configure all the clients, which normally involves spending a few days apologizing to users and helping them through the steps involved in changing over.

Microsoft use the Microsoft web server (IIS) as a basis for their proxy server component, and Microsoft proxy thus only accepts incoming proxy request on port 80. If you are installing a Squid system to replace either CERN, Apache or IIS running in both web-server and cache-server modes on the same port, you will have to set http_port to 80. Squid is written only as a high-performance proxy server, so there is no way for it to function as a web server, since Squid has no support for reading files from a local disk, running CGI scripts and so forth. There is, however, a workaround.

If you have both services running on the same port, and you cannot change your client PCs, do not despair. Squid can accept requests in web-server format and forward them to another server. If you have only one machine, and you can get your web server software to accept incoming requests on a non-default port (for example 81), Squid can be configured to forward incoming web requests to that port. This is called accelerator mode (since its initial purpose was to speed up very slow web servers). Squid effectively does some translation on the original request, and then simply acts as if the request were a proxy request and connects to the host: the fact that it's not a remote host is irrelevant. Accelerator mode is discussed in more detail in chapter 9. Until then, get Squid installed and running on another port, and work your way through the first couple of chapters of this book, until you have a working pilot-phase system. Once Squid is stable and tested you can move on to changing web server settings. If you feel adventurous, however, you can skip there shortly!

Where to Store Cached Data Cached Data has to be kept somewhere. In the section on hardware sizing, we discussed

the size and number of drives to use for caching. Squid cannot autodetect where to store this data, though, so you need to let Squid know which directories it can use for data storage.

The cache_dir operator in the squid.conf file is used to configure specific storage areas. If you use more than one disk for cached data, you may need more than one mount point (for example /usr/local/squid/cache1 for the first disk, /usr/local/squid/cache2 for the second). Squid allows you to have more than one cache_dir option in your config file.

Let's consider only one cache_dir entry in the meantime. Here I am using the default values from the standard squid.conf. cache_dir ufs /usr/local/squid/var/cache/ 100 16 256

The first option to the cache_dir tag sets the directory where data will be stored. The prefix value simply has /cache/ tagged onto the end and it's used as the default directory. This directory is also made by the make install command that we used earlier.

The next option to cache_dir is straight forward: it's a size value. Squid will store up to that amount of data in that directory. The value is in megabytes, so of the cache store. The default is 100 megabytes.

The other two options are more complex: they set the number of subdirectories (first and second tier) to create in this directory. Squid makes lots of directories and stores a few files in each of them in an attempt to speed up disk access (finding the correct entry in a directory with one million files in it is not efficient: it's better to split the files up into lots of smaller sets of

P a g e | 20

files... don't worry too much about this for the moment). I suggest that you use the default values for these options in the mean time: if you have a very large cache store you may want to increase these values, but this is covered in the section on

Email for the Cache Administrator If Squid dies, an email is sent to the address specified with the cache_mgr tag. This

address is also appended to the end of error pages returned to users if, for example, the remote machine is unreachable.

Logging Logging is fairly straight forward. The default squid.conf defines a useful selection of logformats. Using access_log tag you specify where to log, and the format to log in. The directives for logging have been renamed between versions. cache_access_log became access_log and gained an optional logformat option.

The old useragent_log and referer_log directives are supported if squid is compiled with the appropriate compiler flags. Some distributions (Debian) ship squid (3) with these options disabled.

The "combined" logformat, which mimics the Apache combined log file format as it includes referer and User-agent information as well as appending the Squid specific information, makes a good choice for many installations, as it is compatible with many Apache log tools. Although administrators may want to check out that their preferred squid logfile tools work with that format. Example config file entry: access_log /var/log/squid/access.log combined

Effective User and Group ID Squid can only bind to low numbered ports (such as port 80) if it is started as root. Squid

is normally started by your system's rc scripts when the machine boots. Since these scripts run as root, Squid is started as root at bootup time.

Once Squid has been started, however, there is no need to run it as root. Good security practice is to run programs as root only when it's absolutely necessary, and for this reason Squid changes user and group ID's once it has bound to the incoming network port.

The cache_effective_user and cache_effective_group tags tell Squid what ID's to change to. The Unix security system would be useless if it allowed all users to change their ID's at will, so Squid only attempts to change ID's if the main program is started as root.

If you do not have root access to the machine, and are thus not starting Squid as root, you can simply leave this option commented out. Squid will then run with whatever user ID starts the actual Squid binary.

As discussed in chapter 2, this book assumes that you have created both a squid user and a squid group on your cache machine. The above tags should thus both be set to "squid".

FTP login information

P a g e | 21

Squid can act as a proxy server for various Internet protocols. The most commonly used protocol is HTTP, but the File Transfer Protocol (FTP) is still alive and well.

FTP was written for authenticated file transfer (it requires a username and password). To provide public access, a special account is created: the anonymous user. When you log into an FTP server you use this as your username. As a password you generally use your email address. Most browsers these days automatically enter a useless email address.

It's polite to give an address that works, though. If one of your users abuses a site, it allows the site admin get hold of you easily.

Squid allows you to set the email address that is used with the ftp_user tag. You should probably create a [email protected] email address specifically for people to contact you on.

There is another reason to enter a proper address here: some servers require a real email address. For your proxy to log into these ftp servers you will have to enter a real email address here.

Access Control Lists and Access Control Operators

Squid could not be used in an ISP environment without a sophisticated access control system. Indeed, Squid should not be used in ANY environment without some kind of basic authentication system. It is amazing how fast other Internet users will find out that they can relay requests through your cache, and then proceed to do so.

Why? Sometimes to obfuscate their real identity, and other times since they have a fast line to you, but a slow line to the remainder of the Internet.

Simple Access Control In many cases only the most basic level of access control is needed. If you have a small

network, and do not wish to use things like user/password authentication or blocking by destination domain, you may find that this small section is sufficient for all your access control setup. If not, you should read chapter 7, where access control is discussed in detail.

The simplest way of restricting access is to only allow IPs that are on your network. If you wish to implement different access control, it's suggested that you put this in place later, after Squid is running. In the meantime, set it up, but only allow access from your PC's IP address.

Example access control entries are included in the default squid.conf. The included entries should help you avoid some of the more obscure problems, such as bandwidth-chewing loops, cache tunneling with SSL CONNECTs and other strange access problems. In chapter 7 we work through the config file's default config options, since some of them are pretty complex.

Access control is done on a per-protocol basis: when Squid accepts an HTTP request, the list of HTTP controls is checked. Similarly, when an ICP request is accepted, the ICP list is checked before a reply is sent.

Assume that you have a list of IP addresses that are to have access to your cache. If you want them to be able to access your cache with both HTTP and ICP, you would have to enter the list of IP addresses twice: you would have lines something like this: acl localnet src 192.168.1.0/255.255.255.0

P a g e | 22

..http_access allow localnet icp_access allow localnet

Rule sets like the above are great for small organisations: they are straightforward. Note that as http_access and icp_access rules are processed in the order they appear in the file, you will need to place the http_access and icp_access entries as is appropriate.

For large organizations, though, things are more convenient if you can create classes of users. You can then allow or deny classes of users in more complex relationships. Let's look at an example like this, where we duplicate the above example with classes of users:

Sure, it's more complex for this example. The benefits only become apparent if you have large access lists, or when you want to integrate refresh-times (which control how long objects are kept) and the sources of incoming requests. I am getting quite far ahead of myself, though, so let's skip back.

We need some terminology to discuss access control lists, otherwise this could become a rather long chapter. So: lines beginning with acl are (appropriately, I believe) acl lines. The lines that use these acls (such as http_access and icp_access in the above example) are called acl-operators. An acl-operator can either allow or deny a request.

So, to recap: acls are used to define classes. When Squid accepts a request it checks the list of acl-operators specific to the type of request: an HTTP request causes the http_access lines to be checked; an ICP request checks the icp_access lists.

Acl-operators are checked in the order that they occur in the file (ie from top to bottom). The first acl-operator line that matches causes Squid to drop out of the acl list. Squid will not check through all acl-operators if the first denies the request.

In the previous example, we used a src acl: this checks that the source of the request is within the given IP range. The src acl-type accepts IP address lists in many formats, though we used the subnet/netmask in the earlier example. CIDR (Classless Internet Domain Routing) notation can also be used here. Here is an example of the same address range in either notation: CIDR: 192.168.1.0/24Subnet/Netmask (Dot Notation): 192.168.1.0/255.255.255.0

Access control lists inherit permissions when there is no matching acl If all acl-operators in the file are checked, and no match is found, the last acl-operator checked determines whether the request is allowed or denied. This can be confusing, so it's normally a good idea to place a final catch-all acl-operator at the end of the list. The simplest way to create such an operator is to create an acl that matches any IP address. This is done with a src acl with a netmask of all 0's. When the netmask arithmetic is done, Squid will find that any IP matches this acl.

Your cache server may well be on the network placed in the relevant allow lists on your cache, and if you were thus to run the client on the cache machine (as opposed to another machine somewhere on your network) the above acl and http_access rules would allow you to test the cache. In many cases, however, a program running on the cache server will end up connecting to (and from) the address '127.0.0.1' (also known as localhost). Your cache should thus allow requests to come from the address 127.0.0.1/255.255.255.255. In the below example we don't allow icp requests from the localhost address, since there is no reason to run two caches on the same machine.

The squid.conf file that comes with Squid includes acls that deny all HTTP requests. To use your cache, you need to explicitly allow incoming requests from the appropriate range. The squid.conf file includes text that reads: ## INSERT YOUR OWN RULE(S) HERE TO ALLOW ACCESS FROM YOUR CLIENTS

P a g e | 23

#

To allow your client machines access, you need to add rules similar to the below in this space. The default access-control rules stop people exploiting your cache, it's best to leave them in. by ace

Ensuring Direct Access to Internal Machines Acl-operator lines are not only used for authentication. In an earlier section we discussed

communication with other cache servers. Acl lines are used to ensure that requests for specific URLs are handled by your cache, not passed on to another (further away) cache.

If you don't have a parent cache (a firewall, or you have a parent ISP cache) you can probably skip this section.

Let's assume that you connect to your ISP's cache server as a parent. A client machine (on your local network) connects to your cache and requests http://www.yourdomain.example/. Your cache server will look in the local cache store. If the page is not there, Squid will connect to its configured parent (your ISP's cache: across your serial link), and request the page from there. The problem, though, is that there is no need to connect across your internet line: the web server is sitting a few feet from your cache in the machine room.

Squid cannot know that it's being very inefficient unless you give it a list of sites that are "near by". This is not the only way around this problem though: your browser could be configure to ignore the cache for certain IPs and domains, and the request will never reach the cache in the first place. Browser config is covered in Chapter 5, but in the meantime here is some info on how to configure Squid to communicate directly with internal machines.

The acl-operators always_direct and never_direct determine whether to pass the connection to a parent or to proceed directly.

The following is a set of operators are based on the final configuration created in the previous section, but using never_direct and always_direct operators. It is assumed that all servers that you wish to connect to directly are in the address ranges specified in with the my-iplist directives. In some cases you may run a web server on the same machine as the cache server, and the localhost acl is thus also considered local.

The always_direct and never_direct tags are covered in more detail in Chapter 7, where we cover hierarchies in detail.

Squid always attempts to cache pages. If you have a large Intranet system, it's a waste of cache store disk space to cache your Intranet. Controlling which URLs and IP ranges not to cache are covered in detail in chapter 6, using the no_cache acl operator.

Communicating with other proxy servers Squid supports the concept of a hierarchy of proxies. If your proxy does not have an

object on disk, its default action is to connect to the origin web server and retrieve the page. In a hierarchy, your proxy can communicate with other proxies (in the hope that one of these servers will have the relevant page). You will, obviously, only peer with servers that are 'close' to you, otherwise you would end up slowing down access. If access to the origin server is faster than access to neighboring cache servers it is not a good idea to get the page from the slower link!

Having the ability to treat other caches as siblings is very useful in some interactions. For example: if you often do business with another company, and have a permanent link to their

P a g e | 24

premises, you can configure your cache to communicate with their cache. This will reduce overall latency: it's almost certainly faster to get the page from them than from the other side of the country.

When querying more than one cache, Squid does not query each in turn, and wait for a reply from the first before querying the second (since this would create a linear slowdown as you add more siblings, and if the first server stops responding, you would slow down all incoming requests). Squid thus sends all ICP queries together - without waiting for replies. Squid then puts the client's request on hold until the first positive reply from a sibling cache is received, and will retrieve the object from the fastest-replying cache server. Since the earliest returning reply packet is usually on the fastest link (and from the least loaded sibling server), your server gets the page fast. Squid will always get the page from the fastest-responding cache - be it a parent or a sibling.

The cache_peer option allows you to specify proxy servers that your server is to communicate with. The first line of the following example configures Squid to query the cache machine cache.myparent.example as a parent. Squid will communicate with the parent on HTTP port 3128, and will use ICP to query the server using port 3130. Configuring Squid to query more than one server is easy: simply add another cache_peer line. The second line configures cache.sibling.example as a sibling, listening for HTTP request on port 8080 and ICP queries on port 3130. cache_peer cache.myparent.example parent 3128 3130cache_peer cache.sibling.example sibling 8080 3130

If you do not wish to query any other caches, simply leave all cache_peer lines commented out: the default is to talk directly to origin servers.

Cache peering and hierarchy interactions are discussed in quite some detail in this book. In some cases hierarchy setups are the most difficult part of your cache setup process (especially in a distributed environment like a nationwide ISP). In depth discussion of hierarchies is beyond the scope of this chapter, so much more information is given in chapter 8. There are cases, where you need at least one hierarchy line to get Squid to work at all. This section covers the basics, just for those setups. You only need to read this material if one of the following scenarios applies to you:

You have to use your Internet Service Provider's cache. You have a firewall.

Your ISP's cache If you have to use your Internet Service Provider's cache, you will have to configure

Squid to query that machine as a parent. Configuring their cache as a sibling would probably return error pages for every URL that they do not already have in their cache.

Squid will attempt to contact parent caches with ICP for each request. This is essentially a ping. If there is no response to this request, Squid will attempt to go direct to the origin server. since (in this case, at least) you cannot bypass your ISP's cache, you may want to reduce the latency added by this extra query. To do this, place the default and no-query keywords at the end of your cache_peer line: cache_peer cache.myisp.example parent 3128 3130 default no-query

The default option essentially tells Squid "Go through this cache for all requests. If it's down, return an error message to the client: you cannot go direct".

P a g e | 25

The no-query option gets Squid to ignore the given ICP port (leaving the port number out will return an error), and never to attempt to query the cache with ICP.

Firewall Interactions Firewalls can make cache configuration hairy. Inter-cache protocols generally use

packets which firewalls inherently distrust. Most caches (Squid included) use ICP, which is a layer on top of UDP. UDP is difficult to make secure, and firewall administrators generally disable it if at all possible.

It's suggested that you place your cache server on your DMZ (if you have one). There are a few advantages to this:

Your cache server is kept secure. The firewall can be configured to hand off requests to the cache server, assuming it is

capable. You will be able to peer with other, outside, caches (like your ISP's), since DMZ

networks generally have less rigid rule sets. The remainder of this section should help you getting Squid and your firewall to co-operate.

A few cases are covered for each type of firewall: the cache inside the firewall; the cache outside the firewall; and, finally, on the DMZ.

Proxying Firewalls The vast majority of firewalls know nothing about ICP. If, on the other hand, your

firewall does not support HTTP, it's a good time to have a serious talk to the buyer that had an all-expenses-paid weekend on the firewall supplier. Configuring the firewall to understand ICP is likely to be painful, but HTTP should be easy.

If you are using a proxy-level firewall, your client machines are probably configured to use the firewall's internal IP address as their proxy server. Your firewall could also be running in transparent mode, where it automatically picks up outgoing web requests. If you have a fair number of client machines, you may not relish the idea of reconfiguring all of them. If you fall into this category, you may wish to put Squid on the outside (or on the DMZ) and configure the firewall to pass requests to the cache, rather than reconfiguring all client machines.

Inside The cache is considered a trusted host, and is protected by the firewall. You will

configure client machines to use the cache server in their browser proxy settings, and when a request is made, the cache server will pass the outgoing request to a firewall, treating the firewall as a parent proxy server. The firewall will then connect to the destination server. If you have a large number of clients configured to use the firewall as their proxy server, you could get the firewall to hand-off incoming HTTP requests back into the network, to the cache server. This is less efficient though, since the cache will then have to re-pass these requests through the firewall to get to the outside, using the parent option to cache_peer. Since the latter involves traffic passing through the firewall twice, your load is very likely to increase. You should also beware of loops, with the cache server parenting to the firewall and the firewall handing-off the cache's request back to the cache!

As described in chapter 1, Squid will also send ICP queries to parents. Firewalls don't care for UDP packets, and normally log (and then discard) such packets.

P a g e | 26

When Squid does not receive a response from a configured parent, it will mark the parent as down, and proceed to go directly.

Whenever Squid is setup to use a parent that does not support ICP, the cache_peer line should include the "default" and "no-query" options. These options stop Squid from attempting to go direct when all caches are considered down, and specify that Squid is not to send ICP requests to that parent. Here is an example config entry: cache_peer inside.fw.address.domain parent 3128 3130 default no-query

Outside There are only two major reasons for you to put your cache outside the firewall: One: Although squid can be configured to do authentication, this can lead to the

duplication of effort (you will encounter the "add new staff to 500 servers" syndrome). If you want to continue to authenticate users on the firewall, you will have to put your cache on the outside or on the DMZ. The firewall will thus accept requests from clients, authenticate them, and then pass them on to the cache server.

Two: Communicating with cache hierarchies is easy. The cache server can communicate with other systems using any protocol. Sibling caches, for example, are difficult to contact through a proxying firewall.

You can only place your cache outside if your firewall supports hand-offs. Browsers inside will connect to the firewall and request a URL, and the firewall will connect to the outside cache and request the page.

If you place your cache outside your firewall, you may find that your client PCs have problems connecting to internal web servers (your intranet, for example, may be unreachable). The problem is that the cache is unable to connect back through to your internal network (which is actually a good thing: don't change that). The best thing to do here is to add exclusions to your browser settings: this is described in Chapter 5 - you should specifically have a look at the section on browser autoconfig. In the meantime, let's just get Squid going, and we will configure browsers once you have a cache to talk to.

Since the cache is not protected by the firewall, it must be very carefully configured - it must only accept requests from the firewall, and must not run any strange services. If possible, you should disable telnet, and use something like SSH (Secure SHell) instead. The access control lists (which you will setup shortly) must only allow the firewall, otherwise people will be able to relay their requests through your cache, using your bandwidth.

If you place the cache outside the firewall, your client PCs will be configured to use the firewall as their proxy server (this is probably the case already). The firewall must be configured to hand-off client HTTP requests to the cache server. The cache must be configured to only allow HTTP requests when from the firewall's outside IP address. If not configured this way, other Internet users could use your cache server as a relay, using your bandwidth and hardware resources for illegitimate (and possibly illegal) purposes.

With your cache server on the outside network, you should treat the machine as a completely untrusted host, lest a cracker find a hole somewhere on the system. It is recommended that you place the cache server on a dedicated firewall network card, or on a switched ethernet port. This way, if your cache server were to be cracked, the cracker would only be able to read passing HTTP data. Since the majority of sensitive information is sent via email, this would reduce the potential for sensitive data loss.

P a g e | 27

Since your cache server only accepts requests from the firewall, there is no cache_peer line needed in the squid.conf. If you have to talk to your ISP's cache you will, of course, need one: see the section on this a bit further back.

DMZ The best place for a cache is your DMZ.

If you are concerned with the security of your cache server, and want to be able to communicate with outside cache servers (using ICP), you may want to put your cache on the DMZ.

With Squid in your DMZ, internal client PCs are setup to proxy to the firewall. The firewall is then responsible for handing-off these HTTP requests to the cache server (so the firewall in fact treats the cache server as a parent).

Since your cache server is (essentially) on the outside of the firewall, the cache doesn't need to treat the firewall as a parent or sibling: it only accepts requests from the firewall: it never passes them to the firewall.

If your cache is outside your firewall, you will need to configure your client PCs not to use the firewall as a proxy server for internal hosts. This is quite easy, and is discussed in the chapter on browser configuration.

Since the firewall is acting as a filter between your cache and the outside world, you are going to have to open up some ports on the firewall. The cache will need to be able to connect to port 80 on any machine on the outside world. Since some valid web servers will run on ports other than 80, you should consider allowing connections to any port from the cache server. In short, allow connections to:

Port 80 (for normal HTTP requests) Port 443 (for HTTPS requests) Ports higher than 1024 (site search engines often use high-numbered ports) If you are going to communicate with a cache server outside the firewall, you will need even

more ports opened. If you are going to communicate with ICP, you will need to allow UDP traffic from and to your cache machine on port 3130. You may find that the cache server that you are peering with uses different ports for reply packets. It's probably a bad idea to open all UDP traffic, though.

Packet Filtering firewalls Squid will normally live on the inside of your packet-filtering firewall. If you have a

DMZ, it may be best to put your cache on this network, as you may want to allow UDP traffic to and from the cache server (to communicate with other caches).

To configure your firewall correctly, you should make the minimum number of holes in your filter set. In the remainder of this section we assume that your internal machines can connect to the cache server unimpeded. If your cache is on the DMZ (or outside the firewall altogether) you will need to allow TCP connections from your internal network (on a random source port) to the HTTP port that Squid will be accepting requests on (this is the port that you set a bit earlier, in the Setting Squid's HTTP Port section of this chapter).

First, let's consider the firewall setup when you do not query any outside caches. On accepting a request, Squid will attempt to connect to a machine on the Internet at large. Almost always, the destination port will be the default HTTP port, port 80. A few percent of the time, however, the request will be destined for a high-numbered port (any port number higher than

P a g e | 28

1023 is a high-numbered port). Squid always sources TCP requests from a high-numbered port, so you will thus need to allow TCP requests (all HTTP is TCP-based) from a random high-numbered port to both port 80 and any high-numbered port.

There is another low-numbered port that you will probably need to open. The HTTPS port (used for secure Internet transactions) is normally listening on TCP port 443, so this should also be opened.

In the second situation, let's look at cache-peering. If you are planning to interact with other caches, you will need to open a few more ports. First, let's look at ICP. As mentioned previously, ICP is UDP-based. Almost all ICP-compliant caches listen for ICP requests on UDP port 3130. Squid will always source requests from port 3130 too, though other ICP-compliant caches may source their requests from a different port.

It's probably not a good idea to allow these UDP packets no matter what source address they come from. Your filter should probably specify the IP addresses for each of the caches that you wish to peer from, rather than allowing UDP packets from any source address. That should be it: you should now be able to save the config file, and get ready to start the Squid program.

Starting Squid

Running Squid Squid should now be configured, and the directories should have the correct permissions.

We should now be able to start Squid, and you can try and access the cache with a web browser. Squid is normally run by starting the RunCache script. RunCache (as mentioned ealier) restarts Squid if it dies for some reason, but at this stage we are merely testing that it will run properly: we can add it to startup scripts at a later stage.

Programs which handle network requests (such as inetd and sendmail) normally run in the background. They are run at startup, and log any messages to a file (instead of printing it to a screen or terminal, as most user-level programs do.) These programs are often referred to as daemon programs. Squid is such a program: when you run the squid binary, you should be immediately returned to the command line. While it looks as if the program ran and did nothing, it's actually sitting in the background waiting for incoming requests. We want to be able to see that Squid's actually doing something useful, so we increase the debug level (using -d 1) and tell it not to dissapear into the background (using -N.) If your machine is not connected to the Internet (you are doing a trial squid-install on your home machine, for example) you should use the -D flag too, since Squid tries to do DNS lookups for a few common domains, and dies with an error if it is not able to resolve them. The following output is that printed by a default install of Squid: cache1:~ # /usr/local/squid/sbin/squid -N -d 1 -D

Squid reads the config file, and changes user-id's here: 1999/06/12 19:16:20| Starting Squid Cache version 2.2.DEVEL3 for i586-pc-linux-gnu...

P a g e | 29

1999/06/12 19:16:20| Process ID 4121

Each concurrent incoming request uses at least one filedescriptor. 256 filedescriptors is only enough for a small, lightly loaded cache server, see Chapter 12 for more details. Most of the following is diagnostic: 1999/06/12 19:16:20| With 256 file descriptors available1999/06/12 19:16:20| helperOpenServers: Starting 5 'dnsserver' processes1999/06/12 19:16:20| Unlinkd pipe opened on FD 131999/06/12 19:16:20| Swap maxSize 10240 KB, estimated 787 objects1999/06/12 19:16:20| Target number of buckets: 151999/06/12 19:16:20| Using 8192 Store buckets, replacement runs every 10 seconds1999/06/12 19:16:20| Max Mem size: 8192 KB1999/06/12 19:16:20| Max Swap size: 10240 KB1999/06/12 19:16:20| Rebuilding storage in Cache Dir #0 (DIRTY)

When you connect to an ftp server without a cache, your browser chooses icons to match the files based on their filenames. When you connect through a cache server, it assumes that the page returned will be in html form, and will include tags to load any images so that the directory listing looks normal. Squid adds these tags, and has a collection of icons that it refers clients to. These icons are stored in /usr/local/squid/etc/icons/. If Squid has permission problems here, you need to make sure that these files are owned by the appropriate users (in the previous section we set permissions on the files in this directory.) 1999/06/12 19:16:20| Loaded Icons.

The next few lines are the most important. Once you see the Ready to serve requests line, you should be able to start using the cache server. The HTTP port is where Squid is waiting for browser connections, and should be the same as whatever we set it to in the previous chapter. The ICP port should be 3130, the default, and if you have included other protocols (such as HTCP) you should see them here. If you see permission denied errors here, it's possible that you are trying to bind to a low-numbered port (like 80) as a normal user. Try run the startup command as root, or (if you don't have root access on the machine) choose a high-numbered port. Another common error message at this stage is Address already in use. This occurs when another process is already listening to the given port. This could be because Squid is already started (perhaps you are upgrading from an older version which is being restarted by the RunCache script) or you have some other process listening on the same port (such as a web server.)

1999/06/12 19:16:20| Accepting HTTP connections on port 3128, FD 35.1999/06/12 19:16:20| Accepting ICP messages on port 3130, FD 36.1999/06/12 19:16:20| Accepting HTCP messages on port 4827, FD 37.1999/06/12 19:16:20| Ready to serve requests.

Once Squid is up-and-running, it reads the cache-store. Since we are starting Squid for the first time, you should see only zeros for all the numbers below: 1999/06/12 19:16:20| storeRebuildFromDirectory: DIR #0 done!1999/06/12 19:16:25| Finished rebuilding storage disk.1999/06/12 19:16:25| 0 Entries read from previous logfile.1999/06/12 19:16:25| 0 Entries scanned from swap files.1999/06/12 19:16:25| 0 Invalid entries.1999/06/12 19:16:25| 0 With invalid flags.1999/06/12 19:16:25| 0 Objects loaded.1999/06/12 19:16:25| 0 Objects expired.1999/06/12 19:16:25| 0 Objects cancelled.

P a g e | 30

1999/06/12 19:16:25| 0 Duplicate URLs purged.1999/06/12 19:16:25| 0 Swapfile clashes avoided.1999/06/12 19:16:25| Took 5 seconds ( 0.0 objects/sec).1999/06/12 19:16:25| Beginning Validation Procedure1999/06/12 19:16:26| storeLateRelease: released 0 objects1999/06/12 19:16:27| Completed Validation Procedure1999/06/12 19:16:27| Validated 0 Entries1999/06/12 19:16:27| store_swap_size = 21k

Testing Squid If all has gone well, we can begin to test the cache. True browser access is only covered

in the next chapter, and there is a whole chapter devoted to configuring your browser. Until then, testing is done with the client program, which is included with the Squid source, and is in the /usr/local/squid/bin directory.

The client program connects to a cache and request a page, and prints out useful timing information. Since client is available on all systems that Squid runs on, and has the same interface on all of them, we use it for the initial testing.

At this stage Squid should be in the foreground, logging everything to your terminal. Since client is a unix program, you need access to a command prompt to run it. At this stage it's probably easiest to simply start another session (this way you can see if errors are printed in the main window).

The client program is compiled to connect to localhost on port 3128 (you can override these defaults from the command line, see the output of client -h for more details.)

If you are running client on the cache server, and are using port 3128 for incoming requests, you should be able to type a command like this, and the client program will retrieve the page through the cache server: client http://squid.nlanr.net/

If your cache is running on a different machine you will have to use the -h and -p options. The following command will connect to the machine cache.qualica.comf on port 8080 and retrieve the above web page.

The client program can also be used to access web sites directly. As you may remember from reading Chapter 2, the protocol that clients use to access pages through a cache is part of the HTTP specification. The client program can be used to send both "normal" and "cache" HTTP requests. To check that your cache machine can actually connect to the outside world, it's a good idea to test access to an outside web server.

The next example will retrieve the page at http://www.qualica.com/, and send the html contents of the page to your terminal.

If you have a firewall between you and the internet, the request may not work, since the firewall may require authentication (or, if it's a proxy-level firewall and is not doing transparent proxying of the data, you may explicitly have to tell client to connect to the machine.) To test requests through the firewall, look at the next section.

A note about the syntax of the next request: you are telling client to connect directly to the remote site, and request the page /. With a request through a cache server, you connect to the cache (as you would expect) and request a whole url instead of just the path to a file. In essence, both normal-HTTP and cache-HTTP requests are identical; one just happens to refer to a whole URL, the other to a file.

http://www.qualica.com/

http://squid.nlanr.net/

P a g e | 31

Client can also print out timing information for the download of a page. In this mode, the contents of the page aren't printed: only the timing information is. The zero in the below example indicates that Squid is to retrieve the page until interrupted (with Control-C or Break). If you want to retrieve the page a limited number of times, simply replace the zero with a number.

Testing a Cache or Proxy Server with Client Now that you have client working, you

If the request through the cache returned the same page as you retrieved with direct access (you didn't receive an error message from Squid), Squid should be up and running. Congratulations! If things aren't going so well for you, you will have received an error message here. Normally, this is because of the acls described in the previous chapter. First, you should have a look at the terminal where you are running Squid (Or, if you are skipping ahead and have put Squid in the background, in the /usr/local/squid/logs/cache.log file.) If Squid encountered some sort of problem, there should be an error or warning in this file. If there are no messages here, you should look at the /usr/local/squid/logs/access.log file next. We haven't covered the details of this file yet, but they are covered in the next section of this chapter. First, though, let's see if your cache can process requests to internal servers. There are many cases where a request will work to internal servers but not to external machines.

Testing Intranet Access If you have a proxy-based firewall, Squid should be configured to pass outgoing requests

to the proxy running on the firewall. This quite often presents a problem when an internal client is attempting to connect to an internal (Intranet) server, as discussed in section 2.2.5.2. To ensure that the acl-operator lists created in section 2.2.5.2 are working, you should use client to attempt to connect to a machine on the local network through the cache.

If you didn't get an error message from a command like the above, access to local servers should be working. It is possible, however, that the connection could be being passed from the local cache to the parent (across a serial line), and the parent could be connecting back into the local network, slowing the connection enormously. The only way to ensure that the connection is not passing through your parent is to check the access logs, and see which server the connection is being passed to.

Access.log basics The access.log file logs all incoming requests. chapter 11 covers the fields in the

access.log in detail. The most important fields are the URL (field 7), and hierarchy access type (field 9) fields. Note that a "-" indicates that there is no data for that field.

The following example access.log entries indicate the changes in log output when connecting to another server, without a cache, with a single parent, and with multiple parents.

Though fields are seperated by spaces, fields can contain sub-fields, where a "/" indicates the split.

When connecting directly to a destination server, field 9 contains two subfields - the key word "DIRECT", followed by the name of the server that it is connecting to. Access to local servers (on your network) should always be DIRECT, even if you have a firewall, as discussed in section 3.1.2. The acl operator always_direct controls this behaviour.

When you have configured only one parent cache, the hierarchy access type indicates this, and includes the name of that cache.

P a g e | 32

There are many more types that can appear in the hierarchy access information field, but these are covered in chapter 11.

Another useful field is the 'Log Tag' field, field four. In the following example this is the field "TCP_MISS/200".

A MISS indicates that the request was not already stored in the cache (or that the page contained headers indicating that the page was not to be cached). A HIT would indicate that the page was already stored in the cache. In the latter case the request time for a remote page should be substantially less than the first occurence in the logs.

The time that Squid took to service the request is the second field. This value is in milliseconds. This value should approach that returned by examining a client request, but given operating system buffering there is likely to be a discrepancy.

The fifth field is the size of the page returned to the client. Note that an aborted request can end up downloading more than this from the origin server if the quick_abort feature set is turned on in the Squid config file. Here is an example request direct from the origin server:

If we use client to fetch the page a short time later, a HIT is returned, and the time is reduced hugely.

Some of you will have noticed that the size of the hit has increased slightly. If you have checked the size of a request from the origin server and compared it to that of the same page through the cache, you will also note that the size of the returned data has increased very slightly. Extra headers are added to pages passing through the cache, indicating which peer the page was returned from (if applicable), age information and other information. Clients never see this information, but it can be useful for debugging.

Since Squid 1.2 has support for HTTP/1.1, extra features can be used by clients accessing a copy of a page that Squid already has. Certain extra headers are included into the HTTP headers returned in HITS, indicating support for features which are not available to clients when returning MISSes. In the above example Squid has included a header in the page indicating that range-request are supported.

If Squid is performing correctly, you should shut Squid down and add it to your startup files.

Since Squid maintains an in-memory index of all objects in the cache, a kill -9 could cause corruption, and should never be used. The correct way to shutdown Squid is to use the command:

Squid command-line options are covered in chapter 10.

Addition to Startup Files The location of startup files vary from system to system. The location and naming

scheme of these files is beyond the scope of this book. If you already have a local startup file, it's a pretty good idea to simply add the RunCache

program to that file. Note that you should place RunCache in the background on startup, which is normally done by placing an '&' after the command.

The RunCache program attempts to restart Squid if it dies for some reason, and logs basic Squid debug output both to the file "/usr/local/squid/squid.out" and to syslog.

P a g e | 33

Windows In NT-based Windows systems, Squid NT can be installed as a native service. Simply

unzip in the root of C: and run c:\squid\sbin\squid -i. Rename and edit the files in c:\squid\etc and run net start squid or start squid via services.msc. Also, make sure to create c:\squid\var\cache and run squid -z to create swap directories (or you might spend a long time trying to figure out the cryptic "abnormal program termination" message like I did! :) )

Browser Configuration

Browser Configuration Browsers

Squid is the server half of a client-server relationship. Though you have configured Squid, your client (the browser) is still configured to talk to the menagerie of servers that make up the Internet.

You have already used the client program included with Squid to test that the cache is working. Browsers are more complicated to configure than client, especially since there are so many different types of browsers.

This chapter covers the three most common browsers. It also includes information on the proxy configuration of Unix tools, since you may wish to use these for automatic download of pages. Once your browser is configured, some of the proxy-oriented features of browsers are covered. Many browsers allow you to force your cache server to reload the page, and have other proxy-specific features.

So that you can skip sections in this chapter that you don't need to read, browsers are configured in the following order: Netscape Communicator, Microsoft Internet Explorer, Opera and finally Unix Clients.

You can configure most browsers in more than one way. The first method is the simplest for a sysadmin, the second is simplest for the user. Since this book is written for system administrators, we term the first basic configuration, the second advanced configuration.

Basic Configuration In this mode, each browser is configured independently to the others. If you need to

change something about the server (the port that it accepts requests on, for example), each browser will have to be reconfigured manually: you will have to physically walk to it and change the setup. To avoid caching intranet sites, you will have to add exclusions for each intranet site.

Advanced Configuration In this mode, you will configure a so-called rule server. Clients connect to this server on

startup, and download information on which proxy server to talk to, and which URLs to retrieve from which proxy server. Exclusion of intranet sites is handled centrally: so one change will update all clients. If your organization is large, or is growing, you should use the auto-config option.

P a g e | 34

Though this method is called auto-config, it's not completely automatic, since the user still has to enter a URL indicating the location of the list of rules. Advanced configuration has some advantages:

Changes to the proxy server are easy, since you only change the rule server. A proxy server can be chosen based on destination machine name, destination port and

more. Since this list is maintained centrally, changes also only have to be made once. Browser configuration is easy, instead of adding complicated lists of IP's, a user simply

has to type in a URL. Since it's easy to configure, users are more likely to use the cache. When you write your list of rules (also called a proxy auto-config script), you will still need

to supply the client with the same information as with the basic configuration, it's just that the list of this information is maintained centrally. Even if you decide to use only autoconfig on your network, you should probably work through the basic configuration first.

Basic Configuration To configure any browser, you need at least two pieces of information:

The proxy server's host name The port that the proxy server is accepting requests on

Host name It's very important to use a proxy specific host name. If you decide to move the cache to

another machine at a later stage you will find that it's much easier to change DNS settings than to change the configuration of every browser on your network.

If your operating system supports IP aliases you should organize a dedicated IP address for the cache server, and use the tcp_incoming_address and tcp_outgoing_address squid.conf options to make Squid only accept incoming HTTP requests on that IP address.

There isn't really a naming convention for caches, but people generally use one of the following: cache, proxy, www-proxy, www-cache, or even the name of the product they are using; squid, netapp, netscape. Some people also include the location of the cache, and configure people in a region to talk to their local cache. More and more people are simply using cache, and it's the suggested name. If you wish to use regional names, you can use something along the lines of region.cache.domain.example.

Your choice of port has already been discussed. Have a look at HTTP:port in the index for more information.

Windows clients See http://www.squid-cache.org/Doc/FAQ/FAQ-5.html for more information on configuring the different windows browsers.

Unix clients Most Unix client programs use a single environment variable to decide how they are to

access the Internet. If you run lynx (a terminal-based browser) on any of your machines, or use the recursive web-spider wget, you can simply set a shell variable and these programs will use the correct proxy server.

P a g e | 35

Each protocol has a different environmental variable, so that you can configure your client to use a different proxy for each protocol. Each protocol simply has the text _proxy tagged onto the end, so some of the most common protocols end up as follows:

http_proxy ftp_proxy gopher_proxy Since many people prefer a shell other than bash, we make an exception to our rule that "all

examples are based on sh" here.

Browser-cache Interaction The Internet is a transient place, and at some stage a server that does not correctly handle

caching will be found. You can easily add the server to the appropriate do not cache lists, but most browsers give users a way of forcing a web page reload.

page or graphic brings up a menu where you can select reload, which forces a re-get of the page. If you right click in a frame, you can re-load only the frame.

Testing the Cache As you can see, pressing reload in Netscape (and some other browsers) doesn't simply re-

fetch the page, it forces the cache not to serve the cached page. Many people doing tests of how the cache increases performance simply press reload, and believe that there has been no change in speed. The cache is, in fact, re-downloading the page from the origin server, so a speed increase is impossible.

To test the cache properly you need two machines setup to access the cache, and a page that does not contain do not cache me headers. Pages that use ASP often include headers that force Squid not to cache the page, even if the authors are not aware of it's implications.

So, to test the cache, choose a site that is off your local network (for a marked change, choose one in a different country) and access it from the first machine. Once it has download, change to the second machine and re-download the page. Once the page has downloaded there, check that the page is marked as a 'HIT' (in the file called access.log - the basics of which are covered earlier in this book). If the second accesses were marked as misses, it is probably because the origin server is asking Squid not to cache the page. Try a different page and see difference the cache makes to browsing speed.

Many people are looking for an increase in performance on problem pages, since this is when people believe that they are getting the short end of the stick. If you choose a site that is too close, you may only be able to see a difference in the speed in the transaction-time field of the access.log.

Since you have a completely unloaded cache, you should access a local, unloaded web server a few times, and see what kind of latency you experience with the cache. If you have time, print out some of the access log entries. If, some time in the future, you are unsure as to the cache load, you can compare the latency added now to the latency added by the same cache later on; if there is a difference you know it's time to upgrade the cache.

Cache Auto-config

P a g e | 36

Client browsers can have all options configured manually, or they can be configured to download a autoconfig file (every time they start up), which provides all of the information about your cache setup.

Each URL referenced (be it the URL that you typed, or the URL for a graphic on the page yet to be retrieved) is checked against the list of rules. You should keep the list of rules as short as possible, otherwise you could end up slowing down page loads - not at the cache level, but at the browser.

Web server config changes for autoconfig files The original Netscape documentation for the [1] suggested the filename proxy.pac for

Proxy AutoConfig files. Since it's possible to have a file ending in .pac that is not used for autoconfiguration, browsers require a server returning an autoconfig file to indicate so in the mime type. Most web servers do not automatically recognize the .pac extension as a proxy-autoconfig file, and have to be reconfigured to return the correct mime type (application/x-ns-proxy-autoconfig).

Apache On some systems Apache already defines the autoconfig mime type. The Apache config

file mime.types is used to associate filename extensions with mime types. This file is normally stored in the apache conf directory. This directory also contains the access.conf and httpd.conf files, which you may be more familiar with editing. As you can probalby see, the mime.types file consists of two fields: a mime type on the left, the associated filename extension on the right. Since this file is only read at startup or reconfigure, you will need to send a HUP signal to the parent apache process for your changes to take affect. The following line should be added to the file, assuming that it is not already included: application/x-ns-proxy-autoconfig pac

Internet Information Server FIX ME Netscape FIXME

Autoconfig Script Coding The autoconfig file is actually a JavaScript function, put in a file and served by your

standard web server program. Don't panic if you don't know JavaScript, since this section acts as a cookbook. Besides: the basic structure of the JavaScript language is quite easy to get the hang of, especially if you have previous programming experience, whether it be in C, Pascal or Perl.

The Hello World! of auto-configuration scripts If you have learned a programming language, you probably remember one of the most

basic programs simply printing the phrase Hello World!. We don't want to print anything when someone tries to go to a page, but the following example is similar to the original Hello World program in that it's the shortest piece of code that does something useful.

The following simply connects direct to the origin server for every URL, just as it would if you had no proxy-cache configured at all.

P a g e | 37

The next example gets the browser to connect to the cache server named cache.domain.example on port 3128. If the machine is down for some reason, an error message will be returned to the user.

As you may be able to guess from the above, returning text with a semicolon (;) splits the answer returned into two sub-strings. If the first cache server is unavailable, the second will be tried. This provides you with a failover mechanism: you can attempt a local proxy server first and, if it is down, try another proxy. If all are down, a direct attempt will be made. After a short period of time, the proxy will be retried. A third return type is included, for SOCKS proxies, and is in the same format as the HTTP type: return "SOCKS socks.domain.example:3128";

If you have no intranet, and require no exclusions, you should use the above autoconfig file. Configuring machines with above autoconfig file allows you to add future required exclusions very easily. Auto-config functions

Web browsers include various built-in functions to make your autoconfig coding as simple as possible. You don't have to write the code that does a string match of the hostname, since you can use a standard function call to do a match. Not all functions are covered here, since some of them are very rarely used. You can find a complete list of autoconfig functions (with examples) at the [2]

dnsDomainIs Returns true if the first argument (normally specified as the variable host, which is

defined in the autoconfig function by default) is in the domain specified in the second argument. Checks if a host is in a domain.

You can check more than one domain by using the || JavaScript operator. Since this is a JavaScript operator you can use the layout described in this example in any combination.

isInNet Sometimes you will wish to check if a host is in your local IP address range. To do this,

the browser resolves the name to find the IP address. Do not use more than one isInNet call if you can help it: each call causes the browser to resolve the hostname all over again, which takes time. A string of these calls can reduce browser performance noticeably. The isInNet function takes three arguments: the hostname, and a subnet/netmask pair.

isPlainHostname Simply checks that there is no full-stop in the hostname (the only argument for this call).

Many people refer to local machines simply by hostname, since the resolver library will automatically attempt to look up host.domain.example if you simply attempt to connect to host. For example: typing www in your browser should bring up your web site.

Many people connect to internal web servers (such as one sitting on their co-worker's desk) by typing in the hostname of the machine. These connections should not pass through the cache server, so many people use a function like the following:

myIpAddress Returns the IP address of the machine that the browser is running on, requires no arguments.

On a network with more than one cache, your script can use this information to decide which cache to communicate with. In the next subsection we look at different ways of

P a g e | 38

communicating with a local proxy (with minimal manual user intervention), so the example here is comparatively basic. The below example assumes that you have more than two networks: one with a private address range (10.0.0.*), the others with real IP addresses.

If the client machine is in the private address range, it cannot connect directly to the destination server, so if the cache is down for some reason they cannot access the Internet. A machine with a real IP address, on the other hand, should attempt to connect directly to the origin server if the cache is down.

Since myIpAddress requires no arguments, we can simply place it in where we would have put host in the isInNet function call.

shExpMatch The shExpMatch function accepts two arguments: a string and a shell expression. Shell

expressions are similar to regular expressions, though are more limited. This function is often used to check if the url or host variables have a specific word in them.

If you are configuring a ISP-wide script, this function can be quite useful. Since you do not know if a customer will call their machine "intranet" or "intra" or "admin", you can chain many shExpMatch checks together. Note that in the below example uses a single "intra*" shell expression to match both "intranet" and "intra.mydomain.example".

url.substring This function doesn't take the same form as those described above. Since Squid does not

support all possible protocols, you need a way of comparing the first few characters of the destination URL with the list of possible protocols. The function has two arguments. The first is a starting position, the second the number of characters to retrieve. Note that (like C), string start at position 0, rather than at 1.

All of this is best demonstrated with an example. The following attempts to connect to the cache for the most common URL types (http, ftp and gopher), but attempts to go directly for protocols that Squid doesn't recognize.

Example autoconfig files The main reason that autoconfig files were invented was the sheer number of possible

cache setups. It's difficult (or even impossible) to represent all of the possible combinations that a autoconfig file can provide you with.

There is no config file that will work for everyone, so a couple of config files are included here, one of which should suit your setup.

A Small Organization A small organization is the easiest to create an autoconfig file for. Since you will have a

moderately small number of IP addresses you can use the isInNet function to discover if the destination host is local or not (a large organization, such as an ISP would need a very long autoconfig file simply because they have many IP address ranges).

A Dialup ISP

P a g e | 39

Since dialup customers don't have intranet systems, a dialup ISP would have a very straight forward config file. If you wish your customers to connect directly to your web server (why waste the disk space of a cache when you have the origin server rack-mounted above it), you should use the dnsDomainIs function:

Leased Line ISP When you are providing a public service, you have no control over what your customers

call their machines. You have to handle the generic names (like intranet) and hope that people name their machines according to the de-facto standards.

Super Proxy Script cache1 gets a request for an object. It caches the page, and stores it on disk. An hour or so

later, cache2 gets a request for the same Many large ISPs will have more than one cache server. To avoid duplicating objects, these cache servers have to communicate with one another. Consider the following;page. To find a local copy of the object, cache2 has to query the other caches. Add more and more caches, and your number of queries goes up. [[If an incoming request for a specific URL only ever went to one cache, your caches would not need to communicate with one another. A client requesting the page http://www.qualica.com/ would always connect to cache1]].

Let's assume that you have 5 caches. Splitting the Internet into five pieces would split the load across the caches almost evenly. How do you split though

Some of you will know what a hash function is. If not, don't panic: you can still use the Super Proxy script without knowing the theoretical basis of the algorithms involved.

The Super Proxy Script allows you to split up the Internet by URL (the combination of hostname, path and filename). If you have 5 cache servers, you split up the domain of possible answers into 5 parts. (A hash function returns a number, so we are using the appropriate terms - a domain is not an Internet domain in this context). With a good hashing function, the numbers returned are going to be spread across the 5 parts evenly, which spreads your load perfectly.

If you have a cache which is twice as powerful as your others, you can allocate it more of the domain, and put more load on it.

Carp is used by some cache servers (most notably Microsoft Proxy and Squid) to decide which parent cache to send a request too. Browsers can also use CARP to decide which cache to talk to, using a java auto-config script. For more information (and an example Java script), you should look at the [3] web page. == == == ==

cgi generated autoconfig files It is possible to associate the .pac extension with a cgi program on most web servers. A

program could then generate an autoconfig script depending on the source address of the request. Since the autconfig file is only loaded on startup (or when the autoconfig refresh button is pressed) the slight delay due to a cgi program would not be noticeable to the user. Most large ISPs allocate subnets to regions, so a browser could be configured to access the nearest cache by looking at the source address of the request to the cgi program.

Future directions

P a g e | 40

There has recently been a move towards a standard for the automatic configuration of proxy-caches. New versions of Netscape and Internet Explorer are expected to use the new unknown standard to automatically change their proxy settings. This allows you to manipulate your cache server settings without inconveniencing clients.

Roaming Roaming customers have to remove their configured caches, since your access control

lists should stop them accessing your cache from another network. Although both problems can be reduced by the cgi-generated configs (discussed above) a

firewall between the browser and your cgi server would still mean that roaming users cannot access the Internet.

There are changes on the horizon that would help. As more and more protocols take roaming users into account, standards will evolve that make Internet usage plug-and-play. If you are in Tanzania today, plug in your modem and use the Internet. If you are in France in a weeks time, plug in again and (without config changes) you will be ready to go.

Progress on standards for autoconfiguration of Internet applications is underway, which will allow administrators to specify config files depending on where a user connects from without something like the cgi kludge above.

Browsers Browser support for CARP is not at the stage where it is tremendously useful: once there

is a proper standard for setup, it's likely to be included into the main browsers. At some stage, expect support for ICP and cache-digests in browsers. The browser will

then be able to make intelligent decisions as to which cache to talk to. Since ICP requests are efficient, a browser could send requests for each of the links on a page once it has retrieved the HTML source.

Transparency Currently there is a major trend towards transparent caching, not only in the "Outer

Internet" (where bandwidth is very expensive), but in the USA. (Transparency is covered in detail in chapter 12.)

Transparency has one major advantage: Users do not have to configure their browsers to access the cache.

To backbone providers this means that they can cache all passing traffic. A local ISP would configure their clients to talk to their cache; a backbone provider could then ask their ISP clients to use theirs as parents, but transparent caching has another advantage.

A backbone provider is acting as transit for requests that originate on other backbone provider's networks. With transparency, a backbone provider reduces this traffic as well as requests from their network to other backbone providers.

Assume you place a cache the hop before a major peering point. Here the cache intercepts both incoming requests (from other providers to web servers on your network) and outgoing (from your network to web servers on other provider's networks). This will reduce your peering-point usage (by caching outgoing requets for pages), and will also reduce the money you spend on other people's customers: since you reduce the cost it takes for data to flow out of your network. The latter cost may be minimal, but in times of network trouble it can reduce your latency noticibly.

P a g e | 41

As more and more backbone providers cache pages, more local ISPs will cache ("since it's cached further along the path, we may as well implement caching here - it's not going to change anything"). Though this will probably cause a drop in the hit rate of the backbone providers, their ever increasing user-base may make up for it. Backbone providers are caching centrally - with large numbers of edge caches (local ISP caches), they are likely to see fewer hits. Certain Inter-University networks have already noticed such a hit rate decline. As more and more universities add local caches, their hit rate falls.

Since the Universities are large, it's likely that their users will surf the same web page twice. Previously the Inter-University network would have returned the hit for that page, now the University's local cache does; this reduces the edge-cache's number of queries, and hence it's hit rate.

Ready to Go If all has gone well, you should be ready to use your cache, at least on a trial basis.

People around your office or division can now be configured to use the cache, and once you are happy with it's performance and stability, you can make it a proper service.

Access Control and Access Control OperatorsFrom Squid User's Guide

Access control lists (acls) are often the most difficult part of the configuration of a Squid cache: the layout and concept is not immediately obvious to most people. Hang on to your hat!

Unless Chapter 4 is still fresh in your mind, you may wish to skip back and review the access control section of that chapter before you continue. This chapter assumes that you understood the difference between an acl and an acl-operator.

Uses of ACLs The primary use of the acl system is to implement simple access control: to stop other

people using your cache infrastructure. (There are other uses of acls, described later in this chapter; in the meantime we are going to discuss only the access control function of acls.) Most people implement only very basic access control, denying access to people that are not on their network. Squid's access system is incredibly flexible, but 99% of administrators only use the most basic elements. In this chapter some examples of the less common uses of acls are covered: hopefully you will discover some Squid feature which suits your organization - and which you didn't think was part of Squid before.

Access Classes and Operators There are two elements to access control: classes and operators. Classes are defined with

the acl squid.conf tag, while the names of the operators vary: the most common operator used is http_access.

Let's work through the below example line-by-line. Here, a systems administrator is in the process of installing a cache, and doesn't want other staff to access it while it's being installed, since it's likely to ping-pong up and down during the installation. Once the

P a g e | 42

administrator is happy with the config, the whole network will be allowed access. The admin's PC is at the IP 10.0.0.3. If the admin connects to the cache from the PC, Squid does the following:

Accepts the (HTTP) connection and reads the request Checks the line that reads http_access allow myIP. Since your IP address matches the IP defined in the myIP acl, access is allowed.

Remember that Squid drops out of the operator list on the first match. If you connect from a different PC (on the 10.0.*.* network) things are very similar:

Accepts the connection and reads the request The source of the connection doesn't match the myIP acl, so the next http_access line is

checked. The myNet acl matches the source of the connection, so access is denied. An error page is

returned to the user instead of the requested page. If someone reaches your cache from another netblock (from, say, 192.168.*.*), the above

access list will not block access. The reason for this is quite complicated. If Squid works through a set of acl-operators and finds no match, it defaults to using the opposite of the last match (if the previous operator is an allow, the default is to deny; if it's a deny, the default is to allow). This seems a bit strange at first, but let's look at an example where this behaviour is used: it's more sensible than it seems.

The following acl example is nice and simple: it's something a first-time cache admin could create.

A config file with no access lists will allow cache access without any restrictions. An administrator using the above access lists obviously wishes to allow only his network access to the cache. Given the Squid behavior of inverting the last decision, we have an invisible line reading http_access deny all

Inverting the last decision is a simple (if not immediately obvious) solution to one of the most common acl mistakes: not adding a final deny all to the end of your acl list.

With this new knowledge, have a look at the first example in this chapter: you will see why I said not to use it in your configs. Given that the last operator denies the local network, local people will not be able to access the cache. The remainder of the Internet, however, will! As discussed in Chapter 4, the simplest way of creating a catch-all acl is to match requests when they come from any IP address. When programs do netmask arithmetic a subnet of all zeros will match any IP address. A corrected version of the first example dispenses with the myNet acl.

Once the cache is considered stable and is moved into production, the config would change. http_access lines do add a very small amount of overhead, but that's not the only reason to have simple access rulesets: the fewer rulesets, the easier your setup is to understand. The below example includes a deny all rule although it doesn't really need one: you may know of the automatic inversion of the last rule, but someone else working on the cache may not.

You should always end your access lists with an explicit deny. In Squid-2.1 the default config file does this for you when you insert your HTTP acl operators in the appropriate place.

Acl lines The Examples so far have given you an idea of an acl line's layout. Their layout can be

symbolized as follows (? Check! ?): acl name type (string|"filename") [string2] [string3] ["filename2"]

P a g e | 43

The acl tag consists of a minimum of three fields: a unique name; an acl type and a decision string. An acl line can have more than one decision string, hence the [string2] and [string3] in the line above.

A unique name This is supposed to be descriptive. Use a name such as customers or mynet. You have

seen this lots of times before: the word myNet in the above example is one such case. There must only be one acl with a given name; if you find that you have two or more

classes with similar names, you can append a number to the name: customer1, customer2 etc. I generally avoid this, instead putting all similar data on these classes into a file, and including the whole file as one acl. Check the Decision String section for some more info on this.

Type So far we have discussed only acls that check the source IP address of the connection.

This isn't sufficient for many people: it may be useful for you to allow connections at only certain times, or to only specific domains, or by only some users (using usernames and passwords). If you really want to, you can even combine all of the above: only allow connections from users that have the right password, have the right destination and are going to the right domain. There are quite a few different acl types: the next section of this chapter discusses all of the different types in detail. In the meantime, let's finish the description of the structure of the acl line.

Decision String The acl code uses this string to check if the acl matches a given connection. When using

this field, Squid checks the type field of the acl line to decide how to use the decision string. The decision string could be an IP address range, a regular expression or a list of domains or more. In the next section (where we discuss the types of acls available) we discuss the different forms of the Decision String.

If you have another look at the formal definition of the acl line above, you will note that you can have more than one decision string per acl line. Strings in this format are ORd together; if you were to specify two IP address ranges on the same line the return result of the acl would be true if either of the IP addresses match. (If source strings were ANDd together, then an incoming request would have to come from two IP address ranges at the same time. This is not impossible, but would almost certainly be pointless.)

Large decision lists can be stored in files, so that your squid.conf doesn't get cluttered. Some of the caches I have worked on have had in the region of 2000 lines of acl rules, which could lead to a very cluttered squid.conf file. You can include a file into the decision section of an acl list by placing the filename (with path) in double-quotes. The file simply contains the data set; one datum per line. In the next example the file /usr/local/squid/conf/data/myNets can contain any number of IP ranges, one range per line.

While on the topic of long lists of acls: it's important to note that you can end up slowing your cache response with very long lists of acls. Checking acls requires CPU time, and long lists can decrease cache performance, since instead of moving data to clients Squid is busy checking access lists. What constitutes a long list? Don't worry about lists with a few hundred entries

P a g e | 44

unless you have a really slow or busy CPU. Lists thousands of lines long can, however, cause problems.

Types of acl So far we have only spoken about acls that filter by source IP address. There are numerous other acl types:

Source/Destination IP address Source/Destination Domain Regular Expression match of requested domain Words in the requested URL Words in the source or destination domain Current day/time Destination port Protocol (FTP, HTTP, SSL) Method (HTTP GET or HTTP POST) Browser type Name (according to the Ident protocol) Autonomous System (AS) number Username/Password pair SNMP Community

Source/Destination IP address In the examples earlier in this chapter you saw lines in the following format: acl myNet src 10.0.0.0/255.255.0.0 http_access allow myNet

The above acl will match when the IP address comes from any IP address between 10.0.0.0 and 10.0.255.255. In recent years more and more people are using Classless Internet Domain Routing (CIDR) format netmasks, like 10.0.0.0/16. Squid handles both the traditional IP/Netmask and more recent IP/Bits notation in the src acl type. IP ranges can also be specified in a further format: one that is Squid specific. (? I need to spend some time hacking around with these: I am not sure of the layout ?) acl myNet src addr1-addr2/netmaskhttp_access allow myNet

Squid can also match connections by destination IP. The layout is very similar: simply replace src with dst. Here are a couple of examples:

Source/Destination Domain Squid can also limit requests by their source domain. Though it doesn't always happen in

the real world, network administrators can add reverse DNS entries for each of the hosts on their network. (These records are normally referred to as PTR records.) Squid can make decisions about the validity of incoming requests by checking their reverse DNS entries. In the below example, the acl is true if the request comes from a host with a reverse entry that is in either the qualica.com or squid-cache.org domains. acl myDomain srcdomain .qualica.com .squid-cache.orgacl allow myDomain

Reverse DNS matches should not be used where security is important. A determined attacker (who controlled the reverse DNS entries for the attacking host) would be able to

P a g e | 45

manipulate these entries so that the request comes from your domain. Squid doesn't attempt to check that reverse and forward DNS entries match, so this option is not recommended.

Squid can also be configured to deny requests to specific domains. Many people implement these filter lists for pornographic sites. The legal implications of this filtering are not covered here: there are many, and the relevant law is in a constant state of flux, so advice here would likely be obsolete in a very short period of time. I suggest that you consult a good lawyer if you want to do something like this.

The dst acl type allows one to match accesses by destination domain. This could be used to match urls for popular adult sites, and refuse access (perhaps during specific times).

If you want to deny access to a set of sites, you will need to find out these site's IP addresses, and deny access to these IP addresses too. If you just put the URL Domain name in, someone determined to access a specific site could find out the IP address associated with that hostname and access it by entering the IP address in their browser.

The above is best described with an example. Here, I assume that you want to restrict access to the site www.adomain.example. If you use either the host of nslookup commands, you would find that this server has the IP address 10.255.1.2. It's easiest to just have two acls: one for IPs and one for domains. If the lists get too large, you can simply place them in a file. Words in the requested URL

Most caches can filter out URLs that contain a set of banned words. Regular expressions allow you to simply check if a word is in a given URL, but they also allow for more powerful searches of the URL. With a simple word check you would find it nearly impossible to create a rule that allows access to sites with the word sex in the URL, but at the same time denies access to all avi files on that site. With regular expressions this sort of checking becomes easy, once you understand the regex syntax.

A Quick introduction to regular expressions We haven't encountered regular expressions in this book yet. A regular expression

(regex) is an incredibly useful way of matching strings. As they are incredibly powerful they can get a little complicated. Regexes are often used in string-oriented languages like Perl, where they make processing of large text files (such as logs) incredibly easy. Squid uses regular expressions for numerous things: refresh patterns and access control among them.

If you have not used regular expressions before, you might want to have a look at the O'Reilly book on regular expressions or the appropriate section in the O'Reilly perl book. Instead of going into detail here, I am just going to give some (hopefully) useful examples. If you have perl installed on your machine, you could have a look at the perlre manual page to get an idea as to how the various regex operators (such as .) function.

Regular expressions in Squid are case-sensitive by default. If you want to match both upper or lower-case text, you can prefix the regular expression with a -i. Have a look at the next example, where we use this to match either sex SEX (or even SeX).

Using Regular expressions to match words in the requested URL Using regular expressions allows you to create more flexible access lists. So far you have

only been able to filter sites by destination domain, where you have to match the entire domain to deny access to the site. Since regular expressions are used to match text strings, you can use them to match words, partial words or patterns in URLs or domains.

P a g e | 46

The most common use of regex filters in ACL lists is for the creation of far-reaching site filters: if the url or domain contain a set of banned words, access to the site is denied. If you wish to deny access to sites that contain the word sex in the URL, you would add one acl rule, rather than trying to find every site that has adult material on it.

The big problem with regex filters is that not all sites that contain the word sex in the URL are pornographic. By denying these sites you are likely to be infringing people's rights, and you should refer to a lawyer for advice on the legality of this.

Creating a list of sites that you don't want accessed can be tedious. There are companies that sell adult/unwanted material lists which plug into Squid, but these can be expensive. If you cannot justify the cost, you can The url_regex acl type is used to match any word in the URL. Here is an example:

In places where bandwidth is very expensive, system administrators may have no problem with people visiting pornograpic sites. They may, however, want to stop people downloading huge avi files from these sites. The following example would deny downloads of avi files from sites that contain the word sex in the URL. The regular expression below matches any URL that contains the word sex AND ends with .avi.

The urlpath_regex acl strips off the url-type and hostname, checking instead only the path and filename. Words in the source or destination domain

Regular expressions can also be used for checking the source and destination domains of a request. The srcdom_regex tag is used to check that a request comes from a specific subdomain, while the dstdom_regex checks the domain part of the requested URL. (You could check the requested domain with a url_regex tag, but you could run into interesting problems with sites that refer to pages with urls like http://www.company.example/www.anothersite.example.)

Here is an example acl set that uses a regular expression (rather than using the srcdomain and dstdomain tags). This example allows you to deny access to .com or .net sites if the request is from the .za domain. This could be useful if you are providing a "public peering" infrastructure to other caches in your geographical region. Note that this example is only a fragment of a complete acl set: you would presumably want your customers to be able to access any site, and there is no final deny acl. acl bad_dst_TLD dstdom_regex \.com$ \.net$acl good_src_TLD srcdom_regex \.za$# allow requests FROM the za domain UNLESS they want to go to \.com or \.nethttp_access deny bad_dst_TLD http_access allow good_src_TLD

Current day/time Squid allows one to allow access to specific sites by time. Often businesses wish to filter

out irrelevant sites during work hours. The Squid time acl type allows you to filter by the current day and time. By combining the dstdomain and time acls you can allow access to specific sites (such as your the sites of suppliers or other associates) during work hours, but allow access to other sites after work hours. The layout is quite compact: acl name time [day-list] [start_hour:minute-end_hour:minute]

Day list is a list of single characters indicating the days that the acl applies to. Using the first letter of the day would be ambiguous (since, for example, both Tuesday and Thursday start

P a g e | 47

with the same letter). When the first letter is ambiguous, the second letter is used: T stands for Tuesday, H for Thursday. Here is a list of the days with their single-letter abreviations: S - Sunday M - Monday T - Tuesday W - Wednesday H - Thursday F - Friday A - Saturday

Start_hour and end_hour are times written in 24-hour ("military") time (17:00 instead of 5:00). End_hour must always be larger than start_hour. Unfortunately, this means that you can't simply write: acl darkness 17:00-6:00 # won't work

You have to specify two separate ranges: acl night time 17:00-24:00acl early_morning time 00:00-6:00

As you can see from the original definition of the time acl, you can specify the day of the week (with no time), the time (with no day), or both the time and day (?check!?). You can, for example, create a rule that specifies weekends without specifying that the day starts at midnight and ends at the following midnight. The following acl will match on either Saturday or Sunday. acl weekends time SA

The following example is too basic for real-world use. Unfortunately, creating a good example requires some of the more advanced features of the http_access line; these are covered (with examples) in the next section of this chapter.

Destination Port Because of the design of the HTTP protocol, people can connect to things like IRC

servers through your cache servers, even though the two protocols are very different. The same problems can be used to tunnel telnet connections through your cache server. The part of HTTP that allows this is the CONNECT method, mainly used for securing https connections with SSL.

Since you generally don't want to proxy anything other than the standard supported protocols, you can restrict the ports that your cache is willing to connect to. Web servers almost always listen for incoming requests on port 80. Some servers (notably site-specific search engines and unofficial sites) listen on other ports, such as 8080. Other services (such as IRC) also use high-numbered ports. The default Squid config file limits standard HTTP requests to the port ranges defined in the Safe_ports squid.conf acl. SSL CONNECT requests are even more limited, allowing connections to only ports 443 and 563. However, keep in mind that these port assignments are only a convention and nothing prevents people from hosting (on machines they control) any type of server on any port they choose.

Port ranges are limited with the port acl type. If you look in the default squid.conf, you will see lines like: acl SSL_ports port 443 563acl Safe_ports port 80 21 443 563 70 210 1025-65535

The format is pretty straightforward: a destination port of 443 or 563 is matched by the first acl, while 80, 21, 443, etc. by the second line. The most complicated section of the examples above is the end of the line: the text that reads "1025-65535".

The "-" character is used in squid to specify a range. The example thus matches any port from 1025 all the way up to 65535. These ranges are inclusive, so the second line matches ports 1025 and 65535 too.

The only low-numbered ports which Squid should need to connect to are 80 (the HTTP port), 21 (the FTP port), 70 (the Gopher port), 210 (wais) and the appropriate SSL ports. All other low-numbered ports (where common services like telnet run) do not fall into the 1024-65535 range, and are thus denied.

P a g e | 48

The following http_access line denies access to URLs that are not in the correct port ranges. You have not seen the ! http_access operator before: it inverts the decision. The line below would read "deny access if the request does not fall in the range specified by acl Safe_ports" if it were written in english. If the port matches one of those specified in the Safe_ports acl line, the next http_access line is checked. More information on the format of http_access lines is given in the next section Acl-operator lines. http_access deny !Safe_ports

Protocol (FTP, HTTP, SSL) Some people may wish to restrict their users to specific protocols. The proto acl type

allows you to restrict access by the URL prefix: the http:// or ftp:// bit at the front. The following example will deny requests that use the FTP protocol.

The default squid.conf file denies access to a special type of URL, those which use the cache_object protocol. When Squid sees a request for one of these URLs it serves up information about itself: usage statistics, performance information and the like. The world at large has no need for this information, and it could be a security risk.

HTTP Method (GET, POST or CONNECT) HTTP can be used for downloading (GETting data) or uploads (POSTing data to a site).

The CONNECT mode is used for SSL data transfers. When a connection is made to the proxy the client specifies what kind of request (called a method) it is sending. A GET request looks like this: GET http://www.qualica.com/ HTTP/1.1blank-line

If you were connecting using SSL, the GET word would be replaced with the word CONNECT. You can control what methods are allowed through the cache using the post acl type. The

most common use is to stop CONNECT type requests to non-SSL ports. The CONNECT method allows data transfer in any direction at any time: if you telnet to a badly configured proxy, and enter something like: CONNECT www.domain.example:23 HTTP/1.1blank-line

you might end up with a telnet connection to www.domain.example just as if you had telnetted there from the cache server itself. This can be used get around packet-filters, firewall access lists and passwords, which is generally considered a bad thing! Since CONNECT requests can be quite easily exploited, the default squid.conf denies access to SSL requests to non-standard ports (as described in the section on the port acl-operator.)

Let's assume that you want to stop your clients from POSTing to any sites (note that doing this is not a good idea, since people using some search engines (for example) would run into problems: at this stage this is just an example. (?TODO: Example)

Browser type Companies sometimes have policies as to what browsers people can use. The browser acl

type allows you to specify a regular expression that can be used to allow or deny access..

Username

P a g e | 49

Logs generally show the source IP address of a connection. When this address is on a multiuser machine (let's use a Unix machine at a university as an example) you cannot pin down a request as being from a specific user. There could be hundreds of people logged into the Unix machine, and they could all be using the cache server. Trying to track down a misbehaver is very difficult in this case, since you can never be sure which user is actually doing what. To solve this problem, the ident protocol was created. When the cache server accepts a new connection, it can call back to the origin server (on a low-numbered port, so the reply cannot be faked) to find out who's on the other end of the connection. This doesn't make any sense on single-user systems: people can just load their own ident servers (and become daffy duck for a day). If you run multi-user systems then you may want only certain people on those machines to be able to use the cache. In this case you can use the ident username to allow or deny access.

One of the best things about Unix is the flexibility you get. If you wanted (for example) only students in their second year on to have access to the cache servers via your Unix machines, you could create a replacement ident server. This server could find out which user that has connected to the cache, but instead of returning the username you could return a string like "third_year" or "postgrad". Rather than maintaining a list of which students are in on both the cache server and the central Unix system, you could simple Squid rules, and the ident server could do all the work where it checks which user is which.

Autonomous System (AS) Number Squid is often used by large ISPs. These ISPs want all of their customers to have access

to their caches without having incredibly long manually-maintained ACL lists (don't forget that such long lists of IPs generally increase the CPU usage of Squid too). Large ISP's all have AS (Autonomous System) numbers which are used by other Internet routers which run the BGP (Border Gateway Protocol) routing protocol.

The whois server whois.ra.net keeps a (supposedly authoritive) list of all the IP ranges that are in each AS. Squid can query this server and get a list of all IP addresses that the ISP controls, reducing the number of rules required. The data returned is also stored in a radix tree, for more cpu-friendly retrieval.

Sometimes the whois server is updated only sporadically. This could lead to problems with new networks being denied access incorrectly. It's probably best to automate the process of adding new IP ranges to the whois server if you are going to use this function.

If your region has some sort of local whois server that handles queries in the same way, you can use the as_whois_server Squid config file option to query a different server.

Username and Password If you want to track Internet usage it's best to get users to log into the cache server when

they want to use the net. You can then use a stats program to generate per-user reports, no matter which machine on your network a person is using. Universities and colleges often have labs with many machines, where it is difficult to tell which user is sitting in front of a machine at any specific time. By using names and passwords you will solve this problem.

Squid uses modules to do user authentication, rather than including code to do it directly. The default Squid source does, however, include two standard modules; The first authenticates users from a file, the other uses SMB (MS Windows) authentication. Since these modules are not compiled when you compile Squid itself, you will need to cd to the appropriate source directory (under auth_modules) and run make. If the compile goes well, a make install will place the

P a g e | 50

program file in the /usr/local/squid/bin/ directory and any config files in the /usr/local/squid/etc/ directory.

NCSA authentication is the easiest to use, since it's self contained. The SMB authentication program requires that Samba (samba.org) be installed, since it effectively talks to the SMB server through Samba.

The squid.conf file uses the authenticate_program tag to decide which external program to use to authenticate users. If Squid were to only start one authentication program, a slow username/password lookup could slow the whole cache down (while all other connections waited to be authenticated). Squid thus opens more than one authentication program at a time, sending pending requests to the second when the first is busy, the third when the second is and so forth. The actual number started is specified by the authenticate_children squid.conf value. The default is five, but you will probably need to increase this for a heavily loaded cache server.

Using the NCSA authentication module To use the NCSA authentication module, you will need to add the following line to your squid.conf: authenticate_program /usr/local/squid/bin/ncsa_auth /usr/local/squid/etc/passwd

You will also need to create the appropriate password file (/usr/local/squid/etc/passwd in the example above). This file consists of a username and password pair, one per line, where the username and password are seperated by a colon (:), just as they are in a Unix /etc/passwd file. The password is encrypted with the same function as the passwords in /etc/passwd (or /etc/shadow on newer systems) are. Here is an example password line: oskar:lKdpxbNzhlo.w

Since the encrypted passwords are the same, and the ncsa_auth module understands the /etc/passwd or /etc/shadow file format, you could simply copy the system password file periodically. If your users do not already have passwords in Unix crypt format somewhere, you will have to use the htpasswd program (in /usr/local/squid/bin/) to generate the appropriate user and password pairs.

Using the SMB authentication module Very Simple... authenticate_ip_ttl 5 minutesauth_param basic children 5auth_param basic realm Servidor de Autenticacion!auth_param basic program /usr/lib/squid/smb_auth -W work_group -I server_name

Using the RADIUS authentication module Once you have compiled (./compile & make & make install) "Squid_radius_auth" (you can get a copy here: http://www.squid-cache.org/contrib/squid_radius_auth/), you must add this follow line to squid.conf (for basic auth): acl external_traffic proxy_auth REQUIREDhttp_access allow external_trafficauth_param basic program /usr/local/squid/libexec/squid_radius_auth -f /usr/local/squid/etc/squid_radius_auth.confauth_param basic children 5auth_param basic realm This is the realmauth_param basic credentialsttl 45 minutes

P a g e | 51

After you have added this parameter you must edit /usr/local/squid/etc/squid_radius_auth.conf and change the default hostname of RADIUS server hostname (or IP) and change the key. Restart squid for it to take effect.

SNMP Community If you have configured Squid to support SNMP, you can also create acls that filter by the

requested SNMP community. By combining source address (with the src acl type) and community filters (using the snmp_community acl type) you can restrict sensitive SNMP queries to administrative machines while allowing safer queries from the public. SNMP setup is covered in more detail later in the chapter, where we discuss the snmp_access acl-operator.

Acl-operator lines Acl-operators are the other half of the acl system. For each connection the appropriate

acl-operators are checked (in the order that they appear in the file). You have met the http_access and icp_access operators before, but they aren't the only Squid acl-operators. All acl-operator lines have the same format; although the below format mentions http_access specifically, the layout also applies to all the other acl-operators too. http_access allow|deny [!]aclname [& [!]aclname2 ... ]

<<note: field-testing above on old squid 2.3 suggests "&" must be omitted>> Let's work through the fields from left to right. The first word is http_access, the actual acl-operator.

The allow and deny words come next. If you want to deny access to a specific class of users, you can change the customary allow to deny in the acl line. We have seen where a deny line is useful before, with the final deny of all IP ranges in previous examples.

Let's say that you wanted to deny Internet access to a specific list of IP addresses during the day. Since acls can only have one type per acl, you could not create an acl line that matches an IP address during specific times. By combining more than one acl per acl-operator line, though, you get the same effect. Consider the following acls: acl dialup src 10.0.0.0/255.255.255.0acl work time 08:00-17:00

If you could create an acl-operator that was matched when both the dialup and work acls were true, clients in the range could only connect during the right times. This is where the aclname2 in the above acl-operator definition comes in. When you specify more than one acl per acl-operator line, both acls have to be matched for the acl-operator to be true. The acl-operator function AND's the results from each acl check together to see if it is to return true of false.

You could thus deny the dialup range cache access during working hours with the following acl rules:

You can also invert an acl's result value by using an exclamation mark (the traditional NOT value from many programming languages) before the appropriate acl. In the following example I have reduced Example 6-4 into one http_access line, taking advantage of the implicit inversion of the last rule to deny access to all clients. Since the above example is quite complicated, let's cover it in more detail:

In the above example an IP from the outside world will match the 'all' acl, but not the 'myNet' acl; the IP will thus match the http_access line. Consider the binary logic for a request coming in from the outside world, where the IP is not defined in the myNet acl. Deny http access if ((true) & (!false))

P a g e | 52

If you consider the relevant matching of an IP in the 10.0.0.0 range, the myNet value is true, the binary representation is as follows: Deny http access if ((true) & (!true))

A 10.0.0.0 range IP will thus not match the only http_access line in the squid config file. Remembering that Squid will default to the opposite of the last match in the file, accesses will be allowed from the myNet IP range.

The other Acl-operators You have encountered only the http_access and icp_access acl-operators so far. Other acl-operators are:

no_cache ident_lookup_access miss_access always_direct, never_direct snmp_access (covered in the next section of this chapter) delay_classes (covered in the next section of this chapter) broken_posts

The no_cache acl-operator The no_cache acl-operator is used to ensure freshness of objects in the cache. The default

Squid config file includes an example no_cache line that ejects the results of cgi programs from the cache. If you want to ensure that cgi pages are not cached, you must un-comment the following lines from squid.conf: acl QUERY urlpath_regex cgi-bin \\?no_cache deny QUERY

The first line uses a regular expression match to find urls that have cgi-bin or ? in the path (since we are using the urlpath_regex acl type, a site with a name like cgi-bin.qualica.com will not be matched.) The no_cache acl-operator is then used to eject matching objects from the cache.

The ident_lookup_access acl-operator Earlier we discussed using the ident protocol to control cache access. To reduce network

overhead, Squid does an ident lookup only when it needs to. If you are using ident to do access control, Squid will do an ident lookup for every request, and you don't have to worry about this acl-operator.

Many administrators would like to log the the ident value for connections without actually using it for access control. Squid used to have a simple on/off switch for ident lookups, but this incurred extra overhead for the cases where the ident lookup wasn't useful (where, for example, the connection is from a desktop PC).

Let's consider some examples. Assume that a you have one Unix server (at IP address 10.0.0.3), and all remaining IP's in the 10.0.0.0/255.255.255.0 range are desktop PC's. You don't want to log the ident value from PC's, but you do want to record it when the connection is from the Unix machine. Here is an example acl set that does this:

If a system cracker is attempting to attack your cache, it can be useful to have their ident value logged. The following example gets Squid not to do ident lookups for machines that are

P a g e | 53

allowed access, but if a request comes from a disallowed IP range, an ident lookup is done and inserted into the log.

The miss_access acl-operator The ICP protocol is used by many caches to find out if objects are in another cache's on-

disk store. If you are peering with other organisation's caches, you may wish them to treat you as a sibling, where they only get data that you already have stored on disk. If an unscrupulous cache-admin were to change their cache_peer line to read parent instead of sibling, they could get you to retrieve objects on their behalf.

To stop this from happening, you can create an acl that contains the peering caches, and use the miss_access acl-operator to ensure that only hits are served to these caches. In response to all other requests, an access-denied message is sent (so if a sibling complains that they almost always get error messages, it's likely that they think that you should be their parent, and you think that they should be treating you as a sibling.)

When looking at the following example it is important to realise that http_access lines are checked before any miss_access lines. If the request is denied by the http_access lines, an error page is returned and the connection closed, so miss_access lines are never checked. This means that the last miss_access line in the example doesn't allow random IP ranges to access your cache, it only allows ranges that have passed the http_access test through. This is simpler than having one miss_access line for each http_access line in the file, and it will reduce CPU usage too, since only two acls are checked instead of the six we would have instead.

The always_direct and never_direct acl-operators These operators help you make controlled decisions about which servers to connect to

directly, and which to connect through a parent cache/proxy. I previously discussed this set of options briefly in Chapter 3, during the Basic Installation phase. These tags are covered in detail in the following chapter, in the Peer Selection section.

The broken_posts acl-operator Some servers incorrectly handle POST data, requiring an extra Carriage-Return (CR) and

Line-Feed (LF) after a POST request. Since obeying the HTTP specification will make Squid incompatible with these servers, there is an option to be non-compliant when talking to a specific set of servers. This option should be very rarely used. The url_regex acl type should be used for specifying the broken server.

SNMP Configuration Before we continue: if you wish to use Squid's SNMP functions, you will need to have

configured Squid with the --enable-snmp option, as discussed way back in Chapter 2. The Squid source only includes SNMP code if it is compiled with the correct options.

Normally a Unix SNMP server (also called an agent) collects data from the various services running on a machine, returning information about the number of users logged in, the number of sendmail processes running and so forth. As of this writing, there is no SNMP server which gathers Squid statistics and makes them available to SNMP managment stations for interpretation. Code has thus been added to Squid to handle SNMP queries directly.

P a g e | 54

Squid normally listens for incoming SNMP requests on port 3401. The standard SNMP port is 161.

For the moment I am going to assume that your management station can collect SNMP data from a port other than 161. Squid will thus listen on port 3401, where it will not interfere with any other SNMP agents running on the machine.

No specific SNMP agent or mangement station software is covered by this text. A Squid-specific mib.txt file is included in the /usr/local/squid/etc/ directory. Most management station software should be able to use this file to construct Squid-specific queries.

Querying the Squid SNMP server on port 3401 All snmp_access acl-operators are checked when Squid is queried by an SNMP

management station. The default squid.conf file allows SNMP queries from any machine, which is probably not what you want. Generally you will want only one machine to be able to do SNMP queries of your cache. Some SNMP information is confidential, and you don't want random people to poke around your cache settings. To restrict access, simply create a src acl for the appropriate IP address, and use snmp_access to deny access for every other IP.

Not all Squid SNMP information is confidential. If you want to allow split up SNMP information into public and private, you can use an SNMP-specific acl type to allow or deny requests based on the community the client has requested.

Running multiple SNMP servers on a cache machine If you are running multiple SNMP servers on your cache machine, you probably want to

see all the SNMP data returned on one set of graphs or summaries. You don't want to have to query two SNMP servers on the same machine, since many SNMP analysis tools will not allow you to relate (for example) load average to number of requests per second when the SNMP data comes from more than one source.

Let's work through the steps Squid goes through when it receives an SNMP query: The request is accepted, and access-control lists are checked. If the request is allowed, Squid checks to see if it's a request for Squid information or a request for something it doesn't understand. Squid handles all Squid-specific queries internally, but all other SNMP requests are simply passed to the other SNMP server; Squid essentially acts as an SNMP proxy for SNMP queries it doesn't understand.

This SNMP proxy-mode allows you to run two servers on a machine, but query them both on the same port. In this mode Squid will normally listen on port 161, and the other SNMP server is configured to listen on another port (let's use port 3456 for argument's sake). This way the client software doesn't have to be configured to query a different port, which especially helps when the client is not under your control.

Binding the SNMP server to a non-standard port Getting your SNMP server to listen on a different port may be as easy as changing one

line in a config file. In the worst case, though, you may have to trick it to listen somewhere else. This section is a bit of a guide to IP server trickery!

Server software can either listen for connections on a hard-coded port (where the port to listen to is coded into the source and placed directly into the binary on compilation time), or it can use standard system calls to find the port that it should be listening to. Changing programs that use the second set of options to use a different port is easy: you edit the /etc/services file,

P a g e | 55

changing the value for the appropriate port there. If this doesn't work, it probably means that your program uses hard-coded values, and your only recourse is to recompile from source (if you have it) or speak to your vendor.

You can check that your server is listening to the new port by checking the output of the netstat command. The following command should show you if some process is listening for UDP data on port 3456: cache1:~ $ netstat -na | grep udp | grep 3456udp 0 0 0.0.0.0:3456 0.0.0.0:*cache1:~ $

Changing the services port does have implications: client programs (like any SNMP management station software running on the machine) will also use the services file to find out which port they should connect when forming outgoing requests. If you are running anything other than a simple SNMP agent on the cache machine, you must not change the /etc/services file: if you do you will encounter all sorts of strange problems!

Squid doesn't use the /etc/services file, but the port to listen to is stored in the standard Squid config file. Once the other server is listening on port 3456, we need to get Squid to listen on the standard SNMP port and proxy requests to port 3456.

First, change the snmp_port value in squid.conf to 161. Since we are forwarding requests to another SNMP server, we also need to set forward_snmpd_port to our other-server port, port 3456. Access Control with more than one Agent

Since Squid is actually creating all the queries that reach the second SNMP server, using an IP-based access control system in the second server's config is useless: all requests will come from localhost. Since the second server cannot find out where the requests came from originally, Squid will have to take over the access control functions that were handled by the other server.

For the first example, let's assume that you have a single SNMP management station, and you want this machine to have access to all SNMP functions. Here we assume that the management station is at IP 10.0.0.2.

You may have classes of SNMP stations too: you may wish some machines to be able to inspect public data, but others are to be considered completely trusted. The special snmp_community acl type is used to filter requests by destination community. In the following example all local machines are able to get data in the public SNMP community, but only the snmpManager machine is able to get other information. In this example we are using the ANDing of the publicCommunity and myNet acls to ensure that only people on the local network can get even public information.

Delay Classes Delay Classes are generally used in places where bandwidth is expensive. They let you

slow down access to specific sites (so that other downloads can happen at a reasonable rate), and they allow you to stop a small number of users from using all your bandwidth (at the expense of those just trying to use the Internet for work).

To ensure that some bandwidth is available for work-related downloads, you can use delay-pools. By classifying downloads into segments, and then allocating these segments a certain amount of bandwidth (in kilobytes per second), your link can remain uncongested for "useful" traffic.

P a g e | 56

To use delay-pools you need to have compiled Squid with the appropriate options: you will have to have used the --enable-delay-pools option when running the configure program back in Chapter 2.

Slowing down access to specific URLs An acl-operator (delay_access) is used to split requests into pools. Since we are using

acls, you can split up requests by source address, destination url or more. There is more than one type (or class) of pool. Each type of pool allows you to limit bandwidth in different ways.

The First Pool Class Rather than cover all of the available classes immediately, let's deal with a basic example

first. In this example we have only one pool, and the pool catches all URLs containing the word abracadabra. acl magic_words url_regex -i abracadabra delay_pool_count 1 delay_class 1 1 delay_parameters 1 16000/16000 delay_access 1 allow magic_words

The first line is a standard ACL: it returns true if the requested URL has the word abracadabra in it. The -i flag is used to make the search case-insensitive.

The delay_pool_count variable tells Squid how many delay pools there will be. Here we have only one pool, so this option is set to 1.

The third line creates a delay pool (delay pool number 1, the first option) of class 1 (the second option to delay_class).

The first delay class is the simplest: the download rate of all connections in the class are added together, and Squid keeps this aggregate value below a given maximum value.

The fourth line is the most complex, as if you can see. The delay_parameters option allows you to set speed limits on each pool. The first option is the pool to be manipulated: since we have only one pool in this example, this is set to 1. The second option consists of two values: the restore and max values, separated by a forward-slash (/).

If you download a short file at high speed, you create a so-called burst of traffic. Generally these short bursts of traffic are not a problem: these are normally html or text files, which are not the real bandwidth consumers. Since we don't want to slow everyone's access down (just the people downloading comparitively large files), Squid allows you to configure a size that the download is to start slowing down at. If you download a short file, it arrives at full speed, but when you hit a certain threshold the file arrives more slowly.

The restore value is used to set the download speed, and the max value lets you set the size at which the files are to be slowed down from. Restore is in bytes per second, max is in bytes.

In the above example, downloads proceed at full speed until they have downloaded 16000 bytes. This limit ensures that small file arrive reasonably fast. Once this much data has been transferred, however, the transfer rate is slowed to 16000 bytes per second. At 8 bits per byte this means that connections are limited to 128kilobits per second (16000 * 8).

P a g e | 57

The Second Pool Class As I discussed in this section's introduction, delay pools can help you stop one user from

flooding your links with downloads. You could place each user in their own pool, and then set limits on a per-user basis, but administrating these lists would become painful almost immediately. By using a different pool type, you can set rate limits by IP address easily.

Let's consider another example: you have a 128kbit per second line. Since you want some bandwidth available for things like SMTP, you want to limit web access to 100kbit per second. At the same time, you don't want a single user to use more than their fair share of sustained bandwidth. Given that you have 20 staff members, and 100kbit per second remaining bandwidth, each person should not use more than 5kbit per second of bandwidth. Since it's unlikely that every user will be surfing at once, we can probably limit people to about four times their limit (that's 20kbit per second, or 2.5kbytes per second).

In the following example, we change the delay class for pool 1 to 2. Delay class 2 allows us to specify both an aggregate (overall) bandwidth usage and a per-user usage. In the previous example the delay_paramaters tag only took one set of options, the aggregate peak and burst rates. Given that we are now using a class-two pool, we have to supply two sets of options to delay_parameters: the overall speed and the per-IP speed. The 100kbits per second value is converted to bytes per second by dividing by 8 (giving us the 12500 values), and the per-IP value of 2.5kbits per second we discovered is converted to bytes per second (giving us the 2500 values.)

EXAMPLE acl all src 0.0.0.0/0.0.0.0delay_pool_count 1delay_class 1 2delay_parameters 1 12500/12500 2500/2500delay_access 1 allow all

The Third Pool Class This class is useful to very organizations like Universities. The second pool class lets you

stop individual users from flooding your links. A lab full of students all operating at their maximum download rate can, however, still flood the link. Since such a lab (or department, if you are not at a University) will all have IP addresses in the same range, it is useful to be able to put a cap on the download rate of an entire network range. The third pool class lets you do this. Currently this option only works on class-C network ranges, so if you are using variable length subnet masks then this will not help.

In the next example we assume that you have three IP ranges. Each range must not use more than 1/3 of your available bandwidth. For this example I am assuming that you have a 512kbit/s line, and you want 64kbit/s available for SMTP and other protocols. This will leave you with an overall download rate cap of 448kbit/s.) Each Class-C IP range will have about 150kbit/s available. With 3 ranges of 256 IP addresses each, you should have in the region of 500 pc's, which (if calculated exactly) gives you .669kbit per second per machine. Since it is unlikely that all machines will be using the net at the same time, you can probably allocate each machine (say) 4kbit per second (a mere 500 bytes per second).

In this example, we changed the delay class of the pool to 3. The delay_parameters option now takes four arguments: the pool number; the overall bandwidth rate; the per-network bandwidth rate and the per-user bandwidth rate.

P a g e | 58

The 4kbit per second limit for users seems a little low. You can increase the per-user limit, but you may find that it's a better idea to change the max value instead, so that the limit sets in after only (say) 16kilobytes or so. This will allow small pages to be downloaded as fast as possible, but large pages will be brought down without influencing other users.

If you want, you can set the per-user limit to something quite high, or even set them to -1, which effectively means that there is no limit. Limits work from right to left, so if I user is sitting alone in a lab they will be limited by their per-user speed. If this value is undefined, they are limited by their per-network speed, and if that is undefined then they are limited by their overall speed. This means that you can set the per-user limit higher than you would expect: if the lab is not busy then they will get good download rates (since they are only limited by the per-network limit).

EXAMPLE: acl all src 0.0.0.0/0.0.0.0 delay_pool_count 1 delay_class 1 3

1. 56000*8 sets your overall limit at 448kbit/s 1. 18750*8 sets your per-network limit at 150kbit/s 1. 500*8 sets your per-user limit at 4kbit/s

delay_parameters 1 56000/56000 18750/18750 500/500 delay_access 1 allow all

Using Delay Pools in Real Life By combining multiple ACLs, you can do interesting things with delay pools. Here are some

examples: By using time-based acls, you can limit people's speed during working hours, but allow

them full-speed access outside hours. Again (with time-based acl lists), you can allocate a very small amount of bandwidth to

http during working hours, discouraging people from browsing the Web during office hours.

By using acls that match specific source IP addresses, you can ensure that sibling caches have full-speed access to your cache.

You can prioritize access to a limited set of destination sites by using the dst or dstdomain acl types by inverting the rules we used to slow access to some sites down.

You can combine username/password access-lists and speed-limits. You can, for example. allow users that have not logged into the cache access to the Internet, but at a much slower speed than users who have logged in. Users that are logged in get access to dedicated bandwidth, but are charged for their downloads.

Conclusion Once your acl system is correctly set up, your cache should essentially be ready to

become a functional part of your infrastructure. If you are going to use some of the advanced Squid features (like transparent operation mode, for example), (? Unintended truncation? ?)

P a g e | 59

Cache Hierarchies

Introduction Squid is particularly good at communicating with other caches and proxies. Numerous

inter-cache communication protocols are supported, including ICP (Inter-Cache Protocol), Cache-Digests, HTCP (Hyper-Text Cache Protocol) and CARP (Cache Array Routing Protocol). Each of these protocols has specific strengths and weaknesses; they are more suited to some circumstances than others.

In this chapter we look at each of the protocols in detail. We also look at the different ways that you can structure your cache hierarchy, and work through the config options that effect cache hierarchies.

Why Peer The primary function of an inter-cache protocol is to stop object duplication, increasing

hit rates. If you have a large network with widely separated caches, you may wish to store objects in each cache even if one of your other caches has it: by keeping objects close to your users, you reduce their network latency (even if you end up "wasting" disk space in the process.)

Inter-branch traffic can be reduced by placing a cache at each branch. Since caches can avoid duplicating objects between them, each disk you add to a cache adds space to the overall hierarchy, increasing your hierarchy hit-rate. This is a lot better than simply having caches at branches which do not communicate with one another, since with that setup you end up with multiple copies of each cache object; one per server. Clients can also be configured to query another branch's cache if their local one goes down, adding redundancy.

P a g e | 60

If overloaded, a central cache machine can become a network bottleneck. Unlike one cache machine, caches in a hierarchy can be close to all parts of the network; they can also handle a much larger load (with a near-linear increase in performance with each added machine). Loaded caches can thus be replaced with clusters of low-load caches, without wasting disk space.

Integrating your caches into a public cache hierarchy can increase your hit rate (since you increase your effective disk space by accessing other machine's object stores.) By choosing peers carefully, you can reduce latency, or reduce costs by saving Internet bandwidth (if communicating with your peers is cheaper than going direct to the source.) On the other hand, communicating with peers via loaded (or high-latency) line can slow down your cache. It's best to check your peer response times periodically to check if the peering arrangement is beneficial. You can use the client program to check cache response times, and the cache manager (discussed in Chapter 12) to look at Squid's view on the cache.

Peer Configuration First, let's look at the squid.conf options available for hierarchy configuration. We will

then work through the most common hierarchy structures, so that you can see the way that the options are used.

You use the cache_peer option to configure the peers that Squid will communicate with. Other options are then used to select which peer to pass a request to.

The cache_peer Option When communicating with a peer, Squid needs some basic information about how to talk

to the machine; the hostname, what ports to send queries to, and so forth. The cache_peer config line does this. Let's look at an example line:

The cache_peer option is split into five fields. The first field (cache.domain.example) is the hostname or IP of the cache that is to be queried. The second field indicates the type of relationship, and must be set to either parent or sibling or multicast. The third field sets the HTTP port of the destination server, while the fourth sets the ICP (UDP) query port. The fifth field can contain more than zero or more keywords, although we only use one in the example above; the keyword default sets that the cache will be used as the default path to the outside world. If you compiled Squid to support HTCP, your cache will automatically attempt to connect to TCP port 4827 (there is currently no option to change this port value). Cache digests are transferred via the HTTP port specified on the cache_peer line. Here is a summary of the available cache_peer options:

Peer Selection Let's say that you have only one parent cache server: the server at your ISP. In Chapter 3,

we configured Squid so that the parent cache server would not be queried for internal hosts, so queries to the internal machines went direct, instead of adding needless load to your parent cache (and the line between you). Squid can use access-control lists to decide which cache to talk to, rather than just the destination domain. With access lists, you can use different caches depending on the source IP, domain, text in the URL and more. The advantages of this flexibility are not immediately obvious (even to me), but some examples are given in th remainder of this chapter. First, however, let's cover filtering by destination domain.

P a g e | 61

Selecting by Destination Domain This tag is used to communicate with different caches depending on the domain that the

request is destined for. To ensure that you don't query another cache server for your local domain, you can use the following config line: cache_peer_domain cache-host.example.net !.my.dom.ain

Selecting with Acls Squid can also make peer selections based on the results of acl rules. The

cache_peer_access line is discussed in the previous chapter. The following example could be used if you want all requests from a specific IP address range to go to a specific cache server (for accounting purposes, for example). In the following example, all requests from the 10.0.1.* range are passed to cache.domain.example, but all other requests are handled directly.

Querying an Adult-Site Filtering-cache for Specific URLs Let's say that you have a separate Adult-Site cache, which filters out urls. The company

that maintains the filter list charges by number of queries, so it's in your interest to bypass them for URLs that you know are fine. Their documentation says that you should set their machine up as your default parent, so you create a list of suspect words, and set the cache up to forward requests for any URL that contains one of these words to the filtering cache server. By avoiding the filtering server, you will end up missing a fairly large number of sites. At the same time, however, you don't end up filtering out valid sites that do contain suspect words in the URL.

Filtering with Cache Hierarchies ISPs in the outer regions quite often peer with large hierarchies in the USA, so as to

avoid any extra latency in the USA. Since it's almost certainly faster to get any local data directly from the source, they configure their caches to retrieve data for their local top-level domain directly, rather than via a USA cache.

The always_direct and never_direct tags Squid checks all always_direct tags before it checks any never_direct tags. If a matching

always_direct tag is found, Squid will not check the never_direct tags, but decides which cache to talk to immediately. This behavior is demonstrated by the following example; here, Squid will attempt to go the machine intranet, even though the same host is also matched by the all acl.

Let's work through the logic that Squid uses in the above example, so that you can work out which cache Squid is going to talk to when you construct your own rules.

First, let's consider a request destined for the web server intranet.mydomain.example. Squid first works through all the always_direct lines; the request is matched by the first (and only) line. The never_direct and always_direct tags are acl-operators, which means that the first match is considered. In this illustration, the matching line instructs Squid to go directly when the acl matches, so all neighboring peers are ignored for this request. If the line used the deny keyword instead of allow, Squid would have simply skipped on to checking the never_direct lines.

Now, the second case: a request arrives for an external host. Squid works through the always_direct lines, and finds that none of them match. The never_direct lines are then checked. The all acl matches the connection, so Squid marks the connection as never to be forwarded

P a g e | 62

directly to the origin server. Squid then works through it's list of peers, trying to find the cache that the request is best forwarded to (servers that have the object are more likely to get the request, as are servers that respond fast). The algorithm that Squid uses to decide which of it's peers to use is discussed shortly.

prefer_direct hierarchy_stoplist

Squid can be configured to avoid cache siblings when the requested URL contains specific word-lists. The hierarchy_stoplist tag normally contains words that occur when the remote page is dynamically generated, such as cgi-bin, asp or more.

neighbor_type_domain You can blur the distinction between peers and a siblings with this tag. Let's say that you

work for a very large organization, with many regions, some in different countries. These organizations generally have their own network infrastructure: you will install a

link to a local regional office, and they will run links to a core backbone. Let's assume that you work for the regional office, and you have an Internet line that your various divisions share. You also have a link to your head-office, where they have a large cache, and their own Internet link. You peer with their cache (with them set up as a sibling), and you also peer with your local ISP's server.

When you request pages from the outside world, you treat your ISP's cache server as a parent, but when you query web servers in your own domain you want the requests to go to your head-office's cache, so that any web sites within your organization are cached. By using the neighbor_type_domain option, you can specify that requests for your local domain are to be passed to your head-office's cache, but other requests are to be passed directly.

Other Peering Options Various other options allow you to tune various values that effect your cache's interaction

with hierarchies. These options all effect all peering caches (rather than individual machines).

miss_access The miss_access tag is an acl-operator. This tag has already been covered in the acls

chapter (Chapter 6), but is covered here again for completeness. The miss_access tag allows you to create a list of caches which are only allowed to retrieve hits from your cache. If they request an object that is missed, Squid will return an error page denying them access. If the example below is not immediately clear, please refer to Chapter 6 for more information

dead_peer_timeout If a peer cache has not responded to an ICP request for dead_peer_timeout seconds, the

cache will be marked as down, and the object will be retrieved from somewhere else (probably directly from the source.)

icp_hit_stale Turning this option on can cause problems if you peer with anyone.

P a g e | 63

Multicast Cache Communication Cache digests are in some ways a replacement for multicast cache peering. There are

some advantages to cache-digests: they are handled at the Squid level (so you don't have to fiddle with kernel multicast settings and so forth), and they add significantly less latency (finding out if a cache has an object simply involves checking an in-memory bit-array, which is significantly faster than checking across the network).

First, though, let's cover some terminology. Most people are familiar with the term broadcast, where data is sent from one host to all hosts on the local network. Broadcasts are normally used to discover things, not for general inter-machine transfer: a machine will send out a broadcast ARP request to try and find the hardware address that a specific IP address belongs to. You can also send ping packets to the broadcast address, and find machines on the local network when they respond. Broadcasts only work across physical segments (or bridged/switched networks), so an ARP request doesn't go to every machine on the Internet.

A unicast packet is the complete opposite: one machine is talking to only one other machine. All TCP connections are unicast, since they can only have one destination host for each source host. UDP packets are almost always unicast too, though they can be sent to the broadcast address so that they reach every single machine in some cases.

A multicast packet is from one machine to one or more. The difference between a multicast packet and a broadcast packet is that hosts receiving multicast packets can be on different lans, and that each multicast data-stream is only transmitted between networks once, not once per machine on the remote network. Rather than each machine connecting to a video server, the multicast data is streamed per-network, and multiple machines just listen-in on the multicast data once it's on the network.

This efficient use of bandwidth is perfect for large groups of caches. If you have more than one server (for load-distribution, say), and someone wants to peer with you, they will have to configure their server to send one ICP packet to each of your caches. If Squid gets an ICP request from somewhere, it doesn't check with all of it's peers to see if they have the object. This "check with my peers" behavior only happens when an HTTP request arrives. If you have 5 caches, anyone who wants to find out if your hierarchy has an object will have to send 5 ICP requests (or treat you as a parent, so that your caches check with one another). This is a real waste of bandwidth. With a multicast network, though, the remote cache would only send one ICP request, destined for your multicast address. Routers between you would only transfer one packet (instead of 5), saving the duplication of requests. Once on your network, each machine would pick up one packet, and reply with their answer.

Multicast packets are also useful on local networks, if you have the right network cards. If you have a large group of caches on the same network, you can end up with a lot of local traffic. Each request that a cache receives prompts one ICP request to all the other local caches, swamping the local network with small packets (and their replies). A multicast packet, on the other hand, is a kind of broadcast to the machines on the local network. They will each receive a copy of the packet, although only one went out onto the wire. If you have a good ethernet card, the card will handle a fair amount of the filtering (some cards may have to be put into promiscuous mode to pick up all the packets, which can cause load on the machine: make sure that the card you buy supports hardware multicast filters). This solution is still not linearly scalable, however, since the reply packets can easily become the bottleneck by themselves.

P a g e | 64

Getting your machine ready for Multicast The kernel's IP stack (the piece of kernel code that handles IP networking) needs to look

out for multicast packets, otherwise they will be discarded (either by the network card or the lower levels of the IP stack.) Your kernel may already have multicast support, or you will have to turn it on. Doing this is, unfortunately, beyond the scope of this book, and you may have to root around for a howto guide somewhere.

Once your machine is setup to receive multicast packets, you need your machines to talk to one another. You can either join the mbone (a virtual multicast backbone), or set up an internal multicast network. Joining the mbone could be a good thing anyway, since you get access to other services. You must be sure not to use a random set of multicast IP addresses, since they may belong to someone else. You can get your own IP range from the people at the mbone.

An outgoing multicast packet has a ttl (Time To Live) value, which is used to ensure that loops are not created. Each time a packet passes through a router, the router decrements this ttl value, and the value is then checked. Once the value reaches zero, the packet is dropped. If you want multicast packets to stay on your local network, you would set the ttl value to 1. The first router to see the packet would decrement the packet, discover the ttl was zero and discard it. This value gives you a level of control on how many multicast routers will see the packet. You should set this value carefully, so that you limit packets to your local network or immediate multicast peers (larger multicast groups are seldom of any use: they generate too many responses, and when geographically dispersed, may simply add latency. You also don't want crackers picking up all your ICP requests by joining the appropriate multicast group.)

Various multicast debugging tools are available. One of the most useful is mtrace, which is effectively a traceroute program for multicast connections. This program should help you choose the right ttl value.

Querying a Multicast Cache The cache_peer option traditionally can have two types of cache: a parent and a sibling.

If you are querying a set or multicast caches, you need to use a different tag, the multicast cache type. When you send a multicast request to a cache, each of the servers in the group will send you a response packet (from their real IP address.) Squid discards unexpected ICP responses by default, and since it can't determine which ICP replies are valid automatically, you will have to add lines to the Squid config file that stop it rejecting packets from hosts in the multicast group.

In the following example, the multicast group 224.0.1.20 consists of three hosts, at IP addresses 10.11.12.1, 10.11.13.1 and 10.11.14.1. These hosts are quite close to your cache, so the ttl value is set to 5.

Accepting Multicast Queries: The mcast_groups option As a multicast server, Squid needs to listen out for the right packets. Since you can have

more than one multicast group on a network, you need to configure Squid to listen to the right multicast-group (the IP that you have allocated to Squid.) The following (very simple) example is from the config of the server machine 10.11.12.1 in the example above.

Other Multicast Cache Options

P a g e | 65

The mcast_icp_query_timeout Option As you may recall, Squid will wait for up to dead_peer_timeout seconds after sending out

an ICP request before deciding to ignore a peer. With a multicast group, peers can leave and join at will, and it should make no difference to a client. This presents a problem for Squid: it can't wait for a number of seconds each time (what if the caches are on the same network, and responses come back in milliseconds: the waiting just adds latency.) Squid gets around this problem by sending ICP probes to the multicast address occasionally. Each host in the group responds to the probe, and Squid will know how many machines are currently in the group. When sending a real request, Squid will wait until it gets at least as many responses as were returned in the last probe: if more arrive, great. If less arrive, though, Squid will wait until the dead_peer_timeout value is reached. If there is still no reply, Squid marks that peer as down, so that all connections are not held up by one peer.

When Squid sends out a multicast query, it will wait at most mcast_icp_query_timeout seconds (it's perfectly possible that one day a peer will be on the moon: and it would probably be a bad idea to peer with that cache seriously, unless it was a parent for the Mars top-level domain.) It's unlikely that you will want to increase this value, but you may wish to drop it, so that only reasonably speedy replies are considered.

Cache Digests Cache digests are one of the latest peering developments. Currently they are only

supported by Squid, and they have to be turned on at compile time. Squid keeps it's "list" of objects in an in-memory hash. The hash table (which is based on

MD5) helps Squid find out if an object is in the cache without using huge amounts of memory or reading files on disk. Periodically Squid takes this table of objects and summarizes it into a small bitmap (suitable for transfer across a modem). If a bit in the map is on, it means that the object is in the store, if it's off, the object is not. This bitmap/summary is available to other caches, which connect on the HTTP port and request a special URL. If the client cache (the one that just collected the bitmap) wants to know if the server has an object, it simply performs the same mathematical function that generated the values in the bitmap. If the server has the object, the appropriate bit in the bitmap will be defined.

There are various advantages to this idea: if you have a set of loaded caches, you will find that inter-cache communication can use significant amounts of bandwidth. Each request to one cache sparks off a series of requests to all neighboring caches. Each of these queries also causes some server load: the networking stack has to deal with these extra packets, for one thing. With cache-digests, however, load is reduced. The cache digest is generated only once every 10 minutes (the exact value is tunable). The transfer of the digest thus happens fairly seldom, even if the bitmap is rather large (a few 100kbytes is common.) If you were to run 10 caches on the same physical network, however, with each ICP request being a few hundred bytes, the numbers add up. This network load reduction can give your cache time to breathe too, since the kernel will not have to deal with as many small packets.

ICP packets are incredibly simple: they essentially contain only the requested URL. Today, however, a lot of data is transferred in the headers of a request. The contents of a static URL may differ depending on the browser that a user uses, cookie values and more. Since the

P a g e | 66

ICP packet only contains the URL, Squid can only check the URL to see if it has the object, not both the headers and the URL. This can (very occasionally) cause strange problems, with the wrong pages being served. With cache digests, however, the bitmap value depends on both the headers AND the url, which stops these strange hits of objects that are actually generated on-the-fly (normally these pages contain cgi-bin in their path, but some don't, and cause problems.)

Cache digests can generate a small percentage of false hits: since the list of objects is updated only every 10 minutes, your cache could expire an object a second after you download the summarized index. For the next ten minutes, the client cache would believe your server has data that it doesn't. Some five percent of hits may be false, but they are simply retrieved directly from the origin server if this turns out to be the case.

Cache Hierarchy Structures Deciding what hierarchy structure to use is difficult. Not only that, but it's quite often

very difficult to change, since you can have thousands of clients accessing you directly, and even more through your peers.

Here, I cover the most common cache-peer architectures. We start off with the most simple setup that's considered peering: two caches talking to one another as siblings. I am going to try and cover all the issues with this simple setup, and then move to larger cache meshes.

Two Peering Caches We have two caches, your cache and their cache. You have a nice fast link between them,

with low-latency, both caches have quite a lot of disk space, and they aren't going to be running into problems with load anytime soon. Let's look at the peering options you have:

Things to Watch Out For The most common problem with this kind of setup is the "continuous object exchange".

Let's say that your cache has an object. A user of their cache wants the object, so they retrieve it from you. A few hours later you expire the object. The next day, one of your users requests the object. Your cache checks with the other cache and finds that it does, indeed, have the object (it doesn't realize that it was the one that retrieved the object in the first place). It retrieves the object. Later on the whole process repeats itself, with their cache getting the object from you again. To stop this from happening, you may have to use the proxy-only option to the cache_peer line on both caches. This way, the caches simply retrieve their data from the fast sibling cache each time: if that cache expires the object, the object cannot return from the other cache, since it was never saved there. With ICP, there is a chance that an object that is hit is dynamically generated (even if the path does not say so). Cache digests fix this problem, which may make their extra bandwidth usage worthwhile.

Trees The traditional cache hierarchy structure involves lots of small servers (with their own

disk space, each holding the most common objects) which query another set of large parent servers (there can even be only one large server.) These large servers then query the outside on the client cache's behalf. The large servers keep a copy of the object so that other internal caches

P a g e | 67

requesting the page get it from them. Generally, the little servers have a small amount of disk space, and are connected to the large servers by quite small lines.

This structure generally works well, as long as you can stop the top-level servers from becoming overloaded. If these machines have problems, all performance will suffer.

Client caches generally do not talk to one another at all. The parent cache server should have any object that the lower-down cache may have (since it fetched the object on behalf of the lower-down cache). It's invariably faster to communicate with the head-office (where the core servers would be situated) than another region (where another sibling cache is kept).

In this case, the smaller servers may as well treat the core servers as default parents, even using the no-query option, to reduce cache latency. If the head-office is unreachable it's quite likely that things may be unusable altogether (if, on the other hand, your regional offices have their own Internet lines, you can configure the cache as a normal parent: this way Squid will detect that the core servers are down, and try to go direct. If you each have your own Internet link, though, there may not be a reason to use a tree structure. You might want to look at the mesh section instead, which follows shortly.)

To avoid overloading one server, you can use the round-robin option on the cache_peer lines for each core server. This way, the load on each machine should be spread evenly.

Meshes Large hierarchies generally use either a tree structure, or they are true meshes. A true

mesh considers all machines equal: there is no set of large root machines, mainly since they are almost all large machines. Multicast ICP and Cache digests allow large meshes to scale well, but some meshes have been around for a long time, and are only using vanilla ICP.

Cache digests seem to be the best for large mesh setups these days: they involve bulk data transfer, but as the average mesh size increases machines will have to be more and more powerful to deal with the number of queries coming in. Instead of trying to deal with so many small packets, it is almost certainly better to do a larger transfer every 10 minutes. This way, machines only have to check their local ram to see which machines have the object.

Pure multicast cache meshes are another alternative: unfortunately there are still many reply packets generated, but it still effectively halves the number of packets flung around the network.

Load Balancing Servers Sometimes, a single server cannot handle the load required. DNS or CARP load

balancing will allow you to split the load between two (or more) machines. DNS load balancing is the simplest option: In your DNS file, you simply add two A

records for the cache's hostname (you did use a hostname for the cache when you configured all those thousands of browsers like I told you, right?) The order that the DNS server returns the names in is continuously, randomly switched, and the client requesting the lookup will connect to a random server. These server machines can be setup to communicate with one-another as peers. By using the proxy-only option, you reduce duplication of objects between the machines, saving disk space (and, hopefully, increasing your hit rate.)

There are other load-balancing options. If you have client caches accessing the overloaded server (rather than client pcs), you can configure Squid on these machines with the

P a g e | 68

round-robin option on the cache_peer lines. You could also use the CARP (Cache Array Routing Protocol) to split the load unevenly (if you have one very powerful machine and two less powerful machines, you can use CARP to load the fast cache twice as much as the other machines).

The Cache Array Routing Protocol (CARP) The CARP protocol uses a hash function to decide which cache a request is to be

forwarded to. When a request is to be sent out, the code takes the URL requested and feeds it through a formula that essentially generates a large number from the text of the URL. A different URL (even if it differs by only one character) is likely to end up as a very different number (it won't, for example, differ by one). If you take 50 URLs and put them through the function, the numbers generated are going to be spread far apart from one another, and would be spread evenly across a graph. The numbers generated, however, all fit in a certain range. Because the number are spread across the range evenly, we can split the range into two, and the same number of URLs will have ended up in the first half as the second.

Let's say that we create a rule that effectively says: "I have two caches. Whenever I receive a request, I want to pass it to one of these caches. I know that any number generated by the hash function will be less than X, and that numbers are as likely to fall above one-half X as below. By sending all requests that hash to less than one-half X to cache one, and the remaining requests to cache two, the load should be even."

To terminology: the total range of numbers is split into equally large number ranges (called buckets).

Let's say that we have two caches, again. This time, though, cache one is able to handle twice the load of cache two. If we split the hash space into three ranges, and allocate buckets one and three to cache one (and bucket two to cache two), a URL will have twice the chance of going to cache one as it does to cache two.

Squid caches can talk to parent caches using CARP balancing if CARP was enabled when the source was configured (using the ./configure --enable-carp command.)

The load-factor values on all cache_peer lines must add up to 1.0. The below example splits 70% of the load onto the machine bigcache.mydomain.example, leaving the other 30% up to the other cache.

Now that your cache is integrated into a hierarchy (or is a hierarchy!), we can move to the next section. Accelerator mode allows your cache to function as a front-end for a real web server, speeding up web page access on those old servers.

Transparent caches effectively accelerate web servers from a distance (the code, at least, to perform both functions is effectively the same.) If you are going to do transparent proxying, I suggest that you read the next two Chapters. If you aren't interested in either of these Squid features, your Squid installation should be up and running. The remainder of the book (Section III) covers cache maintenance and debugging.

P a g e | 69

Accelerator Mode

Accelerator Mode Some cache servers can act as web servers (or vice versa). These servers accept requests

in both the standard web-request format (where only the path and filename are given), and in the proxy-specific format (where the entire URL is given).

The Squid designers have decided not to let Squid be configured in this way. This avoids various complicated issues, and reduces code complexity, making Squid more reliable. All in all, Squid is a web cache, not a web server.

By adding a translation layer into Squid, we can accept (and understand) web requests, since the format is essentially the same. The additional layer can re-write incoming web requests, changing the destination server and port. This re-written request is then treated as a normal request: the remote server is contacted, the data requested and the results cached. This lets you get Squid to pretend to be a web server, re-writing requests so that they are passed on to some other web server.

When to use Accelerator Mode Accelerator mode should not be enabled unless you need it. There are a limited set of

circumstances in which it is needed, so if one of the following setups applies to you, you should have a look at the remainder of this chapter.cc

Acceleration of a slow server

P a g e | 70

Squid can sit in front of a slow server, caching the server's results and passing the data on to clients. This is very useful when the origin server (the server that is actually serving the original data) is very slow, or is across a slow line). If the origin server is across a slow line, you could just move the origin server closer to the clients, but this may not be possible for administrative reasons. Don't use Squid to cache web server on the same machine for speed up, since modern web server (e.g. httpd) is faster than Squid in serving static contents in most cases.

Replacing a combination cache/web server with Squid If you are in the process of replacing a combination cache/web server, your client

machines may be configured to talk to the cache on port 80. Rather than reconfiguring all of you clients, you can get Squid to listen for incoming connections on port 80 (moving the real server to another port or server.) When Squid finds that it's received a web request, it will forward the request to the origin server, so that the machine continues to function as both a web and cache server.

Transparent Caching/Proxy Squid can be configured to magically intercept outgoing web requests and cache them.

Since the outgoing requests are in web-server format, it needs to translate them to cache-format requests. Transparent caching is covered in detail in the following section.

Security Squid can be placed in front of an insecure web server to protect it from the outside

world: not merely to stop unwanted clients from accessing the machine, but also to stop people from exploiting bugs in the server code.

Accelerator Configuration Options Please note as of Squid 2.6 these options have been replaced with parameters that

are stated in the http_port section of the squid.conf file. Their use in Squid 2.6 has been deprecated, please read below for their replacements.

The list of accelerator options is short, and setup is fairly simple. Once we have a working accelerator cache, you will have to create the appropriate access-list rules. (Since you probably want people outside your local network to be able to access your server, you cannot simple use source-IP address rulesets anymore.)

The httpd_accel_host option You will need to set the hostname of the accelerated server here. It's only possible to have

one destination server, so you can only have one occurence of this line. If you are going accelerate more than one server, or transparently cache traffic (as described in the next chapter), you will have to use the word virtual instead of a hostname here.

The httpd_accel_port option Accelerated requests can only be forwarded to one port: there is no table that associates

accelerated hosts and a destination port. Squid will connect to the port that you set the httpd_accel_port value to. When acting as a front-end for a web server on the local machine, you will set up the web server to listen for connections on a different port (8000, for example), and

P a g e | 71

set this squid.conf option to match the same value. If, on the other hand, you are forwarding requests to a set of slow backend servers, they will almost certainly be listening to port 80 (the default web-server port), and this option will need to be set to 80.

The httpd_accel_with_proxy option If you use the httpd_accel_host option, Squid will stop recognizing cache requests. So

that your cache can function both as an accelerator and as a web cache, you will need to set the httpd_accel_with_proxy option to on.

The httpd_accel_uses_host_header option A normal HTTP request consists of three values: the type of transfer (normally a GET,

which is used for downloads); the path and filename to be retrieved (or executed, in the case of a cgi program); and the HTTP version.

This layout is fine if you only have one web site on a machine. On systems where you have more than one site, though, it makes life difficult: the request does not contain enough information, since it doesn't include information about the destination domain. Most operating systems allow you to have IP aliases, where you have more than one IP address per network card. By allocating one IP per hosted site, you could run one web server per IP address. Once the programs were made more efficient, one running program could act as a server for many sites: the only requirement was that you had one IP address per domain. Server programs would find out which of the IP addresses clients were connected to, and would serve data from different directories for each IP.

There are a limited number of IP addresses, and they are fast running out. Some systems also have a limited number of IP aliases, which means that you cannot host more than a (fairly arbitrary) number of web sites on machine. If the client were to pass the destination host name along with the path and filename, the web server could listen to only one IP address, and would find the right destination directores by looking in a simple hostname table.

From version 1.1 on, the HTTP standard requires a special Host header, which is passed along with every outgoing request. This header also makes transparent caching and acceleration easier: by pulling the host value out of the headers, Squid can translate a standard HTTP request to a cache-specific HTTP request, which can then be handled by the standard Squid code. Turning on the httpd_accel_uses_host_header option enables this translation. You will need to use this option when doing transparent caching.

It's important to note that acls are checked before this translation. You must combine this option with strict source-address checks, so you cannot use this option to accelerate multiple backend servers (this is certain to change in a later version of Squid).

Option replacements in Squid 2.6 In this part of the Squid documentation only the option replacements and their correlation

with the now deprecated options shall be listed, please refer to the text above as to what each option does. Please note that much of this is taken directly from the Squid 2.6 Changes document. probably, most setup migration from 2.5 to 2.6 should start rewriting this par of the conf from scratch : http://wiki.squid-cache.org/SquidFaq/ReverseProxy

httpd_accel_* for transparent proxy

http://wiki.squid-cache.org/SquidFaq/ReverseProxy

P a g e | 72

This has all been replaced by a parameter called transparent that acheives this, please refer to the section on Transparent Proxy/Cache for more information httpd_accel_host Replaced by defaultsite http_port option and cache_peer originserver option. httpd_accel_port No longer needed. Server port defined by the cache_peer port. httpd_accel_uses_host_header Replaced by vhost http_port option Squid Version 2.6 http_port 3128 transparent And transparent always_direct allow all You'll need to locate the proper place and add this line.This replaces the four lines above.

Configuration for acceleration of a backend hosthttp_port 80 defaultsite=<backend ip>acl port80 port 80http_access allow port80always_direct allow all

Related Configuration Options So far, we have only covered the Config options that directly relate to accelerator mode.

The redirect_rewrites_host_header option Refresh patterns

Accelerating a slow web server is only useful if the cache can keep copies of returned pages (so that it can avoid contacting the back-end server.) Since you know about the backend server, you can specify refresh patterns that suit the machine exactly. Refresh patterns aren't covered here (they are covered in-depth in Chapter 11), but it's worth looking at how your site changes, and tuning your refresh patterns to match.

If, on the other hand, you are using simply using accelerator mode to replace a combination cache (or to act as a secure front-end for another server), you can disable caching of that site altogether: otherwise you simply end up duplicating data (once on the origin site, once for the cached copy) with no benefit.

Access Control Presumably you will want people from outside your network to be able to access the web

server that Squid is accelerating. If you have based your access lists on the examples in this book, you will find that machines on the outside cannot access the site being accelerated. The accelerated request is treated exactly like a normal http request, so people accesing the site from the outside world will be rejected since your acl rules deny access from IPs that are not on your network. By using the dst acl type, you can add specific exclusions to your access lists to allow requests to the accelerated host.

In the following example, we have changed the config so that the first rule matches (and allows) any request to the machine at IP 10.0.0.5, the accelerated machine. If we did not have the

P a g e | 73

port acl in the below rules, someone could request a URL with a different port number with a request that explicitly specifies a non-standard port. If we were to leave out this rule, it could let a system cracker poke around the system with requests for things like http://server.mydomain.example:25.

Example Configurations Let's cover two example setups: one, where you are simply using Squid's accelerator

function so that the machine has both a web server and a cache server on port 80; two, where you are using Squid as an accelerator to speed up a slow machine.

Accelerating Requests to a Slow Server When accelerating a slow server, you may find that communicating with peer caches is

faster than communicating with the accelerated host. In the following example, we remove all the options that stop Squid from caching the server's results. We also assume that the accelerated host is listening on port 80, since there is no conflict with Squid trying to listen to the same port.

Once you have tested that connecting to Squid brings up the correct pages, you will have to change the DNS entry to point to your cache server..

Replacing a Combination Web/Cache serverxx First, let's cover the most common use of accelerator mode: replacing a combination

web/cache server with Squid. When Squid is acting as an accelerator (speeding up a slow web server), Squid will accept requests on port 80 (on any IP address) and pass them to a cache server on a different machine (also on port 80). Since it's unlikely that you want to use two machines where you can use one (unless you are changing to Squid due to server overload), we will need to configure Squid to pass requests to the local machine.

Squid will need to accept incoming requests on port 80 (using the http_port option), and pass the requests on to the web server on another port (since only one process can listen for requests on port 80 at a time). I normally get web servers to listen for requests on port 8000.

Since you want Squid to function both as an accelerator and as a cache server, you will need to use the httpd_accel_with_proxy option.

The cache in this example is the local machine: there is almost certainly no reason to cache results from this server. I could have used an extremely conservative refresh_pattern in the below example, but instead I decide to use the no_cache tag: this way I can make use of my predefined acl. The always_direct tag in the below example will be very useful if you have a peer cache: you don't want the request passed on to a peer machine.

Transparent Caching/ProxyFrom Squid User's Guide

http://server.mydomain.example:25/

P a g e | 74

Jump to: navigation, searchTransparent caching is the art of getting HTTP requests intercepted and processed by the

proxy without any form of configuration in the browser (neither manual or automatic configuration). This involves firewalling & routing rules to have packets with destination port 80 forwarded to the proxy port, and some Squid configuration to tell Squid that it's being called in this manner which differs slightly from normal proxied requests.

Transparent Cache/Proxy with Squid version prior to 2.6 Prior to Squid 2.6 there was no quick and direct method of enabling Squid to be a

transparent proxy. This has since changed in the latest stable version of Squid and it is highly recommended that the latest stable version of Squid be used in preference to any previous edition, unless there exists an overriding reason to use an older release of Squid.

In older versions of Squid, transparent proxy was almost a "hack", achieved through the use of the httpd_accel options. Transparent proxy can be achieved in these versions of Squid by appending/uncommenting the following four lines of code in the squid.conf file: httpd_accel_host virtualhttpd_accel_port 80httpd_accel_with_proxy onhttpd_accel_uses_host_header on

The four lines inform Squid to run as a transparent proxy, below is a list of what each individual line acheives:

httpd_accel_host virtual - This tells the accelerator to work for any URL that it is given (the usual usage for the accelerator is to inform it which URL it must accelerate)

httpd_accel_port 80 - Informs the accelerator which port to listen to, the accelerator is a very powerful tool and much of its usage is beyond the scope of this section, the only knowledge required here is that this setting ensures that the transparent proxy accesses the websites we wish to browse via the correct HTTP port, where the standard is port 80.

httpd_accel_with_proxy on - By default when Squid has its accelerator options enabled it stops being a cache server, to reinstate this (this is obviously important as the whole purpose behind this configuration is a cache server) we turn the httpd_accel_with_proxy option on

httpd_accel_uses_host_header on - In a nutshell with this option turned on Squid is able to find out which website you are requesting

Transparent Cache/Proxy with Squid version 2.6 and beyond

In this version of Squid, transparent proxy has been given a dedicated parameter -- the transparent parameter -- and it is given as an argument to the http_port tag within the squid.conf file, as the following example demonstrates: http_port 192.168.0.1:3128 transparent

In this example, the IP address that Squid is set to listen to is 192.168.0.1 using port number 3128, and your firewall rules is already set up to transparently intercept port 80 and forward to this port. The transparent option is then used to inform squid that this IP and port should be listened to as a transparent proxy. This completes the configuration of Squid as a transparent proxy server (yes that's right, all done! (apart from the ACL rules and generic

P a g e | 75

settings that you have should have set by now after reading the sections of this guide prior to this one)).

Please note that to use this then you will need to compile in the necessary feature into your Squid binary. Please read the information on transparent proxy in the Installing Squid section for more details on this. Do not be alarmed by a Squid binary recompile at this stage, Squid should not overwrite your edited squid.conf file but make sure to back it up just in case! For a full solution for Squid > 2.6, including Iptables, you can see this article: http://www.lesismore.co.za/squid3.html

Troubleshooting squidFrom Squid User's Guide

/usr/sbin/squid -NCd1

will give a you a tail of logfile as it processes the request or in the squid.conf file

debug_options ALL,1 33,2 28,9

and then

tail -f /path to you log files/cache.log

gives similar output

WishlistFrom Squid User's Guide

Items that people would like to have completed:

P a g e | 76

What to do after editing squid.conf (restart squid or re-read squid.conf?) Squid optimization tips and performance issues; How to have a Time CAP; How to deny certain files (upper a defined weight/size) for download; Troubleshooting squid All about logging; All about Custom Error Messages (in ACL); Some form of reporting to analyse what sites people are using; How to clear specific file or URL in the cache BEFORE the default expiration. SquidGuard All about authentication in squid. User based URL and/or domain restriction in squidguard Real time monitoring per client

Documents

Squid Guide and Tutorial