A reverse proxy can be used to secure and load balance Web servers. In this article we'll discuss the concept of a proxy server and reverse proxies, and how they can be used to provide better Web (HTTP) services.
In the classic sense, a proxy server is a server that sits between you and the Internet. If a Web browser is configured to do so, all requests will be made through the proxy, which in turn will apply filtering rules. The proxy will then request the site the user was trying to reach on his or her behalf, or more accurately: "by proxy," as the name implies.
A "transparent proxy" refers to a proxy server configured to serve requests without the client machine knowing about it. The drawback here is that the proxy won't support SSL, but on the bright side users' browsers require no configuration for plain HTTP traffic. Many times this is used with a caching proxy, which serves images and other large files from its cache, rather than using Internet bandwidth to fetch them every time.
A reverse proxy, the main topic today, is one that sits between your Web server and the world. When an HTTP connection comes in, the reverse proxy will decide what to do, and then make a request to the appropriate backend Web server. Reverse proxies are very important, and they are frequently tasked with many roles.
What it Does
A reverse proxy can be an SSL terminator. This means that SSL certificates (and their keys) are installed on the proxy server, as well as the corresponding IP addresses for those sites. SSL is therefore terminated at the proxy, and the requests to the backend happen (generally) in plain text. This is usually OK, but if your internal network is insecure, tricks can be used to get the requests shipped via secure channels.
This is a good a time as any to bring up "virtual hosts" and SSL. The concept of a virtual host, based on the name of the site, operates on knowledge of the URL used to connect—the HTTP header data. When an HTTP request is made, a Web server that supports virtual hosts will serve different content based on the site requested. Essentially, this means that you can point hundreds of domain names at the same IP address. If SSL is negotiated, it must be done with a specific IP address, and the SSL certificate must match the name of the site the user is trying to access. SSL negotiation happens before HTTP data is passed, so the server has only one choice for which certificate to present per IP address. If, after an SSL connection is negotiated, it turns out that the URL requested was actually for a different site, your web browser will scream at you. If it didn't work this way then SSL would be pointless. Ergo, there is no such thing as a virtual host with SSL.
A reverse proxy can also be a load balancer. Load balancing, in basic terms, works in one of two ways: either intelligently round-robin requests to a group of servers at the IP layer, or use a proxy and do even more intelligent things. A group of servers can be used to serve sites by using a DNS round-robin. A hostname can be given multiple DNS records, so that connections will choose one out of the group. Of course, this is a pain to manage with SSL sites. A router can also load-balance requests in a similar fashion, which requires that state be kept so that subsequent requests make it to the right server. Most devices that do this are simply going to act as a proxy, though. Using a proxy to load balance makes great sense, especially considering the other features it can provide.
A reverse proxy can also act as a sort of application layer firewall for your Web servers. In two regards, actually: Incoming requests are subject to the rules and policies defined in your proxy server's configuration, and your Web servers can be locked off from the world, effectively neutering cross-site scripting exploits.
A reverse proxy is often tasked with acting as a content filter too. This is closely related to the firewalling aspect, but with more distinction. Most proxy server vendors implement a mechanism to block certain keywords or content-types. This can be another layer to preventing code exploits from getting back to your real servers.
Pretty much everything that is possible with a forward proxy can be accomplished with a reverse proxy server. A caching proxy, like squid, can be used in conjunction with the reverse proxy in a variety of configurations. If the reverse proxy doesn't support caching, many sites opt to configure access to the backend servers through a caching proxy, so that images and other static content doesn't have to be retrieved from the real servers. Many reverse proxies can also farm out specific tasks, like images, to a completely separate server. These are often referred to as "Web accelerators."
What Does It
There are many proxy server products that will operate as a reverse proxy, but we'll just focus on a few free and open source ones. Apache 2.2 now comes with mod_proxy_balancer. Apache has supported reverse proxying for a long time with mod_proxy, but with the balancer module, Apache can now be used to configure much more complex and resilient setups. Of course configuration isn't simple, and Apache itself is very resource intensive and memory hungry.
Pound is a reverse proxy and load balancer that terminates SSL connections, and is very nice to configure. An advantage over Apache is that pound is very lightweight, and carefully written. Many pound users report amazing statistics of throughput, and of course mention that it has been reliable the entire time.
Next week we'll talk about pound in great detail. We'll also walk through an example configuration of pound that can be used to provide faster and more fault tolerant Web services. A single Web server failure will no longer be able to take down your site!