So You Got Yourself a Loadbalancer

When you put your web application behind a load balancer, or any type of reverse proxy (perhaps a web cache), you immediately need to take some important factors into consideration.

This article will cover those considerations, as well as discuss common solutions.

The Setup

When using a load balancer, you typically point all (most) web traffic to your load balancer. The load balancer is responsible for distributing web traffic across 1 or more configured web servers.

Load balancers aren't restricted to distributing only HTTP traffic, but that is one of the most common use cases and what we'll be covering here.

Asset Management

Using a load balancer implies that you have more than one web server processing requests. In this situation, how you manage your static assets (images, JS, CSS files) becomes important.

Static assets don't change - they are the same across servers. You can likely get away with having them copied on each web server behind a load balancer. The main worry here is making sure that static assets are successfully replicated and are the same on each server.

You might also benefit from using a CDN, which can serve and cache your static assets, reducing the load on your server and loading them faster for users geographically far away from your web servers. Something like Cloudflare, MaxCDN or Cloudfront might be appropriate.

If you accept user uploads (such as with a CMS), the uploaded file can't simply live on the web server it was uploaded to. When an uploaded jpg file only lives on one web server, a request for that image will result in a 404 response when the load balancer attempts to find it on web server which does not have the image.

In this situation, the web servers need to have a central file store they can all access.

One way this is done is via a shared network drive (NAS, for example). This, however, get's slow when there are many files or high levels of traffic. Furthermore, if your architecture is distributed across several data centers, then a shared network drive can become too slow; Your web servers would be too far away from them and the network latency too high.

Central File Storage

A common (and better) solution is to host all static assets in a separate location, such as Amazon's S3.

Within Amazon, this can be taken a step further. An S3 bucket can be integrated with CloudFront, their CDN service. Your files can then be served via a true CDN.

For your static assets, you can use automated tools such as Grunt to automate these tasks for you. For example, you can have Grunt watch your files for changes, minify and concatenate CSS, JS and images, as well as generate production-ready files, and then upload them to a location of your choice.

For user-uploaded content, you'll need to do some coding around sending uploaded files to S3 via AWS's API. This is actually pretty easy.

Uploading files to S3 using the AWS PHP SDK:

// Send uploaded image to S3
Route::post('/upload', function()
{
    // Get Uploaded File
    $file = Input::file('file');

    // Create name for file
    $now = new DateTime;
    $hash = md5( $file->getClientOriginalName().$now->format('Y-m-d H:i:s') );
    $key = $hash.'.'.$file->getClientOriginalExtension();

    $s3 = AWS::get('s3');
    $s3->putObject(array(
        'Bucket'    => 'user_uploads_bucket',
        'Key'    => $key,
        'SourceFile'    => $file->getRealPath(),
        'ContentType'    => $file->getClientMimeType(),
    ));

    // Probably store the name of the file in a database too...

    return Redirect::to('/profile');
});

Environment-Based URLs

One thing I do on projects is to change the URL of assets based on the environment. Using a helper function of some sort, I'll have code output the development machine's URL out to HTML so the files are loaded locally.

In production, this helper can output URLs for your file-store or CDN of choice. Combined with some automation (Grunt), this gives you a fairly seamless workflow between development and production.

Here's a Gulp plugin that let's you upload files to S3.

Sessions

Similarly to the issue of Asset Management, how you handle sessions becomes an important consideration. Session information is often saved on a temporary location within a web server.

A user may log in, creating a session on one web server. On a subsequent request, however, the load balancer may bounce that user to another web server which doesn't have that session information! To the user, it appears that they are no longer logged in.

There are three common fixes for this.

Cookie-based Sessions

In this scenario, cookies are used to store the session information. The session data, such as the user ID, are not saved to the server or any other storage, but are instead within the browser's cookie.

This has a limitation of the available amount of data a cookie can store. It's also easy to make insecure unless done correctly - the cookie needs to be encrypted in a way that can't be unencrypted, even if the cookie is hijacked by a malicious user.

Sticky Sessions

Another solution is use "sticky sessions", also called "session affinity". This will track which server a client was routed to and always route their request to the same web server on subsequent requests.

This let's the web server keep its default behavior of saving the session locally, leaving it up to the load balancer to get a client back to that server.

This is nice if you have a legacy application where changing the way cookies are handled is too difficult. However, this can skew the sharing of work load around your web servers.

You can see how to accomplish that in HAProxy under the Load Balancing Algorithms section here and Nginx, under the available algorithms here.

HAProxy Session Affinity:

backend nodes
    # Other options above omitted for brevity
    cookie SRV_ID prefix
    server web01 127.0.0.1:9000 cookie check
    server web02 127.0.0.1:9001 cookie check
    server web03 127.0.0.1:9002 cookie check

Nginx Session Affinity:

upstream app_example {
    ip_hash;                 
    server 127.0.0.1:9000;
    server 127.0.0.1:9001;
    server 127.0.0.1:9002;
}

Central Session Storage

The second fix for this is to use a shared session storage mechanism.

Session storage is typically centralized within an in-memory stores such as Redis or Memcached. Persistent stores such as a database are also commonly used.

Since session data does not necessarily need the guaranteed persistence of a database, but may be heavily used, an in-memory data store's efficiency may be preferred.

In any case, using a cache (Redis, Memcached) for session-storage lets all the web servers connect to a central session store, growing your infrastructure a bit, but letting your work load be truly distributed across all web nodes.

Sample framework configuration to use Memcached:

<?php

# Session configuration for a Laravel application.
# Avoid using "file" in a load-balanced environment
return [
    /*
    |--------------------------------------------------------------------------
    | Default Session Driver
    |--------------------------------------------------------------------------
    |
    | This option controls the default session "driver" that will be used on
    | requests. By default, we will use the lightweight native driver but
    | you may specify any of the other wonderful drivers provided here.
    |
    | Supported: "file", "cookie", "database", "apc",
    |            "memcached", "redis", "array"
    |
    */
    'driver' => 'memcached',
];

Lost Client Information

Closely related to the session issue is detecting who the client is. If the load balancer is a proxy to your web application, it might appear to your application that every request is coming from the load balancer! Your application wouldn't be able to tell one client from the other other than by the cookie sent along in a browser-based request.

This may or may not be an issue for your application. However if logs of HTTP requests are stored from your web nodes (rather than from your load balancer), then you may lose important information needed when auditing your logs.

Luckily, most load balancers provide a mechanism for giving your web servers and application this information. If you inspect the headers of a request received from a load balancer, you might see these included:

X-Forwarded-For
X-Forwarded-Host
X-Forwarded-Proto / X-Forwarded-Scheme
X-Forwarded-Port

These headers can tell you (respectively) the client's IP address, the hostname used to access the application, the schema used (http vs https) and which port the client made the request on. If these are present, your application's job is to sniff these headers out and use them in place of the usual client information (to avoid thinking every client is the load balancer itself).

//  JSON representation of headers sent from a load balancer
{"host":"example.com",
"cache-control":"max-age=0",
"accept":"text/html",
"accept-encoding":"gzip,deflate,sdch",
"x-forwarded-port":"80",                   // An x-forwarded-port header!
"x-forwarded-for":"172.17.42.1"}     // An x-forwarded-for header!

IP Address

Having an accurate IP address of a client is important. Web applications may use a user's IP address to help identify a client as part of the authentication process. Some applications use the client's IP address to perform functions such as rate limiting or other throttling techniques. Furthermore, having a client's IP address can help identify malicious traffic patterns when inspecting access logs.

The X-Forwarded-For header, which should include the client's IP address, should be used if the header is found (assuming the source of the proxy request is trusted).

Host

If your site is accessed in a user's browser as "example.com", but your load balancer is sending requests to your web nodes as "localhost:9000", then you may need to find the correct hostname. This may be important for multi-tenancy applications, where a site's subdomain determines under what organization is user is performing actions.

I believe this use case is more rare - the hostname is likely passed through to the web server correctly.

Protocol/Schema and Port

Knowing the protocol (http, https) and port used by the client is also important. If the client is connecting over an SSL (with a https url), that encrypted connection might end at the load balancer. The load balancer would then send a http request to the web servers.

Many frameworks attempt to guess the site address based on the request information. If your web application is receiving a http request over port 80, then any URLs it generates or redirects it sends will likely be on the same protocol. This means that a user might get redirected to a page with the wrong protocol or port!

Sniffing out the X-Forwarded-Proto and X-Forwarded-Port header then becomes important so that the web application can generate correct URLs for redirects or for printing out URLs within templates (think form actions, links to other pages, and links to static assets such as your JS, CSS and images).

Trusted Proxies

Many frameworks can handle this for you. For example Symfony and frameworks using Symfony's HTTP components have a means to incorporate X-Forwarded-* headers.

They ask you to configure a "trusted proxy". If the request comes from a proxy who's IP address is trusted, then the framework will seek out and use the X-Forwarded-* headers in place of the usual mechanisms for gathering that information.

This provides a very nice abstraction over this HTTP mechanism, allowing you to forget about this issue and keep on coding!

SSL Traffic

As noted above, in a load balanced environment, SSL traffic is often decrypted at the load balancer. However, there's actually a few ways to handle SSL traffic when using a load balancer.

SSL Termination

When the load balancer is responsible for decrypting SSL traffic before passing the request on, it's referred to as "SSL Termination". In this scenario, the load balancer alleviates the web servers of the extra CPU cycles needed to decrypt SSL traffic. It also gives the load balancer the opportunity to append the X-Forwarded-* headers to the request before passing it onward.

The downside of SSL Termination is that the traffic between the load balancers and the web servers is not encrypted. This leaves the application open to possible man-in-the-middle attacks.

However, this is a risk usually mitigated by the fact that the load balancers are often within the same infrastructure (data center) as the web servers. Someone would have to get access to traffic between the load balancers and web servers by being within the data-centers internal network (possible, but less likely).

Amazon AWS load balancers also give you the option of generating a (self-signed) SSL for use between the load balancer and the web servers, giving you a secure connection all around. This, of course, means more CPU power being used, but if you need the extra security due to the nature of your application, this is an great option.

HAProxy SSL Termination:

frontend localhost
    bind *:80
    bind *:443 ssl crt /etc/ssl/xip.io/xip.io.pem
    mode http
    default_backend nodes

backend nodes
    mode http
    balance roundrobin
    option forwardfor
    option httpchk HEAD / HTTP/1.1\r\nHost:localhost
    server web01 172.17.0.3:9000 check
    server web02 172.17.0.3:9001 check
    server web03 172.17.0.3:9002 check
    http-request set-header X-Forwarded-Port %[dst_port]
    http-request add-header X-Forwarded-Proto https if { ssl_fc }

SSL Pass-Through

Alternatively, there is "SSL Pass-Through". In this scenario, the load balancer does not decrypt the request, but instead passes the request through to a web server. The web server than must decrypt it.

This solution obviously costs the web servers more CPU cycles. You also often lose some extra functionality that load-balancing proxies can provide, such as DDoS protection. However, this option is often used when security is an important concern (although SSL Termination followed by re-encryption seems to be a good compromise).

SSL Pass-thru is only supported by load balancers that can balance traffic at the TCP level rather than HTTP level, since the traffic is not decrypted at the load balancer and is therefore not inspected to see what kind of traffic it is.

That rules out Nginx for SSL pass-thru, but HAProxy will happily accomplish this for you!

HAProxy SSL Pass-Through:

frontend localhost
    bind *:80
    bind *:443
    option tcplog
    mode tcp
    default_backend nodes

backend nodes
    mode tcp
    balance roundrobin
    option ssl-hello-chk
    server web01 172.17.0.3:443 check
    server web02 172.17.0.4:443 check

Logs

So, now you have multiple web servers, but each one generates their own log files! Going through each servers' logs is tedious and slow. Centralizing your logs can be very beneficial.

You may wish to just get logs from the load balancer, skipping the web server logs. This ignores the issue of logs your application generates, however.

The simplest ways I've done this is to combine Logrotate's functionality with an uploaded to an S3 bucket. This at least puts all the log files in one place that you can look into.

However, there's plenty of centralized logging servers that you can install in your infrastructure or purchase. The SaaS offerings in this arena are often easily integrated, and usually provide extra services such as alerting, search and analysis.

Some popular self-install loggers:

Logstash, as part of the popular ELK Stack
Graylog2
Splunk
Syslog-ng
Rsyslog

Some popular SaaS loggers:

Loggly
Splunk Storm
Paper Trail
Bugsnag - Captures errors, not necessarily all logs

So You Got Yourself a Loadbalancer

The Setup

Asset Management

Central File Storage

Environment-Based URLs

Sessions

Cookie-based Sessions

Sticky Sessions

Central Session Storage

Lost Client Information

IP Address

Host

Protocol/Schema and Port

Trusted Proxies

SSL Traffic

SSL Termination

SSL Pass-Through

Logs

All Topics

Beginners

Development

Containers

PHP

Security

Proxies

Continuous Integration

Configuration Management

Databases