Tuesday, March 31, 2009

Lusca snapshot released

I've just put up a snapshot of the version of lusca-head which is running on the cacheboy cdn. Head to http://code.google.com/p/lusca-cache/downloads/list .

Monday, March 30, 2009

Mirroring a new project - Cyberduck!

I've just started providing mirror download services for Cyberduck - a file manager for a wide variety of platforms including the traditional (SFTP, FTP) and the new (Amazon/S3, WebDAV.) Cacheboy is listed as the primary download site on the main page.

Woo!

Mozilla 3.0.8 release!

The CDN handled the load with oodles to spare. The aggregate client traffic peak was about 650mbit across 5 major boxes. The boxes themselves peaked at about 160mbit each, depending upon the time of day (ie, whether Europe or the US was active.) None of the nodes were anywhere near maximum CPU utilisation.

About 2 and a half TB of mozilla updates a day are being shuffled out.

I'd like to try pushing a couple of the nodes up to 600mbit -each- but I don't have enough CDN nodes to guarantee the bits will keep flowing if said node fails. I'll just have to be patient and wait for a few more sponsors to step up and provide some hardware and bandwidth to the project.

So far so good - the bits are flowing, I'm able to use this to benchmark Lusca development and fix performance bottlenecks before they become serious (in this environment, at least) and things are growing at about the right rate for me to not need to panic. :)

My next major goal will be to finish off the BGP library and lookup daemon; flesh out some BGP related redirection map logic; and start investigating reporting for "services" on the box. Hm, I may have to write some nagios plugins after all..

Saturday, March 28, 2009

shortcomings in the async io code

Profiling the busy(ish) Lusca nodes during the Mozilla 3.0.8 release cycle has shown significant CPU wastage in memset() (ie, 0'ing memory) - via the aioRead and aioCheckCallbacks code paths.

The problem stems from the disk IO interface inherited from Squid. With Squid, there's no explicit cancel-and-wait-for-cancel to occur with both the network and disk IO code, so the async disk IO read code would actually allocate its own read buffer, read into that, and then provide said read buffer to the completion callback to copy said read data out of. If the request is cancelled but the worker thread is currently read()'ing data, it'll read into its own buffer and not a potentially free()'d buffer from the owner. Its a bit inefficient but in the grand scheme of Squid CPU use, its not that big a waste on modern hardware.

In the short term, I'm going to re-jig the async IO code to not zero buffers that are involved in the aioRead() path. In the longer term, I'm not sure. I prefer cancels which may fail - ie, if an operation is in progress, let it complete, if not then return immediately. I'd like this for the network code too, so I can use async network IO threads for less copy network IO (eg FreeBSD and aio_read() / aio_write()); but there's significant amounts of existing code which assumes things can be cancelled immediately and assumes temporary copies of data are made everywhere. Sigh.

Anyway - grr'ing aside, fixing the pointless zero'ing of buffers should drop the CPU use for large file operations reasonably noticably - by at least 10% to 15%. I'm sure that'll be a benefit to someone.

Googletalk: "Getting C++ threads to work right"

I've been watching a few Google dev talks on Youtube. I thought I'd write up a summary of this one:

http://www.youtube.com/watch?v=mrvAqvtWYb4

In summary:
  • Writing "correct" thread code using the pthreads and CPU instructions (fencing, for example) requires the code to know whats going on under the hood;
  • Gluing concurrency to the "side" of a language which was specified without concurrency has shown to be a bit of a problem - eg, concurrent access to different variables in a structure and how various compilers have implemented this (eg, changing a byte in a struct becoming a 32 bit load, 8 bit modify, 32 bit store);
  • Most programmers should really use higher level constructs, like what C++0x and what the Java specification groups have been doing.
If you write threaded code or you're curious about it, you should watch this talk. It provides a very good overview of the problems and should open your mind up a little to what may go wrong..

Friday, March 27, 2009

Another open cdn project - mirrorbrain

I've been made aware of Mirrorbrain (http://mirrorbrain.org), another project working towards an open CDN framework. Mirrorbrain uses Apache as the web server and some apache module smarts to redirect users between mirrors.

I like it - I'm going to read through their released source and papers to see what clue can be crimed from them - but they still base the CDN on an untrusted, third-party mirror network out of their control. I still think the path forward to an "open CDN" involves complete control right out to the mirror nodes and, in some places, the network which the mirror nodes live on.

There's a couple of shortcomings - most notably, their ASN implementation currently uses snapshots of the BGP network topology table rather than a live BGP feed distributed out to each mirror and DNS node. They also store central indexes of files and attempt to maintain maps of which mirror nodes have which updated versions of files, rather than building on top of perfectly good HTTP/1.1 caching semantics. I wonder why..

Monday, March 23, 2009

Example CDN stats!

Here's a snapshot of the global aggregate traffic level:


.. and top 10 AS stats from last Sunday (UTC) :




Sunday, March 22, 2009

wiki.cacheboy.net

I've setup http://wiki.cacheboy.net/, a simple mediawiki install which will serve as a place for me to braindump stuff into.

Thursday, March 19, 2009

More "Content Delivery" done open

Another network-savvy guy in Europe is doing something content-delivery related: http://www.as250.net/ .

AS250 is building a BGP anycast based platform for various 'open' content delivery and other applications. I plan on doing something similar (or maybe just partner with him, I'm not sure!) but anycast is only part of my over-all solution space.

He's put up some slides from a presentation he did earlier in the year:

http://www.trex.fi/2009/as250-anycast-bgp.pdf


Filesystem Specifications, or EXT4 "Losing Data"

This is a bit off-topic for this blog, but the particular issue at hand bugs the heck out of me.

EXT4 "meets" the POSIX specifications for filesystems. The specification does not make any requirements for data to be written out in any order - and for very good reason. If the application developer -requires- data to be written out in order, they should serialise their operations through use of fsync(). If they do -not- require it, then the operating system should be free to optimise away the physical IO operations.

As a clueful(!) application developer, -I- appreciate being given the opportunity to provide this kind of feedback to the operating system. I don't want one or the other. I'd like to be able to use both where and when I choose.

Application developers - stop being stupid. Fix your applications. Read and understand the specification and what it provides -everyone- rather than just you.

Monday, March 16, 2009

Breaking 200mbit..

The CDN broke 200mbit at peak today - roughly half mozilla and half videolan.

200mbit is still tiny in the grand scheme of things, but it proves that things are working fine.

The next goal is to handle 500mbit average traffic during the the day, and keep a very close eye on the overheads in doing so (specifically - making sure that things don't blow up when the number of concurrent clients grows.)

GeoIP backend, or "reinventing the wheel"

The first incantation of the Cacheboy CDN uses 100% GeoIP to redirect users. This is roughly how it goes:

  1. Take a GeoIP map to break up IPs into "country" regions (thanks nerd.dk!) ;
  2. Take the list of "up" CDN nodes;
  3. For each country in my redirection table, find the CDN node that is up with the highest weight;
  4. Generate a "geo-map" file consisting of the highest-weight "up" CDN node for each country in "3";
  5. Feed that to the PowerDNS geoip module (thanks Mark @ Wikipedia!)
This really is a good place to start - its simple, its tested and it provides me with some basic abilities for distributing traffic across multiple sites to both speed up transfer times to end-users and better use the bandwidth available. The trouble is that it knows very little about the current state of the "internet" at any point in time. But, as I said, as a first (coarse!) step to get the CDN delivering bits, it worked out.

My next step is to build a much easier "hackable" backend which I can start adding functionality to. I've reimplemented the geoip backend in Perl and glued it to the "pipe-backend" module in PowerDNS. This simply passes DNS requests to an external process which spits back DNS replies. The trouble is that multiple backend processes will be invoked regardless of whether you want to or not. This means that I can't simply load in large databases into the backend process as it'll take time to load, waste RAM, and generally make things scale (less) well.

So I broke out the first memory hungry bit - the "geoip" lookup - and stuffed it into a small C daemon. All the daemon does is take a client IP and answer the geoip information for that IP. It will periodically check and reload the GeoIP database file in the background if its changed - maintaining whatever request rate I'm throwing at it rather than pausing for a few seconds whilst things are loaded in.

I can then use the "geoip daemon" (lets call it "geoipd") by the PowerDNS pipe-backend process I'm writing. All this process has to do at the moment is load in the geo maps (which are small) and reload them as required. It sends all geoip requests to the geoipd and uses the reply. If there is a problem talking to the geoipd, the backend process will simply use a weighted round robin of well-connected servers as a last resort.

The aim is to build a flexible backend framework for processing redirection requests which can be used by a variety of applications. For example, when its time for the CDN proxy nodes to also do 302 redirections to "closer" nodes, I can simply reuse a large part of the modular libraries written. When I integrate BGP information into the DNS infrastructure, I can reuse all of those libraries in the CDN proxy redirection logic, or the webserver URL rewriting logic, or anywhere else where its needed.

The next step? Figuring out how to load balance traffic destined to the same AS / GeoIP region across multiple CDN end nodes. This should let me scale the CDN up to a gigabit of aggregate traffic given the kind of sponsored boxes I'm currently receiving. More to come..

Friday, March 13, 2009

Downtime!

The CDN had a bit of downtime tonight. It went like this:

  • The first mirror threw a disk;
  • For some reason, gmirror became unhappy, rather than running on the second mirror (I'm guessing the controller went unhappy; there wasn't anything logged to indicate the other disk in the mirror set was failing);
  • The second mirror started taking load;
  • For some weird reason, the second mirror hung hard without any logging to explain why.
I've replaced the disks in mirror1 and its slowly rebuilding the content. It probably won't be finished resync'ing the (new) mirror set until tomorrow. Hopefully mirror-2 will stay just as stable as it currently is.

The CDN ended up still serving content whilst the masters were down - they just couldn't download uncached content. So it wasn't a -total- loss.

This just highlights that I really do require another mirror master or two located elsewhere. :)

Monday, March 9, 2009

minimising traffic to the backends..

Squid/Cacheboy/Lusca has a nifty feature where it'll "piggyback" a client connection on an existing backend connection if the backend response is cachable AND said response is valid for the client (ie, its the right variant, doesn't require revalidation at that point, etc.)

I've been using this for the videolan and mozilla downloads. Basically, one client will suck down the whole object, and any other clients which want the same object (say, the 3.0.7 US english win32 update!) will share the same connection.

There's a few problems which have crept up.

Firstly - the "collapsed forwarding" support is not working in this instance. I think the logic is broken with large objects (it was only written for small objects, delaying forwarding the request until the forwarded response was known cachable) where it denies cachability of the response (well, it forces it to be RELEASEd after it finishes transferring) because of all of the concurrent range requests going on.

Secondly - Squid/Cacheboy/Lusca doesn't handle range request caching. It'll -serve- range responses for objects it has the data for, but it won't cache partial responses nor will it reassemble them into one chunk. I've been thinking about how to possibly fix that, but for now I'm hacking around the problems with some scripts.

Finally - the forwarding logic uses the speed of the -slowest- client to determine how quickly to download the file. This needs to be changed to use the speed of the -fastest- client to determine how quickly to download said file.

I need to get these fixed before the next mozilla release cycle if I'm to have a chance of increasing the traffic levels to a gigabit and beyond.

More to come..

Cacheboy Outage

There was a brief outage earlier tonight due to some troubles with the transit provider of one of my sponsors. They're sponsoring the (only) pair of mirror master servers at the moment.

This will be (somewhat) mitigated when I bring up another set of mirror master servers elsewhere.

Sunday, March 8, 2009

Cacheboy is pushing bits..

The Cacheboy CDN is currently pushing about 1.2 TB a day (~ 100mbit average) of mozilla and videolan downloads out of 5 seperate CDN locations. A couple of the servers are running LUSCA_HEAD and they seem to handle the traffic just fine.

The server assignment is currently being done through GeoIP mapping via DNS. I've brought up BGP sessions to each of the sites to eventually use in the request forwarding process.

All in all, things are going reasonably successfully so far. There's been a few hiccups which I'll blog about over the next few days but the bits are flowing, and noone is complaining. :)