Cacheboy Development: 2009

Tuesday, November 10, 2009

More issues with Lighttpd

So occasionally Lighttpd on FreeBSD-7.x+ZFS gets all upset. I -think- there's something weird going on where I hit mbuf exhaustion somehow when ZFS starts taking a long time to complete IO requests; then all socket IO fails in Lighttpd until it is restarted.

More investigation is required. Well, more statistics are needed so I can make better judgements. Well, actually, more functional backends are needed so I can take one out of production when something like this occurs, properly debug what is going on and try to fix it.

Cacheboy Update / October/November 2009

Howdy,

Just a few updates this time around!

Cacheboy was pushing around 800-1200mbit during the Firefox 3.5.4 release cycle. I started to hit issues with the backend server not keeping up with revalidating requests and so I'll have to improve the edge caching logic a little more.
Lusca seems quite happy serving up 300-400mbit from a single node though; which is a big plus.
I've found some quite horrible memory leaks in Quagga on only one of the edge nodes. I'll have to find some time to login and debug this a little more.
The second backend server is now offically toast. I need to acquire another 1ru server with 2 SATA slots to magically appear in downtown Manhattan, NY.

Thursday, October 8, 2009

Cacheboy downtime - hardware failures

Howdy,

I've had both backend servers fail today. One is throwing undervolt errors on one PSU line and is having disk issues (most likely related to an undervoltage); the other is just crashed.

I'm waiting for remote hands to prod the other box into life.

This is why I'd like some more donated equipment and hosting - I can make things much more fault tolerant. Hint hint.

Wednesday, September 30, 2009

Lusca updates - September 2009

Just a few Lusca related updates!

All of the Cacheboy CDN nodes are running Lusca-HEAD now and are nice and stable.
I've deployed Lusca at a few customer sites and again, it is nice and stable.
The rebuild logic changes are, for the most part, nice and stable. There seems to be some weirdness with 32 vs 64 bit compilation options which I need to suss out but everything "just works" if you compile Lusca with large file/large cache file support regardless of the platform you're using. I may make that the default option.
I've got a couple of small coding projects to introduce a couple of small new features to Lusca - more on those when they're done!
Finally, I'm going to be migrating some more of the internal code over to use the sqinet_t type in preparation for IPv4/IPv6 agnostic support.

Stay Tuned!

Monday, September 21, 2009

My current wishlist

I'm going to put this on the website at some point, but I'm currently chasing a few things for Cacheboy:

More US nodes. I'll take anything from 50mbit to 5gbit at this point. I need more US nodes to be able to handle enough aggregate traffic to make optimising the CDN content selection methods worthwhile.
Some donations to cover my upcoming APNIC membership for ASN and IPv4/IPv6 space. This will run to about AUD $3500 this year and then around AUD $2500 a year after that.
Some 1ru/2ru server hardware in the San Francisco area
Another site or two willing to run a relatively low bandwidth "master" mirror site. I have one site in New York but I'd prefer to run a couple of others spread around Europe and the United States.

I'm sure more will come to mind as I build things out a little more.

New project - sugar labs!

I've just put the finishing touches on the basic sugar labs software repository. I'll hopefully be serving part or all of their software downloads shortly.

Sugar is the software behind the OLPC environment. It works on normal intel based PCs as far as I can tell. More information can be found at http://www.sugarlabs.org/

Monday, August 31, 2009

Cacheboy presentation at AUSNOG

I've just presented on Cacheboy at AUSNOG in Sydney. The feedback so far has been reasonably positive.

There's more information available at http://www.creative.net.au/talks/.

Monday, August 17, 2009

Cacheboy status update

So by and large, the pushing of bits is working quite well. I have a bunch of things to tidy up and a DNS backend to rewrite in C or C++ but that won't stop the bits from being pushed.

Unfortunately what I'm now lacking is US hosts to send traffic from. I still have more Europe and Asian connectivity than North American - and North America is absolutely where I need connectivity the most. Right now I'm only able to push 350-450 megabits of content from North America - and this puts a big, big limit on how much content I can serve overall.

Please contact me as soon as possible if you're interested in hosting a node in North America. I ideally need enough nodes to push between a gigabit and ten gigabits of traffic.

I will be able to start pushing noticable amounts of content out of regional areas once I've sorted out North America. This includes places like Australia, Africa, South America and Eastern Europe. I'd love to be pushing more open source bits out of those locations to keep the transit use low but I just can't do so at the moment.

Canada node online and pushing bits!

The Canada/TORIX node is online thanks to John Nistor at prioritycolo in Toronto, Canada.

Thanks John!

Cacheboy is on WAIX!

Yesterday's traffic from mirror1.au into WAIX:

ASN	MBytes	Requests	% of overall
AS7545	17946.77	7437	29.85	TPG-INTERNET-AP TPG Internet Pty Ltd
AS4802	12973.47	4476	21.58	ASN-IINET iiNet Limited
AS4739	8497.92	2947	14.13	CIX-ADELAIDE-AS Internode Systems Pty Ltd
AS9543	2524.57	1241	4.20	WESTNET-AS-AP Westnet Internet Services
AS4854	2097.32	941	3.49	NETSPACE-AS-AP Netspace Online Systems
AS17746	1881.17	1050	3.13	ORCONINTERNET-NZ-AP Orcon Internet
AS9822	1425.44	456	2.37	AMNET-AU-AP Amnet IT Services Pty Ltd
AS17435	1161.01	411	1.93	WXC-AS-NZ WorldxChange Communications LTD
AS9443	1140.62	701	1.90	INTERNETPRIMUS-AS-AP Primus Telecommunications
AS7657	891.93	1187	1.48	VODAFONE-NZ-NGN-AS Vodafone NZ Ltd.
AS7718	740.74	272	1.23	TRANSACT-SDN-AS TransACT IP Service Provider
AS7543	732.11	423	1.22	PI-AU Pacific Internet (Australia) Pty Ltd
AS24313	527.38	252	0.88	NSW-DET-AS NSW Department of Education and Training
AS9790	436.80	389	0.73	CALLPLUS-NZ-AP CallPlus Services Limited
AS17412	365.13	228	0.61	WOOSHWIRELESSNZ Woosh Wireless
AS17486	349.27	116	0.58	SWIFTEL1-AP People Telecom Pty. Ltd.
AS17808	311.65	248	0.52	VODAFONE-NZ-AP AS number for Vodafone NZ IP Networks
AS24093	303.40	114	0.50	BIGAIR-AP BIGAIR. Multihoming ASN
AS9889	288.85	197	0.48	MAXNET-NZ-AP Auckland
AS17705	282.49	84	0.47	INSPIRENET-AS-AP InSPire Net Ltd

Query content served: 54878.07 mbytes; 23170 requests.
Total content served: 60123.25 mbytes; 28037 requests.

BGP aware DNS

I've just written up the first "test" hack of BGP aware DNS.

The basic logic is simple but evil. I'm simply mapping BGP next-hop to a set of weighted servers. A server is then randomly chosen from this pool.

I'm not doing this for -all- prefixes and POPs - it is only being used for two specific POPs where there is a lot of peering and almost no transit. There are a few issues regarding split horizon BGP/DNS and request routing which I'd like to fully sort out before I enable it for everything. I don't want a quirk to temporarily redirect -all- requests to -one- server cluster!

In any case, the test is working well. I'm serving ~10mbit to WAIX (Western Australia) and ~ 30mbit to TORIX (Toronto, Canada.)

All of the DNS based redirection caveats apply - most certainly that not all client requests to the caches will also be over peering. I'll have to craft some method(s) of tracking this.

Sunday, August 9, 2009

Updates - or why I've not been doing very much

G'day! Cacheboy has been running on autopilot for the last couple of months whilst I've been focusing on paid work and growing my little company. So far (mostly) so good there.

The main issue scaling traffic has been the range request handling in Squid/Lusca, so I've been working on fixing things up "just enough" to make it work in the firefox update environment. I think I've finally figured it out - and figured out the bugs in the range request handling in Squid too! - so I'll push out some updates to the network next week and throw it some more traffic.

I really am hoping to ramp traffic up past the gigabit mark once this is done. We'll just have to see!

Wednesday, July 8, 2009

VLC 1.0 released

VLC-1.0 has been released. The CDN is pushing out between 550 and 700mbit of VLC downloads. I'm sure it can do more but as I'm busy working elsewhere, I'm going to be overly conservative and leave the mirror weighting where it is.

Graphs to follow!

Monday, June 29, 2009

Current Downtime/issues

There's a current issue with content not being served correctly. It stemmed from a ZFS related panic on one of the backend servers (note to self - update to the very latest FreeBSD-7-stable code; these are all fixed!) which then came up with lighttpd but no ZFS mounts. Lighttpd then started returning 404's.

I'm now watching the backend(s) throw random connection failures and the Lusca caches then cache an error rather than the object.

I've fixed the backend giving trouble so it won't start up in that failed mode again and I've set the negative caching in the Lusca cache nodes to 30 seconds instead of the default 5 minutes. Hopefully the traffic levels now pick up to where its supposed to be.

EDIT: The problem is again related to the Firefox range requests and Squid/Lusca's inability to cache range request fragments.

The backend failure(s) removed the objects from the cache. The problem now is that the objects aren't re-entering the cache because they are all range requests.

I'm going to wind down the Firefox content serving for now until I get some time to hack up Lusca "enough" to cache the range request objects. I may just do something dodgy with the URL rewriter to force a full object request to occur in the background. Hm, actually..

Saturday, June 27, 2009

New mirror node - italy

I've just turned on a new mirror node in Italy thanks to New Media Labs. They've provided some transit services and (I believe) 100mbit access to the local internet exchange.

Thanks guys!

Wednesday, June 17, 2009

And the GeoIP summary..

And the geoip summary:

From Sun Jun 7 00:00:00 2009 to Sun Jun 14 00:00:00 2009

Country	MBytes	Requests
us	5163783.09	6533162
de	1514664.22	2307222
ca	1152095.00	917777
fr	948433.27	1451105
uk	945640.71	1136455
it	818161.03	770164
br	542497.79	1426306
se	482932.15	229559
es	445444.34	647321
pl	397755.30	1021083
nl	373185.13	306023
ru	368124.64	749924
tr	293627.27	484965
mx	276775.12	463252
be	249088.62	213460
ch	201782.33	209530
ro	190059.45	274216
fi	172399.75	204630
ar	170421.77	374071
no	169351.46	155258

Tuesday, June 16, 2009

A quick snapshot of Cacheboy destinations..

The following is a snapshot of the per destination AS traffic information I'm keeping.

If you're peering with any of these ASes and are willing to sponsor a cacheboy node or two then please let me know. How well I can scale things at this point is rapidly becoming limited to where I can push traffic from, rather than anything intrinsic to the software.

From Sun Jun 7 00:00:00 2009 to Sun Jun 14 00:00:00 2009

ASN	MBytes	Requests	% of overall
AS3320	602465.01	1021975	3.26	DTAG Deutsche Telekom AG
AS7132	583164.05	778259	3.16	SBIS-AS - AT&T Internet Services
AS19262	459322.30	603127	2.49	VZGNI-TRANSIT - Verizon Internet Services Inc.
AS3215	330962.95	553299	1.79	AS3215 France Telecom - Orange
AS3269	317534.06	333114	1.72	ASN-IBSNAZ TELECOM ITALIA
AS9121	259768.32	434932	1.41	TTNET TTnet Autonomous System
AS22773	244573.65	283427	1.32	ASN-CXA-ALL-CCI-22773-RDC - Cox Communications Inc.
AS12322	224708.25	343686	1.22	PROXAD AS for Proxad/Free ISP
AS3352	206093.84	305183	1.12	TELEFONICADATA-ESPANA Internet Access Network of TDE
AS812	204120.74	166633	1.10	ROGERS-CABLE - Rogers Cable Communications Inc.
AS8151	198918.22	328632	1.08	Uninet S.A. de C.V.
AS6327	197906.53	152861	1.07	SHAW - Shaw Communications Inc.
AS3209	191429.18	303787	1.04	ARCOR-AS Arcor IP-Network
AS20115	182407.09	225151	0.99	CHARTER-NET-HKY-NC - Charter Communications
AS2119	181719.20	117656	0.98	TELENOR-NEXTEL T.net
AS577	181167.02	152383	0.98	BACOM - Bell Canada
AS12874	172973.42	108429	0.94	FASTWEB Fastweb Autonomous System
AS6389	165445.73	236133	0.90	BELLSOUTH-NET-BLK - BellSouth.net Inc.
AS6128	165183.07	210300	0.89	CABLE-NET-1 - Cablevision Systems Corp.
AS2856	164332.96	219267	0.89	BT-UK-AS BTnet UK Regional network

Query content served: 5234195.61 mbytes; 6878234 requests (ie, what was displayed in the table.)

Total content served: 18473721.25 mbytes; 26272660 requests (ie, the total amount of content served over the time period.)

Saturday, June 13, 2009

Seeking a few more US / Canada hosts

G'day everyone!

I'm now actively looking for some more Cacheboy CDN nodes in the United States and Canada. I've got around 3gbit of available bandwidth in Europe, 1gbit of available bandwidth in Japan but only 300mbit of available bandwidth in North America.

I'd really, really appreciate a couple of well-connected North American nodes so I can properly test the platform and software that I'm building. The majority of traffic is still North American in destination; I'm having to serve a fraction of it from Sweden and the United Kingdom at the moment. Erk.

Please drop me a line if you're interested. The node requirements are at http://www.cacheboy.net/node_requirements.html . Thankyou!

Friday, June 12, 2009

Another day, another firefox release done..

The June Firefox 3.0.11 release rush is all but over and Cacheboy worked without much of a problem.

The changes I've made to the Lusca load shedding code (ie, being able to disable it :) works well for this workload. Migrating the backend to lighttpd (and fixing up the ETag generation to be properly consistent between 32 bit and 64 bit platforms) fixed the initial issues I was seeing.

The network pushed out around 850mbit at peak. Not a lot (heck, I can do that on one CPU of a mid-range server without a problem!) but it was a good enough test to show that things are working.

I need to teach Lusca a couple of new tricks, namely:

It needs to be taught to download at the fastest client speed, not the slowest; and
Some better range request caching needs to be added.

The former isn't too difficult - that is a weekend 5 line patch. The latter is more difficult. I don't really want to shoehorn in range request caching into the current storage layer. It would look a lot like how Vary and Etag is currently handled (ie, with "magical" store entries acting as indexes to the real backend objects.) I'd rather put in a dirtier hack that is easy to undo now and use the opportunity to tidy up the whole storage layer a whole lot. But the "tidying up" rant is not for this blog entry, its for the Lusca development blog.

The hack will most likely be a little logic to start downloading full objects that aren't in the cache when their first range request comes in - so subsequent range requests for those objects will be "glued" to the current request. It means that subsequent requests will "stall" until enough of the object is transferred to start satisfying their range request. The alternative is to pass through each range request to a backend until the full object is transferred and this would improve initial performance but there's a point where the backend could be overloaded with too many range requests for highly popular objects and that starts affecting how fast full objects are transferred.

As a side note, I should probably do up some math on a whiteboard here and see if I can model some of the potential behaviour(s). It would certainly be a good excuse to brush up on higher math clue. Hm..!

Thursday, June 11, 2009

Migrating to Lighttpd on the backend, and why aren't my files being cached..

I migrated away from apache-1.3 to Lighttpd-1.4.19 to handle the load better. Apache-1.3 handles lots of concurrent disk IO on large files fine but it bites for lots of concurrent network connections.

In theory, once all of the caching stuff is fixed, the backends will spend most of their time revalidating objects.

But for some weird reason I'm seeing TCP_REFRESH_MISS on my Lusca edge nodes and generally poor performance during this release. I look at the logs and find this:


[Host: mozilla.cdn.cacheboy.net\r\n
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10\r\n
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n
Accept-Language: en-us,en;q=0.5\r\n
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\n
If-Modified-Since: Wed, 03 Jun 2009 15:09:39 GMT\r\n
If-None-Match: "1721454571"\r\n
Cache-Control: max-stale=0\r\n
Connection: Keep-Alive\r\n
Pragma: no-cache\r\n
X-BlueCoat-Via: 24C3C50D45B23509\r\n]

[HTTP/1.0 200 OK\r\n
Content-Type: application/octet-stream\r\n
Accept-Ranges: bytes\r\n
ETag: "1687308715"\r\n
Last-Modified: Wed, 03 Jun 2009 15:09:39 GMT\r\n
Content-Length: 2178196\r\n
Date: Fri, 12 Jun 2009 04:25:40 GMT\r\n
Server: lighttpd/1.4.19\r\n
X-Cache: MISS from mirror1.jp.cacheboy.net\r\n
Via: 1.0 mirror1.jp.cacheboy.net:80 (Lusca/LUSCA_HEAD)\r\n
Connection: keep-alive\r\n\r]

Notice the different ETags? Hm! I wonder whats going on. On a hunch I checked the Etags from both backends. master1 for that object gives "1721454571"; master2 gives "1687308715". They both have the same size and same timestamp. I wonder what is different?

Time to go digging into the depths of the lighttpd code.

EDIT: the etag generation is configurable. By default it uses the mtime, inode and filesize. Disabling inode and inode/mtime didn't help. I then found that earlier lighttpd versions have different etag generation behaviour based on 32 or 64 bit platforms. I'll build a local lighttpd package and see if I can replicate the behaviour on my 32/64 bit systems. Grr.

Meanwhile, Cacheboy isn't really serving any of the mozilla updates. :(

EDIT: so it turns out the bug is in the ETag generation code. They create an unsigned 32-bit integer hash value from the etag contents, then shovel it into a signed long for the ETag header. Unfortunately for FreeBSD-i386, "long" is a signed 32 bit type, and thus things go airy from time to time. Grrrrrr.

EDIT: fixed in a newly-built local lighttpd package; both backend servers are now doing the right thing. I'm going back to serving content.

Tuesday, June 2, 2009

New mirrors - mirror2.uk and mirror3.uk

I've just had two new sponsors show up with a pair of UK mirrors.

mirror2.uk is thanks to UK Broadband, who have graciously given me access to a few hundred megabits of traffic and space on an ESX server.

mirror3.uk (due to be turned up today!) is thanks to a private donor named Alex who has given me a server in his colocation space and up to a gigabit of traffic.

Shiny! Thanks to you both.

Wednesday, April 22, 2009

More traffic!

970mbit and rising...!

mirror1.jp.cacheboy.net - mozilla!

The nice folk at mozilla.org have provided me with a .jp CDN node. I'm now serving a good stack of bits from it into Australia, Malaysia, India, Japan, China, Korea and the Phillipines.

Thanks guys!

Mozilla 3.0.9 release..

The mozilla release (3.0.9) is currently going on. The traffic levels are ramping up now to the release peak.

880mbit/sec and counting..

Monday, April 6, 2009

Lusca and Cacheboy improvements in the pipeline..

After profiling Lusca-HEAD rather extensively on the CDN nodes, I've discovered that the largest CPU "use" on the core 2 duo class boxes is memcpy(). On the ia64-2 node memcpy() shows up much lower down in the list. I'm sure this has to do with the differing FSB and general memory bus bandwidth available on the two architectures.

I'm planning out the changes to the store client needed to support fully copy-free async read and write. This should reduce the CPU overhead on core 2 duo class machines to the point where Lusca should break GigE throughput on this workload without too much CPU use. (I'm sure it could break GigE throughput right now on this workload though.)

I'll code this all up during the week and build a simulated testing rig at home "pretending" to be a whole lot of clients downloading partial bits of mozilla/firefox updates, complete with a random packetloss, latency and abort probability.

I also plan on finally releasing the bulk of the Cacheboy CDN software (hackish as it is!) during the week, right after I finally remove the last few bits of hard-coded configuration locations. :) I still haven't finished merging in the bits of code which do the health check, calculate the current probabilities to assign each host and then write out the geoip map files. I'll try to sort that out over the next few days and get a public subversion repository with the software online.

By the way, I plan on releasing the Cacheboy CDN software under the Affero GPL (AGPL) licence.

Tuesday, March 31, 2009

Lusca snapshot released

I've just put up a snapshot of the version of lusca-head which is running on the cacheboy cdn. Head to http://code.google.com/p/lusca-cache/downloads/list .

Monday, March 30, 2009

Mirroring a new project - Cyberduck!

I've just started providing mirror download services for Cyberduck - a file manager for a wide variety of platforms including the traditional (SFTP, FTP) and the new (Amazon/S3, WebDAV.) Cacheboy is listed as the primary download site on the main page.

Woo!

Mozilla 3.0.8 release!

The CDN handled the load with oodles to spare. The aggregate client traffic peak was about 650mbit across 5 major boxes. The boxes themselves peaked at about 160mbit each, depending upon the time of day (ie, whether Europe or the US was active.) None of the nodes were anywhere near maximum CPU utilisation.

About 2 and a half TB of mozilla updates a day are being shuffled out.

I'd like to try pushing a couple of the nodes up to 600mbit -each- but I don't have enough CDN nodes to guarantee the bits will keep flowing if said node fails. I'll just have to be patient and wait for a few more sponsors to step up and provide some hardware and bandwidth to the project.

So far so good - the bits are flowing, I'm able to use this to benchmark Lusca development and fix performance bottlenecks before they become serious (in this environment, at least) and things are growing at about the right rate for me to not need to panic. :)

My next major goal will be to finish off the BGP library and lookup daemon; flesh out some BGP related redirection map logic; and start investigating reporting for "services" on the box. Hm, I may have to write some nagios plugins after all..

Saturday, March 28, 2009

shortcomings in the async io code

Profiling the busy(ish) Lusca nodes during the Mozilla 3.0.8 release cycle has shown significant CPU wastage in memset() (ie, 0'ing memory) - via the aioRead and aioCheckCallbacks code paths.

The problem stems from the disk IO interface inherited from Squid. With Squid, there's no explicit cancel-and-wait-for-cancel to occur with both the network and disk IO code, so the async disk IO read code would actually allocate its own read buffer, read into that, and then provide said read buffer to the completion callback to copy said read data out of. If the request is cancelled but the worker thread is currently read()'ing data, it'll read into its own buffer and not a potentially free()'d buffer from the owner. Its a bit inefficient but in the grand scheme of Squid CPU use, its not that big a waste on modern hardware.

In the short term, I'm going to re-jig the async IO code to not zero buffers that are involved in the aioRead() path. In the longer term, I'm not sure. I prefer cancels which may fail - ie, if an operation is in progress, let it complete, if not then return immediately. I'd like this for the network code too, so I can use async network IO threads for less copy network IO (eg FreeBSD and aio_read() / aio_write()); but there's significant amounts of existing code which assumes things can be cancelled immediately and assumes temporary copies of data are made everywhere. Sigh.

Anyway - grr'ing aside, fixing the pointless zero'ing of buffers should drop the CPU use for large file operations reasonably noticably - by at least 10% to 15%. I'm sure that'll be a benefit to someone.

Googletalk: "Getting C++ threads to work right"

I've been watching a few Google dev talks on Youtube. I thought I'd write up a summary of this one:

http://www.youtube.com/watch?v=mrvAqvtWYb4

In summary:

Writing "correct" thread code using the pthreads and CPU instructions (fencing, for example) requires the code to know whats going on under the hood;
Gluing concurrency to the "side" of a language which was specified without concurrency has shown to be a bit of a problem - eg, concurrent access to different variables in a structure and how various compilers have implemented this (eg, changing a byte in a struct becoming a 32 bit load, 8 bit modify, 32 bit store);
Most programmers should really use higher level constructs, like what C++0x and what the Java specification groups have been doing.

If you write threaded code or you're curious about it, you should watch this talk. It provides a very good overview of the problems and should open your mind up a little to what may go wrong..

Friday, March 27, 2009

Another open cdn project - mirrorbrain

I've been made aware of Mirrorbrain (http://mirrorbrain.org), another project working towards an open CDN framework. Mirrorbrain uses Apache as the web server and some apache module smarts to redirect users between mirrors.

I like it - I'm going to read through their released source and papers to see what clue can be crimed from them - but they still base the CDN on an untrusted, third-party mirror network out of their control. I still think the path forward to an "open CDN" involves complete control right out to the mirror nodes and, in some places, the network which the mirror nodes live on.

There's a couple of shortcomings - most notably, their ASN implementation currently uses snapshots of the BGP network topology table rather than a live BGP feed distributed out to each mirror and DNS node. They also store central indexes of files and attempt to maintain maps of which mirror nodes have which updated versions of files, rather than building on top of perfectly good HTTP/1.1 caching semantics. I wonder why..

Monday, March 23, 2009

Example CDN stats!

Here's a snapshot of the global aggregate traffic level:

.. and top 10 AS stats from last Sunday (UTC) :

Sunday, March 22, 2009

wiki.cacheboy.net

I've setup http://wiki.cacheboy.net/, a simple mediawiki install which will serve as a place for me to braindump stuff into.

Thursday, March 19, 2009

More "Content Delivery" done open

Another network-savvy guy in Europe is doing something content-delivery related: http://www.as250.net/ .

AS250 is building a BGP anycast based platform for various 'open' content delivery and other applications. I plan on doing something similar (or maybe just partner with him, I'm not sure!) but anycast is only part of my over-all solution space.

He's put up some slides from a presentation he did earlier in the year:

http://www.trex.fi/2009/as250-anycast-bgp.pdf

Filesystem Specifications, or EXT4 "Losing Data"

This is a bit off-topic for this blog, but the particular issue at hand bugs the heck out of me.

EXT4 "meets" the POSIX specifications for filesystems. The specification does not make any requirements for data to be written out in any order - and for very good reason. If the application developer -requires- data to be written out in order, they should serialise their operations through use of fsync(). If they do -not- require it, then the operating system should be free to optimise away the physical IO operations.

As a clueful(!) application developer, -I- appreciate being given the opportunity to provide this kind of feedback to the operating system. I don't want one or the other. I'd like to be able to use both where and when I choose.

Application developers - stop being stupid. Fix your applications. Read and understand the specification and what it provides -everyone- rather than just you.

Monday, March 16, 2009

Breaking 200mbit..

The CDN broke 200mbit at peak today - roughly half mozilla and half videolan.

200mbit is still tiny in the grand scheme of things, but it proves that things are working fine.

The next goal is to handle 500mbit average traffic during the the day, and keep a very close eye on the overheads in doing so (specifically - making sure that things don't blow up when the number of concurrent clients grows.)

GeoIP backend, or "reinventing the wheel"

The first incantation of the Cacheboy CDN uses 100% GeoIP to redirect users. This is roughly how it goes:

Take a GeoIP map to break up IPs into "country" regions (thanks nerd.dk!) ;
Take the list of "up" CDN nodes;
For each country in my redirection table, find the CDN node that is up with the highest weight;
Generate a "geo-map" file consisting of the highest-weight "up" CDN node for each country in "3";
Feed that to the PowerDNS geoip module (thanks Mark @ Wikipedia!)

This really is a good place to start - its simple, its tested and it provides me with some basic abilities for distributing traffic across multiple sites to both speed up transfer times to end-users and better use the bandwidth available. The trouble is that it knows very little about the current state of the "internet" at any point in time. But, as I said, as a first (coarse!) step to get the CDN delivering bits, it worked out.

My next step is to build a much easier "hackable" backend which I can start adding functionality to. I've reimplemented the geoip backend in Perl and glued it to the "pipe-backend" module in PowerDNS. This simply passes DNS requests to an external process which spits back DNS replies. The trouble is that multiple backend processes will be invoked regardless of whether you want to or not. This means that I can't simply load in large databases into the backend process as it'll take time to load, waste RAM, and generally make things scale (less) well.

So I broke out the first memory hungry bit - the "geoip" lookup - and stuffed it into a small C daemon. All the daemon does is take a client IP and answer the geoip information for that IP. It will periodically check and reload the GeoIP database file in the background if its changed - maintaining whatever request rate I'm throwing at it rather than pausing for a few seconds whilst things are loaded in.

I can then use the "geoip daemon" (lets call it "geoipd") by the PowerDNS pipe-backend process I'm writing. All this process has to do at the moment is load in the geo maps (which are small) and reload them as required. It sends all geoip requests to the geoipd and uses the reply. If there is a problem talking to the geoipd, the backend process will simply use a weighted round robin of well-connected servers as a last resort.

The aim is to build a flexible backend framework for processing redirection requests which can be used by a variety of applications. For example, when its time for the CDN proxy nodes to also do 302 redirections to "closer" nodes, I can simply reuse a large part of the modular libraries written. When I integrate BGP information into the DNS infrastructure, I can reuse all of those libraries in the CDN proxy redirection logic, or the webserver URL rewriting logic, or anywhere else where its needed.

The next step? Figuring out how to load balance traffic destined to the same AS / GeoIP region across multiple CDN end nodes. This should let me scale the CDN up to a gigabit of aggregate traffic given the kind of sponsored boxes I'm currently receiving. More to come..

Friday, March 13, 2009

Downtime!

The CDN had a bit of downtime tonight. It went like this:

The first mirror threw a disk;
For some reason, gmirror became unhappy, rather than running on the second mirror (I'm guessing the controller went unhappy; there wasn't anything logged to indicate the other disk in the mirror set was failing);
The second mirror started taking load;
For some weird reason, the second mirror hung hard without any logging to explain why.

I've replaced the disks in mirror1 and its slowly rebuilding the content. It probably won't be finished resync'ing the (new) mirror set until tomorrow. Hopefully mirror-2 will stay just as stable as it currently is.

The CDN ended up still serving content whilst the masters were down - they just couldn't download uncached content. So it wasn't a -total- loss.

This just highlights that I really do require another mirror master or two located elsewhere. :)

Monday, March 9, 2009

minimising traffic to the backends..

Squid/Cacheboy/Lusca has a nifty feature where it'll "piggyback" a client connection on an existing backend connection if the backend response is cachable AND said response is valid for the client (ie, its the right variant, doesn't require revalidation at that point, etc.)

I've been using this for the videolan and mozilla downloads. Basically, one client will suck down the whole object, and any other clients which want the same object (say, the 3.0.7 US english win32 update!) will share the same connection.

There's a few problems which have crept up.

Firstly - the "collapsed forwarding" support is not working in this instance. I think the logic is broken with large objects (it was only written for small objects, delaying forwarding the request until the forwarded response was known cachable) where it denies cachability of the response (well, it forces it to be RELEASEd after it finishes transferring) because of all of the concurrent range requests going on.

Secondly - Squid/Cacheboy/Lusca doesn't handle range request caching. It'll -serve- range responses for objects it has the data for, but it won't cache partial responses nor will it reassemble them into one chunk. I've been thinking about how to possibly fix that, but for now I'm hacking around the problems with some scripts.

Finally - the forwarding logic uses the speed of the -slowest- client to determine how quickly to download the file. This needs to be changed to use the speed of the -fastest- client to determine how quickly to download said file.

I need to get these fixed before the next mozilla release cycle if I'm to have a chance of increasing the traffic levels to a gigabit and beyond.

More to come..

Cacheboy Outage

There was a brief outage earlier tonight due to some troubles with the transit provider of one of my sponsors. They're sponsoring the (only) pair of mirror master servers at the moment.

This will be (somewhat) mitigated when I bring up another set of mirror master servers elsewhere.

Sunday, March 8, 2009

Cacheboy is pushing bits..

The Cacheboy CDN is currently pushing about 1.2 TB a day (~ 100mbit average) of mozilla and videolan downloads out of 5 seperate CDN locations. A couple of the servers are running LUSCA_HEAD and they seem to handle the traffic just fine.

The server assignment is currently being done through GeoIP mapping via DNS. I've brought up BGP sessions to each of the sites to eventually use in the request forwarding process.

All in all, things are going reasonably successfully so far. There's been a few hiccups which I'll blog about over the next few days but the bits are flowing, and noone is complaining. :)

Friday, February 27, 2009

Cacheboy CDN is online!

There's been a few changes!

* The "Cacheboy proxy" development has become Lusca; thats spun off into a little separate project of its own.
* The "Cacheboy" project is now focusing on providing an open source platform for content delivery. I've organised some donated hardware (some donated by me), some donated bandwidth (again, some donated by me) and a couple of test projects to serve content for.

More details to come!

(As a side note, I've got too many blogs; I think its time to rationalise them down to one or two and use labels to correctly identify which is which.)

Monday, February 23, 2009

Lusca and BGP, take 2.

I've ironed out the crash kinks (the rest of the "kinks" are in the BGP FSM implementation); thus I'm left with:

1235459412.856 17063 118.92.109.x TCP_REFRESH_HIT/206 33405 GET http://videolan.cdn.cacheboy.net/vlc/0.9.8a/win32/vlc-0.9.8a-win32.exe - NONE/- application/x-msdownload AS7657
1235459417.194 1113 202.150.98.x TCP_HIT/200 45637 GET http://videolan.cdn.cacheboy.net/vlc/0.9.8a/win32/vlc-0.9.8a-win32.exe - NONE/- application/x-msdownload AS17746

Notice how the Squid logs have AS numbers in them? :)

Lusca and BGP

I've been fleshing out some very, very basic BGP support in a lusca-head branch. I'm only using the BGP information right now for logging but I'll eventually use it as part of the request and reply processing.

It *cough* mostly works. I need to figure out why there's occasional radix tree corruption (which probably means running it under valgrind to find when the radix code goes off the map..) and un-dirty some of the BGP code (ie, implement a real FSM; proper separation of the protocol handling, FSM, network and RIB code) and add in the AS path/community/attribute stuff before I commit it to LUSCA_HEAD.

It is kind of cool though having a live BGP feed in your application. :) All 280,000 odd routes of it. :)

Sunday, February 1, 2009

Lusca development, and changes to string handling

I've just renamed Cacheboy to "Lusca". I've had a few potential users comment that "Cacheboy" isn't uhm, "management compatible", so the project has been renamed to try and bring some of these users on board. I'm also hoping to make Lusca less Adrian-focused and involve more of the community. We'll see how that goes.

In terms of development, I've shifted the code to http://code.google.com/p/lusca-cache/ and I'm continuing my work in /branches/LUSCA_HEAD.

I've been working on src/http.c (the server-side HTTP code) in preparation for introducing reference counted buffer/string handling. I removed one copy (of the socket read buffer into another memory buffer, to assemble a buffer containing the HTTP reply, in preparation for parsing) and have just migrated that bit of the codebase over to use my reference counted buffer (buf_t; found in libmem/buf.[ch].) It's entirely possible that I've horribly broken the server-side code so I'm reluctant to do much else until I've finished restructuring and testing the server-side HTTP code.

I've also been tidying up a few more places where the current String API is used "incorrectly", at least incorrectly for reference counted strings/buffers. I have ~ 61 code chunks to rewrite, mostly in the logging code. I've done it twice already in other branches, so this won't be terribly difficult. Its just boring. :)

Oh, and I've also just removed the "caching" bits of the MemPools code. MemPools in LUSCA_HEAD is now just a small wrapper around malloc/calloc/free, mainly to preserve the "block allocator" style API and keep some statistics. At the end of the day, Squid uses memory very very poorly and the caching code in MemPools is purely to avoid said poor memory use. I'm going to just fix the memory use (mostly revolving around String buffers, HTTP headers and the TLV code, amazing that!) so the number of calls through the allocator is much, much reduced. I'm guessing once I've finished, the number of calls through the system allocator will be about 2 or 3% of what they are now. That should drop the CPU use quite a bit.

Ah, now to find testers..

Tuesday, January 20, 2009

Where the CPU is going

Oprofile is fun.

So, lets find out all of the time spent in cacheboy-head, per-symbol, with accumulative time, but only showing symbols taking 1% or more of CPU:


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -la -t 1 ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples  cum. samples  %        cum. %     image name               symbol name
2100394  2100394        6.9315   6.9315    libc-2.3.6.so            memcpy
674036   2774430        2.2244   9.1558    libc-2.3.6.so            vfprintf
657729   3432159        2.1706  11.3264    squid                    memPoolAlloc
463901   3896060        1.5309  12.8573    libc-2.3.6.so            _int_malloc
453978   4350038        1.4982  14.3555    libc-2.3.6.so            strncasecmp
442439   4792477        1.4601  15.8156    libc-2.3.6.so            re_search_internal
438752   5231229        1.4479  17.2635    squid                    comm_select
423196   5654425        1.3966  18.6601    squid                    memPoolFree
418949   6073374        1.3826  20.0426    squid                    stackPop
412394   6485768        1.3609  21.4036    squid                    httpHeaderIdByName
402709   6888477        1.3290  22.7325    libc-2.3.6.so            strtok
364201   7252678        1.2019  23.9344    squid                    httpHeaderClean
359257   7611935        1.1856  25.1200    squid                    statHistBin
343628   7955563        1.1340  26.2540    squid                    SQUID_MD5Transform
330128   8285691        1.0894  27.3434    libc-2.3.6.so            memset
323962   8609653        1.0691  28.4125    libc-2.3.6.so            memchr

Ok, thats sort of useful. Whats unfortunate is that there's uhm, a lot more symbols than that:


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -la ./squid | wc -l
595

Ok, so thats a bit annoying. 16 symbols take ~ 28% of the CPU time, but the other 569 odd take the ~ 72% remaining CPU. This sort of makes traditional optimisation techniques a bit pointless now. I've optimised almost all of the "stupid" bits - double/triple copying of data, over-allocating and freeing pointlessly, multiple parsing attempts, etc.

How many samples in total?


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -l ./squid | cut -f1 -d' ' | awk '{ s+= $1; } END { print s }'
30302294

Lets look now at what memcpy() is doing, just to get an idea of what needs to be changed


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -lc -t 1 -i memcpy ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples  %        image name               symbol name
-------------------------------------------------------------------------------
 28133     1.3394  squid                    storeSwapOut
 31515     1.5004  squid                    stringInit
 32619     1.5530  squid                    httpBuildRequestPrefix
 54237     2.5822  squid                    strListAddStr
 54322     2.5863  squid                    storeSwapMetaBuild
 80047     3.8110  squid                    clientKeepaliveNextRequest
 171738    8.1765  squid                    httpHeaderEntryParseCreate
 211091   10.0501  squid                    httpHeaderEntryPackInto
 318793   15.1778  squid                    stringDup
 1022812  48.6962  squid                    storeAppend
2100394  100.000  libc-2.3.6.so            memcpy
 2100394  100.000  libc-2.3.6.so            memcpy [self]
------------------------------------------------------------------------------

So hm, half the memcpy() CPU time is spent in storeAppend, followed by storeDup, and httpHeaderEntryPackInto. Ok, those are what I'm going to be working on eliminating next anyway, so its not a big deal. This means I'll eliminate ~ 73% of the memcpy() CPU time, which is 73% of 7%, so around 5% of CPU time. Not too shabby. There'll be some overheads introduced by how its done (referenced buffer management) but one of the side-effects of that should be a drop in the number of calls to the memory allocator functions, so they should drop off a bit.

But this stuff is still just micro-optimisation. What I need is an idea of what code -paths- are taking up precious CPU time and thus what I should consider first to reimplement. Lets use the "-t" on non-top-level symbols. To start with, lets look at the two top-level "read" functions, which generally lead to some kind of other processing.

root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -lc -t 1 -i clientReadRequest ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples  %        symbol name
-------------------------------------------------------------------------------
  87536     4.7189  clientKeepaliveNextRequest
  1758418  94.7925  comm_select
88441    100.000  clientReadRequest
  2121926  86.3731  clientTryParseRequest
  88441     3.6000  clientReadRequest [self]
  52951     2.1554  commSetSelect
-------------------------------------------------------------------------------


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -lc -t 1 -i httpReadReply ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples  %        symbol name
-------------------------------------------------------------------------------
  3962448  99.7463  comm_select
163081   100.000  httpReadReply
  2781096  53.2193  httpAppendBody
  1857597  35.5471  httpProcessReplyHeader
  163081    3.1207  httpReadReply [self]
  57084     1.0924  memBufGrow
------------------------------------------------------------------------------

Here we're not interested in who is -calling- these functions (since its just the comm routine :) but which functions this routine is calling. The next trick, of course, is to try and figure out which of these paths are taking a noticable amount of CPU time. Obviously httpAppendBody() and httpProcessReplyHeader() are; they're doing both a lot of copying and a lot of parsing.

I'll look into things a little more in-depth in a few days; I need to get back to paid work. :)

Monday, January 19, 2009

Eliminating copies, or "god this code is horrible"

I've been (slowlyish!) unwinding some of the evil horridness that exists in the src/http.c code which handles reading data from upstream servers/caches, parsing it, and throwing it into the store.

There's two annoying memory copies as I've said before - one was a copy of the incoming data into a MemBuf, used -just- to assemble the full response headers for parsing, and the other (well, other two) are for appending the data coming in from the network into the memory store, on its way to the client-side code to be sent back to the client.

Now, as I've said before, the src/http.c code isn't all that long and complicated (by far most of the logic actually happens in the forward and client-side routines; the http.c routines do very little besides pump data back into the memory store) but unfortunately enough various layers of logic are mashed together to make things uhm, "very difficult" to work on separately.

Anyway, back on track. I've mostly pulled apart the code which handles reading the reply and parsing the response headers, and I've eliminated the first copy. The data is now read directly into a MemBuf, which serves as both the incoming buffer (which gets appended to) for the reply status line + headers, _AND_ the incoming buffer for HTTP body data (which never gets appended to - it is written out to the memory store and then reset back to empty.)

So the good news now is the number one place for L2 loads, L2 stores and CPU cycles spent unhalted (as measured on my P3 667mhz celeron test box, nice and slow, to expose all those stupid inefficiencies modern CPUs try to cover up :) comes from the memcpy() from src/http.c -> { header parsing (12%), http body appending (84%) } -> storeAppend().

This means one main thing - if I can eliminate the copying from into the store, and instead read directly into variable-sized pages (which is unfortunately the bloody tricky part), which are then handed to their entirety to the memory store, that last memcpy() will be eliminated, along with hopefully a good 10 + % of CPU time on this P3.

After that, its fixing the various uses of *printf() functions in the critical path, which absolutely should be avoided. I've got some basic patches to begin replacing some of the really STUPID uses of those. I'll begin committing the really obviously easy ones to Cacheboy HEAD once I've verified they don't break anything (in particular, SNMP indexes of all things..)

Once the two above are done, which accounts for a good 15 - 20% of the current CPU use in Cacheboy (at least in my small objects, memory-cache-only test load on the above hardware), I'll absolutely stop adding any and all new changes, features, optimisations, etc, and go -straight- to "make everything stable" mode again.

There's still so much that needs doing (proper refcounted buffers and strings, comm library functions which properly implement readv() and writev() so I can do things like write out the entire request/reply using vector operations and avoid the other bits of copying which go on, lessening the load on the memory allocator by actually efficiently packing structures, rewriting the http request/reply handling in preparation for replacement HTTP client/server modules, oh and IPv6/threading!) but that will come later.

Eliminating copies, or "god this code is horrible"

Sunday, January 18, 2009

Tidying up the http reply handling code..

One of the unfortunate parts of the Squid codebase is that the HTTP request and reply handling code is messed up with the client and server code, and contains both stuff specific to a Cache (eg, looking for headers to control cache behaviour) as well as connection stuff (eg Transfer Encoding stuff, Keepalive, etc.)

My long-term goal is to finally separate all of this mess out so there's "generic" routines to be a HTTP client and server, create requests/replies and parse responses. But for now, tidying up some of the messy code to improve performance (and thus give people motivation to migrate their busy sites to Cacheboy) is on my short-term TODO list.

I spent some time ~ 18 months ago tidying up all of the client-side code so the request line and request header parsing didn't require half a dozen copies of various things just to complete. That was quite successful. The code structure is still horrible, but it works, and that for now is absolutely the most important part.

Now I'm doing something similar to the server-side code. The HTTP server code (src/http.c) combines both reply buffer appending, parsing, 100-continue response handling (well, "handling") and the various header checks for caching and connection in one enormous puddle of code. I'm trying to tease these apart so each part is done separately and the reply data isn't double-copied - once into the reply buffer, then once via storeAppend() into the memory store.

The CPU time spent doing this copying isn't all that high on current systems but it is definitely noticable (~30% of all CPU time spent in memcpy()) for slower systems talking to LAN-connected servers. So I'm going to do it - primarily to fix performance on slower hardware, but it also forces me to tidy up the existing code somewhat.

The next step is avoiding the copy into the memory store entirely, removing another 65% or so of memcpy() CPU time.

Friday, January 16, 2009

Refcounted string buffers!

Those of you who have been watching may have noticed a few String tidyups going into CACHEBOY_HEAD recently (one of which caused a bug in the first cacheboy-1.6 stable release that made it very non-stable!)

This is all in preparation for more sensible string and buffer handling. Unfortunately the Cacheboy codebase inherited a lot of dirty string handling and it needed some house cleaning before I could look towards the future.

Well, the future is here now (well, in /svn/branches/CACHEBOY_HEAD_strref ...) - I brought in my refcounted buffer routines from my previous attempts at all of this and converted String.[ch] over to use it.

For now, the refcounted string implementation doubles the malloc overhead for new strings (since it has to create a small buf_t and a string buffer) but stringDup() becomes essentially free. Since in a lot of cases, the stringDup() occurs when copying string headers and basically leaving them alone, this saves on a bunch of memory copying.

Decent performance benefits will only come with a whole lot of work:

Remove all of the current assumptions in code which uses String that the actual backing buffer (accessible via strBuf()) is NUL-terminated;
Rewrite sections of the code which go between String and C string buffers (with copying, etc) to use String where applicable. Unfortunately a whole lot of the original client_side.c code which handles parsing the request involves a fair bit of crap - so..
.. writing replacement request and reply HTTP parsers is probably the next thing to do;
Shuffling around the client-side code and the http code to use a buf_t as a incoming socket buffer, instead of how they currently do things (in an ugly way..)
Propagate down the incoming socket buffer to the request/reply parsing code, so said code can simply create references to the original socket buffer, bypassing any and all requirement for copying the request/reply data seperately.

I'm reasonably excited about the future benefits this code holds, but for now I'm going to remain reasonably conservative and leave the current String improvements where they are. I don't mind if these and the next round of changes to the MemBuf code reduce performance but improve the code; I know that the medium-term goal is going to provide some pretty decent benefits and I want to keep things stable and usable in production whilst I get there.

Next on my list though; looking at removing the places where *printf() is used in critical sections..

Friday, January 9, 2009

More profiling!

The following info is for a 10,000 concurrent connections, keep-alived, of just a fetch of an internal icon object from Squid. This is using my apachebench-adrian package which can handle such traffic loads.

The below accounts for roughly 60% of total CPU time (ie, 60% of the CPU is spent in userspace) on one core.
With oprofile, it hits around 12,300 transactions a second.

I have much, much hatred for how Squid uses *printf() everywhere. Sigh.



CPU: AMD64 processors, speed 2613.4 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
samples  cum. samples  %        cum. %     image name               symbol name
5383709  5383709        4.5316   4.5316    libc-2.6.1.so            vfprintf
4025991  9409700        3.3888   7.9203    libc-2.6.1.so            memcpy
3673722  13083422       3.0922  11.0126    libc-2.6.1.so            _int_malloc
3428362  16511784       2.8857  13.8983    libc-2.6.1.so            memset
3306571  19818355       2.7832  16.6815    libc-2.6.1.so            malloc_consolidate
2847887  22666242       2.3971  19.0787    squid                    memPoolFree
2634120  25300362       2.2172  21.2958    libm-2.6.1.so            floor
2609922  27910284       2.1968  23.4927    squid                    memPoolAlloc
2408836  30319120       2.0276  25.5202    libc-2.6.1.so            re_search_internal
2296612  32615732       1.9331  27.4534    libc-2.6.1.so            strlen
2265816  34881548       1.9072  29.3605    libc-2.6.1.so            _int_free
1826493  36708041       1.5374  30.8979    libc-2.6.1.so            _IO_default_xsputn
1641986  38350027       1.3821  32.2800    libc-2.6.1.so            free
1601997  39952024       1.3484  33.6285    squid                    httpHeaderGetEntry
1575919  41527943       1.3265  34.9549    libc-2.6.1.so            memchr
1466114  42994057       1.2341  36.1890    libc-2.6.1.so            re_string_reconstruct
1275377  44269434       1.0735  37.2625    squid                    clientTryParseRequest
1214714  45484148       1.0225  38.2850    squid                    httpMsgFindHeadersEnd
1185932  46670080       0.9982  39.2832    squid                    statHistBin
1170361  47840441       0.9851  40.2683    squid                    urlCanonicalClean
1169694  49010135       0.9846  41.2529    libc-2.6.1.so            strtok
1145933  50156068       0.9646  42.2174    squid                    comm_select
1128595  51284663       0.9500  43.1674    libc-2.6.1.so            __GI_____strtoll_l_internal
1116573  52401236       0.9398  44.1072    squid                    httpHeaderIdByName
956209   53357445       0.8049  44.9121    squid                    SQUID_MD5Transform
915844   54273289       0.7709  45.6830    squid                    memBufAppend
907609   55180898       0.7640  46.4469    squid                    stringLimitInit
898666   56079564       0.7564  47.2034    libc-2.6.1.so            strspn
883282   56962846       0.7435  47.9468    squid                    urlParse
852875   57815721       0.7179  48.6647    libc-2.6.1.so            calloc
819613   58635334       0.6899  49.3546    squid                    clientWriteComplete
800196   59435530       0.6735  50.0281    squid                    httpMsgParseRequestLine

Thursday, January 8, 2009

FreeBSD TPROXY works!

The FreeBSD TPROXY support (with a patched FreeBSD kernel for now) works just fine in testing.

I'm going to commit the changes to FreeBSD in the next couple of days. I'll then bring in the TPROXY4 support from Squid-3, and hopefully get functioning TPROXY2, TPROXY4 and FreeBSD TPROXY support into the upcoming Cacheboy-1.6 release.

Wednesday, January 7, 2009

TPROXY support

G'day,

I've done a bit of shuffling in the communication code to include a more modular approach to IP source address spoofing.

There's (currently) untested support for some FreeBSD source IP address spoofing that I'm bringing over courtesy of Julian Elischer; and there's a Linux TPROXY2 module.

I'll look at porting over the TPROXY4 support from Squid-3 in a few days.

I think this release is about as close to "stable" as Cacheboy-1.6 is going to get, so look forward to a "stable" release as soon as the FreeBSD port has been setup.

I already have a list of things to do for Cacheboy-1.7 which should prove to be interesting. Stay tuned..

Sunday, January 4, 2009

next steps..

I've been slowly fixing whatever bugs creep up in my local testing. The few people publicly testing Cacheboy-1.6 have reported that all the bugs have been fixed. I'd appreciate some further testing but I'll get what I can for now. :)

I'll be doing a few things to the CACHEBOY_HEAD branch after the 1.6 release which will hopefully lead towards a mostly thread-safe core. I'm also busy documenting various things in the core libraries which I haven't yet gotten around to. I also really should sort out some changes to the apple HeaderDoc software to support generating slightly better looking documents from C source.

I've almost finished the first round of code reorganisation. The stuff that I would have liked to have done this round includes shuffling out the threaded async operations from the AUFS code and building a proper disk library, sort of like what Squid-2.2 "had" and what Squid-3.0 almost did; but instead use them to implement general disk IO versus specialised modules just for storage. I'd like to take advantage of threaded/non-blocking disk IO in a variety of situations, including logfile writing.