Cacheboy Development: Where the CPU is going

Oprofile is fun.

So, lets find out all of the time spent in cacheboy-head, per-symbol, with accumulative time, but only showing symbols taking 1% or more of CPU:


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -la -t 1 ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples  cum. samples  %        cum. %     image name               symbol name
2100394  2100394        6.9315   6.9315    libc-2.3.6.so            memcpy
674036   2774430        2.2244   9.1558    libc-2.3.6.so            vfprintf
657729   3432159        2.1706  11.3264    squid                    memPoolAlloc
463901   3896060        1.5309  12.8573    libc-2.3.6.so            _int_malloc
453978   4350038        1.4982  14.3555    libc-2.3.6.so            strncasecmp
442439   4792477        1.4601  15.8156    libc-2.3.6.so            re_search_internal
438752   5231229        1.4479  17.2635    squid                    comm_select
423196   5654425        1.3966  18.6601    squid                    memPoolFree
418949   6073374        1.3826  20.0426    squid                    stackPop
412394   6485768        1.3609  21.4036    squid                    httpHeaderIdByName
402709   6888477        1.3290  22.7325    libc-2.3.6.so            strtok
364201   7252678        1.2019  23.9344    squid                    httpHeaderClean
359257   7611935        1.1856  25.1200    squid                    statHistBin
343628   7955563        1.1340  26.2540    squid                    SQUID_MD5Transform
330128   8285691        1.0894  27.3434    libc-2.3.6.so            memset
323962   8609653        1.0691  28.4125    libc-2.3.6.so            memchr

Ok, thats sort of useful. Whats unfortunate is that there's uhm, a lot more symbols than that:


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -la ./squid | wc -l
595

Ok, so thats a bit annoying. 16 symbols take ~ 28% of the CPU time, but the other 569 odd take the ~ 72% remaining CPU. This sort of makes traditional optimisation techniques a bit pointless now. I've optimised almost all of the "stupid" bits - double/triple copying of data, over-allocating and freeing pointlessly, multiple parsing attempts, etc.

How many samples in total?


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -l ./squid | cut -f1 -d' ' | awk '{ s+= $1; } END { print s }'
30302294

Lets look now at what memcpy() is doing, just to get an idea of what needs to be changed


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -lc -t 1 -i memcpy ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples  %        image name               symbol name
-------------------------------------------------------------------------------
 28133     1.3394  squid                    storeSwapOut
 31515     1.5004  squid                    stringInit
 32619     1.5530  squid                    httpBuildRequestPrefix
 54237     2.5822  squid                    strListAddStr
 54322     2.5863  squid                    storeSwapMetaBuild
 80047     3.8110  squid                    clientKeepaliveNextRequest
 171738    8.1765  squid                    httpHeaderEntryParseCreate
 211091   10.0501  squid                    httpHeaderEntryPackInto
 318793   15.1778  squid                    stringDup
 1022812  48.6962  squid                    storeAppend
2100394  100.000  libc-2.3.6.so            memcpy
 2100394  100.000  libc-2.3.6.so            memcpy [self]
------------------------------------------------------------------------------

So hm, half the memcpy() CPU time is spent in storeAppend, followed by storeDup, and httpHeaderEntryPackInto. Ok, those are what I'm going to be working on eliminating next anyway, so its not a big deal. This means I'll eliminate ~ 73% of the memcpy() CPU time, which is 73% of 7%, so around 5% of CPU time. Not too shabby. There'll be some overheads introduced by how its done (referenced buffer management) but one of the side-effects of that should be a drop in the number of calls to the memory allocator functions, so they should drop off a bit.

But this stuff is still just micro-optimisation. What I need is an idea of what code -paths- are taking up precious CPU time and thus what I should consider first to reimplement. Lets use the "-t" on non-top-level symbols. To start with, lets look at the two top-level "read" functions, which generally lead to some kind of other processing.

root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -lc -t 1 -i clientReadRequest ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples  %        symbol name
-------------------------------------------------------------------------------
  87536     4.7189  clientKeepaliveNextRequest
  1758418  94.7925  comm_select
88441    100.000  clientReadRequest
  2121926  86.3731  clientTryParseRequest
  88441     3.6000  clientReadRequest [self]
  52951     2.1554  commSetSelect
-------------------------------------------------------------------------------


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -lc -t 1 -i httpReadReply ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples  %        symbol name
-------------------------------------------------------------------------------
  3962448  99.7463  comm_select
163081   100.000  httpReadReply
  2781096  53.2193  httpAppendBody
  1857597  35.5471  httpProcessReplyHeader
  163081    3.1207  httpReadReply [self]
  57084     1.0924  memBufGrow
------------------------------------------------------------------------------

Here we're not interested in who is -calling- these functions (since its just the comm routine :) but which functions this routine is calling. The next trick, of course, is to try and figure out which of these paths are taking a noticable amount of CPU time. Obviously httpAppendBody() and httpProcessReplyHeader() are; they're doing both a lot of copying and a lot of parsing.

I'll look into things a little more in-depth in a few days; I need to get back to paid work. :)

3 comments:

Unknown said...: Of course, the other thing you need to always keep in mind is what the theoretical limits of the system are, so you know how close you're getting to max throughput.... I admit it's a bit harder on Squid, but I've definitely seen examples of people spending weeks trying to optimize something that was as fast as it was ever going to be.; January 20, 2009 at 5:07 PM
Adrian said...: Hm, I think I've posted information about this in the past on this blog, but even this slow/old system is capable of slightly more throughput than what I'm currently getting.

I've written and benchmarked straight TCP proxy code on this server and I've seen the throughput jump dramatically because of the lack of CPU time spent in userland.

The problem that exists with Squid is that there's just a -lot- of code being executed, a lot of which is horribly inefficient and verbose.

I'm resisting the temptation to rewrite large chunks of the codebase until I've removed the last couple of obvious CPU and bus thrashers (namely, the memcpy() and *printf() calls) and can then try to identify which bits of code are just taking too damn long.; January 20, 2009 at 5:19 PM
Ralf H. said...: "So hm, half the memcpy() CPU time is spent in storeAppend, followed by storeDup..."

No, it's stringDup :); January 21, 2009 at 12:21 AM

Cacheboy Development

Tuesday, January 20, 2009

Where the CPU is going

3 comments:

Blog Archive

About Me