Tuesday, January 20, 2009

Where the CPU is going

Oprofile is fun.

So, lets find out all of the time spent in cacheboy-head, per-symbol, with accumulative time, but only showing symbols taking 1% or more of CPU:


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -la -t 1 ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples cum. samples % cum. % image name symbol name
2100394 2100394 6.9315 6.9315 libc-2.3.6.so memcpy
674036 2774430 2.2244 9.1558 libc-2.3.6.so vfprintf
657729 3432159 2.1706 11.3264 squid memPoolAlloc
463901 3896060 1.5309 12.8573 libc-2.3.6.so _int_malloc
453978 4350038 1.4982 14.3555 libc-2.3.6.so strncasecmp
442439 4792477 1.4601 15.8156 libc-2.3.6.so re_search_internal
438752 5231229 1.4479 17.2635 squid comm_select
423196 5654425 1.3966 18.6601 squid memPoolFree
418949 6073374 1.3826 20.0426 squid stackPop
412394 6485768 1.3609 21.4036 squid httpHeaderIdByName
402709 6888477 1.3290 22.7325 libc-2.3.6.so strtok
364201 7252678 1.2019 23.9344 squid httpHeaderClean
359257 7611935 1.1856 25.1200 squid statHistBin
343628 7955563 1.1340 26.2540 squid SQUID_MD5Transform
330128 8285691 1.0894 27.3434 libc-2.3.6.so memset
323962 8609653 1.0691 28.4125 libc-2.3.6.so memchr

Ok, thats sort of useful. Whats unfortunate is that there's uhm, a lot more symbols than that:


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -la ./squid | wc -l
595

Ok, so thats a bit annoying. 16 symbols take ~ 28% of the CPU time, but the other 569 odd take the ~ 72% remaining CPU. This sort of makes traditional optimisation techniques a bit pointless now. I've optimised almost all of the "stupid" bits - double/triple copying of data, over-allocating and freeing pointlessly, multiple parsing attempts, etc.

How many samples in total?


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -l ./squid | cut -f1 -d' ' | awk '{ s+= $1; } END { print s }'
30302294

Lets look now at what memcpy() is doing, just to get an idea of what needs to be changed


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -lc -t 1 -i memcpy ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples % image name symbol name
-------------------------------------------------------------------------------
28133 1.3394 squid storeSwapOut
31515 1.5004 squid stringInit
32619 1.5530 squid httpBuildRequestPrefix
54237 2.5822 squid strListAddStr
54322 2.5863 squid storeSwapMetaBuild
80047 3.8110 squid clientKeepaliveNextRequest
171738 8.1765 squid httpHeaderEntryParseCreate
211091 10.0501 squid httpHeaderEntryPackInto
318793 15.1778 squid stringDup
1022812 48.6962 squid storeAppend
2100394 100.000 libc-2.3.6.so memcpy
2100394 100.000 libc-2.3.6.so memcpy [self]
------------------------------------------------------------------------------

So hm, half the memcpy() CPU time is spent in storeAppend, followed by storeDup, and httpHeaderEntryPackInto. Ok, those are what I'm going to be working on eliminating next anyway, so its not a big deal. This means I'll eliminate ~ 73% of the memcpy() CPU time, which is 73% of 7%, so around 5% of CPU time. Not too shabby. There'll be some overheads introduced by how its done (referenced buffer management) but one of the side-effects of that should be a drop in the number of calls to the memory allocator functions, so they should drop off a bit.

But this stuff is still just micro-optimisation. What I need is an idea of what code -paths- are taking up precious CPU time and thus what I should consider first to reimplement. Lets use the "-t" on non-top-level symbols. To start with, lets look at the two top-level "read" functions, which generally lead to some kind of other processing.

root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -lc -t 1 -i clientReadRequest ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples % symbol name
-------------------------------------------------------------------------------
87536 4.7189 clientKeepaliveNextRequest
1758418 94.7925 comm_select
88441 100.000 clientReadRequest
2121926 86.3731 clientTryParseRequest
88441 3.6000 clientReadRequest [self]
52951 2.1554 commSetSelect
-------------------------------------------------------------------------------


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -lc -t 1 -i httpReadReply ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples % symbol name
-------------------------------------------------------------------------------
3962448 99.7463 comm_select
163081 100.000 httpReadReply
2781096 53.2193 httpAppendBody
1857597 35.5471 httpProcessReplyHeader
163081 3.1207 httpReadReply [self]
57084 1.0924 memBufGrow
------------------------------------------------------------------------------

Here we're not interested in who is -calling- these functions (since its just the comm routine :) but which functions this routine is calling. The next trick, of course, is to try and figure out which of these paths are taking a noticable amount of CPU time. Obviously httpAppendBody() and httpProcessReplyHeader() are; they're doing both a lot of copying and a lot of parsing.

I'll look into things a little more in-depth in a few days; I need to get back to paid work. :)

3 comments:

Unknown said...

Of course, the other thing you need to always keep in mind is what the theoretical limits of the system are, so you know how close you're getting to max throughput.... I admit it's a bit harder on Squid, but I've definitely seen examples of people spending weeks trying to optimize something that was as fast as it was ever going to be.

Adrian said...

Hm, I think I've posted information about this in the past on this blog, but even this slow/old system is capable of slightly more throughput than what I'm currently getting.

I've written and benchmarked straight TCP proxy code on this server and I've seen the throughput jump dramatically because of the lack of CPU time spent in userland.

The problem that exists with Squid is that there's just a -lot- of code being executed, a lot of which is horribly inefficient and verbose.

I'm resisting the temptation to rewrite large chunks of the codebase until I've removed the last couple of obvious CPU and bus thrashers (namely, the memcpy() and *printf() calls) and can then try to identify which bits of code are just taking too damn long.

Ralf H. said...

"So hm, half the memcpy() CPU time is spent in storeAppend, followed by storeDup..."

No, it's stringDup :)