Cacheboy Development: oprofile

Showing posts with label oprofile. Show all posts

Tuesday, January 20, 2009

Where the CPU is going

Oprofile is fun.

So, lets find out all of the time spent in cacheboy-head, per-symbol, with accumulative time, but only showing symbols taking 1% or more of CPU:


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -la -t 1 ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples  cum. samples  %        cum. %     image name               symbol name
2100394  2100394        6.9315   6.9315    libc-2.3.6.so            memcpy
674036   2774430        2.2244   9.1558    libc-2.3.6.so            vfprintf
657729   3432159        2.1706  11.3264    squid                    memPoolAlloc
463901   3896060        1.5309  12.8573    libc-2.3.6.so            _int_malloc
453978   4350038        1.4982  14.3555    libc-2.3.6.so            strncasecmp
442439   4792477        1.4601  15.8156    libc-2.3.6.so            re_search_internal
438752   5231229        1.4479  17.2635    squid                    comm_select
423196   5654425        1.3966  18.6601    squid                    memPoolFree
418949   6073374        1.3826  20.0426    squid                    stackPop
412394   6485768        1.3609  21.4036    squid                    httpHeaderIdByName
402709   6888477        1.3290  22.7325    libc-2.3.6.so            strtok
364201   7252678        1.2019  23.9344    squid                    httpHeaderClean
359257   7611935        1.1856  25.1200    squid                    statHistBin
343628   7955563        1.1340  26.2540    squid                    SQUID_MD5Transform
330128   8285691        1.0894  27.3434    libc-2.3.6.so            memset
323962   8609653        1.0691  28.4125    libc-2.3.6.so            memchr

Ok, thats sort of useful. Whats unfortunate is that there's uhm, a lot more symbols than that:


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -la ./squid | wc -l
595

Ok, so thats a bit annoying. 16 symbols take ~ 28% of the CPU time, but the other 569 odd take the ~ 72% remaining CPU. This sort of makes traditional optimisation techniques a bit pointless now. I've optimised almost all of the "stupid" bits - double/triple copying of data, over-allocating and freeing pointlessly, multiple parsing attempts, etc.

How many samples in total?


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -l ./squid | cut -f1 -d' ' | awk '{ s+= $1; } END { print s }'
30302294

Lets look now at what memcpy() is doing, just to get an idea of what needs to be changed


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -lc -t 1 -i memcpy ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples  %        image name               symbol name
-------------------------------------------------------------------------------
 28133     1.3394  squid                    storeSwapOut
 31515     1.5004  squid                    stringInit
 32619     1.5530  squid                    httpBuildRequestPrefix
 54237     2.5822  squid                    strListAddStr
 54322     2.5863  squid                    storeSwapMetaBuild
 80047     3.8110  squid                    clientKeepaliveNextRequest
 171738    8.1765  squid                    httpHeaderEntryParseCreate
 211091   10.0501  squid                    httpHeaderEntryPackInto
 318793   15.1778  squid                    stringDup
 1022812  48.6962  squid                    storeAppend
2100394  100.000  libc-2.3.6.so            memcpy
 2100394  100.000  libc-2.3.6.so            memcpy [self]
------------------------------------------------------------------------------

So hm, half the memcpy() CPU time is spent in storeAppend, followed by storeDup, and httpHeaderEntryPackInto. Ok, those are what I'm going to be working on eliminating next anyway, so its not a big deal. This means I'll eliminate ~ 73% of the memcpy() CPU time, which is 73% of 7%, so around 5% of CPU time. Not too shabby. There'll be some overheads introduced by how its done (referenced buffer management) but one of the side-effects of that should be a drop in the number of calls to the memory allocator functions, so they should drop off a bit.

But this stuff is still just micro-optimisation. What I need is an idea of what code -paths- are taking up precious CPU time and thus what I should consider first to reimplement. Lets use the "-t" on non-top-level symbols. To start with, lets look at the two top-level "read" functions, which generally lead to some kind of other processing.

root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -lc -t 1 -i clientReadRequest ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples  %        symbol name
-------------------------------------------------------------------------------
  87536     4.7189  clientKeepaliveNextRequest
  1758418  94.7925  comm_select
88441    100.000  clientReadRequest
  2121926  86.3731  clientTryParseRequest
  88441     3.6000  clientReadRequest [self]
  52951     2.1554  commSetSelect
-------------------------------------------------------------------------------


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -lc -t 1 -i httpReadReply ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples  %        symbol name
-------------------------------------------------------------------------------
  3962448  99.7463  comm_select
163081   100.000  httpReadReply
  2781096  53.2193  httpAppendBody
  1857597  35.5471  httpProcessReplyHeader
  163081    3.1207  httpReadReply [self]
  57084     1.0924  memBufGrow
------------------------------------------------------------------------------

Here we're not interested in who is -calling- these functions (since its just the comm routine :) but which functions this routine is calling. The next trick, of course, is to try and figure out which of these paths are taking a noticable amount of CPU time. Obviously httpAppendBody() and httpProcessReplyHeader() are; they're doing both a lot of copying and a lot of parsing.

I'll look into things a little more in-depth in a few days; I need to get back to paid work. :)

Monday, January 19, 2009

Eliminating copies, or "god this code is horrible"

I've been (slowlyish!) unwinding some of the evil horridness that exists in the src/http.c code which handles reading data from upstream servers/caches, parsing it, and throwing it into the store.

There's two annoying memory copies as I've said before - one was a copy of the incoming data into a MemBuf, used -just- to assemble the full response headers for parsing, and the other (well, other two) are for appending the data coming in from the network into the memory store, on its way to the client-side code to be sent back to the client.

Now, as I've said before, the src/http.c code isn't all that long and complicated (by far most of the logic actually happens in the forward and client-side routines; the http.c routines do very little besides pump data back into the memory store) but unfortunately enough various layers of logic are mashed together to make things uhm, "very difficult" to work on separately.

Anyway, back on track. I've mostly pulled apart the code which handles reading the reply and parsing the response headers, and I've eliminated the first copy. The data is now read directly into a MemBuf, which serves as both the incoming buffer (which gets appended to) for the reply status line + headers, _AND_ the incoming buffer for HTTP body data (which never gets appended to - it is written out to the memory store and then reset back to empty.)

So the good news now is the number one place for L2 loads, L2 stores and CPU cycles spent unhalted (as measured on my P3 667mhz celeron test box, nice and slow, to expose all those stupid inefficiencies modern CPUs try to cover up :) comes from the memcpy() from src/http.c -> { header parsing (12%), http body appending (84%) } -> storeAppend().

This means one main thing - if I can eliminate the copying from into the store, and instead read directly into variable-sized pages (which is unfortunately the bloody tricky part), which are then handed to their entirety to the memory store, that last memcpy() will be eliminated, along with hopefully a good 10 + % of CPU time on this P3.

After that, its fixing the various uses of *printf() functions in the critical path, which absolutely should be avoided. I've got some basic patches to begin replacing some of the really STUPID uses of those. I'll begin committing the really obviously easy ones to Cacheboy HEAD once I've verified they don't break anything (in particular, SNMP indexes of all things..)

Once the two above are done, which accounts for a good 15 - 20% of the current CPU use in Cacheboy (at least in my small objects, memory-cache-only test load on the above hardware), I'll absolutely stop adding any and all new changes, features, optimisations, etc, and go -straight- to "make everything stable" mode again.

There's still so much that needs doing (proper refcounted buffers and strings, comm library functions which properly implement readv() and writev() so I can do things like write out the entire request/reply using vector operations and avoid the other bits of copying which go on, lessening the load on the memory allocator by actually efficiently packing structures, rewriting the http request/reply handling in preparation for replacement HTTP client/server modules, oh and IPv6/threading!) but that will come later.

Friday, January 9, 2009

More profiling!

The following info is for a 10,000 concurrent connections, keep-alived, of just a fetch of an internal icon object from Squid. This is using my apachebench-adrian package which can handle such traffic loads.

The below accounts for roughly 60% of total CPU time (ie, 60% of the CPU is spent in userspace) on one core.
With oprofile, it hits around 12,300 transactions a second.

I have much, much hatred for how Squid uses *printf() everywhere. Sigh.



CPU: AMD64 processors, speed 2613.4 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
samples  cum. samples  %        cum. %     image name               symbol name
5383709  5383709        4.5316   4.5316    libc-2.6.1.so            vfprintf
4025991  9409700        3.3888   7.9203    libc-2.6.1.so            memcpy
3673722  13083422       3.0922  11.0126    libc-2.6.1.so            _int_malloc
3428362  16511784       2.8857  13.8983    libc-2.6.1.so            memset
3306571  19818355       2.7832  16.6815    libc-2.6.1.so            malloc_consolidate
2847887  22666242       2.3971  19.0787    squid                    memPoolFree
2634120  25300362       2.2172  21.2958    libm-2.6.1.so            floor
2609922  27910284       2.1968  23.4927    squid                    memPoolAlloc
2408836  30319120       2.0276  25.5202    libc-2.6.1.so            re_search_internal
2296612  32615732       1.9331  27.4534    libc-2.6.1.so            strlen
2265816  34881548       1.9072  29.3605    libc-2.6.1.so            _int_free
1826493  36708041       1.5374  30.8979    libc-2.6.1.so            _IO_default_xsputn
1641986  38350027       1.3821  32.2800    libc-2.6.1.so            free
1601997  39952024       1.3484  33.6285    squid                    httpHeaderGetEntry
1575919  41527943       1.3265  34.9549    libc-2.6.1.so            memchr
1466114  42994057       1.2341  36.1890    libc-2.6.1.so            re_string_reconstruct
1275377  44269434       1.0735  37.2625    squid                    clientTryParseRequest
1214714  45484148       1.0225  38.2850    squid                    httpMsgFindHeadersEnd
1185932  46670080       0.9982  39.2832    squid                    statHistBin
1170361  47840441       0.9851  40.2683    squid                    urlCanonicalClean
1169694  49010135       0.9846  41.2529    libc-2.6.1.so            strtok
1145933  50156068       0.9646  42.2174    squid                    comm_select
1128595  51284663       0.9500  43.1674    libc-2.6.1.so            __GI_____strtoll_l_internal
1116573  52401236       0.9398  44.1072    squid                    httpHeaderIdByName
956209   53357445       0.8049  44.9121    squid                    SQUID_MD5Transform
915844   54273289       0.7709  45.6830    squid                    memBufAppend
907609   55180898       0.7640  46.4469    squid                    stringLimitInit
898666   56079564       0.7564  47.2034    libc-2.6.1.so            strspn
883282   56962846       0.7435  47.9468    squid                    urlParse
852875   57815721       0.7179  48.6647    libc-2.6.1.so            calloc
819613   58635334       0.6899  49.3546    squid                    clientWriteComplete
800196   59435530       0.6735  50.0281    squid                    httpMsgParseRequestLine

Cacheboy Development