Showing posts with label oprofile. Show all posts
Showing posts with label oprofile. Show all posts

Tuesday, January 20, 2009

Where the CPU is going

Oprofile is fun.

So, lets find out all of the time spent in cacheboy-head, per-symbol, with accumulative time, but only showing symbols taking 1% or more of CPU:


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -la -t 1 ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples cum. samples % cum. % image name symbol name
2100394 2100394 6.9315 6.9315 libc-2.3.6.so memcpy
674036 2774430 2.2244 9.1558 libc-2.3.6.so vfprintf
657729 3432159 2.1706 11.3264 squid memPoolAlloc
463901 3896060 1.5309 12.8573 libc-2.3.6.so _int_malloc
453978 4350038 1.4982 14.3555 libc-2.3.6.so strncasecmp
442439 4792477 1.4601 15.8156 libc-2.3.6.so re_search_internal
438752 5231229 1.4479 17.2635 squid comm_select
423196 5654425 1.3966 18.6601 squid memPoolFree
418949 6073374 1.3826 20.0426 squid stackPop
412394 6485768 1.3609 21.4036 squid httpHeaderIdByName
402709 6888477 1.3290 22.7325 libc-2.3.6.so strtok
364201 7252678 1.2019 23.9344 squid httpHeaderClean
359257 7611935 1.1856 25.1200 squid statHistBin
343628 7955563 1.1340 26.2540 squid SQUID_MD5Transform
330128 8285691 1.0894 27.3434 libc-2.3.6.so memset
323962 8609653 1.0691 28.4125 libc-2.3.6.so memchr

Ok, thats sort of useful. Whats unfortunate is that there's uhm, a lot more symbols than that:


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -la ./squid | wc -l
595

Ok, so thats a bit annoying. 16 symbols take ~ 28% of the CPU time, but the other 569 odd take the ~ 72% remaining CPU. This sort of makes traditional optimisation techniques a bit pointless now. I've optimised almost all of the "stupid" bits - double/triple copying of data, over-allocating and freeing pointlessly, multiple parsing attempts, etc.

How many samples in total?


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -l ./squid | cut -f1 -d' ' | awk '{ s+= $1; } END { print s }'
30302294

Lets look now at what memcpy() is doing, just to get an idea of what needs to be changed


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -lc -t 1 -i memcpy ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples % image name symbol name
-------------------------------------------------------------------------------
28133 1.3394 squid storeSwapOut
31515 1.5004 squid stringInit
32619 1.5530 squid httpBuildRequestPrefix
54237 2.5822 squid strListAddStr
54322 2.5863 squid storeSwapMetaBuild
80047 3.8110 squid clientKeepaliveNextRequest
171738 8.1765 squid httpHeaderEntryParseCreate
211091 10.0501 squid httpHeaderEntryPackInto
318793 15.1778 squid stringDup
1022812 48.6962 squid storeAppend
2100394 100.000 libc-2.3.6.so memcpy
2100394 100.000 libc-2.3.6.so memcpy [self]
------------------------------------------------------------------------------

So hm, half the memcpy() CPU time is spent in storeAppend, followed by storeDup, and httpHeaderEntryPackInto. Ok, those are what I'm going to be working on eliminating next anyway, so its not a big deal. This means I'll eliminate ~ 73% of the memcpy() CPU time, which is 73% of 7%, so around 5% of CPU time. Not too shabby. There'll be some overheads introduced by how its done (referenced buffer management) but one of the side-effects of that should be a drop in the number of calls to the memory allocator functions, so they should drop off a bit.

But this stuff is still just micro-optimisation. What I need is an idea of what code -paths- are taking up precious CPU time and thus what I should consider first to reimplement. Lets use the "-t" on non-top-level symbols. To start with, lets look at the two top-level "read" functions, which generally lead to some kind of other processing.

root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -lc -t 1 -i clientReadRequest ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples % symbol name
-------------------------------------------------------------------------------
87536 4.7189 clientKeepaliveNextRequest
1758418 94.7925 comm_select
88441 100.000 clientReadRequest
2121926 86.3731 clientTryParseRequest
88441 3.6000 clientReadRequest [self]
52951 2.1554 commSetSelect
-------------------------------------------------------------------------------


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -lc -t 1 -i httpReadReply ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples % symbol name
-------------------------------------------------------------------------------
3962448 99.7463 comm_select
163081 100.000 httpReadReply
2781096 53.2193 httpAppendBody
1857597 35.5471 httpProcessReplyHeader
163081 3.1207 httpReadReply [self]
57084 1.0924 memBufGrow
------------------------------------------------------------------------------

Here we're not interested in who is -calling- these functions (since its just the comm routine :) but which functions this routine is calling. The next trick, of course, is to try and figure out which of these paths are taking a noticable amount of CPU time. Obviously httpAppendBody() and httpProcessReplyHeader() are; they're doing both a lot of copying and a lot of parsing.

I'll look into things a little more in-depth in a few days; I need to get back to paid work. :)

Monday, January 19, 2009

Eliminating copies, or "god this code is horrible"

I've been (slowlyish!) unwinding some of the evil horridness that exists in the src/http.c code which handles reading data from upstream servers/caches, parsing it, and throwing it into the store.

There's two annoying memory copies as I've said before - one was a copy of the incoming data into a MemBuf, used -just- to assemble the full response headers for parsing, and the other (well, other two) are for appending the data coming in from the network into the memory store, on its way to the client-side code to be sent back to the client.

Now, as I've said before, the src/http.c code isn't all that long and complicated (by far most of the logic actually happens in the forward and client-side routines; the http.c routines do very little besides pump data back into the memory store) but unfortunately enough various layers of logic are mashed together to make things uhm, "very difficult" to work on separately.

Anyway, back on track. I've mostly pulled apart the code which handles reading the reply and parsing the response headers, and I've eliminated the first copy. The data is now read directly into a MemBuf, which serves as both the incoming buffer (which gets appended to) for the reply status line + headers, _AND_ the incoming buffer for HTTP body data (which never gets appended to - it is written out to the memory store and then reset back to empty.)

So the good news now is the number one place for L2 loads, L2 stores and CPU cycles spent unhalted (as measured on my P3 667mhz celeron test box, nice and slow, to expose all those stupid inefficiencies modern CPUs try to cover up :) comes from the memcpy() from src/http.c -> { header parsing (12%), http body appending (84%) } -> storeAppend().

This means one main thing - if I can eliminate the copying from into the store, and instead read directly into variable-sized pages (which is unfortunately the bloody tricky part), which are then handed to their entirety to the memory store, that last memcpy() will be eliminated, along with hopefully a good 10 + % of CPU time on this P3.

After that, its fixing the various uses of *printf() functions in the critical path, which absolutely should be avoided. I've got some basic patches to begin replacing some of the really STUPID uses of those. I'll begin committing the really obviously easy ones to Cacheboy HEAD once I've verified they don't break anything (in particular, SNMP indexes of all things..)

Once the two above are done, which accounts for a good 15 - 20% of the current CPU use in Cacheboy (at least in my small objects, memory-cache-only test load on the above hardware), I'll absolutely stop adding any and all new changes, features, optimisations, etc, and go -straight- to "make everything stable" mode again.

There's still so much that needs doing (proper refcounted buffers and strings, comm library functions which properly implement readv() and writev() so I can do things like write out the entire request/reply using vector operations and avoid the other bits of copying which go on, lessening the load on the memory allocator by actually efficiently packing structures, rewriting the http request/reply handling in preparation for replacement HTTP client/server modules, oh and IPv6/threading!) but that will come later.

Friday, January 9, 2009

More profiling!

The following info is for a 10,000 concurrent connections, keep-alived, of just a fetch of an internal icon object from Squid. This is using my apachebench-adrian package which can handle such traffic loads.

The below accounts for roughly 60% of total CPU time (ie, 60% of the CPU is spent in userspace) on one core.
With oprofile, it hits around 12,300 transactions a second.

I have much, much hatred for how Squid uses *printf() everywhere. Sigh.




CPU: AMD64 processors, speed 2613.4 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
samples cum. samples % cum. % image name symbol name
5383709 5383709 4.5316 4.5316 libc-2.6.1.so vfprintf
4025991 9409700 3.3888 7.9203 libc-2.6.1.so memcpy
3673722 13083422 3.0922 11.0126 libc-2.6.1.so _int_malloc
3428362 16511784 2.8857 13.8983 libc-2.6.1.so memset
3306571 19818355 2.7832 16.6815 libc-2.6.1.so malloc_consolidate
2847887 22666242 2.3971 19.0787 squid memPoolFree
2634120 25300362 2.2172 21.2958 libm-2.6.1.so floor
2609922 27910284 2.1968 23.4927 squid memPoolAlloc
2408836 30319120 2.0276 25.5202 libc-2.6.1.so re_search_internal
2296612 32615732 1.9331 27.4534 libc-2.6.1.so strlen
2265816 34881548 1.9072 29.3605 libc-2.6.1.so _int_free
1826493 36708041 1.5374 30.8979 libc-2.6.1.so _IO_default_xsputn
1641986 38350027 1.3821 32.2800 libc-2.6.1.so free
1601997 39952024 1.3484 33.6285 squid httpHeaderGetEntry
1575919 41527943 1.3265 34.9549 libc-2.6.1.so memchr
1466114 42994057 1.2341 36.1890 libc-2.6.1.so re_string_reconstruct
1275377 44269434 1.0735 37.2625 squid clientTryParseRequest
1214714 45484148 1.0225 38.2850 squid httpMsgFindHeadersEnd
1185932 46670080 0.9982 39.2832 squid statHistBin
1170361 47840441 0.9851 40.2683 squid urlCanonicalClean
1169694 49010135 0.9846 41.2529 libc-2.6.1.so strtok
1145933 50156068 0.9646 42.2174 squid comm_select
1128595 51284663 0.9500 43.1674 libc-2.6.1.so __GI_____strtoll_l_internal
1116573 52401236 0.9398 44.1072 squid httpHeaderIdByName
956209 53357445 0.8049 44.9121 squid SQUID_MD5Transform
915844 54273289 0.7709 45.6830 squid memBufAppend
907609 55180898 0.7640 46.4469 squid stringLimitInit
898666 56079564 0.7564 47.2034 libc-2.6.1.so strspn
883282 56962846 0.7435 47.9468 squid urlParse
852875 57815721 0.7179 48.6647 libc-2.6.1.so calloc
819613 58635334 0.6899 49.3546 squid clientWriteComplete
800196 59435530 0.6735 50.0281 squid httpMsgParseRequestLine