I'm currently trying to understand the performance properties of certain loops on x86_64 (specifically, my Intel(R) Core(TM) i3-8145U CPU @ 2.10GHz processor). Specifically, adding an extra instruction to read memory inside the body of the loop almost doubles the performance, with the details not being particularly important.
I've been using a test program that consists of two main parts: a testing loop, and a function under test. The testing loop runs the function under test 232 times, once with each signed 32-integer as argument (in order from INT_MIN to INT_MAX). The function under test (named body) is a small function that checks to see if it was called with the expected arguments, and records the error in a global variable otherwise. The test program touches a sufficiently small amount of memory that everything is likely to fit in the L1 cache.
In order to eliminate any speed differences that might be caused by the behaviour of a compiler, I wrote both of the functions in question in assembly language (I'm using clang as the assembler), and have been forced to start at fixed addresses (the performance of this sort of test loop often gets dominated by effects relating to alignment or caching, so using fixed addresses will eliminate any alignment effects or cache effects unrelated to the change).
Here's the test loop, disassembled (it takes the address of the function to loop over in %rdi):
401300: 53 push %rbx 401301: 55 push %rbp 401302: 51 push %rcx 401303: 48 89 fd mov %rdi,%rbp 401306: bb 00 00 00 80 mov $0x80000000,%ebx loop: 40130b: 89 df mov %ebx,%edi 40130d: ff d5 callq *%rbp 40130f: 83 c3 01 add $0x1,%ebx 401312: 71 f7 jno 40130b <loop> 401314: 59 pop %rcx 401315: 5d pop %rbp 401316: 5b pop %rbx 401317: c3 retq and here's the simplest version of body, the function under test:
401200: 33 3d 3a 3e 00 00 xor 0x3e3a(%rip),%edi # 405040 <next_expected> 401206: 09 3d 30 3e 00 00 or %edi,0x3e30(%rip) # 40503c <any_errors> 40120c: ff 05 2e 3e 00 00 incl 0x3e2e(%rip) # 405040 <next_expected> 401212: c3 retq (The basic idea is that body checks to see if its argument %edi is equal to next_expected, and sets any_errors to a nonzero value if it isn't, otherwise leaving it unchanged. Then it increments next_expected.)
The test loop, with this version of body as %rdi, takes approximately 11 seconds to run on my processor. However, adding in an extra read of memory causes it to run in under 6 seconds (a difference much too large to be explained by random variation):
401200: 33 3d 3a 3e 00 00 xor 0x3e3a(%rip),%edi # 405040 <next_expected> 401206: 09 3d 30 3e 00 00 or %edi,0x3e30(%rip) # 40503c <any_errors> 40120c: 33 3d 2e 3e 00 00 xor 0x3e2e(%rip),%edi # 405040 <next_expected> 401212: ff 05 28 3e 00 00 incl 0x3e28(%rip) # 405040 <next_expected> 401218: c3 retq I've tried lots of different variants of this code to see what specific properties of the additional statement (labelled 401212 above) give the "fast" behaviour. The common pattern seems to be that the statement needs to read from memory. The various statements I've tried there (note: each of these is a single statement that's exactly 6 bytes long, so there are no alignment considerations to worry about):
These statements run quickly (~6 seconds):
- It doesn't seem to matter what operation we read the memory with:
xor 0x3e2e(%rip),%edi # 405040 <next_expected>and 0x3e2e(%rip),%edi # 405040 <next_expected>mov 0x3e2e(%rip),%edi # 405040 <next_expected>
- or what register we read into:
and 0x3e2e(%rip),%eax # 405040 <next_expected>
- or (in most cases) what we're reading:
xor 0x11c7(%rip),%edi # 4023d9 <main>or -0x12(%rip),%edi # 401200 <body>
- or whether we're writing to memory in addition to reading it:
xor %edi,0x3e2a(%rip) # 40503c <junk>
- Additionally, adding
xor 0x11c7(%rip),%edi # 4023d9 <main>, but after rather than before theinclcommand, also gave fast performance.
These statements run slowly (~11 seconds):
- It isn't sufficient to use an instruction that's 6 bytes long but doesn't read memory:
nopw %cs:(%rax,%rax,1)lea 0x3e2e(%rip),%edi # 405040 <next_expected>
- It isn't sufficient to write memory without reading it:
mov %edi,0x3e2a(%rip) # 40503c <junk>
Additionally, I tried writing the read value back to next_expected rather than incrementing it in place:
401200: 33 3d 3a 3e 00 00 xor 0x3e3a(%rip),%edi # 405040 <next_expected> 401206: 09 3d 30 3e 00 00 or %edi,0x3e30(%rip) # 40503c <any_errors> 40120c: 8b 3d 2e 3e 00 00 mov 0x3e2e(%rip),%edi # 405040 <next_expected> 401212: ff c7 inc %edi 401214: 89 3d 26 3e 00 00 mov %edi,0x3e26(%rip) # 405040 <next_expected> 40121a: c3 retq This had a performance very close to the original 11-second program.
One anomaly is the statement xor 0x3e2a(%rip),%edi # 40503c <any_errors>; adding that as the 401212 statement consistently gave a performance of 7.3 seconds, not matching either of the other two performances. I suspect that what's happening here is that the read of memory is sufficient to send the function onto the "fast path", but the statement itself is slow (because we just wrote any_errors in the preceding line; writing and immediately reading the same memory address is something that processors can struggle with), and thus we're getting fast-path performance + a slowdown from the use of a slow statement. Something similar happens if I add a read of next_expected (rather than main after rather than before the incl statement (again, we're reading a memory address that was just written, so the similar behaviour is not surprising).
Another experiment was adding xor next_expected(%rip), %eax earlier in the function (before the write to %edi or right at the start, before the read of next_expected). These gave a performance of around 8.5 seconds.
Anyway, at this point it seems fairly clear what is causing the fast behaviour (adding an additional read from memory is making the function run faster, at least when it's combined with the specific test loop shown here; it wouldn't surprise me if the details of the test loop were relevant). What I don't understand, though, is why the processor would behave like this. In particular, is there a general rule that can be used to work out when adding an extra read to a program will make it run (this much) faster?
https://stackoverflow.com/questions/65756336/why-does-this-function-run-so-much-faster-when-it-makes-an-extra-read-of-memory January 17, 2021 at 09:06AM
没有评论:
发表评论