编译器中的 8 位布尔值.对它们的操作效率低吗?
问题描述
我正在阅读 Agner Fog 的Optimizing software in C++"(特定于 x86 处理器Intel、AMD 和 VIA),它在第 34 页上声明
I'm reading Agner Fog's "Optimizing software in C++" (specific to x86 processors for Intel, AMD and VIA) and it states on page 34
布尔变量存储为 8 位整数,值为 0 表示假,1 表示真.布尔变量是超定的,因为所有具有布尔值的运算符变量作为输入检查输入是否有任何其他值而不是 0 或 1,但运算符将布尔值作为输出只能产生 0 或 1 以外的任何值.这使得操作使用布尔变量作为输入,效率低于必要.
Boolean variables are stored as 8-bit integers with the value 0 for false and 1 for true. Boolean variables are overdetermined in the sense that all operators that have Boolean variables as input check if the inputs have any other value than 0 or 1, but operators that have Booleans as output can produce no other value than 0 or 1. This makes operations with Boolean variables as input less efficient than necessary.
今天仍然如此吗?在什么编译器上?你能举个例子吗?作者说
Is this still true today and on what compilers? Can you please give an example? The author states
布尔运算可以变得更有效率,如果它可以肯定地知道操作数除了 0 和 1 之外没有其他值.原因为什么编译器不做这样的假设是变量可能有其他如果它们未初始化或来自未知来源.
The Boolean operations can be made much more efficient if it is known with certainty that the operands have no other values than 0 and 1. The reason why the compiler doesn't make such an assumption is that the variables might have other values if they are uninitialized or come from unknown sources.
这是否意味着如果我以函数指针 bool(*)()
为例并调用它,那么对它的操作会产生低效的代码?还是当我通过取消引用指针或从引用中读取来访问布尔值然后对其进行操作时,情况是否如此?
Does this mean that if I take a function pointer bool(*)()
for example and call it, then operations on it produce inefficient code? Or is it the case when I access a boolean by dereferencing a pointer or reading from a reference and then operate on it?
推荐答案
TL:DR:当前的编译器在执行类似
<代码>(a&&b) ?x : y.但是,不是他们不假设 0/1 的原因,他们只是在这方面很烂.
TL:DR: current compilers still have bool
missed-optimizations when doing stuff like
(a&&b) ? x : y
. But the reason why is not that they don't assume 0/1, they just suck at this.
bool
的许多用途是用于本地函数或内联函数,因此布尔化为 0
/1
可以优化并分支(或cmov 或其他)在原始条件下.只有当 bool
输入/输出必须通过非内联或真正存储在内存中的东西传递/返回时,才需要担心优化.
Many uses of bool
are for locals, or inline functions, so booleanizing to a 0
/ 1
can optimize away and branch (or cmov or whatever) on the original condition. Only worry about optimizing bool
inputs / outputs when it does have to get passed/returned across something that doesn't inline, or really stored in memory.
可能的优化指南:将来自外部源(函数 args/内存)的 bool
与位运算符(如 a&b
)结合起来.MSVC 和 ICC 在这方面做得更好.IDK 如果本地 bool
的情况更糟.请注意,a&b
仅等效于 bool
的 a&&b
,而不是整数类型.2 &&1
是真的,但是 2 &1
为 0,这是错误的.按位或没有这个问题.
Possible optimization guideline: combine bool
s from external sources (function args / memory) with bitwise operators, like a&b
. MSVC and ICC do better with this. IDK if it's ever worse for local bool
s. Beware that a&b
is only equivalent to a&&b
for bool
, not integer types. 2 && 1
is true, but 2 & 1
is 0 which is false. Bitwise OR doesn't have this problem.
IDK,如果此准则对通过函数内(或内联)比较设置的本地人有害.例如.它可能会导致编译器实际生成整数布尔值,而不是尽可能直接使用比较结果.另请注意,它似乎对当前的 gcc 和 clang 没有帮助.
IDK if this guideline will ever hurt for locals that were set from a comparison within the function (or in something that inlined). E.g. it might lead the compiler to actually make integer booleans instead of just using comparison results directly when possible. Also note that it doesn't seem to help with current gcc and clang.
是的,x86 上的 C++ 实现将 bool
存储在一个始终为 0 或 1 的字节中(至少跨越函数调用边界,其中编译器必须遵守需要这样做的 ABI/调用约定.)
Yes, C++ implementations on x86 store bool
in a byte that's always 0 or 1 (at least across function-call boundaries where the compiler has to respect the ABI / calling convention which requires this.)
编译器有时会利用这一点,例如对于 bool
->int
转换,即使 gcc 4.4 也只是零扩展为 32 位 (movzx eax, dil
).Clang 和 MSVC 也这样做.C 和 C++ 规则要求这种转换产生 0 或 1,所以这种行为只有在 总是 安全地假设 bool
函数 arg 或全局变量具有 0 时才是安全的或 1 个值.
Compilers do sometimes take advantage of this, e.g. for bool
->int
conversion even gcc 4.4 simply zero-extends to 32-bit (movzx eax, dil
). Clang and MSVC do this, too. C and C++ rules require this conversion to produce 0 or 1, so this behaviour is only safe if it's always safe to assume that a bool
function arg or global variable has a 0 or 1 value.
即使是旧的编译器通常也会在 bool
->int
中利用它,但在其他情况下则不然.因此,Agner 说的原因是错误的:
Even old compilers typically did take advantage of it for bool
->int
, but not in other cases. Thus, Agner is wrong about the reason when he says:
编译器不做这种假设的原因是,如果变量未初始化或来自未知来源,它们可能具有其他值.
The reason why the compiler doesn't make such an assumption is that the variables might have other values if they are uninitialized or come from unknown sources.
MSVC CL19 确实编写了假定 bool
函数参数为 0 或 1 的代码,因此 Windows x86-64 ABI 必须保证这一点.
MSVC CL19 does make code that assumes bool
function args are 0 or 1, so the Windows x86-64 ABI must guarantee this.
在 x86-64 System V ABI(由 Windows 以外的所有设备使用),修订版 0.98 的更改日志显示指定 _Bool
(又名 bool
)在调用者处被布尔化."我认为甚至在更改之前,编译器就已经假设了它,但这只是记录了编译器已经依赖的内容.x86-64 SysV ABI 中的当前语言是:
In the x86-64 System V ABI (used by everything other than Windows), the changelog for revision 0.98 says "Specify that _Bool
(aka bool
) is booleanized at the caller." I think even before that change, compilers were assuming it, but this just documents what compilers were already relying on. The current language in the x86-64 SysV ABI is:
3.1.2 数据表示
布尔值,当存储在内存对象中时,存储为单字节对象,其值始终为 0(假)或 1(真).当存储在整数寄存器中时(除了作为参数传递),寄存器的所有 8 个字节都是有效的;任何非零值都被认为是真的.
Booleans, when stored in a memory object, are stored as single byte objects the value of which is always 0 (false) or 1 (true). When stored in integer registers (except for passing as arguments), all 8 bytes of the register are significant; any nonzero value is considered true.
第二句话是胡说八道:ABI 没有告诉编译器如何将内容存储在函数内部的寄存器中,只是在不同编译单元之间的边界(内存/函数参数和返回值).我不久前报告了这个 ABI 缺陷在维护它的 github 页面上.
The second sentence is nonsense: the ABI has no business telling compilers how to store things in registers inside a function, only at boundaries between different compilation units (memory / function args and return values). I reported this ABI defect a while ago on the github page where it's maintained.
3.2.3 参数传递:
当一个 _Bool
类型的值被返回或传入寄存器或堆栈时,位 0 包含真值,位 1 到 7 应为零16.
When a value of type _Bool
is returned or passed in a register or on the stack, bit 0 contains the truth value and bits 1 to 7 shall be zero16.
(脚注 16):其他位未指定,因此 这些值的消费者端可以依赖它在截断为 8 位时为 0 或 1.
(footnote 16): Other bits are left unspecified, hence the consumer side of those values can rely on it being 0 or 1 when truncated to 8 bit.
i386 System V ABI 中的语言是相同的,IIRC.
The language in the i386 System V ABI is the same, IIRC.
任何假设一件事为 0/1(例如转换为 int
)但在其他情况下未能利用它的编译器都有错过了优化.不幸的是,这种错过的优化仍然存在,尽管它们比 Agner 写关于编译器总是重新布尔化的那段时更少.
Any compiler that assumes 0/1 for one thing (e.g. conversion to int
) but fails to take advantage of it in other cases has a missed optimization. Unfortunately such missed-optimizations still exist, although they are rarer than when Agner wrote that paragraph about compilers always re-booleanizing.
(在 Godbolt 编译器浏览器 用于 gcc4.6/4.7 和 cl昂/MSVC.另请参阅 Matt Godbolt 的 CppCon2017 演讲 我的编译器最近为我做了什么?解开编译器的盖子)
bool logical_or(bool a, bool b) { return a||b; }
# gcc4.6.4 -O3 for the x86-64 System V ABI
test dil, dil # test a against itself (for non-zero)
mov eax, 1
cmove eax, esi # return a ? 1 : b;
ret
所以即使 gcc4.6 也没有重新布尔化 b
,但它确实错过了 gcc4.7 所做的优化:(以及 clang 和更高版本的编译器,如其他答案所示):
So even gcc4.6 didn't re-booleanize b
, but it did miss the optimization that gcc4.7 makes: (and clang and later compilers as shown in other answers):
# gcc4.7 -O3 to present: looks ideal to me.
mov eax, esi
or eax, edi
ret
(Clang 的 or dil, sil
/mov eax, edi
很傻:当读取 edi
在编写 dil
之后,由于需要 REX 前缀来使用 edi 的低 8 部分,因此代码大小更差.更好的选择可能是 或 dil,sil
/movzx eax, dil
如果您想避免读取任何 32 位寄存器,以防调用者留下一些带有脏"部分寄存器的 arg 传递寄存器.)
(Clang's or dil, sil
/ mov eax, edi
is silly: it's guaranteed to cause a partial-register stall on Nehalem or earlier Intel when reading edi
after writing dil
, and it has worse code size from needing a REX prefix to use the low-8 part of edi. A better choice might be or dil,sil
/ movzx eax, dil
if you want to avoid reading any 32-bit registers in case your caller left some arg-passing registers with "dirty" partial registers.)
MSVC 发出此代码,分别检查 a
然后 b
,完全无法利用任何东西,甚至使用 xoral,al
而不是 xor eax,eax
.所以它对大多数 CPU 上的旧值 eax
有错误的依赖(包括 Haswell/Skylake,它们不会将低 8 部分 regs 与整个寄存器分开重命名,只有 AH/BH/...).这只是愚蠢的.使用 xor al,al
的唯一原因是当您明确想要保留高位字节时.
MSVC emits this code that checks a
then b
separately, completely failing to take advantage of anything, and even using xor al,al
instead of xor eax,eax
. So it has a false dependency on the old value of eax
on most CPUs (including Haswell/Skylake, which don't rename low-8 partial regs separately from the whole register, only AH/BH/...). This is just dumb. The only reason to ever use xor al,al
is when you explicitly want to preserve the upper bytes.
logical_or PROC ; x86-64 MSVC CL19
test cl, cl ; Windows ABI passes args in ecx, edx
jne SHORT $LN3@logical_or
test dl, dl
jne SHORT $LN3@logical_or
xor al, al ; missed peephole: xor eax,eax is strictly better
ret 0
$LN3@logical_or:
mov al, 1
ret 0
logical_or ENDP
ICC18 也没有利用输入的已知 0/1 特性,它只是使用 or
指令根据两个输入的按位或设置标志,并且 setcc
产生一个 0/1.
ICC18 also doesn't take advantage of the known 0/1 nature of the inputs, it just uses an or
instruction to set flags according to the bitwise OR of the two inputs, and setcc
to produce a 0/1.
logical_or(bool, bool): # ICC18
xor eax, eax #4.42
movzx edi, dil #4.33
movzx esi, sil #4.33
or edi, esi #4.42
setne al #4.42
ret #4.42
即使对于 bool bitwise_or(bool a, bool b) { return a|b; ICC 也会发出相同的代码}代码>.它提升为
int
(使用 movzx
),并使用 or
根据位或设置标志.与 或 dil,sil
/setne al
相比,这很愚蠢.
ICC emits the same code even for bool bitwise_or(bool a, bool b) { return a|b; }
. It promotes to int
(with movzx
), and uses or
to set flags according to the bitwise OR. This is dumb compared to or dil,sil
/ setne al
.
对于 bitwise_or
,MSVC 确实只使用 or
指令(在每个输入的 movzx
之后),但无论如何都不会重新布尔化.
For bitwise_or
, MSVC does just use an or
instruction (after movzx
on each input), but anyway doesn't re-booleanize.
只有 ICC/MSVC 用上面的简单函数在做哑代码,但是这个函数仍然给 gcc 和 clang 带来麻烦:
Only ICC/MSVC were making dumb code with the simple function above, but this function still gives gcc and clang trouble:
int select(bool a, bool b, int x, int y) {
return (a&&b) ? x : y;
}
<强> 源+ ASM在Godbolt编译器资源管理器(与上次相比,源相同,选择了不同的编译器).
Source+asm on the Godbolt compiler explorer (Same source, different compilers selected vs. last time).
看起来很简单;你会希望智能编译器可以通过一个 test
/cmov
无分支地完成它.x86 的 test
指令根据位与设置标志.这是一个 AND 指令,实际上并没有写入目标.(就像 cmp
是一个不写目的地的 sub
一样).
Looks simple enough; you'd hope that a smart compiler would do it branchlessly with one test
/cmov
. x86's test
instruction sets flags according to a bitwise AND. It's an AND instruction that doesn't actually write the destination. (Just like cmp
is a sub
that doesn't write the destination).
# hand-written implementation that no compilers come close to making
select:
mov eax, edx # retval = x
test edi, esi # ZF = ((a & b) == 0)
cmovz eax, ecx # conditional move: return y if ZF is set
ret
但即使是 Godbolt 编译器资源管理器上的 gcc 和 clang 的日常构建,也会使代码变得更加更加复杂,分别检查每个布尔值.如果您返回 ab
,他们知道如何优化 bool ab = a&&b;
,但即使是这样写(使用单独的布尔变量来保存结果)没有设法让他们编写不糟糕的代码.
But even the daily builds of gcc and clang on the Godbolt compiler explorer make much more complicated code, checking each boolean separately. They know how to optimize bool ab = a&&b;
if you return ab
, but even writing it that way (with a separate boolean variable to hold the result) doesn't manage to hand-hold them into making code that doesn't suck.
请注意,test same,same
完全等同于 cmp reg, 0
,而且更小,所以它是编译器使用的.
Note that test same,same
is exactly equivalent to cmp reg, 0
, and is smaller, so it's what compilers use.
Clang 的 版本比我的手写版本差很多.(请注意,它要求调用者将 bool
参数零扩展为 32 位,就像它对窄整数类型一样,作为它和 gcc 实现的 ABI 的非官方部分,但只有 clang 依赖于).
Clang's version is strictly worse than my hand-written version. (Note that it requires that the caller zero-extended the bool
args to 32-bit, like it does for narrow integer types as an unofficial part of the ABI which it and gcc implement but only clang depends on).
select: # clang 6.0 trunk 317877 nightly build on Godbolt
test esi, esi
cmove edx, ecx # x = b ? y : x
test edi, edi
cmove edx, ecx # x = a ? y : x
mov eax, edx # return x
ret
gcc 8.0.0 20171110 nightly 为此制作分支代码,类似于旧版 gcc 所做的.
gcc 8.0.0 20171110 nightly makes branchy code for this, similar to what older gcc versions do.
select(bool, bool, int, int): # gcc 8.0.0-pre 20171110
test dil, dil
mov eax, edx ; compiling with -mtune=intel or -mtune=haswell would keep test/jcc together for macro-fusion.
je .L8
test sil, sil
je .L8
rep ret
.L8:
mov eax, ecx
ret
MSVC x86-64 CL19 生成非常相似的分支代码.它针对 Windows 调用约定,其中整数 args 位于 rcx、rdx、r8、r9 中.
MSVC x86-64 CL19 makes very similar branchy code. It's targeting the Windows calling convention, where integer args are in rcx, rdx, r8, r9.
select PROC
test cl, cl ; a
je SHORT $LN3@select
mov eax, r8d ; retval = x
test dl, dl ; b
jne SHORT $LN4@select
$LN3@select:
mov eax, r9d ; retval = y
$LN4@select:
ret 0 ; 0 means rsp += 0 after popping the return address, not C return 0.
; MSVC doesn't emit the `ret imm16` opcode here, so IDK why they put an explicit 0 as an operand.
select ENDP
ICC18 也制作分支代码,但在分支之后有两个 mov
指令.
ICC18 also makes branchy code, but with both mov
instructions after the branches.
select(bool, bool, int, int):
test dil, dil #8.13
je ..B4.4 # Prob 50% #8.13
test sil, sil #8.16
jne ..B4.5 # Prob 50% #8.16
..B4.4: # Preds ..B4.2 ..B4.1
mov edx, ecx #8.13
..B4.5: # Preds ..B4.2 ..B4.4
mov eax, edx #8.13
ret #8.13
试图帮助编译器使用
int select2(bool a, bool b, int x, int y) {
bool ab = a&&b;
return (ab) ? x : y;
}
导致 MSVC 编写非常糟糕的代码:
;; MSVC CL19 -Ox = full optimization
select2 PROC
test cl, cl
je SHORT $LN3@select2
test dl, dl
je SHORT $LN3@select2
mov al, 1 ; ab = 1
test al, al ;; and then test/cmov on an immediate constant!!!
cmovne r9d, r8d
mov eax, r9d
ret 0
$LN3@select2:
xor al, al ;; ab = 0
test al, al ;; and then test/cmov on another path with known-constant condition.
cmovne r9d, r8d
mov eax, r9d
ret 0
select2 ENDP
这仅适用于 MSVC(并且 ICC18 对刚刚设置为常量的寄存器上的 test/cmov 进行了相同的优化).
This is only with MSVC (and ICC18 has the same missed optimization of test/cmov on a register that was just set to a constant).
gcc 和 clang 像往常一样不会让代码像 MSVC 那样糟糕;他们为 select()
做同样的事情,这仍然不好,但至少尝试帮助他们不会像 MSVC 那样让事情变得更糟.
gcc and clang as usual don't make code as bad as MSVC; they make the same asm they do for select()
, which is still not good but at least trying to help them doesn't make it worse like with MSVC.
在我非常有限的测试中,|
和 &
似乎比 ||
和 &&
用于 MSVC 和 ICC.使用您的编译器 + 编译选项查看您自己代码的编译器输出,看看会发生什么.
In my very limited testing, |
and &
seem to work better than ||
and &&
for MSVC and ICC. Look at the compiler output for your own code with your compiler + compile options to see what happens.
int select_bitand(bool a, bool b, int x, int y) {
return (a&b) ? x : y;
}
Gcc 仍然在两个输入的单独 test
上单独分支,代码与其他版本的 select
相同.clang 仍然执行两个单独的 test/cmov
,与其他源版本的 asm 相同.
Gcc still branches separately on separate test
s of the two inputs, same code as the other versions of select
. clang still does two separate test/cmov
, same asm as for the other source versions.
MSVC 通过并正确优化,击败了所有其他编译器(至少在独立定义中):
MSVC comes through and optimizes correctly, beating all the other compilers (at least in the stand-alone definition):
select_bitand PROC ;; MSVC
test cl, dl ;; ZF = !(a & b)
cmovne r9d, r8d
mov eax, r9d ;; could have done the mov to eax in parallel with the test, off the critical path, but close enough.
ret 0
ICC18 浪费了两条 movzx
指令,将 bool
s 零扩展为 int
,然后生成与 MSVC 相同的代码
ICC18 wastes two movzx
instructions zero-extending the bool
s to int
, but then makes the same code as MSVC
select_bitand: ## ICC18
movzx edi, dil #16.49
movzx esi, sil #16.49
test edi, esi #17.15
cmovne ecx, edx #17.15
mov eax, ecx #17.15
ret #17.15
这篇关于编译器中的 8 位布尔值.对它们的操作效率低吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!