It's not that unreasonable to ask for *hardware* support for overflow checking. ...

Tuna-Fish · on Dec 18, 2014

> ARM has a sticky overflow flag "Q".

This is surprisingly expensive to implement fast and correctly in a modern OoO core. The most likely implementation would end up taking Q as a dependency for every qadd/sub, which would be terrible.

> You can somewhat do the same purely in software with the compiler, but the lack of a "sticky" flag means it'll need to access - and potentially stall - at each point where flags would be overwritten by another ALU instruction.

The jump depends on flags, but by definition has no further dependencies. So long as it's untaken and predicted as untaken the cpu can never stall on it.

anewhnaccount · on Dec 18, 2014

I'm pretty rusty so excuse me if I'm spouting nonsense, but I think MRS (read register file) has a dependency on the value of Q (so q/add/sub/clear) and operations which OR a new value with Q (q/add/sub [do q |= 0 or q |= 1) have a dependency on operations which change the register file (MSR). Basically qadd and qusb commute on Q. I'm a bit fuzzy on the details, but I think this means that you just have to stall/predict when there's about to be a branch anyway. Please correct me/clarify if I'm wrong.

chrisseaton · on Dec 18, 2014

Yeah you have to predict - so just predict not taken, and treat JO as an uncommon trap. I think that works well in practice.

solarexplorer · on Dec 18, 2014

If you assume that reading (and clearing) the Q bit is infrequent, you can do something much cheaper. Because the bit is sticky, the order in which you set it does not matter. And when you need to read (or to clear) the Q bit, you just block the instruction that wants to read/clear until the OoO core is empty.

Tuna-Fish · on Dec 18, 2014

An OoO core with speculation has to be able to roll back arbitrary instructions when an earlier instruction causes a visible trap after a later instruction has been executed. Because of this, the core needs to be able to reconstruct the value of Q at any point of execution, even if it's not actually needed then. So the order matters.

solarexplorer · on Dec 18, 2014

Good point. So then instead of updating Q out-of-order you could update at commit. You just need to block any instruction that reads Q until the OoO core is empty. This is _much_ cheaper to implement than adding a dependency on all instructions.

WayneS · on Dec 18, 2014

> A sticky overflow flag would be great... "if properly scheduled"

That is the rub right there. Think about what this means. Every ALU operation needs to read whatever register contains this state and write a new value. So every ALU operation becomes serially dependent all all previous operations. You have just totally killed performance on out-of-order processors.

Processor designers work really hard to avoid these kinds of operations. In fact between the Pentium and the Pentium Pro (aka p6) the behavior of condition codes on several x86 instructions were changed from "keep old value" to "always cleared". We couldn't find any code that cared and it was never previously documented.

This code doesn't have to use branches, it could also OR the condition codes to some register after every operation. This would also hurt performance, but only for code that cares instead of all code.

stephencanon · on Dec 18, 2014

Floating-point flags behave in exactly this "sticky" fashion on every mainstream architecture, and processor designers don't have trouble breaking the apparent serial dependency. It's annoying, but it's well-understood, and there are widely-used techniques that address it.

gsg · on Dec 18, 2014

Itanium style not-a-thing constructs might be interesting for that.

Not-a-thing is basically a tagging scheme where the contents of registers are extended with a bit indicating whether the value is valid or not. Arithmetic operations on a NaT produce a NaT: branches or stores which consume a NaT raise an error. Given arithmetic instructions that generate NaT on overflow, the wrong result will be prevented from affecting the state of the program without explicit tests or jumps.

I'm not sure what the implementation costs would be on the hardware side, though.

Rusky · on Dec 18, 2014

The Mill CPU also works like that- they have instructions for wrapping, overflow, and saturated add, and the overflow ones generate viral bad values that only trap on store/branch/etc.

Given that it's generating the same bits as x86, but with the flag stored with the output rather than in a separate register, the hardware cost shouldn't be more than one more level to select the actual result.

This is also how they handle illegal reads, which (along with vector smear and pick instructions) makes vectorizing loops possible when they might otherwise cross protection boundaries.

thesz · on Dec 18, 2014

You wouldn't believe the cost of hardware support.

The addition is usually done using carry-look-ahead scheme [1]. This scheme has depth of O(log(N)) (N being number of bits). For 64 bits it is k * 6, and k is about 1.5. So you are looking at ~9 logical operation depth.

The computation of overflow uses bits from both operands and result. You also have to store that overflow bit somewhere. This means that 1) you have to add logical operations (usually two) to compute it and 2) you have to lay wires to store results of computations. Either way you waste timing resources (logical ops for wider processes, wires for thinner ones).

For superscalar execution you end with another result dependence to resolve and mostly ignore.

In the end you add about 5-10% of overhead of clock cycle time due to constant checking for overflow.

E.g., your request will make all computers more expensive to operate.

I have relatively extensive experience in designing hardware for a software engineer (accelerated video controller for STB, no less, from algorithmic prototype to tests). My modus operandi in that area is that you should do in hardware only what you cannot do in software.

Let's compare SPARC and MIPS. SPARC has status register (hardware support for overflow you asked for) and MIPS doesn't. SPARC lso has complicated register file, but it is out of control path, which is register-register addition, you wouldn't believe. SPARC and MIPS are equavalent otherwise. We estimated operating frequency estimations for ALUs of SPARC and MIPS for 0.13um process and for SPARC it was 400-450MHz for SPARC and was about 500MHz for MIPS, without any tweaking in low level. We have here 10% in speed difference. MIPS would be even faster if we ditch ADD/ADDI/SUB/SUBI instructions (add/subtract with overflow checking).

The same is true for OpenRISC and RISC-V. OpenRISC generates exceptions for any sneeze that may happen, RISC-V continues. Guess what is easier to develop, test and will be faster in the end.

Please, do not add to hardware any functionality you really do not need. You can check for integer overflow statically and generate special code that will generate exceptions if you cannot prove their absence conclusively. This is already done for division by zero for MIPS target (check it out, it is amazing to see difference between -O0 and -03), it can be done for integer additions.

[1] http://en.wikipedia.org/wiki/Carry-lookahead_adder

scott_s · on Dec 18, 2014

How do you feel about John Regehr's suggestion (http://blog.regehr.org/archives/1154) mentioned below? His suggestion sounds reasonable to me because the instructions would optionally trap on overflow. That should avoid the cost (which I have an intuition for) for the common case.

thesz · on Dec 18, 2014

x86 has special into command to trap on overflow. Thus I don't know what he is talking about.

http://www.electronics.dit.ie/staff/tscarff/8086_instruction...

You can do without carry and overflow flags in long arithmetic just fine.

dezgeg · on Dec 18, 2014

INTO is not supported in 64-bit mode.

solarexplorer · on Dec 18, 2014

A flag register is indeed an additional dependency and may end up in the critical path. But x86 and ARM already have a flag register. They already pay the cost.

And you don't need a flag resister to check for overflows. Trap on overflow (like Alpha) will work just as well. The difference is that traps are infrequent so you don't have to make them fast, just correct. You don't have to raise the trap in the same cycle that you calculate the integer operation. You just have to to do it before the commit stage. (And the last time I checked, the Alphas were quite a bit faster than MIPS.)

Of course hardware support implies some overhead and more complexity. I can see why people would oppose it. But there really is software that would greatly benefit from hardware support.

emjaygee · on Dec 18, 2014

Modern processors can generate different micro-ops depending on whether the flags are observed. In old non-pipelined/non-speculative/non-rewriting processors what you said is true but all bets are off in the world of massively funded x86 processor development.

thesz · on Dec 18, 2014

The trap on overflow will need to compute said overflow flag. Yep, it is not a register. Yep, you still have to compute it. And here you again pay for things that are 1) infrequent and 2) mostly statically inferred.

The Alphas were faster than MIPS due to manual design. Instead of using standard components they laid everything out by hand and they often used domino logic, AFAIK. Classic five stage pipeline MIPS implemented with domino logic can easily get 1GHz (twice as fast as direct synthesis).

Domino logic: http://en.wikipedia.org/wiki/Domino_logic

And software would benefit from static error checking much more than from hardware support for integer overflow.

pslam · on Dec 18, 2014

I am not describing constant checking of overflow. The sticky flag is the OR of any overflow flag between the last flag-clear operation and a speculation pivot.

It does not need evaluating between every operation. In fact in the example I gave (ARM SIMD) the entire point of its design is exactly so that you don't need to incur the latency of the flag generation logic for every operation.

I wouldn't dream of implementing anything quite as deliberately inefficient as what you're describing. We're decades beyond the point of two instructions serially depending on each other.

thesz · on Dec 18, 2014

What I am saying is that for 3% of real use you sacrifice 5%+ performance for all other operations (all 100% of them) and make your future designs much more complex (not 5%).

IF you add overflow checking into hardware, you doing disservice to all your users. Pure and simple.

Hardware always evaluate everything. It is hard to get for software engineer, but it is true. The little overflow flag and scheme for its computation will sit there draining energy all the time.

pslam · on Dec 18, 2014

Intel/AMD already handles flag-setting in this fashion because the vast majority of ALU instructions have a flag side-effect.

There is an additional latency to get flag results. The latency is not paid unless you depend on them. Overwriting is not a dependency.

Other architectures get around the cost altogether by having a separate "set" bit in their ISA, e.g ARM. If you don't want flags modified, then don't specify it, e.g ADD instead of ADDS.

I think you are quite mistaken by multiple orders of magnitude how much power cost it would have, especially when taking into account the rest of the system.

lifthrasiir · on Dec 18, 2014

Yes! John Regehr seriously advocates that [1] and I fully agree to him.

[1] http://blog.regehr.org/archives/1154

solarexplorer · on Dec 18, 2014

Yes, AMD should have added it to x64. It isn't exactly new either. The Alpha's integer instructions had a bit in the instruction word that indicated if an overflow should cause a trap or not. IMHO that is the ideal solution.

thesz · on Dec 18, 2014

No, it is not an ideal solution. It causes decrease in clock frequency for about 10% (or you have to make your design noticeably more complex).

The trap in Alpha ISA was done for MIPS compatibility reasons (Alpha was more or less MIPS assembly compatible).

solarexplorer · on Dec 18, 2014

The sticky overflow bit is useful when you do signal processing with integers. To detect bugs it would be better to have a real exception that indicates which instruction caused the overflow. But to do that you would have to change/extend the instruction set which is rather likely. ARM and AMD missed the change to add this when they migrated to 64bit ISAs.