208. f64::mul_add — One Rounding, One Instruction, Better Accuracy
a * b + c rounds twice and may compile to two separate operations. a.mul_add(b, c) computes the whole thing at full precision, rounds once, and on a CPU with FMA folds into a single instruction.
Two roundings vs. one
Floating-point math rounds after every operation. Writing a * b + c first rounds the product a * b to the nearest f64, then rounds again after adding c. That intermediate rounding throws away bits before they ever reach the sum:
| |
For tidy values they agree. The difference shows up when the product needs more bits than an f64 can hold and the addition would have recovered them — exactly the catastrophic-cancellation cases that wreck numeric code.
Where it pays off: Horner’s method
Polynomial evaluation is built out of multiply-then-add, so it’s the textbook home for mul_add. To evaluate 2x² + 3x + 4 with Horner’s method you nest the operations, and each step is one fused step:
| |
Each mul_add keeps full precision through the chain instead of rounding at every * and +. The same pattern carries dot products, which are a sum of products:
| |
The honest caveat
mul_add is faster only when the target CPU has a hardware fused-multiply-add instruction (modern x86-64 with FMA, AArch64, most others you’ll deploy to). Where it doesn’t, the standard library has to emulate the exact single-rounding semantics in software, and that emulation is slower than a plain a * b + c. So this is a hot-path tool: reach for it in tight numeric loops on hardware you control (or behind target-feature/target-cpu flags), and lean on it freely when you need the extra accuracy regardless of speed.
The bottom line
mul_add gives you a single rounding step and, on capable hardware, a single instruction for a * b + c. Use it in polynomial evaluation, dot products, and any multiply-accumulate loop where precision or throughput matters — and remember it can regress on FMA-less targets, so measure when speed is the goal.