#208 Jun 17, 2026

208. f64::mul_add — One Rounding, One Instruction, Better Accuracy

a * b + c rounds twice and may compile to two separate operations. a.mul_add(b, c) computes the whole thing at full precision, rounds once, and on a CPU with FMA folds into a single instruction.

Two roundings vs. one

Floating-point math rounds after every operation. Writing a * b + c first rounds the product a * b to the nearest f64, then rounds again after adding c. That intermediate rounding throws away bits before they ever reach the sum:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
let a = 1.0_f64;
let b = 2.0_f64;
let c = 3.0_f64;

// Two roundings: (a*b) rounded, then (+ c) rounded
let separate = a * b + c;

// One rounding: a*b + c evaluated at full precision, rounded once
let fused = a.mul_add(b, c);

assert_eq!(separate, 5.0);
assert_eq!(fused, 5.0);

For tidy values they agree. The difference shows up when the product needs more bits than an f64 can hold and the addition would have recovered them — exactly the catastrophic-cancellation cases that wreck numeric code.

Where it pays off: Horner’s method

Polynomial evaluation is built out of multiply-then-add, so it’s the textbook home for mul_add. To evaluate 2x² + 3x + 4 with Horner’s method you nest the operations, and each step is one fused step:

1
2
3
4
5
6
fn poly(x: f64) -> f64 {
    // ((2*x) + 3)*x + 4
    2.0_f64.mul_add(x, 3.0).mul_add(x, 4.0)
}

assert_eq!(poly(2.0), 18.0); // 2*4 + 3*2 + 4

Each mul_add keeps full precision through the chain instead of rounding at every * and +. The same pattern carries dot products, which are a sum of products:

1
2
3
4
5
6
7
fn dot(a: &[f64], b: &[f64]) -> f64 {
    a.iter()
        .zip(b)
        .fold(0.0, |acc, (&x, &y)| x.mul_add(y, acc))
}

assert_eq!(dot(&[1.0, 2.0, 3.0], &[4.0, 5.0, 6.0]), 32.0);

The honest caveat

mul_add is faster only when the target CPU has a hardware fused-multiply-add instruction (modern x86-64 with FMA, AArch64, most others you’ll deploy to). Where it doesn’t, the standard library has to emulate the exact single-rounding semantics in software, and that emulation is slower than a plain a * b + c. So this is a hot-path tool: reach for it in tight numeric loops on hardware you control (or behind target-feature/target-cpu flags), and lean on it freely when you need the extra accuracy regardless of speed.

The bottom line

mul_add gives you a single rounding step and, on capable hardware, a single instruction for a * b + c. Use it in polynomial evaluation, dot products, and any multiply-accumulate loop where precision or throughput matters — and remember it can regress on FMA-less targets, so measure when speed is the goal.

This post is licensed under CC BY 4.0 by the author.