'Fast floating-point power of 2 on x86_64
Is there a fast way to take 2.0
to some floating-point degree x
? I mean something faster than pow(2.0, x)
and preferrably what vectorizes well with AVX2.
The counterpart for integers is 1<<n
, but it works for integer n
only.
Solution 1:[1]
There is a standard std::exp2(double n)
Computes
2
raised to the given powern
It is possible that exp2(x)
would not be faster than pow(2.0, x)
in a particular environment but it's more specific than general pow
.
Solution 2:[2]
For integer powers, you can use std::ldexp
:
double x = std::ldexp(1.0, k); // 2 to the power of k
This is better than 1<<k
and casting as it won't have intermediate overflow, and supports negative powers as well.
Solution 3:[3]
TL;DR: split integer and fraction, use a Taylor expansion to calculate the 2^fraction part, add that back into the integer part then use the convert-to-int magic trick many have mentioned. The Taylor expansion is slow because of a long dependency chain calculating the successive powers, but the process only requires 3 registers meaning you can do 4 vectors in parallel to allow pipe-lining.
I know this is an old question but it never got a good answer (or even really any answer that actually satisfied the original questioners requirements) - The best answer was a link to a paper in one of the comments that explains the Taylor series and the convert-to-int-trick and details a way to use a polynomial to adjust the fraction part you already have to get the 2^fract you need, but it was complicated and didn't include the actual numbers to make it work.
The original questioner wanted to do 400,000,000 2^x's per second, with 40 bits minimum precision. My requirements are a bit different - mainly that I don't need that much precision, so I haven't tried it with doubles yet but I am getting 1,730,000,000 2^x's per second with 16bit minimum precision in a single thread on the 10 year-old 3rd-gen i7 mobile chip in my laptop, and I'm confident that 1/4 of this would be doable for double precision and twice the terms in the series, on the same hardware.
The Taylor series calculates e^x with greater accuracy each step, and looks like this:
1 + x + x²/2! + x³/3! + (x^4)/4! + ...
to get it to calculate any other base's power, multiply the power you want by ln(power). in other words:
N^x = e^(x*ln(N))
or for us,
2^x = e^(x*ln(2))
If you use values of x that are large, it takes a long time to converge, but since we have a trick for the integer part we will never be dealing with x => 1.0. we get 16bits precision in 6 terms (up to x^6) and 40 bits in 12 terms, with more precision the closer to 0 we go.
the basic process looks like this:
vector register x, y, z; /// x is our input, y will become our output
y = floor(x); /// 3 round toward -infinity
x -= y; /// 3 split integer and fraction
x *= ln(2); /// 5 ln(2) = 0.693147180559945...
y += x; /// summing the terms direct into the integer part
z = x * x; /// 5
y += z * (1/2); /// this is a single fma instruction but ...
z *= x; /// 5
y += z * (1/6); /// if your processor lacks it, the working register ...
z *= x; /// 5
y += z * (1/24); /// can be shared between the 4 parallel calculation chains
z *= x; /// 5
y += z * (1/120); /// the broadcasted constants can be shared too
z *= x; /// 5
y += z * (1/720); /// 8 I stopped here but you could easily carry on.
y *= 2^(mantissa bits); /// 5 8388608.0 for float,
/// 4503599627370496.0 for double
y = convert-float-to-int(y); /// 3 creates an integer where the original
/// integer part appears where the exponent bits
/// of a float value would be, and our added
/// fraction part becomes the mantissa, gaining
/// the 1.0 it was missing in the process
y = integer-addition(y, interpret-bits-as-integer(1.0)); /// 3
/// add the exponent bias to complete our result
/// this can be done as a float addition before
/// the convert-to-int, but bits of the precision
/// of the result get lost, because all of the
/// mantissa and exponent bits have to fit into the
/// mantissa before conversion
each line can convert to one vector instruction (or two on older processors without fma), but many have to wait for the result of a near previous step to complete (instruction latency - the numbers after some lines indicate they are part of the chain and how many cycles must be waited for the result) and this chain limits the minimum processing time for this 6-step version to 55 cycles. But the floating-point hardware can actually work much faster than this, the 5-stages of addition to do a 24-bit multiply can all be working on different values. with 16 registers in the AVX register file, 4 of these calculation chains can be coded to run together in a single thread, allowing 4 vectors of results to be calculated in the same time ... about 55 clock cycles.
Using double's would halve the number of results per chain, but doubling the number of terms doesn't double the time taken because the starting steps and finishing steps are the same. It results in a latency chain of about 85 cycles, so in theory 560,000,000 40bit precise 2^double results should be possible in a single thread on a third-gen, 3Ghz, intel processor (including read of input and write of output). Newer processors have decreased the latency of multiplication a little, but the latency of rounding and float-to-int conversion have gone up, so it remains about the same overall.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Simon Byrne |
Solution 3 | Jeremy |