Division by constant unsigned integers

The code accompanying this article can be found on github.

Most modern processors have an integer divide instruction which is relatively slow compared to the other arithmetic operations. When the divisor is known at compile-time or the same divisor is used for many divisions, it is possible to transform the single division to a series of instructions which execute faster. Most compilers will optimize divisions in this way. The libdivide library implements this technique for divisors which are constant at runtime but not known at compile-time. The assembly that is generated by some popular compilers can be inspected on Matt Godbolt's compiler explorer. In this article, I give an overview of the existing techniques. I aim to concisely treat both implementation concerns and mathematical background.

First, let's define some terminology. The number that is being divided is called the dividend. The dividend is divided by , the divisor. The rounded down result is called the quotient. In this article, I'll only deal with unsigned integers.

I will be referring to the set of numbers that can be represented with bits a lot, so I will introduce some notation for this.

Definition: Let denote the set of unsigned integers which can be represented in bits with the binary encoding. That is:

I assume that we are working on a processor with -bit registers. I will give code snippets in C and a made-up instruction set architecture for the processor. The assembly language will have the registers r0, r1, and have the following instructions -- they are all we need for this article:

Instruction Description
shr a, b, c Shift register b by c bits, store result in register a
gte a, b, c Set register a to 1 if b >= c and to 0 otherwise
mul a, b, c Store the full unsigned product b * c in registers a and a + 1
umulhi a, b, c Store N high bits of unsigned product b * c in register a
add a, b, c Store b + c in register a, set carry on overflow
adc a, b, c Store b + c + carry in register a
sbb a, b, c Store b - c - carry in register a
ret Return from function

The c register can be replaced by a constant number.

Not all architectures have an instruction like umulhi, but most have an instruction that takes two -bit registers and computes the full -bit product in two separate registers. Of course, we can then just use the register in which the upper bits are stored. I will assume that the arguments are passed in registers r0, r1, r2, ..., and the return value is passed in r0.

I will give the optimized assembly for the following C function:

const uint d = ...;

uint divide(uint n) {
    return n / d;
}

Here, d will be some constant, and the code that is generated for the divide will depend on the value of d.

For some special divisors, the quotient can be found very efficiently. A division by one just returns the dividend. The divide function will become equivalent to

uint divide(uint n) {
    return n;
}

Since the dividend and the quotient are the same and both are passed in r0, the assembly for the divide function just needs to return:

divide:
    ret

When is another power of two, the divide function is trivially implemented by a bit shift. Suppose that . The C code becomes equivalent to

uint divide(uint n) {
    return n << p;
}

The assembly becomes:

divide:
    shr r0, r0, p
    ret

When the divisor is larger than half of the maximum value of the dividend, the quotient is one when and zero otherwise. For example, assuming and taking we get

uint divide(uint n) {
    return n >= 2147483649;
}

Which compiles to the following assembly code:

divide:
    gte r0, r0, 2147483649
    ret

These techniques are applicable in a limited number of cases. A more generally applicable technique uses the idea of multiplying by a constant that approximates the reciprocal of the divisor in fixed point math. This means that, when we want to divide by a divisor , we multiply the dividend by some constant with

and shift the result bits to the right. Here, is a measure of how precise the fixed-point approximation is to . Setting higher will yield a more precise approximation, but also increase the number of bits that we need to store . Before going any further, let's define the problem more formally:

Problem: Given a divisor which is not a power of two, find such that

for all .

Ideally, we would have so that we can store in a single register. Later, we will derive a condition that can be used to check if this is the case.

From this formulation we would expect that . Suppose we pick some which is not a power of two and take . If we now set and try to find the quotient by evaluating , we get

So we get the result , instead of the expected . So taking does not work: The product is too small. To fix this, we can increase , which is equivalent to taking when is not a power of two. Alternatively, we can slightly alter the problem formulation so that we are allowed to increase . As we will see later, these are both useful and even complementary techniques.

If we experiment a bit, setting and picking some high enough seems to work. If we want to use this as a compiler optimization we need to know under which conditions this works. The following lemma is very useful for the analysis.

Lemma 1: Suppose that is a positive integer. When we have

Proof: If the result obviously holds. If the expression is not an integer and we have

Since all of these expressions in the inequality are integers, it follows that .

Now, we can write with and substitute this in , write the resulting expression in the form with a function of , , , and simply check under which condition we have . This gives us the following lemma:

Lemma 2: Let be a positive integer and be an integer with . Define . If , then for all .

Proof: Suppose we have . Write with . Now we have

with . Since we have , and, by assumption, . Combining these inequalities gives

Since the right hand side just states that . So we can apply lemma 1 to see that for any .

With the two lemmas we can finally prove the following theorem:

Theorem 1: Let be positive integers and be the smallest such that . Then . Furthermore, when and we have for any .

Proof: Suppose that . Then we have

So we have . It follows from lemma 2 that for any .

From theorem 1 it follows that when we look for the lowest such that , we only need to check the values . If none of these values succeed, is the smallest that satisfies .

Now, we would like to know how many bits we need to hold . The following theorem simply states that we need bits to hold , independently of whether we round the result up or down. Feel free to skip the proof.

Theorem 2: Let with . Let . Then the binary representations of and have bits. That is,

Proof: Define . We have , so it follows that

The lower bound follows from taking the floor and ceiling from this and using that for any :

To prove the upper bound, note that we have . It follows that , so that

After adding on both sides and rearranging we get

Divide by and use that to get

Now the upper bound follows from taking the floor and ceiling from this expression:

Combining the lower and upper bound, we find that

We can now see that when we use the upper bound from theorem one, the binary representation of has bits. This means that is exactly one bit too large to fit in an -bit register. This is unfortunate but not insurmountable. In [1], a technique for multiplying by an -bit constant on a processor with -bit registers is presented. This is illustrated in algorithm A.

Algorithm A: Let be an -bit unsigned integer divisor, let , let , and let equal without the most significant bit, that is, . Then the following algorithm computes for :

uint fast_divide(uint n) {
    uint hiword = ((big_uint)n * m_prime) >> N;
    uint intermediate = (n - hiword) >> min(k - N, 1);
    return (hiword + intermediate) >> max(k - N - 1, 0)
}

The expressions m_prime, min(k - N, 1), and max(k - N - 1, 0) are constants which can be computed at compile time when the divisor is fixed. The only case where we have and is when . In all other cases we have min(k - N, 1) = 1 and max(k - N - 1, 0) = k - N - 1.*

If we compile the divide function with this strategy we get the following assembly:

divide:
    umulhi r1, r0, m_prime
    sub r0, r0, r1
    shr r0, r0, min(k - N, 1)
    add r0, r1, r0
    shr r0, r0, max(k - N - 1, 0)
    ret

The expressions m_prime, k - N, and (k - N - 1) should be replaced by constants. For example, for we get so that m_prime = 613566757, min(k - N, 1) = 1 and max(k - N - 1 = 2*.

This is a promising first method. If you just don't care about having the absolute fastest algorithm for division, I would advise you to implement this algorithm. It is relatively simple method and works for all integers. It can also be extended to the case of division of signed integers without a lot of effort.

However, it is possible to improve: For some divisors, we might be able to find a smaller , so that and fits in a single register. For architectures on which multiplication and/or bit shifts by higher constants are slower, it may be useful to find the smallest for which .

Combining theorem 1 and theorem 2, we see that if, for a given divisor , there exists a such that with , then fits in a single -bit register. Following [3], we will call such divisors cooperative:

Definition: Let . If there exists a such that with

then is called a cooperative divisor. Otherwise, is an uncooperative divisor.

Assuming that an uint is an -bit unsigned integer, the following C code can be used to find given a divisor . It tests until it finds a that satisfies the condition.

// Double 'init' modulo d until it is less than 2^(k - M), return k
int loop_k(uint d, uint init, uint M) {
    int k = N;
    for (uint mod = init, pow = k - M; mod > pow && pow < N; pow++, k++) {
        // overflow-safe way of doubling 'mod' modulo d
        if (mod < d - mod) mod += mod;
        else mod -= d - mod;
    }
    return k;
}

If we found a such that with , we can use a single -bit register to hold and we don't have to use tricks to multiply by an -bit constant. In this case it is simpler to compute the quotient:

Algorithm B: Let be a cooperative -bit unsigned integer divisor, let be such that with , and let . Then the following procedure calculates for all :

uint function divide_cooperative(uint n) {
    uint hiword = ((big_uint)n * m) >> N;
    return hiword >> (k - N);
}

Using this this strategy to compile the divide function for some cooperative divisor , we get the following assembly:

divide_cooperative:
    umulhi r0, r0, m
    shr r0, r0, k - N
    ret

The expressions m and k - N should be replaced by constants. For example, for we have so that m = 2863311531 and k - N = 1. For we see that so that m = 6700417 and k - N = 0. So in this case, we can skip the shr instruction. Indeed, for some we might have and we don't have to do the shift. In particular, this happens when is a divisor of , since in this case we have . For , the only divisors of are and .

Algorithm B is very efficient, but it can only be used for cooperative divisors. For uncooperative divisors, we can still use algorithm A, but there are more efficient ways.

For example, if is an even uncooperative divisor, we can apply a trick. Write with odd. Then we can obtain the quotient as with . Now can be computed by a right shift by positions, so that and . This gives us , which will be proved in the proof of corollary 1. Now we can apply theorem 1 to find such that for . Now the range of the dividend is smaller and we can certainly find an .

Corollary 1: Let be positive integers with , where is odd. Let be the smallest such that . Then . Furthermore, when , , and we have for any .

Proof: Since and , we have . Now use theorem 1 with instead of and instead of . It follows that for every we have for every .

Now it remains to show that . Write with such that . Then so that . Since , we have , since . So .

When we can find we want to set since this saves us a shift. In this case the division is as efficient as for most cooperative divisors. This gives us the following algorithm:

Algorithm C: Let be an -bit unsigned integer divisor with with and odd, let be the smallest for which , and let . Then the following algorithm computes for :

uint divide_even(n) {
    uint n' = n >> p;
    uint hiword = ((big_uint)n * m) >> N;
    return hiword >> (k - N);
}

If we use this strategy to compile the divide function for an even uncooperative divisor, we get the following assembly:

divide:
    shr r0, r0, p
    umulhi r0, r0, m
    shr r0, r0, k - N
    ret

The expressions p, m, and k - N should be replaced by constants. For example, when , we have so that k - N = 2. When d = 28 we have . In this case k - N = 0 and the second shr instruction can be omitted.

For odd divisors for which does not fit in a single word, one option is to resort to algorithm A, which uses a multiplication by an -bit constant. Alternatively, we can try the other strategy mentioned before: increase instead of .

Theorem 4: Let be a positive number that is not a power of two, let such that , and let . Then . If , then for all .

Proof: Write with . Now we have

with . Multiplying the inequalities and gives . Since is now a power of two, and are both positive, so we have . Combining this, we see

If we multiply by , which flips the inequalities, and add everywhere, we get

The expression in the middle is exactly , which means we have . So lemma 1 applies and we have .

Based on this, we can come up with the following algorithm:

Algorithm D: Let be an uncooperative -bit unsigned integer divisor, let be the smallest integer such that and , and let . Then the following algorithm computes for :

uint divide_uncooperative(n) {
    uint hi_word = ((big_uint)n * m + m) >> N;
    return hi_word >> (k - N);
}

The most efficient way to compute the high word of n * m + m depends on the target architecture. If the target supports instructions to calculate the -bit product and add the -bit operand to it, these can be used. When the target supports an integer fused multiply-add that produces all bits, this can be used. For example, in [2] Intel Celeron's XMA.HU instruction, is used for this purpose.

If the architecture supports instructions for doing bit additions, these can be used. For example, this is the case when 32-bit divisions are done on a 64-bit architecture. If this is not the case, we can still use 32-bit instructions to compute n * m + m. Using this strategy, we can compile the divide function to the following assembly:

divide:
    mul r0, r0, m
    add r0, r0, m
    adc r1, r1, 0
    shr r0, r0, k - N
    ret

For example, for we will have , so k - N = 1.

In [3], it is proposed to calculate as . The problem is that the expression n + 1 can overflow. As a solution, it is proposed to use a saturating increase, which increases n only if it is smaller than the maximum value that an -bit unsigned integer can hold. Most targets have equivalents of the add instruction which sets the carry flag in case of an overflow, and the sbb (subtract with borrow) instruction. A saturating increment on the value in register r0 can be implemented as:

add r0, r0, 1
sbb r0, r0, 0

The following theorem states that when is an uncooperative divisor we have . So if we apply algorithm D only to uncooperative divisors, the saturating increase produces the right result even when .

Theorem 5: Suppose is an uncooperative divisor. Then

Proof: Suppose that . Then . Take and . Using and , we see:

Here we used that . We have since and . So

So it follows that is a cooperative divisor. This is a contradiction, since we assumed that is uncooperative. So we must have .

Now, when we use this strategy to compile the divide function for some uncooperative divisor , we will get the following assembly:

divide:
    add r0, r0, 1
    sbb r0, r0, 0
    umulhi r0, r0, m
    shr r0, r0, k - N
    ret

Note that algorithm D assumes the existence of a such that and . The following theorem states that such a indeed exists when is an uncooperative divisor.

Theorem 6: Let be an uncooperative divisor. Then there exists a with such that .

Proof: Assume that is an uncooperative divisor. Set and . We now have since and . Since we assumed that is an uncooperative divisor we have so that . Using that we see

So we have .

So, if we use algorithm B for cooperative divisors and algorithm D for uncooperative divisors, we can always fit in an -bit word.

Compiler optimizations

Here, I give a sketch of how the code generation for a compiler backend could look. I assume that the compiler backend has an expression_t type, which can contain nested calls to instructions like shr, umulhi, etc. An unsigned integer v can be passed as a compile-time constant expression by writing const(v).

For the optimization of multiplication by a constant, we have the following flowchart. The right arrow should be followed if the question above the node is true, else the left arrow should be followed. ![Flowchart for optimization of a multiplication by a constant unsigned integer]( Flowchart for optimization of a multiplication by a constant unsigned integer

Roughly speaking, the algorithms are ordered from least efficient on the left, to most efficient on the right. The flowchart can be implemented in C as

expression_t div_by_const_uint(const uint d, expression_t n) {
    if (is_power_of_two(d)) {
        if (d == 1) return n;
        else return shr(n, constant(floor_log2(d)));
    }
    else {
        if (d > MAX / 2) return gte(n, constant(d));
        else return mul_based_div_by_const_uint(d, n);
    }
}

For the multiplication-based techniques, we have the following flowchart: Flowchart for multiplication-based optimization of a multiplication by a constant unsigned integer

Algorithms on the right are usually slightly more efficient.

This translates into the following C code:

expression_t mul_based_div_by_const_uint(uint d, expression_t n) {
    // compute minus_mod = -2^N mod d
    uint mod = (uint)(~d + 1) % d;
    uint minus_mod = mod ? d - mod : 0;
    uint k = loop_k(d, minus_mod, N);
    uint bits = k + 1 - ceil_log2(d);
    if (bits <= N) {
        // Algorithm B: cooperative divisors
        uint m = calc_m(d, k) + 1;
        expression_t hiword = umulhi(n, constant(m));
        if (k == N) return hiword;
        else return shr(hiword, constant(k - N));
    }
    if (d % 2 == 0) {
        // Algorithm C: even uncooperative divisors
        uint p = powers_of_two(d);
        uint q = d >> p;
        // compute minus_mod = -2^N mod q
        mod = (uint)(~q + 1) % q;
        uint minus_mod = mod ? q - mod : 0;
        k = loop_k(q, minus_mod, N - p);
        uint m = calc_m(q, k) + 1;
        expression_t n_odd = shr(n, constant(p));
        expression_t hiword = umulhi(n_odd, constant(m));
        if (k == N) return hiword;
        else return shr(hiword, constant(k - N));
    }
    else {
        // Algorithm D: odd uncooperative divisors
        k = loop_k(d, mod, N);
        uint m = calc_m(d, k);
        expression_t n_inc = sbb(add(n, constant(1)), constant(0));
        expression_t hiword = umulhi(n_inc, constant(m));
        if (k == N) return hiword;
        else return shr(hiword, constant(k - N));
    }
}

The code uses a helper function for computing :

// Double 'init' modulo d until it is less than 2^(k - M), return k
int loop_k(uint d, uint init, uint M) {
    int k = N;
    for (uint mod = init, pow = 1 << (k - M); mod > pow; pow <<= 1, k++) {
        // overflow-safe way of doubling 'mod' modulo d
        if (mod < d - mod) mod += mod;
        else mod -= d - mod;
    }
    return k;
}

Runtime optimizations

Suppose we want to perform faster division by a runtime constant. In this case we have to compute the magic value at runtime, and call a function fast_divide to perform the division. For the sake of efficiency, we want this function to be branchless (it should not contain any if statements).

Following [2], we note that both algorithm B and algorithm D follow the common pattern of computing and right shifting it by bits. So the following struct holds enough information:

struct fast_div_t {
    uint mul, add, shift;
};

The fast_divide function now looks like

inline uint fast_divide(uint n, fast_div_t d) {
    uint hiword = ((big_uint)n * d.mul + d.add) >> N;
    return (uint)(hiword >> d.shift);
}

For cooperative divisors, we use algorithm B and set . For uncooperative divisors, we use algorithm D and set . For we need a special case that sets :

fast_div_t precalc_div(uint d) {
    fast_div_t div;
    assert(d > 0);
    if (d == 1) {
        div.mul = MAX;
        div.add = MAX;
        div.shift = 0;
    }
    else {
        uint mod = (uint)(~d + 1) % d;
        uint minus_mod = mod ? d - mod : 0;
        uint k = loop_k(d, minus_mod, N);
        uint bits = k + 1 - ceil_log2(d);
        
        if (bits <= N) {
            div.mul = calc_m(d, k) + (mod > 0);
            div.add = 0;
            div.shift = k - N;
        }
        else {
            k = loop_k(d, mod, N);
            bits = k + 1 - ceil_log2(d);
            assert(bits <= N);
            div.mul = div.add = calc_m(d, k);
            div.shift = k - N;
        }
    }
    return div;
}

This is certainly not the only way to implement division by run-time constant divisors, but I think this pretty elegant. The libdivide library provides an even nicer interface -- The code here is just a reference implementation for educational purposes.

On testing

If you implement these techniques, I recommend that you manually work out some examples for 4-bit numbers. After that, you can implement the methods for 8- and 16-bit numbers. It should be possible to test these exhaustively. That is, you should be able to test if your code gets the quotient right for every with . For 32-bit numbers, it should still be possible to check, for all , the value of the quotient where is either of the form or of the form . You should also test and . For 64-bit quotients, it is impossible to check all possibilities within a reasonable time. However, you can just pick a random value for , and check the quotient for and along with some random values for . If your code works for all 32-bit integers and you have no failing cases after a long time of random testing, you have good reason to believe that there are no blatant errors in your code.

Further reading

The classic work on the subject of dividing by constants is [1], which also also covers division by signed constants and testing for divisibility. Algorithm A is based on this work. In [3] and [4] a very readable introduction to division of unsigned constants is given. Further, the idea of using the saturating increment in algorithm D is introduced. The paper [2] contains many similar ideas, but is a bit more formal. In [5], the techniques from [1] are extended to implement the modulo operation more efficiently.

References

[1]: Division by Invariant Integers using Multiplication, Torbjörn Granlund and Peter L. Montgomery, 1994.

[2]: N-Bit Unsigned Divison Via N-Bit Multiply-Add, Arch D. Robinson, 2005.

[3]: Labor of Divison (Episode I), fish, 2010.

[4]: Labor of Divison (Episode III): Fast Unsigned Division by Constants, fish, 2011.

[5][Faster Remainder by Direct Computation: Applications to Compilers and Software Libraries](https://arxiv.org/pdf/1902.01961) Daniel Lemire, Owen Kaser, Nathan Kurz, 2019.