# Division by constant unsigned integers

The code accompanying this article can be found on github.

Most modern processors have an integer divide instruction which is relatively slow compared to the other arithmetic operations. When the divisor is known at compile-time or the same divisor is used for many divisions, it is possible to transform the single division to a series of instructions which execute faster. Most compilers will optimize divisions in this way. The libdivide library implements this technique for divisors which are constant at runtime but not known at compile-time. The assembly that is generated by some popular compilers can be inspected on Matt Godbolt's compiler explorer. In this article, I give an overview of the existing techniques. I aim to concisely treat both implementation concerns and mathematical background.

First, let's define some terminology. The number $n$ that is being divided is called the *dividend*. The dividend is divided by $d$, the *divisor*. The rounded down result $⌊dn ⌋$ is called the *quotient*. In this article, I'll only deal with unsigned integers.

I will be referring to the set of numbers that can be represented with $N$ bits a lot, so I will introduce some notation for this.

**Definition**: *Let $U_{N}$ denote the set of unsigned integers which can be represented in $N$ bits with the binary encoding. That is:*
$U_{N}={0,1,...,2_{N}−2,2_{N}−1}$

I assume that we are working on a processor with $N$-bit registers. I will give code snippets in C and a made-up instruction set architecture for the processor. The assembly language will have the registers `r0`

, `r1`

, and have the following instructions -- they are all we need for this article:

Instruction | Description |
---|---|

`shr a, b, c` |
Shift register `b` by `c` bits, store result in register `a` |

`gte a, b, c` |
Set register `a` to 1 if `b >= c` and to 0 otherwise |

`mul a, b, c` |
Store the full unsigned product `b * c` in registers `a` and `a + 1` |

`umulhi a, b, c` |
Store N high bits of unsigned product `b * c` in register `a` |

`add a, b, c` |
Store `b + c` in register `a` , set carry on overflow |

`adc a, b, c` |
Store `b + c + carry` in register `a` |

`sbb a, b, c` |
Store `b - c - carry` in register `a` |

`ret` |
Return from function |

The `c`

register can be replaced by a constant number.

Not all architectures have an instruction like `umulhi`

, but most have an instruction that takes two $N$-bit registers and computes the full $2N$-bit product in two separate registers. Of course, we can then just use the register in which the upper $N$ bits are stored. I will assume that the arguments are passed in registers `r0`

, `r1`

, `r2`

, ..., and the return value is passed in `r0`

.

I will give the optimized assembly for the following C function:

```
const uint d = ...;
uint divide(uint n) {
return n / d;
}
```

Here, `d`

will be some constant, and the code that is generated for the `divide`

will depend on the value of `d`

.

For some special divisors, the quotient can be found very efficiently. A division by one just returns the dividend. The divide function will become equivalent to

```
uint divide(uint n) {
return n;
}
```

Since the dividend and the quotient are the same and both are passed in `r0`

, the assembly for the `divide`

function just needs to return:

```
divide:
ret
```

When $d$ is another power of two, the `divide`

function is trivially implemented by a bit shift. Suppose that $d=2_{p}$. The C code becomes equivalent to

```
uint divide(uint n) {
return n << p;
}
```

The assembly becomes:

```
divide:
shr r0, r0, p
ret
```

When the divisor is larger than half of the maximum value of the dividend, the quotient is one when $n≥d$ and zero otherwise. For example, assuming $N=32$ and taking $d=2147483649$ we get

```
uint divide(uint n) {
return n >= 2147483649;
}
```

Which compiles to the following assembly code:

```
divide:
gte r0, r0, 2147483649
ret
```

These techniques are applicable in a limited number of cases. A more generally applicable technique uses the idea of multiplying by a constant that approximates the reciprocal of the divisor in fixed point math. This means that, when we want to divide by a divisor $d$, we multiply the dividend by some constant $m$ with $m≈d2_{k} $

and shift the result $k$ bits to the right. Here, $k$ is a measure of how precise the fixed-point approximation is to $d1 $. Setting $k$ higher will yield a more precise approximation, but also increase the number of bits that we need to store $m$. Before going any further, let's define the problem more formally:

**Problem**: *Given a divisor $d∈U_{N}$ which is not a power of two, find $m,k∈N$ such that*
$⌊2_{k}n⋅m ⌋=⌊dn ⌋$

*for all $n∈U_{N}$.*

Ideally, we would have $m∈U_{N}$ so that we can store $m$ in a single register. Later, we will derive a condition that can be used to check if this is the case.

From this formulation we would expect that $m≈d2_{k} $. Suppose we pick some $d∈N_{+}$ which is not a power of two and take $m=⌊d2_{k} ⌋$. If we now set $n=d$ and try to find the quotient $⌊dn ⌋$ by evaluating $⌊2_{k}n⋅m ⌋$, we get $⌊2_{k}d⋅⌊d2_{k} ⌋ ⌋=0<1=⌊dd ⌋$

So we get the result $0$, instead of the expected $1$. So taking $m=⌊d2_{k} ⌋$ does not work: The product $n⋅m$ is too small. To fix this, we can increase $m$, which is equivalent to taking $m=⌈d2_{k} ⌉$ when $d$ is not a power of two. Alternatively, we can slightly alter the problem formulation so that we are allowed to increase $n$. As we will see later, these are both useful and even complementary techniques.

If we experiment a bit, setting $m=⌈d2_{k} ⌉$ and picking some high enough $k$ seems to work. If we want to use this as a compiler optimization we need to know under which conditions this works. The following lemma is very useful for the analysis.

**Lemma 1**: *Suppose that $d∈N_{+}$ is a positive integer. When $0≤ϵ<d1 $ we have*
$⌊dn +ϵ⌋=⌊dn ⌋$

**Proof**: If $ϵ=0$ the result obviously holds. If $0<ϵ<d1 $ the expression $dn +ϵ$ is not an integer and we have
$⌊dn +ϵ⌋<⌊dn+1 ⌋≤⌊dn ⌋+1$

Since all of these expressions in the inequality are integers, it follows that $⌊dn +ϵ⌋=⌊dn ⌋$. $□$

Now, we can write $m=⌈d2_{k} ⌉=d2_{k}+e $ with $e=mod_{d}(−2_{k})$ and substitute this in $⌊2_{k}n⋅m ⌋$, write the resulting expression in the form $⌊dn +ϵ⌋$ with $ϵ$ a function of $n$, $d$, $e$, and simply check under which condition we have $0≤ϵ<d1 $. This gives us the following lemma:

**Lemma 2**: *Let $d∈N_{+}$ be a positive integer and $k$ be an integer with $k≥N$. Define $m=⌈d2_{k} ⌉$. If $mod_{d}(−2_{k})≤2_{k−N}$, then $⌊2_{k}m⋅n ⌋=⌊dn ⌋$ for all $n∈U_{N}$.*

**Proof**: Suppose we have $mod_{d}(−2_{k})≤2_{k−N}$. Write $m=⌈d2_{k} ⌉=d2_{k}+e $ with $e=mod_{d}(−2_{k})$. Now we have
$⌊2_{k}m⋅n ⌋=⌊2_{k}d2_{k}+e ⋅n ⌋=⌊dn +d2_{k}ne ⌋=⌊dn +ϵ⌋$

with $ϵ=2_{k}de⋅n $. Since $n∈U_{N}$ we have $0≤n<2_{N}$, and, by assumption, $0≤e≤2_{k−N}$. Combining these inequalities gives $0≤e⋅n<2_{k}⟺0≤2_{k}de⋅n <d1 $

Since $ϵ=2_{k}de⋅n $ the right hand side just states that $0≤ϵ<d1 $. So we can apply lemma 1 to see that$⌊2_{k}m⋅n ⌋=⌊dn +ϵ⌋=⌊dn ⌋$ for any $n∈U_{N}$. $□$

With the two lemmas we can finally prove the following theorem:

**Theorem 1**: *Let $d,N∈N_{+}$ be positive integers and $k_{min}∈N$ be the smallest $k≥N$ such that $mod_{d}(−2_{k})≤2_{k−N}$. Then $k_{min}≤N+⌈g_{2}(d)⌉$. Furthermore, when $k≥k_{min}$ and $m=⌈d2_{k} ⌉$ we have $⌊2_{k}m⋅n ⌋=⌊dn ⌋$ for any $n∈U_{N}$.*

**Proof**: Suppose that $k≤k_{min}$. Then we have
$mod_{d}(−2_{k})=mod_{d}(−2_{k−k_{min}}⋅2_{k_{min}})≤2_{k−k_{min}}⋅mod_{d}(−2_{k_{min}})≤2_{k−k_{min}}⋅2_{k_{min}−N}=2_{k−N}$

So we have $mod_{d}(−2_{k})≤2_{k−N}$. It follows from lemma 2 that $⌊2_{k}m⋅n ⌋=⌊dn ⌋$ for any $n∈U_{N}$. $□$

From theorem 1 it follows that when we look for the lowest $k≥N$ such that $mod_{d}(−2_{k})≤2_{k−N}$, we only need to check the values $N,N+1,...,N+⌈g_{2}(d)⌉−1$. If none of these values succeed, $k=N+⌈g_{2}(d)⌉$ is the smallest $k$ that satisfies $mod_{d}(−2_{k})≤2_{k−N}$.

Now, we would like to know how many bits we need to hold $m$. The following theorem simply states that we need $⌈g_{2}(d)⌉−1$ bits to hold $d2_{k} $, independently of whether we round the result up or down. Feel free to skip the proof.

**Theorem 2**: *Let $d,k∈N_{+}$ with $d≤2_{k}$. Let $x=d2_{k} $. Then the binary representations of $⌊x⌋$ and $⌈x⌉$ have $b=⌈g_{2}(d)⌉−1$ bits. That is,*
$2_{b−1}≤⌊x⌋≤⌈x⌉≤2_{b}−1$

**Proof**: Define $q=⌈g_{2}(d)⌉−1$. We have $d≤2_{q+1}$, so it follows that $2_{b−1}=2_{q+1}2_{k} ≤d2_{k} $

The lower bound follows from taking the floor and ceiling from this and using that $⌊x⌋≤⌈x⌉$ for any $x$: $2_{b−1}≤⌊x⌋≤⌈x⌉$

To prove the upper bound, note that we have $2_{q}<d≤2_{k}$. It follows that $q<k$, so that $2_{q}+1≤2_{k}+2_{k−q}$

After adding $2_{k}$ on both sides and rearranging we get $2_{k}≤2_{k+1}−2_{q}+2_{k−q}−1=(2_{k−q}−1)⋅(2_{q}+1)$

Divide by $2_{q+1}+1$ and use that $2_{q}+1≤d$ to get $d2_{k} ≤2_{q+1}+12_{k} ≤2_{k−q}−1=2_{b}−1$

Now the upper bound follows from taking the floor and ceiling from this expression: $⌊x⌋≤⌈x⌉≤2_{b}−1$

Combining the lower and upper bound, we find that $2_{b−1}≤⌊x⌋≤⌈x⌉≤2_{b}−1$

$□$

We can now see that when we use the upper bound $k=N+⌈g_{2}(d)⌉$ from theorem one, the binary representation of $m=⌈d2_{k} ⌉$ has $N+1$ bits. This means that $m$ is exactly one bit too large to fit in an $N$-bit register. This is unfortunate but not insurmountable. In [1], a technique for multiplying by an $(N+1)$-bit constant on a processor with $N$-bit registers is presented. This is illustrated in algorithm A.

**Algorithm A**: *Let $d∈U_{N}$ be an $N$-bit unsigned integer divisor, let $k=N+⌈g_{2}(d)⌉$, let $m=⌈d2_{k} ⌉$, and let $m_{′}$ equal $m$ without the most significant bit, that is, $m_{′}=m−2_{N}$. Then the following algorithm computes $⌊dn ⌋$ for $n∈U_{N}$:*

```
uint fast_divide(uint n) {
uint hiword = ((big_uint)n * m_prime) >> N;
uint intermediate = (n - hiword) >> min(k - N, 1);
return (hiword + intermediate) >> max(k - N - 1, 0)
}
```

The expressions `m_prime`

, `min(k - N, 1)`

, and `max(k - N - 1, 0)`

are constants which can be computed at compile time when the divisor is fixed. The only case where we have $k−N<1$ and $k−N−1<0$ is when $d=1$. In all other cases we have `min(k - N, 1) = 1`

and `max(k - N - 1, 0) = k - N - 1`

.*

If we compile the `divide`

function with this strategy we get the following assembly:

```
divide:
umulhi r1, r0, m_prime
sub r0, r0, r1
shr r0, r0, min(k - N, 1)
add r0, r1, r0
shr r0, r0, max(k - N - 1, 0)
ret
```

The expressions `m_prime`

, `k - N`

, and `(k - N - 1)`

should be replaced by constants. For example, for $d=7,N=32$ we get $k=35,m=4908534053,m_{′}=613566757$ so that `m_prime = 613566757`

, `min(k - N, 1) = 1`

and `max(k - N - 1 = 2`

*.

This is a promising first method. If you just don't care about having the absolute fastest algorithm for division, I would advise you to implement this algorithm. It is relatively simple method and works for all integers. It can also be extended to the case of division of signed integers without a lot of effort.

However, it is possible to improve: For some divisors, we might be able to find a smaller $k$, so that $k<N+⌈g_{2}(d)⌉$ and $m$ fits in a single register. For architectures on which multiplication and/or bit shifts by higher constants are slower, it may be useful to find the smallest $k$ for which $mod_{d}(−2_{k})<2_{k−N}$.

Combining theorem 1 and theorem 2, we see that if, for a given divisor $d$, there exists a $k$ such that $N≤k<N+⌈g_{2}(d)⌉$ with $mod_{d}(−2_{k})≤2_{k−N}$, then $m$ fits in a single $N$-bit register. Following [3], we will call such divisors *cooperative*:

**Definition**: *Let $N∈N,d∈U_{N}$. If there exists a $k∈N$ such that $N≤k<N+⌈g_{2}(d)⌉$ with*
$mod_{d}(−2_{k})≤2_{k−N}$

*then $d$ is called a cooperative divisor. Otherwise, $d$ is an uncooperative divisor.*

Assuming that an `uint`

is an $N$-bit unsigned integer, the following C code can be used to find $k$ given a divisor $d$. It tests $k=N,N+1,...$ until it finds a $k$ that satisfies the condition.

```
// Double 'init' modulo d until it is less than 2^(k - M), return k
int loop_k(uint d, uint init, uint M) {
int k = N;
for (uint mod = init, pow = k - M; mod > pow && pow < N; pow++, k++) {
// overflow-safe way of doubling 'mod' modulo d
if (mod < d - mod) mod += mod;
else mod -= d - mod;
}
return k;
}
```

If we found a $k$ such that $N≤k<N+⌈g_{2}(d)⌉$ with $mod_{d}(−2_{k})≤2_{k−N}$, we can use a single $N$-bit register to hold $m$ and we don't have to use tricks to multiply by an $(N+1)$-bit constant. In this case it is simpler to compute the quotient:

**Algorithm B**: *Let $d∈U_{n}$ be a cooperative $N$-bit unsigned integer divisor, let $k∈N$ be such that $N≤k<N+⌈g_{2}(d)⌉$ with $mod_{d}(−2_{k})≤2_{k−N}$, and let $m=⌈d2_{k} ⌉∈U_{N}$. Then the following procedure calculates $⌊dn ⌋$ for all $n∈U_{N}$:*

```
uint function divide_cooperative(uint n) {
uint hiword = ((big_uint)n * m) >> N;
return hiword >> (k - N);
}
```

Using this this strategy to compile the `divide`

function for some cooperative divisor $d$, we get the following assembly:

```
divide_cooperative:
umulhi r0, r0, m
shr r0, r0, k - N
ret
```

The expressions `m`

and `k - N`

should be replaced by constants. For example, for $d=3$ we have $k=33,m=2863311531$ so that `m = 2863311531`

and `k - N = 1`

. For $d=641$ we see that $k=32,m=6700417$ so that `m = 6700417`

and `k - N = 0`

. So in this case, we can skip the `shr`

instruction. Indeed, for some $d∈U_{N}$ we might have $k=N$ and we don't have to do the shift. In particular, this happens when $d$ is a divisor of $2_{N}+1$, since in this case we have $mod_{d}(−2_{k})=1≤2_{k−N}=1$. For $N=32$, the only divisors of $2_{32}+1$ are $641$ and $6700417$.

Algorithm B is very efficient, but it can only be used for cooperative divisors. For uncooperative divisors, we can still use algorithm A, but there are more efficient ways.

For example, if $d$ is an even uncooperative divisor, we can apply a trick. Write $d=2_{p}⋅q$ with $q$ odd. Then we can obtain the quotient $⌊dn ⌋$ as $⌊qn_{′} ⌋$ with $n_{′}=⌊2_{p}n ⌋$. Now $n_{′}$ can be computed by a right shift by $p$ positions, so that $n_{′}∈U_{N−p}$ and $q=2_{p}d $. This gives us $⌊2d ⌊2n ⌋ ⌋=⌊dn ⌋$, which will be proved in the proof of corollary 1. Now we can apply theorem 1 to find $m,k$ such that $⌊2_{k}n_{′}⋅m ⌋=⌊qn_{′} ⌋$ for $n_{′}∈U_{N−p}$. Now the range of the dividend is smaller and we can certainly find an $m∈U_{N}$.

**Corollary 1**: *Let $d,N,p,q∈N_{+}$ be positive integers with $d=2_{p}⋅q$, where $q$ is odd. Let $k_{min}∈N$ be the smallest $k≥N$* such that $mod_{q}(−2_{k})≤2_{N+p−k}$. Then $k_{min}≤N+⌈g_{2}(q)⌉$. Furthermore, when $k≥k_{min}$, $m=⌈q2_{k} ⌉$, and $n_{′}=⌊2_{p}n ⌋$ we have $⌊2_{k}m⋅n_{′} ⌋=⌊dn ⌋$ for any $n∈U_{N}$.

**Proof**: Since $n∈U_{N}$ and $n_{′}=⌊2_{p}n ⌋$, we have $n_{′}∈U_{N−p}$. Now use theorem 1 with $N−p$ instead of $N$ and $q$ instead of $d$. It follows that for every $m=⌈q2_{k} ⌉,k≥k_{min}$ we have $⌊2_{k}n_{′}⋅m ⌋=⌊qn_{′} ⌋$ for every $n∈U_{N}$.

Now it remains to show that $⌊qn_{′} ⌋=⌊dn ⌋$. Write $n=s⋅d+r$ with $r,s∈N$ such that $0≤r<d$. Then $n_{′}=⌊2_{p}n ⌋=s⋅q+⌊2_{p}r ⌋$ so that $⌊qn_{′} ⌋=s+⌊q⌊2r ⌋ ⌋$. Since $⌊2_{p}r ⌋≤2_{p}r $ , we have $⌊q⌊2r ⌋ ⌋≤⌊q⋅2_{p}r ⌋=⌊dr ⌋=0$, since $r<d$. So $⌊qn_{′} ⌋=s=⌊dn ⌋$.

$□$

When we can find $k<N$ we want to set $k=N$ since this saves us a shift. In this case the division is as efficient as for most cooperative divisors. This gives us the following algorithm:

**Algorithm C**: *Let $d∈U_{N}$ be an $N$-bit unsigned integer divisor with $d=2_{p}⋅q$ with $p,q∈N$ and $q$ odd, let $k$ be the smallest $k∈{N,N+1,...,N+⌈g_{2}(q)⌉−p}$ for which $mod_{q}(−2_{k})≤2_{N+p−k}$, and let $m=⌈q2_{k} ⌉$. Then the following algorithm computes $⌊dn ⌋$ for $n∈U_{N}$:*

```
uint divide_even(n) {
uint n' = n >> p;
uint hiword = ((big_uint)n * m) >> N;
return hiword >> (k - N);
}
```

If we use this strategy to compile the `divide`

function for an even uncooperative divisor, we get the following assembly:

```
divide:
shr r0, r0, p
umulhi r0, r0, m
shr r0, r0, k - N
ret
```

The expressions `p`

, `m`

, and `k - N`

should be replaced by constants. For example, when $d=14$, we have $p=1,q=7,k=34,m=2454267027$ so that `k - N = 2`

. When `d = 28`

we have $p=2,q=7,k=32,m=613566757$. In this case `k - N = 0`

and the second `shr`

instruction can be omitted.

For odd divisors for which $m$ does not fit in a single word, one option is to resort to algorithm A, which uses a multiplication by an $(N+1)$-bit constant. Alternatively, we can try the other strategy mentioned before: increase $n$ instead of $m$.

**Theorem 4**: *Let $d∈U_{N}$ be a positive number that is not a power of two, let $k∈N$ such that $N≤k<N+⌈g_{2}(d)⌉$, and let $m=⌊d2_{k} ⌋$. Then $m∈U_{N}$. If $mod_{d}(2_{k})≤2_{k−N}$, then $⌊2_{k}m⋅(n+1) ⌋=⌊dn ⌋$ for all $n∈U_{N}$.*

**Proof**: Write $m=⌊d2_{k} ⌋=d2_{k}−e $ with $e=mod_{d}(2_{k})$. Now we have
$⌊2_{k}m⋅(n+1) ⌋=⌊2_{k}d2_{k}−e ⋅(n+1) ⌋=⌊(1−2_{k}2 )⋅dn+1 ⌋$

$=⌊dn +d1 −2_{k}e ⋅dn+1 ⌋=⌊dn +ϵ⌋$

with $ϵ=d1 −2_{k}e ⋅dn+1 $. Multiplying the inequalities $e≤2_{k−N}$ and $n+1≤2_{N}$ gives $e(n+1)≤2_{k}$. Since $d$ is now a power of two, $n+1$ and $e$ are both positive, so we have $e⋅(n+1)>0$. Combining this, we see $0<e⋅(n+1)≤2_{k}$

If we multiply by $d2_{k}−1 $, which flips the inequalities, and add $d1 $ everywhere, we get $0≤d1 −2_{k}e ⋅dn+1 <d1 $

The expression in the middle is exactly $ϵ$, which means we have $0≤ϵ<d1 $. So lemma 1 applies and we have $⌊dn⋅(n+1) ⌋=⌊dn +ϵ⌋=⌊dn ⌋$. $□$

Based on this, we can come up with the following algorithm:

**Algorithm D**: *Let $d∈U_{N}$ be an uncooperative $N$-bit unsigned integer divisor, let $k∈N$ be the smallest integer such that $N≤k<N+⌈g_{2}(d)⌉$ and $mod_{d}(2_{k})≤2_{k−N}$, and let $m=⌊d2_{k} ⌋$. Then the following algorithm computes $⌊dn ⌋$ for $n∈U_{N}$:*

```
uint divide_uncooperative(n) {
uint hi_word = ((big_uint)n * m + m) >> N;
return hi_word >> (k - N);
}
```

The most efficient way to compute the high word of `n * m + m`

depends on the target architecture. If the target supports instructions to calculate the $2N$-bit product $n⋅m$ and add the $N$-bit operand $m$ to it, these can be used. When the target supports an integer fused multiply-add that produces all $2N$ bits, this can be used. For example, in [2] Intel Celeron's `XMA.HU`

instruction, is used for this purpose.

If the architecture supports instructions for doing $2N$ bit additions, these can be used. For example, this is the case when 32-bit divisions are done on a 64-bit architecture. If this is not the case, we can still use 32-bit instructions to compute `n * m + m`

. Using this strategy, we can compile the `divide`

function to the following assembly:

```
divide:
mul r0, r0, m
add r0, r0, m
adc r1, r1, 0
shr r0, r0, k - N
ret
```

For example, for $N=32,d=7$ we will have $k=33,m=1227133513$, so `k - N = 1`

.

In [3], it is proposed to calculate $n⋅m+m$ as $(n+1)⋅m$. The problem is that the expression `n + 1`

can overflow. As a solution, it is proposed to use a *saturating increase*, which increases `n`

only if it is smaller than the maximum value that an $N$-bit unsigned integer can hold. Most targets have equivalents of the `add`

instruction which sets the carry flag in case of an overflow, and the `sbb`

(subtract with borrow) instruction. A saturating increment on the value in register `r0`

can be implemented as:

```
add r0, r0, 1
sbb r0, r0, 0
```

The following theorem states that when $d$ is an uncooperative divisor we have $⌊d2_{N}−1 ⌋=⌊d2_{N}−2 ⌋$. So if we apply algorithm D only to uncooperative divisors, the saturating increase produces the right result even when $n=2_{N}−1$.

**Theorem 5**: *Suppose $d∈U_{N}$ is an uncooperative divisor. Then*
$⌊d2_{N}−1 ⌋=⌊d2_{N}−2 ⌋$

**Proof**: Suppose that $⌊d2_{N}−1 ⌋ =⌊d2_{N}−2 ⌋$. Then $2_{N}≡1modd$. Take $k=⌊g_{2}(d)⌋$ and $e_{B}=mod_{d}(−2_{k}),e_{D}=mod_{d}(2_{k})$. Using $mod_{d}(2_{N})=1$ and $2_{⌊log_{2}(d)⌋}<d$, we see:
$e_{D}=mod_{d}(2_{k})=mod_{d}(2_{N}⋅2_{⌊log_{2}(d)⌋})=mod_{d}(2_{⌊log_{2}(d)⌋})=2_{⌊log_{2}(d)⌋}$

Here we used that $2_{⌊log_{2}(d)⌋}<d$. We have $e_{B}+e_{D}=d$ since $e_{B},e_{D}≡0modd$ and $e_{B},e_{D}∈{0,1,...,d−1}$. So $e_{B}=d−e_{D}≤2_{⌈log_{2}(d)⌉}−2_{⌊log_{2}(d)⌋}=2_{⌊log_{2}(d)⌋}=2_{k−N}$

So it follows that $d$ is a cooperative divisor. This is a contradiction, since we assumed that $d$ is uncooperative. So we must have $⌊d2_{N}−1 ⌋=⌊d2_{N}−2 ⌋$. $□$

Now, when we use this strategy to compile the `divide`

function for some uncooperative divisor $d$, we will get the following assembly:

```
divide:
add r0, r0, 1
sbb r0, r0, 0
umulhi r0, r0, m
shr r0, r0, k - N
ret
```

Note that algorithm D assumes the existence of a $k∈N$ such that $N≤k<N+⌈g_{2}(d)⌉$ and $mod_{d}(2_{k})≤2_{k−N}$. The following theorem states that such a $k$ indeed exists when $d∈U_{N}$ is an uncooperative divisor.

**Theorem 6**: *Let $d∈U_{N}$ be an uncooperative divisor. Then there exists a $k∈N$ with $N≤k<N+⌈g_{2}(d)⌉$ such that $mod_{d}(2_{k})≤2_{k−N}$.*

**Proof**: Assume that $d$ is an uncooperative divisor. Set $k=N+⌈g_{2}(d)⌉−1$ and $e_{B}=mod_{d}(−2_{k}),e_{D}=mod_{d}(2_{k})$. We now have $e_{B}+e_{D}=d$ since $e_{B}+e_{D}≡0modd$ and $e_{B},e_{D}∈{0,1,...,d−1}$. Since we assumed that $d$ is an uncooperative divisor we have $e_{B}>2_{k−N}=2_{⌈log_{2}(d)⌉−1}$ so that $e_{D}≤d−2_{⌈log_{2}(d)⌉−1}$. Using that $2d ≤2_{⌈log_{2}(d)⌉−1}$ we see
$e_{D}≤d−2_{⌈log_{2}(d)⌉−1}≤2d ≤2_{⌈log_{2}(d)⌉−1}=2_{k−N}$

So we have $e_{D}=mod_{d}(2_{k})≤2_{k−N}$. $□$

So, if we use algorithm B for cooperative divisors and algorithm D for uncooperative divisors, we can always fit $m$ in an $N$-bit word.

## Compiler optimizations

Here, I give a sketch of how the code generation for a compiler backend could look. I assume that the compiler backend has an `expression_t`

type, which can contain nested calls to instructions like `shr`

, `umulhi`

, etc. An unsigned integer `v`

can be passed as a compile-time constant expression by writing `const(v)`

.

For the optimization of multiplication by a constant, we have the following flowchart. The right arrow should be followed if the question above the node is true, else the left arrow should be followed. ![Flowchart for optimization of a multiplication by a constant unsigned integer](

Roughly speaking, the algorithms are ordered from least efficient on the left, to most efficient on the right. The flowchart can be implemented in C as

```
expression_t div_by_const_uint(const uint d, expression_t n) {
if (is_power_of_two(d)) {
if (d == 1) return n;
else return shr(n, constant(floor_log2(d)));
}
else {
if (d > MAX / 2) return gte(n, constant(d));
else return mul_based_div_by_const_uint(d, n);
}
}
```

For the multiplication-based techniques, we have the following flowchart:

Algorithms on the right are usually slightly more efficient.

This translates into the following C code:

```
expression_t mul_based_div_by_const_uint(uint d, expression_t n) {
// compute minus_mod = -2^N mod d
uint mod = (uint)(~d + 1) % d;
uint minus_mod = mod ? d - mod : 0;
uint k = loop_k(d, minus_mod, N);
uint bits = k + 1 - ceil_log2(d);
if (bits <= N) {
// Algorithm B: cooperative divisors
uint m = calc_m(d, k) + 1;
expression_t hiword = umulhi(n, constant(m));
if (k == N) return hiword;
else return shr(hiword, constant(k - N));
}
if (d % 2 == 0) {
// Algorithm C: even uncooperative divisors
uint p = powers_of_two(d);
uint q = d >> p;
// compute minus_mod = -2^N mod q
mod = (uint)(~q + 1) % q;
uint minus_mod = mod ? q - mod : 0;
k = loop_k(q, minus_mod, N - p);
uint m = calc_m(q, k) + 1;
expression_t n_odd = shr(n, constant(p));
expression_t hiword = umulhi(n_odd, constant(m));
if (k == N) return hiword;
else return shr(hiword, constant(k - N));
}
else {
// Algorithm D: odd uncooperative divisors
k = loop_k(d, mod, N);
uint m = calc_m(d, k);
expression_t n_inc = sbb(add(n, constant(1)), constant(0));
expression_t hiword = umulhi(n_inc, constant(m));
if (k == N) return hiword;
else return shr(hiword, constant(k - N));
}
}
```

The code uses a helper function for computing $k$:

```
// Double 'init' modulo d until it is less than 2^(k - M), return k
int loop_k(uint d, uint init, uint M) {
int k = N;
for (uint mod = init, pow = 1 << (k - M); mod > pow; pow <<= 1, k++) {
// overflow-safe way of doubling 'mod' modulo d
if (mod < d - mod) mod += mod;
else mod -= d - mod;
}
return k;
}
```

## Runtime optimizations

Suppose we want to perform faster division by a runtime constant. In this case we have to compute the magic value at runtime, and call a function `fast_divide`

to perform the division. For the sake of efficiency, we want this function to be branchless (it should not contain any if statements).

Following [2], we note that both algorithm B and algorithm D follow the common pattern of computing $ax+b$ and right shifting it by $k$ bits. So the following struct holds enough information:

```
struct fast_div_t {
uint mul, add, shift;
};
```

The `fast_divide`

function now looks like

```
inline uint fast_divide(uint n, fast_div_t d) {
uint hiword = ((big_uint)n * d.mul + d.add) >> N;
return (uint)(hiword >> d.shift);
}
```

For cooperative divisors, we use algorithm B and set $a=m,b=0$. For uncooperative divisors, we use algorithm D and set $a=m,b=m$. For $d=1$ we need a special case that sets $a=2_{N}−1,b=1,k=32$:

```
fast_div_t precalc_div(uint d) {
fast_div_t div;
assert(d > 0);
if (d == 1) {
div.mul = MAX;
div.add = MAX;
div.shift = 0;
}
else {
uint mod = (uint)(~d + 1) % d;
uint minus_mod = mod ? d - mod : 0;
uint k = loop_k(d, minus_mod, N);
uint bits = k + 1 - ceil_log2(d);
if (bits <= N) {
div.mul = calc_m(d, k) + (mod > 0);
div.add = 0;
div.shift = k - N;
}
else {
k = loop_k(d, mod, N);
bits = k + 1 - ceil_log2(d);
assert(bits <= N);
div.mul = div.add = calc_m(d, k);
div.shift = k - N;
}
}
return div;
}
```

This is certainly not the only way to implement division by run-time constant divisors, but I think this pretty elegant. The libdivide library provides an even nicer interface -- The code here is just a reference implementation for educational purposes.

## On testing

If you implement these techniques, I recommend that you manually work out some examples for 4-bit numbers. After that, you can implement the methods for 8- and 16-bit numbers. It should be possible to test these exhaustively. That is, you should be able to test if your code gets the quotient $⌊dn ⌋$ right for every $n,d∈U_{N}$ with $d =0$. For 32-bit numbers, it should still be possible to check, for all $d =0$, the value of the quotient $⌊dn ⌋$ where $n$ is either of the form $s⋅d$ or of the form $s⋅d−1$. You should also test $n=0$ and $n=2_{32}−1$. For 64-bit quotients, it is impossible to check all possibilities within a reasonable time. However, you can just pick a random value for $d$, and check the quotient for $n=0$ and $n=2_{64}−1$ along with some random values for $n$. If your code works for all 32-bit integers and you have no failing cases after a long time of random testing, you have good reason to believe that there are no blatant errors in your code.

## Further reading

The classic work on the subject of dividing by constants is [1], which also also covers division by signed constants and testing for divisibility. Algorithm A is based on this work. In [3] and [4] a very readable introduction to division of unsigned constants is given. Further, the idea of using the saturating increment in algorithm D is introduced. The paper [2] contains many similar ideas, but is a bit more formal. In [5], the techniques from [1] are extended to implement the modulo operation more efficiently.

### References

[1]: Division by Invariant Integers using Multiplication, Torbjörn Granlund and Peter L. Montgomery, 1994.

[2]: N-Bit Unsigned Divison Via N-Bit Multiply-Add, Arch D. Robinson, 2005.

[3]: Labor of Divison (Episode I), fish, 2010.

[4]: Labor of Divison (Episode III): Fast Unsigned Division by Constants, fish, 2011.

[5][Faster Remainder by Direct Computation: Applications to Compilers and Software Libraries](https://arxiv.org/pdf/1902.01961) Daniel Lemire, Owen Kaser, Nathan Kurz, 2019.