# Division by constant unsigned integers

The code accompanying this article can be found in a github repository.

Most modern processors have an integer divide instruction which, for technical reasons, is very slow compared to other integer arithmetic operations. When the divisor is constant, it is possible to evaluate the result of the division without using the slow division instruction. Most optimizing compilers perform this optimization, as can be seen on Matt Godbolt’s compiler explorer. This article tries to be a self-contained reference for optimizing division by constant unsigned divisors.

There are tricks to make division by some special divisors fast: A division by one can be ignored, a division by a power of two can be replaced by a bit shift. A more general trick exists: By using fixed-point arithmetic, we can speed up division by all constant divisors. However, this is not simple to explain, and for this reason I will focus only on positive (or ‘unsigned’) divisors in this article. For a number $n$ (also called the *dividend*) and a divisor $d$, I assume that we are interested in the *quotient* $⌊dn ⌋$.

When it is necessary to repeatedly divide floating-point numbers by the same constant $c$, it is often preferred to precompute the floating-point number $c1 $ and multiply by this instead. This is usually a lot more efficient. We can do something similar for integers by using fixed point arithmetic. The basic idea is to pick a large constant $L$ that is easy to divide by. In decimal you can take some power of ten, in binary you would take a power of two so that we can divide by $L$ using bit shifts. If we would have $c=dL $ we would have $Ln⋅c =Ln⋅dL =dn $. However, $dL $ is usually not an integer, and we need to round. Still, if $c≈dL $ we still expect that $Ln⋅c ≈dn $.

In the following section, I’ll discuss the mathematical background. In the sections after that, I’ll discuss optimization of division by unsigned integers.

## Mathematical background

### Preliminaries

I will assume that we are working on an $N$-bit machine which can efficiently compute the full $2N$-bit product of two $N$-bit unsigned integers. I will denote $N_{0}={0,1,2,...}$ for the set of natural numbers including zero, $N_{+}={1,2,3,...}$ for the set of positive natural numbers, and $Z={...,−1,0,1,...}$ for the set of integers. Further, I will use the notation $U_{N}$ for the set of unsigned integers that can be represented with $N$ bits:

$U_{N}={0,1,...,2_{N}−1}$I will use the notation $mod_{d}(n)$ to denote the integer in the range ${0,1,...,d−1}$ that is equivalent to $n$ modulo $d$. That is, $mod_{d}(n)$ is the unique integer such that

$0≤mod_{d}(n)<d$ $mod_{d}(n)=n+m⋅d$for some $m∈Z$.

### Unsigned division

For a given divisor $d$, we now want to have an expression that evaluates to $⌊dn ⌋$ for all $n∈U_{N}$. We have two demands for this expression:

- It should be
*correct*, that is, the expression is equal to $⌊dn ⌋$ - It should be
*efficient*to evaluate. For our purposes, this means that we want to multiply numbers of at most $N$ bits, so that they fit in a single register on our $N$-bit machine, and that we implement the division by $2_{k}$ with a bit shift.

Guided by these demands, we can start our analysis. The following lemma will be useful:

**Lemma 1**: *Suppose that $n∈Z$, $d∈N_{+}$. If $dn ≤x<dn+1 $ then $⌊x⌋=⌊dn ⌋$.*

**Proof**: We have $dn+1 =⌊dn ⌋+dk $ for some nonnegative integer $k≤d$. So $dn+1 =⌊dn ⌋+dk ≤⌊dn ⌋+1$. It follows that $x∈[⌊dn ⌋,⌊dn ⌋+1)$, so that $⌊x⌋=⌊dn ⌋$.

$□$

At this point, we have decided that we want an expression that evaluates to $⌊dn ⌋$. In the introduction, we have established that we have $2_{k}n⋅m ≈dn $ whenever $m≈d2_{k} $. So, it is natural to take $⌊2_{k}m⋅n ⌋$ with $m≈d2_{k} $ as our starting point, since we know that $2_{k}m⋅n ≈dn $.

We now need to decide how exactly to pick $m$ and $k$. The obvious choice for $m$ are the integers that minimize the error to $d2_{k} $, which are $m_{down}=⌊d2_{k} ⌋$ and $m_{up}=⌈d2_{k} ⌉$. Note that when $d$ is not a power of two and put $n=d$, we have $⌊2_{k}m_{down}⋅d ⌋=0$, so this is not a viable method. So, let’s round up instead.

This gives us the **round-up method**, which approximates $⌊dn ⌋$ by $⌊2_{k}m_{up}⋅n ⌋$ with $m_{up}=⌈d2_{k} ⌉$.

First, we determine the conditions under which the round-up method produces the correct result. Note that from this point on we will assume that $k$ is larger than $N$. We will assume $k=N+ℓ$ and use $k$ and $N+ℓ$ interchangeably.

First, we would like to know when the round-up method is correct. The following theorem states a condition under which the round-up method is correct, that is, $⌊2_{k}m_{up}⋅n ⌋=⌊dn ⌋$ for all $n∈U_{N}$.

**Theorem 2 (correctness of the round-up method)**: *Let $d,N,ℓ∈N_{0}$ with $d>0$. If there exists an $m∈N_{+}$ with*

*then $⌊2_{N+ℓ}m⋅n ⌋=⌊dn ⌋$ for all $n∈U_{N}$*.

**Proof**: Multiplying the inequality by $d⋅2_{N+ℓ}n $ we get

For all $n∈U_{N}$ we have $n<2_{N}$, so that $2_{N}n <1$. It follows that

$∀n∈U_{N}:dn ≤2_{N+ℓ}m⋅n ≤dn +d1 $By lemma 1, it follows that $⌊2_{N+ℓ}m⋅n ⌋=⌊dn ⌋$ for all $n∈U_{N}$.

$□$

The following corollary gives us a more practical way to check if there is a multiple of $m$ between $2_{N+ℓ}$ and $2_{N+ℓ}+2_{ℓ}$.

**Corollary**: *Let $d,ℓ,N∈N_{0}$ with $d>0$ and $m_{up}=⌈d2_{N+ℓ} ⌉$. If $m_{up}⋅d≤2_{N+ℓ}+2_{ℓ}$ we have $⌊2_{N+ℓ}m_{up}⋅n ⌋=⌊dn ⌋$ for all $n∈U_{N}$. If $m_{up}⋅d>2_{N+ℓ}+2_{ℓ}$ then there exists no $m$ such that $2_{N+ℓ}≤m⋅d≤2_{N+ℓ}+2_{ℓ}$.*

**Proof**: Note that $m_{up}⋅d=⌈d2_{N+ℓ} ⌉⋅d$ is simply the first multiple of $d$ that is greater than or equal to $2_{N+ℓ}$. So the bound $2_{N+ℓ}≤m_{up}⋅d$ is always satisfied. If we have $m_{up}≤2_{N+ℓ}+2_{ℓ}$, then $m_{up}$ satisfies the condition of theorem 2 and we have $⌊2_{N+ℓ}m⋅n ⌋=⌊dn ⌋$ for all $n∈U_{N}$. If we have $m_{up}⋅d>2_{N+ℓ}+2_{ℓ}$, the first multiple that is greater than or equal to $2_{N+ℓ}$ is larger than $2_{N+ℓ}+2_{ℓ}$. This means that there is no multiple of $d$ in the range $2_{N+ℓ},2_{N+ℓ}+1,...,2_{N+ℓ}+2_{ℓ}$, and that there exists no $m$ such that $2_{N+ℓ}≤m⋅d≤2_{N+ℓ}+2_{ℓ}$.

$□$

Let’s do some examples to see how we can use the round-up method in practice.

**Example**: Let’s take $N=8$, $d=3$. First, we try $ℓ=0$ and compute $m_{up}=⌈32_{8} ⌉=86$. We have $86⋅3=258>257=2_{8}+2_{0}$. So we increase $ℓ$ to one and try again. This time we get $m_{up}=⌈32_{9} ⌉=171$. We have $171⋅3=513≤514=2_{9}+2_{1}$. So the condition of the corollary is satisfied and for any 8-bit unsigned integer $n∈U_{8}$ we have $⌊2_{9}171⋅n ⌋=⌊3n ⌋$. Note that $m_{up}=171$ fits in $N$ bits.

**Example**: Let’s take $N=8$, $d=7$. First, we try $ℓ=0$ and compute $m_{up}=⌈72_{8} ⌉=37$. We see that $37⋅7=259>257=2_{8}+2_{0}$. So we increase $ℓ$ to one and try again. This time we get $m_{up}=⌈72_{9} ⌉=74$. We see that $74⋅7=518>514=2_{9}+2_{1}$. Again, we increase $ℓ$ to two and check the bound again: $m_{up}=⌈72_{10} ⌉=147$, and $147⋅7=1029>1028=2_{1}0+2_{2}$. Increasing $ℓ$ to four, we get $m_{up}=⌈72_{11} ⌉=293$, and $293⋅7=2051≤2056=2_{11}+2_{3}$. So the condition of the corollary is satisfied and for any 8-bit unsigned integer $n∈U_{8}$ we can have $⌊2_{9}293⋅n ⌋=⌊3n ⌋$.

This last example shows that $m_{up}$ does not always fit in $N$ bits. In this case, $m_{up}$ does not fit in a single register, which makes the evaluation of $m_{up}⋅n$ inefficient. The following theorem shows that $m_{up}$ always fits in $N+1$ bits.

**Theorem 3**: *Let $N,d∈N_{0}$ with $d>0$ and define $ℓ=⌈g_{2}(d)⌉,m_{up}=⌈d2_{N+ℓ} ⌉$. Then $m_{up}∈U_{N+1}∖U_{N}$ and $⌊2_{N+ℓ}m_{up}⋅n ⌋=⌊dn ⌋$ for all $n∈U_{N}$.*

**Proof**: The range ${2_{N+ℓ},2_{N+ℓ}+1,...,2_{N+ℓ}+2_{ℓ}}$ consists of $2_{ℓ}+1$ consecutive numbers. We have $ℓ=⌈g_{2}(d)⌉$, so $2_{ℓ}+1>d$ and there must be a multiple of $d$ in this range. Since $d⋅m_{up}=d⋅⌈d2_{N+ℓ} ⌉$ is simply the first multiple greater than or equal to $2_{N+ℓ}$, this is a multiple of $d$ in this range. By theorem 2, it follows that $⌊2_{N+ℓ}m⋅n ⌋=⌊dn ⌋$ for all $n∈U_{N}$. By using lemma 5 with $N+1$ instead of $N$, we see that $2_{N}≤m_{up}<2_{N+1}$, so $m_{up}∈U_{N+1}∖U_{N}$.

$□$

One way to handle the case when $m$ does not fit in $N$ bits, is to use the following trick from [1] to multiply by an $(N+1)$-bit constant: We can evaluate $⌊2_{N+ℓ}m⋅n ⌋$ by defining $m_{′}=m−2_{N}$ and using the following code:

```
uint high_word = (((big_uint)m') * n) >> N;
uint result = (high_word + ((n - high_word) >> 1)) >> (l - 1);
```

This trick is used by some compilers to handle divisors for which $m_{up}$ does not fit in $N$ bits.

We will now focus on another method that is more efficient than using this trick (but less efficient than the round-up method). Instead of rounding up $d2_{N+ℓ} $ to get $m$, we can round down but increase $n$.

This gives us the **round-down method**, which approximates $⌊dn ⌋$ by $⌊2_{N+ℓ}m_{down}⋅(n+1) ⌋$ by $m_{down}=⌊d2_{N+ℓ} ⌋$.

We proceed as we did for the round-up method, by deriving a condition under which the method is correct.

**Theorem 4 (correctness of the round-down method)**: *Let $d,N,ℓ∈N_{0}$ with $d>0$. If there exists an $m∈N_{+}$ with*

*then*

**Proof**: Multiply the inequality by $d⋅2_{N+ℓ}n+1 $ to get

Looking at the expression on the left side, we have $1≤n+1≤2_{N}$, so that $0≤1−2_{N}n+1 <1$. It follows that $dn ≤dn +d1 ⋅(1−2_{N}n+1 )$, so

$dn ≤2_{N+ℓ}m⋅(n+1) <dn+1 $for all $n∈U_{N}$. So lemma 1 applies and we have $⌊2_{N+ℓ}m⋅(n+1) ⌋$ for all $n∈U_{N}$.

$□$

When using the round-up or round-down method, we want $m$ to be an $N$-bit number in order for the multiplication to be efficient. As we have seen, such an $m$ does not always exist. Let us call divisors for which such an $m$ exists *efficient*. The following definition makes this rigorous.

**Definition**: *We call a positive divisor $d∈U_{N}$ efficient for the $N$-bit round-up method (or round down method) if there exists an $ℓ∈N_{0}$ such that $⌊2_{N+ℓ}m_{up}⋅n ⌋=⌊dn ⌋$ for all $n∈U_{N}$. Likewise, we call a positive divisor $d∈U_{N}$ efficient for the $N$-bit round-down method if there exists an $ℓ∈N_{0}$ such that $⌊2_{N+ℓ}m_{down}⋅(n+1) ⌋=⌊dn ⌋$ for all $n∈U_{N}$.*

The following result tells us that $ℓ=⌈g_{2}(d)⌉−1$ is the biggest value for $ℓ$ that we can pick so that $m_{up}$ and $m_{down}$ still fit in $N$ bits.

**Lemma 5**: *Let $N∈N_{+}$, $d∈U_{N}$ and define $m_{′}=d2_{N+⌈log(d)⌉−1} ,m_{down}=⌊m_{′}⌋,m_{up}=⌈m_{′}⌉$. Now $m_{down},m_{up}∈U_{N}$ with*

That is, the binary representations of $m_{down}$ and $m_{up}$ have exactly $N$ bits.

**Proof**: We have $2_{N−1}=d2_{N−1}⋅d =d2_{N−1+log(d)} ≤d2_{N−1+⌈log(d)⌉} =m_{′}$. Since the floor function is a nondecreasing function and $⌊m_{′}⌋≤⌈m_{′}⌉$ we have $2_{N−1}≤m_{down}≤m_{up}$. It remains to show that $m_{up}≤2_{N}−1$.

When $N=1$, it follows that $d=1$ and the bound holds. When $N>1$, the ratio $d2_{⌈log(d)⌉} $ is maximized over $d∈U_{N}$ by $d=2_{N−1}+1$, so $d2_{⌈log(d)⌉} ≤2_{N−1}+12_{N} <2_{N−1}2_{N}−1 $ (this last inequality can be seen by multiplying both sides by $2_{N−1}⋅(2_{N−1}+1)$). So we have $d2_{⌈log(d)⌉} <2_{N−1}2_{N}−1 $; multiplying both sides by $2_{N−1}$ gives $m_{′}=d2_{N+⌈log(d)⌉−1} <2_{N}−1$. After rounding up both sides, it follows that $m_{up}≤2_{N}−1$.

$□$

The following result states that we can combine the round-up and round-down method to a method that works for any positive divisor. So this finally concludes our quest for an efficient method to calculate the quotient $⌊dn ⌋$.

**Theorem 6**: *Any positive divisor $d∈U_{N}$ is either efficient for the $N$-bit round-up method, or efficient for the $N$-bit round-down method.*

**Proof**: Set $ℓ=⌈g_{2}(d)⌉−1$. By lemma 6, we know that $m_{up},m_{down}∈U_{N}$. Now consider the range ${2_{N}−2_{ℓ},2_{N}−2_{ℓ}+1,...,2_{N}+ℓ}$. This is a range of $2ℓ+1$ numbers. Since $d<2ℓ+1$ there must be at least one multiple $m$ of $d$ in this range. When this multiple $m$ satisfies $2_{N}−2_{ℓ}≤m<2_{N+ℓ}$ the condition for the round-down method is satisfied. Since $m_{down}∈U_{N}$, $d$ is efficient for the $N$-bit round-down method. Otherwise, we have $2_{N+ℓ}≤m≤2_{N+ℓ}+2_{ℓ}$ and the condition for the round-up method is satisfied. Again, since $m_{up}∈U_{N}$ we have that $d$ is efficient for the $N$-bit round-up method.

$□$

This theorem also implies that we never need to check if a divisor is efficient for the $N$-bit round-down method. We can simply test if a divisor is efficient for the $N$-bit round-up method, and use the round-down method if it is not.

**Exercise** (solution can be found at the end of this document): *Let $N∈N$, let $d∈U_{N}$ be a positive integer that is not a power of two. Define $ℓ=⌈g_{2}(d)⌉−1$, $x=d2_{N+ℓ} $, $m_{up}=⌈x⌉$, and $m_{down}=⌊x⌋$. Show that if $x$ is closer to $m_{up}$ than to $m_{down}$, the condition of theorem 2 is satisfied with $m=m_{up}$. Then show that if $x$ is closer to $m_{down}$ than to $m_{up}$, the condition of theorem 4 is satisfied by $m=m_{down}$.*

The following result gives a more efficient condition to check if a given divisor is efficient for the round-up method.

**Lemma 7**: *Let $N∈N_{+}$ and $d∈U_{N}$ and define $ℓ=⌈g_{2}(d)⌉−1$ and $m_{up}=⌈d2_{N+ℓ} ⌉$. If $mod_{2_{N}}(m_{up}⋅d)≤2_{ℓ}$ then $d$ is efficient for the round-up method.*

**Proof**: The product $m_{up}⋅d=⌈d2_{N+ℓ} ⌉⋅d$ is the first multiple of $d$ that is equal to or larger than $2_{N+ℓ}$. This product will be of the form $2_{N+ℓ}+q$ for some $q<d≤2_{ℓ+1}$, so we have $mod_{2_{N}}(m_{up}⋅d)=q$. It follows that $2_{N+ℓ}≤m_{up}⋅d≤2_{N+ℓ}+2_{ℓ}$ if and only if $mod_{2_{N}}(m_{up}⋅d)≤2_{ℓ}$.

$□$

Armed with these results, let’s do the examples from before again:

**Example**: We set $ℓ=⌊g_{2}(3)⌋=1$ and compute $m_{up}=⌈32_{9} ⌉=171$. We have $mod_{256}(171⋅3)=1≤2=2_{ℓ}$. So $3$ is efficient for the 8-bit round-up method and we have $⌊2_{9}171⋅n ⌋=⌊3n ⌋$ for all $n∈U_{8}$.

**Example**: We set $ℓ=⌊g_{2}(7)⌋=2$ and compute $m_{up}=⌈72_{10} ⌉=147$. We have $mod_{256}(147⋅7)=5>4=2_{ℓ}$. So $7$ is not efficient for the 8-bit round-up method. According to theorem 6, it must be efficient for the 8-bit round-down method, so we have $⌊2_{1}0147⋅(n+1) ⌋=⌊7n ⌋$ for all $n∈U_{8}$.

As mentioned before, the round-up method is more efficient. For even divisors that are not efficient for the roundup method, there exists a trick. Say $d=2_{p}d_{′}$. Then we can put $n_{′}=⌊2_{p}n ⌋$ and evaluate $⌊dn ⌋$ as $⌊d_{′}n_{′} ⌋$. This has the benefit that $n_{′}∈U_{N−p}$ has at most $N−p$ bits.

**Lemma 8**: *Let $d,N$ be positive integers, and let $d=2_{p}d_{′}$ for some positive integer $p$. Define $ℓ=⌊g_{2}(d)⌋$ and $m_{up}=⌈d2_{N+ℓ} ⌉$. Now $⌊2_{N+ℓ+1−2p}m_{up}⋅n_{′} ⌋=⌊dn ⌋$ for all $n∈U_{N}$ with $n_{′}=⌊2_{p}n ⌋$ and $m_{up}=⌈2_{p−1}m_{up} ⌉$.*

**Proof**: Define $N_{′}=N−p$ and $ℓ_{′}=⌈g_{2}(d_{′})⌉=ℓ−p$. Now we have $m_{up}=⌊2_{p−1}m_{up} ⌋=⌊2_{p−1}⌈d2 ⌉ ⌋=⌊2_{p−1}⌈d⋅22 ⌉ ⌋=⌈d2_{N+ℓ+1} ⌉$. Using theorem 3, we see that $⌊2_{N+l}m_{up}⋅n_{′} ⌋=⌊d_{′}n_{′} ⌋=⌊dn ⌋$ for every $n_{′}∈U_{N−p}$. Since $n_{′}∈U_{N−p}$ whenever $n∈U_{N}$, this holds for every $n∈U_{N}$.

$□$

**Example**: Take $N=8$, $d=14$. Setting $ℓ=⌊g_{2}(d)⌋=3$ we get $m_{up}=⌈d2_{N+ℓ} ⌉=147$. We see that $mod_{256}(14⋅147)=10>8=2_{ℓ}$. So $14$ is not efficient for the roundup method. However, it is even, so using lemma 8 with $p=1$ we get $n_{′}=⌊2n ⌋$, $m_{up}=m_{up}$. We see that $⌊2_{10}147⋅n_{′} ⌋=⌊dn ⌋$ with $n_{′}=⌊2n ⌋$ for all $n∈U_{N}$.

On some architectures it is faster to shift by fewer bits. So, for the purpose of optimization, we might be interested in finding the smallest $ℓ$ such that $m_{up}$ satisfies the condition of theorem 2 (or the smallest). Surprisingly, there is an easy way to find the smallest $m_{up}$ or $m_{down}$ that satisfies the condition of theorem 2 or theorem 4.

The following theorem says that to find the smallest $m$

**Theorem 9**: *Let $d,m∈U_{N}$, where $d>0$ is not a power of two. Let $ℓ∈N_{+}$ such that $ℓ≤⌊g_{2}(d)⌋$ and let $ℓ,m$ satisfy the condition of theorem 2:*

*or the condition of theorem 4, respectively:*

*If $m$ is even, then $ℓ_{′},m_{′}$ with $ℓ_{′}=ℓ−1$ and $m_{′}=2m $ also satisfy the condition. So $2_{N+ℓ_{′}}≤m_{′}⋅d≤2_{N+ℓ_{′}}+2_{ℓ_{′}}$, respectively $2_{N+ℓ_{′}}−2_{ℓ_{′}}≤m_{′}⋅d<2_{N+ℓ_{′}}$.*

*If $m$ is odd, this is the smallest $m$ that satisfies this condition.*

**Proof**: Suppose that $m$ satisfies the condition of theorem 2. In this case, we have $2_{N+ℓ}≤m⋅d≤2_{N+ℓ}+2_{ℓ}$. It is easy to see that when $m$ is even all expressions in the inequality are even, so we can divide by two and see that $2_{N+ℓ−1}≤2m ⋅d≤2_{N+ℓ−1}+2_{ℓ−1}$. The case for the condition of theorem 4 is analogous.

Suppose that there is a smaller pair $ℓ_{′},m_{′}$ that satisfies the condition $2_{N+ℓ_{′}}≤m_{′}⋅d≤2_{N+ℓ_{′}}+2_{ℓ_{′}}$. By multiplying the whole thing by $2_{ℓ−ℓ_{′}}$, we see that $2_{N+ℓ}≤2_{ℓ−ℓ_{′}}⋅m_{′}⋅d≤2_{N+ℓ}+2_{ℓ}$. The set ${2_{N+ℓ},2_{N+ℓ}+1,...,2_{N+ℓ}+2_{ℓ}}$ has $2_{ℓ}+1$ elements. We have $2_{ℓ}+1≤2_{⌊log_{2}(d)⌋}+1≤d$ (this last inequality holds since $d$ is not a power of two), so there can only be one multiple of $d$ in this set, which is $m⋅d$. So we have $m=2_{ℓ−ℓ_{′}}⋅m_{′}$, so $m$ must be even.

$□$

**Example**: Take $N=8$, $d=36$. We compute $ℓ=⌊g_{2}(d)⌋=5$ so $m_{up}=⌈d2_{N+ℓ} ⌉=228$. We see that $mod_{256}(228⋅36)=16≤32=2_{ℓ}$. So $36$ is efficient for the round-up method and we have $⌊2_{13}228⋅n ⌋=⌊dn ⌋$ for all $n∈U_{8}$. However, we see that $m_{up}$ is even, so we can reduce $m_{up}$ by dividing it by two and decrementing $ℓ$. We get $m_{up}=114$, $ℓ=4$. Now $m_{up}$ is still even, so again we divide $m_{up}$ by two and decrement $ℓ$. We get $m_{up}=57$ and $ℓ=3$. So we have $⌊2_{11}57⋅n ⌋=⌊dn ⌋$ for all $n∈U_{8}$.

**Example**: Take $N=8$, $d=11$. We set $ℓ=⌊g_{2}(d)⌋=3$ so that $m_{up}=⌈d2_{N+ℓ} ⌉=187$. Now $mod_{2_{N}}(187⋅11)=7≤2_{9}$, so $11$ is not efficient for the $N$-bit round-up method. Since 11 is odd, we use the round-down method. Using theorem 6, we see that 11 is efficient for the $N$-bit round down method. We have $m_{down}=186$ so we can use lemma 9 to see that we have $⌊2_{10}93⋅n ⌋=⌊11n ⌋$.

**Example**: Take $N=32$, $d=641$. We set $ℓ=⌊g_{2}(d)⌋=9$ so that $m_{up}=⌈d2_{N+ℓ} ⌉=3430613504$. Now $mod_{2_{N}}(3430613504⋅641)=512≤512=2_{ℓ}$, so $641$ is efficient for the round-up method. Now write $3430613504=2_{9}⋅6700417$. Using lemma 9 we see that we have $⌊2_{N+ℓ}m_{up}⋅n ⌋=⌊641n ⌋$ with $ℓ_{′}=ℓ−9$ and $m_{up}=2_{9}m_{up} $ for all $n∈U_{32}$. So we see that $⌊2_{32}6700417⋅n ⌋=⌊dn ⌋$ for all $n∈U_{32}$. The division by $2_{32}$ can be implemented by taking the upper $32$ bits of the product $6700417⋅n$.

**Example**: Take $N=8$, $d=28$. Taking $ℓ=⌊g_{2}(28)⌋=4$ and $m_{up}=⌈282_{N+ℓ} ⌉=147$, we see that $mod_{2_{N}}(m_{up}⋅28)=20>16=2_{ℓ}$. So 28 is not efficient for the 8-bit round-up method. Applying lemma 8 gives us that $⌊2_{9}74⋅n_{′} ⌋=⌊dn ⌋$ with $n_{′}=⌊4n ⌋$ for all $n∈U_{8}$. We can now apply lemma 9 to see that $⌊2_{8}37⋅n_{′} ⌋=⌊dn ⌋$ with $n_{′}=⌊4n ⌋$ for all $n∈U_{8}$. This means that we only have to look at the upper 8 bits, and we need just one shift instruction to compute $n_{′}=⌊4n ⌋$. So this ends up as efficient as the round-up method, which also needs just one shift and a multiplication.

## Implementation

We distinguish between compile-time optimization and runtime optimization of constant unsigned integers. To illustrate, the divisor `d`

in the following code is a compile-time constant:

```
const unsigned int d = 61;
for (int i = 0; i < size; i++) {
quotient[i] = dividend[i] / d;
}
```

The divisor `d`

in the following code is a runtime constant:

```
unsigned int d = read_divisor();
for (int i = 0; i < size; i++) {
quotient[i] = dividend[i] / d;
}
```

In case of a runtime constant divisor, most compilers will not optimize the division. However, we can do it ourselves, by doing something like:

```
unsigned int d = read_divisor();
divdata_t divisor_data = precompute(divisor);
for (int i = 0; i < size; i++) {
quotient[i] = fast_divide(dividend[i], divisor_data);
}
```

In this section, we will discuss both how a compiler can generated optimized code and how to efficiently implement the `precompute`

and `fast_divide`

functions for runtime constant divisors. The libdivide library, which was started by the author of [3] and [4], is a mature implementation that can optimize division by runtime constant integers. The library is conveniently included in a single header file, contains a nice C++ interface, and also implements vector implementations of integer division, which can make division even faster.

During compile-time, there is a lot of room for optimizations. Typically, programmers are OK with waiting a fraction of a second longer if this means that their code executes faster. So for division by compile-time constants, there is time for extensive optimizations. During runtime, time is more precious, so we want to keep the precomputation reasonably efficient. Further, the `fast_divide`

function is used for all divisors; We want to keep it efficient and, if possible, branchless.

For compile-time optimization, we can spend some more effort to produce better optimized code. For runtime optimization, we need to do the precomputation in runtime, which means that we want the precomputation to be efficient as well. Simply put, for compile-time optimization it will pay off to distinguish many special cases for which we can produce more efficient code. For runtime optimization the challenge is the other way around: We want to have a single, fast code path which handles all cases with good efficiency.

### Runtime optimization

Let’s take the example from before as a starting point. I will assume that we have a constant `N`

which denotes the number of bits in the unsigned integer datatype `uint`

. I’ll also assume the existence of `big_uint`

, an unsigned integer datatype with $2N$ bits. Using these datatypes, the example from before becomes:

```
uint divisor = get_number_from_user();
divdata_t divisor_data = precompute(divisor);
for (int i = 0; i < size; i++) {
quotient[i] = fast_divide(dividend[i], divisor_data);
}
```

Both the round-up and the round-down method can be implemented using an expression of the form `(n * m + add) >> (N + shift)`

to compute $⌊dn ⌋$, so it is natural to define the struct `divdata_t`

with these fields:

```
typedef struct {
uint mul, add, shift;
} divdata_t;
```

The `fast_divide`

function is now straightforward to write:

```
uint fast_divide(uint n, divdata_t dd) {
big_uint full_product = ((big_uint)n) * dd.mul + dd.add;
return (full_product >> N) >> dd.shift;
}
```

Let’s start with the implementation of the `precompute`

function. First, we define a variable `divdata`

to hold the result. I’ll save you the trouble and note that the case $d=1$ needs to be handled separately. If we work out this case by hand we see that when we set $ℓ=⌈g_{2}(1)⌉−1$ we get $m=2_{N−1}$ and $2_{N−1}m⋅n =n$. While this is mathematically correct, we can’t use the `fast_divide`

function to evaluate this expression, because it divides by $2_{N−1}$ while the `fast_divide`

function implements this as a right shift by $N$, followed by another right shift. This last right shift would have to be a right shift by negative one. Technically, you could say that this is a right shift by one, but most architectures don’t work this way and even if it would, we would lose a bit that has been shifted out by the first shift.

Luckily, we use a trick. If we set both `divdata.mul`

and `divdata.add`

to $2_{N}−1$, things will work out since

So let’s write a skeleton for the `precompute`

function that only handles the case $d=1$:

```
divdata_t precompute(uint d) {
divdata_t divdata;
// d = 1 is a special case
if (d == 1) {
divdata.mul = max;
divdata.add = max;
divdata.shift = 0;
return divdata;
}
// normal path
...
return divdata;
}
```

For the normal path (that is, all the cases except $d=1$) we can use the test from lemma 7:

```
// normal path
uint l = ceil_log2(d) - 1;
uint m_down = (((big_uint)1) << (N + l)) / d;
uint m_up = (is_power_of_two(d)) ? m_down : m_down + 1;
uint temp = m_up * d;
bool use_round_up_method= temp <= (1 << l);
// set fields of divdata
...
```

Note that the `temp`

variable is an $N$-bit unsigned integer. It is just used to compute $m_{up}⋅d$ modulo $2_{N}$. Also note that we need a special case for when $d$ is a power of two for rounding.

Setting the `mul`

, `add`

, and `shift`

fields of `divdata`

is straightforward:

```
// set fields of divdata
if (use_round_up_method) {
divdata.mul = m_up;
divdata.add = 0;
}
else {
divdata.mul = m_down;
divdata.add = m_down;
}
divdata.shift = l;
```

When you stitch the snippets together and include `bits.h`

from the appendix you should get a working implementation of `precompute`

. However, we can make it a bit more efficient. Note that we have a special case for $d=1$ and also special cases for when $d$ is a power of two. The trick we used for $d=1$ actually can be generalized to handle all powers of two.

We can set `mul = max`

and `add = max`

so that the high word contains the dividend. Then we only need need to shift the high word right by $g_{2}(d)$ bits. Now, we can also compute $ℓ=⌊g_{2}(d)⌋$ instead of $⌈g_{2}(d)⌉−1$ (which is slightly more efficient), since there is only a difference for powers of two.

```
divdata_t precompute(uint d) {
divdata_t divdata;
uint l = floor_log2(d);
if (is_power_of_two(d)) {
divdata.mul = max;
divdata.add = max;
}
else {
uint m_down = (((big_uint)1) << (N + l)) / d;
uint m_up = m_down + 1;
uint temp = m_up * d;
bool use_round_up_method = temp <= (1 << l);
if (use_round_up_method) {
divdata.mul = m_up;
divdata.add = 0;
}
else {
divdata.mul = m_down;
divdata.add = m_down;
}
}
divdata.shift = l;
return divdata;
}
```

There are probably lots of optimizations possible, but this is a reasonably simple and efficient version. It is adapted from [2].

### Compile-time optimization

Here, I will show a sample implementation of the idea. I use a hacked-together framework which is meant to resemble the part of a compiler backend that does code generation. The `uint`

datatype is an $N$-bit unsigned integer, where $N$ is 8, 16, or 32. The datatype `expression_t`

is used to denote an expression tree, which can be used for code generation. This is all implemented in `compiler.h`

, which is included in the appendix. For simplicity, I have not implemented register allocation – every expression is stored from and to `r0`

. This means that only really simple expressions can be used, but this turns out to be enough for our purpose.

We will implement a function `div_by_const_uint`

that takes a constant divisor `const uint d`

and an expression `expression_t n`

that represents the dividend. It will return an expression that represents the instructions that will be executed. As an example, the expression $a⋅(b+5)$ can be written as `mul(a, add(b, constant(5))`

in this way.

I will use `constant(n)`

where `n`

is an `uint`

to denote a constant, `shr(a, b)`

to denote a right shift of `a`

by `b`

bits, `umulhi(a, b)`

to denote the high $N$ bits of a multiplication, `add(a, b)`

to denote the sum of `a`

and `b`

, `sbb(a, b)`

to denote a subtraction with a borrow if the carry bit is set, and `gte(a, b)`

which returns 1 if `a`

is greater than or equal to `b`

and 0 if `a`

is less than `b`

. Here, `a`

and `b`

can be expressions themselves. All nodes in an expression correspond directly to an instruction and the parameters to a function are passed in `r0`

, `r1`

, etc. So a function `evaluate`

that has a single parameter `a`

and returns $3a+5$ can be implemented by the expression tree `add(mul(a, 3), 5)`

. This will generate the following instructions:

```
evaluate:
mul r0, r0, 3
add r0, r0, 5
ret
```

Admittedly, I picked the instructions and calling conventions to be convenient for my use case. The instructions that are emitted by the compiler are typically a bit longer. Still, for most cases the instructions are similar to those that compilers emit. You can see the instructions that clang 11.0 outputs for some divisors on godbolt.org. Note that clang 11.0 still produces suboptimal instructions for the case of an odd divisor that is not efficient for the round-up method, because it uses the method mentioned after theorem 3. Benchmarks from [4] show that using the round-down method is faster.

As mentioned before, for compile-time optimization we can distinguish a lot of cases to squeeze out every last bit of performance. Some special cases that can be implemented particularly efficient are:

- Division by one, which can be handled by setting the quotient equal to the dividend.
- Division by a power of two, which can be implement by a bit shift.
- Division by an integer larger than half of the maximum value of the dividend, which can be implemented by setting the quotient to zero if it is smaller than the divisor, and to one otherwise.

```
expression_t div_by_const_uint(const uint d, expression_t n) {
if (d == 1) return n;
if (is_power_of_two(d)) return shr(n, constant(floor_log2(d)));
if (d > MAX / 2) return gte(n, constant(d));
return div_fixpoint(d, n);
}
```

For divisors that do not fall in one of these special cases, we use fixed-point arithmetic to efficiently implement the division. We first test if the divisor is efficient for the $N$-bit round-up method. For divisors that are efficient for the $N$-bit round-up method, we use the round-up method. For even divisors that are not efficient for the round-up method, we use a modified version of the round-up method. Finally, for odd divisors that are not efficient for the round-up method, we use the round-down method, which is slightly less efficient.

```
expression_t div_fixpoint(uint d, expression_t n) {
// test if d is efficient for round-up method
...
if (use_round_up_method) {
// round-up method
...
}
if ((d & 1) == 0) {
// even divisors which are not efficient for round-up method
// handled by doing a preshift and using round-up method
...
}
// round-down method
...
}
```

To test if a divisor is efficient for the round-up method we simply implement the condition from lemma 7:

```
// test if d is efficient for round-up method
uint l = floor_log2(d);
uint m_down = (((big_uint)1) << (N + l)) / d;
uint m_up = m_down + 1;
uint product_mod_2N = m_up * d;
bool use_round_up_method = product_mod_2N <= (1 << l);
```

The round-up method is pretty straightforward to implement. For efficiency, we reduce $m$ and $ℓ$ using lemma 9.

```
// round-up method
// find smallest m
while ((m_up & 1) == 0 && l > 0) {
m_up >>= 1;
l--;
}
return shr(umulhi(n, constant(m_up)), constant(l));
```

For even divisors which are not efficient for the round-up method, we can use lemma 8. First, we compute $n_{′}$ from $n$ by a right shift. The `preshift`

variable holds the number of bits of this first shift. Then, we multiply $n_{′}$ by $m_{up}$, and shift the result by `postshift`

. You should convince yourself that the following implementation computes `preshift`

and `postshift`

in accordance to lemma 8. In particular, note that this implementation can increment `preshift`

too often, so that `postshift`

becomes negative, and that a separate correction step for this is needed.

```
// even divisors which are not efficient for round-up method
// handled by doing a preshift and using round-up method
// pre-shift as much as possible; postshift might
// become negative, this is corrected later
d >>= 1;
int preshift = 1, postshift = l - 1;
while ((d & 1) == 0 && postshift > 0) {
d >>= 1;
preshift++;
postshift -= 2;
m_up = (m_up + 1) >> 1;
}
// optimize m
while ((m_up & 1) == 0 && postshift > 0) {
m_up >>= 1;
postshift--;
}
// correct if preshift is too large and
// postshift has become negative
if (postshift < 0) {
m_up = m_up << 1;
postshift++;
}
expression_t n_prime = shr(n, constant(preshift));
return shr(umulhi(n_prime, constant(m_up)), constant(postshift));
```

There is a subtle point we have to take into consideration when implementing the round-down method. When we naively implement the evaluation of the expression $⌊2_{N+ℓ}m_{down}⋅(n+1) ⌋$, we would increment $n$. However, when $n$ has the maximum value of $2_{N}−1$, the expression $n+1$ will overflow. We can overcome this in a couple of different ways.

The first way is to use that $m_{down}⋅(n+1)=m_{down}⋅n+m_{down}$. This latter expression has $2N$ bits when $n$ and $m_{down}$ have $N$ bits. Some architectures have integer fused multiply-add instructions which directly compute $a⋅b+c$. For example, in [2], the `XMA.HU`

instruction on Itanium is used, which computes the $N$ high bits of $a⋅b+c$. Other architectures may have support for doing $2N$-bit arithmetic. For example, this is the case when doing a division by a 32-bit integer on a 64-bit machine. Otherwise, most instruction sets have:

- An instruction to add two values and set a carry flag if the sum has overflown
- An instruction to add with carry, which adds two values and increases the result by one if the carry flag is set

Suppose we have an instruction set with an `add`

instruction which sets the carry flag on overflow, and an `adc`

instruction which adds with carry. Adding a constant, say `12345`

to a $2N$-bit value contained in registers `r0`

(low $N$ bits) and `r1`

(high $N$ bits) is as easy as:

```
add r0, r0, 12345
adc r1, r1, 0
```

In [4], it is suggested to do a *saturating increment* of $n$ instead of naively calculating $n+1$. This is the same as only incrementing $n$ when $n<2_{N}−1$. So effectively, this calculates the expression

When $n=2_{N}−1$ we do not increment and we are effectively computing $⌊d2_{N}−2 ⌋$ instead of $⌊d2_{N}−1 ⌋$. The only way that this difference can matter is when $d2_{N}−2 =⌊d2_{N}−2 ⌋+dd−1 $, so that $d2_{N}−1 =d2_{N}−2 +d1 =⌊d2_{N}−2 ⌋+1$. In this case $d$ is a divisor of $2_{N}−1$, and according to the following result a divisor of $2_{N}−1$ is efficient for the $N$-bit round-up method. So as long as we only use the round-down method for divisors which are not efficient for the round-up method, we will never get an incorrect result when we use a saturating increment.

**Lemma 10**: *If $d$ is a divisor of $2_{N}−1$, then $d$ is efficient for the $N$-bit round-up method.*

**Proof**: Take $ℓ=⌊g_{2}(d)⌋$ and define $e=mod_{d}(−2_{N+1})$ and $e_{′}=mod_{d}(2_{N+1})$. Using $mod_{d}(2_{N})=1$ and $2_{⌊log_{2}(d)⌋}<d$, we see:

Here, we used that $2_{⌊log_{2}(d)⌋}<d$. Since $d$ is a divisor of $2_{N}−1$, it can’t be a power of two, so we have $e,e_{′}∈{1,2,...,d−1}$. It follows that $e+e_{′}=d$ since $e+e_{′}≡0modd$. We find that

$e=d−e_{′}≤2_{⌈log_{2}(d)⌉}−2_{⌊log_{2}(d)⌋}=2_{⌊log_{2}(d)⌋}=2_{ℓ}$So the condition $e=mod_{d}(2_{N+ℓ})≤2_{ℓ}$ for corollary 5 is satisfied. Since $d$ is not a power of two, we have that $ℓ=⌊g_{2}(d)⌋=⌈g_{2}(d)⌉−1$. By using lemma 6, we now see that $m_{up}∈U_{N}$ for $ℓ=⌊g_{2}(d)⌋$. So $d$ is efficient for the $N$-bit round-up method.

$□$

A saturating increment can be implemented as an increment by one, followed by a subtraction with borrow. A subtraction with borrow works similar to an addition with carry: it subtracts a value from another, and decrements the result when the carry flag is set. I have used the `sbb`

instruction to denote a subtraction with borrow. In the following code, `n_inc`

is the result of applying a saturating increment on `n`

:

```
// round-down method
// find smallest m
while ((m_down & 1) == 0 && l > 0) {
m_down >>= 1;
l--;
}
expression_t n_inc = sbb(add(n, constant(1)), constant(0));
expression_t hiword = umulhi(n_inc, constant(m_down));
return shr(hiword, constant(l));
```

### Testing

When writing code for $N=8$ or $N=16$, I recommend that you check the result is correct for every pair $n,d∈U_{N}$ with $d>0$. This can be as simple as:

```
for (uint d = 1; true; d++) {
divdata_t dd = precompute(d);
for (uint n = 0; true; n++) {
assert(fast_divide(n, dd) == n / d);
if (n == UINT_MAX) break;
}
if (d == UINT_MAX) break;
}
```

For $N=8$, this code will run more or less instantly, for $N=16$ it will take a couple of minutes. For $N=32$ this program won’t terminate anytime soon. However, what you can you is test, for every divisor $d∈U_{N}$ with $d>0$, all numbers of the form $n⋅d,n⋅d−1∈U_{N}$:

```
for (uint d = 1; true; d++) {
divdata_t dd = precompute(d);
assert(fast_divide(0, dd) == 0);
assert(fast_divide(1, dd) == 1 / d);
assert(fast_divide(UINT_MAX, dd) == UINT_MAX / d);
uint bound = UINT_MAX / d;
for (uint k = 1, n = d; true; k++) {
assert(fast_divide(n, dd) == k);
assert(fast_divide(n - 1, dd) == k - 1);
if (k == bound) break;
n += d;
}
if (d == UINT_MAX) break;
}
```

This will still take a long time, but it is feasible; The above code took slightly over 40 minutes to run on my machine.

For $N=64$, exhaustive testing is completely infeasible. I recommend making a set $S$ of special values consisting of all numbers from 0 up to 256, all numbers of the form $2_{k}−1$, $2_{k}$, $2_{k}+1$, the divisors of these numbers, and the divisors of $2_{64}+1$. Then we can test if the result is correct for each $n∈S$, $d∈S∖{0}$. On top of that, you can run some randomized testing. I recommend that you pick $n$ and $d$ uniformly random in $U_{64}$ and then mask out each byte with some probability. Then, you can run random tests for some time. If you find a bug in the implementations, add $n$ and $d$ for which the result is not correct to $S$ and verify that your test set triggers the bug before you fix it.

## Solution to the exercise

Note that since $d$ is not a power of two $x=d2_{N+ℓ} $ is never an integer, so we have $m_{up}=m_{down}+1$.

Suppose that $x=d2_{N+ℓ−1} $ is closer to $m_{up}$ than to $m_{down}$. This means that $m_{up}−x<21 $, or, equivalently, $m_{up}<x+21 $. Since $m_{up}=⌈x⌉$ we have $x<m_{up}$. Combining these inequalities, we see that

$x<m_{up}<x+21 $Substituting $x=d2_{N+ℓ−1} $ we get

$d2_{N+ℓ−1} <m_{up}<d2_{N+ℓ−1} +21 $Multiplying by $d$ and using $2d <2_{ℓ}$ gives

$2_{N+ℓ}<m_{up}⋅d<2_{N+ℓ}+2d <2_{N+ℓ}+2_{ℓ}$which is stricter than the condition $2_{N+ℓ}≤m⋅d≤2_{N+ℓ}+2_{ℓ}$ of theorem 2, so the round-up method can be applied.

Similarly, if we assume that $x$ is closer to $m_{down}$ we find

$d2_{N+ℓ} −21 <m_{down}<d2_{N+ℓ} $And by multiplying by $d$ we find

$2_{N+ℓ}−2_{ℓ}<2_{N+ℓ}−2d <m_{down}⋅d<2_{N+ℓ}$This is stricter than the condition of theorem 4, so the round-down method can be applied.

## References and further reading

The classic reference for optimization of division by both signed and unsigned integers is [1], which states and proves theorem 2 (round-up method). In [2], the approach is extended to the round-down method. The resources [3] and [4] are by far the easiest to read, and cover essentially everything in this article in a more accessible way. They sacrifice some rigor and completeness, though. If you are interested in optimizing division, I recommend reading these articles as your starting point. Finally, in [5] it is shown that we can also use fixed point math to compute the modulo operation.

[1] Division by Invariant Integers using Multiplication, Torbjörn Granlund and Peter L. Montgomery, 1994.

[2] N-Bit Unsigned Divison Via N-Bit Multiply-Add, Arch D. Robinson, 2005.

[3] Labor of Divison (Episode I), fish, 2010.

[4] Labor of Divison (Episode III): Fast Unsigned Division by Constants, fish, 2011.

[5] Faster Remainder by Direct Computation: Applications to Compilers and Software Libraries, Daniel Lemire, Owen Kaser, Nathan Kurz, 2019.