Division by constant integers

This post is a survey of the methods to optimize integer division.

The methods in this post originate from the classic 1991 paper “Division by Invariant Integers using Multiplication” by Torbjörn Granlund and Peter L. Montgomery. In particular, the paper presents what in this post is called the “round-up method”. This method was further popularized by Henry S. Warren in his 2002 book “Hacker’s Delight”. In the 2005 paper “N-Bit Unsigned Division Via N-Bit Multiply-Add” by Arch D. Robinson presents what in this post is called the “round-down method”. In episode I and episode III of the series of articles cleverly named “The Labor of Division”, ridiculous_fish both independently discovers the round-down method, and clearly describes both methods.

What does this post add to the already existing literature? For one, it serves as a reference, combining the results in a rigorous and structured way. I am presenting the results that power the different methods (theorem 2 and 3) in a slightly simpler and more unified way that is consistent with the original presentation in the paper “Division by Invariant Integers using Multiplication”.

Overview

First, the round-up method is presented, which shows that for every NN-bit divisor d>0d > 0 there is a constant mm so that the upper bits of mnmn equal nd\lfloor \frac{n}{d} \rfloor for every NN-bit unsigned nn. For certain “uncooperative” divisors the constant mm will not fit in NN bits, in which case the method is feasible but less efficient. To improve the efficiency, the round-down method is presented, which is similar to the round-up method but more efficient for uncooperative divisors. Then, we will discuss how the method can be applied to signed integers.

Intuition

To understand why these multiplicative methods work, let’s build some intuition. The idea is to choose an integer constant, mm, that approximates 2kd\frac{2^k}{d}. This way, mm effectively represents 1d\frac{1}{d} in binary form. Multiplying by 2k2^k shifts the fractional bits to more significant positions, allowing them to be captured as an integer.

Simply put, since we have m2kdm \approx \frac{2^k}{d}, we expect that mn2knd\frac{mn}{2^k} \approx \frac{n}{d}. So, it isn’t to far-fetched to expect that mn2k=nd\lfloor \frac{mn}{2^k} \rfloor = \lfloor \frac{n}{d} \rfloor. In the next section we’ll see under which conditions this equality holds.

If this doesn’t click yet, consider an example in base 10 instead. Pick d=3d = 3 and instead of 2k2^k, we’ll pick 10410^4 and m=33341043m = 3334 \approx \frac{10^4}{3}. Picking n=492n = 492 we see that mn104=1640328104\frac{mn}{10^4} = \frac{1640328}{10^4}. This equals 164.0328164.0328, so indeed we have mn104=4923=164\lfloor \frac{mn}{10^4} \rfloor = \lfloor \frac{492}{3} \rfloor = 164.

Round-up method

We start with deriving a theorem that shows the method is correct. To prove this theorem, we’ll need the following lemma.

Lemma 1: Suppose that nn and dd are integers with d>0d > 0, and xx is a real number. If ndx<n+1d\frac{n}{d} \leq x < \frac{n + 1}{d} then x=nd\lfloor x \rfloor = \lfloor \frac{n}{d} \rfloor.

Proof: We have ndx<n+1dnd+1\frac{n}{d} \leq x < \frac{n + 1}{d} \leq \lfloor \frac{n}{d} \rfloor + 1. After taking the floor of all terms and considering that the right-hand side is an integer, we get ndx<nd+1\lfloor \frac{n}{d} \rfloor \leq \lfloor x \rfloor < \lfloor \frac{n}{d} \rfloor + 1, which implies x=nd\lfloor x \rfloor = \lfloor \frac{n}{d} \rfloor.

\square

Now, we are ready to prove the result that justifies the round-up method.

Theorem 2: Let dd, mm, \ell be nonnegative integers with d0d \neq 0 and
2N+dm2N++2(1) 2^{N + \ell} \leq d \cdot m \leq 2^{N + \ell} + 2^\ell \tag{1}

then mn2N+=nd\lfloor \frac{mn}{2^{N + \ell}} \rfloor = \lfloor \frac{n}{d} \rfloor for every integer nn with 0n<2N0 \leq n < 2^N.

Proof: Multiplying the inequality by nd2N+\frac{n}{d \cdot 2^{N + \ell}} we get ndmn2N+nd+1dn2N\frac{n}{d} \leq \frac{mn}{2^{N + \ell}} \leq \frac{n}{d} + \frac{1}{d} \cdot \frac{n}{2^N}. We have n<2Nn < 2^N, so that n2N<1\frac{n}{2^N} < 1. It follows that ndmn2N+<nd+1d\frac{n}{d} \leq \frac{mn}{2^{N + \ell}} < \frac{n}{d} + \frac{1}{d}. By lemma 1, it follows that mn2N+=nd\lfloor \frac{mn}{2^{N + \ell}} \rfloor = \lfloor \frac{n}{d} \rfloor for all integers nn with 0n<2N0 \leq n < 2^N.

\square

So if we can find \ell, mm satisfying (1)(1), we can compute nd\lfloor \frac{n}{d} \rfloor as mn2N+\lfloor \frac{mn}{2^{N + \ell}} \rfloor. An mm that satisfies (1)(1) exists only if and only if the interval [2N+,2N++2][2^{N + \ell}, 2^{N + \ell} + 2^\ell] contains a multiple of dd.

If log2(d)\ell \geq \log_2(d), the interval will contain dd or more successive integers and we can be certain that the interval contains a multiple of dd. Since d2N+dd \cdot \lceil \frac{2^{N + \ell}}{d} \rceil is just the first multiple of dd that is at least as large as 2N+2^{N + \ell}, we can always set m=2N+dm = \lceil \frac{2^{N + \ell}}{d} \rceil; if there an mm exists that satisfies (1)(1), this mm will do so too.

Corollary 3: If log2(d)\ell \geq \log_2(d) and m=2N+dm = \lceil \frac{2^{N + \ell}}{d} \rceil, then mn2N+=nd\lfloor \frac{mn}{2^{N + \ell}} \rfloor = \lfloor \frac{n}{d} \rfloor for every integer nn with 0n<2N0 \leq n < 2^N.

However, we would like to have m<2Nm < 2^N so that mm fits in an NN-bit word, but this requires <log2(d)\ell < \log_2(d). So, this we can’t guarantee this, and indeed we see that for some divisors dd there only exist mm, \ell with m>2Nm > 2^N.

So, concretely, how can we find mm and \ell? First we pick =log2(d)\ell = \lfloor \log_2(d) \rfloor and set m=2N+dm = \lceil \frac{2^{N + \ell}}{d} \rceil. For mm calculating this way we have m<2Nm < 2^N, so we have to check that it satisfies (1)(1).

We can also see that if mm is even, we can divide we can divide all terms in (1)(1) by two and get the valid equation 2N+1dm22N+1+212^{N + \ell - 1} \leq d \cdot \frac{m}{2} \leq 2^{N + \ell - 1} + 2^{\ell - 1}. So if mm is even we can replace mm by m2\frac{m}{2} and \ell by 1\ell - 1, and (1)(1) will still hold. In some cases, this allows us to find a smaller constant mm that still satisfies (1)(1).

We will call a divisor dd for which an m<2Nm < 2^N, \ell satisfying (1)(1) exists cooperative, because they allow us to easily compute the product mnmn for NN-bit unsigned integers NN. Divisors that are not cooperative are called uncooperative; for them it is more tricky and less efficient to calculate the product mnmn. We will consider both cases in the following subsections.

Cooperative divisors

To make the last section more concrete, let’s optimize the following function.

uint32_t div9(uint32_t n) {
	return n / 9;
}

Using the strategy outlined in the previous section, we set =log2(9)=3\ell = \lfloor \log_2(9) \rfloor = 3 and compute m=232+39=3817748708m = \lceil \frac{2^{32 + 3}}{9} \rceil = 3817748708. Working modulo 2322^{32} we see that mod232(dm)\text{mod}_{2^{32}}(d \cdot m) equals mod232(34359738372)=4\text{mod}_{2^{32}}(34359738372) = 4. We have 4<234 < 2^3, so the test succeeds and 33 is a cooperative divisor.

However, we see that mm is even. We can divide it by two to get 38177487083817748708, which is also even. Dividing by two once more gives 954437177954437177. Adjusting for this, we only need to shift right by 3333 bits, instead of 3535.

So the optimized version of the function is

uint32_t div9(uint32_t n) {
	return (((uint64_t)954437177) * n) >> 33;
}

Indeed, GCC compiles div9 and div9_opt to the same assembly. For example in x86_64 we get:

div9:
	mov eax, edi
	mul rax, rax, 954437177
	shr rax, 33
	ret

For 32-bit instruction sets, the shift is usually implemented by taking the high 32-bit word of the full 64-bit and performing a right shift by one on that. For example, in 32-bit RISC-V:

div9:
	li a5, 954437177
	mulhu a0, a0, a5
	srli a0, a0, 1
	ret

Uncooperative divisors

Let’s take the function from the last section but instead divide by 77:

uint32_t div7(uint32_t n) {
	return n / 7;
}

Doing the same, we compute =log2(7)=2\ell = \lfloor \log_2(7) \rfloor = 2, m=232+27=2454267027m = \lceil \frac{2^{32 + 2}}{7} \rceil = 2454267027. Now, we compute mod232(dm)\text{mod}_{2^{32}}(d \cdot m) to be mod232(17179869189)=5>4\text{mod}_{2^{32}}(17179869189) = 5 > 4. So the test fails and 77 and we see that 77 is an uncooperative divisor.

This means we can’t use the value we just calculated for mm and instead we have to set =log2(7)=3\ell = \lceil \log_2(7) \rceil = 3. We get m=232+37=4908534053m = \lceil \frac{2^{32 + 3}}{7} \rceil = 4908534053. This is a 33-bit value and we have to resort to some tricks to evaluate mn2N+\lfloor \frac{mn}{2^{N + \ell}} \rfloor without overflow.

A first idea is to set m=m2Nm' = m - 2^N and compute the product mnmn in two 32-bit words as (232+m)n=mn+232n(2^{32} + m')n = m' \cdot n + 2^{32} n. However, this doesn’t work since in general the product of a 32-bit number and a 33-bit number does not fit in 64 bits. So we need to take care that the addition of 232n2^{32} n to mnm' n does not overflow.

This can be done by evaluating mn2N+\lfloor \frac{mn}{2^{N + \ell}} \rfloor as ((((mn >> N) - n) >> 1) + n) >> (l - 1), where mn is the full 2N2N-bit product.

So the optimized version of the function is

uint32_t div7(uint32_t n) {
	// compute the high word of the product m' * n
	uint32_t mn = (((uint64_t)613566757) * n) >> 32;

	// compute (m * n) >> 1 in an overflow-safe way
	uint32_t mn2 = ((n - mn) >> 1) + mn;

	// shift right by l - 1 bits
	return mn2 >> 2;
}

Indeed, GCC gives the same assembly. In x86_64:

div7:
	mov eax, edi
	imul rax, rax, 613566757
	shr rax, 32
	sub edi, eax
	shr edi
	add eax, edi
	shr eax, 2
	ret

Or in RISC-V:

div7:
	li a5, 613566757
	mulhu a5, a0, a5
	sub a0, a0, a5
	srli a0, a0, 1
	add a0, a0, a5
	srli a0, a0, 2
	ret

Even uncooperative divisors

Suppose we have an even uncooperative divisor dd of the form d=2pcd = 2^pc, so that p1p \geq 1. We can implement a division by dd as a division by 2p2^p – which can be implemented efficiently with a bit shift – followed by a division by cc. Now, cc will be a uncooperative divisor too, but since the dividend nn will only be an (Np)(N - p)-bit number, the constant mm will fit in NN bits. This only costs us a single bit shift, which is more efficient than implementing the overflow-avoiding product from last section.

As an example, consider the following function:

uint32_t div28(uint32_t n) {
	return n / 28;
}

Following the procedure as before, we set =log2(28)=4\ell = \lfloor \log_2(28) \rfloor = 4 and test m=2N+d=2454267027m = \lceil \frac{2^{N + \ell}}{d} \rceil = 2454267027. We see that mod232(dm)=20>16=2\text{mod}_{2^{32}}(d \cdot m) = 20 > 16 = 2^\ell so the test fails and 28 is an uncooperative divisor.

So we note that we have an even uncooperative divisor and set n=n4n' = \lfloor \frac{n}{4} \rfloor. We now need to divide nn', a 3030-bit number, by 77. If 77 would be a cooperative divisor, then 2k72^k \cdot 7 would be too, so we know that 77 is an uncooperative divisor. So we set N=30N' = 30, =log2(7)=3\ell = \lceil \log_2(7) \rceil = 3 and get m=2N+d=1227133514m = \lceil \frac{2^{N' + \ell}}{d} \rceil = 1227133514.

Note that mm is even, so we can use m2\frac{m}{2} instead and decrease \ell. We get m=613566757m = 613566757, N=30N = 30, =2\ell = 2. The optimized version of the function is as follows:

uint32_t div28(uint32_t n) {
	n = n >> 2;
	return (((uint64_t)613566757) * n) >> 32;
}

TODO: add godbolt links

In x86_64 assembly:

div28:
	mov eax, edi
	shr eax, 2
	imul rax, rax, 613566757
	shr rax, 32
	ret

In RISC-V:

div28:
	li a5, 613566757
	srli a0, a0, 2
	mulhu a0, a0, a5
	ret

Round-down method

The following theorem powers the round-down method. I present the result in a form analogous to theorem 2.

Theorem 3: Let dd, mm, \ell be nonnegative integers with d0d \neq 0 and
2N+2dm2N+(2) 2^{N + \ell} -2^\ell \leq d \cdot m \leq 2^{N + \ell} \tag{2}

then m(n+1)2N+=nd\lfloor \frac{m(n + 1)}{2^{N + \ell}} \rfloor = \lfloor \frac{n}{d} \rfloor for every integer nn with 0n<2N0 \leq n < 2^N.

Proof: Multiplying the inequality by n+1d2N+\frac{n + 1}{d \cdot 2^{N + \ell}} we get nd+1d(1n+12N)m(n+1)2N+n+1d\frac{n}{d} + \frac{1}{d} \cdot (1 - \frac{n + 1}{2^N}) \leq \frac{m(n + 1)}{2^{N + \ell}} \leq \frac{n + 1}{d}. Since n<2Nn < 2^N we have ndnd+1d(1n+12N)\frac{n}{d} \leq \frac{n}{d} + \frac{1}{d} \cdot (1 - \frac{n + 1}{2^N}) and the condition of lemma 1 is satisfied. So we have m(n+1)2N+=nd\lfloor \frac{m(n + 1)}{2^{N + \ell}} \rfloor = \lfloor \frac{n}{d} \rfloor for all integers nn with 0n<2N0 \leq n < 2^N.

\square

The computation of mm is very similar to the round-up method: We take =log2(d)\ell = \lfloor \log_2(d) \rfloor and compute m=2N+dm = \lfloor \frac{2^{N + \ell}}{d} \rfloor. The difference is of course that we round mm down instead of up (which can hardly be a surprise, given the name of the method). So it’s only the increment that can overflow.

For the evaluation of the expression m(n+1)m(n + 1) we do need to be careful to avoid overflows again. It can happen that n=2N1n = 2^N - 1, and in this case naively computing n+1n + 1 will overflow. The product m(n+1)m(n + 1) needs to be computed with a “widening” multiplication as before (sometimes you can get away with a “compute high word of product” instruction, if the target architecture has one).

Possible strategies to avoid overflow include:

  1. Compute the full 2N2N-bit product mnmn (not just the high NN-bit word), and add mm to that.
  2. Use a saturating increment on nn. The implementation is architecture-dependent, but it can usually be done with an increment, followed by a single instruction that subtracts the overflow bit.

In most cases the second approach seems better. The exception is when the division is done on NN-bit integers but the target architecture is 2N2N-bit (typically this happens when dividing 32-bit integers on a 64-bit procesors). In this case, we can just use a 2N2N-bit word to store n+1n + 1 without worrying about overflow.

Correctness of the saturating-increment approach

When we use the saturating-increment approach, we calculate m(n+1)2N+=nd\lfloor \frac{m(n + 1)}{2^{N + \ell}} \rfloor = \lfloor \frac{n}{d} \rfloor for every n<2N1n < 2^N - 1. For n=2N1n = 2^N - 1, incrementing nn would lead to an overflow, and we don’t increment. This means that instead of m(n+1)2N+=2N1d\lfloor \frac{m(n+1)}{2^{N+\ell}} \rfloor = \lfloor \frac{2^N - 1}{d} \rfloor we calculate mn2N+=lfloor2N2d\lfloor \frac{mn}{2^{N+\ell}} \rfloor = lfloor \frac{2^N-2}{d} \rfloor. I claim that 2N2d=2N1d\lfloor \frac{2^N - 2}{d} \rfloor = \lfloor \frac{2^N - 1}{d} \rfloor when dd is an uncooperative divisor.

We only have 2N2d2N1d\lfloor \frac{2^N - 2}{d} \rfloor \neq \lfloor \frac{2^N - 1}{d} \rfloor when dd is a divisor of 2N12^N - 1. But in this case, we have 2N1(mod d)2^N \equiv 1 (\text{mod}\ d), which means that 2N++22^{N + \ell} + 2^\ell equals 2+2-2^\ell + 2^\ell modulo dd. So dd divides 2N++22^{N + \ell} + 2^\ell, and the condition for theorem 2 holds.

Example

Checking for overflow is a bit messy in pure C. I’ll assume we use GCC and use the __builtin_uadd_overflow builtin for this. Details can be found here. To check for overflow, we use it like this:

// add unsigned integers x and y
if (__builtin_uadd_overflow(x, y, &result)) {
	// there was an overflow, result equals the low 32 bits of x + y
} else {
	// there was no overflow, result equals x + y
}

We’ll now implement an optimized version of the following function.

uint32_t div7(uint32_t n) {
	return n / 7;
}

With N=32N = 32 we compute =log2(7)=2\ell = \lfloor \log_2(7) \rfloor = 2. We test if d=7d = 7 is a cooperative divisor by checking if mod232(dm)<2\text{mod}_{2^{32}}(d \cdot m) < 2^\ell with m=2347m = \lceil \frac{2^{34}}{7} \rceil. The left-hand side is 5 and the right-hand side is 4, so d=7d = 7 is an uncooperative divisor.

So we use the round-down method and set m=2347m = \lfloor \frac{2^{34}}{7} \rfloor so that m(n+1)234=n7\lfloor \frac{m(n + 1)}{2^{34}} \rfloor = \lfloor \frac{n}{7} \rfloor.

Now, as discussed in the previous section, when n=2N1n = 2^N - 1 we have mn234=m(n+1)234=n7\lfloor \frac{mn}{2^{34}} \rfloor = \lfloor \frac{m(n + 1)}{2^{34}} \rfloor = \lfloor \frac{n}{7} \rfloor so that we can use a saturating increment to compute mnmn when n=2N1n = 2^N - 1 and m(n+1)m(n + 1) otherwise.

So we can write an optimized version:

uint32_t div7(uint32_t n) {
	if (__builtin_uadd_overflow(n, 1, &n)) {
		n--;
	}
	return (((uint64_t)2454267026) * n) >> 34;
}

RISC-V:

div7:
	# saturating increment a0
	sltiu   a5,a0,-1
	add     a0,a5,a0

	# get high word of m * (n + 1)
	# (or m * n when n = 2^32 - 1)
	li a4, 2454267026
	mulhu   a0,a0,a4

	# shift right
	srli    a0,a0,2

	ret

In x86_64 we get

div7:
	# move first arg (n) to eax
	mov eax, edi

	# saturating increment on n
	add eax, 1
	sbb eax, 0

	# get high word of m * (n + 1)
	# (or m * n when n = 2^32 - 1)
	mov edx, 2454267026
	mul rax, rdx

	# shift right
	shr rax, 34

	ret

Since x64_64 is a 64-bit architecture, we can use a 64-bit word for n + 1 without having to worry about overflow.

uint32_t div7(uint32_t n32) {
	uint64_t n = n32;
	return (((uint64_t)2454267026) * (n + 1)) >> 34;
}

This makes the resulting assembly ever so slightly more elegant (and I suspect more efficient as well, although I haven’t benchmarked it):

div7:
	# move first arg (n) to eax/rax
	mov eax, edi

	# increment rax/n
	add rax, 1

	# load m * (n + 1) into rax
	mov edx, 2454267026
	mul rax, rdx

	# shift rax right
	shr rax, 34

	ret

Signed integers

For nonnegative signed integers, we can use the same expression as we did for the round-up method. In fact, since the maximum magnitude of signed integers is smaller than the maximum magnitude of unsigned integers, it is the case that for every divisor d>0d > 0 we can find an NN-bit magic number mm such that
mn2N1+=nd \lfloor \frac{mn}{2^{N - 1 + \ell}}\rfloor = \lfloor \frac{n}{d} \rfloor

for all integers nn with 0n2N0 \leq n \leq 2^N. Proving this is straightforward using a variation of the proof of theorem 2.

Theorem 4: Let dd, mm, \ell be nonnegative integers with d0d \neq 0 and
2N1+dm<2N1++2 2^{N - 1 + \ell} \leq d \cdot m < 2^{N - 1 + \ell} + 2^\ell

then mn2N+=nd\lfloor \frac{mn}{2^{N + \ell}} \rfloor = \lfloor \frac{n}{d} \rfloor for every integer nn with 0n2N10 \leq n \leq 2^{N-1}.

Proof: When n=0n = 0 the result is trivial. When n>0n > 0, we can multiply the inequality by nd2N+\frac{n}{d \cdot 2^{N + \ell}} to get ndmn2N+<nd+1dn2N\frac{n}{d} \leq \frac{mn}{2^{N + \ell}} < \frac{n}{d} + \frac{1}{d} \cdot \frac{n}{2^N}. We have n2N1n \leq 2^{N-1}, so that n2N1\frac{n}{2^N} \leq 1. It follows that ndmn2N+<nd+1d\frac{n}{d} \leq \frac{mn}{2^{N + \ell}} < \frac{n}{d} + \frac{1}{d}. By lemma 1, it follows that mn2N+=nd\lfloor \frac{mn}{2^{N + \ell}} \rfloor = \lfloor \frac{n}{d} \rfloor for all integers nn with 0n2N10 \leq n \leq 2^{N-1}.

\square

But how we do handle the case where the quotient is negative? In most programming languages, integer division rounds towards zero. That means that when the quotient is negative, the expression should be rounded up instead of down.

Another subtlety that we have to handle, is that mm is an unsigned number while nn is a signed number. Processors generally have instructions to multiply unsigned integers or signed integers, but not to multiply a signed integer by an unsigned one. So we have to figure out how we can evaluate this product.

For now, I will assume that the divisor dd is positive. I will address the case where dd is negative later – it will turn out to be quite simple to handle.

Rounding up instead of down

We have the identity n+d1d=nd\lfloor \frac{n + d - 1}{d} \rfloor = \lceil \frac{n}{d} \rceil when nn and dd are integers. Using this, we can evaluate mn2N1+\lceil \frac{mn}{2^{N - 1 + \ell}} \rceil as m(n+d1)2N1+\lfloor \frac{m(n + d - 1)}{2^{N - 1 + \ell}} \rfloor.

If mnmn is not a power of two we have
2N1+<md<2N1++2 2^{N-1+\ell} < md < 2^{N-1+\ell} + 2^\ell

Since 2N1+2^{N-1+\ell} and 2N1++22^{N-1+\ell} + 2^\ell are consecutive multiples of 22^\ell, mdmd cannot be one. So mm also cannot be a multiple of 22^\ell, and for an NN-bit signed integers nn, mnmn can not be a multiple of 2N1+2^{N-1+\ell}, given that n=2N1n = -2^{N-1} has the highest number of two’s in its factorization.

So mn2N1+\frac{mn}{2^{N-1+\ell}} is not an integer and we have
mn2N1+=mn2N1++1 \lceil \frac{mn}{2^{N - 1 + \ell}} \rceil = \lfloor \frac{mn}{2^{N - 1 + \ell}} \rfloor + 1

when dd is not a power of two. This last expression is potentially more efficient to evaluate.

So in practice we want to compute
mn2N1++1n<0 \lfloor \frac{mn}{2^{N - 1 + \ell}} \rfloor + 1_{n < 0}

when dd is not a power of two. An efficient way of adding 1n<01_{n < 0} is by subtracting n >> (N - 1). This expression will be 1-1 when nn is negative, and 00 otherwise. By swapping the operands of the subtraction, we can negate the result “for free”.

When dd is a power of two this will not work, and we need to explicitly negate the result. This can be done in a single instruction on most architectures.

Computing the product of a signed and an unsigned integer

If we set =log2(d)\ell = \lceil \log_2(d) \rceil, the value m=2N1+dm = \lceil \frac{2^{N-1+\ell}}{d} \rceil with is an NN-bit unsigned integer with the most significant bit set. When mm is even, we can use m=m2m' = \frac{m}{2} and use
mn2N1+=mn2N2+ \frac{mn}{2^{N-1+\ell}} = \frac{m'n}{2^{N-2+\ell}}

This is easier to evaluate since we can now simply compute mnm'n as a signed product.

When mm is odd, we do not have this option. Let msm_s be the value obtained by interpreting the bits of mm (represented as an unsigned integer) as a signed integer. That is, if m=k=02kmkm = \sum_{k=0} 2^k m_k, then
ms=2N1mN1+k=0N22kmk m_s = -2^{N-1}m_{N-1} + \sum_{k = 0}^{N-2} 2^k m_k

So we have m=ms+2km = m_s + 2^k so that
mn=(ms+2k)n=msn+2kn mn = (m_s + 2^k)n = m_s n + 2^k n

That is, we can use a signed product to evaluate msnm_s n, and add nn to the high word of the product. The result will be mnmn.

Examples

Example 1: Dividing by 7

For a simple example, consider the following function:

int32_t div7(int32_t n) {
	return n / 7;
}

With N=32N = 32 we find =log2(7)=3\ell = \lceil \log_2(7) \rceil = 3, m=2N1+d=245426702m = \lceil \frac{2^{N - 1 + \ell}}{d} \rceil = 245426702. This is an odd number, so we can’t simplify. Since mm is a 32-bit number, we’ll have to evaluate the high word of the product mn_hi as mn_signed_hi + n. Here, mn_signed_hi is the product of nn and mm', and mm' is the interpretation of mm as a signed word. When we interpret mm as a signed number, we get 245426702232=1840700269245426702 - 2^{32} = -1840700269, so this is what we use in the multiplication, instead of the unsigned value of mm.

We proceed by shifting right the rest of the bits. We have already shifted the product by 3232 bits, so we need to shift right by 22 additional bits. Finally, we correct the rounding by adding n >> 31.

All in all, we get

int32_t div7(int32_t n) {
	int32_t m = -1840700269;

	// evaluate the high word of m * n
	int32_t mn_signed_hi = ((int64_t)m * n) >> 32;
	int32_t mn_hi = mn_signed_hi + n;

	// shift remaining bits and correct rounding
	return (mn_hi >> 2) - (n >> 31);
}

This results in the following assembly:

div7:
	# move n to ecx
	mov ecx, DWORD PTR [esp+4]

	# move m to eax
	mov eax, -1840700269

	# move m*n to edx:eax
	imul ecx

	# add n and store in eax
	add eax, [edx+ecx]

	# shift right by two and add (n << 31)
	sar     ecx, 31
	sar     eax, 2
	sub     eax, ecx
	ret

Also, GCC sometimes prefers to use a lea instruction instead of a simple addition. This is the only difference between the assembly generated by the two C functions above.

To divide by -7 we can now simply swap the operands of the subtraction in the return statement. However, GCC seems to have difficulties generating efficient assembly for this, and the resulting assembly is definitely not equivalent to the assembly that results from compiling int32_t div7(int32_t n) { return n / -7; }.

Example 2: Dividing by 9

Now, consider the following function:

int32_t div9(int32_t n) {
	return n / 9;
}

Setting =log2(9)=4\ell = \lceil \log_2(9) \rceil = 4, we get m=2N1+9=3817748708m = \lceil \frac{2^{N-1+\ell}}{9} \rceil = 3817748708. This is even so we can decrease \ell to 33 and get m=1908874354m = 1908874354. This is again even so we can decrease \ell once more to get 22 and we get m=954437177m = 954437177.

This fits in a signed 32-bit integer and we can proceed as before, except that we can use a simple signed product to evaluate mnmn.

int32_t div9(int32_t n) {
	int32_t m = 954437177;

	// evaluate the high word of m * n
	int32_t mn_hi = ((int64_t)m * n) >> 32;

	// shift remaining bits and correct rounding
	return (mn_hi >> 2) - (n >> 31);
}

Example 3: Dividing by 4096

Consider the following function:

int32_t div4096(int32_t n) {
	return n / 4096;
}

If we divide by a power of two we simply need to add d - 1 to n to fix the rounding. Otherwise, the division is a simple shift:

int32_t div4096(int32_t) {
	if (n < 0) {
		n += 4095;
	}
	return n >> 12;
}

In practice, GCC compiles both versions to the same assembly which uses a cmovs instruction (conditional move when the sign bit is not set):

div4096:
	# move n to eax
	mov eax, DWORD PTR [esp+4]

	# add 4095 to eax if it is negative
	test eax, eax
	lea edx, [eax+4095]
	cmovs eax, edx

	# arithmetic shift right by 12 bits
	# (i.e. signed divide by 4096)
	sar eax, 12
	ret