This fast integer square root function realizes its efficiency by requiring no multiply or divide operations, except for divide by 2 (an unsigned binary shift right). A version written using inline IA32 assembly gains even greater efficiency by melding two binary shift operations into one.
Performance varies by compiler. For example, instead of shifting the variable "dP2" and performing two subsequent OR operations, the compiler may opt to eliminate the shift and "dP2" variable, and replace "dP2" with a constant in each OR operation. If a compiler cannot recognize this opportunity when it is most efficient on the target platform, and speed is critical, please try coding the routine in assembly or find a smarter compiler. (Double checking that optimization is *enabled* sometimes helps too. =) When compiling with GCC (2.95 or 3.0), "-O2" typically yields a balanced optimization.
IA32 processors with branch prediction may suffer a performance loss with unpredictable data. Thus, I designed an IA32 assembly version that performs better on non-sequential data at the cost of some extra instructions to avoid branching.
This code uses GNU GCC inline assembly (with source to destination register order). Versions of the GCC compiler are freely available for Linux, Windows, and other IA32 platforms.
Specifically coding for the Athlon produced a function that requires roughly 25 or 30 percent less processor time than the general IA32 assembly version, or about 61 cycles per square root. Using this code inline without function call overhead yields about 52 cycles per non-sequential integer square root.
For 32-bit floating point data (with 24 significant bits), the Athlon provides specific instructions for fast results that are accurate within 1 ULP (Unit in the Last Place).
In the year 2000, I wrote an integer square root function for a prime number generator. The function was neither succinct, nor efficient, so I dedicated a night to create a new isqrt function. The initial target was a Motorola 68000 style processor, hence the emphasis on avoiding multiply, divide, and long shift operations. The algorithm also performs respectably on an IA32 style processor.
Feel free to send e-mail if you have any comments, or if you find interesting platform specific optimizations. Messages with relevant subject lines have greater chances of gaining my attention.