Super-Fast Pseudo-Random Number Generator (PRNG) Implementation using SSE/SSE4

SpikeFun 0.71 contains several speed improvements that bring simulation-generation time down by 10-20%. One of the improvements is super-fast pseudorandom number generator (PRNG), written to use Intel's SSE4 instruction set (where possible). New PRNG is based on MWC1616 algorithm, which is a class of "Multiply-with-carry" (MWC) pseudorandom number generators. Author of this algorithm is George Marsaglia.

I implemented MWC1616 using SSE and, optionally, SSE4 intrinsics. I have deliberately chosen intrinsics instead of pure inline assembly code as with intrinsics, compiler has more freedom and can choose which SSE register to use for which variable - which allows efficient code inlining. Furthermore, x64 version of Microsoft's compilers do not support inline assembly code (Intel's compilers do, however).

Here is a quick performance comparison of the new generator:

Generating 2000000000 (2 billion) random numbers on a single core of the Sandy Bridge EP (Core i7 2687W) CPU:

  • SSE4 MWC1616 - 1069 ms
  • MSVC CRT rand() - 28632 ms
As it could be seen from the test above, my implementation of MWC1616 is almost 27 times faster than C-runtime library's rand() function, and it can generate almost 2 billion 32-bit random numbers per second (1.5 CPU clocks, or ~0.5 nanoseconds per number)! Furthermore, MWC1616 algorithm is actually much better than Microsoft's rand() implementation; it has a period greater than 10^18 and the statistical quality ("randomness") of the output is considerably higher than Microsoft's rand() implementation, which fails many statistical tests and provides only 16-bit of randomness (RAND_MAX)

SSE4 version takes advantage of pmullud instruction, which allows multiplication of two vectors of 32-bit integers. For Pre-SSE4 CPUs this has to be carried by breaking the calculation in two 16-bit parts and finally shifting and OR-ing the intermediate results into the final result.

You can download the code by clicking on this link

Below you can see the calculation part of the MWC1616 SSE4 implementation. As you can see, it is extremely simple, yet very fast and the pseudorandom output is of high quality (for non-cryptography and non-gambling use, of course). Code is licensed under BSD license, and you are free to use it in your projects as long as BSD license is respected.

// MWC1616 --> SSE4 version
// Copyright (C) 2012 Ivan Dimkovic
// Licensed under BSD license
static inline void FastRand_SSE4(fastrand *f)
    __m128i x = _mm_load_si128((const __m128i *)f->x);
    __m128i y = _mm_load_si128((const __m128i *)f->y);

const __m128i mask = _mm_load_si128((const __m128i *)f->mask);
const __m128i mul1 = _mm_load_si128((const __m128i *)f->mul1);
const __m128i mul2 = _mm_load_si128((const __m128i *)f->mul2);

__m128i xmask = _mm_and_si128(x, mask);
__m128i xshift = _mm_srli_epi32(x, 0x10);
__m128i xmul = _mm_mullo_epi32(xmask, mul1);
__m128i xnew = _mm_add_epi32(xmul, xshift);
_mm_store_si128((__m128i *)f->x, xnew);

__m128i ymask = _mm_and_si128(y, mask);
__m128i yshift = _mm_srli_epi32(y, 0x10);
__m128i ymul = _mm_mullo_epi32(ymask, mul2);
__m128i ynew = _mm_add_epi32(ymul, yshift);
_mm_store_si128((__m128i *)f->y, ynew);

__m128i ymasknew = _mm_and_si128(ynew, mask);
__m128i xshiftnew = _mm_slli_epi32(xnew, 0x10);
__m128i res = _mm_add_epi32(xshiftnew, ymasknew);
_mm_store_si128((__m128i *)f->res, res);