SpikeFun 0.71 contains several speed improvements that bring simulation-generation time down by 10-20%. One of the improvements is super-fast pseudorandom number generator (PRNG), written to use Intel's SSE4 instruction set (where possible). New PRNG is based on MWC1616 algorithm, which is a class of "Multiply-with-carry" (MWC) pseudorandom number generators. Author of this algorithm is George Marsaglia.
I implemented MWC1616 using SSE and, optionally, SSE4 intrinsics. I have deliberately chosen intrinsics instead of pure inline assembly code as with intrinsics, compiler has more freedom and can choose which SSE register to use for which variable - which allows efficient code inlining. Furthermore, x64 version of Microsoft's compilers do not support inline assembly code (Intel's compilers do, however).
Here is a quick performance comparison of the new generator:
Generating 2000000000 (2 billion) random numbers on a single core of the Sandy Bridge EP (Core i7 2687W) CPU:
Below you can see the calculation part of the MWC1616 SSE4 implementation. As you can see, it is extremely simple, yet very fast and the pseudorandom output is of high quality (for non-cryptography and non-gambling use, of course). Code is licensed under BSD license, and you are free to use it in your projects as long as BSD license is respected.
//
// MWC1616 --> SSE4 version
// Copyright (C) 2012 Ivan Dimkovic
// Licensed under BSD license
//////////////////////////////////////////////
static inline void FastRand_SSE4(fastrand *f)
{
__m128i x = _mm_load_si128((const __m128i *)f->x);
__m128i y = _mm_load_si128((const __m128i *)f->y);
const __m128i mask = _mm_load_si128((const __m128i *)f->mask);
const __m128i mul1 = _mm_load_si128((const __m128i *)f->mul1);
const __m128i mul2 = _mm_load_si128((const __m128i *)f->mul2);
__m128i xmask = _mm_and_si128(x, mask);
__m128i xshift = _mm_srli_epi32(x, 0x10);
__m128i xmul = _mm_mullo_epi32(xmask, mul1);
__m128i xnew = _mm_add_epi32(xmul, xshift);
_mm_store_si128((__m128i *)f->x, xnew);
__m128i ymask = _mm_and_si128(y, mask);
__m128i yshift = _mm_srli_epi32(y, 0x10);
__m128i ymul = _mm_mullo_epi32(ymask, mul2);
__m128i ynew = _mm_add_epi32(ymul, yshift);
_mm_store_si128((__m128i *)f->y, ynew);
__m128i ymasknew = _mm_and_si128(ynew, mask);
__m128i xshiftnew = _mm_slli_epi32(xnew, 0x10);
__m128i res = _mm_add_epi32(xshiftnew, ymasknew);
_mm_store_si128((__m128i *)f->res, res);
}