SpikeFun 0.67 - A matter of (memory) bandwidth...

Click here to download SpikeFun...

SpikeFun v0.67 comes with built-in benchmark tool (SpikeBench) which can be useful to assess CPU and (especially) memory subsystem performance of a desktop, workstation or server PC. Simulation of large scale biologically-inspired spiking neural networks is extremely tough on system memory and CPU, so it can serve as an useful benchmarking tool.

To aid benchmarking and extend the results to memory I/O, SpikeFun now also supports access to the PMU hardware registers found in newer Intel(R) CPUs such as CPUs with microarchitecture codenames 'Nehalem / Nehalem EP / Nehalem EX', 'Westmere / Westmere EP / Westmere EX' and 'Sandy Bridge / Sandy Bridge EP - Jaketown'.

By using PMU registers it is possible to directly measure the read and write memory bandwidth, as reported by the integrated memory controller (IMC) located in the so-called 'uncore' part of the CPU package. Additional useful information can be obtained from the cores themselves such as energy consumption, IPC, instructions retired, etc. If your CPU is having performance monitoring unit, and access to it is enabled, SpikeFun will also display and log this information during the benchmark. For NUMA systems, such as the test system I use, SpikeFun will display information for each CPU package. In the next versions I will also add ability to log QPI traffic (I have issues testing this as ASUS Z9PE-D8 WS motherboard BIOS does not configure QPI LL counter).

Please note that not all recent Intel CPUs support memory bandwidth measurements. Typically, this feature is supported by workstation and server-class processors (such as those that fit in LGA1366, LGA1567 and LGA 2011 sockets) and it is not present in desktop-class processors (such as those that fit in LGA 1155 / 1156 sockets).

Performance monitoring on Intel systems is done by using Intel's Performance Counter Monitor (PCM) library - which can be downloaded in source code form from here: http://software.intel.com/en-us/articles/intel-performance-counter-monitor/ (for licensing/copyright info please refer to benchmark.pdf document in SpikeFun download package)

So, let's see how the small benchmark looks on the reference Intel Xeon E5 2687 dual-CPU system with faster than officially supported DDR3 RAM speed (2133 MHz):


Looking at the picture above, some interesting observations can be made:

  • SpikeFun nicely saturates the memory bandwidth of the system (~64 GiB/s traffic, theoretical maximum for quad-channel DDR3 2133 MHz: 68.2 GiB/s). This is possible thanks to usage of 256-bit AVX instruction set in synaptic processing code.
  • Performance fluctuates with the network behavior in delta rhythm, indicating that it is directly correlated with the number of spiked neurons at any given moment. This is because spiked neurons require further processing.
  • IPC rate of the CPUs is very low (just ~0.71 - theoretical maximum on the Intel Sandy Bridge EP architecture is 4) indicating that the CPUs are mostly waiting for data
  • CPU time spent in C0 state (C0 residency) is ~90% on this system, indicating that CPUs have not enough work to do
  • Achieved average real-time factor is 0.494x, which can be used to estimate needed memory bandwidth for real-time performance for this simulation: ~130 GiB/s

So, as it could be seen from the above figures, DDR3 2133 MHz in quad-channel mode is still too slow! Basically, CPUs are too often sitting idle and waiting for the data to be delivered from the DRAM by the IMC. In the next few weeks I will be testing DDR3 2400 MHz which should improve things considerably for this system.

Interestingly, this exercise also shows how complex are even basic biological simulations - for a small network of 32768 neurons and 1.8 million synapses (with 2 receptors each - AMPA/NMDA or GABAa/GABAb) we need approximately ~130 GiB/s of memory bandwidth in order to achieve real-time performance!

Think about it - typical cat has approx. 300 million neurons in the cortex with trillions of synapses! And to model those accurately we might (probably) need more complex models than one currently implemented in SpikeFun... now this would be a very expensive hardware purchase, which is currently quite above my budget.

I don't even want to mention the number of neurons and synapses in human brain... hardware to simulate that in real-time would probably drain the budgets of G20 economies put together and you'd be still off by a large margin... It is pretty obvious that going with the Von Neumann architecture (that is, practically all computers made today) this challenge would be a rather expensive option - due to the inefficiency caused by the bus bottleneck.