View Single Post
 
Old 04-18-2016
Mysticial Mysticial is offline
grunt bot
 
Join Date: Mar 2013
Location: United States
Posts: 114
Thanks: 4
Thanked 45 Times in 26 Posts
Mysticial
Default Re: Math turns benchmark: y-cruncher meets HWBOT

I was going through the submissions and I noticed a number of multi-socket systems that have seemingly terrible performance. (Especially that 4-socket Magny-Cours Opteron.)

I'll go ahead and explain why this is the case. It will probably be obvious to those of you who are familiar with the topic.

-----

Why does y-cruncher (sometimes) suck on multi-socket systems?

This is due to memory access. Specifically, Non-Uniform Memory Access (NUMA).

y-cruncher can only run efficiently when the following assumption is true:
  • Every core/processor has fast access to all the memory.
This is true for all single-socket systems as well as some of the pre-Nehalem dual-socket servers. But not on modern multi-socket systems.

On multi-socket systems, each processor socket has its own set of memory banks. A processor has fast access to its own set of memory. But if it needs to access memory that's elsewhere (on a different socket), it needs to go over the interconnect to get it from the other processor. So it's a lot slower.

In other words, the assumption that is critical to y-cruncher's performance is no longer valid. Some memory is faster, and some memory is really slow - hence "Non-Uniform Memory Access". If you have two sockets, half the memory will be fast and the other half slow. If you have a lot of sockets, then the vast majority of the memory will be slow with respect to each individual processor.

If you think that's bad, get ready for more bad news.

Operating systems are aware of the NUMA. So they try to be smart about it. When a program runs, it biases the memory in favor of the core that asked for it. This maximizes locality so that memory access stay within the same NUMA node. While this sounds reasonable for most applications, it actually backfires for y-cruncher. Unlike most programs, y-cruncher wants to use the entire system.

Some of you might have noticed that y-cruncher's memory usage is static throughout the entire computation. What's happening is that y-cruncher allocates all the memory it needs upfront and reuses it through the computation. And that's where the problem is. That allocation is done by a single thread. So the OS will put all of it on one NUMA node.

During the computation, y-cruncher spawns threads that run on all the cores and all the sockets/NUMA nodes. Since all the memory is on one socket, all processors from all the sockets will hammer that one socket. Not only is it overloading the memory bandwidth in that node, it's also swamping the QPI going in and out of that socket. Meanwhile, all memory on the other nodes are idle. In other words, a massive traffic jam while everybody tries to park in one garage while there are 3 others that are empty.

This is why the performance sucks on those quad-Opteron servers. It also affects Intel machines as well, but to a lesser degree since they seem to have better interconnects.

What can you do about it?

The biggest problem is the traffic imbalance. If your BIOS has the option to disable the NUMA, then do it. This doesn't actually disable the NUMA since the NUMA is a physical thing, but it tricks the OS into thinking there's no NUMA so it randomizes the memory allocations across all the nodes.

In Linux, you have a bit more control. The numactl package lets you run a program with interleaved memory. This also spreads out the memory across the nodes.

These tweaks will help y-cruncher run faster. But it doesn't completely solve the NUMA problem. There's still the latency problem, and even when the interconnect traffic is balanced out, it will still be a bottleneck.

Solving the NUMA problem can only be done by redesigning the program. That's obviously beyond the scope of benchmarking.

That said, it doesn't mean you should avoid multi-socket systems. A high-end dual-socket machine that is properly configured will still beat out all the single-socket setups - LN2 or not.

What makes y-cruncher different from programs like wPrime?

y-cruncher actually needs to use memory - and a lot of it. (Not that I needed to say that.)
__________________
Reply With Quote
The Following 4 Users Say Thank You to Mysticial For This Useful Post:
Massman (04-18-2016), mr.paco (04-19-2016), Strong Island (04-18-2016), Taloken (04-18-2016)