It is interesting to read the recent flurry of articles and blog posts discussing the role of General-Purpose Graphics Processing Units (GPGPUs) in machine learning, some even claiming that the co-processors will revolutionize AI.
I’ve been a long time believer in the value of HW accelerators, both on-chip and off-chip, and even architected some during my hardware days at Sun Microsystems. It should therefore come as no surprise Ayasdi is also embracing the potential performance benefits from accelerator cards. Indeed, Ayasdi is presenting at the upcoming GPU Technology Conference in March, and is currently working with Intel to better understand the opportunity to accelerate Topological Data Analysis (TDA) on Xeon Phi cards.
That said, I would reiterate one point made in some of the articles; don’t write off CPUs, and don’t believe that GPGPUs are a panacea. The truth, as always, is more complex, and will, outside expensive, dedicated supercomputers, likely involve CPUs and GPGPUs effectively complementing each other.
Certainly GPGPUs are powerful, and can deliver significant improvements in compute density and even power efficiency. However, there can be a temptation to believe that CPUs can simply can’t compete and GPGPUs are a necessity.
In engineering at Ayasdi, we are adopting a more balanced approach that recognizes the potential of accelerator cards and will make full use them when available; but also recognizes that modern multicore CPUs with vector instruction support are certainly no slouches and that not only can we deliver significantly improved performance by fully exploiting CPU and GPGPUs working in unison, but also ensures that Ayasdi provides a highly performant solution when deploying into environments where GPGPUs aren’t commonplace (e.g. many/most hadoop data lakes).
Looking at typical modern server processors, each socket can contain up to 18-cores, each 2-way threaded for a total of 36 hardware threads per socket. These coupled with support for 256-bit vectors instructions, with fused-multiply add operations, provides significant compute power. This attached to large memory capacities, and up to around 70GB/s of sustained offchip bandwidth, ensures that its readily possible to write optimized code that takes full advantage of these compute capabilities, even when working with out-of-core algorithms. Ayasdi’s implementations scale linearly with core count in a single system, be it composed of 1, 2 or 4 sockets.
As a more concrete example; with AVX2, each core can deliver up to 16 double-precision FLOPs per cycle. Accordingly, looking at something like the Intel E5-2699 v3, running with 18 2-way threaded cores per socket, at a nominal 2.3GHz, a dual-socket server is capable of delivering over 1TFLOP/s.
Contrast this with a recent Xeon Phi or other GPGPUs, that deliver around 1.2-1.6 DP TFLOP/s per chip. Not wildly dissimilar. Accordingly, deploying a balanced solution that takes full advantage of both CPU and GPGPU resources has the potential to deliver almost double the performance/throughput of a solution that solely optimizes one.
That said, there are always algorithms where vectorization is not readily feasible, and throwing threads at the problem can be a key performance driver. However, even in these scenarios, significant effort is normally required to effectively ensure that the active memory footprint can configured to reside within the roughly 16GB on the accelerator card (ensuring that data movement costs don’t outway the compute improvements), and ensuring that the necessary compute flows leverage the strengths of the GPGPU. Finally, there are also power consumption and density considerations to consider when making design decisions.
Ultimately, GPGUs offer a compelling path forward for compute intensive users like Ayasdi. Still, CPU architectures still offer a superb price/performance tradeoff. We think that the next few years will see a blended implementation. Let us know your thoughts at the bottom of the page or by emailing us at firstname.lastname@example.org.