Arm processors on servers has gone from failed starts (Calxeda) to modest successes (ThunderX2) to real contenders (ThunderX3, Ampere). Now, details have emerged about Japanese IT giant Fujitsu\u2019s Arm processor, which it claims will offer better HPC performance than Nvidia GPUs but at a lower power cost.\nFujitsu is developing the A64FX, a 48-core Arm8 derivative specifically engineered for high-performance computing (HPC). Rather than design general-purpose compute cores, Fujitsu has added compute engines specific to artificial intelligence, machine learning, and other technologies specific to the needs of HPC.\nIt will go in a new supercomputer called Fugaku, or Post-K. Post-K is a reference to the K supercomputer, at one time the fastest supercomputer in the world, that ran on custom Sparc chips before RIKEN Lab, where it was installed, pulled the plug.\nFujitsu has revealed some new details, and they are impressive. The design of the A64FX is a major departure from traditional design. Instead of the chiplet design of the AMD Epyc and some Xeons, it is a single monolithic design. More important, there are four chips of High Bandwidth Memory 2 (HBM2), an expensive but very fast memory used only in high-end systems, connected to the CPU. Two 8GB modules are placed on each side of the CPU.\nPrototypes of the A64FX motherboard reveal it has no RAM DIMM sockets. An Intel or AMD motherboard will show up to a dozen memory DIMM sockets for each CPU but the A64FX motherboard has none. That\u2019s because the A64FX has the HBM2 memory on the die for 32GB per CPU.\nIn HPC, memory bandwidth has been the bottleneck, and data intensive workloads like analytics, simulations, and machine learning are slowing them down. And much more power \u2013 up to 100 times as much\u00a0\u2013 is used in moving data around in HPC than in actually processing it. So to achieve energy efficiency, data needs to move as little as possible.\nSo A64FX has a totally different design than your standard Arm or x86 chip. No system memory, just 32GB per processor of extremely fast memory directly connected to the chip via a high-speed interconnect instead of through a much slower memory bus. This will greatly reduce latency between CPU and memory and also reduce power because data doesn\u2019t have to be moved in and out of memory sockets.\nThe 48 cores of the A64FX function like a GPU in that they are connected by a very fast interconnect called Tofu, which was first used in the K supercomputer and has been advanced in the A64FX. Tofu is designed for energy efficiency and low latency. The A64FX is capable of 3Tflops of peak bandwidth while being 10 times more power efficient than a x86 processor.\nA Fugaku prototype made the number-one spot on the Green500 list, a list of the most energy efficient supercomputers published by the same group that does the Top500 supercomputer list, and that\u2019s a prototype, not a finished design.\nIn early benchmarks, Fujitsu claims to trounce the Xeon Platinum, Intel\u2019s top of the line, and is competitive with Nvidia\u2019s Volta line of HPC GPUs. However that\u2019s not final silicon, and I always wait for third-party benchmarks.\nSo why should you care? Because Fujitsu struck a deal with Cray to make HPC servers using A64FX and sold under the Cray brand name. Cray has since been bought out by HP Enterprise, so HPE will be peddling not one but two Arm-based servers, its more mainstream Project Moonshot servers, and A64FX.\nAnd there is a long history of technologies starting in HPC and slowly mainstreaming, from GPU computing to liquid cooling to modular server design. There\u2019s no reason the A64FX can\u2019t go mainstream either and bring AI, ML, and other high-performance tasks to more than just supercomputing facilities.\nThe HBM2\/no DIMMs is a massive twist on system memory, and I am really curious to see if Intel and AMD follow.