If you look back at it now, especially with the advent of massively parallel computing on GPUs, maybe the techies at Tera Computing and then Cray had the right idea with massive stream processors. “ThreadStorm” and their high bandwidth connection.
Because many neural networks created by AI frameworks are themselves graphs – types with vertices with data and edges that show relationships between data, not something created in Excel – or exported a graph, maybe, in the end, what we need is a really good graph processor. Or, maybe millions of them.
Gasp! Who would say such heretical things in a world where the Nvidia GPU and its wishes are the common ointment to deal with – put the oil palm on, are you sure? – our modern computer problem? Yes, we do. While GPUs excel at dense matrix high-precision floating-point math, which dominates HPC simulation and modeling, a lot of the data that underpins AI frameworks is sparse and the accuracy is high. lower starting accuracy. And based on this, there may be better ways to do this.
The US Defense Advanced Research Projects Agency, the research and development agency of the Department of Defense, explores such cutting-edge questions and has considered creating a large-scale parallel graph processor. large scale and connected since the establishment of the Hierarchical Identity Verification Exploitation (HIVE) project back in 2017. Intel was selected to produce HIVE processors and Lincoln Laboratory at MIT and Amazon Web Services was chosen to create and store a trillion-edge graph dataset for a system that relies on such processors for processing.
At Hot Chips 2023 this week, Intel introduced the processor it created for the HIVE project, initially codenamed “Puma” in reference to its Programmable Integrated Memory Architecture (PIUMA). underpin it. In August 2019, Intel gave an update on the PUIMA chip at DARPA’s ERI Summit and at IEEE’s High Performance Extreme Computing 2020 event in September 2020, Intel researchers Balasubramanian Seshasayee, Joshua Fryman and Ibrahim Hur gave a presentation entitled Hash table scalability on Intel PUIMAis behind an IEEE paywall but provides an overview of the processor and an article titled PUMA: Programmable Integrated Unified Memory Architecture, is not behind a paywall. These are vague things about the architecture of the PUIMA system. But this week, Jason Howard, Intel’s principal engineer, gave an update on PUMA systems and processors, including the photonic interconnect that Intel created with Ayar Labs to connect some large processors together.
In the IEEE paper, PUIMA researchers make no secret of the fact that they were completely inspired by the Cray XMT line. The XMT line from a decade ago culminated with a massive shared-memory thread monster, perfect for graph analysis, featuring up to 8,192 processors, each with 128 threads running at 500 MHz frequency, plugged into the AMD Rev F socket used by the Opteron 8000 series X86 CPUs are all interconnected by a custom “SeaStar2+” toroidal interconnect providing 1.05 million threads and 512 TB of memory shared key so the graph can work well. For Linux, this looks like a single CPU.
What’s old is new again with the PUIMA project, this time the processor is more modest but the connectivity is much better. And maybe price/performance too, and for the love of all that is holy in heaven, maybe Intel will commercialize this Piuma system and Actually Shake everything up.
Get smaller bytes of memory
According to Howard, when Intel started designing the PUMA chip, researchers working on the HIVE project realized that graph jobs were not only massively parallel but also embarrassingly parallel, which could that is, there may be some way to exploit that parallelism to increase the performance of graph analytics. When running on a standard X86 processor, graph databases have very poor cache line usage, with only 8 bytes or less of a 72-byte cache line being used more than 80% of the time the database whether the graph is running. Having a lot of branches in the instruction flow puts a strain on the CPU pipelines, and the memory subsystem is also under a lot of pressure from long dependency load chains, which affects the cache on the CPU.
The PUIMA chip has some big and small ideas embedded in it and has four pipelines with 16 threads each (called MTP) and two pipelines with one thread each (called STP) which delivers 8x performance for one of the threads in MTP. The cores are based on a custom RISC instruction set that Howard did not identify, and his research colleagues at Intel or Microsoft, which have also participated in the PUMA effort, did not identify the core set.
“All pipelines use custom ISA, which is like RISC, fixed length,” Howard explained in his Hot Chips presentation. “And each pipeline has 32 physical registers available. We did this so you can easily move compute threads between any process. So maybe I start execution on one of the multithreaded paths, and if I see it’s taking too long, or maybe it’s the last thread available, I can quickly move it to single thread pipeline for better performance.”
The STP and MTP units are connected by a crossbar and have a total of 192 KB of L1 instruction and L1 data caches, and they link to a shared 4 MB panel SRAM memory that is simpler than the cache. L2.
Each PUMA chip has eight active cores, and each core has its own custom DDR5 memory controller that has an 8-byte access granularity instead of the 72-byte granularity of conventional DDR5 memory controllers. Each socket has 32 GB of custom DDR5-4400 memory.
Each core has a pair of routers that link the cores in a 2D mesh together, with eight memory controllers, and with four high-speed Advanced Interface Bus (AIB) ports. AIB is the royalty-free PHY for interconnecting chiplets that Intel announced in 2018. There are 32 optical I/O ports, 8 ports per AIB, which is an addition from Ayar Labs, providing 32 GB bandwidth /sec in each port. direction.
Here are the details about the routers-on-chip that implement 2D mesh on the PUIMA package:
This is a 10-port cut-through router. The 2D mesh runs at 1 GHz and takes four cycles to go through the router. It has 10 virtual channels and four different messaging layers, which Howard says avoids any deadlocks on the network and delivers 64GB/s of speed per link in the router.
Packaging the router and core on the PUMA chip is a bit more complicated than you might expect. Watch:
It appears there are 16 routers/cores on the die and only 8 of them are core enabled because the mesh on the die requires twice as many routers to feed into the AIB, which in turn feeds into the Ayar Labs photonic silicon. Silicon photonic interconnects are used only as a physical layer, and they are used specifically to extend the on-die network between multiple sockets.
And when we talk a lot, we mean it an astonishingly huge number. Like this:
A set of 16 of these Piuma chips uses silicon photonic interconnects that can be linked together in a 4×4 grid in an overall configuration. Each Piuma chip burns about 75 watts at nominal voltage and workload, meaning it burns about 1,200 watts. More than one Xeon SP socket, but no more than three.
Build the Perfect Graph Processing Beast
The PUIMA chip has a 1 TB/s optical interconnect coming out of it, and in addition to the links on the slider, some of them can be used to connect up to 131,072 sliders together to form a processing supercomputer. Huge shared memory graph processing. The router is the network, and everything connected using the HyperX topology outside of the topology is directly connected all in one rack with 16 sliders.
Let’s go through this. A single sled with 16 sockets has 128 cores with 8,448 threads and 512 GB of memory. The first level of the HyperX network has 256 sliders, 32,768 cores, 270,336 threads, and 128 TB of memory. Step up to level two of the HyperX network and you can build a PUIMA cluster with 16,384 sliders, 2.1 million cores, 17.3 million threads, and 8 PB of shared memory. And finally, at level three of the HyperX network, you can scale up to 131,072 sleds, 16.8 million cores, 138.4 million threads, and 64 PB of shared memory.
Admit it. You want to see what one of these beasts can do. The US National Security Agency and the US Department of Defense as well as their peers around the world, who have funded a lot of AI research over the past 15 years, are certainly also interested. .
While you chew on that scale for a minute, let’s consider a few more things. First is the latency of that optical network:
PUIMA nodes are linked together by single-mode fiber, and interestingly, the achieved bandwidth of the PUIMA network design, at 16 GB/s per direction, is just shy of the theoretical design point. However, it’s still a massive bandwidth monster, with a theoretical unidirectional bifurcated bandwidth of 16 PB/s across the entire HyperX network.
The PUIMA chip is implemented in Taiwan Semiconductor Manufacturing Company’s 7-nanometer FinFET process and has 27.6 billion transistors on it with 1.2 billion of them dedicated to the corresponding cores. relatively small. Obviously, AIB circuits take up a lot of transistor count.
Here’s what the PUMA chip package looks like:
And here’s what the package and test board look like:
So far, Intel has built two boards with each a single PUIMA chip and linked them together to run tests and prove to DARPA that it works.
The question now is, how much would such a machine cost on a large scale? Well, at $750 per node, that’s not much at all, it’s $3.1 million for a system scaled up to level one HyperX with 4,096 PUIMA chips, almost $200 million for a system with 262,144 chips at HyperX level two and $1.57 billion for a node one with 2.1 million chips raised to HyperX level three.
As the AI boom shows, there are dozens of companies and dozens of other government agencies around the world that wouldn’t even blink at a $1 billion price tag for a system anymore. The number didn’t even make my pulse quicken as I wrote and read it.
That is the moment we are living in right now.
#millioncore #Graph #Processing #Monster
World Innovations: Top Trends Shaping the Future Worldwide
Global Migration Trends: Understanding the Modern Movement of People
World Sports: Discover the Most Exciting Global Sporting Events