Designing Meta’s new AI supercomputer

Any company with the same level of financial resources as Facebook parent company Meta could buy the same parts and systems it is using for its new AI supercomputer.

But to connect hundreds — and eventually thousands — of Nvidia’s DGX A100 systems together along with storage systems, it takes the right kind of system integrator to design and construct a high-performance computing cluster that meets the company’s requirements.

In the case of Meta’s new AI Research SuperCluster, the tech giant turned to Penguin Computing to design and integrate the new AI supercomputer, which Meta said will become the world’s fastest later this year when the full cluster of 16,000 Nvidia A100 GPUs goes online.

In an interview with CRN US, Thierry Pellegrino, the executive who oversees Penguin Computing at parent company Smart Global Holdings, said the new AI supercomputer was the result of an “embedded collaboration” between Meta and Penguin Computing that began more than four years ago with a previous-generation AI supercomputer that used Nvidia’s older V100 GPUs.

“When you operate such a cluster, you have to understand the requirements around security, around performance, around availability and around all the unique aspects of a company like Meta that need to be taken into consideration. That’s been our value to Meta,” said Pellegrino, whose title is president and senior vice president of Smart Global Holdings’ Intelligent Platform Solutions.

Pellegrino, who previously ran Dell EMC’s HPC business, talked about how Penguin Computing designed and integrated Meta’s AI Research SuperCluster, which includes storage systems made by Pure Storage and Penguin Computing itself. He also discussed how the company overcame storage and networking bottlenecks and whether the new AI supercomputer is a sign of things to come in the larger enterprise world. The transcript was lightly edited for clarity.

Leave a Reply

Your email address will not be published. Required fields are marked *