Meta is betting big on AI with custom chips — and a supercomputer

At a virtual event this morning, Meta raised the curtains on its efforts to develop internal infrastructure for AI workloads, including generative AI like the type that underpins its recently launched ad design and creation tools.

It was an attempt to project the power of Meta, which has historically been slow to adopt AI-friendly hardware systems, hampering its ability to keep up with rivals like Google and Microsoft.

Build ours [hardware] capabilities gives us control over every layer of the stack, from data center design to training frameworks,” Alexis Bjorlin, VP of Infrastructure at Meta, told BlogRanking. “This level of vertical integration is needed to push the boundaries of AI research at scale. “

Over the past decade, Meta has spent billions of dollars recruiting top data scientists and building new kinds of AI, including AI that now powers the discovery engines, moderation filters, and ad recommenders found in its apps and services. But the company has struggled to turn many of its more ambitious AI research innovations into products, particularly in generative AI.

Until 2022, Meta largely ran its AI workloads with a combination of CPUs — typically less efficient than GPUs for that kind of task — and a custom chip designed to accelerate AI algorithms. Meta pulled the plug on a large-scale rollout of the custom chip, which was scheduled for 2022, and instead placed orders for billions of dollars worth of Nvidia GPUs that required major redesigns of several of its data centers.

In an effort to turn things around, Meta made plans to develop a more ambitious internal chip in 2025, which can both train and run AI models. And that was the main topic of today’s presentation.

Meta calls the new chip the Meta Training and Inference Accelerator, or MTIA for short, and describes it as part of a “family” of chips for accelerating AI training and inferencing workloads. (“Inference” refers to running a trained model.) The MTIA is an ASIC, a type of chip that combines several circuits on a single board, allowing it to be programmed to perform one or more tasks in parallel.

Meta AI accelerator chip

An AI chip Meta custom designed for AI workloads. Image Credits: meta

“To make our critical workloads more efficient and perform better, we needed a custom solution designed together with the model, software stack, and system hardware,” continued Bjorlin. “This provides a better experience for our users across services.”

Custom AI chips are increasingly the name of the game among Big Tech players. Google has created a processor called the TPU (short for “tensor processing unit”) to train large generative AI systems like PaLM-2 and Imagen. Amazon offers proprietary chips to AWS customers, both for training (Trainium) and inference (Inferentia). And Microsoft is reportedly working with AMD to develop an internal AI chip called Athena.

Meta says it made the first generation of the MTIA – MTIA v1 – in 2020, built on a 7 nanometer process. It can scale beyond its internal 128MB of memory to a maximum of 128GB, and in a benchmark test designed by Meta – which should be taken with a grain of salt, of course – Meta claims the MTIA handled “low complexity”. and “medium complex” AI models more efficient than a GPU.

Work still needs to be done in the memory and network areas of the chip, says Meta, which are bottlenecks as the size of AI models grows, requiring workloads to be split across several chips. (Not coincidentally, Meta recently acquired an Oslo-based team-building AI network technology from British chip unicorn Graphcore.) And for now, the MTIA’s focus is solely on inference — not training — for “recommendation workloads” in Meta’s family of apps.

But Meta emphasized that the MTIA, which it continues to refine, “significantly” increases the company’s efficiency in terms of performance per watt when running recommendation workloads – in turn allowing Meta to run “more enhanced” and “advanced” (ostensibly ) AI workloads.

A supercomputer for AI

Perhaps one day Meta will delegate most of its AI workloads to banks of MTIAs. But for now, the social network relies on the GPUs in its research-focused supercomputer, the Research SuperCluster (RSC).

First unveiled in January 2022, the RSC has been put together in collaboration with Penguin Computing, Nvidia and Pure Storage and has completed the second phase of its buildout. Meta says it now has a total of 2,000 Nvidia DGX A100 systems with 16,000 Nvidia A100 GPUs.

So why build an internal supercomputer? First, there is peer pressure. Several years ago, Microsoft made a big to-do about its AI supercomputer built in partnership with OpenAI, and more recently said it would partner with Nvidia to build a new AI supercomputer in the Azure cloud. Elsewhere, Google has touted its own AI-focused supercomputer, which has 26,000 Nvidia H100 GPUs, putting it ahead of Meta’s.

Meta supercomputer

Meta’s supercomputer for AI research. Image Credits: meta

But beyond keeping up with the Joneses, Meta says the benefit to RSC is that its researchers can train models with real-world examples of Meta’s production systems. That’s different from the company’s previous AI infrastructure, which used only open source and publicly available datasets.

“The RSC AI supercomputer is being used to push the boundaries of AI research in several domains, including generative AI,” said a Meta spokesperson. “It’s really about AI research productivity. We wanted to provide AI researchers with a state-of-the-art infrastructure that allows them to develop models and give them a training platform to advance AI.”

At its peak, the RSC can reach nearly 5 exaflops of computing power, which the company claims is among the fastest in the world. (To prevent that from impressing, it’s worth noting that some experts view the exaflops’ performance stats with a grain of salt, and that the RSC is far surpassed by many of the world’s fastest supercomputers.)

Meta says it used the RSC to train LLaMA, a haunted acronym for “Large Language Model Meta AI” — a large language model that the company shared as a “gated release” with researchers earlier this year (and subsequently released in various Internet communities ). The largest LLaMA model was trained on 2,048 A100 GPUs, says Meta, which took 21 days.

“Building our own supercomputing capabilities gives us control over every layer of the stack; from data center design to training frameworks,” the spokesperson added. “RSC will help Meta’s AI researchers build new and better AI models that can learn from trillions of examples; work in hundreds of different languages; seamlessly analyze text, images and video together; developing new augmented reality tools; and much more.”

Video transcoder

In addition to MTIA, Meta is developing another chip to handle certain types of computing workloads, the company revealed at today’s event. The chip, called the Meta Scalable Video Processor, or MSVP, is Meta’s first in-house developed ASIC solution designed for the processing needs of video-on-demand and live streaming.

Meta started coming up with custom server-side video chips years ago, readers may recall, and in 2019 announced an ASIC for video transcoding and inference work. This is the result of some of those efforts, as well as a renewed push for a competitive edge in the specific area of ​​live video.

“On Facebook alone, people spend 50% of their time watching videos on the app,” Meta chief technical executives Harikrishna Reddy and Yunqing Chen wrote in a co-author of a blog post published this morning. “To accommodate the wide variety of devices around the world (mobile devices, laptops, TVs, etc.) is programmable and scalable and can be configured to deliver both the high-quality transcoding required for VOD and the low latency and faster processing times that live streaming requires. efficient support.”

Meta video chip

Meta’s custom chip designed to accelerate video workloads such as streaming and transcoding. Image Credits: meta

Meta says it plans to eventually move most of its “stable and mature” video processing workloads to the MSVP and use software video encoding only for workloads that require specific customization and “significantly” higher quality. Work continues to improve video quality with MSVP using pre-processing methods such as smart denoising and image enhancement, says Meta, as well as post-processing methods such as artifact removal and super resolution.

“Going forward, MSVP will enable us to support even more of Meta’s key use cases and needs, including short videos – enabling efficient delivery of generative AI, AR/VR and other metaverse content,” said Reddy and Chen.

AI focus

If there’s a common thread running through today’s hardware announcements, it’s that Meta is desperately trying to pick up the pace when it comes to AI, specifically generative AI.

That much had been telegraphed before. In February, CEO Mark Zuckerberg — who has reportedly made increasing Meta’s computing power for AI a top priority — announced a new top-tier generative AI team to, in his words, “turbocharge” the company’s R&D. CTO Andrew Bosworth also recently said that generative AI was the area where he and Zuckerberg spent most of their time. And chief scientist Yann LeCun has said that Meta plans to deploy generative AI tools to create assets in virtual reality.

“We’re exploring chat experiences in WhatsApp and Messenger, visual creation tools for Facebook and Instagram posts and ads, and video and multimodal experiences over time,” Zuckerberg said during Meta’s Q1 earnings call in April. “I expect these tools will be valuable to everyone from everyday people to creators to businesses. For example, I expect there will be a lot of interest in AI agents for business messaging and customer support once we get to that experience. Over time, this will also extend to our work on the metaverse, where people can much more easily create avatars, objects, worlds and code to connect them all together.

In part, Meta felt mounting pressure from investors concerned that the company isn’t moving fast enough to capture the (potentially huge) market for generative AI. It has no answer to chatbots like Bard, Bing Chat or ChatGPT yet. Nor has it made much progress in image generation, another important segment that has exploded.

If predictions are correct, the total addressable market for generative AI software could reach $150 billion. Goldman Sachs forecasts that GDP will increase by 7%.

Even a small fraction of that could erase the billions Meta has lost on investments in “metaverse” technologies like augmented reality headsets, conferencing software, and VR playgrounds like Horizon Worlds. Reality Labs, Meta’s division responsible for augmented reality technology, reported a net loss of $4 billion last quarter, and the company said on its Q1 call that it expects “operating losses to increase year over year in 2023.”

Leave a Comment