
Machine learning (ML) models continue to evolve in challenging ways, both in terms of scale and technology. Language Large Models (LLMs) serve as an example of the former, while Deep Learning Recommendation Models (DLRMs), Massive Transformer Computations, and BERT serve as examples of the latter. Our ML supercomputer has expanded from 256 TPU v2 nodes to 4096 TPU v4 nodes due to the huge size of modern LLMs. Getting to this size leads to reliability issues, which are further exacerbated by the fact that training a deep neural network (DNN) is done in an HPC, checkpoint/restore fashion, and everything has to work in a manner. This is quite different from the software-dependent characteristic of distributed mainline systems such as Google.
Google researchers have identified three major improvements to TPU v4 that address these issues:
1. To overcome scalability and reliability challenges, they introduced optical circuit switches (OCSes) with optical data lines, enabling a 4K node supercomputer to accept 1K CPU hosts that are lower 0.1% to 1.0% of the time through reconfiguration .
🚀 Join the fastest ML Subreddit community
2. They describe SparseCore or SC hardware support for weddings in DLRMs, a feature of TPU from TPU version 2.
3. By combining the above two skills, weddings increase communication requirements at the supercomputer scale by introducing all-encompassing communication patterns. All-to-all patterns place a load on bisection bandwidth unlike all downsampling, which is used in backpropagation and translates well to 2D and 3D tori. OCS allows for versatile topology construction, including enhanced partitioning.
LLM is now a hot issue in the ML community. OCSes in TPU v4 were initially driven by size and reliability, but their topological flexibility and deployment advantages ended up significantly reducing LLM training time. Although the principles of TPU pre-training and inference have already been addressed in previous publications, this study focuses on the three unique aspects of TPU v4 that have not been covered before.
The following are the main contributions of the paper:
- Discusses and evaluates the first production deployment of OCSes in a supercomputer and the first to provide an architecture change to improve performance.
- Discusses and evaluates the first Accelerator Assistance for Inclusion in a Profit ML System.
- It details the rapid evolution of production model types since 2016 for a rapidly developing ML sector.
- Shows how Google is involved in optimizing DNN models, OCS topologies, and SparseCore using machine learning.
scan the paper. All credit for this research goes to the researchers on this project. Also, don’t forget to join 18k+ML Sub RedditAnd discord channelAnd Email newsletterwhere we share the latest AI research news, cool AI projects, and more.
Anish Teeku is a Consultant Trainee at MarktechPost. He is currently pursuing his undergraduate studies in Data Science and Artificial Intelligence from the Indian Institute of Technology (IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is in image processing and he is passionate about building solutions around it. Likes to communicate with people and collaborate on interesting projects.
🔥 MUST READ – What is an AI Hallucination? What’s going wrong with AI chatbots? How do you discover the presence of artificial intelligence hallucinations?