China’s Fastest General-Purpose MCM GPU, The Birentech Biren BR100, Architecture Detailed
The Birentech BR100 is the flagship General-Purpose GPU that China has to offer, featuring an in-house GPU architecture that utilizes a 7nm process node and houses 77 Billion transistors within its die. The GPU has been fabricated on TSMC’s 2.5D CoWoS design and also comes packed with 300 MB of on-chip cache, 64 GB of HBM2e with a memory bandwidth of 2.3 TB/s, and support for PCIe Gen 5.0 (CXL interconnect protocol). The whole chip measures 1074mm2 which is beyond the reticle limit of the process node. Some of the fundamentals that went into designing the BR100 GPU included:
To break the reticle size limit and integrate more transistors on a chip One tape out to empower multiple SKUs Smaller die for better yield, hence lower cost 896 GB/s high-speed die-to-die interconnect 30% more performance, and 20% better yield compared with a monolithic design
Talking about the architecture itself, the Biren BR100 is made up of two chiplets, each housing 16 SPC or Streaming Processing Clusters. Each SPC has 16 EUs and four of these EUs form an internal Compute Unit or CU that is attached to 64 KB of L1 cache (LSC) while the SPC features a shared 8 MB L2 cache across all Execution Units. So that’s a total of 32 SPCs with 512 Execution Units, 256 MB of L2 cache, and 8 MB of L1 cache. A deeper look at the Execution Unit reveals 16 streaming processing cores (V-Core) and a single Tensor Engine (T-Core). There’s 40 KB of TLR (Thread Local Register), 4 SFUs, and a TDA (Tensor Data Accelerator). Interestingly, each CU can contain 4, 8, and up to 16 EUs. The V-Core itself is a general-purpose SIMT processor which features 16-cores that supports FP32, FP16, INT32 & INT16 along with SFU, Load/Store, and Data Processing, while handling deep learning operations such as Batch Norm, ReLu, etc. It also features an enhanced SIMT Model that can run up to 128K threads on 32 SPCs in a super-scalar mode (static and dynamic). For the T-Cores, the tensor design is used to accelerate AI operations such as MMA, Convolution, etc. Birentech disclosed various performance metrics of the chip. It offers up to 2048 TOPs (INT8), 1024 TFLOPs (BF16), 512 TFLOPs (TF32+), and 256 TFLOPs (FP32), and based on the performance figures, it looks like this chip is going to be faster than the NVIDIA Ampere A100, at least on paper. The GPU has been compared against the NVIDIA Ampere A100 in various HPC workloads and it looks like it would offer up to a 2.6x average speedup and up to a 2.8x speedup over its main competitor. The Hopper H100 GPU offers nearly 2x or 2.5x the performance in the same GPU performance metrics. The chip also supports 64-channel encoding and 512-channel encoding. As for the interconnects, the chip comes with an 8 BLink solution which offers 2.3 TB/s of external I/O bandwidth. What’s interesting is that the BR100 isn’t that far behind in terms of overall transistor count compared to the NVIDIA H100. The H100 features 80 Billion transistors on the new N4 process node whereas the BR100 is only 3 Billion transistors behind the 7nm process node. This would lead to a much bigger die size. The Biren BR100 isn’t the only chip that the China-based company has announced. There’s also the Biren BR104 which offers half the performance metrics of the BR100 but the specifications aren’t told yet. The only detail available on the other chip is that, unlike the Biren BR100 which uses a chiplet design, the BR104 is a monolithic die and comes in a standard PCIe form factor with a TDP of 300W. The company states that a chip with 77 Billion transistors can mimic the human brain nerve cells and the chip itself will be used for DNN and AI purposes so it is more or less going to replace China’s dependence on NVIDIA’s AI GPUs.