AMD Announces MI350 Specifications: 185 Billion Transistors and 288GB Memory

kyojuro 2025年8月27日星期三

AMD revealed comprehensive details of the Instinct MI350 series at Hot Chips 2025. Building upon the CDNA 4 architecture, this GPU accelerator is designed to cater to the demands of large-scale language models and high-performance computing. The MI350 series is encased in a 3D multi-chip package, incorporating 185 billion transistors, and is manufactured using TSMC's N3P and N6 dual processes. High-density interconnections are facilitated by the COWOS-S packaging method. A single package includes eight Accelerator Composite Chips (XCDs) and two I/O chips, with the XCDs handling computation while the IODs provide the Infinity Fabric interconnect and manage the HBM3e memory controller.

The memory configuration is a pivotal feature of this generation. The MI350 series boasts 288 GB of HBM3e memory offering bandwidth up to 8 TB/s, a substantial upgrade from the MI300's 6 TB/s. Each I/O chip connects to four HBM3e stacks, with each stack having a 36 GB capacity, arranged in a 12-Hi package. This architecture not only enhances throughput for training large models but also increases contextual processing capabilities for inference tasks. Regarding cache hierarchy, the MI350 is equipped with a 256 MB Infinity Cache and provides larger registers and LDS space in each compute unit to support dense matrix operations.

In terms of computational specifications, the MI350 series delivers 2.5 PFLOP of matrix FP16/BF16 and 5 PFLOP of FP8 computational power on a single card, with support for MXFP6/MXFP4 formats achieving a combined 10 PFLOP. For FP64 double-precision calculations, its vector performance is sustained at 78.6 TFLOP, with its matrix performance slightly underperforming compared to the MI300. Nevertheless, the optimization for AI inference and training shows remarkable improvement. AMD's on-site data demonstrated that the MI355X achieves a 35-fold increase in throughput over the MI300 series during the Llama 3.1 405B inference task.

Interconnectivity and scalability are another set of crucial highlights, with the MI350 series achieving 1075 GB/s of bidirectional aggregated bandwidth per card via the fourth-generation Infinity Fabric. It supports interconnection of up to eight cards, enhancing communication speeds by approximately 20%. For system integration, AMD offers the air-cooled MI350X as well as the liquid-cooled MI355X, with thermal design powers (TDP) of 1000W and 1400W respectively. The air-cooled configuration is scalable up to 10U cabinets, while the liquid-cooled option supports higher density performance in a 5U setup. The standard cluster solution offers 80 PFLOPs of FP8 performance and 2.25 TB of graphics memory per rack.

When compared to NVIDIA, AMD highlights that the MI355X offers a 1.6x memory capacity advantage, double the FP64 performance, and competes closely with the GB200 in mainstream precisions such as FP8 and FP16. The inclusion of the FP6 data format makes the MI350 exceptionally efficient for specific inference scenarios. AMD also emphasizes the flexibility of its chips' partitioning, allowing a single card to be split into multiple logical GPUs, enabling simultaneous execution of multiple instances of the 70B model for enhanced resource utilization.

The MI350 series is anticipated to be distributed to partners and data centers by the third quarter of 2025. Additionally, AMD confirmed the MI400 series is under development, projected for release in 2026. As AI models continue to expand, the MI350's design emphasizes large memory capacity, scalable bandwidth, and energy efficiency, positioning AMD at the forefront of competition with NVIDIA. The revelations at Hot Chips underscore AMD's engineering prowess in advanced packaging and chip interconnections, and indicate an annual product iteration cycle to keep pace with the rapid advancements in generative AI.

相关资讯

© 2025 - TopCPU.net