OpenAI's crazy data center, chip conjecture

Release time:

2024-04-02 13:02

The Information reported on Stargate, a collaboration between Microsoft and OpenAI, set by the Hyper-Ethernet Alliance, of which Microsoft is a founding member, at the start of the Easter holiday last FridayScalability of 10,000 interconnect endpoints, the future of Ethernet is the same.

 

The Stargate system has also been a source of discussion ever since. Altman can't seem to decide whether OpenAI should rely entirely on Microsoft, but who can blame him? That's why there are also rumors that OpenAI is designing its own chip for AI training and inference, as well as about AltmanOutrageous comments that tried to spearhead the investment in $7 trillion in chip manufacturing but then gave up.

 

You can't blameAltman littered the big numbers he was staring at. Training AI models is very expensive, and running inference mainly generating tokern – is not cheap either. As Nvidia co-founder and CEO Jensen Huang recently pointed out in his keynote at the GTC 2024 conference they are unsustainable and expensive. This is what Microsoft, Amazon Web Services, Google, and Meta Platform have created or are in the process of creating their own CPU and XPU.

 

As the number of parameters increases and the data shifts from text format to other formats, if the current trend continues and iron can be scaled, thenLLMs are only going to get bigger and bigger 100x to 1,000x in the next few years.

 

As a result, we hear discussions about Stargate, which suggests that the upper echelons of AI training are undoubtedly the game of the rich.

 

Based on what you read in the initial Stargate rumor-post-rumor report, Stargate is the fifth phase of a project that will cost between $100 billion and $115 billion, with Stargate to be delivered in 2028 and in 2030 years and beyond. Microsoft is clearly currently in the third phase of its expansion. Presumably, these funding figures cover all five phases of the machine, and it's unclear if the figure covers the data center, in-house machinery, as well as the cost of electricity. Microsoft and OpenAI may not do much to address this issue.

 

It has not yet been discussed What technology will the Stargate system be based on, but we don't think it will be based on Nvidia GPUs and interconnects. It will be based on future generations of Cobalt Arm server processors and Maia XPU, with Ethernet scalable to hundreds of thousands to 1 million XPUs in a single machine .

 

We also believe that Microsoft acquired DPU manufacturer Fungible to create scalable Ethernet networks, and possibly to Juniper Networks and Fungible's founder, Pradeep SindhuCreate matching Ethernet switch ASICs so that Microsoft can control its entire hardware stack.

 

Of course, this is just a conjecture.

 

Whatever What kind of Ethernet network Microsoft uses, we're all fairly certain that at some point 1 million endpoints is the target, and we're fairly certain that InfiniBand isn't the answer.

 

We also think that hypothetical of this It is unlikely that the XPU will be as powerful as the future Nvidia X100/X200 GPUs or their successors (we don't know the name of them). Microsoft and OpenAI are more likely to try to massively scale a network of cheaper devices and radically reduce the overall cost of AI training and inference.

 

Their business model depends on this happening.

 

And we can also reasonably assume that at some point Nvidia will have to create an XPU packed with matrix math units and ditch the vector and shader units that gave the company its start in data center computing. If Microsoft builds a better mousetrap for OpenAI, then Nvidia will have to follow suit.

 

Stargate certainly represents a ladder function for AI spending, and perhaps two more, depending on how you want to interpret the data.

 

In terms of data center budgets, all Microsoft has publicly said so far is that it will spend more than $10 billion on data centers in 2024 and 2025, and we speculate that most of that spending will be used to cover the cost of AI servers. Those $100 billion or $115 billion numbers are too vague to represent anything specific, so at the moment it's just some big talk. We would like to remind you that Microsoft has kept at least $100 billion in cash and equivalents over the past decade and reached close to 1440 in the September 2023 quarter A peak of 100 million dollars. As of calendar year 2023 (Microsoft's Q2 FY2024 ), that number drops to $81 billion.

 

As a result, Microsoft doesn't have enough money right now to do it all at once Stargate project, but its software and cloud businesses have total sales of $82.5 billion in the last 12 months, compared to about $227.6 billion in sales. Over the next six years, if the software and cloud businesses remain as they are, Microsoft will bring in $1.37 trillion in revenue and a net profit of about $500 billion. It can take on the efforts of Stargate. Microsoft also has the ability to buy OpenAI and then it can end it.

 

Either way, we've probably built clusters for Microsoft as well as the future may be OpenAI-built clusters set budgets, showing how their composition and size have changed over time. Take a look:

 

 

 

We believe that over time, assigned OpenAI's AI clusters will decrease in number, while the size of those clusters will increase.

 

We also think The share of GPUs in OpenAI clusters will fall, while the share of XPUs (most likely in the Maia series, but may also be designed with OpenAI ) will rise. Over time, the number of self-developed XPUs will match the number of GPUs, and we further estimate that the cost of these XPUs will be less than half the cost of data center GPUs. In addition, we believe that moving from InfiniBand to Ethernet will also reduce costs, especially if Microsoft uses its own Ethernet ASIC and its own NIC with built-in AI capabilities and collective operation capabilities. (Just like the SHARP feature of Nvidia's InfiniBand .) )

 

We're also forcing a spending model so that by 2028 we have two clusters with 1 million endpoints one made up of GPUs and one made up of self-developed XPUs or two clusters of half each. We want to estimate future cluster performance, but that's hard to do. More XPUs may get a modest performance boost each year, but at a much better price/performance ratio.

 

It is important to keep in mind that Microsoft can keep the current generation's The GPU or XPU for OpenAI's internal use (and therefore its own) and the sale of the N-1 and N-2 generations to users for many years to come is likely to reap a lot of its investment bait again OpenAI. Therefore, these investments are not sunk costs in themselves. It's more like a car dealer driving a whole bunch of different cars with dealer license plates, but not raising their mileage too much before selling them.

 

Here's the thing: Microsoft will continue to be in Will OpenAI invest heavily in order to turn a profit and lease this capacity, or will it stop spending $100 billion on OpenAI (the company was valued at $80 billion two months ago)? $100 million or so in infrastructure to take full control of its AI stack.

 

Even for Microsoft, these numbers are quite large. But, as we said, if you look at 2024-2028 , Microsoft could have around $500 billion in net profits at its disposal. Very few other companies do this.

 

Microsoft from one The BASIC compiler and a garbage DOS operating system cobbled together from a third party to decorate a desperate blue giant who doesn't understand it is simply giving up the candy store.

 

Maybe it's Ultraman's nightmare too. But given the huge sums of money it takes to take AI to new heights, it may be too late now.

 

 

Recommended News