How big is NVIDIA's inference advantage? Detailed explanation of GTC inference performance chart

At GTC, Jensen Huang posted two very interesting pictures:

If the demand for TPS for 1 user is low, the advantages of more advanced GPUs and more advanced interconnects are not so obvious, but on the contrary, there is a qualitative difference.

The two pictures above:

The horizontal axis is the response speed of each user when using the big model, Tokens per second (TPS) per user. 1 token is equal to 1.5-2 Chinese characters or 1.3 English words. 100 TPS for 1 user means that the big model can spit out 150-200 Chinese characters or 130 English words per second when answering questions；
The vertical axis is how many tokens a 100MW data center can output per second, i.e., Throughput. The value depends on the GPU card model (H100 or B200), the connection method (NVL8 or NVL72), the accuracy of the large model (FP8 or FP4), the inference framework used (whether the newly launched Dynamo is used), and the required output speed (TPS for 1 user).

The more advanced the GPU is (B200 is more advanced than H100), the lower the precision is (FP4 is lower than FP8), the lower the required TPS for 1 user is, and the more tokens a 1MW server can output per second.

In addition, the connection method also has a great impact on throughput, which is worth discussing in depth：

Under the same Blackwell and FP4 conditions, NVL72 and NVL8 will show significant differences when Tokens per second (TPS) per user is higher than 100：

When TPS for 1 user is below 100, NVL72 can provide throughput equivalent to about 1.1-1.5 times that of NVL8 in a 1 MW data center, which is more but not qualitatively different；
When TPS for 1 user reaches 150, NVL72 can provide throughput ~2x that of NVL8, and the advantage is significantly expanded；
When TPS for 1 user reaches 200, NVL72 can provide throughput that is ~10x of NVL8, creating a qualitative difference.

In summary, when TPS for 1 user exceeds 100, Blackwell NVL72 can significantly outperform NVL8 and even Hopper.

The same task, whether to provide users with high or low experience will also affect the choice of cards

Specifically, a 1MW NVL8 Blackwell data center：

If the required TPS for 1 user is 10, the data center can provide ~9 million tokens per second；
When the required TPS for 1 user reaches 100, the data center can only provide ~6 million tokens per second

If the user's task is relatively simple, such as just asking and answering questions, each question consumes 90 tokens, and the concurrent requirement per second is 100,000 users (100,000 peak users roughly corresponds to 10 million daily active users). If only TPS for 1 user is required to reach 10, that is, waiting for 9 seconds to answer this question, then a 1MW NVL8 Blackwell data center with a corresponding throughput of 9 million tokens/s can meet the requirements.If we want to improve the user experience, for example, to reduce the user waiting time to 0.9 seconds, that is, TPS for 1 user needs to reach 100. At this time, the same 1MW data center can only provide 6 million tokens per second (instead of 9 million before), which will reduce the concurrent demand that can be served to 60,000-70,000 people, which means that only 6-7 million daily active users can be served, which is 30% off the previous service capacity.

In summary, reducing the user experience (reducing TPS for 1 user) can improve the service capacity of the data center (that is, improving the throughput of tokens per second). The two have a trade-off relationship.

Under what circumstances will TPS for 1 user of more than 100 be used?

The current data center generally requires 20 TPS for 1 user (corresponding to 30-40 Chinese characters and 26 words per second). From this perspective, the current significance of multi-card interconnection for reasoning is very small.

From another perspective, the reading speed of humans is only ~5 Chinese characters or words per second. If the application of AI is through the interactive form of chatbot, it does not require too high TPS.

However, if AI applications become agents represented by Manus and DeepResearch, they will require ultra-high TPS, and the demand for TPS will easily exceed 100, or even much higher.

The newly released Dynamo reasoning framework follows the same logic. It is very useful when TPS for 1 user is greater than 250, but it is overkill and unnecessary when it is less than 250.

When AI applications change from simple tasks to complex tasks, the demand for cards will increase by a thousand times

In Chat applications:

Serving a peak of 10,000 users (corresponding to serving 1 million daily active users), each task requires 900 tokens, so the throughput requirement per second is 9 million tokens. If a Blackwell NVL72 data center is used:

Scenario 1: Providing services at 20 TPS for 1 user, each task will have to wait 45 seconds. A 1MW NVL8 Blackwell data center (roughly corresponding to ~1,000 cards) can output 9 million tokens per second, which can implement this service.
Scenario 2: However, if we want to improve the service experience, we provide services at 200 TPS for 1 user, and each user only needs to wait 4.5 seconds. The same data center's ability to output tokens per second is reduced to 6 million, and 1.5 data centers are needed to support the service (roughly corresponding to ~1,500 cards).

Currently, in applications that focus on simple tasks, it is enough to provide the experience of scenario 1.

In Agent applications (such as Manus and Deepresearch):

With a peak service of 10,000 users (corresponding to 1 million daily active users), the tasks are more complex, and 900,000 tokens are required to complete a task (1,000 times more than before), and the throughput requirement per second becomes 9 billion tokens. If Blackwell NVL72 data center is used:

Scenario 1: If 20 TPS for 1 user is still used, each task will have to wait for about 13 hours (no user can accept this experience). The same data center has a capacity of 9 million tokens per second. 1,000 data centers (roughly corresponding to 1 million cards) are needed to meet the service needs of 1 million daily active users, and the experience is poor.
Scenario 2: In order to improve the user experience, services are provided at 200 TPS for 1 user, and the waiting time for each task is reduced to 1.3 hours. A 1MW NVL72 data center has a capacity of 6 million tokens per second, and 1,500 data centers (roughly corresponding to 1.5 million cards) are needed; (the most likely scenario in the agent era).
Scenario 3: If you insist on using NVL8 data centers, but improve the service standards, at 200 TPS for 1 user, each task will have to wait 1.3 hours to complete. However, the ability of a 1MW data center to output tokens per second is greatly reduced to 1 million (in fact, it may only be a few hundred thousand tokens, and the coordinate axis in the above figure is not clear without scale), and 9,000 computer rooms (corresponding to 9 million cards) are needed to support the service needs of 1 million daily active users.

Under the same user experience, to serve 1 million daily active users, 1.5 million NVL72 cards can do what NVL8 needs 6x cards to do.

If the AI demand for simple questions and answers changes to complex task demands, under acceptable experience (complex tasks require longer waiting than simple tasks), serving the same scale of users, even after the card is upgraded from NVL8 to NVL72, the demand for cards will increase by 1500x.

In general, long reasoning and complex tasks will strengthen NVDA's advantages, because the more tokens a response has, the higher the TPS for 1 user is needed, so that the response time can be controlled within a certain range.

If we only achieve the efficiency of the current card, the economic account of Agent cannot be calculated for the time being.

If we follow the Agent demand of serving 1 million daily active users as assumed above, taking NVL72 as an example (i.e., scenario 2 calculated by Agent), 1.5 million B cards are needed. If each card costs $35,000, assuming depreciation for 5 years, the annual card cost for each user is $10,000. Adding other server costs and operation and maintenance costs, if the Agent service provider wants to avoid losing money, it must receive at least $20,000 from each user, and if it wants to get a reasonable return, it must receive at least $50,000.

The average disposable income in the United States is $60,000. Even if Agent can achieve the ability to replace one person, the fee of $20,000 to $50,000 is too high. Moreover, it will take time for Agent to improve to completely replace one person.

However, if the inference cost can be further reduced by 10 times through software optimization or hardware improvement, and Agent improves to be able to completely replace one person, then each user only needs to spend $5,000 per year. This fee level is very attractive, and Agent companies have considerable returns. At present, hardware optimization can reduce the cost of inference by 30x in 2 years. If this speed can be sustained, coupled with software optimization, it is still worth looking forward to the success of Agent.

Finally, the above conclusions are calculated based on the server inference performance curve chart currently released by NVDA. If software or hardware progress causes this curve to change, the corresponding conclusions must also be recalculated. For example, NVDA's Blackwell Ultra, which will be launched in the second half of this year, Rubin, which will be launched next year, and Deepseek may release new models and new inference optimization methods, which will lead to changes in the above conclusions.

BR Partners

How big is NVIDIA's inference advantage? Detailed explanation of GTC inference performance chart

If the demand for TPS for 1 user is low, the advantages of more advanced GPUs and more advanced interconnects are not so obvious, but on the contrary, there is a qualitative difference.

The same task, whether to provide users with high or low experience will also affect the choice of cards

When AI applications change from simple tasks to complex tasks, the demand for cards will increase by a thousand times

If we only achieve the efficiency of the current card, the economic account of Agent cannot be calculated for the time being.

Recent Posts

Comments