Aug 20, 2025

Can Domestic AI Chip Startup FuriosaAI Overtake Nvidia?

Ryunsu Seong

11 min

Furiosa AI's RNGD accelerator product image on a red colored background.

“If Nvidia is a gasoline car, Furiosa is an EV”

“Nvidia and FuriosaAI are, so to speak, competing in an F1 (Formula One) race. To survive in F1, you need to push for extreme speed and performance.” FuriosaAI founder and CEO Junho Baek gave his first interview on August 14 after the company achieved unicorn status (a startup valued at over 1 trillion won)…

Dong-A Ilbo - Reporter Eunji Jang

A few days ago, Tesla halted development of its in-house AI accelerator Dojo and officially disbanded the engineering team. Tesla’s existing and planned data centers mainly use Nvidia’s data-center product line to train AI models, but its first-generation Dojo chips are also reportedly being used to some extent.

Ganesh Venkataramanan, who joined Tesla in 2016 and oversaw everything from Dojo’s hardware/software design to mass production, was previously a senior director at AMD, managing more than 200 engineers in design engineering. He left Tesla in October 2023 to start his own company. Peter Bannon, the vice president in charge of custom semiconductors and low-voltage electrical systems who reported directly to Elon Musk, also left the company this month, and Tesla is now expected to focus on developing the AI6 chip that will be prioritized for its vehicles.

FuriosaAI CEO Junho Baek worked as an AMD GPU software engineer and a Samsung Electronics memory hardware engineer before founding FuriosaAI in 2017. Baek compares Nvidia, which currently boasts overwhelming performance, to a “gasoline car,” and his company’s inference-only accelerator under development to an “EV,” claiming that it delivers performance comparable to Nvidia’s top-tier inference product, the L40S, while achieving more than twice the power efficiency.

The L40S has a relatively small 48GB of memory and lacks NVLink, which is supported by Nvidia’s flagship data-center products. NVLink is a technology that lets you connect multiple chips and use them as if they were one. To handle very large language models or process users’ inference requests in parallel, you need memory capacity that exceeds a single chip (80GB for the H100), so in most data-center environments, eight H100 chips are configured as one node.

As concerns grow that performance gains from simply scaling up model parameters have hit a ceiling (more precisely, we are running out of training data), more models are being released that focus on the inference phase (for example, GPT o3). These models are demonstrating that allocating more compute to inference leads directly to better output quality. In that sense, Baek’s assertion that we are entering an upcoming “age of inference” is not particularly controversial. Frontier models will keep growing in parameter count, but the pace will slow, while inference demand is surging as AI models improve.

Once a model has been trained, training costs are essentially fixed, whereas inference costs scale with user request volume and are therefore a highly sensitive area for service providers. Inference is usually measured in “cost per token,” which is not an absolute metric because the amount of information per token varies by model. Still, the cost per token for inference is one of the key factors in delivering AI services. You need to maintain performance while ensuring response latency and token generation speed are fast enough, and you must secure a sufficiently large context window to provide answers that fully reflect the input information.

The token/s metric frequently used in LLM inference benchmarks is not absolute. Furiosa claims on its RNGD product page that it outperforms Nvidia’s H100 and L40S, but the data it uses is based on a test environment described so sparsely that it is essentially meaningless for estimating performance in real-world data-center service conditions. I also looked for any documentation on the company’s website explaining the environment in which these performance test values were obtained, but could not find any.

The token/s/W metric that the company mainly promotes appears to be intended as a measure of power efficiency. However, as the table below—calculated independently by us—shows, when the AI accelerator runs inference on the Llama 3.1 70B model, it is almost certain that the measurements were taken based on TDP rather than actual power consumption. Under that assumption, we obtain the same result that it is 80% better than the H100 SXM and 1,038% better than the L40S. TDP is the value that indicates the maximum amount of heat a chip can dissipate, so it is optimized for designing server cooling solutions and is not directly related to actual power draw in use. Moreover, because TDP calculation methods are not standardized and differ by manufacturer, it is an even less reliable basis for discussing power efficiency.

The H100 NVL, which Furiosa does not compare itself against, is a variant “optimized for inference” that cuts TDP to about half that of the standard H100 SXM while increasing memory bandwidth. On the TDP-based metric that the company favors, it delivers an outstanding 4.32 FLOPS per watt.

There is no test data for the H100 NVL—the model that Nvidia itself describes as efficient for truly large-scale inference services—but I was curious what the results would look like if we applied the exact same metrics Furiosa chose to highlight RNGD. Using the “Theoretical Max Tokens/s” and “Theoretical Step Time (General)” formulas from a blog run by Google DeepMind, I first derived a theoretical maximum throughput of 2,690 tokens per second. I then applied the ratio between Furiosa’s reported H100 SXM test data and its theoretical maximum (77%) to estimate a realistic throughput of 2,091 tokens per second, and divided that by the manufacturer’s stated TDP (W) to obtain an estimated 5.98 output tokens per second per watt. Under these assumptions, the H100 NVL’s raw throughput is 119% higher than RNGD’s, and its performance per unit of power is 12% better.

In the case of the Nvidia L40S, the measured performance was only 24% of its theoretical maximum. This is because the Llama 3.1 70B model used in the test is roughly 70GB in size. Since each L40S GPU has 48GB of memory, you need to connect multiple GPUs to run this model, and because they are connected over the older PCIe 4.0 interface (64GB/s bidirectional bandwidth), you get communication latency when inferring multiple batches at once.

The table below shows throughput by batch size, and for the L40S, it appears highly likely that inter-GPU communication latency will cause bottlenecks starting from a batch size of 32. For RNGD, which uses PCIe 5.0 (128GB/s bidirectional bandwidth), bottlenecks are expected to arise starting from a batch size of 170, driven more by the limit of compute throughput (TFLOPS) than by inter-GPU communication. The KV cache capacity required at that batch size is estimated at 53.1GB.

Another interesting point is that RNGD’s performance data was extremely close to its theoretical maximum—an astonishing 98%. No AI accelerator delivers that level of efficiency in real large-scale service environments. It appears that FuriosaAI’s engineers carried out an extraordinary degree of optimization to produce these results. The test setup the company chose—2,048 input tokens (prompt) and 128 output tokens—is tailored to tasks like reading a long document and generating a summary, but in actual user requests, the output is usually longer than the prompt, which makes memory bandwidth more important. If we instead assume 128 input tokens and 2,048 output tokens, the H100 SXM’s maximum theoretical throughput advantage over RNGD widens from 2.94x to 3.74x.

Even if we take FuriosaAI’s chosen inference performance metrics and data at face value, the RNGD model appears likely to deliver around 89% of the performance per watt of Nvidia’s H100 NVL—the very inference-only model that Baek likens to a “gasoline car maker.” Nvidia’s products are already deployed in huge numbers across data centers worldwide, and optimization know-how is widely shared among researchers and developers, so the performance gap in real large-scale service environments is likely to be much wider. Still, even allowing for the fact that the company has cherry-picked its data, RNGD’s product specs represent very solid performance by the standards of AI accelerator startups.

However, Baek’s remarks comparing Nvidia to a gasoline car and positioning FuriosaAI as if it were a company like Tesla a decade ago—pioneering an entirely new market that incumbent manufacturers failed to create—have significant potential to mislead the vast majority of people who lack deep expertise in AI, even taking into account the enormous burden he carries on his shoulders as CEO.

Comments0

Newsletter

Be the first to get news about original content, newsletters, and special events.