logo

Shanghai Neardi Technology Co., Ltd. sales@neardi.com 86-021-20952021

Shanghai Neardi Technology Co., Ltd. Company Profile
blog
Home > blog >
Company News About An In-Depth Interpretation of RK3588's 6TOPS Bottleneck and the Truth About NPU Computing Power

An In-Depth Interpretation of RK3588's 6TOPS Bottleneck and the Truth About NPU Computing Power

2025-12-15
Latest company news about An In-Depth Interpretation of RK3588's 6TOPS Bottleneck and the Truth About NPU Computing Power

Imagine you're working on an edge AI project with the RK3588: the camera video stream needs to perform real-time face recognition and vehicle detection, while also supporting UI display, data upload, and business logic processing. You notice: frame drops occur when there are many objects in the frame, large models fail to run smoothly, and the temperature rises sharply.

At this point, people usually say: "Your model is too large—RK3588's 6TOPS isn't enough."

But is it really a lack of computing power? Have you ever wondered: Why does a 6TOPS NPU still experience frame drops and lag when running a 4TOPS model? The answer lies in three dimensions of NPU computing power: Peak Performance (TOPS), Precision (INT8/FP16), and Efficiency (Bandwidth).

You will see that various chips emphasize their NPU specifications, with a core parameter prominently displayed: NPU Computing Power: X TOPS. Examples include RK3588-6TOPS, RK3576-6TOPS, RK1820-20TOPS, Hi3403V100-10TOPS, Hi3519DV500-2.5TOPS, Jetson Orin Nano-20/40TOPS, Jetson Orin NX-70/100TOPS, and so on...

What is TOPS? Why is everyone talking about it?

Tera: Represents 10¹².

Operations Per Second: Refers to the total number of AI operations the NPU can perform in one second. In simple terms, 1 TOPS means the NPU can execute 1 trillion (10¹²) operations per second.

How is TOPS calculated?

latest company news about An In-Depth Interpretation of RK3588's 6TOPS Bottleneck and the Truth About NPU Computing Power  0

The total number of MAC Units is the core of neural network computing. In convolutional layers and fully connected layers, the main computation involves multiplying input data by weights and then summing the results.

The design philosophy of an NPU lies in having an extremely large array of parallel MAC units. An NPU chip may contain thousands or even tens of thousands of MAC units, which can work simultaneously to achieve large-scale parallel computing.

The more MAC units there are, the greater the amount of computation the NPU can complete in a single clock cycle.

Clock Frequency: Determines the number of cycles the NPU chip and its MAC units operate per second (measured in Hertz, Hz). A higher frequency allows the MAC array to perform more multiply-accumulate operations per unit time. When manufacturers announce TOPS, they use the NPU's peak operating frequency (i.e., the maximum achievable frequency).

Operations per MAC: A complete MAC operation actually includes one multiplication and one addition. To align with the traditional FLOPS (Floating-Point Operations Per Second) counting method, many computing standards count one MAC operation as 2 basic operations (1 for multiplication and 1 for addition).

Precision Factor: The MAC units of an NPU are optimized for processing low-precision data (e.g., INT8).

Simplified speedup ratio of INT8 vs FP32: Since 32 bits / 8 bits = 4, a single FP32 unit can theoretically perform 4 times as many operations in one cycle when switched to INT8 computation. Therefore, if a manufacturer's TOPS is calculated based on INT8, it needs to be multiplied by a precision-related speedup ratio. This is why INT8 TOPS is much higher than FP32 TOPS.

TOPS measures peak theoretical computing power. In practical applications, due to factors such as data transmission, memory constraints, and model structure, the actual effective computing power of an NPU is often lower than this peak value.

Computing power is about speed; precision is about "fineness."

latest company news about An In-Depth Interpretation of RK3588's 6TOPS Bottleneck and the Truth About NPU Computing Power  1

Computing power tells us how fast an NPU runs, while computational precision tells us how finely it operates. Precision is another key dimension of NPU performance, determining the number of bits used and the representation range of data during computation.

At the same TOPS level, the actual computing speed of INT8 is much faster than that of FP32. This is because the NPU's MAC units can process more 8-bit data at once and perform more operations.

The NPU TOPS claimed by manufacturers are usually based on INT8 precision. When making comparisons, ensure that you are comparing TOPS under the same precision.

latest company news about An In-Depth Interpretation of RK3588's 6TOPS Bottleneck and the Truth About NPU Computing Power  2

High Precision (Typically Used for Training)
  • FP32 (Single-Precision Floating-Point, 32-bit): Offers the largest numerical range and precision. Commonly used in traditional GPU and PC computing. Models typically adopt FP32 during the training phase to ensure accuracy.
  • FP16/BF16 (Half-Precision Floating-Point, 16-bit): Reduces data volume by half while maintaining a certain level of precision, enabling faster computation and memory savings.
Low Precision (Typically Used for Inference)
  • INT8 (8-bit Integer): Currently the industry standard for evaluating inference performance of edge-side NPUs. The process of converting model weights and activation values from high precision (e.g., FP32) to 8-bit integers is called Quantization.
  • INT4 (Lower Bit-Width): Features further compression, suitable for scenarios with extremely high requirements for power consumption and latency, but imposes higher demands on controlling model precision loss.
How to Understand the Actual Performance of an NPU?

When you see an NPU claiming 20 TOPS (INT8), you need to understand:

  • The peak computing power is 20 trillion operations per second.
  • This computing power is measured under 8-bit integer (INT8) precision. This means it is mainly used for AI inference (such as image recognition, speech processing, etc.), not training.
  • The final performance depends on the application: The actual user experience (such as face unlock speed, real-time translation latency) relies not only on the NPU's TOPS but also on:
    • Model quantization quality: Whether the quantized INT8 model maintains sufficient accuracy.
    • Memory bandwidth: The speed of data input and output.
    • Software stack and drivers: The optimization level of the toolchain and drivers provided by the chip manufacturer for model deployment.

An NPU's computing power (TOPS) is an indicator of its speed, while computational precision (e.g., INT8) is key to its efficiency and applicability. For end-user-facing devices, manufacturers generally aim to maximize INT8 TOPS while maintaining acceptable precision loss, to achieve low-power and high-efficiency AI inference performance.

Events
Contacts
Contacts: Mr. Cola
Contact Now
Mail Us