Shanghai Neardi Technology Co., Ltd. sales@neardi.com 86-021-20952021
Imagine you're working on an edge AI project with the RK3588: the camera video stream needs to perform real-time face recognition and vehicle detection, while also supporting UI display, data upload, and business logic processing. You notice: frame drops occur when there are many objects in the frame, large models fail to run smoothly, and the temperature rises sharply.
At this point, people usually say: "Your model is too large—RK3588's 6TOPS isn't enough."
But is it really a lack of computing power? Have you ever wondered: Why does a 6TOPS NPU still experience frame drops and lag when running a 4TOPS model? The answer lies in three dimensions of NPU computing power: Peak Performance (TOPS), Precision (INT8/FP16), and Efficiency (Bandwidth).
You will see that various chips emphasize their NPU specifications, with a core parameter prominently displayed: NPU Computing Power: X TOPS. Examples include RK3588-6TOPS, RK3576-6TOPS, RK1820-20TOPS, Hi3403V100-10TOPS, Hi3519DV500-2.5TOPS, Jetson Orin Nano-20/40TOPS, Jetson Orin NX-70/100TOPS, and so on...
Tera: Represents 10¹².
Operations Per Second: Refers to the total number of AI operations the NPU can perform in one second. In simple terms, 1 TOPS means the NPU can execute 1 trillion (10¹²) operations per second.
![]()
The total number of MAC Units is the core of neural network computing. In convolutional layers and fully connected layers, the main computation involves multiplying input data by weights and then summing the results.
The design philosophy of an NPU lies in having an extremely large array of parallel MAC units. An NPU chip may contain thousands or even tens of thousands of MAC units, which can work simultaneously to achieve large-scale parallel computing.
The more MAC units there are, the greater the amount of computation the NPU can complete in a single clock cycle.
Clock Frequency: Determines the number of cycles the NPU chip and its MAC units operate per second (measured in Hertz, Hz). A higher frequency allows the MAC array to perform more multiply-accumulate operations per unit time. When manufacturers announce TOPS, they use the NPU's peak operating frequency (i.e., the maximum achievable frequency).
Operations per MAC: A complete MAC operation actually includes one multiplication and one addition. To align with the traditional FLOPS (Floating-Point Operations Per Second) counting method, many computing standards count one MAC operation as 2 basic operations (1 for multiplication and 1 for addition).
Precision Factor: The MAC units of an NPU are optimized for processing low-precision data (e.g., INT8).
Simplified speedup ratio of INT8 vs FP32: Since 32 bits / 8 bits = 4, a single FP32 unit can theoretically perform 4 times as many operations in one cycle when switched to INT8 computation. Therefore, if a manufacturer's TOPS is calculated based on INT8, it needs to be multiplied by a precision-related speedup ratio. This is why INT8 TOPS is much higher than FP32 TOPS.
TOPS measures peak theoretical computing power. In practical applications, due to factors such as data transmission, memory constraints, and model structure, the actual effective computing power of an NPU is often lower than this peak value.
![]()
Computing power tells us how fast an NPU runs, while computational precision tells us how finely it operates. Precision is another key dimension of NPU performance, determining the number of bits used and the representation range of data during computation.
At the same TOPS level, the actual computing speed of INT8 is much faster than that of FP32. This is because the NPU's MAC units can process more 8-bit data at once and perform more operations.
The NPU TOPS claimed by manufacturers are usually based on INT8 precision. When making comparisons, ensure that you are comparing TOPS under the same precision.
![]()
When you see an NPU claiming 20 TOPS (INT8), you need to understand:
An NPU's computing power (TOPS) is an indicator of its speed, while computational precision (e.g., INT8) is key to its efficiency and applicability. For end-user-facing devices, manufacturers generally aim to maximize INT8 TOPS while maintaining acceptable precision loss, to achieve low-power and high-efficiency AI inference performance.