IBM and Samsung both presented ML processors for phones, where local AI processing will remove the need for cloud participation, so long as the architectures are powerful, and flexible enough cope with the diverse workload – able to reconfigure to suit neural networks that vary in bit-width, layer count and the other dimensions available to them.
IBM’s is a 7nm four-core chip with 25.6Tflop/s available for ‘hybrid’ FP8 training, and 102.4Top/s for INT4 inferencing.
Hybrid 8bit floating point (HFP8) is a format invented at IBM (revealed in 2019) as a way of overcoming the limitations the standard 8bit (1 sign, 5 exponent, 2 mantissa) FP8 floating-point format, which works well when training certain standard neural networks, but results in poor accuracy when training others. Hybrid FP8 uses 4 exponent and 3 mantissa bits for forward propagation, then 5 exponent and 2 mantissa bits for back propagation, significantly increasing training accuracy, according to the company.
The four cores are linked by a pair of wide fast data rings, one for clockwise transfer and one for anti-clockwise transfer. These can be kept closed within the chip, or opened and routed through external memory or multiple identical chips to process larger networks. Rings and cores are asynchronous to allow different clock rates to trade power for performance separately.
Each core is divided into to two sub-cores sharing a scratchpad memory, then each sub-core has a 8×8 array of engines optimised to accelerate convolution and matrix multiplication with separate pipelines for floating-point and fixed-point computation – together providing FP16, HFP8, INT4 and INT2 capability for both AI training and inference.
The 36mm2 chip was made using EUV lithography and achieved the above performance metrics with 0.75V on the core and 0.95V on the SRAM. Using network knowledge gleaned when a network is compiled, the chip can throttle power-hungry network layers to stay within a power budget. Nominal operation (0.55V core, 0.7V SRAM) yields 1GHz clocking and sustained 3.5Tflop/s/W FP8 and 16Top/s/W INT4.
Samsung mobile AI processor is smaller, at 5.46mm2, and uses a 5nm process to implement its three cores that in total and can execute 623 inference/s.
Each core has two sub-cores (‘convolutional engines’) along with a vector processing unit and 1Mbyte of scratchpad. Each sub-core has weight – feature map – partial sum fetchers and an array of 1,024 MACs – so >6,000 MACs on the chip). It can execute 64 dot-products of 16-dimensional vectors per cycle. The scratchpad holds all weights, input feature maps, output feature maps and partial sums for a layer or, if the layer is to big to fit at once, a tile rom that layer. A vector processing unit executes complex non-linear functions such as normalization and softmax.
Unlike IBM’s ring busses, the cores in this case are connected by a more conventional bus that uses DMA (direct memory access).
To save wasted processing and therefore power, feature map zero skipping ins implemented. “The MAC utilisation on convolutional layers in Inception-V3 can be improved by 36% on average by feature map zero-skipping,” according to the ISSCC presentation. “Unlike weight zero-skipping, feature map zero-skipping enhances effective performance and energy efficiency without any additional training steps such as weight pruning.”
The chip runs from between 550mV and 900mV, and 332Mz to 1.2GHz clocking. Power and performance were measured while running convolution, pooling and fully-connected layers of the 8bit Inception-V3 model without weight pruning. Overall inference throughput was 194 inference/s at 332MHz and 623 inferences/s at 1.196GHz in throughput priority mode equivalent to the multi-thread CPU operation. 1,190 inferences/J was measured at 0.6V, corresponding to 13.6Top/s/W for Inception-V3.
Per area, the Samsung chip gets 2.69Top/s/mm2 and 114 inference/s/mm2.
On the tiny side, Nanyang Technological University from Singapore and Columbia University have been looking at micro-power artificial intelligence.
Nanyang presented a real-time hand gesture recognition system for wearable and IoT devices, that works by examining edge data from VGA (640x480x8bit) images, followed by hybrid compact classifiers for static gesture recognition, and an error-tolerant sequence analyser for dynamic gesture recognition.
The 1.5mm2 65nm chip can recognise 24 dynamic gestures with an average accuracy of 92.6%, all for 184μW at 0.6V.
Columbia’s processor is a 65nm always-on keyword spotter designed to work in the presence of background-noise.
More conventional noise-independent training – training with a lot of different noise levels and types – would have resulted in a too-large neural network, so the team used a simpler biologically-inspired scheme called ‘divisive energy normalisation’.
Processing was spread out across a ‘normalised acoustic feature extractor’ chip that takes an acoustic signal from a microphone and produces spike-rate coded features (for 109nW), and a spiking neural network classifier chip.
For 570nW, the two-chip system achieves 89 – 94% accuracy across -5 to 20dB signal-to-noise ratios with four different noise types (HeySnips data set). Capacity-wise, accuracy was 96.5% when seeking one keyword, or 90.2% looking for four keywords.
ISSCC paper 9.1 A 7nm 4-core AI chip with 25.6Tclops hybrid FP8 training, 102.4Tops INT4 inference and workload-aware fhrottling
ISSCC paper 9.5 A 6k-MAC feature-map-sparsity-aware neural processing unit in 5nm flagship mobile SoC
ISSCC paper 9.7 A 184µW real-time hand-gesture recognition system with hybrid tiny classifiers for smart wearable devices
ISSCC paper 9.9 A background-noise and process-variation-tolerant 109nW acoustic feature extractor based on spike-domain divisive-energy normalization for an always-on keyword spotting device
To see what else was in session 9, and the rest of ISSCC 2021, download the programme by clicking here