Edge AI Acceleration: Leading Specialized Processors for 2024

AI and machine‑learning workloads are increasingly moving to the edge, and a growing array of processors is emerging to meet the diverse demands of these applications. From established giants to agile startups, each solution targets specific verticals, power envelopes, and price points. Below is a concise overview of the current market leaders.

Application Processors

Intel Movidius Myriad X

Originally developed by the Irish startup Movidius and acquired by Intel in 2016, the Myriad X is a third‑generation vision‑processing unit that introduced a dedicated neural‑network compute engine. It delivers 1 TOPS of deep‑neural‑network (DNN) compute, supports FP16 and INT8 precision, and is tightly coupled to an intelligent memory fabric to eliminate data‑transfer bottlenecks. The chip hosts 16 proprietary SHAVE cores and an enhanced suite of vision accelerators. It is available in Intel’s Neural Compute Stick 2, a plug‑and‑play USB evaluation platform that enables rapid deployment of AI and computer‑vision workloads on any workstation.

NXP Semiconductors i.MX 8M Plus

The i.MX 8M Plus is a heterogeneous application processor featuring a dedicated neural‑network accelerator from VeriSilicon (Vivante VIP8000). It offers 2.3 TOPS for inference, supporting tasks such as multi‑object identification, 40,000‑word speech recognition, or MobileNet v1 at 500 images/second. Complementing the accelerator, the chip houses a quad‑core Arm Cortex‑A53 subsystem at 2 GHz and a Cortex‑M7 real‑time subsystem. Vision support includes two high‑definition image signal processors for stereo or single 12 MP cameras, while an 800 MHz HiFi4 DSP handles voice pre‑ and post‑processing.

NXP’s i.MX 8M Plus is the company’s first application processor with a dedicated neural‑network accelerator, tailored for IoT scenarios. (Image: NXP Semiconductors)

XMOS xcore.ai

The xcore.ai blends application‑processor performance with microcontroller‑level real‑time efficiency, targeting voice‑controlled AIoT devices. Built on XMOS’s proprietary Xcore architecture, the chip comprises 16 logical cores that can be allocated to I/O, DSP, control, or AI functions—effectively creating a software‑defined SoC. It supports 32‑bit, 16‑bit, 8‑bit, and 1‑bit (binarized) neural networks, delivering 3,200 MIPS, 51.2 GMACCs, and 1,600 MFLOPS. With 1 MB embedded SRAM and a low‑power DDR interface, it is well‑suited for compact, low‑latency voice‑processing workloads.

XMOS’s xcore.ai is a proprietary, versatile platform engineered for voice‑centric AI workloads. (Image: XMOS)

Automotive SoC

Texas Instruments Inc. TDA4VM

Part of the Jacinto 7 series, the TDA4VM is TI’s inaugural SoC with an integrated deep‑learning accelerator. The accelerator builds on the C7x DSP core and a custom matrix‑multiply accelerator (MMA), achieving up to 8 TOPS. Designed for ADAS, it can process an 8 MP front‑camera stream or multiple 3 MP cameras plus radar, LiDAR, and ultrasonic inputs—enabling sensor fusion for autonomous parking and other advanced driver‑assist functions. Targeted at 5–20 W power budgets, the TDA4VM is available as a pre‑production development kit.

TI’s TDA4VM powers complex automotive ADAS systems, delivering perception and sensor fusion on a single chip. (Image: Texas Instruments Inc.)

GPU

Nvidia Corp. Jetson Nano

Nvidia’s Jetson Nano brings a 128‑core Maxwell GPU and a quad‑core Arm Cortex‑A57 CPU to edge devices. Capable of 0.5 TFLOPS, the module can run multiple neural networks on high‑resolution image streams while consuming only 5 W. It integrates CUDA X, Nvidia’s suite of neural‑network acceleration libraries, and is available in cost‑effective development kits.

Nvidia’s Jetson Nano delivers a powerful GPU for AI at the edge, in a compact, low‑power form factor. (Image: Nvidia Corp.)

Consumer Co‑processors

Kneron Inc. KL520

The KL520 is a dedicated neural‑network processor aimed at image processing and facial recognition for smart homes, security systems, and mobile devices. Optimized for convolutional neural networks, it delivers 0.3 TOPS at 0.5 W (≈0.6 TOPS/W) with over 90 % MAC efficiency. Its reconfigurable architecture and a compiler that applies compression enable larger models to fit within the chip’s resources, reducing power and cost. The KL520 is available in the AAEON M2AI‑2280‑520 accelerator card.

Kneron’s KL520 combines a reconfigurable design with compression techniques for efficient mobile image processing. (Image: Kneron Inc.)

Gyrfalcon Lightspeeur 5801

Targeted at consumer electronics, the Lightspeeur 5801 achieves 2.8 TOPS at 224 mW (≈12.6 TOPS/W) with 4 ms latency, thanks to a processor‑in‑memory architecture that offers excellent power efficiency. Its 10 MB on‑chip memory accommodates complete models, and clock speed can be tuned between 50 MHz and 200 MHz to trade off power against performance. The 5801 powers LG’s Q70 mid‑range smartphone, handling inference for camera effects, and is available as a USB thumb‑drive development kit.

Ultra‑Low‑Power

Eta Compute ECM3532

The ECM3532 is tailored for battery‑powered or energy‑harvesting IoT devices that require always‑on inference. It pairs an Arm Cortex‑M3 microcontroller with an NXP CoolFlux DSP, and uses a proprietary voltage/frequency scaling technique that optimizes power on a per‑clock‑cycle basis. Machine‑learning workloads—particularly voice tasks—can be run on either core. Samples are available now, with mass production slated for Q2 2020.

Syntiant Corp. NDP100

The NDP100 is a processor‑in‑memory silicon designed for voice‑command inference under stringent power budgets. Consuming less than 140 µW active power, it supports keyword spotting, wake‑word detection, speaker identification, and event classification. It is destined for hands‑free consumer devices such as earbuds, hearing aids, smartwatches, and remote controls, with development kits on hand.

Syntiant’s NDP100 delivers ultra‑low‑power voice processing for consumer IoT. (Image: Syntiant Corp.)

GreenWaves Technologies GAP9

The GAP9 is an ultra‑low‑power application processor featuring nine RISC‑V cores with a custom instruction set tuned for power efficiency. It includes bidirectional multi‑channel audio interfaces and 1.6 MB internal RAM. GAP9 can run MobileNet V1 on 160 × 160 images in 12 ms while drawing only 806 µW per frame, making it ideal for battery‑powered image, audio, and vibration sensing applications.

Silicon Labs Strengthens IoT Wireless Leadership with Redpine Signals Acquisition TE Connectivity HTU31: Precision Humidity & Temperature Sensor with Linear Response

Embedded

Sensor

Cloud Computing

Internet of Things Technology