Analog In‑Memory Computing: Power‑Efficient Edge AI Inference

Machine learning and deep learning are now woven into everyday technology. From natural‑language assistants to real‑time object detection, AI drives a growing number of consumer and industrial devices. Yet most of these applications rely on cloud‑based inference engines—an architecture that raises concerns about privacy, power consumption, latency and cost.

Deploying AI locally on the edge can eliminate those problems. The main hurdle is the memory‑driven power drain of conventional digital neural‑network accelerators. When weights and activations are fetched from DRAM or SRAM, the data movement alone consumes two to three orders of magnitude more energy than the arithmetic itself. Analog in‑memory computing tackles this bottleneck by performing multiply‑accumulate (MAC) operations directly inside a non‑volatile memory array.

Cloud‑Based AI: The Pain Points

When an edge device forwards raw data to a remote server, several issues arise:

Privacy & security – Personal or confidential information is transmitted and stored in data centers.
Unnecessary power use – Every bit sent over the radio, stored in the cloud and re‑retrieved burns energy.
Latency for small‑batch inference – Responses can take more than 100 ms, which is perceptible to users.
Data‑economy inefficiency – Sensors generate massive streams; uploading everything is neither economical nor sustainable.

To overcome these challenges, a trained model is first built on a high‑performance platform (cloud, GPU farm, etc.). The resulting weights are then mapped to a lightweight inference engine that performs the computation locally.

The MAC‑Heavy Nature of Neural Networks

Even lightweight models are dominated by MACs. For example, MobileNet‑v1 with a 224×224 input contains 4.2 million parameters and requires 569 million MACs per inference. The multiplication and accumulation are performed in a dense matrix‑vector product.

Why Digital Computing Struggles

In a conventional accelerator, weights and activations are stored in DRAM/SRAM and must be fetched into an arithmetic logic unit (ALU). Figure 3 illustrates the memory bottleneck: the energy spent moving data (50–100 pJ per transfer) dwarfs the 250 fJ per MAC operation. Even with aggressive data‑reuse techniques, the Von Neumann architecture imposes a hard limit on power efficiency.

Analog In‑Memory Computing: Power‑Efficient Edge AI Inference

Analog In‑Memory Computing to the Rescue

By embedding computation into the memory array itself, the need to shuttle weights disappears. The remaining data transfer is limited to the input signal, which is typically a sensor output. Flash cells operating in the sub‑threshold regime consume sub‑nanoamp currents, making active power negligible. In standby, the non‑volatile cells retain data without any power draw.

The memBrain™ platform from Silicon Storage Technology (a Microchip company) exemplifies this approach. Built on the SuperFlash® (ESF3) memory technology, each bitcell can be programmed to a precise threshold voltage (V_t), establishing a conductance that represents a weight. When an input voltage is applied, the current through the cell equals the product of the input and the stored weight.

Multi‑Level Memory Architecture

Figure 4 shows the cross‑section of an ESF3 bitcell. The cell includes five terminals—Control Gate (CG), Word Line (WL), Erase Gate (EG), Source Line (SL), and Bitline (BL). High‑voltage EG programs the floating gate, while low‑voltage biasing on the other terminals reads or writes the cell.

Fine‑grained programming tunes the floating‑gate V_t to produce a specific I‑V curve. Figure 5 demonstrates a 2‑bit example where four distinct V_t levels yield four separate current responses when a fixed CG voltage is applied.

MAC in the Memory Array

Each cell’s conductance g_m embodies a trained weight. The current I_out = g_m · V_in implements the multiply step. By wiring many cells in parallel, the column currents naturally sum, yielding the accumulate operation. Figure 6 visualises a 2×2 array performing two MACs and one accumulation.

On a larger scale, a weight matrix can be mapped onto a single memory array. Figure 7 shows the matrix layout for a fully connected layer.

The inference flow is straightforward: sensor data is digitised, converted to an analog voltage by a DAC, fed into the array, and the resulting column currents are digitised by an ADC for subsequent activation and pooling stages.

Modularity and Scalability

memBrain tiles are designed to be stacked. Figure 8 shows a 3×4 configuration where multiple tiles share a common analog‑digital interconnect. Data moves across tiles via a shared bus, enabling the construction of deep networks without a single monolithic die.

Software Development Kit (SDK)

The accompanying SDK is framework‑agnostic. Whether you train in TensorFlow, PyTorch, or another library, the SDK handles model quantisation, mapping, and deployment to the memBrain hardware. Figure 9 outlines the workflow from training to inference.

Key Advantages

Ultra‑low power – In‑memory compute eliminates data movement, flash cells operate sub‑threshold, and no standby power is needed. The architecture also exploits sparsity, as zero weights or inputs leave cells idle.
Compact footprint – The 1.5‑transistor split‑gate cell stores a 4‑bit value, whereas a 6‑transistor SRAM would require 24 transistors for the same data. The result is a dramatically smaller die area.
Reduced development cost – By leveraging mature flash processes and avoiding the need for ultra‑dense geometries, mask set costs and lead times are significantly lowered compared to ASICs designed for edge AI.

Edge AI promises transformative applications, but power and cost remain limiting factors. Analog in‑memory computing, backed by proven multi‑level flash technology, offers a practical, low‑cost pathway to bring intelligent inference right to the sensor.

Bluetooth Mesh Design Choices: Module vs. Discrete Device Why Companies Are Building Custom Voice Agents to Secure Data and Drive Automation

Embedded

Sensor

Cloud Computing

Internet of Things Technology