Article

Programmable logic provides the key to efficient edge AI

Nishant Nishant
microchip
Engineers face a design choice between FPGAs and GPUs when developing embedded AI applications

Artificial intelligence (AI) provides the ability to bring more advanced capabilities to embedded and edge-computing systems. Running neural networks, on which many of these systems rely, is computationally intensive. High-end graphics processing units (GPUs) are the go-to solution, but they typically have a high average selling price and are power-hungry. Field programmable gate arrays (FPGAs) offer an attractive alternative.

How GPUs work in AI applications

GPUs can deliver the high throughput for the matrix-matrix and matrix-vector operations on which deep learning relies. However, they have overheads, such as the instruction bandwidth of the execution units that process each kernel.

Repeated memory calls to fetch instructions—even if they are mostly contained in on-chip cache—can be avoided by hardwiring the operations. In general, GPUs are better optimized to training than for the inferencing used at runtime.

 

Routing between processor units using a systolic Array


In hardware, a systolic array provides an efficient way to route data between processing engines, reducing the overall number of main-memory accesses. (Source: Wikimedia Commons)

How FPGAs can be more efficient for AI processing

Programmable logic provides the ability to fine-tune these matrix and vector operations and pipeline them in more efficient ways. The combination of AI and DSP cores with general-purpose programming logic now available on many FPGAs allows for the construction of high-throughput architectures such as systolic arrays.

These support the rapid flow of data through the many layers of a deep neural network with far greater efficiency than with a fixed-layout GPU or microprocessor.

FPGAs that are dynamically programmable make it possible to build systems that load layer-execution logic on demand. That allows the use of advanced techniques that complement layer fusion and other optimizations that are used in neural networks designed for edge applications, such as ResNet and MobileNet.

Hardwired logic can easily forward the results from one operation to others without using memory buffers. Implemented in FPGAs, this forwarding can reduce power-hungry offchip memory accesses and reduce overall latency.

As there will be a delay in loading new logic configurations, dynamic modification may be restricted to applications less sensitive to latency. The embedded device can still support different processing modes, but only if the time needed to reconfigure the programmable fabric can be supported.

Comparing FPGAs and GPUs for Edge AI applications

Design Aspect FPGAs GPUs
  Pros Cons Pros Cons
Matrix operations Efficient pipelining and fine-tuning Requires custom design High throughput High power consumption
Power efficiency Low power by reducing off-chip memory access Complex power rail management Prioritizes performance over power High power consumption
Cost Generally lower than high-end GPUs More advanced devices can carry a price premium over other FPGAs Prioritizes performance over cost Large-scale deployments are expensive
Architectural flexibility Highly flexible architecture High design complexity Extremely parallel No hardware flexibility, but software-defined
Sparsity and pruning Good support for fine-grained structed sparsity (FSS) Requires expert hardware design Simpler, software-configure approach SIMD/SIMT architecture is less efficient WRT sparsity
Power management Exceptionally efficient by design Gains require efficiency logic design Improving with every generation High power consumption
Security Customizable by design Bitstreams could be intercepted Mature software layers Dependent on APIs and drivers
Memory efficiency Low latency pipelines Limited on-chip capacity High bandwidth Limited flexibility

The relative strengths and weakness of FPGAs and GPUs indicate that FPGAs are better suited for always-on edge nodes than GPUs, but GPUs offer a potentially simpler design path if the system resources can support them.

Understanding the role of acceleration in GPUs and FPGAs

There are techniques common to both microprocessor and FPGA implementations that improve performance. However, programmable logic provides engineering teams greater degrees of freedom to exploit them.

Quantization can deliver big improvements in operations-per-watt as well as in terms of silicon area efficiency. Also known as microscaling, it involves converting a model trained using floating-point arithmetic to a version that uses integer weights.

Floating point numbers can be based on 32-bit word lengths, as opposed to an 8-bit integer (INT8). Many models offer near-equivalent accuracy when using INT8 formats. Some layers can run almost as well with 4-bit, ternary or even binary weights.

A decade ago, engineers at Stanford University developed several techniques that accelerate deep neural networks, documenting the impact they had on overall performance. The experiments uncovered the dependency on various architectural factors, from memory organization to the ability to skip unnecessary operations dynamically.

Pruning models involves removing zero-effect operations


Pruning removes connections between neurons that will have no impact on the output, reducing the number of matrix-arithmetic operations needed. (Source: Google)

The greatest benefits were shown to come from distilling a model to the point that almost all data accesses can be supported using local SRAM rather than off-chip DRAM. Compressing the model to execute from SRAM can result in a 100-fold saving in energy.

The work was primarily aimed at microprocessor-like pipelines rather than FPGAs, but the distributed SRAM arrays in FPGAs provide further scope for locality-based optimization when models are compressed to use on-chip resources as much as possible.

The Stanford research found the next biggest benefit came from exploiting sparsity. Often, neural weights are small enough to have little to no impact on the output decision. These operations represent wasted effort during inferencing. The weights of negligible effect can be set to zero and skipped in a process known as pruning. This creates a sparser network that needs fewer arithmetic operations, in principle. Skipping operations with zero-value weights could reduce the processing needed tenfold. Avoiding the need to look up activation functions that result in zero output results in further savings.

However, the memory controllers in GPUs are usually optimized for dense matrix processing. In this scenario, it is often faster to fetch all the weights in a single wide fetch operation than to split accesses into smaller operations scattered through the memory map.

Using FPGAs provides advantages over microprocessor and GPU implementations where sparsity provides a benefit. Programmable logic lends itself to the creation of complex address-generation units for non-contiguous data accesses. They can schedule and pipeline scatter-gather operations made to access sparse neural-network weights and data values.

The result is a process of densification: this ensures that on each clock cycle the matrix and vector units have a full bank of data to operate on. Similarly, programmable logic provides much greater flexibility in the implementation of activation functions and the use of zero-skipping to reduce the amount of processing needed on neurons that will have little effect on the output.
 

Challenges and limitations in using FPGAs for AI functions

There are challenges associated with implementing AI functions on an FPGA platform. The power management of SRAM-based FPGAs can be complex. Devices that incorporate microprocessor cores, AI cores as well as programmable logic will often use multiple power rails to supply power to each independently.

All of those need to be managed by a hardware controller that keeps track of system state. It will need to ensure that power is removed and restored in a guaranteed order to prevent errors or vital parts of the device losing its configuration state. Though this increases design complexity and the number of off-chip power management integrated circuits (PMICs), it provides opportunities for fine-grained energy optimization. The key question is how many of the rails can be grouped and controlled together versus the flexibility needed from power gating parts of the FPGA at a fine-grained level.

Security presents further challenges. Attackers may attempt to alter the models or their weights to cause the system to respond to false events. Or they may simply try to steal valuable intellectual property (IP) contained in the bitstream data used to configure the FPGA as well as processor firmware. FPGA manufacturers have introduced measures to protect IP and prevent tampering, including bitstream encryption.

The need to provide regular updates to systems in the field presents another mode of attack. But there are measures designers can take to prevent over-the-air (OTA) updates from being altered by attackers before they are loaded into the device. By encrypting and signing the payloads, developers can implement secure- or measured-boot protocols. With these, the host processor will only load valid firmware into flash memory for later execution. The same signing protocols can be used to check new FPGA configuration bitstreams.

A key requirement in the hardware itself is a root of trust. By assigning individually signed management data to each node, root of trust can ease the control and protection for Edge AI, IoT, and similar system-of-systems implementations. Servers can check the identity and validity of each node as it joins the network. And they can refuse accesses from devices that do not pass. This helps enable the scalable deployment of edge systems, automating much of the enrollment process.

Despite some challenges, the advantages that programmable logic can bring to AI functions make FGPAs worth considering, when sourcing high-performance, lower-power platforms for advanced AI applications.

 

Read more about Artificial Intelligence

About Author

Nishant Nishant
Avnet Staff

We use Avnet Staff as a collective byline when our team of editors and writers collaborate on the co...

Helpful Links

Marketing Content Spots

Related Articles

Related Articles
coffee machine with embedded intelligence
Transitioning from IoT to edge AI system development
By Avnet Staff   -   October 29, 2025
Rapid developments in machine learning and artificial intelligence present challenges for embedded system engineers charged with bringing AI to edge devices. How should design teams address these challenges?
A robot arm operating in space moves equipment into position
Satellites servicing satellites
By Philip Ling   -   August 7, 2025
High demand means that while space is big, it’s starting to fill up. Initiatives to use robots to repair, refuel, reassemble and eventually manufacture satellites in space are gaining pace. Sound complicated? It is.

Related Events

Related Events
What, Why & How of Vision AI at the Edge
Date: April 23, 2021
Location: On Demand

programmable-logic-provides-the-key-to-efficient-edge-ai