When did memory become so critical?
- How memory became a technical bottleneck
- Why low-level optimization is essential
- What processor architectures and memory controllers can mitigate
Even before AI arrived, memory subsystems in embedded designs had become a primary driver. Memory now dictates cost, performance and, as seen by recent developments, supply chain risk.
OEMs relying on LPDDR5 (Low-Power Double Data Rate 5) may have become so unwittingly, but they are now the casualties of the current constriction on high-performance and high-bandwidth memory. As processors became more capable and more dependent on software over recent years, this mobile-class DRAM (Dynamic Random-Access Memory) technology quietly became a requirement in many embedded systems, particularly systems looking to include edge AI.
The road that led us here is more nuanced than measured speed and capacity. Faster processors do need faster memory, but it takes more than larger off-chip DRAM to take embedded systems to the next level. There are many types of memory in play and while DDR (Double Data Rate) fits into the overall architecture, engineers have much more to consider.
What kind of memory hierarchy do modern embedded systems need?
Memory in embedded systems is not a single entity; it is a hierarchy of technologies combined in the most optimal way for the application. This dependency on the application belies the general-purpose nature of processors, and we’re also seeing this at the microcontroller level.
Optimization of this nature relates to speed, density, cost and persistence. It can be differentiated as volatile and non-volatile, but also falls into two broad categories: on-chip and off-chip memory. Looking at on-chip memory first, along with registers and pipelines, a processor will include Level 1 and Level 2 (possibly more) caches built using static random-access memory (SRAM), sitting between the processor’s core(s) and external memory. Cache holds most often/recently used instructions and data, reducing the latency associated with reading/writing to/from other types of memory.
Along with the cache, a processor may feature tightly coupled memory (TCM) and scratchpads, represented by smaller blocks of SRAM, mapped directly into the CPU’s address space. This memory provides deterministic access and is typically used for hard real-time routines.
The capacity of this type of on-chip memory is small, in the kilobyte or low-megabyte range. However, it is the only memory in the architecture that has any chance of keeping up with the speed of the core(s), operating in a small number of clock cycles.
Non-volatile memory–almost always now flash–can be either on-chip or external. If external, the main types are parallel NOR, NAND, SPI. The leading options here are NAND-based managed flash drives configured as either the eMMC (embedded multimedia card) or UFS (unified flash storage) using a JEDEC-defined interface standard.
This memory’s value proposition is persistent storage rather than performance. At power-up, a boot loader runs from on-chip ROM or flash to bring up the system’s clocks and peripherals. The next stage is to load the application code, including the operating system, into RAM. Due to the size of modern code bases, that almost always now involves external DRAM.
This is where DDR sits. External, or off-chip, DRAM holds instructions and data, including stacks, heaps, framebuffers, AI model weights, network buffers, logging and telemetry data. Embedded engineers are now dealing with memory subsystems, with technologies like LPDDR5 setting the overall performance envelope.
Table 1: Comparing DDR4 and DDR5
| Feature | DDR4 | DDR5 | DDR5 Advantage |
|---|---|---|---|
| Data rates | 1600 – 3200 MT/s | 3200 – 6400 MT/s | Increases bandwidth |
| VDD/VDDQ/VPP | 1.2/1.2/2.5 | 1.1/1.1/1.8 | Lowers power |
| Internal VREF | VREFDQ | VREFDQ, VREFCA, VREFCS | Improves voltage margins, reduces BOM costs |
| Device densities | 2Gb – 16Gb | 8Gb – 64Gb | Larger monolithic memories |
| Prefetch | 8n (words per core cycle) | 16n (works per core cycle) | Keeps the internal core clock low |
| On-die ECC | None | 128b+8b SEC, error check and scrub | Strengthens on-chip RAS (reliability, availability and serviceability) |
| CRC | Write | Read/Write | Strengthens system RAS |
| Bank groups (BG)/banks | 4 BG x 4 banks (x4/x8) 2 BG x 4 banks (x16) |
8 BG x 2 banks (8Gb x4/x8 4 BG x 2 banks (8Gb x 16) 8 BG x 4 banks (16-64Gb x4/x8) 4 BG x 4 banks (16-64Gb x16) |
Improves bandwidth/performance |
| Burst length | BL8 (and BL4) | BL16, BL32 (and BC8 OTF, BL32 OTF) | Allows 64B cache line fetch with only 1 DIMM subchannel |
What runs faster, the memory or the processor core?
Although the DRAM sets the performance, the processor core(s) are almost always faster than any off-chip memory. This asymmetry is a legacy, but despite both sides–cores and memory interfaces–getting faster, the gap could be widening.
The historic solution to dealing with “slow” memory was to insert wait states: small loops explicitly designed to waste time while the memory caught up with the processor. In simple embedded systems based on an MCU running relatively slowly, inserting a short delay wasn’t a huge problem. As clock speeds climbed from tens to hundreds of MHz and now into the GHz range, dealing with memory latency became a design driver.
In essence, wait states still exist, but they are now abstracted away from the designer through the memory controller. Flash interfaces include line-fill buffers and prefetch logic, which essentially act like cache memory. Handshaking in bus protocols allows the CPU to continue working while the memory interface waits.
DDR controllers handle the latency associated with row-to-column address delays (tRCD), the active cycles between row-active and precharge states (tRAS), the actual precharge delay (tRP), and the column address strobe latency (tCAS). Together, these parameters define the minimum time between DDR commands and are often combined into a single parameter in data sheets.
In general, the latency of synchronous DRAM hasn’t improved much since early SDRAMs. The evolution from those early SDRAMs to the latest LPDDR5 memory doesn’t provide any significant improvements in latency; all the gains come from the bandwidth. Modern processors use this bandwidth to help manage the latency, by using local caches, out-of-order and multi-issue pipelines and memory controllers with reordering.

Which applications really need LPDDR5?
Not every embedded application needs high-bandwidth external memory like LPDDR5. Plenty of applications in industrial, commercial, medical and aerospace sectors can run on a memory system comprising internal flash and a modest amount of external RAM. But trends in the embedded space are pushing the limits of memory subsystems.
One of the more established trends is moving from a bare-metal system to one that uses a real-time operating system or an embedded Linux distribution. The memory requirements of the kernel, drivers, device tree and root filesystem quickly add up. This can set the baseline in the megabytes.
Connected devices also need sophisticated security features and network stacks. Add a graphical user interface framework and possibly containers, and the requirements soon reach the hundreds of megabytes.
At this point, the only viable option is off-chip DRAM, The adoption of LPDDR4 was confluent with these trends, as it provides a good balance between power and bandwidth. That progression naturally led to the use of LPDDR5, as the processor suppliers continued to add features and functionality to high-end devices and systems-on-a-chip (SoCs).
Applications that ingest large amounts of real-world data from local sensors or online services increased the memory requirements. Systems based on radar or lidar, and even industrial applications using vibration sensors can generate hundreds of megabytes of data per second. The need to buffer that data, apply pre-processing, conduct timing alignment, and filter all requires large capacities of SDRAM. Demand from these kinds of applications push the limits of LPDDR4, particularly if it’s doing double-duty to store the operating system and networking stack. The higher bandwidth offered by LPDDR5 became essential.
The most visible trend today is AI inferencing at the edge. The models used, such as convolutional neural networks and transformer-based models, are normally memory-bound, rather than compute-bound. The neural processing unit (NPU) spends a lot of cycles fetching weights from memory, and as model sizes grow, so too does the time spent accessing memory.
Moving to LPDDR5 provides the extra bandwidth to keep NPUs (neural processing units) busy, delivering more inferences per second with fewer stalls. This can be particularly crucial in vision systems such as ADAS (advanced driver assistance systems), robotics or smart surveillance. But high-end HMIs (human-machine interfaces) with rich graphics, gateways running multiple complex protocol stacks, or systems that feature multiple processors also need the performance offered by LPDDR5.
| Type | Bandwidth | Latency |
|---|---|---|
| DDR3 | 12-17 GB/s | 50-60 ns |
| DDR4 | 19-25 GB/s | 60-70 ns |
| DDR5 | 38-51 GB/s | 70-90 ns |
| LPDDR4 | 25-34 GB/s | 80-100 ns |
| LPDDR5 | 50-68 GB/s | 90-110 ns |
How should engineers manage memory efficiency in embedded systems?
The relationship between processor and memory is now critical to overall performance. But assessing the efficiency of that relationship is not a simple task. Designing a memory subsystem to meet an efficiency goal requires low-level engineering.
For bare-metal and RTOS-based systems, engineers might rely on the static analysis of the linker map to understand where the code and data are placed, in terms of flash, SRAM or tightly-coupled memory. Simple time-stamping and event logging, even toggling an output can help engineering teams work out how often external memory is accessed. A more sophisticated approach might include using a logic analyzer or mixed-signal oscilloscope to monitor external buses, measuring burst lengths and bus utilization. This will expose requests and responses, revealing if the memory interface is creating a bottleneck.
There are more tools in Linux-based systems, but it’s still far from automatic. The virtual file system (/proc) provides access to processor state and memory usage, plus other kernel parameters. Linux’s built-in profiling analysis tool, perf, uses counters and tracepoints to measure events like instructions retired, CPU cycles, cache misses, branches and page faults.
It also provides sub-tools, specifically perf mem, which looks at memory accesses, including loads and stores. Using perf mem, engineers can analyze cache misses and latency. It also records which instructions or functions generate the most memory traffic, the latency of those memory accesses, and where they hit (L1, L2, RAM).
While these tools can help engineers understand how their system is performing, they don’t provide recommendations. A memory controller can schedule commands to maximize row hits, maintain refresh, calibration and timing margins, and provide basic quality of service. Rewriting structures to be cache-friendly or use fewer memory passes is still the engineers’ job. But with generative AI making huge breakthroughs in the automation of low-level coding in high-level languages, it may happen sooner rather than later.
The line between clock speed and memory performance
It would be helpful to draw a straight line between clock speed and memory performance; both scaling linearly, double each and the system gets faster. In reality, the relationship is far more complicated. While on-chip SRAM may operate at speeds close to the processor clock speeds, it remains small in comparison to the demands of modern code bases and data sets. Off-chip memory provides much greater capacity and high burst rates, but brings latency, with little sign that this latency will significantly improve in future generations.
The memory subsystem must reconcile this asynchronous co-existence and as processors continue to grow more capable, adding dedicated hardware for AI and other high-bandwidth functions, tuning the memory subsystem will become more important and more complex. Simply scaling back the core speed doesn’t eliminate memory stalls, they just take fewer cycles. Increasing the bandwidth of SDRAM doesn’t reduce first-word latency. The real levers to pull are architectural: cache designs, data locality, parallelism and careful mapping of workloads to the memory subsystem.
Very soon, engineers will be looking at how LPDDR6 will help them increase system performance while balancing the same design tradeoffs. The gains from higher data rates may further expose the importance of those architectural decisions. Engineers navigating this landscape can benefit from the experienced Field Application Engineers and close supplier relationships Avnet brings, helping OEMs find the best solution for them.