Article

When did memory become so critical?

Philip Ling
data flowing out of bottle
Memory bandwidth is increasing but latency is still a challenge as processor cores get faster
KEY TAKEAWAYS:
  • How memory became a technical bottleneck
  • Why low-level optimization is essential
  • What processor architectures and memory controllers can mitigate

Even before AI arrived, memory subsystems in embedded designs had become a primary driver. Memory now dictates cost, performance and, as seen by recent developments, supply chain risk.

OEMs relying on LPDDR5 (Low-Power Double Data Rate 5) may have become so unwittingly, but they are now the casualties of the current constriction on high-performance and high-bandwidth memory. As processors became more capable and more dependent on software over recent years, this mobile-class DRAM (Dynamic Random-Access Memory) technology quietly became a requirement in many embedded systems, particularly systems looking to include edge AI.

The road that led us here is more nuanced than measured speed and capacity. Faster processors do need faster memory, but it takes more than larger off-chip DRAM to take embedded systems to the next level. There are many types of memory in play and while DDR (Double Data Rate) fits into the overall architecture, engineers have much more to consider.

What kind of memory hierarchy do modern embedded systems need?

Memory in embedded systems is not a single entity; it is a hierarchy of technologies combined in the most optimal way for the application. This dependency on the application belies the general-purpose nature of processors, and we’re also seeing this at the microcontroller level.

Optimization of this nature relates to speed, density, cost and persistence. It can be differentiated as volatile and non-volatile, but also falls into two broad categories: on-chip and off-chip memory. Looking at on-chip memory first, along with registers and pipelines, a processor will include Level 1 and Level 2 (possibly more) caches built using static random-access memory (SRAM), sitting between the processor’s core(s) and external memory. Cache holds most often/recently used instructions and data, reducing the latency associated with reading/writing to/from other types of memory.

Along with the cache, a processor may feature tightly coupled memory (TCM) and scratchpads, represented by smaller blocks of SRAM, mapped directly into the CPU’s address space. This memory provides deterministic access and is typically used for hard real-time routines.

The capacity of this type of on-chip memory is small, in the kilobyte or low-megabyte range. However, it is the only memory in the architecture that has any chance of keeping up with the speed of the core(s), operating in a small number of clock cycles.

Non-volatile memory–almost always now flash–can be either on-chip or external. If external, the main types are parallel NOR, NAND, SPI. The leading options here are NAND-based managed flash drives configured as either the eMMC (embedded multimedia card) or UFS (unified flash storage) using a JEDEC-defined interface standard.

This memory’s value proposition is persistent storage rather than performance. At power-up, a boot loader runs from on-chip ROM or flash to bring up the system’s clocks and peripherals. The next stage is to load the application code, including the operating system, into RAM. Due to the size of modern code bases, that almost always now involves external DRAM.

This is where DDR sits. External, or off-chip, DRAM holds instructions and data, including stacks, heaps, framebuffers, AI model weights, network buffers, logging and telemetry data. Embedded engineers are now dealing with memory subsystems, with technologies like LPDDR5 setting the overall performance envelope.

Table 1: Comparing DDR4 and DDR5

Feature DDR4 DDR5 DDR5 Advantage
Data rates 1600 – 3200 MT/s 3200 – 6400 MT/s Increases bandwidth
VDD/VDDQ/VPP 1.2/1.2/2.5 1.1/1.1/1.8 Lowers power
Internal VREF VREFDQ VREFDQ, VREFCA, VREFCS Improves voltage margins, reduces BOM costs
Device densities 2Gb – 16Gb 8Gb – 64Gb Larger monolithic memories
Prefetch 8n (words per core cycle) 16n (works per core cycle) Keeps the internal core clock low
On-die ECC None 128b+8b SEC, error check and scrub Strengthens on-chip RAS (reliability, availability and serviceability)
CRC Write Read/Write Strengthens system RAS
Bank groups (BG)/banks 4 BG x 4 banks (x4/x8)
2 BG x 4 banks (x16)
8 BG x 2 banks (8Gb x4/x8
4 BG x 2 banks (8Gb x 16)
8 BG x 4 banks (16-64Gb x4/x8)
4 BG x 4 banks (16-64Gb x16)
Improves bandwidth/performance
Burst length BL8 (and BL4) BL16, BL32 (and BC8 OTF, BL32 OTF) Allows 64B cache line fetch with only 1 DIMM subchannel
The shift to LPDDR5 brings many benefits in terms of bandwidth and burst length, with better reliability, availability and serviceability. (Source: element14)

What runs faster, the memory or the processor core?

Although the DRAM sets the performance, the processor core(s) are almost always faster than any off-chip memory. This asymmetry is a legacy, but despite both sides–cores and memory interfaces–getting faster, the gap could be widening.

The historic solution to dealing with “slow” memory was to insert wait states: small loops explicitly designed to waste time while the memory caught up with the processor. In simple embedded systems based on an MCU running relatively slowly, inserting a short delay wasn’t a huge problem. As clock speeds climbed from tens to hundreds of MHz and now into the GHz range, dealing with memory latency became a design driver.

In essence, wait states still exist, but they are now abstracted away from the designer through the memory controller. Flash interfaces include line-fill buffers and prefetch logic, which essentially act like cache memory. Handshaking in bus protocols allows the CPU to continue working while the memory interface waits.

DDR controllers handle the latency associated with row-to-column address delays (tRCD), the active cycles between row-active and precharge states (tRAS), the actual precharge delay (tRP), and the column address strobe latency (tCAS). Together, these parameters define the minimum time between DDR commands and are often combined into a single parameter in data sheets.

In general, the latency of synchronous DRAM hasn’t improved much since early SDRAMs. The evolution from those early SDRAMs to the latest LPDDR5 memory doesn’t provide any significant improvements in latency; all the gains come from the bandwidth. Modern processors use this bandwidth to help manage the latency, by using local caches, out-of-order and multi-issue pipelines and memory controllers with reordering.

meter on circuitboard
Memory bandwidth is increasing but latency improvements can’t keep up with increases in processor clock speeds, putting more pressure on design engineers.

Which applications really need LPDDR5?

Not every embedded application needs high-bandwidth external memory like LPDDR5. Plenty of applications in industrial, commercial, medical and aerospace sectors can run on a memory system comprising internal flash and a modest amount of external RAM. But trends in the embedded space are pushing the limits of memory subsystems.

One of the more established trends is moving from a bare-metal system to one that uses a real-time operating system or an embedded Linux distribution. The memory requirements of the kernel, drivers, device tree and root filesystem quickly add up. This can set the baseline in the megabytes.

Connected devices also need sophisticated security features and network stacks. Add a graphical user interface framework and possibly containers, and the requirements soon reach the hundreds of megabytes.

At this point, the only viable option is off-chip DRAM, The adoption of LPDDR4 was confluent with these trends, as it provides a good balance between power and bandwidth. That progression naturally led to the use of LPDDR5, as the processor suppliers continued to add features and functionality to high-end devices and systems-on-a-chip (SoCs).

Applications that ingest large amounts of real-world data from local sensors or online services increased the memory requirements. Systems based on radar or lidar, and even industrial applications using vibration sensors can generate hundreds of megabytes of data per second. The need to buffer that data, apply pre-processing, conduct timing alignment, and filter all requires large capacities of SDRAM. Demand from these kinds of applications push the limits of LPDDR4, particularly if it’s doing double-duty to store the operating system and networking stack. The higher bandwidth offered by LPDDR5 became essential.

The most visible trend today is AI inferencing at the edge. The models used, such as convolutional neural networks and transformer-based models, are normally memory-bound, rather than compute-bound. The neural processing unit (NPU) spends a lot of cycles fetching weights from memory, and as model sizes grow, so too does the time spent accessing memory.

Moving to LPDDR5 provides the extra bandwidth to keep NPUs (neural processing units) busy, delivering more inferences per second with fewer stalls. This can be particularly crucial in vision systems such as ADAS (advanced driver assistance systems), robotics or smart surveillance. But high-end HMIs (human-machine interfaces) with rich graphics, gateways running multiple complex protocol stacks, or systems that feature multiple processors also need the performance offered by LPDDR5.

Type Bandwidth Latency
DDR3 12-17 GB/s 50-60 ns
DDR4 19-25 GB/s 60-70 ns
DDR5 38-51 GB/s 70-90 ns
LPDDR4 25-34 GB/s 80-100 ns
LPDDR5 50-68 GB/s 90-110 ns
     
Memory latency increases with each new DRAM standard, while on-chip cache latency is around 1-2 ns for Level 1, and 3-10 ns for Level 2. On-chip SRAM (TCM, scratchpad) latency is around 5-20 ns (Source: Avnet)

How should engineers manage memory efficiency in embedded systems?

The relationship between processor and memory is now critical to overall performance. But assessing the efficiency of that relationship is not a simple task. Designing a memory subsystem to meet an efficiency goal requires low-level engineering.

For bare-metal and RTOS-based systems, engineers might rely on the static analysis of the linker map to understand where the code and data are placed, in terms of flash, SRAM or tightly-coupled memory. Simple time-stamping and event logging, even toggling an output can help engineering teams work out how often external memory is accessed. A more sophisticated approach might include using a logic analyzer or mixed-signal oscilloscope to monitor external buses, measuring burst lengths and bus utilization. This will expose requests and responses, revealing if the memory interface is creating a bottleneck.

There are more tools in Linux-based systems, but it’s still far from automatic. The virtual file system (/proc) provides access to processor state and memory usage, plus other kernel parameters. Linux’s built-in profiling analysis tool, perf, uses counters and tracepoints to measure events like instructions retired, CPU cycles, cache misses, branches and page faults.

It also provides sub-tools, specifically perf mem, which looks at memory accesses, including loads and stores. Using perf mem, engineers can analyze cache misses and latency. It also records which instructions or functions generate the most memory traffic, the latency of those memory accesses, and where they hit (L1, L2, RAM).

While these tools can help engineers understand how their system is performing, they don’t provide recommendations. A memory controller can schedule commands to maximize row hits, maintain refresh, calibration and timing margins, and provide basic quality of service. Rewriting structures to be cache-friendly or use fewer memory passes is still the engineers’ job. But with generative AI making huge breakthroughs in the automation of low-level coding in high-level languages, it may happen sooner rather than later.

The line between clock speed and memory performance

It would be helpful to draw a straight line between clock speed and memory performance; both scaling linearly, double each and the system gets faster. In reality, the relationship is far more complicated. While on-chip SRAM may operate at speeds close to the processor clock speeds, it remains small in comparison to the demands of modern code bases and data sets. Off-chip memory provides much greater capacity and high burst rates, but brings latency, with little sign that this latency will significantly improve in future generations.

The memory subsystem must reconcile this asynchronous co-existence and as processors continue to grow more capable, adding dedicated hardware for AI and other high-bandwidth functions, tuning the memory subsystem will become more important and more complex. Simply scaling back the core speed doesn’t eliminate memory stalls, they just take fewer cycles. Increasing the bandwidth of SDRAM doesn’t reduce first-word latency. The real levers to pull are architectural: cache designs, data locality, parallelism and careful mapping of workloads to the memory subsystem.

Very soon, engineers will be looking at how LPDDR6 will help them increase system performance while balancing the same design tradeoffs. The gains from higher data rates may further expose the importance of those architectural decisions. Engineers navigating this landscape can benefit from the experienced Field Application Engineers and close supplier relationships Avnet brings, helping OEMs find the best solution for them.

 

About Author

Philip Ling
Philip Ling, Technical Content Manager, Corporate Marketing

Philip Ling is a Technical Content Manager with Avnet. He holds a post-graduate diploma in Advanced ...

Marketing Content Spots

Related Articles

Related Articles
What is an AI processor?
Discover what’s special about AI processor architectures
By Avnet Staff   -   February 10, 2025
Are you and your engineering peers curious about AI processors? There’s a good chance you are, but can you agree on what makes them different, better or special? It’s not just a label; they are architected in a fundamentally different way.
Engineers_bench_with_electronic_components
Where is the new low end in FPGAs?
By Avnet Staff   -   April 8, 2024
Programmable logic is a favorite of design engineers. Programmable devices blend flexibility, functional density, and time-to-market advantages. When low-end parts are discontinued, it reflects how electronic system design is evolving.

Related Events

Related Events

No related events found

when-did-memory-become-so-critical