On-demand Container Loading in AWS Lambda

Arpit Bhayani

curious, tinkerer, and explorer


These are my notes based on the paper On-demand Container Loading in AWS Lambda .

TL;DR

AWS Lambda scales to handle millions of requests per second, provision containers at a rate of 15,000 per second, and achieve cold-start times as low as 50ms, even for large container images (up to 10GiB). The paper describes and covers the design of their storage and caching system and optimizations that made it so efficient.

The system tolerates infrastructure failures, significantly reduces storage requirements, and maintains low cold-start latency. These are the three things I found most interesting.

  • block-level demand loading
  • deduplication of blocks with convergent encryption
  • use of erasure coding in caching for optimizing tail latencies

general-cover(2)(1)

Three interesting things

Continuing the discussion from the above TL;DR, here are three things that I found interesting in this paper and some quick details about each.

Block-level demand loading

Traditional approaches would require loading entire containers into memory before starting the execution. This drastically increase cold-start times and memory. However, Lambda makes it better by loading only the necessary blocks from the container on demand.

Effectively, this allows Lambda to start executing the function while the remainder of the container image continues to load in the background. Additional blocks are fetched asynchronously as they are needed, ensuring that Lambda can minimize both startup latency and resource consumption.

Deduplication with Convergent Encryption

Most containers used across customers have a common base image and other layers. Deduplication is essential to optimize storage and network bandwidth. Because containers are encrypted for security reasons, Lambda cannot use traditional methods as it requires you to peek.

Lambda adopted convergent encryption - a technique that allows deduplication of container data without compromising security. The idea (elaborated below) is to use the hash of the content as the encryption key. This allows deduplication without having to maintain a central key store or share keys across.

Erasure Coding in Caching for Tail Latency Optimization

Erasure coding allows Lambda to ensure data redundancy without the overhead of full data replication. It strikes a balance between minimizing the amount of redundant data stored and maintaining high availability and fault tolerance.

Lambda divides data into chunks and encodes these using erasure codes. Only a subset of the encoded chunks is required to reconstruct the original data, reducing the amount of total storage required.

Notes and a quick explanation

When first launched, AWS Lambda only supported function deployment through compressed code packages up to 250MB in size. However, in 2020, AWS introduced support for container-based Lambda functions with sizes up to 10 GiB, enabling much larger workloads. This presented a challenge: maintaining Lambda’s scaling capabilities (up to 15,000 new containers per second for a single customer) and low start-up times (as low as 50ms).

Architecture

The core architecture of AWS Lambda comprises several key components.

  1. Frontend: a stateless, load-balanced component that handles incoming execution requests
  2. Worker Manager: stateful, sticky load balancer that tracks capacity for each unique function
  3. Workers: hosts that execute in isolated MicroVMs based on the Firecracker hypervisor

general-cover(2)(1)

The entire design of this system minimizes the data movement required during cold starts, which is a critical factor in maintaining performance as deployment sizes increase.

general-cover(3)

Block-Level Loading

To optimize for large containers, AWS Lambda uses block-level demand loading, where only the required portions of a container image are loaded into memory during startup. This avoids fully loading large container images into memory, reducing cold-start times. As execution proceeds, additional blocks are loaded dynamically based on function requirements.

By the way, other block-level loading system are Slacker and Starlight.

Image Flattening Process

Container image layers are deterministically flattened and collapse into a single ext4 filesystem. This ensures that blocks containing unchanged files remain identical, which deduplicates block between containers sharing common base layers. The loading system also introduces two new components - local agent and local cache.

The local agent handles reads by fetching data from the local cache or the tiered cache system. Writes are managed using a page-level copy-on-write approach, allowing shared immutable data in caches while supporting guest writes.

Deduplication Without Trust

Approximately 80% of newly uploaded Lambda functions result in zero unique chunks, while the remaining 20% contain a mean of 4.3% unique chunks (median 2.5%). This clearly shows that there is a widespread use of common base images which can be leveraged to achieve significant deduplication.

AWS Lambda deduplicates blocks and makes sure they are stored only once, even across different containers. This reduces the storage footprint and optimizes bandwidth. Also, frequently used blocks are cached closer to the compute nodes, reducing the latency of loading container images. This caching occurs across three layers.

  1. local - on-worker cache
  2. regional - AZ-level distributed cache
  3. global - S3 backing store

general-cover(4)

Convergent Encryption

To maintain security without sacrificing performance, Lambda implements convergent encryption. The core idea is to ensure that identical container images, even if they are created by different customers, result in the same encrypted output, enabling deduplication without compromising security.

Convergent encryption uses the hash of the content as the encryption key. Thus, if two containers have some content then their hash would be the same, and thus encryption keys would be the same; enabling deduplication on encrypted containers while maintaining security boundaries between customer workloads.

The process involves:

  1. Deriving a key from each chunk using its SHA256 digest
  2. Encrypting the chunk with AES-CTR using the derived key and a deterministic IV
  3. Creating a manifest containing chunk offsets, unique keys, and SHA256 hashes
  4. Encrypting the manifest’s key table using a unique per-customer key managed by AWS KMS

Salt Rotation for Blast Radius Limitation

To mitigate risks associated with widely referenced chunks, the Lambda incorporates a varying salt in the key derivation step. This salt can be rotated based on factors such as time, chunk popularity, and infrastructure placement, allowing for fine-tuned control over the trade-off between deduplication efficiency and potential impact radius.

Erasure Coding for Tail Latency Optimization

To address tail latency concerns and ensure data redundancy and fault tolerance, the system employs erasure coding instead of simple replication. The current production deployment uses a 4-of-5 code, achieving:

  • 25% storage overhead
  • 25% increase in request rate
  • Significant decrease in tail latency

In the event of network or storage failures, erasure coding allows container images to be reconstructed from partial data without the need for complete replication thus optimizing storage cost.

The content presented here is a collection of my notes and explanations based on the paper. You can access the full paper On-demand Container Loading in AWS Lambda . This is by no means an exhaustive explanation, and I strongly encourage you to read the actual paper for a comprehensive understanding. Any images you see are either taken directly from the paper or illustrated by me .

Arpit Bhayani

Creator of DiceDB, ex-Google Dataproc, ex-Amazon Fast Data, ex-Director of Engg. SRE and Data Engineering at Unacademy. I spark engineering curiosity through my no-fluff engineering videos on YouTube and my courses


Arpit's Newsletter read by 100,000 engineers

Weekly essays on real-world system design, distributed systems, or a deep dive into some super-clever algorithm.