Advanced7 min read

How DynamoDB storage internals work

DynamoDB is a hash table whose values are sorted trees, spread across machines in three Availability Zones. Two data structures do almost all the work: a hash on the picks a machine, a B-tree on the orders items inside it.

How does DynamoDB store data?

DynamoDB stores your data as a giant distributed hash table whose values are sorted B-trees. A hash on the picks one storage node; a B-tree orders items by sort key inside it. Every write replicates synchronously to three nodes across three Availability Zones before it acknowledges.

  • The partition key is hashed, not searched. DynamoDB runs a hash function over your PK to find the one storage node holding that partition — an O(1) jump, regardless of table size.
  • The sort key lives in a B-tree. Inside a partition, items are stored in a B-tree ordered lexicographically by SK, which is why range reads (begins_with, between) are cheap and Scan is not.
  • Every write hits three nodes before it acks. A write goes to a leader, which replicates synchronously to two peers in other AZs — durability is bought before your PutItem returns.
  • This is why the access-pattern rules exist. Hash-then-tree is fast only when you read by key. No key, no fast path — you fall back to scanning the tree.

Start with the data structure, not the API

Coming from SQL, you picture a table as rows on disk with a query planner picking indexes. DynamoDB has no planner. The storage layout is the contract — what's fast and what's a footgun both fall straight out of two structures.

Picture a giant distributed map. The key is a hash of your partition key. The value isn't a single item — it's a whole B-tree of items that share that partition key, ordered by sort key.

Everything else — query semantics, the 10 GB warnings, why a missing key forces a Scan — is a consequence of that one sentence.

Hash the partition key to find the node

When a request arrives, DynamoDB applies an internal hash function to the partition-key value. The hash deterministically maps to one storage node — the physical partition that owns those items. AWS documents this as the mechanism behind constant-time key lookups regardless of table size.

That's the O(1) step. A 10 TB table and a 10 KB table cost the same to locate: hash, jump, done. There's no index scan to find the node, no statistics, no plan.

The catch is the flip side. If you don't supply the partition key, DynamoDB has no node to jump to — it has to walk every partition. That's a Scan, and it's the difference between O(1) and reading the whole table.

PutItem / GetItemPK=DEVICE#a91Hash(PK)Storage node(one partition)B-tree on SKREADING#... orderedItem

The request hashes to exactly one partition, then descends that partition's sort-key B-tree to the item — two cheap steps instead of a table walk.

Order items in a per-partition B-tree

Inside a single partition, items aren't a heap. They're held in a B-tree keyed by the sort key, ordered lexicographically. A B-tree search is O(log n), and crucially that n is the items in one partition, not the whole table.

This is the entire reason sort-key range reads are cheap. Take a fleet-telemetry table where each device's readings live under one partition key:

PKSK
PK = DEVICE#a91SK = READING#2026-06-23T08:00Z
PK = DEVICE#a91SK = READING#2026-06-23T08:05Z
PK = DEVICE#a91SK = READING#2026-06-23T08:10Z

Because the B-tree is sorted, "all readings between 08:00 and 09:00" is a tree descent to the start value plus a sequential walk — not a filter over every reading the device ever sent. You read only the matched range.

That ordering is also why a begins_with(SK, "READING#2026-06-23") query is fast while filtering on a non-key attribute is not. The tree can seek by SK; it can't seek by anything else. To compose those key conditions safely, build them in the DynamoDB Expression Builder rather than hand-concatenating strings:

KeyConditionExpression  PK = :pk AND begins_with(SK, :day)

Replicate every write to three AZs

A partition isn't one machine. Each one is replicated across three nodes in three Availability Zones — a design lineage that traces back to the 2007 Amazon Dynamo paper's replication model.

One node is the leader for the partition. A write goes to the leader, which writes locally and replicates to its two peers. The leader acks the write once a durable quorum of nodes has it — so durability across AZs is paid for before your PutItem returns.

Reads have a choice. A read goes to the leader and sees the latest committed write. An read can be served by any of the three nodes, one of which may be a few milliseconds behind — that's the lag you trade for cheaper, more available reads.

Eventually consistentStrongly consistent
Served byAny of the 3 nodesLeader node only
Sees latest writeMaybe (small lag)Always
RCU costHalfFull
AvailabilityHigherLower (single node)

The same async-propagation idea is why a GSI read can be stale — see GSIs are eventually consistent.

Read the rules off the structure

Almost every DynamoDB "rule" is just storage physics:

  • Always supply the partition key. No key, no hash target — you're scanning the whole map. This is the core of Query vs Scan.
  • Co-locate what you read together under one partition key, so a single hash + tree-walk returns the whole item collection. That's the foundation of single-table design.
  • Keep partitions bounded. One partition is one B-tree on a finite set of nodes; a runaway hot key or an LSI's 10 GB cap are both limits of that physical partition.

Once you see the hash-then-B-tree shape, the access-pattern discipline stops feeling arbitrary — you're just keeping every read on the fast path.

Next steps

Model your keys to match the structure with sort key strategies and single-table design, then assemble the actual expressions in the DynamoDB Expression Builder. Try DynoTable to watch these reads run against your own tables and see exactly which items a key condition pulls back.

Updated