How DynamoDB storage internals work
DynamoDB is a hash table whose values are sorted trees, spread across machines in three Availability Zones. Two data structures do almost all the work: a hash on the picks a machine, a B-tree on the orders items inside it.
How does DynamoDB store data?
DynamoDB stores your data as a giant distributed hash table whose values are sorted B-trees. A hash on the picks one storage node; a B-tree orders items by sort key inside it. Every write replicates synchronously to three nodes across three Availability Zones before it acknowledges.
- The partition key is hashed, not searched. DynamoDB runs a hash function
over your
PKto find the one storage node holding that partition — an O(1) jump, regardless of table size. - The sort key lives in a B-tree. Inside a partition, items are stored in a
B-tree ordered lexicographically by
SK, which is why range reads (begins_with,between) are cheap andScanis not. - Every write hits three nodes before it acks. A write goes to a leader, which
replicates synchronously to two peers in other AZs — durability is bought before
your
PutItemreturns. - This is why the access-pattern rules exist. Hash-then-tree is fast only when you read by key. No key, no fast path — you fall back to scanning the tree.
Start with the data structure, not the API
Coming from SQL, you picture a table as rows on disk with a query planner picking indexes. DynamoDB has no planner. The storage layout is the contract — what's fast and what's a footgun both fall straight out of two structures.
Picture a giant distributed map. The key is a hash of your partition key. The value isn't a single item — it's a whole B-tree of items that share that partition key, ordered by sort key.
Everything else — query semantics, the 10 GB warnings, why a missing key forces
a Scan — is a consequence of that one sentence.
Hash the partition key to find the node
When a request arrives, DynamoDB applies an internal hash function to the partition-key value. The hash deterministically maps to one storage node — the physical partition that owns those items. AWS documents this as the mechanism behind constant-time key lookups regardless of table size.
That's the O(1) step. A 10 TB table and a 10 KB table cost the same to locate: hash, jump, done. There's no index scan to find the node, no statistics, no plan.
The catch is the flip side. If you don't supply the partition key, DynamoDB has
no node to jump to — it has to walk every partition. That's a Scan, and it's
the difference between O(1) and reading the whole table.
The request hashes to exactly one partition, then descends that partition's sort-key B-tree to the item — two cheap steps instead of a table walk.
Order items in a per-partition B-tree
Inside a single partition, items aren't a heap. They're held in a B-tree keyed
by the sort key, ordered lexicographically. A B-tree search is O(log n), and
crucially that n is the items in one partition, not the whole table.
This is the entire reason sort-key range reads are cheap. Take a fleet-telemetry table where each device's readings live under one partition key:
| PK | SK |
|---|---|
| PK = DEVICE#a91 | SK = READING#2026-06-23T08:00Z |
| PK = DEVICE#a91 | SK = READING#2026-06-23T08:05Z |
| PK = DEVICE#a91 | SK = READING#2026-06-23T08:10Z |
Because the B-tree is sorted, "all readings between 08:00 and 09:00" is a tree descent to the start value plus a sequential walk — not a filter over every reading the device ever sent. You read only the matched range.
That ordering is also why a begins_with(SK, "READING#2026-06-23") query is fast
while filtering on a non-key attribute is not. The tree can seek by SK; it
can't seek by anything else. To compose those key conditions safely, build them
in the DynamoDB Expression Builder rather
than hand-concatenating strings:
KeyConditionExpression PK = :pk AND begins_with(SK, :day)
Replicate every write to three AZs
A partition isn't one machine. Each one is replicated across three nodes in three Availability Zones — a design lineage that traces back to the 2007 Amazon Dynamo paper's replication model.
One node is the leader for the partition. A write goes to the leader, which
writes locally and replicates to its two peers. The leader acks the write once a
durable quorum of nodes has it — so durability across AZs is paid for before
your PutItem returns.
Reads have a choice. A read goes to the leader and sees the latest committed write. An read can be served by any of the three nodes, one of which may be a few milliseconds behind — that's the lag you trade for cheaper, more available reads.
| Eventually consistent | Strongly consistent | |
|---|---|---|
| Served by | Any of the 3 nodes | Leader node only |
| Sees latest write | Maybe (small lag) | Always |
| RCU cost | Half | Full |
| Availability | Higher | Lower (single node) |
The same async-propagation idea is why a GSI read can be stale — see GSIs are eventually consistent.
Read the rules off the structure
Almost every DynamoDB "rule" is just storage physics:
- Always supply the partition key. No key, no hash target — you're scanning the whole map. This is the core of Query vs Scan.
- Co-locate what you read together under one partition key, so a single hash + tree-walk returns the whole item collection. That's the foundation of single-table design.
- Keep partitions bounded. One partition is one B-tree on a finite set of nodes; a runaway hot key or an LSI's 10 GB cap are both limits of that physical partition.
Once you see the hash-then-B-tree shape, the access-pattern discipline stops feeling arbitrary — you're just keeping every read on the fast path.
Next steps
Model your keys to match the structure with sort key strategies and single-table design, then assemble the actual expressions in the DynamoDB Expression Builder. Try DynoTable to watch these reads run against your own tables and see exactly which items a key condition pulls back.