How DynamoDB storage internals work

DynamoDB is a hash table whose values are sorted trees, spread across machines in three Availability Zones. Two data structures do almost all the work: a hash on the picks a machine, a B-tree on the orders items inside it.

How does DynamoDB store data?

DynamoDB stores your data as a giant distributed hash table whose values are sorted B-trees. A hash on the picks one storage node; a B-tree orders items by sort key inside it. Every write replicates synchronously to three nodes across three Availability Zones before it acknowledges.

The partition key is hashed, not searched. DynamoDB runs a hash function over your PK to find the one storage node holding that partition — an O(1) jump, regardless of table size.
The sort key lives in a B-tree. Inside a partition, items are stored in a B-tree ordered lexicographically by SK, which is why range reads (begins_with, between) are cheap and Scan is not.
Every write hits three nodes before it acks. A write goes to a leader, which replicates synchronously to two peers in other AZs — durability is bought before your PutItem returns.
This is why the access-pattern rules exist. Hash-then-tree is fast only when you read by key. No key, no fast path — you fall back to scanning the tree.

Start with the data structure, not the API

Coming from SQL, you picture a table as rows on disk with a query planner picking indexes. DynamoDB has no planner. The storage layout is the contract — what's fast and what's a footgun both fall straight out of two structures.

Picture a giant distributed map. The key is a hash of your partition key. The value isn't a single item — it's a whole B-tree of items that share that partition key, ordered by sort key.

Everything else — query semantics, the 10 GB warnings, why a missing key forces a Scan — is a consequence of that one sentence.

Hash the partition key to find the node

When a request arrives, DynamoDB applies an internal hash function to the partition-key value. The hash deterministically maps to one storage node — the physical partition that owns those items. AWS documents this as the mechanism behind constant-time key lookups regardless of table size.

That's the O(1) step. A 10 TB table and a 10 KB table cost the same to locate: hash, jump, done. There's no index scan to find the node, no statistics, no plan.

The catch is the flip side. If you don't supply the partition key, DynamoDB has no node to jump to — it has to walk every partition. That's a Scan, and it's the difference between O(1) and reading the whole table.

The request hashes to exactly one partition, then descends that partition's sort-key B-tree to the item — two cheap steps instead of a table walk.

Order items in a per-partition B-tree

Inside a single partition, items aren't a heap. They're held in a B-tree keyed by the sort key, ordered lexicographically. A B-tree search is O(log n), and crucially that n is the items in one partition, not the whole table.

This is the entire reason sort-key range reads are cheap. Take a fleet-telemetry table where each device's readings live under one partition key:

PK	SK
PK = DEVICE#a91	SK = READING#2026-06-23T08:00Z
PK = DEVICE#a91	SK = READING#2026-06-23T08:05Z
PK = DEVICE#a91	SK = READING#2026-06-23T08:10Z

Because the B-tree is sorted, "all readings between 08:00 and 09:00" is a tree descent to the start value plus a sequential walk — not a filter over every reading the device ever sent. You read only the matched range.

That ordering is also why a begins_with(SK, "READING#2026-06-23") query is fast while filtering on a non-key attribute is not. The tree can seek by SK; it can't seek by anything else. To compose those key conditions safely, build them in the DynamoDB Expression Builder rather than hand-concatenating strings:

KeyConditionExpression  PK = :pk AND begins_with(SK, :day)

Replicate every write to three AZs

A partition isn't one machine. Each one is replicated across three nodes in three Availability Zones — a design lineage that traces back to the 2007 Amazon Dynamo paper's replication model.

One node is the leader for the partition. A write goes to the leader, which writes locally and replicates to its two peers. The leader acks the write once a durable quorum of nodes has it — so durability across AZs is paid for before your PutItem returns.

Reads have a choice. A read goes to the leader and sees the latest committed write. An read can be served by any of the three nodes, one of which may be a few milliseconds behind — that's the lag you trade for cheaper, more available reads.

	Eventually consistent	Strongly consistent
Served by	Any of the 3 nodes	Leader node only
Sees latest write	Maybe (small lag)	Always
RCU cost	Half	Full
Availability	Higher	Lower (single node)

The same async-propagation idea is why a GSI read can be stale — see GSIs are eventually consistent.

Read the rules off the structure

Almost every DynamoDB "rule" is just storage physics:

Always supply the partition key. No key, no hash target — you're scanning the whole map. This is the core of Query vs Scan.
Co-locate what you read together under one partition key, so a single hash + tree-walk returns the whole item collection. That's the foundation of single-table design.
Keep partitions bounded. One partition is one B-tree on a finite set of nodes; a runaway hot key or an LSI's 10 GB cap are both limits of that physical partition.

Once you see the hash-then-B-tree shape, the access-pattern discipline stops feeling arbitrary — you're just keeping every read on the fast path.

Next steps

Model your keys to match the structure with sort key strategies and single-table design, then assemble the actual expressions in the DynamoDB Expression Builder. Try DynoTable to watch these reads run against your own tables and see exactly which items a key condition pulls back.