Of the half-dozen rate-limiting algorithms (fixed window, sliding window, leaky bucket, token bucket, GCRA), the token bucket has the best property for client-facing APIs: it allows bursts up to a configured size, then smooths out. Most users want to fire 5 requests in a row, not 1 every 200ms — a fixed window punishes them, a token bucket doesn't.

We use a Rust implementation in our internal tooling. ~120 lines, no locks on the hot path, exposes Prometheus metrics. This post walks through it.

The algorithm

Every key (user, IP, API key, whatever you bucket on) has:

A capacity — the max tokens the bucket can hold (the burst size)
A refill rate — tokens added per second
A current token count
A last-refill timestamp

On each request:

Compute how many tokens have refilled since last check (elapsed * refill_rate).
Add them, capped at capacity.
If tokens >= 1, decrement and allow the request.
Otherwise, deny.

That's it. The whole thing is two arithmetic operations and a comparison.

The Rust implementation

The bucket itself is a single AtomicU64 packing two values: the current token count (top 32 bits) and the last-refill timestamp in microseconds since some epoch (bottom 32 bits). We update both atomically with a single CAS loop. No Mutex, no RwLock, no contention.

The struct sketch (using inline pseudo-Rust):

pub struct TokenBucket { state: AtomicU64, capacity: u32, refill_per_sec: u32, epoch: Instant }
fn try_acquire(&self) -> bool — the hot path, lock-free
fn refill(state: u64, now: u32, capacity: u32, rate: u32) -> u64 — pure function, easy to test

The try_acquire loop:

Load current state.
Decode (tokens, last_refill).
Compute elapsed microseconds since last_refill.
Compute new_tokens = min(capacity, tokens + elapsed * rate / 1_000_000).
If new_tokens < 1, return false.
Encode (new_tokens - 1, now), CAS into state. If CAS fails, retry.
Return true.

The CAS-loop is bounded — under contention it retries, but each retry is microseconds. We've benchmarked at >10M ops/sec on a single core.

Edge cases that bit us

Clock drift across cores. Instant::now() on Linux uses CLOCK_MONOTONIC, which is per-CPU. We initially packed an absolute microsecond count and saw negative elapsed values when threads were rescheduled across cores. Fix: use Instant::elapsed_since(epoch) and store as u32 deltas relative to a shared epoch. u32 microseconds wraps every ~71 minutes — the refill logic handles wrap explicitly.

Burst at startup. A fresh bucket starts at full capacity, which means an attacker can fire capacity requests in the first millisecond against a fresh process. Fix: optional start_empty flag, default off (we want the burst behavior for normal users), on for known abuse-prone keys.

The "rounded down to zero" trap. With low refill rates (say, 1 token per 60s), elapsed * rate / 1_000_000 rounds to 0 for any elapsed under 1 second. The bucket never refills. Fix: keep the fractional refill in a separate AtomicU32 field, accumulate, only add whole tokens when the fraction crosses 1.

Distributed deployments. A token bucket per-node doesn't enforce a global rate limit. We use the local bucket as the hot path and a Redis-backed bucket as a slower secondary check, only consulted when the local bucket is at < 25% capacity. Cuts Redis load by ~95% in practice.

The test cases that matter

Don't trust a rate limiter you haven't tested under contention. Our test suite includes:

Rate enforcement. Fire requests at 2x the configured rate from a single thread, assert that approximately half are denied.
Burst capacity. Fresh bucket, fire capacity requests back-to-back, assert all allowed. Then fire one more, assert denied.
Refill correctness. Drain bucket, sleep N / rate seconds, assert exactly N tokens available.
Concurrent access. 100 threads, each firing 10K requests. Aggregate the allow/deny counts. The total allowed should be within 1% of the theoretical max.
Clock-wrap handling. Mock the clock at u32::MAX - 100ms, make a request, advance past wrap, make another. Assert correct refill (this one caught a real bug in our first version).

We also fuzz the refill pure function with cargo fuzz — random tokens, capacity, rate, elapsed values. Found one overflow on day 2 (our use of u64 for the multiply intermediate value was off).

Observability

Three metrics:

rate_limit_acquire_total{key,result} — counter, allow/deny
rate_limit_tokens{key} — gauge, current tokens (sampled)
rate_limit_acquire_duration_seconds — histogram, the hot-path latency

The duration histogram caught a regression where we accidentally introduced a HashMap lookup in the hot path. p99 went from 200ns to 4μs — invisible in average dashboards, glaring in p99.

When to use it

API gateways. Per-key rate limiting at the edge.
Background workers. Don't let a bug hammer downstream systems.
LLM proxies. Token-cost-based rate limiting (each request costs N tokens proportional to its size).

When *not* to use it:

Strict global limits across many nodes — use a shared store (Redis/DynamoDB) as the source of truth, not a hint.
Per-second precision on cold-start traffic — the bucket model is best for sustained traffic.

If you want this kind of work in your stack, our Embedded Engineer subscription is exactly this kind of one-ticket-at-a-time infra work, async, $5,995/mo, pause anytime.

The Hayaiti team

Hayaiti

Hayaiti is a productized engineering studio. We ship web, software, iOS, and cybersecurity work on fixed prices and calendar-day timelines. The team takes turns on the shipping log.

Want help shipping this?

We turn posts like this into production code. Fixed price. Calendar-day timelines. Source code in your repo on day one.

See pricing Or just email us

A 120-line lock-free token bucket in Rust

Why a token bucket