A 120-line lock-free token bucket in Rust
The rate limiter we run in our internal tooling. ~120 lines of Rust, lock-free, observable, and the failure modes we hit before it stabilised.
Built on tools you trust
← swipe · 12 tools →
Why a token bucket
Of the half-dozen rate-limiting algorithms (fixed window, sliding window, leaky bucket, token bucket, GCRA), the token bucket has the best property for client-facing APIs: it allows bursts up to a configured size, then smooths out. Most users want to fire 5 requests in a row, not 1 every 200ms — a fixed window punishes them, a token bucket doesn't.
We use a Rust implementation in our internal tooling. ~120 lines, no locks on the hot path, exposes Prometheus metrics. This post walks through it.
The algorithm
Every key (user, IP, API key, whatever you bucket on) has:
- A capacity — the max tokens the bucket can hold (the burst size)
- A refill rate — tokens added per second
- A current token count
- A last-refill timestamp
On each request:
- Compute how many tokens have refilled since last check (
elapsed * refill_rate). - Add them, capped at capacity.
- If tokens >= 1, decrement and allow the request.
- Otherwise, deny.
That's it. The whole thing is two arithmetic operations and a comparison.
The Rust implementation
The bucket itself is a single AtomicU64 packing two values: the current token count (top 32 bits) and the last-refill timestamp in microseconds since some epoch (bottom 32 bits). We update both atomically with a single CAS loop. No Mutex, no RwLock, no contention.
The struct sketch (using inline pseudo-Rust):
pub struct TokenBucket { state: AtomicU64, capacity: u32, refill_per_sec: u32, epoch: Instant }fn try_acquire(&self) -> bool— the hot path, lock-freefn refill(state: u64, now: u32, capacity: u32, rate: u32) -> u64— pure function, easy to test
The try_acquire loop:
- Load current state.
- Decode (tokens, last_refill).
- Compute elapsed microseconds since last_refill.
- Compute new_tokens = min(capacity, tokens + elapsed * rate / 1_000_000).
- If new_tokens < 1, return false.
- Encode (new_tokens - 1, now), CAS into state. If CAS fails, retry.
- Return true.
The CAS-loop is bounded — under contention it retries, but each retry is microseconds. We've benchmarked at >10M ops/sec on a single core.
Edge cases that bit us
Clock drift across cores. Instant::now() on Linux uses CLOCK_MONOTONIC, which is per-CPU. We initially packed an absolute microsecond count and saw negative elapsed values when threads were rescheduled across cores. Fix: use Instant::elapsed_since(epoch) and store as u32 deltas relative to a shared epoch. u32 microseconds wraps every ~71 minutes — the refill logic handles wrap explicitly.
Burst at startup. A fresh bucket starts at full capacity, which means an attacker can fire capacity requests in the first millisecond against a fresh process. Fix: optional start_empty flag, default off (we want the burst behavior for normal users), on for known abuse-prone keys.
The "rounded down to zero" trap. With low refill rates (say, 1 token per 60s), elapsed * rate / 1_000_000 rounds to 0 for any elapsed under 1 second. The bucket never refills. Fix: keep the fractional refill in a separate AtomicU32 field, accumulate, only add whole tokens when the fraction crosses 1.
Distributed deployments. A token bucket per-node doesn't enforce a global rate limit. We use the local bucket as the hot path and a Redis-backed bucket as a slower secondary check, only consulted when the local bucket is at < 25% capacity. Cuts Redis load by ~95% in practice.
The test cases that matter
Don't trust a rate limiter you haven't tested under contention. Our test suite includes:
- Rate enforcement. Fire requests at 2x the configured rate from a single thread, assert that approximately half are denied.
- Burst capacity. Fresh bucket, fire
capacityrequests back-to-back, assert all allowed. Then fire one more, assert denied. - Refill correctness. Drain bucket, sleep
N / rateseconds, assert exactly N tokens available. - Concurrent access. 100 threads, each firing 10K requests. Aggregate the allow/deny counts. The total allowed should be within 1% of the theoretical max.
- Clock-wrap handling. Mock the clock at
u32::MAX - 100ms, make a request, advance past wrap, make another. Assert correct refill (this one caught a real bug in our first version).
We also fuzz the refill pure function with cargo fuzz — random tokens, capacity, rate, elapsed values. Found one overflow on day 2 (our use of u64 for the multiply intermediate value was off).
Observability
Three metrics:
rate_limit_acquire_total{key,result}— counter, allow/denyrate_limit_tokens{key}— gauge, current tokens (sampled)rate_limit_acquire_duration_seconds— histogram, the hot-path latency
The duration histogram caught a regression where we accidentally introduced a HashMap lookup in the hot path. p99 went from 200ns to 4μs — invisible in average dashboards, glaring in p99.
When to use it
- API gateways. Per-key rate limiting at the edge.
- Background workers. Don't let a bug hammer downstream systems.
- LLM proxies. Token-cost-based rate limiting (each request costs N tokens proportional to its size).
When *not* to use it:
- Strict global limits across many nodes — use a shared store (Redis/DynamoDB) as the source of truth, not a hint.
- Per-second precision on cold-start traffic — the bucket model is best for sustained traffic.
If you want this kind of work in your stack, our Embedded Engineer subscription is exactly this kind of one-ticket-at-a-time infra work, async, $5,995/mo, pause anytime.
The Hayaiti team
Hayaiti
Hayaiti is a productized engineering studio. We ship web, software, iOS, and cybersecurity work on fixed prices and calendar-day timelines. The team takes turns on the shipping log.
More from the shipping log
RAG vs fine-tuning in 2026: default to RAG
Default to RAG. Reach for fine-tuning when retrieval can't close the gap. The six-axis decision matrix we run on every client LLM build.
How we'd build the next Linear (a thought experiment)
Not a pitch. A teardown of what makes Linear feel different, and what we'd steal if we shipped a competitor tomorrow.
Want help shipping this?
We turn posts like this into production code. Fixed price. Calendar-day timelines. Source code in your repo on day one.