Chapter 3: Rate Limiting¶
An agent with the right permissions can still cause problems if it runs out of control โ calling the same tool thousands of times, burning through API quotas, or running up costs. Rate limiting caps how often an agent can act.
This chapter covers two approaches:
| Section | Topic |
|---|---|
| max_tool_calls | A simple hard cap defined in YAML |
| TokenBucket | A per-second rate limiter for bursty workloads |
| Which to use? | Choosing the right strategy |
| Try it yourself | Exercises |
The problem¶
Without rate limiting, even a well-intentioned agent can:
- Burn API quotas โ calling a search API 10,000 times in a loop
- Run up costs โ each LLM call costs money, and runaway agents multiply fast
- Overload downstream services โ a database can only handle so many queries per second
Approach 1: max_tool_calls in YAML¶
The simplest approach. Add a max_tool_calls limit to the policy defaults:
version: "1.0"
name: rate-limit-policy
description: Policy that limits how many tool calls an agent can make
rules:
- name: block-delete-database
condition:
field: tool_name
operator: eq
value: delete_database
action: deny
priority: 100
message: "Deleting databases is not allowed"
defaults:
action: allow
max_tool_calls: 3
The key line is max_tool_calls: 3. The evaluator does not enforce this limit automatically โ it is metadata that your application reads and enforces:
from agent_os.policies.schema import PolicyDocument
policy = PolicyDocument.from_yaml("03_rate_limit_policy.yaml")
max_calls = policy.defaults.max_tool_calls # 3
call_count = 0
for task in agent_tasks:
if call_count >= max_calls:
print("Limit reached โ stopping agent")
break
call_count += 1
# ... execute the task
Example output¶
Call 1: โ
ALLOWED (1/3 used)
Call 2: โ
ALLOWED (2/3 used)
Call 3: โ
ALLOWED (3/3 used)
Call 4: ๐ซ DENIED โ limit of 3 calls reached
Call 5: ๐ซ DENIED โ limit of 3 calls reached
After three calls, the agent is stopped. Simple and predictable.
Approach 2: TokenBucket for per-second limits¶
max_tool_calls is a total cap. But sometimes you want to allow many calls over time, just not all at once. That's where a token bucket helps.
Think of it like a vending machine that holds 3 coins:
- Each request costs 1 coin
- Coins refill at a steady rate (e.g., 1 per second)
- If there are no coins left, the request is denied until one refills
from agent_os.policies.rate_limiting import RateLimitConfig, TokenBucket
# Allow bursts of 3, refilling 1 token per second
config = RateLimitConfig(capacity=3, refill_rate=1.0)
bucket = TokenBucket.from_config(config)
# Try to make a request
if bucket.consume():
print("Request allowed")
else:
wait = bucket.time_until_available()
print(f"Rate limited โ retry in {wait:.1f}s")
Example output¶
Bucket: capacity=3, refill_rate=1.0/sec
Starting tokens: 3
Request 1: โ
ALLOWED (2 tokens left)
Request 2: โ
ALLOWED (1 tokens left)
Request 3: โ
ALLOWED (0 tokens left)
Request 4: ๐ซ DENIED โ retry in 1.0s
Request 5: ๐ซ DENIED โ retry in 1.0s
The first three requests go through immediately (burst). After that, requests are denied until tokens refill. If you wait one second, another request will be allowed.
How the token bucket works¶
Time 0.0s [โโโ] 3/3 tokens โ Request 1: consume โ [โโโ]
Time 0.0s [โโโ] 2/3 tokens โ Request 2: consume โ [โโโ]
Time 0.0s [โโโ] 1/3 tokens โ Request 3: consume โ [โโโ]
Time 0.0s [โโโ] 0/3 tokens โ Request 4: DENIED
Time 1.0s [โโโ] 1/3 tokens โ (1 token refilled)
Time 2.0s [โโโ] 2/3 tokens โ (another refilled)
Which approach should you use?¶
| Approach | Good for | Example |
|---|---|---|
max_tool_calls | Hard lifetime cap โ "agent can do at most N things total" | An agent that should only make 10 tool calls per task |
TokenBucket | Throughput control โ "agent can do N things per second" | Protecting a rate-limited external API |
In production, you often use both: max_tool_calls as a safety net and a TokenBucket for smooth throughput control.
Full example¶
Try it yourself¶
- Change
max_tool_callsto 5 in the YAML file and re-run. The agent should now get 5 allowed calls before being stopped. - Create a
TokenBucketwithcapacity=1, refill_rate=0.5. This means only 1 request at a time, refilling every 2 seconds. How does the output change? - Combine both approaches: load the policy to get
max_tool_calls, create aTokenBucket, and check both limits before allowing each request.
What's missing?¶
We can now block dangerous tools, scope permissions by role, and rate-limit runaway agents. But every rule we've written applies the same way everywhere. What if a tool should be allowed in dev but blocked in production? And what happens when the security team and a product team write separate policies that disagree? That's conditional policies.
Previous: Chapter 2 โ Capability Scoping Next: Chapter 4 โ Conditional Policies โ environment-aware rules and conflict resolution when policies disagree.