Skip to content
This project is under active development and subject to breaking changes. See the changelog for release notes.

Cost & Token Usage

Use this page to project the cost of adopting DevSquad Copilot. It converts the framework's token consumption into a planning baseline you can map to your own models and plan.

The baseline is expressed in tokens, which stay stable over time, with a simple method to convert tokens into dollars or GitHub AI Credits using your model's current rates. Actual usage varies with codebase size, model choice, retry loops, and how much context lives outside the framework.

Small feature

~0.5M tokens total (specify through review). Around 4 tasks, ~200–500 LOC.

Medium feature

~4.6M tokens total. Around 18 tasks, ~1,500–3,000 LOC, new endpoint with integration.

Large feature / migration

~36M tokens total. Around 60 tasks, ~8,000+ LOC, cross-service and debug-heavy.

Typical squad

~51M tokens/month for 3 developers at a sustainable cadence (~17M per developer).

Framework overhead versus an unstructured Copilot session is ~1–2% on real work. The bill is dominated by the work being done (code reads, test loops, generation), not by the framework's prompts.

The baseline never goes stale because the volatile input (your model's rate) is supplied at estimate time.

  1. Pick the token figure for your scope from the per-feature or monthly tables below.

  2. Look up your model's input and output rates (per 1M tokens) on the live Models and pricing for GitHub Copilot page.

  3. Apply the formula:

    USD = (input_tokens_M * input_rate) + (output_tokens_M * output_rate)
    AI credits = USD / credit_value

    Where input_tokens_M / output_tokens_M are your scope's token counts in millions (e.g. 4.65M = 4.65), input_rate / output_rate are your model's price per 1M input/output tokens from step 2, and credit_value is the USD value of one AI credit (1 credit = $0.01 at the time of writing; confirm the current value on your billing page).

Output tokens are billed roughly 5x input on most models, so the input/output split matters. Per-feature splits:

FeatureInput tokensOutput tokensTotal
Small~480K~83K~563K
Medium~3.97M~0.68M~4.65M
Large~30.8M~4.9M~35.7M

Paid plans receive a 10% discount on model costs when using auto model selection. Cached input (the repeated prompt prefix on later turns) is billed up to 10x below the uncached input rate, and modern agentic harnesses keep cache hit rates high (around 94% on Anthropic models for agentic workloads), so the figures above are a conservative upper bound on input cost. Extended prompt caching (up to 24h retention on supported OpenAI models) keeps the cache warm across pauses.

FeatureCalculationUSDAI credits
Small0.48 × $3.00 + 0.083 × $15.00~$2.69~270
Medium3.97 × $3.00 + 0.68 × $15.00~$22.11~2,210
Large30.8 × $3.00 + 4.9 × $15.00~$165.90~16,600

The monthly squad volume scales the same way: apply the same two rates to its input/output split.

GitHub meters all of this in AI credits, and every paid plan includes a monthly credit allowance. To check whether your usage fits, compare your plan's allowance against the per-feature credit estimate above:

features per month before overage = monthly included AI credits / credits per feature

Read your plan's current included allowance (and the per-token overage rate) from the GitHub Copilot billing page. Allowances, plan tiers, and the variable flex portion change over time, so the live billing page is the only reliable source. Example: if your plan includes 7,000 credits and a medium feature costs ~2,210 credits at your current rates, you can ship roughly three medium features per month before additional usage applies.

ProfileStoriesTasks per storyTotal tasksCode change
Small feature224~200–500 LOC, well-scoped CRUD
Medium feature6318~1,500–3,000 LOC, new endpoint + integration
Large feature / migration183–460~8,000+ LOC, cross-service, debug-heavy

Token counts include all framework overhead, artifact reads, tool outputs, sub-agent calls, and produced artifacts.

PhaseSmall (in/out)Medium (in/out)Large (in/out)
envision (one-time per product)15K / 3K15K / 3K15K / 3K
kickoff (one-time per product)20K / 4K25K / 5K40K / 8K
specify (per feature)30K / 5K60K / 10K120K / 20K
plan (per feature)50K / 8K120K / 15K250K / 30K
decompose (per feature)30K / 5K70K / 10K150K / 25K
sprint (per sprint, amortized)25K / 4K25K / 4K25K / 4K
implement (per task)80K / 15K200K / 35K500K / 80K
review (per feature)50K / 5K120K / 10K300K / 20K
refine (per run, weekly)50K / 5K50K / 5K50K / 5K
security (when triggered)30K / 5K60K / 8K100K / 12K

Planning plus implement plus review only.

Profilespecifyplandecomposeimplement (sum)reviewTotal
Small (4 tasks)35K58K35K380K55K~563K
Medium (18 tasks)70K135K80K4.23M130K~4.65M
Large (60 tasks)140K280K175K34.8M320K~35.7M

The implementation phase consumes 70–98% of the feature budget. Planning phases combined are typically 5–25% of total spend.

Assumptions: one squad of 3 developers, 1-week sprints, mixed feature sizes.

Mix (per month)VolumeTokens
2 small features2 × 563K~1.13M
3 medium features3 × 4.65M~13.95M
1 large feature1 × 35.7M~35.7M
4 sprints4 × 29K~116K
4 refine runs (weekly)4 × 55K~220K
Security reviews (2 triggered)2 × 68K~136K
Envision + kickoff (one-time amortized)~48K
Monthly total per squad~51.3M
Per developer (3 devs)~17.1M

Framework prompts are a small fraction of real cost. The dominant terms are:

  1. Artifact re-reading between phases. Each phase reads spec, ADRs, and related plans from disk: 5–30K input tokens per phase. This is a deliberate trade of tokens for context isolation.
  2. Repository code reads during implement and review: 20–200K tokens depending on familiarity and scope.
  3. Tool output ingestion: test logs, build errors, lint output. Each failed test cycle adds 5–15K input tokens.
  4. Retry and debug loops: a stuck implementation can multiply the per-task cost 3–5x.
  5. Generated output: specs, plans, code edits, commit messages. Output tokens cost more per token than input on every model.

What DevSquad Copilot adds on top of an unstructured Copilot session producing similar code:

Source of overheadTokens per medium feature
Coordinator agent prompts (loaded per phase)~25K (mostly cached after first turn)
Sub-agent prompts (isolated contexts)~30K
Skill auto-triggers~15K
Artifact re-reading between phases~40K
Quality gates, handoff envelopes, reasoning logs~10K
Total~120K

That is ~1.5% of a medium feature's total spend and ~0.2% of a large one.

Listed in order of leverage:

  1. Model selection per agent (highest impact). The framework does not hardcode a model on any agent, since it has no control over which models, regions, or plan restrictions a consumer can access, and models change at a fast pace. Override an agent's model field with open-source tooling such as agext-cli, which layers repo-local overrides on top of the installed plugin without modifying the originals. As a rule, assign a lightweight model family (Haiku-class or mini/flash-class) to routine agents (validate, verify, finalize, decompose) and reserve a frontier family (Sonnet-, Opus-, or GPT-5-class) for plan, implement.execute, and review.code.
  2. Cap retry loops in implement, so a stuck task escalates to you after N failed attempts instead of running away.
  3. Tighter task decomposition. Smaller tasks read less surrounding code and have shorter debug tails.
  4. Honor context cleanup boundaries. Running an entire delivery in one mega-session raises the risk of context contamination, which leads to retries. Use phase boundaries to keep context clean.
  5. Keep sessions warm and stable. With extended prompt caching, resuming related work within the cache-retention window reuses the prompt prefix at the cached rate; a long idle gap forces a cold start that reprocesses the whole prefix at full price. Changing the model or reasoning effort mid-session can also invalidate the cache.
  6. Pooled entitlements for Business/Enterprise. Heavy implement sessions for one developer are offset by lighter envision/specify work elsewhere in the org.
  7. Code completions are free. Inline coding and Next Edit Suggestions remain unlimited. Use them for trivial edits instead of asking an agent.
  1. Pick one medium feature.

  2. Run the full sequence end-to-end: /devsquad.specify, then /devsquad.plan, then /devsquad.decompose, then /devsquad.implement, then /devsquad.review.

  3. Download the usage report CSV from the premium request analytics page. Each row includes aic_quantity and aic_gross_amount.

  4. Filter rows by the session window and compare the AI-credit total against the medium-feature estimate above. Because your account converts tokens to credits at whatever rates are current, this validates the baseline without hardcoding any price.

  5. If actuals deviate by more than 2x, the cause is almost always model choice mismatch or retry loops on a small number of tasks.

  6. Re-run this validation whenever you change your default model or after major harness updates, since both shift the token baseline.