In our article on tokenmaxxing, we argued that the biggest mistake companies make with AI coding tools is focusing on how much they use it rather than how well they use it. They measured and celebrated the output volume, but no one asked what was going into the prompts, how clearly the problem was defined, how much context was provided, or how much judgment went into it before hitting generate. Blindly throwing more context and more prompts at a problem could only produce worse code while driving up costs.
The lesson these companies learned the hard way is that you need to stop focusing on maximizing tokens and start focusing on maximizing the value of each token. But getting there requires a fundamental shift in how you approach software development.
Here’s how you can adopt this lean engineer mindset, stop draining your token budget, and get your AI assistant to write better code on the first try.
Treat every AI mistake as a bug in your input, not a bug in the output
When you write code manually, you’re the feature factory. When you use AI, the AI is the factory, and you’re the engineer calibrating the machinery.
Your job is no longer building features, but building the tools, rules, and systems that produce them. This requires shifting your focus. When your AI assistant makes a mistake, your instinct might be to fix the output by re-prompting. Instead, you should focus on fixing the input.
Let’s say an agent creates a component using styling conventions you don’t want. The obvious reaction is to tell it, “Use Tailwind utility classes instead.” That solves today’s problem, but you’ll probably have to repeat the same correction tomorrow.
A better question is: why did it happen in the first place? Maybe the specification was unclear. Maybe a rule is missing. Maybe the project documentation doesn’t establish a preferred approach. Instead of repeatedly fixing the output, update the source of truth so the agent gets it right automatically next time.
The same principle applies to planning. Spend more time refining PRDs and technical specifications before generating code. Fixing AI-generated code after the fact is usually more expensive than generating it correctly in the first place.
Understand what’s silently draining your token budget
Before you can optimize AI usage, you need to understand what’s draining your token budget. In my experience, there are two massive budget-drainers that usually go unnoticed.
The endless micro-adjustments
Every time you send a new prompt, the model receives the entire conversation history again. Those seemingly harmless back-and-forths, “center that element”, “actually, align it vertically”, “use flex instead”, don’t stay small for long. By the time you’re twenty messages deep, every tiny correction is being replayed over and over again, turning a simple UI tweak into thousands of unnecessary tokens.
Context pollution
Over time, context windows fill up with duplicated system rules, overlapping local/global skills, raw log outputs, massive files (mocks, translations, generated JSON), and other artifacts that have little to do with the task at hand. Context pollution inflates your bill and degrades the model’s reasoning accuracy, thereby reducing the quality of your output.
Context bloat and endless prompt iteration are two of the biggest sources of waste, but they’re not the only ones. Model choice, workflow design, and how you manage agent context all directly impact both cost and output quality.
Five strategies to optimize your AI token usage
Once you understand what’s burning your tokens, the fixes are surprisingly straightforward. Follow these strategies to reduce waste, keep context under control, and get better results from your AI tools without constantly reaching for more prompts or more powerful models.
1. Use the right model for the task
One of the fastest ways to waste tokens is by using the most powerful model for every task. Powerful models, extended reasoning modes, and premium agent settings all have their place, but most tasks don’t need that much horsepower.
Using the most powerful model for a simple task can easily backfire. Opus, for example, tends to over-engineer straightforward code, which means you end up waiting longer for a worse result than a lighter model would have produced. There’s no point spinning up Opus with extra-high thinking to open a PR when cheaper models would do it correctly and faster.
Reasoning modes “think” before they answer. These models burn a significant number of internal tokens, often invisible, before they ever produce a single line of output. In some cases, that means 2,000 tokens of reasoning for a 50-token one-line fix. Use fast, non-reasoning models for boilerplate, refactoring, and mechanical chores. Reserve reasoning models for genuinely hard tasks, such as debugging architectural flaws or complex algorithms.
The high-effort tiers cost a multiple, so spend them deliberately. In Cursor, Composer’s fast mode is often enabled by default, and it’s ~6x more expensive than the standard mode. It’s not worth paying for that when a lighter model would have produced the same or better result. Turn off high-effort or “fast” models by default and reach for them only when the task genuinely needs it.
Save your most powerful model for the thinking, not the doing. When scoping work with a PRD or a detailed technical specification, use a powerful model to create a plan, then hand the well-defined steps to a lighter, faster model to implement.
2. Ruthlessly manage your context window
Long sessions come with a hidden “bloat tax”: each new prompt becomes more expensive because the model must reprocess the entire context window. Over time, this also leads to context rot, in which noise and outdated information reduce the quality of the model’s outputs. To avoid that, keep your sessions clean and intentional using these strategies:
- Start fresh often. Track your context usage and spin up new agent sessions frequently.
- Export core context into .md files. Instead of keeping massive chat histories alive just to preserve a few decisions, export those conclusions to a Markdown file. Share that file between agents. This gives you far more control than auto-compacted logs.
- Pin only what matters. Keep context tight by pinning only files relevant to the current change.
- Set up ignore files. Stop agents from indexing build artifacts and massive JSON mock files. Treat this like a
.gitignorefor your AI, and use .cursorignore for Cursor, or content exclusion in GitHub Copilot, or deny permission setting in Claude. - Trim chatty output. Use tools or skills such as caveman, RTK (Rust Token Killer), or similar to cut down on verbose, unnecessary response text. Keep in mind that those tools could affect the quality of the output and results.
3. Keep skills and rules lean
Only install skills you directly need and understand. A curated list of 100 skills won’t magically improve your output, and bloated skill sets are just another form of context pollution. Audit both local and global skills regularly and remove anything that isn’t earning its place.
Split massive, catch-all skills into smaller, modular units. Instead of one giant “Code Review” skill, create a lightweight Code Review Orchestrator that selectively calls five or six highly specific sub-skills only when relevant. Smaller sub-skills are also much easier to maintain and keep up to date.
When it comes to rules, scope them to the files where they’re actually relevant. Component rules shouldn’t load into context when you’re modifying backend services. Use Cursor rules or Claude’s path-scoped rules to control this.
4. Don’t waste agents on simple tasks
Using an agent for minor chores adds thousands of tokens to every subsequent prompt, especially in a session with a full context window. Do simple one-liners and quick commits manually.
The same applies to tests and linting. Avoid running massive test suites (e.g., 2k+ tests) in an automated AI loop. Run focused checks only, and trim the output with a tool such as RTK (Rust Token Killer) to prevent the results from polluting your context.
One thing to watch out for: if you make manual changes outside the session, the agent won’t be aware of them and may burn tokens investigating what changed. After making manual changes, start a fresh agent session and update your saved plans and documentation so the next prompt reflects the current state.
5. Save recurring tasks as reusable files
When your agent writes a useful piece of automation, such as a bash script, a database query, or a boilerplate config, don’t let it disappear into the chat history. Save it to a file in your repository.
Beyond token savings, saved files give you determinism. LLMs are probabilistic, so if you ask an agent to generate the same setup script multiple times, it may subtly rewrite the logic and introduce inconsistencies between runs. A file on disk executes the exact same code every time. The next time you need that task done, don’t ask the agent to rewrite it; just tell it to run that specific file. You get predictable behavior and pay zero tokens to recreate it.
From code writer to AI architect
The era of writing every line of code is gone, but so is the brief era of lazily asking an AI to “just build it.” The engineers getting the most out of AI are the ones who shift their focus from building features to curating the rules, specs, and context that produce features.
Once you master this, you stop fighting the model and burning tokens on bloated context windows. Instead, you build a lean and predictable factory that outputs exactly what you need on the first try.