Engines Explained¶

An engine is an AI coding tool that RoboDev runs inside a container to complete tasks. The controller doesn't write code itself — it delegates to engines. Think of the controller as a dispatcher and the engine as the worker.

Which Engine Should I Use?¶

graph TD
    A["Which engine?"] --> B{"Need deterministic<br/>guard rails?"}
    B -->|Yes| C["Claude Code"]
    B -->|No| D{"Which AI provider?"}
    D -->|Anthropic| E{"Budget-sensitive?"}
    E -->|Yes| F["Aider"]
    E -->|No| C
    D -->|OpenAI| G["Codex"]
    D -->|Google| H["OpenCode"]
    D -->|AWS Bedrock| I["Cline"]
    D -->|Multiple| J{"Need MCP?"}
    J -->|Yes| I
    J -->|No| H

Short answer: Use Claude Code unless you have a specific reason not to. It has the best guard rail support (hook-based, not just prompt-based) and built-in heartbeat telemetry for the watchdog.

Comparison¶

Feature	Claude Code	Codex	Aider	OpenCode	Cline
Guard rails	Hook-based (deterministic)	Prompt-based	Prompt-based	Prompt-based	Prompt-based
AI provider	Anthropic	OpenAI	Anthropic or OpenAI	Anthropic, OpenAI, or Google	Anthropic, OpenAI, Google, or Bedrock
Heartbeat telemetry	Built-in via hooks	Not built-in	Not built-in	Not built-in	Not built-in
Agent teams	Yes (experimental)	No	No	No	No
MCP support	Yes	No	No	No	Yes
Repo context file	`CLAUDE.md`	`AGENTS.md`	`.aider/conventions.md`	`AGENTS.md`	`.clinerules`
Best for	General-purpose, large refactors	OpenAI shops	Lightweight edits, cost-sensitive	BYOM flexibility	Bedrock, MCP integration
Pre-built image	✅	✅	✅	✅	❌ Community template

What "Hook-Based" vs "Prompt-Based" Guard Rails Means¶

This is the most important difference between engines.

Hook-based (Claude Code): RoboDev generates scripts that run before and after every tool call. If the agent tries to run rm -rf / or write to a .env file, the hook script blocks it before it happens. The agent sees the rejection and adjusts its approach. This is deterministic — the rule always fires.

Prompt-based (all others): Guard rails are appended to the task prompt as text instructions ("Do NOT execute destructive commands..."). The AI model usually follows them, but there's no hard enforcement. A sufficiently confused or jailbroken model could ignore them.

Prompt-based guard rails are advisory

If your security requirements demand strict enforcement of file access or command restrictions, use Claude Code. For lower-risk tasks (documentation, test fixes), prompt-based engines are usually fine.

Engine Fallback Chains¶

You can configure a fallback chain so that if one engine fails, the next is tried:

engines:
  default: claude-code
  fallback_engines: [codex, aider]

The controller tries claude-code first. If it fails, it tries codex. If that also fails, it tries aider. This is useful for resilience against API outages or rate limits.

Note

Do not use cline in a fallback chain. No pre-built Cline image is published — see Engine Reference for details.

Per-Task Engine Override¶

You can override the engine for a specific task by adding a label to the issue:

engine:codex — use Codex for this task
engine:aider — use Aider for this task

This takes priority over the configured default.

Cost Considerations¶

Different engines have different cost profiles depending on the underlying model:

Engine	Typical Cost Range	Notes
Claude Code	$1–$50 per task	Depends on model (Opus vs Sonnet vs Haiku) and task complexity
Codex	$1–$30 per task	OpenAI pricing
Aider	$0.50–$20 per task	Can use cheaper models
OpenCode	Varies	Depends on provider
Cline	Varies	Depends on provider; Bedrock pricing may differ

Use max_cost_per_job in your guard rails to cap spending per task regardless of engine.

Intelligent Routing¶

RoboDev is building an intelligent task routing system (internal/routing/) that replaces static fallback chains with data-driven engine selection. Instead of always trying claude-code → cline → aider in order, the router will learn which engine is best for each combination of task type, repo language, repo size, and complexity.

How it works:

After each task completes, the outcome (success/failure, cost, duration) is recorded against the engine that ran it.
Per-engine "fingerprints" build up over time — statistical profiles of where each engine excels.
When a new task arrives, the router scores each available engine and picks the most likely to succeed.
An epsilon-greedy exploration factor (default 10%) ensures the system occasionally tries non-optimal engines to discover new strengths.

This will be transparent — Prometheus metrics will show which engine is being selected and why. Cold-start behaviour falls back to the static fallback chain until enough data accumulates (minimum 5 outcomes per engine).

Competitive Execution / Tournaments¶

For high-value tasks, RoboDev can launch multiple engines in parallel — a tournament — have a judge LLM compare the results, and select the best solution. Each candidate runs in an isolated git worktree so they cannot interfere with each other.

When to Use Tournaments¶

Tournaments are best suited for:

High-stakes changes (large refactors, API redesigns, security fixes) where you want the best result, not just a passing one
Tasks where different engines have meaningfully different strengths (e.g. Claude Code vs Aider on a Python-heavy repo)
Situations where code quality or correctness matters more than cost

Tournaments multiply cost

A 2-candidate tournament with a judge run costs roughly 3× a single task. Use max_predicted_cost_per_job to auto-reject tasks before they start when the predicted cost is too high.

Configuration¶

Enable competitive execution in your Helm values (or robodev-config.yaml):

competitive_execution:
  enabled: true
  default_candidates: 2          # number of engines to run in parallel
  judge_engine: claude-code      # engine used to evaluate and pick the winner
  early_termination_threshold: 0.6  # PRM score above which a candidate wins immediately
  max_concurrent_tournaments: 3  # cap on parallel tournaments cluster-wide

The default_candidates value determines how many engines are launched. RoboDev selects engines from the configured fallback chain in order — so with default: claude-code and fallback_engines: [codex, aider], a 2-candidate tournament runs Claude Code and Codex.

To specify engines explicitly for a tournament, use candidate_engines per task label (see Per-Task Engine Override).

Tournament Lifecycle¶

graph LR
    A["Task arrives"] --> B["Competing\n(all candidates running)"]
    B -->|"early termination\n(PRM score ≥ threshold)"| E["Winner selected"]
    B -->|"all candidates finish"| C["Judging\n(judge engine evaluates diffs)"]
    C --> E
    B -->|"infra failure"| D["Candidate eliminated"]
    D -->|"remaining candidates finish"| C

Competing — all candidate engines run simultaneously in isolated worktrees. The controller watches each candidate's PRM scores as they stream in.
Early termination — if any candidate's PRM score exceeds early_termination_threshold (default 0.6), remaining candidates are cancelled immediately and that candidate wins. This avoids paying for redundant runs when one engine is clearly superior.
Judging — once all candidates have finished (or been eliminated), the judge engine receives a structured prompt with each candidate's diff, summary, cost, duration, and PRM scores. It responds with a winner_index and reasoning.
Winner selected — the winning candidate's changes are used to open the pull request. Losing candidates' worktrees are discarded.

Judging Rubric¶

The judge evaluates candidates on five dimensions, in priority order:

Criterion	Weight
Correctness — does the solution address the task?	Highest
Code quality — clean, structured, maintainable	High
Completeness — edge cases, tests included	Medium
Efficiency — cost, tokens, duration	Lower
Safety — no vulnerabilities, follows best practices	Always checked

Correctness outweighs cost efficiency: the judge will prefer a correct solution that costs more over a cheaper incomplete one.

Per-Task Tournament Trigger¶

You can trigger a tournament for a specific issue by adding the tournament label. The configured default_candidates and candidate_engines are used.

Prometheus Metrics¶

Metric	Description
`robodev_tournament_total`	Total tournaments started
`robodev_tournament_candidates_total`	Candidates completed, labelled by engine
`robodev_tournament_winner_engine_total`	Tournament wins by engine
`robodev_tournament_cost_total`	Histogram of total tournament cost (all candidates + judge) in USD
`robodev_tournament_duration_seconds`	Histogram of tournament duration from start to winner selection

The winner engine metric is particularly useful for understanding which engine tends to produce the best results on your codebase over time.

Next Steps¶

Engine Reference — full configuration details for each engine
Guard Rails Overview — the six safety layers
Configuration Reference — how to configure engines in your values