Architecture¶
This document describes the architecture of RoboDev, a Kubernetes-native controller that orchestrates autonomous AI coding agents to perform development tasks at scale.
For the full technical plan, see oss-plan.md. For product requirements, see oss-prd.md.
Overview¶
RoboDev follows the Kubernetes operator pattern. A single controller binary runs inside the cluster and drives a reconciliation loop: it polls a ticketing backend for work, translates each ticket into a Kubernetes Job that runs an AI coding agent, monitors the job through to completion, and feeds the result back to the source control and ticketing systems.
The controller does not itself perform any code generation. It is purely an orchestration layer. The actual coding work happens inside short-lived Kubernetes Jobs, each running one of the supported AI engines (Claude Code, OpenAI Codex, or Aider). This separation means the controller can manage many concurrent agents across multiple repositories and organisations without coupling to any single AI provider.
All external integrations -- ticketing, notifications, approvals, secrets, SCM, and code review -- are abstracted behind plugin interfaces. Built-in plugins compile directly into the controller binary; third-party plugins run as separate gRPC subprocesses managed by hashicorp/go-plugin.
System Architecture¶
graph TD
TB["Ticketing Backend<br/>(GitHub Issues / GitLab / Jira)"] -->|poll tickets| Ctrl
subgraph Ctrl["RoboDev Controller"]
RL["Reconciliation Loop"]
GR["Guard Rails"]
TSM["TaskRun State Machine"]
ES["Engine Selector"]
JB["JobBuilder"]
RL --> GR --> TSM --> ES --> JB
end
Ctrl --> Secrets["Secrets Backend<br/>(Vault / K8s)"]
Ctrl --> WD["Watchdog Loop"]
Ctrl --> Approval["Approval Backend"]
Ctrl --> Notif["Notification Channel<br/>(Slack / Teams)"]
JB --> Job["K8s Job<br/>(AI Agent Pod)<br/>Claude Code / Codex / Aider"]
Job -->|result + branch| SCM["SCM Backend<br/>(GitHub / GitLab)<br/>create PR/MR"]
Job -->|result + branch| Review["Review Backend<br/>(CodeRabbit)<br/>auto-review"]
Controller¶
The controller lives in internal/controller/ and is built on top of controller-runtime and client-go. Its Reconciler struct drives the entire lifecycle.
Reconciliation Loop¶
The loop runs on a configurable poll interval (typically 30-60 seconds). Each tick performs the following steps:
- Capacity check -- count active (non-terminal, non-queued) TaskRuns against the configured
max_concurrent_jobslimit (default: 5). If the limit is reached, the tick is skipped entirely. - Poll -- call
PollReadyTicketson the ticketing backend to retrieve tickets that are ready for processing. - Per-ticket processing -- for each ticket, up to the remaining capacity:
- Generate an idempotency key (
<ticket_id>-<attempt>). If a non-terminal TaskRun already exists for this key, the ticket is skipped. This prevents duplicate work when the same ticket appears in consecutive polls. - Run guard rail validation (see Guard Rails below). If a ticket violates any rule, it is rejected and the ticketing backend is updated.
- Select the execution engine (from configuration, defaulting to
claude-code). - Create a
TaskRuninQueuedstate, then build anExecutionSpecvia the engine. - Pass the spec to the
JobBuilder, which produces abatch/v1.Job. - Create the Job in Kubernetes via
client-go. - Transition the TaskRun to
Runningand record the Job name. - Emit Prometheus metrics (
active_jobs,taskruns_total). - Mark the ticket as in-progress and fire start notifications.
- Job status check -- iterate all
RunningTaskRuns, fetch the corresponding Job status from the Kubernetes API, and handle completion or failure.
Retry Logic¶
When a Job fails and the TaskRun has not exhausted its MaxRetries (default: 1), the controller transitions the TaskRun through Failed to Retrying and back to Running, creating a new Job. The retry count is incremented on each attempt. Once retries are exhausted, the TaskRun enters a terminal Failed state, the ticket is marked as failed, and notifications are sent.
Idempotency¶
Every TaskRun is keyed by an idempotency key derived from the ticket ID and attempt number. The controller maintains an in-memory map of all TaskRuns, protected by a read-write mutex. Before processing any ticket, the controller checks whether a non-terminal TaskRun already exists for that key. This ensures that a ticket is never processed twice, even if the ticketing backend returns it in multiple consecutive polls.
TaskRun State Machine¶
The TaskRun state machine is implemented in internal/taskrun/. Each TaskRun struct tracks the full lifecycle of a single execution attempt.
States¶
| State | Description |
|---|---|
Queued |
The TaskRun has been created but no Job has been launched yet. |
Running |
A Kubernetes Job is actively executing the AI agent. |
NeedsHuman |
The agent has asked a question that requires human intervention. Execution is paused. |
Succeeded |
The Job completed successfully and produced a valid result. Terminal state. |
Failed |
The Job failed. Terminal if retries are exhausted; otherwise transitions to Retrying. |
Retrying |
A retry has been scheduled. Transitions back to Running when the new Job is created. |
TimedOut |
The Job exceeded its deadline. Terminal state. |
Transition Diagram¶
stateDiagram-v2
[*] --> Queued
Queued --> Running
Queued --> NeedsHuman
Running --> Succeeded
Running --> Failed
Running --> TimedOut
Running --> NeedsHuman
NeedsHuman --> Running
Failed --> Retrying
Retrying --> Running
Succeeded --> [*]
TimedOut --> [*]
Valid Transitions¶
The validTransitions map enforces the following rules:
Queuedmay only transition toRunning.Runningmay transition toNeedsHuman,Succeeded,Failed, orTimedOut.NeedsHumanmay only transition back toRunning(once a human responds).Failedmay only transition toRetrying.Retryingmay only transition toRunning.SucceededandTimedOutare terminal -- no outgoing transitions.
Heartbeats and Staleness¶
Each TaskRun carries a HeartbeatAt timestamp and a configurable HeartbeatTTLSeconds (default: 300 seconds). The agent container is expected to push heartbeats at regular intervals. If the time since the last heartbeat exceeds the TTL, the IsStale() method returns true and the watchdog may intervene.
Execution Engines¶
The ExecutionEngine interface in pkg/engine/engine.go decouples AI coding tools from the Kubernetes runtime.
Interface¶
type ExecutionEngine interface {
BuildExecutionSpec(task Task, config EngineConfig) (*ExecutionSpec, error)
BuildPrompt(task Task) (string, error)
Name() string
InterfaceVersion() int
}
Each engine implements BuildExecutionSpec, which takes a Task (containing the ticket ID, title, description, repository URL, and labels) and an EngineConfig (image, resources, timeout), and returns an ExecutionSpec. The spec is a runtime-agnostic description of what container to run:
type ExecutionSpec struct {
Image string
Command []string
Env map[string]string
SecretEnv map[string]string
ResourceRequests Resources
ResourceLimits Resources
Volumes []VolumeMount
ActiveDeadlineSeconds int
}
The JobBuilder (internal/jobbuilder/) then translates this spec into a Kubernetes batch/v1.Job, applying security contexts, tolerations, labels, and resource limits.
Supported Engines¶
| Engine | Description |
|---|---|
| Claude Code | Anthropic's CLI agent. Supports hooks for guard rail enforcement and the experimental agent-teams mode for parallel sub-agents. |
| Codex | OpenAI's coding agent. Configured via API key or credentials file. |
| Aider | Open-source AI pair programming tool. Supports multiple LLM backends. |
The default engine is claude-code, configurable via robodev-config.yaml.
TaskResult¶
Every engine writes a structured TaskResult to /workspace/result.json upon completion:
type TaskResult struct {
Success bool
MergeRequestURL string
BranchName string
Summary string
TokenUsage *TokenUsage
CostEstimateUSD float64
ExitCode int // 0=success, 1=agent failure, 2=guard rail blocked
}
Plugin System¶
RoboDev's plugin system, implemented in pkg/plugin/, provides two integration mechanisms:
Built-in Plugins¶
Built-in plugins are compiled directly into the controller binary. They implement the relevant Go interface and are registered at startup via functional options (e.g. WithTicketing, WithEngine, WithNotifier). These are first-class citizens with no serialisation overhead.
Third-party (gRPC) Plugins¶
Third-party plugins run as separate processes and communicate with the controller over gRPC using hashicorp/go-plugin. The plugin host (pkg/plugin/host.go) manages their lifecycle:
- Spawning -- the host starts the plugin binary as a subprocess using
exec.Command. - Handshake -- a magic cookie (
ROBODEV_PLUGIN=robodev) and protocol version are exchanged to verify compatibility. - Health monitoring -- the host tracks each plugin's health state and restart count.
- Automatic restart -- if a plugin dies, the host restarts it with exponential backoff (default: 1s, 5s, 30s), up to a configurable maximum restart count (default: 3).
- Graceful shutdown -- on controller termination, all plugin subprocesses are killed.
Plugin Interfaces¶
RoboDev defines six plugin interfaces. Every interface includes a Handshake RPC with an interface_version field for forward-compatible version negotiation.
| Interface | Type Constant | Purpose |
|---|---|---|
| TicketingBackend | ticketing |
Polls for ready tickets, marks tickets as in-progress/complete/failed. |
| NotificationChannel | notifications |
Fire-and-forget notifications (e.g. Slack, Microsoft Teams, email). |
| HumanApprovalBackend | approval |
Event-driven human-in-the-loop approval workflow. |
| SecretsBackend | secrets |
Retrieves secrets at runtime (Kubernetes Secrets, HashiCorp Vault, etc.). |
| SCMBackend | scm |
Source control operations -- clone, branch, commit, push, create PR/MR. |
| ReviewBackend | review |
Automated code review (e.g. CodeRabbit, Semgrep). |
All interfaces are defined as protobuf services in proto/ (the source of truth) and generated into Go, Python, and TypeScript SDKs.
Guard Rails¶
RoboDev enforces safety through six complementary layers. For full details, see the Guard Rails documentation.
-
Controller validation -- the reconciler validates each ticket against configurable rules before creating a Job. This includes allowed repository patterns (glob matching), allowed task types, and blocked file patterns. Violations are rejected immediately and the ticket is marked as failed.
-
Engine hooks -- for engines that support them (notably Claude Code), hooks run inside the agent container at tool-call boundaries. These hooks can intercept and block dangerous operations (e.g. writing to protected files, executing disallowed commands) before they take effect.
-
Repository guard rail files --
guardrails.mdandCLAUDE.mdfiles placed in target repositories provide per-repo instructions that the AI agent may follow. These files are read by the agent naturally during execution (Claude Code readsCLAUDE.mdautomatically). The controller does not currently inject them — prompt-builder injection is on the roadmap. -
Task profiles -- configuration-driven profiles that define cost and duration limits per task type. The config schema is defined and values are stored, but per-task-type file pattern restrictions (
allowed_file_patterns,blocked_file_patterns) are not yet enforced at runtime. -
Quality gate -- an optional post-completion review step. A separate AI engine (or the same one) reviews the agent's output for security issues, OWASP patterns, leaked secrets, and dependency CVEs. Configurable responses include
retry_with_feedback,block_mr, ornotify_human. -
Progress watchdog -- a continuous monitoring loop that detects stalled, looping, or unproductive agents during execution. See below for details.
Job Lifecycle¶
The complete lifecycle of a ticket from discovery to pull request:
sequenceDiagram
participant TP as Ticketing Poller
participant Ctrl as Controller
participant GR as Guard Rails
participant Eng as Engine
participant JB as JobBuilder
participant K8s as Kubernetes
participant WD as Watchdog
participant SCM as SCM Backend
participant Rev as Review Backend
participant Notif as Notifications
TP->>Ctrl: Ready ticket
Ctrl->>GR: Validate (repos, task types, files)
GR-->>Ctrl: Pass
Ctrl->>Eng: BuildExecutionSpec(task, config)
Eng-->>Ctrl: ExecutionSpec
Ctrl->>JB: Translate to batch/v1.Job
JB-->>Ctrl: Job manifest
Ctrl->>K8s: Create Job
Ctrl->>Notif: NotifyStart
loop Watchdog loop
WD->>K8s: Check heartbeats + anomalies
end
K8s-->>Ctrl: Job completed
Ctrl->>Ctrl: Read TaskResult
Ctrl->>SCM: Create PR/MR
Ctrl->>Rev: Review output (if quality gate enabled)
Ctrl->>TP: Update ticket (succeeded/failed)
Ctrl->>Notif: NotifyComplete
The steps in detail:
- Poll -- the ticketing backend returns a ready ticket.
- Validate -- the controller checks guard rails (allowed repos, task types, file patterns).
- Build spec -- the selected engine produces an
ExecutionSpeccontaining the container image, command, environment variables, secret references, resource limits, and deadline. - Create job -- the
JobBuildertranslates the spec into abatch/v1.Jobwith: - Labels:
app=robodev-agent,robodev.io/task-run-id,robodev.io/engine - Security context:
runAsNonRoot,runAsUser: 1000,readOnlyRootFilesystem,allowPrivilegeEscalation: false, all capabilities dropped,RuntimeDefaultseccomp profile - Tolerations for the
robodev.io/agenttaint (to schedule on dedicated node pools) BackoffLimit: 0(retries are handled by the controller, not Kubernetes)RestartPolicy: Never- Monitor heartbeats -- the watchdog loop evaluates heartbeat telemetry from the running agent, checking for loops, thrashing, stalls, cost overruns, and telemetry failures.
- Collect result -- once the Job completes, the controller reads the
TaskResultfrom the agent's output (success, branch name, MR URL, token usage, cost). - Create PR/MR -- the SCM backend creates a pull request or merge request from the agent's branch.
- Review -- if the quality gate is enabled, the review backend performs automated code review.
- Update ticket -- the ticketing backend is updated with the final status (succeeded or failed, with reason).
- Notify -- all configured notification channels are informed of the outcome.
Progress Watchdog¶
The watchdog (internal/watchdog/) runs as a separate loop alongside the reconciler, checking active TaskRuns at a configurable interval (default: 60 seconds).
Detection Rules¶
| Rule | What It Detects | Default Threshold | Default Action |
|---|---|---|---|
| Loop detection | Agent calling the same tool with the same arguments repeatedly, with no file progress | 10 consecutive identical calls | terminate_with_feedback |
| Thrashing detection | High token consumption without meaningful file changes | 80,000 tokens without progress | warn, escalating to terminate_with_feedback |
| Stall detection | No tool calls despite heartbeat still advancing | 300 seconds idle | terminate |
| Cost velocity | Spending rate exceeds threshold | $15 USD per 10 minutes | warn |
| Telemetry failure | Heartbeat sequence number not advancing | 3 stale ticks | warn |
| Unanswered human | NeedsHuman state with no response |
30 minutes | terminate_and_notify |
Consecutive Tick Requirement¶
To avoid false positives, the watchdog requires an anomaly to persist for a configurable number of consecutive ticks (min_consecutive_ticks, default: 2) before taking action. If the anomaly resolves before reaching the threshold, the tick counter resets.
Research Grace Period¶
Newly created TaskRuns receive a grace period (default: 5 minutes) during which thrashing detection is relaxed, since agents commonly consume many tokens during initial code analysis without yet producing file changes.
Actions¶
| Action | Behaviour |
|---|---|
terminate |
Kills the Job immediately. |
terminate_with_feedback |
Kills the Job and attaches diagnostic feedback to the TaskRun, available for the next retry. |
terminate_and_notify |
Kills the Job and sends a notification to the configured channels. |
warn |
Logs a structured warning and sets a condition on the TaskRun, but does not terminate. |
Diagnostic Reason structs are populated from templates (never from raw agent output) to prevent prompt injection into the watchdog feedback path.
Intelligence Layer¶
Seven subsystems extend RoboDev's intelligence beyond basic orchestration. PRM and Memory are fully wired into the controller — the remaining five have complete packages with unit tests and are tracked for integration in docs/roadmap.md under Phase I.
Subsystem Architecture¶
graph TD
subgraph Intelligence["Intelligence Layer"]
MEM["Episodic Memory<br/>internal/memory/<br/>✅ ACTIVE"]
PRM["Process Reward Model<br/>internal/prm/<br/>✅ ACTIVE"]
LLM["LLM Abstraction<br/>internal/llm/<br/>✅ ACTIVE"]
DIAG["Causal Diagnosis<br/>internal/diagnosis/"]
CAL["Adaptive Calibrator<br/>internal/watchdog/calibrator"]
ROUTE["Intelligent Routing<br/>internal/routing/"]
EST["Cost Estimator<br/>internal/estimator/"]
TOURN["Tournament<br/>internal/tournament/"]
end
subgraph Controller["Controller (Reconciler)"]
PT["ProcessTicket"]
HJC["handleJobComplete"]
HJF["handleJobFailed"]
SSR["startStreamReader"]
WD["Watchdog.Check"]
end
PT -.->|"predict cost"| EST
PT -.->|"select engine"| ROUTE
PT -.->|"start tournament"| TOURN
SSR -->|"score each step"| PRM
WD -.->|"calibrated thresholds"| CAL
HJC -->|"extract knowledge"| MEM
HJC -.->|"record outcome"| ROUTE
HJC -.->|"record outcome"| EST
HJC -.->|"record telemetry"| CAL
HJF -.->|"diagnose failure"| DIAG
HJF -->|"extract knowledge"| MEM
MEM -->|"inject context"| PT
DIAG -.->|"informed retry"| PT
style Intelligence fill:#1a1a2e,stroke:#16213e
style Controller fill:#0f3460,stroke:#16213e
Solid lines indicate active integrations. Dashed lines indicate planned integration points (not yet wired).
Controller-Level Process Reward Model (internal/prm/) — Active¶
The PRM evaluates agent behaviour in real-time using the NDJSON event stream from internal/agentstream/. It operates purely on observable telemetry — no agent modification required. The PRM is wired into the controller: when prm.enabled: true, the controller creates a prm.Evaluator per TaskRun, feeds streaming events via WithEventProcessor, records interventions with Prometheus metrics, and cleans up evaluators on job completion or failure.
Flow: Stream events → rolling window → rule-based scoring (1-10) → trajectory pattern detection → intervention decision.
Interventions: - Continue — agent is productive, no action needed - Nudge — log a structured hint with guidance and record on TaskRun - Escalate — signal the watchdog to terminate the Job with diagnostic feedback
Trajectory patterns detected: sustained decline (3+ consecutive drops), plateau (5+ identical scores), oscillation (alternating up/down), recovery (3+ consecutive increases).
For full details, see Real-Time Agent Coaching (PRM).
Episodic Memory (internal/memory/) — Active¶
A persistent temporal knowledge graph that accumulates facts across all TaskRuns. Facts have confidence values that decay over time as repositories evolve. Memory is wired into the controller: when memory.enabled: true, knowledge is extracted on job completion and failure, relevant prior knowledge is queried before building prompts, and a background goroutine handles confidence decay and pruning.
Node types:
- Fact — a specific observation (e.g. "repo X has flaky test Y", "engine Z fails on Python monorepos")
- Pattern — a recurring observation across multiple tasks
- EngineProfile — per-engine capability summary
Storage: SQLite via modernc.org/sqlite (pure Go, no CGO). Auto-migration on startup.
Temporal weighting: queries weight facts by confidence × decay_factor(age). Stale facts below a configurable threshold are pruned.
Cross-tenant isolation: all queries are scoped by tenant ID. Fact extraction tags each node with the originating tenant.
For full details, see Episodic Memory.
LLM Abstraction (internal/llm/) — Active¶
A DSPy-inspired package providing typed, composable LLM interactions for all intelligent subsystems. Defines Signature types with typed input/output fields, Module interface with Predict and ChainOfThought implementations, and a Budget tracker for per-subsystem cost enforcement. Uses only net/http — no external SDK dependency.
For full details, see LLM Abstraction Layer.
Causal Diagnosis (internal/diagnosis/)¶
Replaces blind retry with informed corrective action. When a task fails, the analyser classifies the failure mode from the stream transcript, watchdog reason, and result data.
Failure modes: WrongApproach, DependencyMissing, TestMisunderstanding, ScopeCreep, PermissionBlocked, ModelConfusion, InfraFailure.
Prescriptions are generated from safe text/template templates (never from raw agent output) to prevent prompt injection into the retry prompt.
Deduplication: DiagnosisHistory on the TaskRun prevents repeating the same diagnosis — if the same failure mode recurs, the task goes terminal rather than retrying endlessly.
Adaptive Watchdog Calibration (internal/watchdog/calibrator.go, profiles.go)¶
Extends the existing watchdog with per-(repo, engine, task_type) adaptive thresholds. Tracks running percentile statistics (P50, P90, P99) for key telemetry signals from completed TaskRuns.
Cold-start logic: requires a minimum of 10 completed TaskRuns for a given profile key before overriding static defaults.
Profile resolution: exact match → partial match (e.g. same engine but any repo) → global fallback → static config values.
Engine Fingerprinting and Routing (internal/routing/)¶
Builds statistical profiles of each engine from historical task outcomes. Uses Laplace-smoothed success rates across dimensions (task type, repo language, repo size, complexity).
Selection algorithm: epsilon-greedy — with probability ε (default 0.1), picks a random engine for exploration; otherwise picks the engine with the highest composite score.
Interface: implements the existing EngineSelector interface, so it's a drop-in replacement for DefaultEngineSelector.
Predictive Cost Estimation (internal/estimator/)¶
Pre-execution cost and duration prediction using multi-dimensional complexity scoring and k-nearest-neighbours from historical data.
Complexity dimensions: description length/complexity, label mapping, normalised repo size, task type base complexity.
Output: low/high ranges for cost (USD) and duration (minutes) with a confidence score based on sample count.
Competitive Execution / Tournament (internal/tournament/)¶
For high-value tasks, launches N parallel K8s Jobs (different engines or strategies). A "judge" Job compares the resulting diffs and selects the best solution.
Lifecycle: Start → Competing (N candidates running) → Judging (enough candidates complete, judge launched) → Selected/Eliminated (winner chosen, losers cleaned up).
Early termination: configurable threshold (default 60% of candidates must complete before triggering judge). Remaining slow candidates are terminated.
Security Architecture¶
RoboDev is a security-first project. For the full threat model and mitigations, see the Security Model.
Container Isolation¶
Every agent Job is created with a restrictive security context:
securityContext:
runAsNonRoot: true
runAsUser: 1000
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
seccompProfile:
type: RuntimeDefault
Writable paths are limited to explicitly mounted emptyDir volumes (e.g. /workspace).
Network Policies¶
Agent pods should be deployed with Kubernetes NetworkPolicy resources that restrict egress to only the required endpoints (API providers, SCM hosts). The Helm chart includes templates for these policies.
Secret Management¶
Secrets are never passed as plain-text environment variables in Job specs. Instead, the SecretEnv map references Kubernetes Secrets by name, and Kubernetes injects them at pod startup. The SecretsBackend plugin interface supports external providers such as HashiCorp Vault. API keys and credentials are never logged.
Input Validation¶
All external input -- ticket descriptions, plugin responses, webhook payloads -- is validated before processing. The watchdog's Reason structs use templated messages rather than raw agent output to prevent prompt injection.
Workload Identity¶
Where possible, RoboDev prefers workload identity patterns (AWS IRSA, GCP WIF) over static credentials. The engine configuration supports bedrock and vertex authentication methods that leverage pod-level identity bindings.
Observability¶
Prometheus Metrics¶
The controller exposes the following metrics under the robodev_ namespace:
| Metric | Type | Labels | Description |
|---|---|---|---|
robodev_taskruns_total |
Counter | state |
Total number of TaskRuns by final state. |
robodev_taskrun_duration_seconds |
Histogram | engine |
Duration of TaskRuns. Buckets from 1 minute to ~4 hours (exponential, base 2, 8 buckets). |
robodev_active_jobs |
Gauge | -- | Number of currently active Jobs. |
robodev_plugin_errors_total |
Counter | plugin |
Total number of plugin errors by plugin name. |
Metrics are registered via promauto and are available at the standard /metrics endpoint.
Structured Logging¶
All logging uses Go's standard library slog package with JSON output. Log entries include structured fields such as ticket_id, task_run_id, engine, job, error, and duration. Context-aware logging (InfoContext, ErrorContext, WarnContext) ensures that request-scoped values propagate correctly.
Grafana Dashboards¶
The Helm chart includes provisioning for Grafana dashboards that visualise:
- Active job count over time
- TaskRun success/failure rates by engine
- TaskRun duration distributions
- Plugin error rates
- Cost velocity and token consumption trends
- Watchdog anomaly frequency
Configuration¶
All controller behaviour is driven by robodev-config.yaml, loaded at startup by the internal/config/ package. The configuration covers:
- Ticketing -- backend selection and connection details
- Notifications -- one or more notification channels
- Secrets -- backend selection (Kubernetes Secrets, Vault, etc.)
- Engines -- default engine, per-engine image/auth/resource settings
- Guard rails -- max cost per job, concurrent job limit, duration limit, allowed repos, blocked file patterns, task types
- Plugin health -- max restarts, backoff schedule, critical plugin list
- Quality gate -- enabled/disabled, review engine, security checks, failure action
- Tenancy -- shared or namespace-per-tenant mode
- Progress watchdog -- interval, thresholds, and actions for each detection rule
Environment variables can override configuration values where appropriate, following twelve-factor principles.