Scaling¶
Overview¶
RoboDev is designed for horizontal scaling on Kubernetes. The controller runs as a single replica, whilst agent jobs scale independently via Karpenter node provisioning and configurable concurrency limits.
Controller High Availability¶
Planned feature
Leader election (multiple controller replicas with automatic failover) is not yet implemented. The controller currently runs as a single replica. HA support via controller-runtime Kubernetes Lease objects is on the roadmap.
For now, run a single controller replica. Agent jobs are independent Kubernetes Jobs and are unaffected by a controller restart — in-flight jobs complete normally.
Karpenter Integration¶
RoboDev agent jobs are ephemeral, CPU-bound workloads. Karpenter provisions dedicated nodes on demand when jobs are created and scales down when they complete.
Recommended NodePool¶
See examples/karpenter/nodepool.yaml for a production-ready configuration.
Key design decisions:
- Spot instances for cost savings (jobs are stateless and retryable)
- Taints to isolate agent workloads from production services
- WhenEmpty consolidation to avoid disrupting running jobs
- Instance type diversity across c6i/c7i/m6i for spot availability
- CPU limits to cap maximum infrastructure spend
Tolerations¶
Agent jobs include tolerations for the robodev.io/agent taint by default:
Concurrency Limits¶
The controller enforces a maximum number of concurrent jobs to prevent runaway scaling:
When the limit is reached, new tickets are queued until running jobs complete.
Cost Controls¶
Multiple layers of cost control are available:
- Per-job cost ceiling —
max_cost_per_jobterminates jobs exceeding the budget - Job duration limit —
max_job_duration_minutessets a hard timeout - Progress watchdog — detects and terminates unproductive agents early
- Karpenter CPU limits — caps total compute across all RoboDev nodes
- Cost velocity alerts — warns when spending exceeds thresholds
KEDA Scaling¶
For advanced scaling scenarios, KEDA can trigger scaling based on ticket queue depth:
Note: The controller currently runs as a single replica (leader election is on the roadmap). KEDA is most useful for alerting or for scaling downstream resources.
Multi-Tenancy¶
For shared clusters, RoboDev defines a tenancy configuration schema. Namespace-per-tenant runtime isolation is on the roadmap; the config structure and RBAC recommendations are in place:
tenancy:
mode: "namespace-per-tenant"
tenants:
- name: "team-alpha"
namespace: "robodev-alpha"
ticketing:
backend: github
config:
repo: "alpha-org/repos"
When namespace-per-tenant isolation is implemented, each tenant will receive: - Dedicated namespace with its own RBAC, NetworkPolicies, and ResourceQuotas - Separate Kubernetes Secrets (no cross-tenant secret access) - Independent job limits and cost budgets - Isolated Karpenter NodePool (optional, for compute isolation)
Tenancy runtime isolation is on the roadmap
The tenancy config block is parsed and stored but has no runtime effect today. All jobs currently run in the controller's own namespace regardless of tenant configuration.
Observability¶
Prometheus Metrics¶
The controller exposes the following metrics:
| Metric | Type | Description |
|---|---|---|
robodev_taskruns_total |
Counter | Total task runs by state |
robodev_active_jobs |
Gauge | Currently running jobs |
robodev_taskrun_duration_seconds |
Histogram | Job execution duration |
robodev_plugin_errors_total |
Counter | Plugin errors by plugin |
Grafana Dashboard¶
A Grafana dashboard JSON model is included in the Helm chart for visualising RoboDev metrics.
Structured Logging¶
All controller logs are structured JSON via Go's slog, suitable for log aggregation systems (Elasticsearch, Loki, CloudWatch Logs).