Troubleshooting¶
Common issues and their solutions when running RoboDev.
Controller Pod Is Not Starting¶
kubectl describe pod -n robodev -l app.kubernetes.io/name=robodev
kubectl logs -n robodev deployment/robodev --previous
| Symptom | Cause | Fix |
|---|---|---|
| ImagePullBackOff | Wrong image repository or tag, or missing pull secret | Check image.repository and image.tag in your values. Set imagePullSecrets if using a private registry. |
| CrashLoopBackOff | Configuration error | Inspect logs for the specific error. The most common issue is a missing or malformed config section. |
| Pending | Insufficient cluster resources | Check kubectl describe pod for scheduling events. Ensure your cluster has enough CPU and memory. |
Issues Are Not Being Picked Up¶
- Confirm the issue has the correct label (must match
config.ticketing.config.labelsexactly). - Check that
config.ticketing.config.ownerandconfig.ticketing.config.repomatch the repository. - Verify the GitHub token secret exists and has the required scopes:
- Look for polling errors in the controller logs:
Tip
The controller polls on a configurable interval (default: 30 seconds). If you need faster pickup, consider enabling webhooks for near-instant ingestion.
Agent Jobs Are Failing¶
# List recent jobs and their status
kubectl get jobs -n robodev
# Get logs from a failed job's pod
kubectl logs -n robodev job/<job-name>
| Symptom | Cause | Fix |
|---|---|---|
| API key invalid | Expired or incorrect API key | Verify your Anthropic or OpenAI secret contains a valid key. Recreate the secret if needed. |
| Cost limit reached | Job exceeded max_cost_per_job |
Increase the limit in your guard rails config or simplify the task. |
| Duration limit reached | Job exceeded max_job_duration_minutes |
Increase the limit or break the task into smaller pieces. |
| Guard rail rejection | Ticket failed validation | Check logs for the specific guard rail that rejected the ticket. Adjust allowed_repos, allowed_task_types, or blocked_file_patterns as needed. |
Metrics Endpoint Is Not Working¶
Ensure metrics.enabled is set to true in your values (this is the default). The metrics are served on the port specified by metrics.port (default 8080).
If you are using a ServiceMonitor, confirm that metrics.serviceMonitor.enabled is true and the labels match your Prometheus operator configuration.
# Test metrics endpoint directly
kubectl port-forward -n robodev deployment/robodev 8080:8080 &
curl -s http://localhost:8080/metrics | head -20
Webhooks Are Not Working¶
- Verify the webhook pod port matches
webhook.portin your values (default: 8081). - Check that your ticketing provider is sending events to the correct URL.
- Verify the webhook secret matches between your provider and RoboDev configuration.
- Check network policies allow inbound traffic on the webhook port.
Notifications Are Not Being Sent¶
- Verify the notification channel is correctly configured in your values.
- Check that the Slack bot token or Teams webhook URL is valid.
- Ensure the Slack bot has been invited to the target channel.
- Look for notification errors in the logs:
Info
Notification failures are non-critical — they are logged but do not block the controller. Check the robodev_plugin_errors_total Prometheus metric for persistent failures.
Docker Compose Issues¶
Controller exits immediately¶
Check the logs for configuration errors:
Ensure your .env file contains valid GITHUB_TOKEN and ANTHROPIC_API_KEY values.
Agent container cannot reach GitHub¶
Ensure Docker has network access and can resolve api.github.com. If you are behind a corporate proxy, configure Docker's proxy settings.
Watchdog Is Terminating Jobs¶
The progress watchdog may terminate jobs that appear stalled, looping, or unproductive. Check the termination reason in the logs:
| Reason | What happened | Fix |
|---|---|---|
| Loop detected | Agent called the same tool repeatedly | The task may be too ambiguous. Provide clearer instructions in the issue. |
| Thrashing | High token use without file changes | Increase thrashing_token_threshold or add a longer research_grace_period_minutes for complex tasks. |
| Stall | No activity for extended period | Check if the agent container has network access to the AI API endpoint. |
| Cost velocity | Spending too fast | Reduce task complexity or increase cost_velocity_max_per_10_min. |
See Guard Rails Overview for details on each detection rule and how to tune thresholds.
Getting Help¶
If you cannot resolve an issue:
- Search the GitHub Issues for similar problems.
- Open a new issue with the controller logs, your
values.yaml(with secrets redacted), and a description of the expected vs actual behaviour. - Join the community discussion on GitHub Discussions.