Server Resilience¶
The Server (AgentSmith.Server.dll, the single long-running deployment since p0107)
is split into independent subsystems with separate health states. Most depend on
Redis; the webhook routing path does not. The server boots even when Redis is
missing or unreachable — it stays up and reports why it is degraded via the
/health endpoints, instead of crashing the container.
Subsystems¶
| Subsystem | Needs Redis | Purpose |
|---|---|---|
redis |
Yes | StackExchange.Redis multiplexer connection state. |
queue_consumer |
Yes | Pulls PipelineRequests off the queue and runs them. |
housekeeping |
Yes | Stale-job detector + enqueued reconciler (leader-elected). |
poller |
Yes | Per-platform ticket polling (leader-elected). |
The webhook routes (/webhook/{github,gitlab,azuredevops,jira}) are served by
the same Kestrel as /health and the chat-platform endpoints — no separate
listener subsystem since p0107.
Each subsystem is in one of four states:
- Up — running normally.
- Degraded — temporarily impaired (Redis is configured but disconnected; task crashed and is retrying).
- Down — fatal error.
- Disabled —
REDIS_URLis not configured. The subsystem will never start in this process; restart withREDIS_URLset to enable it.
Endpoints¶
GET /health — liveness¶
Always returns HTTP 200 as long as the listener is alive. The body is JSON describing every subsystem:
{
"status": "degraded",
"subsystems": [
{ "name": "queue_consumer", "state": "disabled", "reason": "REDIS_URL not configured", "last_changed_utc": "2026-04-26T12:00:00Z" },
{ "name": "housekeeping", "state": "disabled", "reason": "REDIS_URL not configured", "last_changed_utc": "2026-04-26T12:00:00Z" },
{ "name": "poller", "state": "disabled", "reason": "REDIS_URL not configured", "last_changed_utc": "2026-04-26T12:00:00Z" },
{ "name": "redis", "state": "disabled", "reason": "REDIS_URL not configured", "last_changed_utc": "2026-04-26T12:00:00Z" }
]
}
status is ok when every subsystem is up, otherwise degraded. Use
/health for container liveness probes — Kubernetes / Docker should not restart
the pod just because Redis is briefly down.
GET /health/ready — readiness (loud-fail)¶
Returns HTTP 503 whenever any subsystem is not Up — including Disabled.
This is intentional: a server with REDIS_URL unset is technically alive but
silently rejecting every webhook with 503, which would otherwise look identical
to a healthy server in monitoring. Loud-fail readiness ensures operators see
the misconfiguration immediately:
- 200 +
{"status": "ready"}— every subsystem isUp. - 503 +
{"status": "not_ready", "subsystems": [...]}— at least one subsystem is notUp. The body lists every subsystem with its current state and reason so the operator can see why at a glance.
Use /health/ready for ingress / load-balancer readiness gates and alerting.
Behaviour by deployment configuration¶
| Configuration | webhook |
redis |
queue_consumer |
/health |
/health/ready |
|---|---|---|---|---|---|
REDIS_URL set, Redis reachable |
Up | Up | Up | 200 ok | 200 ready |
REDIS_URL set, Redis unreachable |
Up | Degraded | Degraded | 200 degraded | 503 not_ready |
REDIS_URL unset |
Up | Disabled | Disabled | 200 degraded | 503 not_ready |
| Listener stopped (graceful shutdown) | Down | (any) | (any) | 200 / shutdown | 503 |
Recovery semantics¶
IConnectionMultiplexeris built withAbortOnConnectFail=false, so it reconnects automatically when Redis becomes reachable.queue_consumer,housekeeping, andpollereach poll the multiplexer everyqueue.redis_retry_interval_seconds(default 30s, configurable inagentsmith.yml) while inDegraded. When the multiplexer reportsIsConnected=truethey transition toUpand start their work.- A single
INFOlog line per state transition keeps the log readable during outages.
Webhook behaviour while Redis is down¶
Structured ticket webhooks (/webhook/jira, /webhook/github, …) need
ITicketClaimService to enqueue work. When Redis is unavailable, every
structured webhook responds:
Dialogue-answer webhooks (PR comment paths) reply 503 with the same body for
the same reason. Free-form TriggerInput webhooks that run pipelines
in-process (no claim required) continue to work.
GitHub / GitLab / Azure DevOps / Jira retry their webhook deliveries on 503, so once Redis is restored, queued events are replayed to the server.