On-Call Runbook

Ten on-call scenarios with the symptom you see, what to check next, and how to recover. Each entry is self-contained so you can jump directly to the relevant section from an alert.

1. Actor system not draining at shutdown

Symptom: Deploy or pod termination takes the full terminationGracePeriodSeconds before the process exits. Logs show actors still alive after PoisonPill broadcast.

Check:

Is any actor blocking inside a handler? Look for sleep(), synchronous HTTP, or PDO calls in handler code. A blocked fiber cannot process PoisonPill.
Is the shutdown timeout shorter than the slowest actor's drain time? Check $config->shutdownTimeout in your server config.
Are there actors with large mailbox backlogs? Check dead-letter counts at shutdown — messages dropped to dead letters indicate the backlog was not drained.

Action:

Increase shutdownTimeout to match the 99th-percentile handler duration.
If blocking calls are present, suppress them or move them to scheduleOnce (see troubleshooting: blocking call violation).
For immediate recovery on a stuck pod: scale the deployment to add a new replica, then force-delete the stuck pod (kubectl delete pod --grace-period=0).

2. Mailbox overflow rate spiking

Symptom: MailboxOverflowException count is rising in logs, or dead-letter count spikes without a corresponding error rate.

Check:

Which actor is overflowing? Search logs for MailboxOverflowException to find the actor path.
What is the overflow strategy? DropNewest and DropOldest discard silently; ThrowException logs on every drop; Backpressure blocks the sender.
Is the actor processing slower than it was (new deployment, DB slowdown, lock contention)?

Action:

Short-term: increase mailbox capacity via MailboxConfig::bounded($newCapacity, ...).
For stateless actors: spin up more instances behind a router actor to fan out the work.
For stateful actors: investigate the handler bottleneck — profile with Xdebug or add hrtime() instrumentation around the slow path.
Switch to Backpressure strategy while investigating so senders get back-pressure instead of silent drops.

3. Dead letter spike

Symptom: Dead-letter log entries appear at a rate significantly above baseline.

Check:

Are actors stopping unexpectedly? A dead-letter spike often follows an actor crash — messages sent to a stopped actor land in dead letters.
Are tell() calls made to stale references? Check if any code stores ActorRef objects in caches or long-lived structures without checking isAlive().
Did a deployment restart actors mid-flight? Messages in transit at restart land in dead letters.

Action:

Identify the target actor path from the dead-letter log entries.
Check if that actor is alive: if not, find why it stopped (supervision tree logs, ChildFailed signal).
If the actor is crashing on a specific message, fix the handler or add supervision to restart it.
If the spike is deployment-related and transient, no action required — monitor that it returns to baseline.

4. Ask timeout cascade

Symptom: Multiple AskTimeoutException errors in a short window, affecting several actors or request handlers that use the ask pattern.

Check:

Is the actor being asked still alive? A stopped actor never replies; every pending ask times out.
Is the actor's mailbox backed up? If the actor is overloaded, replies arrive after the caller's timeout window closes.
Is there a downstream bottleneck (DB slow query, external HTTP call) that extended handler latency?

Action:

Check the target actor's mailbox depth and handler latency.
If the actor is down, restart it or check its supervision strategy — MaxRetriesExceededException stops an actor permanently.
Increase ask timeouts as a temporary measure while investigating the root cause.
Consider replacing synchronous ask with fire-and-forget tell plus a callback actor for non-latency-sensitive flows.

5. Worker thread crash

Symptom: A worker pool thread exits. In thread-mode Swoole, this is logged as Worker X crashed or the worker ID stops appearing in request logs.

Check:

Did $system->shutdown() throw an unhandled exception inside the worker's watchdog? Check worker-specific log lines around the crash time.
Did a Fatal error occur (OOM, stack overflow) that bypassed the actor supervision tree?
Is the WorkerStartHandler::onWorkerStart() implementation throwing during setup?

Action:

Swoole restarts crashed workers automatically in worker mode. In thread mode, the thread does not auto-restart — restart the server process.
Reproduce locally with the same worker count and message volume: docker compose exec php-swoole php bin/server.php.
If OOM: increase memory limit in php.ini, or reduce per-request allocations.
Add a try/catch in onWorkerStart() with explicit logging so startup failures are visible.

6. Doctrine connection pool exhaustion

Symptom: PoolExhaustedException appearing in logs, request error rate rising, database connections maxed out.

Check:

What is ConnectionPoolConfig::size()? Is it set lower than the database's max_connections?
Are connections being held too long? A handler that does multiple round-trips without releasing the connection back to the pool (or that has a long transaction) keeps it checked out for the full request duration.
Is there a deadlock or slow query holding a connection open? Check SHOW PROCESSLIST on the database.

Action:

Short-term: restart the server to flush any leaked connections.
Verify pool size matches your database's max_connections minus headroom for admin/migration connections.
Check for handlers that call $em->beginTransaction() without a matching commit() or rollback() — these hold connections indefinitely.
Add connection acquisition timeout logging: ConnectionPoolConfig::withAcquireTimeout(Duration::seconds(5)) so slow acquisitions appear in logs before they cascade.

7. `ReceiveTimeout` firing unexpectedly

Symptom: Actors are passivating or stopping earlier than expected. Log shows ReceiveTimeout signals firing.

Check:

Which actors call $ctx->setReceiveTimeout()? This is an opt-in feature — only actors that explicitly set a timeout receive the ReceiveTimeout signal.
Is the configured duration shorter than the expected message interval? A burst-quiet message pattern can leave the actor idle for longer than the timeout between bursts.
Was the timeout set during setup and never cleared? $ctx->setReceiveTimeout(null) disables it.

Action:

Increase the ReceiveTimeout duration to exceed the maximum expected idle window.
If passivation is the goal but triggering too aggressively, add a check before stopping: only stop if business-level state is empty.
If ReceiveTimeout is firing on an actor that should not have it set, search for setReceiveTimeout calls in the actor hierarchy — it can be set in a parent's setup closure.

8. Swoole reactor exit timeout

Symptom: On shutdown, the process hangs at the Swoole reactor shutdown phase. Logs show Server BeforeShutdown fired but the process does not exit within the expected window.

Check:

Is the BeforeShutdown watchdog completing? Look for Worker ActorSystem shutdown complete log lines from each worker.
Are there coroutines still running outside of the actor system (e.g. a custom Coroutine::create() that was never cancelled)?
Is the shutdownTimeout configured in SwooleThreadConfig large enough for actors to drain?

Action:

Search application code for Coroutine::create() calls outside of the Nexus actor system — these must be self-terminating or cancelled explicitly before server shutdown.
Increase SwooleThreadConfig::withShutdownTimeout().
If the hang is reproducible, attach strace or check Swoole::stats() for coroutine counts after BeforeShutdown fires.

9. Persistence writer conflict (split-brain)

Symptom: WriterConflictException: Writer conflict detected for persistence ID 'Order-abc' in logs. Events from two different writer IDs appear interleaved in the event store.

Cause: Two actor system instances are writing to the same PersistenceId. This is a violation of the single-writer principle: each persistence ID must be owned by exactly one ActorSystem at a time.

Check:

Are two pods writing to the same database? Blue-green deployments with shared event stores hit this during the overlap window.
Did a pod restart without the old instance fully shutting down? The new instance generates a new writer ULID and detects the previous writer's events.

Action:

Ensure only one pod is active per persistence ID. Use rolling deployments, not blue-green, for event-sourced services.
If the conflict is from a restart, check the ReplayFilter mode. RepairByDiscardOld keeps only the latest writer's events and is the safest recovery option.
Inspect the event store for interleaved sequences: events with two different writer_id values on the same persistence_id.
After recovery, rotate the persistence ID or archive the conflicted sequence before resuming normal operation.

10. Graceful-shutdown deadline missed

Symptom: The process exits with exit code 137 (SIGKILL from Kubernetes) rather than a clean 0. Pod logs show actors still processing messages at the moment of termination.

Check:

Is terminationGracePeriodSeconds in the Kubernetes deployment longer than shutdownTimeout? If SIGKILL fires before Nexus finishes draining, actors are killed mid-message.
Did the preStop hook add enough delay for traffic draining before SIGTERM? If traffic is still arriving at SIGTERM time, the mailbox backlog is larger at shutdown start.
Is shutdownTimeout long enough for the slowest actor to drain?

Action:

Set terminationGracePeriodSeconds = preStop delay + shutdownTimeout + 5 s headroom. Example: preStop: sleep 5 + shutdownTimeout: 20 s → terminationGracePeriodSeconds: 30.
Add a preStop lifecycle hook to drain load balancer traffic before SIGTERM fires.
Monitor exit codes: exit 0 means clean shutdown; exit 137 means SIGKILL (deadline missed).

1. Actor system not draining at shutdown​

2. Mailbox overflow rate spiking​

3. Dead letter spike​

4. Ask timeout cascade​

5. Worker thread crash​

6. Doctrine connection pool exhaustion​

7. ReceiveTimeout firing unexpectedly​

8. Swoole reactor exit timeout​

9. Persistence writer conflict (split-brain)​

10. Graceful-shutdown deadline missed​

1. Actor system not draining at shutdown

2. Mailbox overflow rate spiking

3. Dead letter spike

4. Ask timeout cascade

5. Worker thread crash

6. Doctrine connection pool exhaustion

7. `ReceiveTimeout` firing unexpectedly

8. Swoole reactor exit timeout

9. Persistence writer conflict (split-brain)

10. Graceful-shutdown deadline missed