Dev Server Runbook (Backend + Infra)¶

Version: 1.0.0 | Last Updated: 2026-03-22 | Status: Active

Purpose¶

This runbook is the current operational reference for developers, system administrators, and DevOps engineers working on the MBPanel development backend stack.

Use this for: - live server configuration verification - service restarts and health validation - incident triage and recovery - safe rollback of infra-level changes

1) Current Runtime Topology (verified 2026-03-22)¶

1.1 Hosts¶

Component	Host/IP	Access User	Notes
Backend app node	`dev-backend.mightybox.site` / `15.204.15.210`	`apache`	Apache + mod_wsgi + FastAPI
Frontend node	`dev-frontend.mightybox.site` / `15.204.15.169`	`nodejs`	Next.js + PM2 + Nginx

1.2 External dependencies¶

Service	Host	Port	Usage
PostgreSQL	`172.16.4.139`	`5432`	app primary database
Redis	`172.16.3.213`	`6379`	cache + distributed state
RabbitMQ	`172.16.3.83`	`5672`	Celery + SSE broker transport

1.3 Backend capacity settings¶

Apache MPM (/etc/httpd/conf/httpd.conf): - MaxRequestWorkers 150 (worker/event blocks) - ThreadsPerChild 25

mod_wsgi daemon (/etc/httpd/conf.d/wsgi.conf): - WSGIDaemonProcess ... processes=3 threads=15 ...

Celery worker config (/etc/conf.d/celery): - CELERYD_NODES="w1 w2" - CELERYD_OPTS="--time-limit=300 --concurrency=1"

Operational result: - 2 Celery nodes online (w1, w2) - queue jobs.env.create has 2 consumers under normal healthy state

2) How It Works (Runtime Flow)¶

2.1 Request path¶

Client sends request to https://dev-backend.mightybox.site
Apache (httpd) terminates TLS and forwards to mod_wsgi daemon group
mod_wsgi executes wsgi.py and serves FastAPI app
App uses:
PostgreSQL for persistence
Redis for cache/state
RabbitMQ for Celery jobs and SSE fan-out messaging

2.2 Async jobs and events¶

API enqueues background work to Celery
Celery workers (w1, w2) consume jobs from RabbitMQ
SSE broker consumes RabbitMQ status events and pushes to connected clients

2.3 Health endpoints and intended use¶

Endpoint	Purpose	Expected behavior
`/api/v1/health/ping`	process/network reachability	very fast, always 200 if app reachable
`/api/v1/health/live`	liveness probe	lightweight process alive check
`/api/v1/health/ready`	readiness probe	verifies critical dependencies
`/api/v1/health/celery`	Celery summary health	fast summary (`mode: summary`)
`/api/v1/health/celery/deep`	operator diagnostics	slower deep worker inspection
`/api/v1/health`	full composite check	comprehensive, can be slower

3) Standard Operating Commands¶

3.1 Service status¶

ssh apache@15.204.15.210
systemctl is-active httpd celery
systemctl status httpd --no-pager -l
systemctl status celery --no-pager -l

3.2 Quick health verification¶

curl -sk https://dev-backend.mightybox.site/api/v1/health/ping
curl -sk https://dev-backend.mightybox.site/api/v1/health/live
curl -sk https://dev-backend.mightybox.site/api/v1/health/ready
curl -sk https://dev-backend.mightybox.site/api/v1/health/celery

3.3 Process checks¶

pgrep -af httpd
pgrep -af "celery.*worker"

3.4 Config inspection¶

grep -n "WSGIDaemonProcess" /etc/httpd/conf.d/wsgi.conf
grep -nE "MaxRequestWorkers|ThreadsPerChild|ServerLimit" /etc/httpd/conf/httpd.conf
grep -nE "CELERYD_NODES|CELERYD_OPTS" /etc/conf.d/celery

4) Troubleshooting Playbooks¶

4.1 Apache worker pressure (`AH10159` / `MaxRequestWorkers`)¶

Symptom: error log includes AH10159: server is within MinSpareThreads of MaxRequestWorkers.

Checks:

tail -n 200 /var/log/httpd/error_log | grep -E "AH10159|AH00484|MaxRequestWorkers"

Actions: 1. Confirm MaxRequestWorkers and mod_wsgi process/thread settings. 2. Confirm available memory before increasing capacity. 3. Apply small, reversible increments only. 4. Validate with burst probe + logs.

4.2 Celery workers unhealthy or missing¶

Symptom: /health/celery shows online_workers: 0 or queue not bound.

Checks:

systemctl status celery --no-pager -l
pgrep -af "celery.*worker"
curl -sk https://dev-backend.mightybox.site/api/v1/health/celery

Actions: 1. Restart Celery: sudo systemctl restart celery 2. Re-check health endpoint and consumer count. 3. Inspect logs: journalctl -u celery --since "30 min ago" --no-pager

4.3 SSE issues (connected but no events)¶

Symptom: SSE endpoint connects but client receives no status updates.

Checks: 1. Verify broker dependency health in /health response (rabbitmq, celery_workers). 2. Confirm RabbitMQ reachable from app node. 3. Validate authenticated SSE path (cookie-based auth is required).

Actions: - Verify event publish path and routing keys. - Check SSE broker metrics and recent app logs.

4.4 Full health endpoint slow¶

Context: /health is comprehensive and can be slower than probe endpoints.

Guidance: - Use /health/ping, /health/live, /health/ready, /health/celery for probes/automation. - Reserve /health/celery/deep and /health for diagnostics.

5) Safe Change Procedure (Infra)¶

Backup first
/etc/httpd/conf/httpd.conf
/etc/httpd/conf.d/wsgi.conf
/etc/conf.d/celery
Validate syntax before restart:
sudo /usr/sbin/httpd -t
Restart only needed services (httpd, celery)
Post-change verification:
service status
key health endpoints
recent error logs

6) Rollback¶

Use the latest backup copies and restart services.

# Example pattern (replace timestamps with latest backup files)
sudo cp /etc/httpd/conf/httpd.conf.bak.<timestamp> /etc/httpd/conf/httpd.conf
sudo cp /etc/httpd/conf.d/wsgi.conf.bak.<timestamp> /etc/httpd/conf.d/wsgi.conf
sudo cp /etc/conf.d/celery.bak.<timestamp> /etc/conf.d/celery

sudo /usr/sbin/httpd -t
sudo systemctl restart httpd
sudo systemctl restart celery

7) Developer Notes¶

The canonical favorites endpoints are under environments domain routes (/api/v1/environments/favorites and /favorites/toggle).
Legacy /api/v1/favourites route path is retired and should not be reintroduced.
For backend reliability checks, always separate lightweight probe checks from deep diagnostics.