Dev Server Runbook (Backend + Infra)¶
Version: 1.0.0 | Last Updated: 2026-03-22 | Status: Active
Purpose¶
This runbook is the current operational reference for developers, system administrators, and DevOps engineers working on the MBPanel development backend stack.
Use this for: - live server configuration verification - service restarts and health validation - incident triage and recovery - safe rollback of infra-level changes
1) Current Runtime Topology (verified 2026-03-22)¶
1.1 Hosts¶
| Component | Host/IP | Access User | Notes |
|---|---|---|---|
| Backend app node | dev-backend.mightybox.site / 15.204.15.210 |
apache |
Apache + mod_wsgi + FastAPI |
| Frontend node | dev-frontend.mightybox.site / 15.204.15.169 |
nodejs |
Next.js + PM2 + Nginx |
1.2 External dependencies¶
| Service | Host | Port | Usage |
|---|---|---|---|
| PostgreSQL | 172.16.4.139 |
5432 |
app primary database |
| Redis | 172.16.3.213 |
6379 |
cache + distributed state |
| RabbitMQ | 172.16.3.83 |
5672 |
Celery + SSE broker transport |
1.3 Backend capacity settings¶
Apache MPM (/etc/httpd/conf/httpd.conf):
- MaxRequestWorkers 150 (worker/event blocks)
- ThreadsPerChild 25
mod_wsgi daemon (/etc/httpd/conf.d/wsgi.conf):
- WSGIDaemonProcess ... processes=3 threads=15 ...
Celery worker config (/etc/conf.d/celery):
- CELERYD_NODES="w1 w2"
- CELERYD_OPTS="--time-limit=300 --concurrency=1"
Operational result:
- 2 Celery nodes online (w1, w2)
- queue jobs.env.create has 2 consumers under normal healthy state
2) How It Works (Runtime Flow)¶
2.1 Request path¶
- Client sends request to
https://dev-backend.mightybox.site - Apache (
httpd) terminates TLS and forwards to mod_wsgi daemon group - mod_wsgi executes
wsgi.pyand serves FastAPI app - App uses:
- PostgreSQL for persistence
- Redis for cache/state
- RabbitMQ for Celery jobs and SSE fan-out messaging
2.2 Async jobs and events¶
- API enqueues background work to Celery
- Celery workers (
w1,w2) consume jobs from RabbitMQ - SSE broker consumes RabbitMQ status events and pushes to connected clients
2.3 Health endpoints and intended use¶
| Endpoint | Purpose | Expected behavior |
|---|---|---|
/api/v1/health/ping |
process/network reachability | very fast, always 200 if app reachable |
/api/v1/health/live |
liveness probe | lightweight process alive check |
/api/v1/health/ready |
readiness probe | verifies critical dependencies |
/api/v1/health/celery |
Celery summary health | fast summary (mode: summary) |
/api/v1/health/celery/deep |
operator diagnostics | slower deep worker inspection |
/api/v1/health |
full composite check | comprehensive, can be slower |
3) Standard Operating Commands¶
3.1 Service status¶
ssh apache@15.204.15.210
systemctl is-active httpd celery
systemctl status httpd --no-pager -l
systemctl status celery --no-pager -l
3.2 Quick health verification¶
curl -sk https://dev-backend.mightybox.site/api/v1/health/ping
curl -sk https://dev-backend.mightybox.site/api/v1/health/live
curl -sk https://dev-backend.mightybox.site/api/v1/health/ready
curl -sk https://dev-backend.mightybox.site/api/v1/health/celery
3.3 Process checks¶
3.4 Config inspection¶
grep -n "WSGIDaemonProcess" /etc/httpd/conf.d/wsgi.conf
grep -nE "MaxRequestWorkers|ThreadsPerChild|ServerLimit" /etc/httpd/conf/httpd.conf
grep -nE "CELERYD_NODES|CELERYD_OPTS" /etc/conf.d/celery
4) Troubleshooting Playbooks¶
4.1 Apache worker pressure (AH10159 / MaxRequestWorkers)¶
Symptom: error log includes AH10159: server is within MinSpareThreads of MaxRequestWorkers.
Checks:
Actions:
1. Confirm MaxRequestWorkers and mod_wsgi process/thread settings.
2. Confirm available memory before increasing capacity.
3. Apply small, reversible increments only.
4. Validate with burst probe + logs.
4.2 Celery workers unhealthy or missing¶
Symptom: /health/celery shows online_workers: 0 or queue not bound.
Checks:
systemctl status celery --no-pager -l
pgrep -af "celery.*worker"
curl -sk https://dev-backend.mightybox.site/api/v1/health/celery
Actions:
1. Restart Celery: sudo systemctl restart celery
2. Re-check health endpoint and consumer count.
3. Inspect logs: journalctl -u celery --since "30 min ago" --no-pager
4.3 SSE issues (connected but no events)¶
Symptom: SSE endpoint connects but client receives no status updates.
Checks:
1. Verify broker dependency health in /health response (rabbitmq, celery_workers).
2. Confirm RabbitMQ reachable from app node.
3. Validate authenticated SSE path (cookie-based auth is required).
Actions: - Verify event publish path and routing keys. - Check SSE broker metrics and recent app logs.
4.4 Full health endpoint slow¶
Context: /health is comprehensive and can be slower than probe endpoints.
Guidance:
- Use /health/ping, /health/live, /health/ready, /health/celery for probes/automation.
- Reserve /health/celery/deep and /health for diagnostics.
5) Safe Change Procedure (Infra)¶
- Backup first
/etc/httpd/conf/httpd.conf/etc/httpd/conf.d/wsgi.conf/etc/conf.d/celery- Validate syntax before restart:
sudo /usr/sbin/httpd -t- Restart only needed services (
httpd,celery) - Post-change verification:
- service status
- key health endpoints
- recent error logs
6) Rollback¶
Use the latest backup copies and restart services.
# Example pattern (replace timestamps with latest backup files)
sudo cp /etc/httpd/conf/httpd.conf.bak.<timestamp> /etc/httpd/conf/httpd.conf
sudo cp /etc/httpd/conf.d/wsgi.conf.bak.<timestamp> /etc/httpd/conf.d/wsgi.conf
sudo cp /etc/conf.d/celery.bak.<timestamp> /etc/conf.d/celery
sudo /usr/sbin/httpd -t
sudo systemctl restart httpd
sudo systemctl restart celery
7) Developer Notes¶
- The canonical favorites endpoints are under environments domain routes (
/api/v1/environments/favoritesand/favorites/toggle). - Legacy
/api/v1/favouritesroute path is retired and should not be reintroduced. - For backend reliability checks, always separate lightweight probe checks from deep diagnostics.