Skip to content

Monitoring Guide

Version: 1.0.0 | Last Updated: 2026-03-22 | Status: Active


Monitoring Objectives

  • Detect outages quickly
  • Detect degraded behavior before customer impact
  • Separate liveness/readiness/probe failures from deep diagnostic latency

Core Health Endpoints

Endpoint Use Case Notes
/api/v1/health/ping network/process reachability fastest signal
/api/v1/health/live liveness checks process health only
/api/v1/health/ready readiness gate validates critical dependencies
/api/v1/health/celery Celery operational summary expected mode: summary
/api/v1/health/celery/deep operator diagnostics use on demand, can be slower
/api/v1/health full stack diagnostics comprehensive, not for tight probe intervals

  1. Fast checks (high frequency):
    • /health/ping, /health/live
  2. Readiness checks (medium frequency):
    • /health/ready
  3. Operator diagnostics (manual/low frequency):
    • /health/celery/deep, /health

What to Watch

Apache / mod_wsgi

  • AH10159 or AH00484 signals worker pressure
  • active daemon process/thread settings in wsgi.conf
  • response latency spikes on probe endpoints

Celery

  • online worker count
  • queue consumers and pending message count
  • repeated worker exits or SIGKILLs in logs

Dependencies

  • Postgres availability
  • Redis ping/connectivity
  • RabbitMQ connectivity and consumer health

Log Locations

  • Apache: /var/log/httpd/error_log, /var/log/httpd/access_log
  • Celery (systemd): journalctl -u celery

Quick Triage Commands

systemctl is-active httpd celery
curl -sk https://dev-backend.mightybox.site/api/v1/health/ready
curl -sk https://dev-backend.mightybox.site/api/v1/health/celery
tail -n 100 /var/log/httpd/error_log
journalctl -u celery --since "30 min ago" --no-pager

Escalation Guidance

  • If liveness fails: treat as immediate service incident.
  • If readiness fails: treat as dependency/system incident.
  • If only deep checks are slow: treat as diagnostic overhead/performance issue.