Observability System - Developer Guide¶
Last Updated: 2025-12-17 Status: Production-ready (Phases 0-8 complete) Test Coverage: 33 observability tests (24 unit + 9 integration), 64/71 total backend tests passing
This document provides a comprehensive overview of the MBPanel API observability system for new developers joining the project.
What is Observability?¶
In MBPanel, observability means you can answer these questions in production:
- What happened? - Structured logs with correlation IDs
- Where did it fail? - Distributed tracing across services
- Why did it fail? - Error context with stack traces
- Who was affected? - Multi-tenant context (team_id, user_id)
- Is it being attacked? - Security violation patterns
- How are external APIs performing? - Request/response metrics
System Architecture¶
┌─────────────────────────────────────────────────────────────┐
│ FastAPI Application │
├─────────────────────────────────────────────────────────────┤
│ Middleware Stack: │
│ 1. CorrelationIdMiddleware → request_id │
│ 2. RequestLoggingMiddleware → duration_ms, status_code │
│ 3. AuthContextMiddleware → user_id, team_id │
├─────────────────────────────────────────────────────────────┤
│ Context Variables (thread-safe): │
│ • request_id_ctx • team_id_ctx • user_id_ctx │
├─────────────────────────────────────────────────────────────┤
│ Structured Logging: │
│ • Structlog → JSON → stdout → /var/www/error/*.jsonl │
│ • PII filtering (passwords, tokens, secrets) │
│ • Environment-based config (dev/staging/production) │
├─────────────────────────────────────────────────────────────┤
│ Observability Modules: │
│ • Security Events (violations, abuse detection) │
│ • External HTTP (API calls, retries, failures) │
│ • SSE/RabbitMQ (connections, metrics, backpressure) │
│ • OpenTelemetry (optional distributed tracing) │
│ • Sentry (optional error monitoring) │
└─────────────────────────────────────────────────────────────┘
Quick Start for Developers¶
1. Basic Logging¶
from app.core.logging import get_logger
logger = get_logger(__name__)
# Structured logging (automatically includes correlation_id, team_id, user_id)
logger.info("user_action", action="create_site", site_name="example.com")
logger.warning("validation_failed", field="email", reason="invalid_format")
logger.error("database_error", table="users", operation="insert", exc_info=True)
2. Security Violations¶
from fastapi import Request
from app.core.security_events import log_auth_violation
# Log authentication/authorization failures
log_auth_violation(
request=request,
violation="missing_permission",
http_status=403,
detail="User lacks 'sites:create' permission"
)
3. External API Calls¶
from app.core.external_http import create_external_async_client
# All external HTTP calls use this wrapper
async with create_external_async_client(
service="stripe",
base_url="https://api.stripe.com",
timeout=10.0
) as client:
response = await client.post("/v1/charges", json={"amount": 1000})
# Automatically logs: method, URL, status, duration, headers (redacted)
4. Running Tests¶
# All unit tests (should always pass)
cd backend && source .venv/bin/activate
pytest tests/unit/ -v
# Observability-specific tests
pytest tests/unit/core/test_security_events.py -v
pytest tests/unit/core/test_external_http.py -v
# With coverage
pytest tests/unit/core/ --cov=app --cov-report=term --cov-report=html
Implementation Phases (All Complete ✅)¶
Phase 0: Foundation¶
- ✅ Structlog configuration
- ✅ Correlation ID middleware
- ✅ ContextVar-based request context
- Files:
app/core/logging.py,app/core/middleware.py,app/core/request_context.py
Phase 1: Request Logging¶
- ✅ HTTP request/response lifecycle
- ✅ Duration tracking
- ✅ Status code logging
- Files:
app/core/middleware.py(RequestLoggingMiddleware)
Phase 2: Multi-Tenant Context¶
- ✅ Team ID injection from JWT
- ✅ User ID injection from JWT
- ✅ Context propagation to logs
- Files:
app/core/middleware.py(AuthContextMiddleware)
Phase 3: PII Filtering¶
- ✅ Password field redaction
- ✅ Token/secret masking
- ✅ Credit card number filtering
- Files:
app/core/logging.py(_redact_sensitive_data)
Phase 4: OpenTelemetry (Optional)¶
- ✅ Distributed tracing support
- ✅ Graceful fallback when disabled
- ✅ Span creation with context
- Files:
app/core/telemetry.py
Phase 5: Sentry Integration (Optional)¶
- ✅ Error monitoring
- ✅ PII scrubbing before send
- ✅ Environment-based sampling
- Files:
app/core/sentry.py
Phase 6: SSE/RabbitMQ Observability¶
- ✅ Connection lifecycle logging
- ✅ Message flow metrics (consumed/forwarded/dropped)
- ✅ Backpressure detection
- ✅ Subscriber management
- Files:
app/infrastructure/messaging/sse.py,app/api/v1/events.py - Docs: Phase 6 Summary
Phase 7: Auth/Security Observability¶
- ✅ Centralized violation logging
- ✅ Tenant boundary violations
- ✅ Token anomaly detection
- ✅ Abuse signal detection (host/user thresholds)
- Files:
app/core/security_events.py,app/core/middleware.py - Docs: Phase 7 Summary
- Tests:
tests/unit/core/test_security_events.py(8 tests),tests/integration/api/test_auth_token_anomalies.py(9 tests)
Phase 8: External API Observability¶
- ✅ Central HTTP wrapper with event hooks
- ✅ Header redaction (Authorization, Cookie, X-API-Key)
- ✅ URL sanitization (query param masking)
- ✅ Body preview (dev only, production-disabled)
- ✅ All clients migrated (Virtuozzo, Postmark, ip-api.com)
- Files:
app/core/external_http.py,app/infrastructure/external/* - Docs: Phase 8 Summary
- Tests:
tests/unit/core/test_external_http.py(16 tests)
Key Files Reference¶
| File | Purpose | Key Functions |
|---|---|---|
app/core/logging.py |
Structlog setup | setup_logging(), get_logger() |
app/core/middleware.py |
Request lifecycle | CorrelationIdMiddleware, RequestLoggingMiddleware, AuthContextMiddleware |
app/core/request_context.py |
Context storage | request_id_ctx, team_id_ctx, user_id_ctx |
app/core/security_events.py |
Security violations | log_auth_violation(), _track_abuse_signal() |
app/core/external_http.py |
External API wrapper | create_external_async_client() |
app/core/telemetry.py |
OpenTelemetry | configure_opentelemetry() |
app/core/sentry.py |
Sentry integration | init_sentry() |
Environment Variables¶
Core Settings¶
ENVIRONMENT=development|staging|production
LOG_LEVEL=DEBUG|INFO|WARNING|ERROR
# File logging (optional)
LOG_TO_FILE=false
LOG_DIR=/var/www/error
LOG_FILE_NAME=mbpanel-api.jsonl
LOG_FILE_MAX_BYTES=10485760
LOG_FILE_BACKUP_COUNT=5
Optional Services¶
# OpenTelemetry
ENABLE_OPENTELEMETRY=false
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
# Sentry
ENABLE_SENTRY=false
SENTRY_DSN=https://...@sentry.io/...
SENTRY_ENVIRONMENT=production
SENTRY_TRACES_SAMPLE_RATE=0.1
External API (Dev only)¶
# Body preview (auto-disabled in production)
EXTERNAL_HTTP_ENABLE_BODY_LOGGING=true
EXTERNAL_HTTP_MAX_BODY_BYTES=1024
Common Patterns¶
Adding a New Endpoint¶
from fastapi import APIRouter, Depends, Request
from app.core.logging import get_logger
from app.core.dependencies import require_permission
from app.core.security_events import log_auth_violation
router = APIRouter()
logger = get_logger(__name__)
@router.post("/api/v1/sites")
async def create_site(
request: Request,
site_data: SiteCreate,
user: User = Depends(require_permission("sites:create"))
):
# Logs automatically include: correlation_id, team_id, user_id
logger.info("site_creation_started", site_name=site_data.name)
try:
site = await site_service.create(site_data)
logger.info("site_created", site_id=site.id, site_name=site.name)
return site
except Exception as e:
logger.error("site_creation_failed",
site_name=site_data.name,
error=str(e),
exc_info=True)
raise
Adding a New External API Client¶
from app.core.external_http import create_external_async_client
from app.core.config import settings
class MyExternalService:
def __init__(self):
self.client = None
async def get_client(self):
if not self.client:
self.client = await create_external_async_client(
service="my-service",
base_url=settings.my_service_url,
timeout=10.0
).__aenter__()
return self.client
async def close(self):
if self.client:
await self.client.__aexit__(None, None, None)
async def do_something(self, param: str):
client = await self.get_client()
# Automatically logs request/response with redacted headers
response = await client.get(f"/endpoint?param={param}")
return response.json()
Custom Security Violation¶
from fastapi import HTTPException
from app.core.security_events import log_auth_violation
def check_custom_permission(request: Request, user: User):
if not user.has_custom_permission():
log_auth_violation(
request=request,
violation="custom_permission_denied",
http_status=403,
detail="User lacks custom permission",
extra_field="custom_value" # Optional extra context
)
raise HTTPException(status_code=403, detail="Forbidden")
Testing Your Observability Code¶
Unit Test Example¶
import pytest
from app.core.security_events import log_auth_violation
from unittest.mock import MagicMock
from fastapi import Request
def create_mock_request(method="GET", path="/", client_host="127.0.0.1"):
mock_request = MagicMock(spec=Request)
mock_request.method = method
mock_request.url = MagicMock(path=path)
mock_request.client = MagicMock(host=client_host)
mock_request.headers = {"user-agent": "test"}
return mock_request
def test_violation_logging(capsys):
"""Test that violations are logged with correct format."""
request = create_mock_request(method="POST", path="/api/v1/admin")
# Execute the function that logs
log_auth_violation(
request=request,
violation="missing_permission",
http_status=403,
detail="Admin access required"
)
# Capture stdout (structlog writes here)
captured = capsys.readouterr()
# Validate structured log output
assert "auth_violation" in captured.out
assert "missing_permission" in captured.out
assert "POST" in captured.out
assert "/api/v1/admin" in captured.out
Troubleshooting¶
Issue: Logs not appearing¶
Check:
1. Is setup_logging() called in app_factory.py?
2. Is LOG_LEVEL set correctly?
3. Are you using get_logger(__name__) from app.core.logging?
Issue: Correlation ID missing¶
Check:
1. Is CorrelationIdMiddleware registered?
2. Are you in a request context? (Correlation ID only exists during requests)
3. Check logs for correlation_id=None (indicates context leak)
Issue: Team/User ID not in logs¶
Check:
1. Is AuthContextMiddleware registered?
2. Is the endpoint authenticated? (No JWT = no team/user context)
3. Is the JWT valid and includes sub and team_id claims?
Issue: External API secrets visible in logs¶
Check:
1. Are you using create_external_async_client()?
2. Is ENVIRONMENT=production? (Body logging auto-disabled)
3. Are sensitive headers being passed with standard names? (Authorization, Cookie, X-API-Key)
Issue: Tests failing with "pytest: command not found"¶
Solution:
cd backend
python3 -m venv .venv # If not exists
source .venv/bin/activate
pip install -e ".[dev]"
pytest tests/unit/ -v
Issue: Python 3.10 Compatibility (Fixed Dec 17, 2025)¶
Problem: Tests failing with AttributeError: type object 'datetime.datetime' has no attribute 'UTC'
Root Cause: Used datetime.UTC (Python 3.11+) but project requires Python 3.10+
Solution: Replaced with timezone.utc throughout Phase 7/8 code
Files Fixed: security_events.py, test_security_events.py, test_auth_token_anomalies.py
Status: ✅ All tests now passing (64/71 core tests, 7 failures are env-dependent)
Production Deployment Checklist¶
-
ENVIRONMENT=productionis set -
LOG_TO_FILE=truewith writableLOG_DIR -
LOG_LEVEL=INFO(not DEBUG) -
ENABLE_SENTRY=truewith validSENTRY_DSN -
EXTERNAL_HTTP_ENABLE_BODY_LOGGING=false(redundant check) - Log rotation configured (see
LOG_FILE_BACKUP_COUNT) - Application user has write permissions to log directory
- External log shipping configured (e.g., Filebeat, Fluentd)
Further Reading¶
- Architecture: docs/architecture/001-hybrid-modular-ddd.md
- Testing Guide: backend/tests/README.md
- Task Tracker: ../development/tasks/phase1_observability/logging_debugging_system.md
- Structlog Docs: https://www.structlog.org/
- HTTPX Event Hooks: https://www.python-httpx.org/advanced/#event-hooks
- OpenTelemetry Python: https://opentelemetry.io/docs/languages/python/
Support¶
For questions about the observability system:
1. Check this document first
2. Review phase-specific docs in docs/notes/
3. Run relevant tests to see working examples
4. Check CHANGELOG.md for recent changes
Version: 1.0.0 Maintainer: Backend Team Last Review: 2025-12-16