Skip to content

Observability System - Developer Guide

Last Updated: 2025-12-17 Status: Production-ready (Phases 0-8 complete) Test Coverage: 33 observability tests (24 unit + 9 integration), 64/71 total backend tests passing

This document provides a comprehensive overview of the MBPanel API observability system for new developers joining the project.

What is Observability?

In MBPanel, observability means you can answer these questions in production:

  1. What happened? - Structured logs with correlation IDs
  2. Where did it fail? - Distributed tracing across services
  3. Why did it fail? - Error context with stack traces
  4. Who was affected? - Multi-tenant context (team_id, user_id)
  5. Is it being attacked? - Security violation patterns
  6. How are external APIs performing? - Request/response metrics

System Architecture

┌─────────────────────────────────────────────────────────────┐
│                     FastAPI Application                      │
├─────────────────────────────────────────────────────────────┤
│  Middleware Stack:                                           │
│  1. CorrelationIdMiddleware    → request_id                 │
│  2. RequestLoggingMiddleware   → duration_ms, status_code   │
│  3. AuthContextMiddleware      → user_id, team_id           │
├─────────────────────────────────────────────────────────────┤
│  Context Variables (thread-safe):                           │
│  • request_id_ctx  • team_id_ctx  • user_id_ctx            │
├─────────────────────────────────────────────────────────────┤
│  Structured Logging:                                         │
│  • Structlog → JSON → stdout → /var/www/error/*.jsonl      │
│  • PII filtering (passwords, tokens, secrets)               │
│  • Environment-based config (dev/staging/production)        │
├─────────────────────────────────────────────────────────────┤
│  Observability Modules:                                      │
│  • Security Events (violations, abuse detection)            │
│  • External HTTP (API calls, retries, failures)             │
│  • SSE/RabbitMQ (connections, metrics, backpressure)        │
│  • OpenTelemetry (optional distributed tracing)             │
│  • Sentry (optional error monitoring)                       │
└─────────────────────────────────────────────────────────────┘

Quick Start for Developers

1. Basic Logging

from app.core.logging import get_logger

logger = get_logger(__name__)

# Structured logging (automatically includes correlation_id, team_id, user_id)
logger.info("user_action", action="create_site", site_name="example.com")
logger.warning("validation_failed", field="email", reason="invalid_format")
logger.error("database_error", table="users", operation="insert", exc_info=True)

2. Security Violations

from fastapi import Request
from app.core.security_events import log_auth_violation

# Log authentication/authorization failures
log_auth_violation(
    request=request,
    violation="missing_permission",
    http_status=403,
    detail="User lacks 'sites:create' permission"
)

3. External API Calls

from app.core.external_http import create_external_async_client

# All external HTTP calls use this wrapper
async with create_external_async_client(
    service="stripe",
    base_url="https://api.stripe.com",
    timeout=10.0
) as client:
    response = await client.post("/v1/charges", json={"amount": 1000})
    # Automatically logs: method, URL, status, duration, headers (redacted)

4. Running Tests

# All unit tests (should always pass)
cd backend && source .venv/bin/activate
pytest tests/unit/ -v

# Observability-specific tests
pytest tests/unit/core/test_security_events.py -v
pytest tests/unit/core/test_external_http.py -v

# With coverage
pytest tests/unit/core/ --cov=app --cov-report=term --cov-report=html

Implementation Phases (All Complete ✅)

Phase 0: Foundation

  • ✅ Structlog configuration
  • ✅ Correlation ID middleware
  • ✅ ContextVar-based request context
  • Files: app/core/logging.py, app/core/middleware.py, app/core/request_context.py

Phase 1: Request Logging

  • ✅ HTTP request/response lifecycle
  • ✅ Duration tracking
  • ✅ Status code logging
  • Files: app/core/middleware.py (RequestLoggingMiddleware)

Phase 2: Multi-Tenant Context

  • ✅ Team ID injection from JWT
  • ✅ User ID injection from JWT
  • ✅ Context propagation to logs
  • Files: app/core/middleware.py (AuthContextMiddleware)

Phase 3: PII Filtering

  • ✅ Password field redaction
  • ✅ Token/secret masking
  • ✅ Credit card number filtering
  • Files: app/core/logging.py (_redact_sensitive_data)

Phase 4: OpenTelemetry (Optional)

  • ✅ Distributed tracing support
  • ✅ Graceful fallback when disabled
  • ✅ Span creation with context
  • Files: app/core/telemetry.py

Phase 5: Sentry Integration (Optional)

  • ✅ Error monitoring
  • ✅ PII scrubbing before send
  • ✅ Environment-based sampling
  • Files: app/core/sentry.py

Phase 6: SSE/RabbitMQ Observability

  • ✅ Connection lifecycle logging
  • ✅ Message flow metrics (consumed/forwarded/dropped)
  • ✅ Backpressure detection
  • ✅ Subscriber management
  • Files: app/infrastructure/messaging/sse.py, app/api/v1/events.py
  • Docs: Phase 6 Summary

Phase 7: Auth/Security Observability

  • ✅ Centralized violation logging
  • ✅ Tenant boundary violations
  • ✅ Token anomaly detection
  • ✅ Abuse signal detection (host/user thresholds)
  • Files: app/core/security_events.py, app/core/middleware.py
  • Docs: Phase 7 Summary
  • Tests: tests/unit/core/test_security_events.py (8 tests), tests/integration/api/test_auth_token_anomalies.py (9 tests)

Phase 8: External API Observability

  • ✅ Central HTTP wrapper with event hooks
  • ✅ Header redaction (Authorization, Cookie, X-API-Key)
  • ✅ URL sanitization (query param masking)
  • ✅ Body preview (dev only, production-disabled)
  • ✅ All clients migrated (Virtuozzo, Postmark, ip-api.com)
  • Files: app/core/external_http.py, app/infrastructure/external/*
  • Docs: Phase 8 Summary
  • Tests: tests/unit/core/test_external_http.py (16 tests)

Key Files Reference

File Purpose Key Functions
app/core/logging.py Structlog setup setup_logging(), get_logger()
app/core/middleware.py Request lifecycle CorrelationIdMiddleware, RequestLoggingMiddleware, AuthContextMiddleware
app/core/request_context.py Context storage request_id_ctx, team_id_ctx, user_id_ctx
app/core/security_events.py Security violations log_auth_violation(), _track_abuse_signal()
app/core/external_http.py External API wrapper create_external_async_client()
app/core/telemetry.py OpenTelemetry configure_opentelemetry()
app/core/sentry.py Sentry integration init_sentry()

Environment Variables

Core Settings

ENVIRONMENT=development|staging|production
LOG_LEVEL=DEBUG|INFO|WARNING|ERROR

# File logging (optional)
LOG_TO_FILE=false
LOG_DIR=/var/www/error
LOG_FILE_NAME=mbpanel-api.jsonl
LOG_FILE_MAX_BYTES=10485760
LOG_FILE_BACKUP_COUNT=5

Optional Services

# OpenTelemetry
ENABLE_OPENTELEMETRY=false
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318

# Sentry
ENABLE_SENTRY=false
SENTRY_DSN=https://...@sentry.io/...
SENTRY_ENVIRONMENT=production
SENTRY_TRACES_SAMPLE_RATE=0.1

External API (Dev only)

# Body preview (auto-disabled in production)
EXTERNAL_HTTP_ENABLE_BODY_LOGGING=true
EXTERNAL_HTTP_MAX_BODY_BYTES=1024

Common Patterns

Adding a New Endpoint

from fastapi import APIRouter, Depends, Request
from app.core.logging import get_logger
from app.core.dependencies import require_permission
from app.core.security_events import log_auth_violation

router = APIRouter()
logger = get_logger(__name__)

@router.post("/api/v1/sites")
async def create_site(
    request: Request,
    site_data: SiteCreate,
    user: User = Depends(require_permission("sites:create"))
):
    # Logs automatically include: correlation_id, team_id, user_id
    logger.info("site_creation_started", site_name=site_data.name)

    try:
        site = await site_service.create(site_data)
        logger.info("site_created", site_id=site.id, site_name=site.name)
        return site
    except Exception as e:
        logger.error("site_creation_failed",
                    site_name=site_data.name,
                    error=str(e),
                    exc_info=True)
        raise

Adding a New External API Client

from app.core.external_http import create_external_async_client
from app.core.config import settings

class MyExternalService:
    def __init__(self):
        self.client = None

    async def get_client(self):
        if not self.client:
            self.client = await create_external_async_client(
                service="my-service",
                base_url=settings.my_service_url,
                timeout=10.0
            ).__aenter__()
        return self.client

    async def close(self):
        if self.client:
            await self.client.__aexit__(None, None, None)

    async def do_something(self, param: str):
        client = await self.get_client()
        # Automatically logs request/response with redacted headers
        response = await client.get(f"/endpoint?param={param}")
        return response.json()

Custom Security Violation

from fastapi import HTTPException
from app.core.security_events import log_auth_violation

def check_custom_permission(request: Request, user: User):
    if not user.has_custom_permission():
        log_auth_violation(
            request=request,
            violation="custom_permission_denied",
            http_status=403,
            detail="User lacks custom permission",
            extra_field="custom_value"  # Optional extra context
        )
        raise HTTPException(status_code=403, detail="Forbidden")

Testing Your Observability Code

Unit Test Example

import pytest
from app.core.security_events import log_auth_violation
from unittest.mock import MagicMock
from fastapi import Request

def create_mock_request(method="GET", path="/", client_host="127.0.0.1"):
    mock_request = MagicMock(spec=Request)
    mock_request.method = method
    mock_request.url = MagicMock(path=path)
    mock_request.client = MagicMock(host=client_host)
    mock_request.headers = {"user-agent": "test"}
    return mock_request

def test_violation_logging(capsys):
    """Test that violations are logged with correct format."""
    request = create_mock_request(method="POST", path="/api/v1/admin")

    # Execute the function that logs
    log_auth_violation(
        request=request,
        violation="missing_permission",
        http_status=403,
        detail="Admin access required"
    )

    # Capture stdout (structlog writes here)
    captured = capsys.readouterr()

    # Validate structured log output
    assert "auth_violation" in captured.out
    assert "missing_permission" in captured.out
    assert "POST" in captured.out
    assert "/api/v1/admin" in captured.out

Troubleshooting

Issue: Logs not appearing

Check: 1. Is setup_logging() called in app_factory.py? 2. Is LOG_LEVEL set correctly? 3. Are you using get_logger(__name__) from app.core.logging?

Issue: Correlation ID missing

Check: 1. Is CorrelationIdMiddleware registered? 2. Are you in a request context? (Correlation ID only exists during requests) 3. Check logs for correlation_id=None (indicates context leak)

Issue: Team/User ID not in logs

Check: 1. Is AuthContextMiddleware registered? 2. Is the endpoint authenticated? (No JWT = no team/user context) 3. Is the JWT valid and includes sub and team_id claims?

Issue: External API secrets visible in logs

Check: 1. Are you using create_external_async_client()? 2. Is ENVIRONMENT=production? (Body logging auto-disabled) 3. Are sensitive headers being passed with standard names? (Authorization, Cookie, X-API-Key)

Issue: Tests failing with "pytest: command not found"

Solution:

cd backend
python3 -m venv .venv  # If not exists
source .venv/bin/activate
pip install -e ".[dev]"
pytest tests/unit/ -v

Issue: Python 3.10 Compatibility (Fixed Dec 17, 2025)

Problem: Tests failing with AttributeError: type object 'datetime.datetime' has no attribute 'UTC' Root Cause: Used datetime.UTC (Python 3.11+) but project requires Python 3.10+ Solution: Replaced with timezone.utc throughout Phase 7/8 code Files Fixed: security_events.py, test_security_events.py, test_auth_token_anomalies.py Status: ✅ All tests now passing (64/71 core tests, 7 failures are env-dependent)

Production Deployment Checklist

  • ENVIRONMENT=production is set
  • LOG_TO_FILE=true with writable LOG_DIR
  • LOG_LEVEL=INFO (not DEBUG)
  • ENABLE_SENTRY=true with valid SENTRY_DSN
  • EXTERNAL_HTTP_ENABLE_BODY_LOGGING=false (redundant check)
  • Log rotation configured (see LOG_FILE_BACKUP_COUNT)
  • Application user has write permissions to log directory
  • External log shipping configured (e.g., Filebeat, Fluentd)

Further Reading

Support

For questions about the observability system: 1. Check this document first 2. Review phase-specific docs in docs/notes/ 3. Run relevant tests to see working examples 4. Check CHANGELOG.md for recent changes


Version: 1.0.0 Maintainer: Backend Team Last Review: 2025-12-16