Observability System - Developer Guide¶

Last Updated: 2025-12-17 Status: Production-ready (Phases 0-8 complete) Test Coverage: 33 observability tests (24 unit + 9 integration), 64/71 total backend tests passing

This document provides a comprehensive overview of the MBPanel API observability system for new developers joining the project.

What is Observability?¶

In MBPanel, observability means you can answer these questions in production:

What happened? - Structured logs with correlation IDs
Where did it fail? - Distributed tracing across services
Why did it fail? - Error context with stack traces
Who was affected? - Multi-tenant context (team_id, user_id)
Is it being attacked? - Security violation patterns
How are external APIs performing? - Request/response metrics

System Architecture¶

┌─────────────────────────────────────────────────────────────┐
│                     FastAPI Application                      │
├─────────────────────────────────────────────────────────────┤
│  Middleware Stack:                                           │
│  1. CorrelationIdMiddleware    → request_id                 │
│  2. RequestLoggingMiddleware   → duration_ms, status_code   │
│  3. AuthContextMiddleware      → user_id, team_id           │
├─────────────────────────────────────────────────────────────┤
│  Context Variables (thread-safe):                           │
│  • request_id_ctx  • team_id_ctx  • user_id_ctx            │
├─────────────────────────────────────────────────────────────┤
│  Structured Logging:                                         │
│  • Structlog → JSON → stdout → /var/www/error/*.jsonl      │
│  • PII filtering (passwords, tokens, secrets)               │
│  • Environment-based config (dev/staging/production)        │
├─────────────────────────────────────────────────────────────┤
│  Observability Modules:                                      │
│  • Security Events (violations, abuse detection)            │
│  • External HTTP (API calls, retries, failures)             │
│  • SSE/RabbitMQ (connections, metrics, backpressure)        │
│  • OpenTelemetry (optional distributed tracing)             │
│  • Sentry (optional error monitoring)                       │
└─────────────────────────────────────────────────────────────┘

Quick Start for Developers¶

1. Basic Logging¶

from app.core.logging import get_logger

logger = get_logger(__name__)

# Structured logging (automatically includes correlation_id, team_id, user_id)
logger.info("user_action", action="create_site", site_name="example.com")
logger.warning("validation_failed", field="email", reason="invalid_format")
logger.error("database_error", table="users", operation="insert", exc_info=True)

2. Security Violations¶

from fastapi import Request
from app.core.security_events import log_auth_violation

# Log authentication/authorization failures
log_auth_violation(
    request=request,
    violation="missing_permission",
    http_status=403,
    detail="User lacks 'sites:create' permission"
)

3. External API Calls¶

from app.core.external_http import create_external_async_client

# All external HTTP calls use this wrapper
async with create_external_async_client(
    service="stripe",
    base_url="https://api.stripe.com",
    timeout=10.0
) as client:
    response = await client.post("/v1/charges", json={"amount": 1000})
    # Automatically logs: method, URL, status, duration, headers (redacted)

4. Running Tests¶

# All unit tests (should always pass)
cd backend && source .venv/bin/activate
pytest tests/unit/ -v

# Observability-specific tests
pytest tests/unit/core/test_security_events.py -v
pytest tests/unit/core/test_external_http.py -v

# With coverage
pytest tests/unit/core/ --cov=app --cov-report=term --cov-report=html

Implementation Phases (All Complete ✅)¶

Phase 0: Foundation¶

✅ Structlog configuration
✅ Correlation ID middleware
✅ ContextVar-based request context
Files: app/core/logging.py, app/core/middleware.py, app/core/request_context.py

Phase 1: Request Logging¶

✅ HTTP request/response lifecycle
✅ Duration tracking
✅ Status code logging
Files: app/core/middleware.py (RequestLoggingMiddleware)

Phase 2: Multi-Tenant Context¶

✅ Team ID injection from JWT
✅ User ID injection from JWT
✅ Context propagation to logs
Files: app/core/middleware.py (AuthContextMiddleware)

Phase 3: PII Filtering¶

✅ Password field redaction
✅ Token/secret masking
✅ Credit card number filtering
Files: app/core/logging.py (_redact_sensitive_data)

Phase 4: OpenTelemetry (Optional)¶

✅ Distributed tracing support
✅ Graceful fallback when disabled
✅ Span creation with context
Files: app/core/telemetry.py

Phase 5: Sentry Integration (Optional)¶

✅ Error monitoring
✅ PII scrubbing before send
✅ Environment-based sampling
Files: app/core/sentry.py

Phase 6: SSE/RabbitMQ Observability¶

✅ Connection lifecycle logging
✅ Message flow metrics (consumed/forwarded/dropped)
✅ Backpressure detection
✅ Subscriber management
Files: app/infrastructure/messaging/sse.py, app/api/v1/events.py
Docs: Phase 6 Summary

Phase 7: Auth/Security Observability¶

✅ Centralized violation logging
✅ Tenant boundary violations
✅ Token anomaly detection
✅ Abuse signal detection (host/user thresholds)
Files: app/core/security_events.py, app/core/middleware.py
Docs: Phase 7 Summary
Tests: tests/unit/core/test_security_events.py (8 tests), tests/integration/api/test_auth_token_anomalies.py (9 tests)

Phase 8: External API Observability¶

✅ Central HTTP wrapper with event hooks
✅ Header redaction (Authorization, Cookie, X-API-Key)
✅ URL sanitization (query param masking)
✅ Body preview (dev only, production-disabled)
✅ All clients migrated (Virtuozzo, Postmark, ip-api.com)
Files: app/core/external_http.py, app/infrastructure/external/*
Docs: Phase 8 Summary
Tests: tests/unit/core/test_external_http.py (16 tests)

Key Files Reference¶

File	Purpose	Key Functions
`app/core/logging.py`	Structlog setup	`setup_logging()`, `get_logger()`
`app/core/middleware.py`	Request lifecycle	`CorrelationIdMiddleware`, `RequestLoggingMiddleware`, `AuthContextMiddleware`
`app/core/request_context.py`	Context storage	`request_id_ctx`, `team_id_ctx`, `user_id_ctx`
`app/core/security_events.py`	Security violations	`log_auth_violation()`, `_track_abuse_signal()`
`app/core/external_http.py`	External API wrapper	`create_external_async_client()`
`app/core/telemetry.py`	OpenTelemetry	`configure_opentelemetry()`
`app/core/sentry.py`	Sentry integration	`init_sentry()`

Environment Variables¶

Core Settings¶

ENVIRONMENT=development|staging|production
LOG_LEVEL=DEBUG|INFO|WARNING|ERROR

# File logging (optional)
LOG_TO_FILE=false
LOG_DIR=/var/www/error
LOG_FILE_NAME=mbpanel-api.jsonl
LOG_FILE_MAX_BYTES=10485760
LOG_FILE_BACKUP_COUNT=5

Optional Services¶

# OpenTelemetry
ENABLE_OPENTELEMETRY=false
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318

# Sentry
ENABLE_SENTRY=false
SENTRY_DSN=https://...@sentry.io/...
SENTRY_ENVIRONMENT=production
SENTRY_TRACES_SAMPLE_RATE=0.1

External API (Dev only)¶

# Body preview (auto-disabled in production)
EXTERNAL_HTTP_ENABLE_BODY_LOGGING=true
EXTERNAL_HTTP_MAX_BODY_BYTES=1024

Common Patterns¶

Adding a New Endpoint¶

from fastapi import APIRouter, Depends, Request
from app.core.logging import get_logger
from app.core.dependencies import require_permission
from app.core.security_events import log_auth_violation

router = APIRouter()
logger = get_logger(__name__)

@router.post("/api/v1/sites")
async def create_site(
    request: Request,
    site_data: SiteCreate,
    user: User = Depends(require_permission("sites:create"))
):
    # Logs automatically include: correlation_id, team_id, user_id
    logger.info("site_creation_started", site_name=site_data.name)

    try:
        site = await site_service.create(site_data)
        logger.info("site_created", site_id=site.id, site_name=site.name)
        return site
    except Exception as e:
        logger.error("site_creation_failed",
                    site_name=site_data.name,
                    error=str(e),
                    exc_info=True)
        raise

Adding a New External API Client¶

from app.core.external_http import create_external_async_client
from app.core.config import settings

class MyExternalService:
    def __init__(self):
        self.client = None

    async def get_client(self):
        if not self.client:
            self.client = await create_external_async_client(
                service="my-service",
                base_url=settings.my_service_url,
                timeout=10.0
            ).__aenter__()
        return self.client

    async def close(self):
        if self.client:
            await self.client.__aexit__(None, None, None)

    async def do_something(self, param: str):
        client = await self.get_client()
        # Automatically logs request/response with redacted headers
        response = await client.get(f"/endpoint?param={param}")
        return response.json()

Custom Security Violation¶

from fastapi import HTTPException
from app.core.security_events import log_auth_violation

def check_custom_permission(request: Request, user: User):
    if not user.has_custom_permission():
        log_auth_violation(
            request=request,
            violation="custom_permission_denied",
            http_status=403,
            detail="User lacks custom permission",
            extra_field="custom_value"  # Optional extra context
        )
        raise HTTPException(status_code=403, detail="Forbidden")

Testing Your Observability Code¶

Unit Test Example¶

import pytest
from app.core.security_events import log_auth_violation
from unittest.mock import MagicMock
from fastapi import Request

def create_mock_request(method="GET", path="/", client_host="127.0.0.1"):
    mock_request = MagicMock(spec=Request)
    mock_request.method = method
    mock_request.url = MagicMock(path=path)
    mock_request.client = MagicMock(host=client_host)
    mock_request.headers = {"user-agent": "test"}
    return mock_request

def test_violation_logging(capsys):
    """Test that violations are logged with correct format."""
    request = create_mock_request(method="POST", path="/api/v1/admin")

    # Execute the function that logs
    log_auth_violation(
        request=request,
        violation="missing_permission",
        http_status=403,
        detail="Admin access required"
    )

    # Capture stdout (structlog writes here)
    captured = capsys.readouterr()

    # Validate structured log output
    assert "auth_violation" in captured.out
    assert "missing_permission" in captured.out
    assert "POST" in captured.out
    assert "/api/v1/admin" in captured.out

Troubleshooting¶

Issue: Logs not appearing¶

Check: 1. Is setup_logging() called in app_factory.py? 2. Is LOG_LEVEL set correctly? 3. Are you using get_logger(__name__) from app.core.logging?

Issue: Correlation ID missing¶

Check: 1. Is CorrelationIdMiddleware registered? 2. Are you in a request context? (Correlation ID only exists during requests) 3. Check logs for correlation_id=None (indicates context leak)

Issue: Team/User ID not in logs¶

Check: 1. Is AuthContextMiddleware registered? 2. Is the endpoint authenticated? (No JWT = no team/user context) 3. Is the JWT valid and includes sub and team_id claims?

Issue: External API secrets visible in logs¶

Check: 1. Are you using create_external_async_client()? 2. Is ENVIRONMENT=production? (Body logging auto-disabled) 3. Are sensitive headers being passed with standard names? (Authorization, Cookie, X-API-Key)

Issue: Tests failing with "pytest: command not found"¶

Solution:

cd backend
python3 -m venv .venv  # If not exists
source .venv/bin/activate
pip install -e ".[dev]"
pytest tests/unit/ -v

Issue: Python 3.10 Compatibility (Fixed Dec 17, 2025)¶

Problem: Tests failing with AttributeError: type object 'datetime.datetime' has no attribute 'UTC' Root Cause: Used datetime.UTC (Python 3.11+) but project requires Python 3.10+ Solution: Replaced with timezone.utc throughout Phase 7/8 code Files Fixed: security_events.py, test_security_events.py, test_auth_token_anomalies.py Status: ✅ All tests now passing (64/71 core tests, 7 failures are env-dependent)

Production Deployment Checklist¶

ENVIRONMENT=production is set
LOG_TO_FILE=true with writable LOG_DIR
LOG_LEVEL=INFO (not DEBUG)
ENABLE_SENTRY=true with valid SENTRY_DSN
EXTERNAL_HTTP_ENABLE_BODY_LOGGING=false (redundant check)
Log rotation configured (see LOG_FILE_BACKUP_COUNT)
Application user has write permissions to log directory
External log shipping configured (e.g., Filebeat, Fluentd)

Support¶

For questions about the observability system: 1. Check this document first 2. Review phase-specific docs in docs/notes/ 3. Run relevant tests to see working examples 4. Check CHANGELOG.md for recent changes

Version: 1.0.0 Maintainer: Backend Team Last Review: 2025-12-16