Skip to content

Logging & Debugging System — MBPanel API (Current Implementation + Remaining Tasks)

This document is the authoritative observability task list, revised to match the current repository structure and implementation.

Scope (what observability means here)

  • Structured logging: structlog-backed logs with request correlation.
  • Distributed tracing: OpenTelemetry support, but optional (app must not crash if OTel deps are absent).
  • Error visibility: a centralized error logging path for unhandled exceptions.
  • Multi-tenant context: logs must reliably include team_id (for isolation/debugging).
  • Security: PII-safe logging defaults (no secrets in logs).
  • Messaging visibility: clear runtime visibility into SSE + RabbitMQ event flow (connection, retries, consumers, drops, clients).

Current implementation (verified in code)

  • Structlog setup: backend/app/core/logging.py (setup_logging(), get_logger()).
  • Startup wiring: backend/app/core/app_factory.py calls setup_logging() in lifespan().
  • Correlation id
  • Middleware: backend/app/core/middleware.py CorrelationIdMiddleware
  • Storage: backend/app/core/request_context.py request_id_ctx: ContextVar[str|None]
  • Log enrichment: backend/app/core/logging.py processor add_correlation_id
  • OpenTelemetry (optional)
  • backend/app/core/telemetry.py provides configure_opentelemetry(app) with no-op fallbacks when packages are missing.
  • backend/app/core/app_factory.py calls it when settings.ENABLE_OPENTELEMETRY is true.
  • SSE + RabbitMQ
  • SSE endpoint: backend/app/api/v1/events.py
  • RabbitMQ→SSE broker: backend/app/infrastructure/messaging/sse.py
  • Smoke scripts: backend/tests/smoke/sse/ (real stack verification)

Non-goals / removals (important)

  • We do not use asgi-correlation-id in this repository.
  • We do not have backend/app/database/session.py (DB lives under backend/app/infrastructure/database/).

Dependency policy (do not paste pip install commands into this doc)

  • Backend dependencies are pinned in:
  • backend/pyproject.toml
  • backend/constraints.txt

Any new observability dependency must be: - added to both files - justified with official documentation


Phase 0 (DONE): Structured logging foundation

Task 0.1 — Structlog base configuration (DONE)

  • Files
  • backend/app/core/logging.py
  • Acceptance criteria
  • Logs are emitted through structlog
  • Correlation id can be injected via a processor (see Task 0.2)

Task 0.2 — Correlation id middleware + contextvar (DONE)

  • Files
  • backend/app/core/middleware.py (CorrelationIdMiddleware)
  • backend/app/core/request_context.py (request_id_ctx)
  • Acceptance criteria
  • Response includes X-Request-ID
  • Logs include correlation_id when a request is in scope
  • Correlation id context is always cleared on exceptions (no cross-request leakage)
  • Incoming X-Request-ID is sanitized/capped (no control chars / no unbounded values)

Self-check (local python import sanity)

cd backend
python3 -c "from app.core.request_context import request_id_ctx; from app.core.logging import setup_logging, get_logger; setup_logging(); request_id_ctx.set('test'); get_logger(__name__).info('log-test')"

Phase 1 (DONE): Central request lifecycle logging + global exception capture

Why this phase exists

Request logging + exception capture shipped in Dec 2025; this section now documents the implemented behavior so regressions are easy to spot.

Task 1.1 — Request/response logging middleware (DONE)

  • Files
  • backend/app/core/middleware.py (RequestLoggingMiddleware)
  • backend/app/core/app_factory.py (ensure middleware order stays correct)
  • Requirements
  • Log at least: method, path, status_code, duration_ms, correlation_id
  • Avoid duplicate logs (exactly one “request completed” log entry per request)
  • Avoid logging cookies/authorization headers/body by default
  • Ensure contextvars are cleared between requests (no leakage)
  • Implementation (verified 2025-12-17)
  • RequestLoggingMiddleware logs request_completed on success and request_failed on exceptions (with exc_info=True), capturing method/path/status/duration.
  • Middleware order (create_app) keeps CorrelationIdMiddleware and AuthContextMiddleware ahead of request logging so processors enrich every entry with contextvars.
  • The dispatcher wraps call_next in try/except and never inspects bodies or headers, keeping logs PII-safe.

Task 1.2 — Centralized error logging path (DONE)

Implement one of: - Global exception handler(s) via FastAPI @app.exception_handler(...), or - An outer middleware that logs exceptions (exc_info=True) and re-raises

Goal: any unhandled exception produces one structured error log with request context (and later Sentry). - Implementation (verified 2025-12-17) - _register_exception_handlers registers FastAPI handlers for HTTPException, RequestValidationError, and a catch-all Exception. - The catch-all handler logs unhandled_exception once per failure with exc_info=True, then returns a 500 JSON payload. - Only the centralized handler logs stack traces, preventing duplicate logging downstream.

Task 1.3 — Validation/HTTP exception logging contract (DONE)

FastAPI will raise/return structured errors for: - HTTPException (expected application errors) - request validation errors (422)

We must decide the policy and implement it consistently: - Do not log expected HTTPException at ERROR by default (to avoid noisy logs). - Do log 5xx exceptions with stack traces (once), including correlation id + route metadata. - Ensure 422/400 are either logged at DEBUG/INFO or sampled (avoid log spam). - Implementation (verified 2025-12-17) - 4xx HTTPException responses log at DEBUG via http_exception_4xx, meeting the “expected error” policy. - 5xx HTTPException responses log at ERROR (http_exception_5xx) with stack traces. - Validation errors emit INFO-level validation_error logs with structured errors payloads; correlation/user/team context bindings ensure multi-tenant traceability.


Phase 2 (DONE): Multi-tenant logging context (team_id)

Task 2.1 — Bind team_id into log context (DONE)

  • Files
  • backend/app/core/request_context.py (team_id_ctx, user_id_ctx)
  • backend/app/core/middleware.py (AuthContextMiddleware)
  • backend/app/core/logging.py (add_request_context processor)
  • Implementation
  • Contextvars for team_id and user_id added to request_context.py
  • AuthContextMiddleware binds team_id/user_id when authenticated
  • Processor add_request_context adds team_id/user_id to all logs
  • Only includes team_id/user_id when present (no null noise)
  • Contextvars cleared in finally block (no leakage on exceptions)

Phase 3 (DONE): PII filtering policy

Task 3.1 — PII/sensitive key filtering (DONE)

  • Files
  • backend/app/core/logging.py (filter_pii processor, SENSITIVE_KEYS)
  • backend/app/core/config.py (PII_FILTERING_ENABLED)
  • Implementation
  • Comprehensive SENSITIVE_KEYS set (password, token, secret, cookie, auth, csrf, etc.)
  • filter_pii processor redacts any key matching sensitive patterns
  • Case-insensitive matching with normalization
  • Applied to all log entries in structlog processor chain
  • Can be disabled via PII_FILTERING_ENABLED (dev only)

Task 3.2 — "Safe fields" contract for logs (DONE)

  • Files
  • backend/app/core/logging.py (SAFE_FIELDS)
  • Implementation
  • Defined SAFE_FIELDS allowlist: correlation_id, team_id, user_id, method, path, status_code, duration_ms, etc.
  • Production mode: only SAFE_FIELDS allowed (strict allowlist)
  • Development mode: permissive but still redacts SENSITIVE_KEYS
  • Unknown fields in production: logged as [FILTERED]

Phase 4 (DONE): OpenTelemetry tracing (optional)

Task 4.1 — Document how to enable tracing locally (DONE)

  • Files
  • docs/development/ENVIRONMENT-SETUP.md (new "OpenTelemetry Configuration" section)
  • Documentation added
  • Local development setup with Docker collector
  • Backend configuration (env vars)
  • Sampling rates for dev vs production
  • Verification commands (make test-tracing)
  • Supported backends (Jaeger, Tempo, Honeycomb, DataDog, New Relic)
  • How to disable (ENABLE_OPENTELEMETRY=false)

Phase 5 (DONE): Error monitoring (Sentry)

Task 5.1 — Sentry integration (DONE)

  • Files
  • backend/app/core/sentry.py (new file)
  • backend/app/core/app_factory.py (lifespan calls configure_sentry)
  • backend/app/core/config.py (SENTRY_DSN and sampling settings)
  • Implementation
  • Optional integration (no crash if sentry-sdk not installed or DSN missing)
  • PII scrubbing via before_send hook (_scrub_sensitive_data)
  • Scrubs: cookies, authorization headers, passwords, tokens, secrets, keys
  • FastAPI, SQLAlchemy, Redis, and Logging integrations
  • Environment and release tagging
  • Helper functions: capture_exception, capture_message (safe to call anytime)
  • Comprehensive error handling (never crashes the app)

Benchmarks / performance

  • Tracing benchmark script:
  • backend/tests/benchmarks/tracing_performance_benchmark.py

Repo alignment reminders

  • Architecture layout is enforced by: make validate (runs backend/scripts/validate_architecture.py)
  • Smoke scripts live under: backend/tests/smoke/ (not in scripts/)

Verification tasks

Test V1 — Correlation ID behavior (DONE)

  • File: backend/tests/integration/api/test_correlation_id.py
  • Coverage
  • test_correlation_id_echo_when_provided
  • test_correlation_id_generated_when_missing
  • test_correlation_id_rejects_oversized_values
  • test_correlation_id_rejects_control_characters
  • test_correlation_id_rejects_empty_values
  • test_correlation_id_preserved_across_errors
  • Evidence: async HTTPX tests (per FAST-TESTING guide) exercise success + error paths and validate header behavior + UUID format.

Test V2 — Context leakage guard (DONE)

  • File: backend/tests/integration/api/test_context_leakage.py
  • Coverage
  • test_correlation_id_no_leakage_on_success
  • test_correlation_id_no_leakage_after_error
  • test_context_vars_cleared_after_request
  • test_multiple_sequential_requests_isolated
  • test_concurrent_requests_isolated
  • test_exception_clears_context
  • Evidence: helper router /__tests__/observability/* snapshots contextvars while middleware is patched to emulate authenticated users, proving request- scoped ContextVars reset correctly even on concurrent + failure paths.

Phase 6 (DONE): Messaging observability — SSE + RabbitMQ (production reality)

Why this phase exists

SSE/RabbitMQ issues are notoriously “silent” in production (dropped connections, stalled consumers, backpressure, no active subscribers). We need explicit logging + metrics so we can answer: - Is the SSE client connected? - Is RabbitMQ connected and consuming? - Are messages being routed and fanned out? - Are we dropping messages due to backpressure? - Which tenant/team scope is affected?

Status summary (2025-12-17): Tasks 6.1–6.6 are live (manual OTEL spans now wrap broker + SSE lifecycle); smoke coverage remains in backend/tests/smoke/sse/.

Task 6.1 — SSE endpoint lifecycle logs (DONE)

  • Files
  • backend/app/api/v1/events.py
  • backend/app/infrastructure/messaging/sse.py
  • Requirements
  • Log “SSE connected” and “SSE disconnected” with: correlation_id, team_id, job_id (if provided)
  • Do not log any payload body (PII/security)
  • Implementation (verified 2025-12-17)
  • event_stream logs SSE stream started, SSE client disconnected, and SSE stream ended with team/job IDs and client metadata.
  • broker.subscribe() / broker.unsubscribe() log active subscriber counts and queue sizes; processors add correlation/user/team IDs automatically.

Task 6.2 — Broker connection/retry visibility (DONE)

  • Files
  • backend/app/infrastructure/messaging/sse.py
  • Requirements
  • Log broker start/stop, connect success, retry attempts, and last exception
  • Include exchange/queue/binding_key in logs (already partially present on crash)
  • Implementation (verified 2025-12-17)
  • SSEBroker.start() and .stop() emit structured logs covering exchange/queue/binding_key, subscriber counts, and message totals.
  • _consume_loop logs every connect attempt, successful connection, and crash (with attempt/failure counters and exc_info=True); broker.get_metrics() surfaces last_error.

Task 6.3 — Message flow counters (metrics) (DONE)

  • Files
  • backend/app/infrastructure/messaging/sse.py
  • Requirements
  • Track counters/gauges (at minimum):
    • subscribers_active (gauge)
    • messages_consumed_total
    • messages_forwarded_total
    • messages_dropped_total (queue full / stale)
  • These can be logged periodically (and later exported as metrics when we implement metrics properly).
  • Implementation (verified 2025-12-17)
  • SSEMetrics tracks the required counters; _handle_message/_fan_out increment totals, and broker.get_metrics() exposes them for diagnostics/health endpoints.
  • Active subscriber counts are logged whenever clients subscribe/unsubscribe.

Task 6.4 — Backpressure strategy must be observable (DONE)

  • Files
  • backend/app/infrastructure/messaging/sse.py
  • Requirements
  • If we drop messages when subscriber queues fill, we must log a structured warning with:
    • team_id/job_id (if available from routing key)
    • queue_size/max_size
    • drop_reason
  • Implementation (verified 2025-12-17)
  • _fan_out handles asyncio.QueueFull by dropping the oldest message, logs SSE subscriber queue full with routing_key + queue size, and increments messages_dropped_total.
  • A second warning (Messages dropped due to backpressure) records routing key, delivered vs dropped counts, and subscriber totals whenever drops occur.

Task 6.5 — Tracing hooks for event flow (optional) (DONE)

  • Files
  • backend/app/infrastructure/messaging/sse.py
  • backend/app/api/v1/events.py
  • Requirements
  • If OpenTelemetry is enabled, add manual spans around:
    • consume loop connect/setup
    • message handling
    • SSE write loop
  • This must not crash when OpenTelemetry deps are not installed.
  • Implementation (verified 2025-12-17)
  • _telemetry_span() helper wraps RabbitMQ connect/setup, message handling, and SSE send loops with manual spans using get_tracer; attributes are prefixed mbpanel.*.
  • Spans collapse to no-ops when OpenTelemetry is disabled or not installed (matches optional requirement).
  • Tests: backend/tests/unit/infrastructure/messaging/test_sse_telemetry.py ensures helper sets attributes without relying on OTEL packages.

Task 6.6 — Repeatable verification (smoke scripts) (DONE, keep current)

  • Folder
  • backend/tests/smoke/sse/
  • Goal
  • Publish→consume roundtrip and authenticated SSE stream verification must remain repeatable.

Phase 7 (DONE): Auth/Security observability (violations, abuse, tenant boundaries)

Why this phase exists

Auth failures/violations are security signals. During development we must be able to answer: - Are we seeing brute-force/login abuse? - Are CSRF failures happening (broken client) vs actual attacks? - Are cross-tenant access attempts happening? - Are permission denials happening because of RBAC bugs?

See: ../../../notes/2025-12-16-phase7-auth-security-observability.md

Task 7.1 — Centralized auth violation logging (DONE - core hook added)

  • Files
  • backend/app/core/security_events.py (central logging helper)
  • backend/app/core/dependencies.py (401/403/CSRF denials)
  • backend/app/core/rate_limit.py (429 login rate limit)
  • Rules
  • Must never log secrets (cookies, tokens, Authorization headers, CSRF values, request bodies)
  • Must include correlation_id (via contextvar), plus method/path/status
  • Prefer stable violation codes (for dashboards/alerts)

Task 7.2 — Tenant-boundary violations (DONE)

  • Files
  • backend/app/api/v1/events.py (cross-tenant SSE access)
  • Other routers enforcing team scoping
  • Implementation
  • Logs cross-tenant attempts: cross_tenant_sse_access violation code
  • Includes team_id (requested/effective), user_id, correlation_id
  • Extra field: requested_team_id for debugging

Task 7.3 — Auth token anomalies (DONE)

  • Files
  • backend/app/core/middleware.py (AuthContextMiddleware)
  • Implementation
  • Token validation failures: DEBUG-level (error_type, error_detail)
  • Blacklisted tokens: INFO-level (with jti prefix only)
  • Missing JTI: DEBUG-level
  • Invalid subject: DEBUG-level
  • Escalates to WARNING via abuse detection (Task 7.4)

Task 7.4 — Abuse signals (baseline) (DONE)

  • Files
  • backend/app/core/security_events.py (_track_abuse_signal_async, _track_abuse_signal_memory)
  • Implementation
  • Redis-backed sliding windows (1h TTL) via async INCR + EXPIRE counters for both host and user scopes; thresholds unchanged (host ≥10, user ≥5) and emit structured warnings (abuse_signal_host, abuse_signal_user).
  • Automatic fallback to in-memory counters when Redis is unavailable or when running outside an event loop (e.g., unit tests), preserving deterministic behavior.
  • Async helper failures are surfaced via abuse_tracking_* warnings; unit coverage in backend/tests/unit/core/test_security_events.py validates Redis path + fallback semantics.

Phase 8 (DONE): External API observability (backend/app/infrastructure/external/*)

See: ../../../notes/2025-12-16-phase8-external-api-observability.md

Policy (must be strict)

  • Production
  • Log request/response metadata only (method, safe URL, status, duration, service name).
  • Never log raw request bodies or raw response bodies.
  • Local/Dev
  • Log request/response metadata.
  • Allow response body preview (capped + redacted) for debugging external integrations.
  • Never log secrets (Authorization/Cookie headers, tokens, passwords).

Task 8.1 — Central outbound HTTP logging wrapper (DONE - core wrapper added)

  • Files
  • backend/app/core/external_http.py
  • backend/app/core/config.py (env-gated body logging settings)
  • Mechanism
  • HTTPX event_hooks for request/response logging (official docs: https://www.python-httpx.org/advanced/#event-hooks)
  • Rules
  • ENVIRONMENT == production hard-disables body logging, even if env vars are mis-set.
  • All sensitive headers are redacted.
  • Query params are masked when keys look sensitive (token/key/secret/password/signature).
  • Body preview is capped by EXTERNAL_HTTP_MAX_BODY_BYTES.

Task 8.2 — Migrate external clients to use wrapper (DONE)

  • Files
  • backend/app/infrastructure/external/virtuozzo/client.py
  • backend/app/infrastructure/external/postmark/client.py
  • backend/app/infrastructure/external/geo/ip_api.py
  • Status: All external HTTP clients now use create_external_async_client with automatic observability

Task 8.3 — Production log sink on Virtuozzo Apache Python node (DONE)

Virtuozzo’s log files page does not clearly document an application “stdout/stderr sink” contract for Python apps in a way we can rely on. To avoid ambiguity, MBPanel implements a platform-agnostic managed log sink:

  • Always emit structured logs to stdout (so any platform can capture/ship them).
  • Optionally also write rotating JSONL logs to a writable directory controlled by env vars:
  • LOG_TO_FILE (bool)
  • LOG_DIR (string path)
  • LOG_FILE_NAME (string, default app.jsonl)
  • LOG_FILE_MAX_BYTES (int)
  • LOG_FILE_BACKUP_COUNT (int)

This makes production logging predictable even if the platform’s default log sink is unclear.

  • Implementation (2025-12-17)
  • _configure_file_logging() in backend/app/core/logging.py now sanitizes file names, enforces directory creation with structured warnings, and logs the resolved sink metadata once logging is initialized.
  • _resolve_log_dir() falls back to /var/www/error when LOG_DIR is unset in production to honor Ops defaults.
  • Smoke script backend/tests/smoke/logging/log_sink_probe.py forces file logging via env overrides and asserts the JSONL log contains a fresh marker (aligned with FAST-TESTING guide’s smoke expectations).

Virtuozzo production default (confirmed by ops)

  • Log directory: /var/www/error
  • Recommended env
ENVIRONMENT=production
LOG_TO_FILE=true
LOG_DIR=/var/www/error
LOG_FILE_NAME=mbpanel-api.jsonl
LOG_FILE_MAX_BYTES=10485760
LOG_FILE_BACKUP_COUNT=10

Permissions requirement

The application process user (Apache/Python node user) must have write permission to LOG_DIR.

Verification

  • Integration/unit coverage in backend/tests/unit/core/test_logging_config.py validates path sanitization + warning paths.
  • Smoke coverage via python backend/tests/smoke/logging/log_sink_probe.py --log-dir /var/www/error (run on target host) confirms file sink writes structured JSON entries that include the probe marker.