Logging & Debugging System — MBPanel API (Current Implementation + Remaining Tasks)¶

This document is the authoritative observability task list, revised to match the current repository structure and implementation.

Scope (what observability means here)¶

Structured logging: structlog-backed logs with request correlation.
Distributed tracing: OpenTelemetry support, but optional (app must not crash if OTel deps are absent).
Error visibility: a centralized error logging path for unhandled exceptions.
Multi-tenant context: logs must reliably include team_id (for isolation/debugging).
Security: PII-safe logging defaults (no secrets in logs).
Messaging visibility: clear runtime visibility into SSE + RabbitMQ event flow (connection, retries, consumers, drops, clients).

Current implementation (verified in code)¶

Structlog setup: backend/app/core/logging.py (setup_logging(), get_logger()).
Startup wiring: backend/app/core/app_factory.py calls setup_logging() in lifespan().
Correlation id
Middleware: backend/app/core/middleware.py CorrelationIdMiddleware
Storage: backend/app/core/request_context.py request_id_ctx: ContextVar[str|None]
Log enrichment: backend/app/core/logging.py processor add_correlation_id
OpenTelemetry (optional)
backend/app/core/telemetry.py provides configure_opentelemetry(app) with no-op fallbacks when packages are missing.
backend/app/core/app_factory.py calls it when settings.ENABLE_OPENTELEMETRY is true.
SSE + RabbitMQ
SSE endpoint: backend/app/api/v1/events.py
RabbitMQ→SSE broker: backend/app/infrastructure/messaging/sse.py
Smoke scripts: backend/tests/smoke/sse/ (real stack verification)

Non-goals / removals (important)¶

We do not use asgi-correlation-id in this repository.
We do not have backend/app/database/session.py (DB lives under backend/app/infrastructure/database/).

Dependency policy (do not paste pip install commands into this doc)¶

Backend dependencies are pinned in:
backend/pyproject.toml
backend/constraints.txt

Any new observability dependency must be: - added to both files - justified with official documentation

Phase 0 (DONE): Structured logging foundation¶

Task 0.1 — Structlog base configuration (DONE)¶

Files
backend/app/core/logging.py
Acceptance criteria
Logs are emitted through structlog
Correlation id can be injected via a processor (see Task 0.2)

Task 0.2 — Correlation id middleware + contextvar (DONE)¶

Files
backend/app/core/middleware.py (CorrelationIdMiddleware)
backend/app/core/request_context.py (request_id_ctx)
Acceptance criteria
Response includes X-Request-ID
Logs include correlation_id when a request is in scope
Correlation id context is always cleared on exceptions (no cross-request leakage)
Incoming X-Request-ID is sanitized/capped (no control chars / no unbounded values)

Self-check (local python import sanity)

cd backend
python3 -c "from app.core.request_context import request_id_ctx; from app.core.logging import setup_logging, get_logger; setup_logging(); request_id_ctx.set('test'); get_logger(__name__).info('log-test')"

Phase 1 (DONE): Central request lifecycle logging + global exception capture¶

Why this phase exists¶

Request logging + exception capture shipped in Dec 2025; this section now documents the implemented behavior so regressions are easy to spot.

Task 1.1 — Request/response logging middleware (DONE)¶

Files
backend/app/core/middleware.py (RequestLoggingMiddleware)
backend/app/core/app_factory.py (ensure middleware order stays correct)
Requirements
Log at least: method, path, status_code, duration_ms, correlation_id
Avoid duplicate logs (exactly one “request completed” log entry per request)
Avoid logging cookies/authorization headers/body by default
Ensure contextvars are cleared between requests (no leakage)
Implementation (verified 2025-12-17)
RequestLoggingMiddleware logs request_completed on success and request_failed on exceptions (with exc_info=True), capturing method/path/status/duration.
Middleware order (create_app) keeps CorrelationIdMiddleware and AuthContextMiddleware ahead of request logging so processors enrich every entry with contextvars.
The dispatcher wraps call_next in try/except and never inspects bodies or headers, keeping logs PII-safe.

Task 1.2 — Centralized error logging path (DONE)¶

Implement one of: - Global exception handler(s) via FastAPI @app.exception_handler(...), or - An outer middleware that logs exceptions (exc_info=True) and re-raises

Goal: any unhandled exception produces one structured error log with request context (and later Sentry). - Implementation (verified 2025-12-17) - _register_exception_handlers registers FastAPI handlers for HTTPException, RequestValidationError, and a catch-all Exception. - The catch-all handler logs unhandled_exception once per failure with exc_info=True, then returns a 500 JSON payload. - Only the centralized handler logs stack traces, preventing duplicate logging downstream.

Task 1.3 — Validation/HTTP exception logging contract (DONE)¶

FastAPI will raise/return structured errors for: - HTTPException (expected application errors) - request validation errors (422)

We must decide the policy and implement it consistently: - Do not log expected HTTPException at ERROR by default (to avoid noisy logs). - Do log 5xx exceptions with stack traces (once), including correlation id + route metadata. - Ensure 422/400 are either logged at DEBUG/INFO or sampled (avoid log spam). - Implementation (verified 2025-12-17) - 4xx HTTPException responses log at DEBUG via http_exception_4xx, meeting the “expected error” policy. - 5xx HTTPException responses log at ERROR (http_exception_5xx) with stack traces. - Validation errors emit INFO-level validation_error logs with structured errors payloads; correlation/user/team context bindings ensure multi-tenant traceability.

Phase 2 (DONE): Multi-tenant logging context (`team_id`)¶

Task 2.1 — Bind `team_id` into log context (DONE)¶

Files
backend/app/core/request_context.py (team_id_ctx, user_id_ctx)
backend/app/core/middleware.py (AuthContextMiddleware)
backend/app/core/logging.py (add_request_context processor)
Implementation
Contextvars for team_id and user_id added to request_context.py
AuthContextMiddleware binds team_id/user_id when authenticated
Processor add_request_context adds team_id/user_id to all logs
Only includes team_id/user_id when present (no null noise)
Contextvars cleared in finally block (no leakage on exceptions)

Phase 3 (DONE): PII filtering policy¶

Task 3.1 — PII/sensitive key filtering (DONE)¶

Files
backend/app/core/logging.py (filter_pii processor, SENSITIVE_KEYS)
backend/app/core/config.py (PII_FILTERING_ENABLED)
Implementation
Comprehensive SENSITIVE_KEYS set (password, token, secret, cookie, auth, csrf, etc.)
filter_pii processor redacts any key matching sensitive patterns
Case-insensitive matching with normalization
Applied to all log entries in structlog processor chain
Can be disabled via PII_FILTERING_ENABLED (dev only)

Task 3.2 — "Safe fields" contract for logs (DONE)¶

Files
backend/app/core/logging.py (SAFE_FIELDS)
Implementation
Defined SAFE_FIELDS allowlist: correlation_id, team_id, user_id, method, path, status_code, duration_ms, etc.
Production mode: only SAFE_FIELDS allowed (strict allowlist)
Development mode: permissive but still redacts SENSITIVE_KEYS
Unknown fields in production: logged as [FILTERED]

Phase 4 (DONE): OpenTelemetry tracing (optional)¶

Task 4.1 — Document how to enable tracing locally (DONE)¶

Files
docs/development/ENVIRONMENT-SETUP.md (new "OpenTelemetry Configuration" section)
Documentation added
Local development setup with Docker collector
Backend configuration (env vars)
Sampling rates for dev vs production
Verification commands (make test-tracing)
Supported backends (Jaeger, Tempo, Honeycomb, DataDog, New Relic)
How to disable (ENABLE_OPENTELEMETRY=false)

Phase 5 (DONE): Error monitoring (Sentry)¶

Task 5.1 — Sentry integration (DONE)¶

Files
backend/app/core/sentry.py (new file)
backend/app/core/app_factory.py (lifespan calls configure_sentry)
backend/app/core/config.py (SENTRY_DSN and sampling settings)
Implementation
Optional integration (no crash if sentry-sdk not installed or DSN missing)
PII scrubbing via before_send hook (_scrub_sensitive_data)
Scrubs: cookies, authorization headers, passwords, tokens, secrets, keys
FastAPI, SQLAlchemy, Redis, and Logging integrations
Environment and release tagging
Helper functions: capture_exception, capture_message (safe to call anytime)
Comprehensive error handling (never crashes the app)

Benchmarks / performance¶

Tracing benchmark script:
backend/tests/benchmarks/tracing_performance_benchmark.py

Repo alignment reminders¶

Architecture layout is enforced by: make validate (runs backend/scripts/validate_architecture.py)
Smoke scripts live under: backend/tests/smoke/ (not in scripts/)

Verification tasks¶

Test V1 — Correlation ID behavior (DONE)¶

File: backend/tests/integration/api/test_correlation_id.py
Coverage
test_correlation_id_echo_when_provided
test_correlation_id_generated_when_missing
test_correlation_id_rejects_oversized_values
test_correlation_id_rejects_control_characters
test_correlation_id_rejects_empty_values
test_correlation_id_preserved_across_errors
Evidence: async HTTPX tests (per FAST-TESTING guide) exercise success + error paths and validate header behavior + UUID format.

Test V2 — Context leakage guard (DONE)¶

File: backend/tests/integration/api/test_context_leakage.py
Coverage
test_correlation_id_no_leakage_on_success
test_correlation_id_no_leakage_after_error
test_context_vars_cleared_after_request
test_multiple_sequential_requests_isolated
test_concurrent_requests_isolated
test_exception_clears_context
Evidence: helper router /__tests__/observability/* snapshots contextvars while middleware is patched to emulate authenticated users, proving request- scoped ContextVars reset correctly even on concurrent + failure paths.

Phase 6 (DONE): Messaging observability — SSE + RabbitMQ (production reality)¶

Why this phase exists¶

SSE/RabbitMQ issues are notoriously “silent” in production (dropped connections, stalled consumers, backpressure, no active subscribers). We need explicit logging + metrics so we can answer: - Is the SSE client connected? - Is RabbitMQ connected and consuming? - Are messages being routed and fanned out? - Are we dropping messages due to backpressure? - Which tenant/team scope is affected?

Status summary (2025-12-17): Tasks 6.1–6.6 are live (manual OTEL spans now wrap broker + SSE lifecycle); smoke coverage remains in backend/tests/smoke/sse/.

Task 6.1 — SSE endpoint lifecycle logs (DONE)¶

Files
backend/app/api/v1/events.py
backend/app/infrastructure/messaging/sse.py
Requirements
Log “SSE connected” and “SSE disconnected” with: correlation_id, team_id, job_id (if provided)
Do not log any payload body (PII/security)
Implementation (verified 2025-12-17)
event_stream logs SSE stream started, SSE client disconnected, and SSE stream ended with team/job IDs and client metadata.
broker.subscribe() / broker.unsubscribe() log active subscriber counts and queue sizes; processors add correlation/user/team IDs automatically.

Task 6.2 — Broker connection/retry visibility (DONE)¶

Files
backend/app/infrastructure/messaging/sse.py
Requirements
Log broker start/stop, connect success, retry attempts, and last exception
Include exchange/queue/binding_key in logs (already partially present on crash)
Implementation (verified 2025-12-17)
SSEBroker.start() and .stop() emit structured logs covering exchange/queue/binding_key, subscriber counts, and message totals.
_consume_loop logs every connect attempt, successful connection, and crash (with attempt/failure counters and exc_info=True); broker.get_metrics() surfaces last_error.

Task 6.3 — Message flow counters (metrics) (DONE)¶

Files
backend/app/infrastructure/messaging/sse.py
Requirements
Track counters/gauges (at minimum):
- subscribers_active (gauge)
- messages_consumed_total
- messages_forwarded_total
- messages_dropped_total (queue full / stale)
These can be logged periodically (and later exported as metrics when we implement metrics properly).
Implementation (verified 2025-12-17)
SSEMetrics tracks the required counters; _handle_message/_fan_out increment totals, and broker.get_metrics() exposes them for diagnostics/health endpoints.
Active subscriber counts are logged whenever clients subscribe/unsubscribe.

Task 6.4 — Backpressure strategy must be observable (DONE)¶

Files
backend/app/infrastructure/messaging/sse.py
Requirements
If we drop messages when subscriber queues fill, we must log a structured warning with:
- team_id/job_id (if available from routing key)
- queue_size/max_size
- drop_reason
Implementation (verified 2025-12-17)
_fan_out handles asyncio.QueueFull by dropping the oldest message, logs SSE subscriber queue full with routing_key + queue size, and increments messages_dropped_total.
A second warning (Messages dropped due to backpressure) records routing key, delivered vs dropped counts, and subscriber totals whenever drops occur.

Task 6.5 — Tracing hooks for event flow (optional) (DONE)¶

Files
backend/app/infrastructure/messaging/sse.py
backend/app/api/v1/events.py
Requirements
If OpenTelemetry is enabled, add manual spans around:
- consume loop connect/setup
- message handling
- SSE write loop
This must not crash when OpenTelemetry deps are not installed.
Implementation (verified 2025-12-17)
_telemetry_span() helper wraps RabbitMQ connect/setup, message handling, and SSE send loops with manual spans using get_tracer; attributes are prefixed mbpanel.*.
Spans collapse to no-ops when OpenTelemetry is disabled or not installed (matches optional requirement).
Tests: backend/tests/unit/infrastructure/messaging/test_sse_telemetry.py ensures helper sets attributes without relying on OTEL packages.

Task 6.6 — Repeatable verification (smoke scripts) (DONE, keep current)¶

Folder
backend/tests/smoke/sse/
Goal
Publish→consume roundtrip and authenticated SSE stream verification must remain repeatable.

Phase 7 (DONE): Auth/Security observability (violations, abuse, tenant boundaries)¶

Why this phase exists¶

Auth failures/violations are security signals. During development we must be able to answer: - Are we seeing brute-force/login abuse? - Are CSRF failures happening (broken client) vs actual attacks? - Are cross-tenant access attempts happening? - Are permission denials happening because of RBAC bugs?

See: ../../../notes/2025-12-16-phase7-auth-security-observability.md

Task 7.1 — Centralized auth violation logging (DONE - core hook added)¶

Files
backend/app/core/security_events.py (central logging helper)
backend/app/core/dependencies.py (401/403/CSRF denials)
backend/app/core/rate_limit.py (429 login rate limit)
Rules
Must never log secrets (cookies, tokens, Authorization headers, CSRF values, request bodies)
Must include correlation_id (via contextvar), plus method/path/status
Prefer stable violation codes (for dashboards/alerts)

Task 7.2 — Tenant-boundary violations (DONE)¶

Files
backend/app/api/v1/events.py (cross-tenant SSE access)
Other routers enforcing team scoping
Implementation
Logs cross-tenant attempts: cross_tenant_sse_access violation code
Includes team_id (requested/effective), user_id, correlation_id
Extra field: requested_team_id for debugging

Task 7.3 — Auth token anomalies (DONE)¶

Files
backend/app/core/middleware.py (AuthContextMiddleware)
Implementation
Token validation failures: DEBUG-level (error_type, error_detail)
Blacklisted tokens: INFO-level (with jti prefix only)
Missing JTI: DEBUG-level
Invalid subject: DEBUG-level
Escalates to WARNING via abuse detection (Task 7.4)

Task 7.4 — Abuse signals (baseline) (DONE)¶

Files
backend/app/core/security_events.py (_track_abuse_signal_async, _track_abuse_signal_memory)
Implementation
Redis-backed sliding windows (1h TTL) via async INCR + EXPIRE counters for both host and user scopes; thresholds unchanged (host ≥10, user ≥5) and emit structured warnings (abuse_signal_host, abuse_signal_user).
Automatic fallback to in-memory counters when Redis is unavailable or when running outside an event loop (e.g., unit tests), preserving deterministic behavior.
Async helper failures are surfaced via abuse_tracking_* warnings; unit coverage in backend/tests/unit/core/test_security_events.py validates Redis path + fallback semantics.

Phase 8 (DONE): External API observability (backend/app/infrastructure/external/*)¶

See: ../../../notes/2025-12-16-phase8-external-api-observability.md

Policy (must be strict)¶

Production
Log request/response metadata only (method, safe URL, status, duration, service name).
Never log raw request bodies or raw response bodies.
Local/Dev
Log request/response metadata.
Allow response body preview (capped + redacted) for debugging external integrations.
Never log secrets (Authorization/Cookie headers, tokens, passwords).

Task 8.1 — Central outbound HTTP logging wrapper (DONE - core wrapper added)¶

Files
backend/app/core/external_http.py
backend/app/core/config.py (env-gated body logging settings)
Mechanism
HTTPX event_hooks for request/response logging (official docs: https://www.python-httpx.org/advanced/#event-hooks)
Rules
ENVIRONMENT == production hard-disables body logging, even if env vars are mis-set.
All sensitive headers are redacted.
Query params are masked when keys look sensitive (token/key/secret/password/signature).
Body preview is capped by EXTERNAL_HTTP_MAX_BODY_BYTES.

Task 8.2 — Migrate external clients to use wrapper (DONE)¶

Files
backend/app/infrastructure/external/virtuozzo/client.py ✅
backend/app/infrastructure/external/postmark/client.py ✅
backend/app/infrastructure/external/geo/ip_api.py ✅
Status: All external HTTP clients now use create_external_async_client with automatic observability

Task 8.3 — Production log sink on Virtuozzo Apache Python node (DONE)¶

Virtuozzo’s log files page does not clearly document an application “stdout/stderr sink” contract for Python apps in a way we can rely on. To avoid ambiguity, MBPanel implements a platform-agnostic managed log sink:

Always emit structured logs to stdout (so any platform can capture/ship them).
Optionally also write rotating JSONL logs to a writable directory controlled by env vars:
LOG_TO_FILE (bool)
LOG_DIR (string path)
LOG_FILE_NAME (string, default app.jsonl)
LOG_FILE_MAX_BYTES (int)
LOG_FILE_BACKUP_COUNT (int)

This makes production logging predictable even if the platform’s default log sink is unclear.

Implementation (2025-12-17)
_configure_file_logging() in backend/app/core/logging.py now sanitizes file names, enforces directory creation with structured warnings, and logs the resolved sink metadata once logging is initialized.
_resolve_log_dir() falls back to /var/www/error when LOG_DIR is unset in production to honor Ops defaults.
Smoke script backend/tests/smoke/logging/log_sink_probe.py forces file logging via env overrides and asserts the JSONL log contains a fresh marker (aligned with FAST-TESTING guide’s smoke expectations).

Virtuozzo production default (confirmed by ops)

Log directory: /var/www/error
Recommended env

ENVIRONMENT=production
LOG_TO_FILE=true
LOG_DIR=/var/www/error
LOG_FILE_NAME=mbpanel-api.jsonl
LOG_FILE_MAX_BYTES=10485760
LOG_FILE_BACKUP_COUNT=10

Permissions requirement

The application process user (Apache/Python node user) must have write permission to LOG_DIR.

Verification

Integration/unit coverage in backend/tests/unit/core/test_logging_config.py validates path sanitization + warning paths.
Smoke coverage via python backend/tests/smoke/logging/log_sink_probe.py --log-dir /var/www/error (run on target host) confirms file sink writes structured JSON entries that include the probe marker.

Logging & Debugging System — MBPanel API (Current Implementation + Remaining Tasks)¶

Scope (what observability means here)¶

Current implementation (verified in code)¶

Non-goals / removals (important)¶

Dependency policy (do not paste pip install commands into this doc)¶

Phase 0 (DONE): Structured logging foundation¶

Task 0.1 — Structlog base configuration (DONE)¶

Task 0.2 — Correlation id middleware + contextvar (DONE)¶

Phase 1 (DONE): Central request lifecycle logging + global exception capture¶

Why this phase exists¶

Task 1.1 — Request/response logging middleware (DONE)¶

Task 1.2 — Centralized error logging path (DONE)¶

Task 1.3 — Validation/HTTP exception logging contract (DONE)¶

Phase 2 (DONE): Multi-tenant logging context (team_id)¶

Task 2.1 — Bind team_id into log context (DONE)¶

Phase 3 (DONE): PII filtering policy¶

Task 3.1 — PII/sensitive key filtering (DONE)¶

Task 3.2 — "Safe fields" contract for logs (DONE)¶

Phase 4 (DONE): OpenTelemetry tracing (optional)¶

Task 4.1 — Document how to enable tracing locally (DONE)¶

Phase 5 (DONE): Error monitoring (Sentry)¶

Task 5.1 — Sentry integration (DONE)¶

Benchmarks / performance¶

Repo alignment reminders¶

Verification tasks¶

Test V1 — Correlation ID behavior (DONE)¶

Test V2 — Context leakage guard (DONE)¶

Phase 6 (DONE): Messaging observability — SSE + RabbitMQ (production reality)¶

Why this phase exists¶

Task 6.1 — SSE endpoint lifecycle logs (DONE)¶

Task 6.2 — Broker connection/retry visibility (DONE)¶

Task 6.3 — Message flow counters (metrics) (DONE)¶

Task 6.4 — Backpressure strategy must be observable (DONE)¶

Task 6.5 — Tracing hooks for event flow (optional) (DONE)¶

Task 6.6 — Repeatable verification (smoke scripts) (DONE, keep current)¶

Phase 7 (DONE): Auth/Security observability (violations, abuse, tenant boundaries)¶

Why this phase exists¶

Task 7.1 — Centralized auth violation logging (DONE - core hook added)¶

Task 7.2 — Tenant-boundary violations (DONE)¶

Task 7.3 — Auth token anomalies (DONE)¶

Task 7.4 — Abuse signals (baseline) (DONE)¶

Phase 8 (DONE): External API observability (backend/app/infrastructure/external/*)¶

Policy (must be strict)¶

Task 8.1 — Central outbound HTTP logging wrapper (DONE - core wrapper added)¶

Task 8.2 — Migrate external clients to use wrapper (DONE)¶

Task 8.3 — Production log sink on Virtuozzo Apache Python node (DONE)¶

Phase 2 (DONE): Multi-tenant logging context (`team_id`)¶

Task 2.1 — Bind `team_id` into log context (DONE)¶