Logging & Debugging System — MBPanel API (Current Implementation + Remaining Tasks)¶
This document is the authoritative observability task list, revised to match the current repository structure and implementation.
Scope (what observability means here)¶
- Structured logging:
structlog-backed logs with request correlation. - Distributed tracing: OpenTelemetry support, but optional (app must not crash if OTel deps are absent).
- Error visibility: a centralized error logging path for unhandled exceptions.
- Multi-tenant context: logs must reliably include
team_id(for isolation/debugging). - Security: PII-safe logging defaults (no secrets in logs).
- Messaging visibility: clear runtime visibility into SSE + RabbitMQ event flow (connection, retries, consumers, drops, clients).
Current implementation (verified in code)¶
- Structlog setup:
backend/app/core/logging.py(setup_logging(),get_logger()). - Startup wiring:
backend/app/core/app_factory.pycallssetup_logging()inlifespan(). - Correlation id
- Middleware:
backend/app/core/middleware.pyCorrelationIdMiddleware - Storage:
backend/app/core/request_context.pyrequest_id_ctx: ContextVar[str|None] - Log enrichment:
backend/app/core/logging.pyprocessoradd_correlation_id - OpenTelemetry (optional)
backend/app/core/telemetry.pyprovidesconfigure_opentelemetry(app)with no-op fallbacks when packages are missing.backend/app/core/app_factory.pycalls it whensettings.ENABLE_OPENTELEMETRYis true.- SSE + RabbitMQ
- SSE endpoint:
backend/app/api/v1/events.py - RabbitMQ→SSE broker:
backend/app/infrastructure/messaging/sse.py - Smoke scripts:
backend/tests/smoke/sse/(real stack verification)
Non-goals / removals (important)¶
- We do not use
asgi-correlation-idin this repository. - We do not have
backend/app/database/session.py(DB lives underbackend/app/infrastructure/database/).
Dependency policy (do not paste pip install commands into this doc)¶
- Backend dependencies are pinned in:
backend/pyproject.tomlbackend/constraints.txt
Any new observability dependency must be: - added to both files - justified with official documentation
Phase 0 (DONE): Structured logging foundation¶
Task 0.1 — Structlog base configuration (DONE)¶
- Files
backend/app/core/logging.py- Acceptance criteria
- Logs are emitted through
structlog - Correlation id can be injected via a processor (see Task 0.2)
Task 0.2 — Correlation id middleware + contextvar (DONE)¶
- Files
backend/app/core/middleware.py(CorrelationIdMiddleware)backend/app/core/request_context.py(request_id_ctx)- Acceptance criteria
- Response includes
X-Request-ID - Logs include
correlation_idwhen a request is in scope - Correlation id context is always cleared on exceptions (no cross-request leakage)
- Incoming
X-Request-IDis sanitized/capped (no control chars / no unbounded values)
Self-check (local python import sanity)
cd backend
python3 -c "from app.core.request_context import request_id_ctx; from app.core.logging import setup_logging, get_logger; setup_logging(); request_id_ctx.set('test'); get_logger(__name__).info('log-test')"
Phase 1 (DONE): Central request lifecycle logging + global exception capture¶
Why this phase exists¶
Request logging + exception capture shipped in Dec 2025; this section now documents the implemented behavior so regressions are easy to spot.
Task 1.1 — Request/response logging middleware (DONE)¶
- Files
backend/app/core/middleware.py(RequestLoggingMiddleware)backend/app/core/app_factory.py(ensure middleware order stays correct)- Requirements
- Log at least: method, path, status_code, duration_ms, correlation_id
- Avoid duplicate logs (exactly one “request completed” log entry per request)
- Avoid logging cookies/authorization headers/body by default
- Ensure contextvars are cleared between requests (no leakage)
- Implementation (verified 2025-12-17)
RequestLoggingMiddlewarelogsrequest_completedon success andrequest_failedon exceptions (withexc_info=True), capturing method/path/status/duration.- Middleware order (
create_app) keepsCorrelationIdMiddlewareandAuthContextMiddlewareahead of request logging so processors enrich every entry with contextvars. - The dispatcher wraps
call_nextin try/except and never inspects bodies or headers, keeping logs PII-safe.
Task 1.2 — Centralized error logging path (DONE)¶
Implement one of:
- Global exception handler(s) via FastAPI @app.exception_handler(...), or
- An outer middleware that logs exceptions (exc_info=True) and re-raises
Goal: any unhandled exception produces one structured error log with request context (and later Sentry).
- Implementation (verified 2025-12-17)
- _register_exception_handlers registers FastAPI handlers for HTTPException, RequestValidationError, and a catch-all Exception.
- The catch-all handler logs unhandled_exception once per failure with exc_info=True, then returns a 500 JSON payload.
- Only the centralized handler logs stack traces, preventing duplicate logging downstream.
Task 1.3 — Validation/HTTP exception logging contract (DONE)¶
FastAPI will raise/return structured errors for:
- HTTPException (expected application errors)
- request validation errors (422)
We must decide the policy and implement it consistently:
- Do not log expected HTTPException at ERROR by default (to avoid noisy logs).
- Do log 5xx exceptions with stack traces (once), including correlation id + route metadata.
- Ensure 422/400 are either logged at DEBUG/INFO or sampled (avoid log spam).
- Implementation (verified 2025-12-17)
- 4xx HTTPException responses log at DEBUG via http_exception_4xx, meeting the “expected error” policy.
- 5xx HTTPException responses log at ERROR (http_exception_5xx) with stack traces.
- Validation errors emit INFO-level validation_error logs with structured errors payloads; correlation/user/team context bindings ensure multi-tenant traceability.
Phase 2 (DONE): Multi-tenant logging context (team_id)¶
Task 2.1 — Bind team_id into log context (DONE)¶
- Files
backend/app/core/request_context.py(team_id_ctx,user_id_ctx)backend/app/core/middleware.py(AuthContextMiddleware)backend/app/core/logging.py(add_request_contextprocessor)- Implementation
- Contextvars for team_id and user_id added to request_context.py
- AuthContextMiddleware binds team_id/user_id when authenticated
- Processor
add_request_contextadds team_id/user_id to all logs - Only includes team_id/user_id when present (no null noise)
- Contextvars cleared in finally block (no leakage on exceptions)
Phase 3 (DONE): PII filtering policy¶
Task 3.1 — PII/sensitive key filtering (DONE)¶
- Files
backend/app/core/logging.py(filter_piiprocessor,SENSITIVE_KEYS)backend/app/core/config.py(PII_FILTERING_ENABLED)- Implementation
- Comprehensive SENSITIVE_KEYS set (password, token, secret, cookie, auth, csrf, etc.)
filter_piiprocessor redacts any key matching sensitive patterns- Case-insensitive matching with normalization
- Applied to all log entries in structlog processor chain
- Can be disabled via PII_FILTERING_ENABLED (dev only)
Task 3.2 — "Safe fields" contract for logs (DONE)¶
- Files
backend/app/core/logging.py(SAFE_FIELDS)- Implementation
- Defined SAFE_FIELDS allowlist: correlation_id, team_id, user_id, method, path, status_code, duration_ms, etc.
- Production mode: only SAFE_FIELDS allowed (strict allowlist)
- Development mode: permissive but still redacts SENSITIVE_KEYS
- Unknown fields in production: logged as [FILTERED]
Phase 4 (DONE): OpenTelemetry tracing (optional)¶
Task 4.1 — Document how to enable tracing locally (DONE)¶
- Files
docs/development/ENVIRONMENT-SETUP.md(new "OpenTelemetry Configuration" section)- Documentation added
- Local development setup with Docker collector
- Backend configuration (env vars)
- Sampling rates for dev vs production
- Verification commands (make test-tracing)
- Supported backends (Jaeger, Tempo, Honeycomb, DataDog, New Relic)
- How to disable (ENABLE_OPENTELEMETRY=false)
Phase 5 (DONE): Error monitoring (Sentry)¶
Task 5.1 — Sentry integration (DONE)¶
- Files
backend/app/core/sentry.py(new file)backend/app/core/app_factory.py(lifespan calls configure_sentry)backend/app/core/config.py(SENTRY_DSN and sampling settings)- Implementation
- Optional integration (no crash if sentry-sdk not installed or DSN missing)
- PII scrubbing via before_send hook (_scrub_sensitive_data)
- Scrubs: cookies, authorization headers, passwords, tokens, secrets, keys
- FastAPI, SQLAlchemy, Redis, and Logging integrations
- Environment and release tagging
- Helper functions: capture_exception, capture_message (safe to call anytime)
- Comprehensive error handling (never crashes the app)
Benchmarks / performance¶
- Tracing benchmark script:
backend/tests/benchmarks/tracing_performance_benchmark.py
Repo alignment reminders¶
- Architecture layout is enforced by:
make validate(runsbackend/scripts/validate_architecture.py) - Smoke scripts live under:
backend/tests/smoke/(not inscripts/)
Verification tasks¶
Test V1 — Correlation ID behavior (DONE)¶
- File:
backend/tests/integration/api/test_correlation_id.py - Coverage
test_correlation_id_echo_when_providedtest_correlation_id_generated_when_missingtest_correlation_id_rejects_oversized_valuestest_correlation_id_rejects_control_characterstest_correlation_id_rejects_empty_valuestest_correlation_id_preserved_across_errors- Evidence: async HTTPX tests (per FAST-TESTING guide) exercise success + error paths and validate header behavior + UUID format.
Test V2 — Context leakage guard (DONE)¶
- File:
backend/tests/integration/api/test_context_leakage.py - Coverage
test_correlation_id_no_leakage_on_successtest_correlation_id_no_leakage_after_errortest_context_vars_cleared_after_requesttest_multiple_sequential_requests_isolatedtest_concurrent_requests_isolatedtest_exception_clears_context- Evidence: helper router
/__tests__/observability/*snapshots contextvars while middleware is patched to emulate authenticated users, proving request- scoped ContextVars reset correctly even on concurrent + failure paths.
Phase 6 (DONE): Messaging observability — SSE + RabbitMQ (production reality)¶
Why this phase exists¶
SSE/RabbitMQ issues are notoriously “silent” in production (dropped connections, stalled consumers, backpressure, no active subscribers). We need explicit logging + metrics so we can answer: - Is the SSE client connected? - Is RabbitMQ connected and consuming? - Are messages being routed and fanned out? - Are we dropping messages due to backpressure? - Which tenant/team scope is affected?
Status summary (2025-12-17): Tasks 6.1–6.6 are live (manual OTEL spans now wrap broker + SSE lifecycle); smoke coverage remains in
backend/tests/smoke/sse/.
Task 6.1 — SSE endpoint lifecycle logs (DONE)¶
- Files
backend/app/api/v1/events.pybackend/app/infrastructure/messaging/sse.py- Requirements
- Log “SSE connected” and “SSE disconnected” with: correlation_id, team_id, job_id (if provided)
- Do not log any payload body (PII/security)
- Implementation (verified 2025-12-17)
event_streamlogsSSE stream started,SSE client disconnected, andSSE stream endedwith team/job IDs and client metadata.broker.subscribe()/broker.unsubscribe()log active subscriber counts and queue sizes; processors add correlation/user/team IDs automatically.
Task 6.2 — Broker connection/retry visibility (DONE)¶
- Files
backend/app/infrastructure/messaging/sse.py- Requirements
- Log broker start/stop, connect success, retry attempts, and last exception
- Include exchange/queue/binding_key in logs (already partially present on crash)
- Implementation (verified 2025-12-17)
SSEBroker.start()and.stop()emit structured logs covering exchange/queue/binding_key, subscriber counts, and message totals._consume_looplogs every connect attempt, successful connection, and crash (with attempt/failure counters andexc_info=True);broker.get_metrics()surfaceslast_error.
Task 6.3 — Message flow counters (metrics) (DONE)¶
- Files
backend/app/infrastructure/messaging/sse.py- Requirements
- Track counters/gauges (at minimum):
- subscribers_active (gauge)
- messages_consumed_total
- messages_forwarded_total
- messages_dropped_total (queue full / stale)
- These can be logged periodically (and later exported as metrics when we implement metrics properly).
- Implementation (verified 2025-12-17)
SSEMetricstracks the required counters;_handle_message/_fan_outincrement totals, andbroker.get_metrics()exposes them for diagnostics/health endpoints.- Active subscriber counts are logged whenever clients subscribe/unsubscribe.
Task 6.4 — Backpressure strategy must be observable (DONE)¶
- Files
backend/app/infrastructure/messaging/sse.py- Requirements
- If we drop messages when subscriber queues fill, we must log a structured warning with:
- team_id/job_id (if available from routing key)
- queue_size/max_size
- drop_reason
- Implementation (verified 2025-12-17)
_fan_outhandlesasyncio.QueueFullby dropping the oldest message, logsSSE subscriber queue fullwith routing_key + queue size, and incrementsmessages_dropped_total.- A second warning (
Messages dropped due to backpressure) records routing key, delivered vs dropped counts, and subscriber totals whenever drops occur.
Task 6.5 — Tracing hooks for event flow (optional) (DONE)¶
- Files
backend/app/infrastructure/messaging/sse.pybackend/app/api/v1/events.py- Requirements
- If OpenTelemetry is enabled, add manual spans around:
- consume loop connect/setup
- message handling
- SSE write loop
- This must not crash when OpenTelemetry deps are not installed.
- Implementation (verified 2025-12-17)
_telemetry_span()helper wraps RabbitMQ connect/setup, message handling, and SSE send loops with manual spans usingget_tracer; attributes are prefixedmbpanel.*.- Spans collapse to no-ops when OpenTelemetry is disabled or not installed (matches optional requirement).
- Tests:
backend/tests/unit/infrastructure/messaging/test_sse_telemetry.pyensures helper sets attributes without relying on OTEL packages.
Task 6.6 — Repeatable verification (smoke scripts) (DONE, keep current)¶
- Folder
backend/tests/smoke/sse/- Goal
- Publish→consume roundtrip and authenticated SSE stream verification must remain repeatable.
Phase 7 (DONE): Auth/Security observability (violations, abuse, tenant boundaries)¶
Why this phase exists¶
Auth failures/violations are security signals. During development we must be able to answer: - Are we seeing brute-force/login abuse? - Are CSRF failures happening (broken client) vs actual attacks? - Are cross-tenant access attempts happening? - Are permission denials happening because of RBAC bugs?
See: ../../../notes/2025-12-16-phase7-auth-security-observability.md
Task 7.1 — Centralized auth violation logging (DONE - core hook added)¶
- Files
backend/app/core/security_events.py(central logging helper)backend/app/core/dependencies.py(401/403/CSRF denials)backend/app/core/rate_limit.py(429 login rate limit)- Rules
- Must never log secrets (cookies, tokens, Authorization headers, CSRF values, request bodies)
- Must include correlation_id (via contextvar), plus method/path/status
- Prefer stable
violationcodes (for dashboards/alerts)
Task 7.2 — Tenant-boundary violations (DONE)¶
- Files
backend/app/api/v1/events.py(cross-tenant SSE access)- Other routers enforcing team scoping
- Implementation
- Logs cross-tenant attempts:
cross_tenant_sse_accessviolation code - Includes team_id (requested/effective), user_id, correlation_id
- Extra field:
requested_team_idfor debugging
Task 7.3 — Auth token anomalies (DONE)¶
- Files
backend/app/core/middleware.py(AuthContextMiddleware)- Implementation
- Token validation failures: DEBUG-level (error_type, error_detail)
- Blacklisted tokens: INFO-level (with jti prefix only)
- Missing JTI: DEBUG-level
- Invalid subject: DEBUG-level
- Escalates to WARNING via abuse detection (Task 7.4)
Task 7.4 — Abuse signals (baseline) (DONE)¶
- Files
backend/app/core/security_events.py(_track_abuse_signal_async,_track_abuse_signal_memory)- Implementation
- Redis-backed sliding windows (1h TTL) via async
INCR + EXPIREcounters for both host and user scopes; thresholds unchanged (host ≥10, user ≥5) and emit structured warnings (abuse_signal_host,abuse_signal_user). - Automatic fallback to in-memory counters when Redis is unavailable or when running outside an event loop (e.g., unit tests), preserving deterministic behavior.
- Async helper failures are surfaced via
abuse_tracking_*warnings; unit coverage inbackend/tests/unit/core/test_security_events.pyvalidates Redis path + fallback semantics.
Phase 8 (DONE): External API observability (backend/app/infrastructure/external/*)¶
See: ../../../notes/2025-12-16-phase8-external-api-observability.md
Policy (must be strict)¶
- Production
- Log request/response metadata only (method, safe URL, status, duration, service name).
- Never log raw request bodies or raw response bodies.
- Local/Dev
- Log request/response metadata.
- Allow response body preview (capped + redacted) for debugging external integrations.
- Never log secrets (Authorization/Cookie headers, tokens, passwords).
Task 8.1 — Central outbound HTTP logging wrapper (DONE - core wrapper added)¶
- Files
backend/app/core/external_http.pybackend/app/core/config.py(env-gated body logging settings)- Mechanism
- HTTPX
event_hooksfor request/response logging (official docs:https://www.python-httpx.org/advanced/#event-hooks) - Rules
ENVIRONMENT == productionhard-disables body logging, even if env vars are mis-set.- All sensitive headers are redacted.
- Query params are masked when keys look sensitive (token/key/secret/password/signature).
- Body preview is capped by
EXTERNAL_HTTP_MAX_BODY_BYTES.
Task 8.2 — Migrate external clients to use wrapper (DONE)¶
- Files
backend/app/infrastructure/external/virtuozzo/client.py✅backend/app/infrastructure/external/postmark/client.py✅backend/app/infrastructure/external/geo/ip_api.py✅- Status: All external HTTP clients now use
create_external_async_clientwith automatic observability
Task 8.3 — Production log sink on Virtuozzo Apache Python node (DONE)¶
Virtuozzo’s log files page does not clearly document an application “stdout/stderr sink” contract for Python apps in a way we can rely on. To avoid ambiguity, MBPanel implements a platform-agnostic managed log sink:
- Always emit structured logs to stdout (so any platform can capture/ship them).
- Optionally also write rotating JSONL logs to a writable directory controlled by env vars:
LOG_TO_FILE(bool)LOG_DIR(string path)LOG_FILE_NAME(string, defaultapp.jsonl)LOG_FILE_MAX_BYTES(int)LOG_FILE_BACKUP_COUNT(int)
This makes production logging predictable even if the platform’s default log sink is unclear.
- Implementation (2025-12-17)
_configure_file_logging()inbackend/app/core/logging.pynow sanitizes file names, enforces directory creation with structured warnings, and logs the resolved sink metadata once logging is initialized._resolve_log_dir()falls back to/var/www/errorwhenLOG_DIRis unset in production to honor Ops defaults.- Smoke script
backend/tests/smoke/logging/log_sink_probe.pyforces file logging via env overrides and asserts the JSONL log contains a fresh marker (aligned with FAST-TESTING guide’s smoke expectations).
Virtuozzo production default (confirmed by ops)
- Log directory:
/var/www/error - Recommended env
ENVIRONMENT=production
LOG_TO_FILE=true
LOG_DIR=/var/www/error
LOG_FILE_NAME=mbpanel-api.jsonl
LOG_FILE_MAX_BYTES=10485760
LOG_FILE_BACKUP_COUNT=10
Permissions requirement
The application process user (Apache/Python node user) must have write permission to LOG_DIR.
Verification
- Integration/unit coverage in
backend/tests/unit/core/test_logging_config.pyvalidates path sanitization + warning paths. - Smoke coverage via
python backend/tests/smoke/logging/log_sink_probe.py --log-dir /var/www/error(run on target host) confirms file sink writes structured JSON entries that include the probe marker.