AI Agent Guide — Logging, Error Handling & OTEL (SigNoz Aligned)¶
Active implementation snapshot: December 18, 2025
Grounded in the live code under
backend/app/core/*,backend/app/infrastructure/messaging/sse.py, and the authoritative observability backlog indocs/development/tasks/phase1_observability/logging_debugging_system.md. External recommendations cite the official Structlog manual, OpenTelemetry Python docs, and SigNoz instrumentation guides.[^structlog][^otel-python][^signoz-python][^signoz-manual][^signoz-troubleshooting]
0. Purpose & Alignment¶
- Confirm what already ships. The FastAPI app initializes observability in
lifespan()by callingsetup_logging(), OpenTelemetry, Sentry, and the SSE broker.[^code-lifespan] - Document how to extend it. Every new router, background worker, or infrastructure client must emit structured logs, consistent error telemetry, and OTEL spans that SigNoz can ingest.
- Give AI agents an execution playbook. Follow the workflow below whenever you touch backend code so you never guess about logging, tracing, or error handling.
1. Baseline Observability Map¶
| Concern | Implementation (Code) | Tests / Docs |
|---|---|---|
| Structured logging + PII filtering | backend/app/core/logging.py (setup_logging, filter_pii, OTEL log exporter) |
Phase‑1 doc §§0‑5 |
| Correlation + auth context | backend/app/core/middleware.py (CorrelationIdMiddleware, AuthContextMiddleware, RequestLoggingMiddleware) |
backend/tests/integration/api/test_correlation_id.py, test_context_leakage.py |
| Global exception policy | backend/app/core/app_factory.py::_register_exception_handlers |
Phase‑1 doc §1, integration tests |
| Security events | backend/app/core/security_events.py::log_auth_violation |
Phase‑7 doc, SSE auth tests |
| SSE/RabbitMQ visibility | backend/app/infrastructure/messaging/sse.py (metrics, manual spans) |
docs/architecture/WEBSOCKET/SSE-Notif-Phases.md, smoke scripts |
| OpenTelemetry wiring | backend/app/core/telemetry.py + OpenTelemetryRequestSpanMiddleware |
Phase‑4 doc, SigNoz dashboards |
| SigNoz log export | backend/app/core/logging._configure_otel_log_handler (OTLP gRPC) |
SigNoz ingest pipelines |
2. Implementation Workflow (Follow in Order)¶
Step 1 — Boot the Observability Stack¶
setup_logging()must run before any user code;lifespan()already guarantees this, so never bypasscreate_app().[^code-lifespan]- Middleware order matters: Correlation ID → CORS → OTEL span middleware → Auth context → Request logging.[^code-middleware]
```68:91:backend/app/core/app_factory.py app.add_middleware(RequestLoggingMiddleware) if getattr(settings, "ENABLE_OPENTELEMETRY", False): app.add_middleware(OpenTelemetryRequestSpanMiddleware) app.add_middleware(AuthContextMiddleware) app.add_middleware(CORSMiddleware, ...) app.add_middleware(CorrelationIdMiddleware, header_name="X-Request-ID")
### Step 2 — Emit Structured Logs Everywhere
1. **Get the logger once per module.**
```python
from app.core.logging import get_logger
logger = get_logger(__name__)
```
2. **Log events, not sentences.** Use `logger.info("job_dispatched", job_id=job.id, ...)`.
3. **PII guard rails:** `filter_pii()` redacts keys listed in `SENSITIVE_KEYS`, and production mode allow-lists `SAFE_FIELDS` so stick to those field names unless you add new ones upstream.[^code-safe-fields]
4. **Respect request context.** `add_correlation_id` and `add_request_context` inject `correlation_id`, `team_id`, and `user_id` automatically; never log those manually unless you are outside request scope.
5. **Prefer domain helpers:**
* For external HTTP clients use `create_external_async_client()` which already logs sanitized request/response metadata.
* For auth/security misuse `log_auth_violation()` so abuse counters stay accurate.
* For SSE fan-out rely on `SSEBroker` logging hooks instead of ad-hoc prints.
```208:236:backend/app/core/logging.py
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars,
structlog.stdlib.filter_by_level,
...
add_correlation_id,
add_request_context,
filter_pii,
structlog.stdlib.ProcessorFormatter.wrap_for_formatter,
],
...
)
Sample pattern for new code¶
from app.core.logging import get_logger
logger = get_logger(__name__)
async def provision_environment(job_ctx: JobContext) -> None:
logger.info(
"env_provision_started",
job_id=str(job_ctx.job_id),
env_name=job_ctx.env_name,
)
try:
...
except ExternalAPIError as exc:
logger.error(
"env_provision_failed",
job_id=str(job_ctx.job_id),
env_name=job_ctx.env_name,
error_type=type(exc).__name__,
exc_info=True,
)
raise
Step 3 — Domain-Specific Logging Hooks¶
| Scenario | Required action |
|---|---|
| FastAPI endpoints | Let RequestLoggingMiddleware emit the request lifecycle log; within handlers only log meaningful domain transitions (authorization decision, cross-tenant attempt, etc.). |
| Background jobs / scripts | Call setup_logging() once (scripts already do) and log start/end plus key waypoints. Reuse JobContext so logs include team_id/env_name. |
| External HTTP calls | Instantiate clients via create_external_async_client(service="virtuozzo", ...) to inherit consistent logging and redaction. |
| Security events | Use log_auth_violation(...) instead of ad-hoc warnings so Redis/in-memory abuse counters stay in sync. |
| SSE / RabbitMQ | Wrap fan-out batches in _telemetry_span() and rely on SSEBroker.subscribe/unsubscribe logs for subscriber counts. No payloads, only metadata. |
Step 4 — Centralized Error Handling¶
- The FastAPI exception handlers already log once per failure; do not duplicate stack traces.
- Raise
HTTPExceptionfor expected user errors (401–404) and let_register_exception_handlerslog them at DEBUG. - For 5xx scenarios set
exc_info=Trueand re-raise so the global handler emitsunhandled_exception. - Authentication/rate limit denials must flow through
log_auth_violation()so SigNoz dashboards can detect abuse spikes.
```108:174:backend/app/core/app_factory.py @app.exception_handler(Exception) async def unhandled_exception_handler(request, exc): logger.error( "unhandled_exception", error_type=type(exc).name, error_msg=str(exc), method=request.method, path=request.url.path, exc_info=True, ) return JSONResponse(status_code=500, content={"detail": "Internal server error"})
### Step 5 — OpenTelemetry & SigNoz Integration
1. **Enable tracing/log export by configuration:**
* `ENABLE_OPENTELEMETRY=true`
* `OTEL_EXPORTER_OTLP_ENDPOINT=https://ingest.<region>.signoz.cloud:443` (or your self-hosted collector)
* `OTEL_EXPORTER_OTLP_INSECURE=false` in production unless SigNoz endpoint is plaintext.
* `OTEL_LOGS_ENABLED=true` to push Structlog output through the OTLP log exporter.
2. **Automatic server spans** are generated by `OpenTelemetryRequestSpanMiddleware`, which extracts W3C headers, sets HTTP attributes, and forwards correlation/team/user ids to SigNoz.[^code-otel-middleware]
3. **Manual spans**: use `_telemetry_span()` helper or `get_tracer("package.path")` when wrapping background work (DB migrations, RabbitMQ handlers, Virtuozzo client operations). Record exceptions so SigNoz surfaces them as errors.
```247:293:backend/app/core/telemetry.py
with self._tracer.start_as_current_span(span_name, context=ctx, kind=SpanKind.SERVER) as span:
set_attr("http.method", method)
set_attr("url.path", path)
...
if sc >= 500:
span.set_status(Status(StatusCode.ERROR))
- SigNoz Cloud headers: When using SigNoz Cloud, set
OTEL_EXPORTER_OTLP_HEADERS="signoz-ingestion-key=<KEY>"per their ingestion guide.[^signoz-python] - Logs to SigNoz:
_configure_otel_log_handler()attachesopentelemetry.sdk._logs.LoggingHandlerso Structured logs flow over OTLP. Keep log volume manageable by respectingSAFE_FIELDS.
Step 6 — Validation Before You Ship¶
- Unit / integration coverage: run
pytest backend/tests/integration/api/test_correlation_id.py backend/tests/integration/api/test_context_leakage.py. - Smoke logging sink:
python backend/tests/smoke/logging/log_sink_probe.py --log-dir logs. - SSE smoke: follow
backend/tests/smoke/sse/README(ensures broker metrics/logs stay sane). - SigNoz verification: trigger a request locally with OTEL enabled and confirm the span/log shows up in SigNoz’ service view within 1–2 minutes.[^signoz-troubleshooting]
3. SigNoz Configuration Quickstart¶
| Variable | Purpose | Example value |
|---|---|---|
ENABLE_OPENTELEMETRY |
Turns on tracing middleware + exporter | true |
OTEL_SERVICE_NAME |
Service tag shown in SigNoz | mbpanel-api |
OTEL_EXPORTER_OTLP_ENDPOINT |
OTLP gRPC endpoint | https://ingest.us-east-2.signoz.cloud:443 |
OTEL_EXPORTER_OTLP_HEADERS |
Auth header for SigNoz Cloud | signoz-ingestion-key=... |
OTEL_LOGS_ENABLED |
Routes structlog output to OTLP | true |
OTEL_METRICS_ENABLED |
Enables OTLP metric exporter | true (if collector supports it) |
LOG_TO_FILE, LOG_DIR |
Optional rotating JSONL sink for Virtuozzo | true, /var/www/error |
Per SigNoz’ OTLP instructions, OTEL exporters speak gRPC on port 4317; no custom protocol adapters are needed.[^signoz-python]
4. Pull Request Checklist (Copy Into Descriptions)¶
- Every new module uses
get_logger(__name__)once, no bareprint. - Logs only include SAFE_FIELDS or documented additions; sensitive data is never logged.
- Error paths rely on central handlers (
HTTPException,log_auth_violation, orcapture_exception) instead of ad-hoc prints. - OTEL spans wrap any new external calls / background loops; errors mark the span status.
- SigNoz-specific env vars documented in
READMEor.env.examplewhen new ones are required. -
pytest backend/tests/integration/api/test_correlation_id.py backend/tests/integration/api/test_context_leakage.pysucceeds locally.
References¶
[^structlog]: Structlog best practices (contextvars, processor chains) — https://www.structlog.org/en/stable/best_practices.html
[^otel-python]: OpenTelemetry Python manual instrumentation guide — https://opentelemetry.io/docs/languages/python/instrumentation/
[^signoz-python]: SigNoz — Auto-instrument Python apps with OpenTelemetry & OTLP ingest headers — https://signoz.io/docs/instrumentation/opentelemetry-python/
[^signoz-manual]: SigNoz — Manual spans in Python applications — https://signoz.io/opentelemetry/manual-spans-in-python-application/
[^signoz-troubleshooting]: SigNoz — Troubleshooting Python with OpenTelemetry tracing — https://signoz.io/blog/troubleshooting-python-with-opentelemetry-tracing/
[^code-lifespan]: backend/app/core/app_factory.py::lifespan — initializes logging, Sentry, OTEL, SSE broker.
[^code-middleware]: backend/app/core/app_factory.py::create_app — documents middleware order for observability.
[^code-safe-fields]: backend/app/core/logging.py::SAFE_FIELDS — allow-listed log attributes enforced in production.
[^code-otel-middleware]: backend/app/core/telemetry.py::OpenTelemetryRequestSpanMiddleware — extracts trace context, sets HTTP + tenant attributes.