Skip to content

Error Tracker

Version: 1.0.0 | Last Updated: 2025-01-06 | Status: Active

Overview

The Error Tracker catalogues production errors encountered in MBPanel, providing documentation for symptoms, diagnosis, and resolution procedures. Each error entry includes patterns, frequency, and prevention strategies.

Error Categories

Active Errors

Active Errors - Currently unresolved production errors requiring attention.

Resolved Errors

Resolved Errors - Historical errors that have been fixed and deployed.

Error Patterns

Error Patterns - Common error patterns and reusable solutions.

Quick Stats

Metric Value Trend
Active Errors 3 ↔ Stable
Resolved (Last 30 Days) 12 ↓ Decreasing
Mean Time to Resolve 4.2 hours ↓ Improving
Critical Errors (P0) 0 ✅ None

Recent Activity

Last 7 Days

  • 2025-01-06: [ERR-142] Database connection pool exhaustion (Resolved)
  • 2025-01-05: [ERR-141] RabbitMQ consumer slowdown (Active)
  • 2025-01-04: [ERR-140] Virtuozzo API timeout spike (Resolved)

Top 5 Error Patterns

  1. Database Connection Errors (15%)
  2. External API Timeouts (12%)
  3. Validation Errors (10%)
  4. Authentication Failures (8%)
  5. Rate Limit Exceeded (6%)

Error Severity Matrix

Severity Definition Response Time Examples
P0 - Critical System-wide outage < 15 min Complete service down
P1 - High Major feature broken < 1 hour Cannot create sites
P2 - Medium Degraded performance < 4 hours Slow API responses
P3 - Low Minor issue < 24 hours UI inconsistencies

Error Lifecycle

stateDiagram-v2
    [*] --> Detected: Error occurs
    Detected --> Investigating: Logged
    Investigating --> Diagnosed: Root cause found
    Investigating --> Investigating: More info needed
    Diagnosed --> InProgress: Fix in progress
    InProgress --> Resolved: Fix deployed
    Resolved --> Verified: Monitoring confirms
    Verified --> [*]: Closed
    Diagnosed --> Recurring: Pattern identified
    Recurring --> [*]: Documented as pattern

Error Entry Template

Each error entry follows this structure:

# ERR-XXX: [Error Title]

**Severity:** P0 | P1 | P2 | P3
**Status:** Active | Resolved | Recurring
**First Seen:** YYYY-MM-DD
**Last Seen:** YYYY-MM-DD
**Occurrences:** N times

## Symptoms
[What the error looks like from user/system perspective]

## Diagnosis
[Root cause analysis and debugging steps]

## Resolution
[Step-by-step resolution procedure]

## Prevention
[How to prevent recurrence]

## References
- [Related ADR-XXX](../../architecture/adr/xxx.md)
- [Related ADD-XXX](../../architecture/add/xxx.md)

Creating Error Entries

For Active Errors

  1. Create file in operations/errors/active/
  2. Name: ERR-XXX-error-title.md
  3. Update operations/errors/.meta/index.json
  4. Link to related incidents

For Resolved Errors

  1. Move from active/ to resolved/
  2. Update with resolution details
  3. Add prevention measures
  4. Update metadata

For Error Patterns

  1. Create in operations/errors/patterns/
  2. Document reusable solutions
  3. Link to related errors
  4. Include code examples

Error Metadata

The .meta/index.json file tracks all errors:

{
  "active_errors": [
    {
      "id": "ERR-141",
      "title": "RabbitMQ Consumer Slowdown",
      "severity": "P2",
      "first_seen": "2025-01-05",
      "occurrences": 23
    }
  ],
  "resolved_errors": [
    {
      "id": "ERR-140",
      "title": "Virtuozzo API Timeout Spike",
      "severity": "P1",
      "resolved": "2025-01-04",
      "time_to_resolve": "2.5 hours"
    }
  ],
  "patterns": [
    {
      "id": "PAT-001",
      "title": "Database Connection Pool Exhaustion",
      "occurrences": 15,
      "prevention": "Connection pooling best practices"
    }
  ]
}

Monitoring & Alerting

Automatic Error Detection

  • Errors are automatically detected from application logs
  • Threshold-based alerts for error rate increases
  • Pattern recognition for recurring issues

Alert Channels

  • P0/P1: PagerDuty + Slack #mbpanel-incidents
  • P2: Slack #mbpanel-ops
  • P3: Daily digest email

Integration with Incident Management

Errors often trigger incidents: 1. Error detected and logged 2. Threshold exceeded → Alert triggered 3. Incident created if severity ≥ P1 4. Error entry created/updated 5. Resolution documented after fix