Skip to content

Product Requirements Document (PRD)

Virtuozzo Environment Synchronization - Stale-While-Revalidate Architecture

Document Information

Field Details
Product Name Virtuozzo Environment Synchronization
Author MBPanel Team
Date Created 2026-01-09
Last Updated 2026-01-22
Version 2.0
Status Draft - Architecture Refresh
Architecture Type Domain
Affected Components app/domains/environments/, app/infrastructure/external/virtuozzo/, app/infrastructure/queue/

1. Executive Summary

The Virtuozzo Environment Synchronization feature implements a distributed optimization pattern to solve the critical problem of redundant API calls when team members access MBPanel. This is a classic distributed systems challenge involving caching strategies, consistency vs. availability trade-offs, and scalability.

1.1 Core Problem Statement

Current State: - Every user login triggers a synchronous call to Virtuozzo API (2-5 second latency) - 5 concurrent logins = 5 identical API calls (wasteful) - UI blocks while fetching, creating poor UX - Virtuozzo API rate limits are at risk during traffic spikes

Solution: Stale-While-Revalidate Architecture - De-couple UI from Data Source: Dashboard never speaks directly to Virtuozzo for initial render - Instant UI Response: Always return data immediately (fresh or stale), never block - Background Queue System: Event-driven refresh for active users only (no global crons) - Graceful Degradation: Serve stale data with warning rather than errors

1.2 Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                         STALE-WHILE-REVALIDATE PATTERN                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. UI Requests → Check Local Database (PostgreSQL)                         │
│  2. Data Found?                                                             │
│     - NO → Blocking fetch from Virtuozzo (unavoidable first-run)           │
│     - YES → Check timestamp (TTL)                                           │
│              ├─ FRESH (< 10 min) → Return immediately (Zero API calls)      │
│              └─ STALE (> 10 min) → Return stale IMMEDIATELY + push to queue │
│                                                                              │
│  3. Background Queue Consumer (Rate-limited, Active-User Priority)         │
│     - Processes queue at fixed rate (e.g., 50 req/sec)                      │
│     - Fetches from Virtuozzo asynchronously                                 │
│     - Validates payload before overwriting DB                               │
│     - Updates timestamp on success                                          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

1.3 Key Technical Decisions

Decision Rationale Impact
PostgreSQL as Primary Cache Survives restarts, source of truth No Redis dependency for cache
Queue-Based Refresh Event-driven, not O(N) cron Scales to millions of users
Active-User Priority Only logged-in users trigger refresh Passive users consume zero resources
Stale-While-Revalidate Return stale data immediately, refresh async UI always instant, never blocks
Rate-Limited Consumer Protects Virtuozzo from traffic spikes Controlled API call volume
No Global Cron Don't iterate through all users Linear scaling with active users

2. Problem Statement

2.1 Background: Distributed Systems Optimization

This is a classic distributed caching problem with CAP theorem implications:

Consistency vs. Availability Trade-off: - Strong Consistency: Every read gets latest data (requires calling Virtuozzo) → SLOW - High Availability: Every read returns immediately (cached data) → MAY BE STALE - Our Choice: AP with TTL-based eventual consistency (10-minute window)

Scalability Challenge (Millions of Users): - Global Cron Approach: Iterate through N users = O(N) complexity - 1M users = 1M cron jobs, impossible to scale - Queue-Based Approach: Only active users trigger refresh = O(active users) - 1M users, 10K active = 10K queue items, manageable

2.2 Current Implementation Analysis

File: backend/app/domains/environments/router.py

# CURRENT CODE (lines 11-17)
@router.get("/")
async def environments(
    user: AuthenticatedUser = Depends(get_current_user),
    db: AsyncSession = Depends(db_session),
) -> dict[str, list]:
    items = await list_environments(db=db, current_user=user)  # ← Just reads DB
    return {"items": items}

Current Behavior: 1. User logs in 2. Frontend calls GET /api/v1/environments 3. Backend queries PostgreSQL database directly 4. Returns whatever is in database (could be hours/days old) 5. No Virtuozzo API call is made 6. No freshness check is performed 7. Data only updates if someone manually calls POST /api/v1/environments/sync

Problems: - Data can be stale forever - No automatic refresh mechanism - No staleness indicator to users - Background refresh doesn't exist

2.3 Pain Points

Pain Point Current Impact User Impact
Indefinite Staleness Data never refreshes automatically Users see old environments
Manual Sync Required Must remember to call POST /sync Operational burden
No Freshness Visibility Can't tell if data is current Confusion about data accuracy
No Background Refresh No proactive updates Stale data persists
Blocking Manual Sync POST /sync blocks for 2-5 seconds Poor UX when forcing refresh

3. Goals and Objectives

3.1 Business Goals

Goal Metric Target
Eliminate blocking UI Percent of requests that block API < 1% (first run only)
Reduce API calls API calls per active user per session 1 call per 10 min window
Maintain freshness P95 data staleness < 10 minutes
Scale to millions System cost scales with Active users, not total users
High availability Uptime during Virtuozzo outages Serve stale with warning

3.2 User Goals

User Story Acceptance Criteria
Instant dashboard load Environment list renders in < 100ms (p95)
Transparent staleness Visual indicator when data is > 1 hour old
No blocking spinners Never show loading spinner after first login
Automatic updates Data refreshes within 10 minutes without manual action
Graceful errors See "Data may be delayed" badge instead of error pages

3.3 Non-Goals (Out of Scope)

  • Environment creation/deletion (existing provisioning flows)
  • Session key management (existing app/modules/sessions/)
  • Real-time WebSocket sync (use SSE for notifications)
  • Per-environment detailed status (focus on listing)
  • Multi-region Virtuozzo support (single endpoint per team)

4. Architecture: Stale-While-Revalidate Pattern

4.1 Happy Path: User Logs In

sequenceDiagram
    participant U as User Browser
    participant API as FastAPI
    participant DB as PostgreSQL
    participant Q as Refresh Queue
    participant W as Background Worker
    participant VZ as Virtuozzo API

    Note over U,DB: User logs in, navigates to environments
    U->>API: GET /api/v1/environments

    Note over API,DB: 1. Query Local Database
    API->>DB: SELECT * FROM environments WHERE team_id = ?
    DB-->>API: Return records (with last_updated timestamp)

    alt First Run (No Data)
        API-->>U: 404 + "Loading environments..." (spinner)
        Note over API,VZ: 2. BLOCKING FETCH (unavoidable)
        API->>VZ: GET /getenvs
        VZ-->>API: Return environments
        API->>DB: INSERT environments (last_updated = NOW())
        API-->>U: 200 OK (environments loaded)
    else Returning User (Data Exists)
        Note over API: 3. Check Timestamp (TTL = 10 min)
        alt FRESH (< 10 min old)
            API-->>U: 200 OK (instant, < 50ms)
            Note over U: User sees data immediately
        else STALE (> 10 min old)
            API-->>U: 200 OK (instant, stale data)
            Note over API: 4. PUSH TO UPDATE QUEUE (fire-and-forget)
            API->>Q: enqueue(team_id, priority="normal")
            Note over Q,W: 5. BACKGROUND WORKER (async, non-blocking)
            Q->>W: dequeue(team_id)
            W->>VZ: GET /getenvs
            VZ-->>W: Return environments
            W->>W: Validate payload
            W->>DB: UPDATE environments (last_updated = NOW())
            Note over U: Next page load shows fresh data
        end
    end

4.2 Sad Path: Edge Cases and Failures

sequenceDiagram
    participant U as User Browser
    participant API as FastAPI
    participant DB as PostgreSQL
    participant Q as Refresh Queue
    participant W as Background Worker
    participant VZ as Virtuozzo API

    Note over U,DB: User logs in, data is STALE
    U->>API: GET /api/v1/environments
    API->>DB: SELECT * FROM environments WHERE team_id = ?
    DB-->>API: Return records (last_updated = 2 hours ago)

    Note over API: Data is STALE (> 10 min)
    API-->>U: 200 OK (return stale data immediately)
    API->>Q: enqueue(team_id, priority="normal")

    Note over Q,W: Background Worker processes queue
    Q->>W: dequeue(team_id)
    W->>VZ: GET /getenvs

    alt API Down / Timeout
        VZ--xW: Timeout (5 seconds)
        W->>W: Log error, retry later (exponential backoff)
        Note over DB: DB retains stale data (no overwrite)
        Note over U: User still sees old data (better than error)
    else API Returns 500
        VZ-->>W: 500 Internal Server Error
        W->>W: Log error, mark for retry
        Note over DB: Stale data preserved
    else Incomplete Data
        VZ-->>W: 200 OK (corrupted payload)
        W->>W: Schema validation FAILS
        W->>W: Discard new data, keep old
        Note over DB: Stale data protected from corruption
    end

4.3 ASCII Architecture Diagram

USER ACTION                  SYSTEM LOGIC                             DATA SOURCE
+-----------+       +-------------------------------------+
| User Logs |------>| 1. Query Local Database for Env     |
|    In     |       +-------------------------------------+
+-----------+                      |
                                   v
                      /-------------------------\
                      |   Data Exists in DB?    |
                      \-----------+-------------/
                                  |
               NO (First Run)     |     YES (Returning User)
             +--------------------+-----------------------+
             |                                            |
             v                                            v
+------------------------+                  /--------------------------\
| 2. BLOCKING FETCH      |                  |   Check Timestamp (TTL)  |
| (Show Loading Spinner) |                  \-----------+--------------/
+-----------+------------+                              |
            |                         +-----------------+-----------------+
            |                         |                                   |
            v                   FRESH (< 10m old)                 STALE (> 10m old)
+------------------------+            |                                   |
| Call Virtuozzo API     |            |                                   |
+-----------+------------+            |                     +-----------------------------+
            |                         |                     | 3. RETURN STALE DATA (FAST) |
            v                         v                     | (Dashboard loads instantly) |
+------------------------+    +----------------------+      +-------------+---------------+
| Validate & Save to DB  |<---|   RETURN DATA TO UI  |                    |
+-----------+------------+    +----------------------+                    |
            |                         ^                     +-------------v---------------+
            |                         |                     | 4. PUSH TO UPDATE QUEUE     |
            v                         |                     | (Fire-and-Forget)           |
+------------------------+            |                     +-------------+---------------+
|    RETURN DATA TO UI   |------------+                                    |
+------------------------+                                                 |
                                                                          |
       -------------------------------------------------------------------|
       | ASYNCHRONOUS BACKGROUND WORKER (Independent Process)
       |   - Rate-limited: 50 req/sec max
       |   - Active-user priority only
       |   - No global cron iteration
       |
       v
+----------------------+       +-------------------------+
|    Queue Consumer    |------>|    Call Virtuozzo API   |
+----------------------+       +-----------+-------------+
                                           |
                               /-------------------------\
                               |    Request Successful?  |
                               \-----------+-------------/
                                           |
                        YES                |             NO (Sad Path)
             +----------------------+      |      +-------------------------+
             | Compare Hash/Diff    |      |      | Log Error / Retry Later |
             | (Is data different?) |      |      | (Do NOT overwrite DB)   |
             +----------+-----------+      |      +-------------------------+
                        |                  |
             +----------v-----------+      |
             | Update DB Timestamp  |      |
             | & Save New Data      |      |
             +----------------------+      |

4.4 Queue-Based System Design

Why Queue-Based Instead of Global Cron?

Approach Complexity Scalability Resource Usage
Global Cron O(N) - iterate all users Breaks at 100K+ users Wastes cycles on inactive users
Queue-Based O(active users) Scales to millions Only processes active users

Queue Architecture:

graph TB
    subgraph "Request Layer"
        API[FastAPI /environments endpoint]
    end

    subgraph "Queue Layer"
        Q[RabbitMQ / Redis Queue]
        ROUTER[Priority Router]
    end

    subgraph "Worker Layer"
        W1[Worker 1]
        W2[Worker 2]
        W3[Worker N]
    end

    subgraph "External Layer"
        VZ[Virtuozzo API]
    end

    API -->|enqueue sync task| Q
    Q --> ROUTER
    ROUTER -->|round-robin| W1
    ROUTER -->|round-robin| W2
    ROUTER -->|round-robin| W3

    W1 -->|rate-limited 50/sec| VZ
    W2 -->|rate-limited 50/sec| VZ
    W3 -->|rate-limited 50/sec| VZ

    style Q fill:#ff9,stroke:#333,stroke-width:3px
    style VZ fill:#f9f,stroke:#333,stroke-width:3px

Queue Message Format:

{
  "task_type": "sync_environments",
  "team_id": 123,
  "priority": "normal",
  "enqueued_at": "2026-01-22T10:30:00Z",
  "triggered_by": "stale_data",
  "session_key_encrypted": "..."
}

Worker Processing Logic:

# Pseudocode
async def process_sync_task(task: SyncTask):
    # 1. Rate limiting check (semaphore)
    async with rate_limiter.acquire():
        # 2. Decrypt session key
        session_key = decrypt(task.session_key_encrypted)

        # 3. Call Virtuozzo with timeout
        try:
            response = await virtuozzo_client.get_envs(
                session=session_key,
                lazy=True,
                timeout=5.0
            )
        except Timeout:
            # DO NOT overwrite DB with error
            logger.error("sync_timeout", team_id=task.team_id)
            return  # Stale data preserved

        # 4. Validate payload schema
        if not validate_schema(response):
            logger.error("sync_invalid_payload", team_id=task.team_id)
            return  # Stale data protected from corruption

        # 5. Compare hash (avoid unnecessary writes)
        new_hash = hash_payload(response)
        old_hash = await get_last_sync_hash(task.team_id)
        if new_hash == old_hash:
            logger.info("sync_unchanged", team_id=task.team_id)
            return  # No change needed

        # 6. Update database
        await upsert_environments(task.team_id, response)

        # 7. Update timestamp
        await update_last_synced_at(task.team_id, new_hash)

        logger.info("sync_complete", team_id=task.team_id)

5. Technical Specifications

5.1 Configuration

Setting Default Description
VZ_ENV_TTL_SECONDS 600 10 minutes - Data freshness window
VZ_ENV_MAX_STALE_SECONDS 3600 1 hour - Show warning after this
VZ_SYNC_QUEUE_RATE_LIMIT 50 Requests per second max
VZ_SYNC_TIMEOUT_SECONDS 5 Virtuozzo API timeout
VZ_SYNC_RETRY_ATTEMPTS 3 Max retry attempts with exponential backoff
VZ_SYNC_WORKER_CONCURRENCY 10 Number of worker processes

5.2 Database Schema

-- Add to existing environments table
ALTER TABLE environments
ADD COLUMN last_synced_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
ADD COLUMN last_sync_hash VARCHAR(64),  -- SHA-256 for change detection
ADD COLUMN sync_status VARCHAR(20) DEFAULT 'ok',  -- ok, stale, failed
ADD COLUMN sync_error_message TEXT,
ADD COLUMN api_calls_count INTEGER DEFAULT 0,
ADD COLUMN last_sync_duration_ms INTEGER;

-- Indexes for freshness queries
CREATE INDEX idx_environments_team_sync ON environments(team_id, last_synced_at);
CREATE INDEX idx_environments_sync_status ON environments(sync_status);

-- Sync queue tracking (optional, for monitoring)
CREATE TABLE environment_sync_queue (
    id SERIAL PRIMARY KEY,
    team_id INTEGER NOT NULL,
    enqueued_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    started_at TIMESTAMP WITH TIME ZONE,
    completed_at TIMESTAMP WITH TIME ZONE,
    status VARCHAR(20),  -- pending, processing, completed, failed
    error_message TEXT,
    retry_count INTEGER DEFAULT 0
);

5.3 API Contract

GET /api/v1/environments

Response:

{
  "items": [
    {
      "env_name": "staging",
      "display_name": "Staging Environment",
      "shortdomain": "staging.example.com",
      "app_id": "12345",
      "status": "active",
      "nodes": [...]
    }
  ],
  "meta": {
    "last_synced_at": "2026-01-22T10:25:00Z",
    "is_stale": false,
    "stale_warning": null,
    "sync_in_progress": false
  }
}

Headers: - X-Data-Stale: true if data is older than TTL - X-Data-Last-Sync: ISO timestamp of last sync - X-Sync-In-Progress: true if background sync is running


6. Sad Path Engineering

6.1 Failure Modes and Mitigations

Failure Mode Impact Mitigation Test Case
Virtuozzo API timeout Stale data served Log error, preserve stale data TEST-TIMEOUT-001
Virtuozzo API 500 Stale data served Exponential backoff retry TEST-500-001
Corrupted payload Stale data protected Schema validation before DB write TEST-CORRUPT-001
Queue processing failure Stale data persists Retry with exponential backoff TEST-QUEUE-FAIL-001
Database connection lost 503 Service Unavailable Circuit breaker, retry connection TEST-DB-001
Session key expired Re-fetch session Decrypt error → re-auth TEST-AUTH-001
Worker crash mid-sync Partial data update Transaction rollback TEST-WORKER-CRASH-001
Queue backlog Data stays stale longer Priority queue for recent logins TEST-BACKLOG-001
Massive concurrent logins Queue spike Rate limiter throttles API calls TEST-SPIKE-001
Clock skew between servers Incorrect TTL calculation Use DB time, not app time TEST-CLOCK-001

6.2 Data Integrity Strategies

Atomic Updates:

async def upsert_environments(team_id: int, env_data: list[dict]):
    async with db.begin():  # Transaction
        # Clear old environments
        await db.execute(
            delete(Environment).where(Environment.team_id == team_id)
        )
        # Insert new environments
        for env in env_data:
            db.add(Environment(**env))
        # Commit all or nothing
        await db.commit()

Hash-Based Change Detection:

def compute_env_hash(environments: list[dict]) -> str:
    """SHA-256 hash for change detection."""
    import hashlib
    import json
    normalized = json.dumps(environments, sort_keys=True)
    return hashlib.sha256(normalized.encode()).hexdigest()

# Only update DB if hash changed
if new_hash != old_hash:
    await upsert_environments(team_id, environments)

Schema Validation:

def validate_virtuozzo_response(data: dict) -> bool:
    """Validate response structure before DB write."""
    required_fields = ["result", "infos"]
    if not all(k in data for k in required_fields):
        return False

    if not isinstance(data.get("infos"), list):
        return False

    for env in data["infos"]:
        if "env" not in env:
            return False

    return True


7. Observability

7.1 Metrics

Metric Type Labels Purpose
env_sync_requests_total Counter team_id, status Track sync volume
env_sync_duration_seconds Histogram team_id, result Track sync latency
env_sync_queue_depth Gauge priority Monitor queue backlog
env_data_age_seconds Histogram team_id Track data freshness
env_stale_served_total Counter team_id, staleness_bucket Track staleness
virtuozzo_api_requests_total Counter endpoint, status API call tracking
virtuozzo_api_errors_total Counter endpoint, error_type Failure analysis

7.2 Structured Logging

logger.info(
    "sync_started",
    team_id=team_id,
    triggered_by="stale_data",
    current_data_age_seconds=age.total_seconds(),
    task_id=task_id
)

logger.info(
    "sync_complete",
    team_id=team_id,
    duration_seconds=duration,
    env_count=len(environments),
    data_changed=(new_hash != old_hash),
    api_calls=1
)

logger.error(
    "sync_failed",
    team_id=team_id,
    error_type="timeout",
    error_message=str(e),
    retry_count=retry_count,
    will_retry=True
)

7.3 Alerts

Alert Condition Severity Action
High Stale Rate stale_served / total > 0.5 Warning Check worker health
Queue Backlog queue_depth > 1000 Warning Scale workers
API Error Rate api_errors / api_calls > 0.1 Critical Check Virtuozzo status
Data Age data_age_p95 > 1800 (30 min) Warning Review TTL settings
Worker Down worker_heartbeat > 60s Critical Restart workers

8. Testing Requirements

8.1 Unit Tests

# Test: Fresh data returned from cache
async def test_fresh_data_returned_immediately():
    # Setup: DB has data from 5 minutes ago
    await insert_test_envs(team_id=1, age_minutes=5)

    # Act: Request environments
    result = await list_environments(team_id=1)

    # Assert: No API call made, data returned
    assert_virtuozzo_not_called()
    assert len(result) == 5

# Test: Stale data returned, queue task created
async def test_stale_data_triggers_background_sync():
    # Setup: DB has data from 20 minutes ago
    await insert_test_envs(team_id=1, age_minutes=20)

    # Act: Request environments
    result = await list_environments(team_id=1)

    # Assert: Stale data returned immediately
    assert len(result) == 5

    # Assert: Queue task created
    assert_queue_contains(team_id=1)

# Test: Corrupted payload does not overwrite DB
async def test_corrupted_payload_preserves_stale_data():
    # Setup: DB has valid data
    await insert_test_envs(team_id=1)

    # Act: Simulate sync with corrupted response
    await process_sync_task(
        team_id=1,
        response={"invalid": "payload"}
    )

    # Assert: DB unchanged
    db_data = await get_environments(team_id=1)
    assert len(db_data) == 5  # Original data preserved

8.2 Sad Path Tests (3:1 Ratio)

Test Scenario Expected
test_timeout_preserves_stale Virtuozzo times out Stale data in DB, error logged
test_500_preserves_stale Virtuozzo returns 500 Stale data in DB, retry scheduled
test_invalid_json_rejected Non-JSON response Stale data preserved
test_schema_validation_fails Missing required fields Stale data preserved
test_worker_crash_rollback Worker dies mid-transaction Transaction rolled back
test_queue_full_rejected Queue at capacity Return 503, retry later
test_session_key_expired 401 from Virtuozzo Re-fetch session, retry
test_database_connection_lost DB unavailable Return 503, log error

8.3 Integration Tests

# Test: End-to-end stale-while-revalidate flow
async def test_stale_while_revalidate_e2e():
    # 1. Setup: Insert stale data (20 min old)
    await insert_test_envs(team_id=1, age_minutes=20)

    # 2. Act: Request environments (should return stale + enqueue)
    response = await client.get("/api/v1/environments")
    assert response.status_code == 200
    assert response.json()["meta"]["is_stale"] == True

    # 3. Act: Process queue (background worker)
    await worker_process_one_task()

    # 4. Assert: DB updated with fresh data
    db_data = await get_environments(team_id=1)
    assert db_data[0]["last_synced_at"] > now() - timedelta(minutes=1)

    # 5. Act: Request again (should be fresh)
    response = await client.get("/api/v1/environments")
    assert response.json()["meta"]["is_stale"] == False

9. Success Metrics

9.1 Key Performance Indicators

Metric Baseline Target Measurement
UI Response Time (p95) 2000-5000ms < 100ms Frontend RUM
Blocking Requests 100% < 1% Backend metrics
API Calls per Session 1 per login 1 per 10 min API call counter
Data Staleness (p95) Infinite < 10 min last_synced_at age
Queue Processing Time N/A < 30 sec Queue duration metric
Error Rate < 0.1% < 0.1% HTTP 5xx rate

9.2 Scalability Metrics

Scenario Metric Target
1K concurrent logins Queue depth < 100
10K concurrent logins Queue depth < 1000
100K concurrent logins API rate (per sec) ≤ 50
1M total users, 10K active Queue processing Linear with active users

10. Open Questions

ID Question Owner Priority
Q-001 Is 10-minute TTL acceptable for business requirements? Product HIGH
Q-002 Should stale data warning appear at 30 min or 1 hour? UX MEDIUM
Q-003 What's the Virtuozzo API rate limit per session key? Engineering HIGH
Q-004 Should we implement priority queue for paying customers? Product LOW
Q-005 How many concurrent workers do we provision? DevOps MEDIUM

11. Implementation Checklist

Phase Task Status Notes
Phase 1 Database schema changes Pending Add last_synced_at, last_sync_hash
Phase 1 Create queue infrastructure Pending RabbitMQ/Redis queue setup
Phase 2 Implement sync_or_fetch_cached() Pending Core stale-while-revalidate logic
Phase 2 Implement background worker Pending Queue consumer with rate limiting
Phase 3 Add staleness indicators to API Pending Response headers, meta object
Phase 3 Implement frontend stale badge Pending Visual indicator
Phase 4 Observability (metrics, logging) Pending OpenTelemetry integration
Phase 4 Testing (unit, integration, sad path) Pending 85% coverage target
Phase 5 Documentation and runbooks Pending Ops guides
Phase 5 Production deployment Pending Feature flags

12. References

Document Location
Virtuozzo API Documentation https://docs.jelastic.com/api/
MBPanel Architecture Guide /docs/architecture/001-hybrid-modular-ddd.md
SSE Notification Pattern /docs/architecture/WEBSOCKET/SSE-Notif.md
Current Implementation /backend/app/domains/environments/

13. Feasibility Analysis & Current Tech Stack Assessment

13.1 Overall Feasibility: FEASIBLE with MINOR MODIFICATIONS

Date: 2026-01-22 Assessment Method: Code-based evidence analysis of existing infrastructure

The Stale-While-Revalidate architecture design aligns well with the existing tech stack. All required infrastructure components are present or can be added with minimal effort.

13.2 Infrastructure Component Analysis

Queue Infrastructure

PRD Requirement Current State Evidence File Gap Recommendation
RabbitMQ message broker AVAILABLE docker-compose.yml:24-43 None Already configured
Redis for rate limiting AVAILABLE docker-compose.yml:16-22 None Already configured
Celery workers AVAILABLE celery_app.py:8-23 None Configured with RabbitMQ
SSE event fan-out AVAILABLE sse.py:48-100 None SSEBroker implemented
Dedicated sync queue MISSING N/A Low Add jobs.env.sync queue to config

Configuration Evidence (backend/app/core/config.py:31-39):

rabbitmq_url: str = "amqp://mbpanel:mbpanel_pass@rabbitmq:5672/"  # pragma: allowlist secret
rabbitmq_jobs_exchange: str = "jobs.direct"
rabbitmq_jobs_queue: str = "jobs.env.create"  # Add new: jobs.env.sync

Database & ORM

PRD Requirement Current State Evidence File Gap Recommendation
PostgreSQL 16 AVAILABLE docker-compose.yml:2-14 None Latest version
Alembic migrations AVAILABLE alembic/versions/004_add_environment_fields.py None Pattern established
SQLAlchemy 2.0 async AVAILABLE environment.py:45-80 None AsyncSession ready
last_synced_at column MISSING environment.py Medium Add via migration
last_sync_hash column MISSING environment.py Low Add via migration
sync_status column MISSING environment.py Low Calculated on-the-fly exists
Transaction rollback AVAILABLE SQLAlchemy core None async with db.begin()

Existing Migration Pattern (004_add_environment_fields.py:18-23):

def upgrade() -> None:
    op.add_column('environments', sa.Column('ishaenabled', sa.Boolean(), nullable=True))
    op.add_column('environments', sa.Column('sslstate', sa.Boolean(), nullable=True))
    op.add_column('environments', sa.Column('region', sa.String(100), nullable=True))

Recommended Migration: 005_add_environment_sync_tracking.py

Background Workers (Celery)

PRD Requirement Current State Evidence File Gap Recommendation
Celery app configured AVAILABLE celery_app.py:8-23 None RabbitMQ broker
Task pattern AVAILABLE tasks.py:14-18 None create_environment_task exists
Late ack (at-least-once) AVAILABLE celery_app.py:15 None acks_late=True
Async support AVAILABLE tasks.py:17 None asyncio.run()
Sync task MISSING N/A Medium Add sync_environments_task
Rate limiting (50/s) PARTIAL rate_limit.py:11-26 Low Use semaphore + Redis limiter

Existing Task Pattern (tasks.py:14-18):

@celery_app.task(name="domains.environments.create_environment", bind=True, acks_late=True)
def create_environment_task(self, payload: dict[str, Any]) -> None:
    logger.info("environment_job_received", job_id=payload.get("job_id"))
    asyncio.run(service.execute_environment_job(payload))

Virtuozzo Integration

PRD Requirement Current State Evidence File Gap Recommendation
VirtuozzoClient AVAILABLE client.py:36-57 None Async HTTP client
GetEnvs API AVAILABLE service.py:97-116 None fetch_environments_and_nodes()
Timeout support AVAILABLE client.py:53-55 Very Low 15s default, make configurable
Lazy loading AVAILABLE service.py:99 None lazy: bool parameter
Session management AVAILABLE Team model None Encrypted keys stored
Hash-based change detection MISSING N/A Low Add SHA-256 to service.py

Function Signature Evidence (service.py:97-102):

async def fetch_environments_and_nodes(
    session_key: str,
    lazy: bool = False,
    owner_uid: int | None = None,
    vz_client: VirtuozzoClient | None = None
) -> dict[str, Any]:

API Layer

PRD Requirement Current State Evidence File Gap Recommendation
FastAPI router AVAILABLE router.py:11-26 None GET + POST exist
Service layer AVAILABLE service.py:41-45 None list_environments()
Response serialization AVAILABLE service.py:527-544 None _serialize_environment()
CSRF protection AVAILABLE router.py:20 None dependencies=[Depends(csrf_protected)]
meta object in response MISSING router.py:17 Medium Add to return type
Response headers MISSING N/A Low Add X-Data-Stale, etc.
Queue enqueue on stale MISSING N/A Medium Add publish_event() call

Current Response Format (router.py:11-17):

@router.get("/")
async def environments(...) -> dict[str, list]:
    items = await list_environments(db=db, current_user=user)
    return {"items": items}  # Missing: meta object

Rate Limiting

PRD Requirement Current State Evidence File Gap Recommendation
Redis-based limiter AVAILABLE rate_limit.py:11-26 None Counter-based (sufficient)
Redis client AVAILABLE config.py:46 None redis_url configured
Login rate limit pattern IMPLEMENTED rate_limit.py:28-47 None Pattern to follow
Token bucket algorithm MISSING N/A Medium Use asyncio.Semaphore(50) instead
Virtuozzo-specific limiter MISSING N/A Low Create wrapper class

Existing Implementation (rate_limit.py:11-26):

class RateLimiter:
    """Redis-based counter limiter using INCR semantics."""
    async def check(self, key: str, limit: int, window_seconds: int) -> None:
        count = await self.redis.incr(key)
        if count == 1:
            await self.redis.expire(key, window_seconds)
        if count > limit:
            raise HTTPException(status_code=429)

Frontend

PRD Requirement Current State Evidence File Gap Recommendation
Next.js 16 AVAILABLE package.json:39 None Latest version
React 19 AVAILABLE package.json:42 None Latest version
TanStack Query v5 INSTALLED package.json:32-33 Low Available but unused for environments
useEnvironments hook IMPLEMENTED useEnvironments.ts:20-106 None Custom hook pattern
API client AVAILABLE api-client.ts:18-135 None Full fetch wrapper
SyncStatus types DEFINED environment.types.ts:5-9 None Interface exists
Stale badge display PARTIAL sites/page.tsx:62-68 Low Code checks sync_status
Background refresh MISSING N/A Medium Add SSE listener

Current Frontend Pattern (useEnvironments.ts:55-73):

const syncAndRefetch = useCallback(async () => {
  const syncResult = await triggerSync();  // Currently blocks
  const data = await getEnvironments();
  setEnvironments(data);
}, []);

Observability

PRD Requirement Current State Evidence File Gap Recommendation
Structured logging AVAILABLE logging.py None Structlog with JSON
OpenTelemetry AVAILABLE telemetry.py:92-172 None Full OTEL setup
Tracer helper AVAILABLE observability.py:10-23 None telemetry_span()
SigNoz compatibility YES telemetry.py:254-286 None Canonical attributes
Context propagation AVAILABLE Core modules None request_id_ctx, team_id_ctx
Sync-specific metrics MISSING N/A Low Add using get_meter()

Existing Logging Pattern (service.py:624):

logger.info("sync_team_environments_started", team_id=team_id)

Security & Sad Path

PRD Requirement Current State Evidence File Gap Recommendation
Crypto (decrypt/encrypt) AVAILABLE crypto.py None decrypt_str(), encrypt_str()
Pydantic validation AVAILABLE Throughout None Request/response validation
SQLAlchemy transactions AVAILABLE Core None async with db.begin()
Transaction rollback AUTOMATIC SQLAlchemy None On exception
Error handling AVAILABLE HTTPException None Proper status codes
Security events logging AVAILABLE security_events.py None Auth violation tracking

13.3 Configuration Variables

PRD Variable Default Current State Evidence File Action
VZ_ENV_TTL_SECONDS 600 HARDCODED service.py:577 Move to config
VZ_ENV_MAX_STALE_SECONDS 3600 HARDCODED service.py:579 Move to config
VZ_SYNC_QUEUE_RATE_LIMIT 50 NOT DEFINED N/A Add to config
VZ_SYNC_TIMEOUT_SECONDS 5 15 (different) client.py:53 Override for sync
VZ_SYNC_RETRY_ATTEMPTS 3 NOT DEFINED N/A Add to config
VZ_SYNC_WORKER_CONCURRENCY 10 NOT DEFINED N/A Add to config

Existing Config Pattern (config.py:19-33):

class Settings(BaseSettings):
    model_config = ConfigDict(
        env_prefix="VZ_",
        env_file=".env",
        env_file_encoding="utf-8",
        extra="ignore",
    )
    virtuozzo_timeout_seconds: int = 15

13.4 DDD Architecture Alignment

The design PERFECTLY ALIGNS with the existing Hybrid Modular DDD architecture:

PRD Design Component Current Architecture Location Evidence
EnvironmentSyncService app/domains/environments/service.py ✅ Match
Queue Consumer Worker app/domains/environments/tasks.py ✅ Match
Repository Layer app/domains/environments/repository.py ✅ Match
API Router app/domains/environments/router.py ✅ Match
API Composition app/api/v1/__init__.py:34-36 ✅ Already wired
Virtuozzo External Service app/infrastructure/external/virtuozzo/ ✅ Match
SSE Events app/infrastructure/messaging/events.py ✅ Match

13.5 Implementation Effort Estimate

Phase Tasks Estimated Time Dependencies
Phase 1 Database migration + Config 1 day None
Phase 2 Service layer + Queue logic 2 days Phase 1
Phase 3 Celery worker + Rate limiting 1 day Phase 2
Phase 4 API response format + Headers 1 day Phase 2
Phase 5 Frontend stale badges 1 day Phase 4
Phase 6 Testing (85% coverage) 2 days All phases
Phase 7 Docs + Runbooks 1 day All phases
Phase 8 Production deployment 1 day All phases

Total Estimated Effort: 10 days

13.6 Risk Assessment

Risk Probability Impact Mitigation
Database migration failure Low High Test migration in staging first
Rate limiter blocking workers Low Medium Use semaphore + fallback
Virtuozzo API timeout too aggressive Medium Medium Make configurable, monitor
Queue backlog during spikes Medium Low Add priority routing
Frontend stale badge confusion Low Low Clear UX copy + tooltip

13.7 Recommendation

PROCEED WITH IMPLEMENTATION

All required infrastructure exists. The gaps are minor and well-understood. The design aligns perfectly with the existing architecture patterns.

Next Steps: 1. Create 005_add_environment_sync_tracking.py migration 2. Add config variables to config.py 3. Implement sync_or_fetch_cached() in service.py 4. Add sync_environments_task to tasks.py 5. Update API response format


14. Implementation Checklist (Updated with Evidence)

Phase Task Status Evidence File Notes
Phase 1 Database schema changes Pending alembic/versions/ Create 005_add_environment_sync_tracking.py
Phase 1 Configuration variables Pending app/core/config.py Add 6 new VZ_* settings
Phase 1 Add sync queue config Pending celery_app.py Add jobs.env.sync queue
Phase 2 Implement sync_or_fetch_cached() Pending service.py:41-45 Add after list_environments()
Phase 2 Add hash computation Pending service.py SHA-256 function
Phase 2 Update list_environments() TTL check Pending service.py:41-45 Add stale detection
Phase 2 Add queue publish on stale Pending service.py Use publish_event()
Phase 3 Implement sync_environments_task Pending tasks.py:14-18 Follow create_environment_task pattern
Phase 3 Add rate limiting to worker Pending tasks.py Use asyncio.Semaphore(50)
Phase 3 Add retry with backoff Pending tasks.py Celery autoretry_for
Phase 4 Add meta object to response Pending router.py:11-17 Update return type
Phase 4 Add response headers Pending router.py Add X-Data-Stale, etc.
Phase 4 Update TypeScript types Pending environment.types.ts:45-47 Add meta interface
Phase 5 Display stale badges Pending sites/page.tsx:62-68 Already checks sync_status
Phase 5 Add SSE listener (optional) Pending Frontend For background updates
Phase 6 Add sync metrics Pending service.py Use get_meter()
Phase 6 Unit tests Pending tests/domains/environments/ Target 85% coverage
Phase 6 Sad path tests (3:1 ratio) Pending tests/ Timeout, 500, corruption
Phase 6 Integration tests Pending tests/integration/ End-to-end flow
Phase 7 Update runbooks Pending docs/ Ops procedures
Phase 7 Update API docs Pending docs/ Include new response format
Phase 8 Staging deployment Pending Infra Test migration
Phase 8 Production deployment Pending Infra Feature flags

15. References

Document Location
Virtuozzo API Documentation https://docs.jelastic.com/api/
MBPanel Architecture Guide /docs/architecture/001-hybrid-modular-ddd.md
SSE Notification Pattern /docs/architecture/WEBSOCKET/SSE-Notif.md
Current Implementation /backend/app/domains/environments/

16. Revision History

Version Date Author Changes
1.0 2026-01-09 MBPanel Team Initial draft (lazy sync + background refresh)
2.0 2026-01-22 MBPanel Team Major refactor: Stale-While-Revalidate + Queue-based architecture
2.1 2026-01-22 MBPanel Team Added Section 13: Feasibility Analysis with evidence-based gaps and recommendations

17. Approvals

Role Name Signature Date
Product Manager
Engineering Lead
Architecture Lead
DevOps Lead