Product Requirements Document (PRD)¶

Virtuozzo Environment Synchronization - Stale-While-Revalidate Architecture¶

Document Information¶

Field	Details
Product Name	Virtuozzo Environment Synchronization
Author	MBPanel Team
Date Created	2026-01-09
Last Updated	2026-01-22
Version	2.0
Status	Draft - Architecture Refresh
Architecture Type	Domain
Affected Components	`app/domains/environments/`, `app/infrastructure/external/virtuozzo/`, `app/infrastructure/queue/`

1. Executive Summary¶

The Virtuozzo Environment Synchronization feature implements a distributed optimization pattern to solve the critical problem of redundant API calls when team members access MBPanel. This is a classic distributed systems challenge involving caching strategies, consistency vs. availability trade-offs, and scalability.

1.1 Core Problem Statement¶

Current State: - Every user login triggers a synchronous call to Virtuozzo API (2-5 second latency) - 5 concurrent logins = 5 identical API calls (wasteful) - UI blocks while fetching, creating poor UX - Virtuozzo API rate limits are at risk during traffic spikes

Solution: Stale-While-Revalidate Architecture - De-couple UI from Data Source: Dashboard never speaks directly to Virtuozzo for initial render - Instant UI Response: Always return data immediately (fresh or stale), never block - Background Queue System: Event-driven refresh for active users only (no global crons) - Graceful Degradation: Serve stale data with warning rather than errors

1.2 Architecture Overview¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                         STALE-WHILE-REVALIDATE PATTERN                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. UI Requests → Check Local Database (PostgreSQL)                         │
│  2. Data Found?                                                             │
│     - NO → Blocking fetch from Virtuozzo (unavoidable first-run)           │
│     - YES → Check timestamp (TTL)                                           │
│              ├─ FRESH (< 10 min) → Return immediately (Zero API calls)      │
│              └─ STALE (> 10 min) → Return stale IMMEDIATELY + push to queue │
│                                                                              │
│  3. Background Queue Consumer (Rate-limited, Active-User Priority)         │
│     - Processes queue at fixed rate (e.g., 50 req/sec)                      │
│     - Fetches from Virtuozzo asynchronously                                 │
│     - Validates payload before overwriting DB                               │
│     - Updates timestamp on success                                          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

1.3 Key Technical Decisions¶

Decision	Rationale	Impact
PostgreSQL as Primary Cache	Survives restarts, source of truth	No Redis dependency for cache
Queue-Based Refresh	Event-driven, not O(N) cron	Scales to millions of users
Active-User Priority	Only logged-in users trigger refresh	Passive users consume zero resources
Stale-While-Revalidate	Return stale data immediately, refresh async	UI always instant, never blocks
Rate-Limited Consumer	Protects Virtuozzo from traffic spikes	Controlled API call volume
No Global Cron	Don't iterate through all users	Linear scaling with active users

2. Problem Statement¶

2.1 Background: Distributed Systems Optimization¶

This is a classic distributed caching problem with CAP theorem implications:

Consistency vs. Availability Trade-off: - Strong Consistency: Every read gets latest data (requires calling Virtuozzo) → SLOW - High Availability: Every read returns immediately (cached data) → MAY BE STALE - Our Choice: AP with TTL-based eventual consistency (10-minute window)

Scalability Challenge (Millions of Users): - Global Cron Approach: Iterate through N users = O(N) complexity - 1M users = 1M cron jobs, impossible to scale - Queue-Based Approach: Only active users trigger refresh = O(active users) - 1M users, 10K active = 10K queue items, manageable

2.2 Current Implementation Analysis¶

File: backend/app/domains/environments/router.py

# CURRENT CODE (lines 11-17)
@router.get("/")
async def environments(
    user: AuthenticatedUser = Depends(get_current_user),
    db: AsyncSession = Depends(db_session),
) -> dict[str, list]:
    items = await list_environments(db=db, current_user=user)  # ← Just reads DB
    return {"items": items}

Current Behavior: 1. User logs in 2. Frontend calls GET /api/v1/environments 3. Backend queries PostgreSQL database directly 4. Returns whatever is in database (could be hours/days old) 5. No Virtuozzo API call is made 6. No freshness check is performed 7. Data only updates if someone manually calls POST /api/v1/environments/sync

Problems: - Data can be stale forever - No automatic refresh mechanism - No staleness indicator to users - Background refresh doesn't exist

2.3 Pain Points¶

Pain Point	Current Impact	User Impact
Indefinite Staleness	Data never refreshes automatically	Users see old environments
Manual Sync Required	Must remember to call POST /sync	Operational burden
No Freshness Visibility	Can't tell if data is current	Confusion about data accuracy
No Background Refresh	No proactive updates	Stale data persists
Blocking Manual Sync	POST /sync blocks for 2-5 seconds	Poor UX when forcing refresh

3. Goals and Objectives¶

3.1 Business Goals¶

Goal	Metric	Target
Eliminate blocking UI	Percent of requests that block API	< 1% (first run only)
Reduce API calls	API calls per active user per session	1 call per 10 min window
Maintain freshness	P95 data staleness	< 10 minutes
Scale to millions	System cost scales with	Active users, not total users
High availability	Uptime during Virtuozzo outages	Serve stale with warning

3.2 User Goals¶

User Story	Acceptance Criteria
Instant dashboard load	Environment list renders in < 100ms (p95)
Transparent staleness	Visual indicator when data is > 1 hour old
No blocking spinners	Never show loading spinner after first login
Automatic updates	Data refreshes within 10 minutes without manual action
Graceful errors	See "Data may be delayed" badge instead of error pages

3.3 Non-Goals (Out of Scope)¶

Environment creation/deletion (existing provisioning flows)
Session key management (existing app/modules/sessions/)
Real-time WebSocket sync (use SSE for notifications)
Per-environment detailed status (focus on listing)
Multi-region Virtuozzo support (single endpoint per team)

4. Architecture: Stale-While-Revalidate Pattern¶

4.1 Happy Path: User Logs In¶

sequenceDiagram
    participant U as User Browser
    participant API as FastAPI
    participant DB as PostgreSQL
    participant Q as Refresh Queue
    participant W as Background Worker
    participant VZ as Virtuozzo API

    Note over U,DB: User logs in, navigates to environments
    U->>API: GET /api/v1/environments

    Note over API,DB: 1. Query Local Database
    API->>DB: SELECT * FROM environments WHERE team_id = ?
    DB-->>API: Return records (with last_updated timestamp)

    alt First Run (No Data)
        API-->>U: 404 + "Loading environments..." (spinner)
        Note over API,VZ: 2. BLOCKING FETCH (unavoidable)
        API->>VZ: GET /getenvs
        VZ-->>API: Return environments
        API->>DB: INSERT environments (last_updated = NOW())
        API-->>U: 200 OK (environments loaded)
    else Returning User (Data Exists)
        Note over API: 3. Check Timestamp (TTL = 10 min)
        alt FRESH (< 10 min old)
            API-->>U: 200 OK (instant, < 50ms)
            Note over U: User sees data immediately
        else STALE (> 10 min old)
            API-->>U: 200 OK (instant, stale data)
            Note over API: 4. PUSH TO UPDATE QUEUE (fire-and-forget)
            API->>Q: enqueue(team_id, priority="normal")
            Note over Q,W: 5. BACKGROUND WORKER (async, non-blocking)
            Q->>W: dequeue(team_id)
            W->>VZ: GET /getenvs
            VZ-->>W: Return environments
            W->>W: Validate payload
            W->>DB: UPDATE environments (last_updated = NOW())
            Note over U: Next page load shows fresh data
        end
    end

4.2 Sad Path: Edge Cases and Failures¶

sequenceDiagram
    participant U as User Browser
    participant API as FastAPI
    participant DB as PostgreSQL
    participant Q as Refresh Queue
    participant W as Background Worker
    participant VZ as Virtuozzo API

    Note over U,DB: User logs in, data is STALE
    U->>API: GET /api/v1/environments
    API->>DB: SELECT * FROM environments WHERE team_id = ?
    DB-->>API: Return records (last_updated = 2 hours ago)

    Note over API: Data is STALE (> 10 min)
    API-->>U: 200 OK (return stale data immediately)
    API->>Q: enqueue(team_id, priority="normal")

    Note over Q,W: Background Worker processes queue
    Q->>W: dequeue(team_id)
    W->>VZ: GET /getenvs

    alt API Down / Timeout
        VZ--xW: Timeout (5 seconds)
        W->>W: Log error, retry later (exponential backoff)
        Note over DB: DB retains stale data (no overwrite)
        Note over U: User still sees old data (better than error)
    else API Returns 500
        VZ-->>W: 500 Internal Server Error
        W->>W: Log error, mark for retry
        Note over DB: Stale data preserved
    else Incomplete Data
        VZ-->>W: 200 OK (corrupted payload)
        W->>W: Schema validation FAILS
        W->>W: Discard new data, keep old
        Note over DB: Stale data protected from corruption
    end

4.3 ASCII Architecture Diagram¶

USER ACTION                  SYSTEM LOGIC                             DATA SOURCE
+-----------+       +-------------------------------------+
| User Logs |------>| 1. Query Local Database for Env     |
|    In     |       +-------------------------------------+
+-----------+                      |
                                   v
                      /-------------------------\
                      |   Data Exists in DB?    |
                      \-----------+-------------/
                                  |
               NO (First Run)     |     YES (Returning User)
             +--------------------+-----------------------+
             |                                            |
             v                                            v
+------------------------+                  /--------------------------\
| 2. BLOCKING FETCH      |                  |   Check Timestamp (TTL)  |
| (Show Loading Spinner) |                  \-----------+--------------/
+-----------+------------+                              |
            |                         +-----------------+-----------------+
            |                         |                                   |
            v                   FRESH (< 10m old)                 STALE (> 10m old)
+------------------------+            |                                   |
| Call Virtuozzo API     |            |                                   |
+-----------+------------+            |                     +-----------------------------+
            |                         |                     | 3. RETURN STALE DATA (FAST) |
            v                         v                     | (Dashboard loads instantly) |
+------------------------+    +----------------------+      +-------------+---------------+
| Validate & Save to DB  |<---|   RETURN DATA TO UI  |                    |
+-----------+------------+    +----------------------+                    |
            |                         ^                     +-------------v---------------+
            |                         |                     | 4. PUSH TO UPDATE QUEUE     |
            v                         |                     | (Fire-and-Forget)           |
+------------------------+            |                     +-------------+---------------+
|    RETURN DATA TO UI   |------------+                                    |
+------------------------+                                                 |
                                                                          |
       -------------------------------------------------------------------|
       | ASYNCHRONOUS BACKGROUND WORKER (Independent Process)
       |   - Rate-limited: 50 req/sec max
       |   - Active-user priority only
       |   - No global cron iteration
       |
       v
+----------------------+       +-------------------------+
|    Queue Consumer    |------>|    Call Virtuozzo API   |
+----------------------+       +-----------+-------------+
                                           |
                               /-------------------------\
                               |    Request Successful?  |
                               \-----------+-------------/
                                           |
                        YES                |             NO (Sad Path)
             +----------------------+      |      +-------------------------+
             | Compare Hash/Diff    |      |      | Log Error / Retry Later |
             | (Is data different?) |      |      | (Do NOT overwrite DB)   |
             +----------+-----------+      |      +-------------------------+
                        |                  |
             +----------v-----------+      |
             | Update DB Timestamp  |      |
             | & Save New Data      |      |
             +----------------------+      |

4.4 Queue-Based System Design¶

Why Queue-Based Instead of Global Cron?

Approach	Complexity	Scalability	Resource Usage
Global Cron	O(N) - iterate all users	Breaks at 100K+ users	Wastes cycles on inactive users
Queue-Based	O(active users)	Scales to millions	Only processes active users

Queue Architecture:

graph TB
    subgraph "Request Layer"
        API[FastAPI /environments endpoint]
    end

    subgraph "Queue Layer"
        Q[RabbitMQ / Redis Queue]
        ROUTER[Priority Router]
    end

    subgraph "Worker Layer"
        W1[Worker 1]
        W2[Worker 2]
        W3[Worker N]
    end

    subgraph "External Layer"
        VZ[Virtuozzo API]
    end

    API -->|enqueue sync task| Q
    Q --> ROUTER
    ROUTER -->|round-robin| W1
    ROUTER -->|round-robin| W2
    ROUTER -->|round-robin| W3

    W1 -->|rate-limited 50/sec| VZ
    W2 -->|rate-limited 50/sec| VZ
    W3 -->|rate-limited 50/sec| VZ

    style Q fill:#ff9,stroke:#333,stroke-width:3px
    style VZ fill:#f9f,stroke:#333,stroke-width:3px

Queue Message Format:

{
  "task_type": "sync_environments",
  "team_id": 123,
  "priority": "normal",
  "enqueued_at": "2026-01-22T10:30:00Z",
  "triggered_by": "stale_data",
  "session_key_encrypted": "..."
}

Worker Processing Logic:

# Pseudocode
async def process_sync_task(task: SyncTask):
    # 1. Rate limiting check (semaphore)
    async with rate_limiter.acquire():
        # 2. Decrypt session key
        session_key = decrypt(task.session_key_encrypted)

        # 3. Call Virtuozzo with timeout
        try:
            response = await virtuozzo_client.get_envs(
                session=session_key,
                lazy=True,
                timeout=5.0
            )
        except Timeout:
            # DO NOT overwrite DB with error
            logger.error("sync_timeout", team_id=task.team_id)
            return  # Stale data preserved

        # 4. Validate payload schema
        if not validate_schema(response):
            logger.error("sync_invalid_payload", team_id=task.team_id)
            return  # Stale data protected from corruption

        # 5. Compare hash (avoid unnecessary writes)
        new_hash = hash_payload(response)
        old_hash = await get_last_sync_hash(task.team_id)
        if new_hash == old_hash:
            logger.info("sync_unchanged", team_id=task.team_id)
            return  # No change needed

        # 6. Update database
        await upsert_environments(task.team_id, response)

        # 7. Update timestamp
        await update_last_synced_at(task.team_id, new_hash)

        logger.info("sync_complete", team_id=task.team_id)

5. Technical Specifications¶

5.1 Configuration¶

Setting	Default	Description
`VZ_ENV_TTL_SECONDS`	600	10 minutes - Data freshness window
`VZ_ENV_MAX_STALE_SECONDS`	3600	1 hour - Show warning after this
`VZ_SYNC_QUEUE_RATE_LIMIT`	50	Requests per second max
`VZ_SYNC_TIMEOUT_SECONDS`	5	Virtuozzo API timeout
`VZ_SYNC_RETRY_ATTEMPTS`	3	Max retry attempts with exponential backoff
`VZ_SYNC_WORKER_CONCURRENCY`	10	Number of worker processes

5.2 Database Schema¶

-- Add to existing environments table
ALTER TABLE environments
ADD COLUMN last_synced_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
ADD COLUMN last_sync_hash VARCHAR(64),  -- SHA-256 for change detection
ADD COLUMN sync_status VARCHAR(20) DEFAULT 'ok',  -- ok, stale, failed
ADD COLUMN sync_error_message TEXT,
ADD COLUMN api_calls_count INTEGER DEFAULT 0,
ADD COLUMN last_sync_duration_ms INTEGER;

-- Indexes for freshness queries
CREATE INDEX idx_environments_team_sync ON environments(team_id, last_synced_at);
CREATE INDEX idx_environments_sync_status ON environments(sync_status);

-- Sync queue tracking (optional, for monitoring)
CREATE TABLE environment_sync_queue (
    id SERIAL PRIMARY KEY,
    team_id INTEGER NOT NULL,
    enqueued_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    started_at TIMESTAMP WITH TIME ZONE,
    completed_at TIMESTAMP WITH TIME ZONE,
    status VARCHAR(20),  -- pending, processing, completed, failed
    error_message TEXT,
    retry_count INTEGER DEFAULT 0
);

5.3 API Contract¶

GET /api/v1/environments

Response:

{
  "items": [
    {
      "env_name": "staging",
      "display_name": "Staging Environment",
      "shortdomain": "staging.example.com",
      "app_id": "12345",
      "status": "active",
      "nodes": [...]
    }
  ],
  "meta": {
    "last_synced_at": "2026-01-22T10:25:00Z",
    "is_stale": false,
    "stale_warning": null,
    "sync_in_progress": false
  }
}

Headers: - X-Data-Stale: true if data is older than TTL - X-Data-Last-Sync: ISO timestamp of last sync - X-Sync-In-Progress: true if background sync is running

6. Sad Path Engineering¶

6.1 Failure Modes and Mitigations¶

Failure Mode	Impact	Mitigation	Test Case
Virtuozzo API timeout	Stale data served	Log error, preserve stale data	TEST-TIMEOUT-001
Virtuozzo API 500	Stale data served	Exponential backoff retry	TEST-500-001
Corrupted payload	Stale data protected	Schema validation before DB write	TEST-CORRUPT-001
Queue processing failure	Stale data persists	Retry with exponential backoff	TEST-QUEUE-FAIL-001
Database connection lost	503 Service Unavailable	Circuit breaker, retry connection	TEST-DB-001
Session key expired	Re-fetch session	Decrypt error → re-auth	TEST-AUTH-001
Worker crash mid-sync	Partial data update	Transaction rollback	TEST-WORKER-CRASH-001
Queue backlog	Data stays stale longer	Priority queue for recent logins	TEST-BACKLOG-001
Massive concurrent logins	Queue spike	Rate limiter throttles API calls	TEST-SPIKE-001
Clock skew between servers	Incorrect TTL calculation	Use DB time, not app time	TEST-CLOCK-001

6.2 Data Integrity Strategies¶

Atomic Updates:

async def upsert_environments(team_id: int, env_data: list[dict]):
    async with db.begin():  # Transaction
        # Clear old environments
        await db.execute(
            delete(Environment).where(Environment.team_id == team_id)
        )
        # Insert new environments
        for env in env_data:
            db.add(Environment(**env))
        # Commit all or nothing
        await db.commit()

Hash-Based Change Detection:

def compute_env_hash(environments: list[dict]) -> str:
    """SHA-256 hash for change detection."""
    import hashlib
    import json
    normalized = json.dumps(environments, sort_keys=True)
    return hashlib.sha256(normalized.encode()).hexdigest()

# Only update DB if hash changed
if new_hash != old_hash:
    await upsert_environments(team_id, environments)

Schema Validation:

def validate_virtuozzo_response(data: dict) -> bool:
    """Validate response structure before DB write."""
    required_fields = ["result", "infos"]
    if not all(k in data for k in required_fields):
        return False

    if not isinstance(data.get("infos"), list):
        return False

    for env in data["infos"]:
        if "env" not in env:
            return False

    return True

7. Observability¶

7.1 Metrics¶

Metric	Type	Labels	Purpose
`env_sync_requests_total`	Counter	team_id, status	Track sync volume
`env_sync_duration_seconds`	Histogram	team_id, result	Track sync latency
`env_sync_queue_depth`	Gauge	priority	Monitor queue backlog
`env_data_age_seconds`	Histogram	team_id	Track data freshness
`env_stale_served_total`	Counter	team_id, staleness_bucket	Track staleness
`virtuozzo_api_requests_total`	Counter	endpoint, status	API call tracking
`virtuozzo_api_errors_total`	Counter	endpoint, error_type	Failure analysis

7.2 Structured Logging¶

logger.info(
    "sync_started",
    team_id=team_id,
    triggered_by="stale_data",
    current_data_age_seconds=age.total_seconds(),
    task_id=task_id
)

logger.info(
    "sync_complete",
    team_id=team_id,
    duration_seconds=duration,
    env_count=len(environments),
    data_changed=(new_hash != old_hash),
    api_calls=1
)

logger.error(
    "sync_failed",
    team_id=team_id,
    error_type="timeout",
    error_message=str(e),
    retry_count=retry_count,
    will_retry=True
)

7.3 Alerts¶

Alert	Condition	Severity	Action
High Stale Rate	`stale_served / total > 0.5`	Warning	Check worker health
Queue Backlog	`queue_depth > 1000`	Warning	Scale workers
API Error Rate	`api_errors / api_calls > 0.1`	Critical	Check Virtuozzo status
Data Age	`data_age_p95 > 1800` (30 min)	Warning	Review TTL settings
Worker Down	`worker_heartbeat > 60s`	Critical	Restart workers

8. Testing Requirements¶

8.1 Unit Tests¶

# Test: Fresh data returned from cache
async def test_fresh_data_returned_immediately():
    # Setup: DB has data from 5 minutes ago
    await insert_test_envs(team_id=1, age_minutes=5)

    # Act: Request environments
    result = await list_environments(team_id=1)

    # Assert: No API call made, data returned
    assert_virtuozzo_not_called()
    assert len(result) == 5

# Test: Stale data returned, queue task created
async def test_stale_data_triggers_background_sync():
    # Setup: DB has data from 20 minutes ago
    await insert_test_envs(team_id=1, age_minutes=20)

    # Act: Request environments
    result = await list_environments(team_id=1)

    # Assert: Stale data returned immediately
    assert len(result) == 5

    # Assert: Queue task created
    assert_queue_contains(team_id=1)

# Test: Corrupted payload does not overwrite DB
async def test_corrupted_payload_preserves_stale_data():
    # Setup: DB has valid data
    await insert_test_envs(team_id=1)

    # Act: Simulate sync with corrupted response
    await process_sync_task(
        team_id=1,
        response={"invalid": "payload"}
    )

    # Assert: DB unchanged
    db_data = await get_environments(team_id=1)
    assert len(db_data) == 5  # Original data preserved

8.2 Sad Path Tests (3:1 Ratio)¶

Test	Scenario	Expected
`test_timeout_preserves_stale`	Virtuozzo times out	Stale data in DB, error logged
`test_500_preserves_stale`	Virtuozzo returns 500	Stale data in DB, retry scheduled
`test_invalid_json_rejected`	Non-JSON response	Stale data preserved
`test_schema_validation_fails`	Missing required fields	Stale data preserved
`test_worker_crash_rollback`	Worker dies mid-transaction	Transaction rolled back
`test_queue_full_rejected`	Queue at capacity	Return 503, retry later
`test_session_key_expired`	401 from Virtuozzo	Re-fetch session, retry
`test_database_connection_lost`	DB unavailable	Return 503, log error

8.3 Integration Tests¶

# Test: End-to-end stale-while-revalidate flow
async def test_stale_while_revalidate_e2e():
    # 1. Setup: Insert stale data (20 min old)
    await insert_test_envs(team_id=1, age_minutes=20)

    # 2. Act: Request environments (should return stale + enqueue)
    response = await client.get("/api/v1/environments")
    assert response.status_code == 200
    assert response.json()["meta"]["is_stale"] == True

    # 3. Act: Process queue (background worker)
    await worker_process_one_task()

    # 4. Assert: DB updated with fresh data
    db_data = await get_environments(team_id=1)
    assert db_data[0]["last_synced_at"] > now() - timedelta(minutes=1)

    # 5. Act: Request again (should be fresh)
    response = await client.get("/api/v1/environments")
    assert response.json()["meta"]["is_stale"] == False

9. Success Metrics¶

9.1 Key Performance Indicators¶

Metric	Baseline	Target	Measurement
UI Response Time (p95)	2000-5000ms	< 100ms	Frontend RUM
Blocking Requests	100%	< 1%	Backend metrics
API Calls per Session	1 per login	1 per 10 min	API call counter
Data Staleness (p95)	Infinite	< 10 min	`last_synced_at` age
Queue Processing Time	N/A	< 30 sec	Queue duration metric
Error Rate	< 0.1%	< 0.1%	HTTP 5xx rate

9.2 Scalability Metrics¶

Scenario	Metric	Target
1K concurrent logins	Queue depth	< 100
10K concurrent logins	Queue depth	< 1000
100K concurrent logins	API rate (per sec)	≤ 50
1M total users, 10K active	Queue processing	Linear with active users

10. Open Questions¶

ID	Question	Owner	Priority
Q-001	Is 10-minute TTL acceptable for business requirements?	Product	HIGH
Q-002	Should stale data warning appear at 30 min or 1 hour?	UX	MEDIUM
Q-003	What's the Virtuozzo API rate limit per session key?	Engineering	HIGH
Q-004	Should we implement priority queue for paying customers?	Product	LOW
Q-005	How many concurrent workers do we provision?	DevOps	MEDIUM

11. Implementation Checklist¶

Phase	Task	Status	Notes
Phase 1	Database schema changes	Pending	Add `last_synced_at`, `last_sync_hash`
Phase 1	Create queue infrastructure	Pending	RabbitMQ/Redis queue setup
Phase 2	Implement `sync_or_fetch_cached()`	Pending	Core stale-while-revalidate logic
Phase 2	Implement background worker	Pending	Queue consumer with rate limiting
Phase 3	Add staleness indicators to API	Pending	Response headers, meta object
Phase 3	Implement frontend stale badge	Pending	Visual indicator
Phase 4	Observability (metrics, logging)	Pending	OpenTelemetry integration
Phase 4	Testing (unit, integration, sad path)	Pending	85% coverage target
Phase 5	Documentation and runbooks	Pending	Ops guides
Phase 5	Production deployment	Pending	Feature flags

12. References¶

Document	Location
Virtuozzo API Documentation	https://docs.jelastic.com/api/
MBPanel Architecture Guide	`/docs/architecture/001-hybrid-modular-ddd.md`
SSE Notification Pattern	`/docs/architecture/WEBSOCKET/SSE-Notif.md`
Current Implementation	`/backend/app/domains/environments/`

13. Feasibility Analysis & Current Tech Stack Assessment¶

13.1 Overall Feasibility: FEASIBLE with MINOR MODIFICATIONS¶

Date: 2026-01-22 Assessment Method: Code-based evidence analysis of existing infrastructure

The Stale-While-Revalidate architecture design aligns well with the existing tech stack. All required infrastructure components are present or can be added with minimal effort.

13.2 Infrastructure Component Analysis¶

Queue Infrastructure¶

PRD Requirement	Current State	Evidence File	Gap	Recommendation
RabbitMQ message broker	AVAILABLE	`docker-compose.yml:24-43`	None	Already configured
Redis for rate limiting	AVAILABLE	`docker-compose.yml:16-22`	None	Already configured
Celery workers	AVAILABLE	`celery_app.py:8-23`	None	Configured with RabbitMQ
SSE event fan-out	AVAILABLE	`sse.py:48-100`	None	SSEBroker implemented
Dedicated sync queue	MISSING	N/A	Low	Add `jobs.env.sync` queue to config

Configuration Evidence (backend/app/core/config.py:31-39):

rabbitmq_url: str = "amqp://mbpanel:mbpanel_pass@rabbitmq:5672/"  # pragma: allowlist secret
rabbitmq_jobs_exchange: str = "jobs.direct"
rabbitmq_jobs_queue: str = "jobs.env.create"  # Add new: jobs.env.sync

Database & ORM¶

PRD Requirement	Current State	Evidence File	Gap	Recommendation
PostgreSQL 16	AVAILABLE	`docker-compose.yml:2-14`	None	Latest version
Alembic migrations	AVAILABLE	`alembic/versions/004_add_environment_fields.py`	None	Pattern established
SQLAlchemy 2.0 async	AVAILABLE	`environment.py:45-80`	None	AsyncSession ready
`last_synced_at` column	MISSING	`environment.py`	Medium	Add via migration
`last_sync_hash` column	MISSING	`environment.py`	Low	Add via migration
`sync_status` column	MISSING	`environment.py`	Low	Calculated on-the-fly exists
Transaction rollback	AVAILABLE	SQLAlchemy core	None	`async with db.begin()`

Existing Migration Pattern (004_add_environment_fields.py:18-23):

def upgrade() -> None:
    op.add_column('environments', sa.Column('ishaenabled', sa.Boolean(), nullable=True))
    op.add_column('environments', sa.Column('sslstate', sa.Boolean(), nullable=True))
    op.add_column('environments', sa.Column('region', sa.String(100), nullable=True))

Recommended Migration: 005_add_environment_sync_tracking.py

Background Workers (Celery)¶

PRD Requirement	Current State	Evidence File	Gap	Recommendation
Celery app configured	AVAILABLE	`celery_app.py:8-23`	None	RabbitMQ broker
Task pattern	AVAILABLE	`tasks.py:14-18`	None	`create_environment_task` exists
Late ack (at-least-once)	AVAILABLE	`celery_app.py:15`	None	`acks_late=True`
Async support	AVAILABLE	`tasks.py:17`	None	`asyncio.run()`
Sync task	MISSING	N/A	Medium	Add `sync_environments_task`
Rate limiting (50/s)	PARTIAL	`rate_limit.py:11-26`	Low	Use semaphore + Redis limiter

Existing Task Pattern (tasks.py:14-18):

@celery_app.task(name="domains.environments.create_environment", bind=True, acks_late=True)
def create_environment_task(self, payload: dict[str, Any]) -> None:
    logger.info("environment_job_received", job_id=payload.get("job_id"))
    asyncio.run(service.execute_environment_job(payload))

Virtuozzo Integration¶

PRD Requirement	Current State	Evidence File	Gap	Recommendation
VirtuozzoClient	AVAILABLE	`client.py:36-57`	None	Async HTTP client
GetEnvs API	AVAILABLE	`service.py:97-116`	None	`fetch_environments_and_nodes()`
Timeout support	AVAILABLE	`client.py:53-55`	Very Low	15s default, make configurable
Lazy loading	AVAILABLE	`service.py:99`	None	`lazy: bool` parameter
Session management	AVAILABLE	`Team` model	None	Encrypted keys stored
Hash-based change detection	MISSING	N/A	Low	Add SHA-256 to service.py

Function Signature Evidence (service.py:97-102):

async def fetch_environments_and_nodes(
    session_key: str,
    lazy: bool = False,
    owner_uid: int | None = None,
    vz_client: VirtuozzoClient | None = None
) -> dict[str, Any]:

API Layer¶

PRD Requirement	Current State	Evidence File	Gap	Recommendation
FastAPI router	AVAILABLE	`router.py:11-26`	None	GET + POST exist
Service layer	AVAILABLE	`service.py:41-45`	None	`list_environments()`
Response serialization	AVAILABLE	`service.py:527-544`	None	`_serialize_environment()`
CSRF protection	AVAILABLE	`router.py:20`	None	`dependencies=[Depends(csrf_protected)]`
`meta` object in response	MISSING	`router.py:17`	Medium	Add to return type
Response headers	MISSING	N/A	Low	Add `X-Data-Stale`, etc.
Queue enqueue on stale	MISSING	N/A	Medium	Add `publish_event()` call

Current Response Format (router.py:11-17):

@router.get("/")
async def environments(...) -> dict[str, list]:
    items = await list_environments(db=db, current_user=user)
    return {"items": items}  # Missing: meta object

Rate Limiting¶

PRD Requirement	Current State	Evidence File	Gap	Recommendation
Redis-based limiter	AVAILABLE	`rate_limit.py:11-26`	None	Counter-based (sufficient)
Redis client	AVAILABLE	`config.py:46`	None	`redis_url` configured
Login rate limit pattern	IMPLEMENTED	`rate_limit.py:28-47`	None	Pattern to follow
Token bucket algorithm	MISSING	N/A	Medium	Use `asyncio.Semaphore(50)` instead
Virtuozzo-specific limiter	MISSING	N/A	Low	Create wrapper class

Existing Implementation (rate_limit.py:11-26):

class RateLimiter:
    """Redis-based counter limiter using INCR semantics."""
    async def check(self, key: str, limit: int, window_seconds: int) -> None:
        count = await self.redis.incr(key)
        if count == 1:
            await self.redis.expire(key, window_seconds)
        if count > limit:
            raise HTTPException(status_code=429)

Frontend¶

PRD Requirement	Current State	Evidence File	Gap	Recommendation
Next.js 16	AVAILABLE	`package.json:39`	None	Latest version
React 19	AVAILABLE	`package.json:42`	None	Latest version
TanStack Query v5	INSTALLED	`package.json:32-33`	Low	Available but unused for environments
useEnvironments hook	IMPLEMENTED	`useEnvironments.ts:20-106`	None	Custom hook pattern
API client	AVAILABLE	`api-client.ts:18-135`	None	Full fetch wrapper
SyncStatus types	DEFINED	`environment.types.ts:5-9`	None	Interface exists
Stale badge display	PARTIAL	`sites/page.tsx:62-68`	Low	Code checks `sync_status`
Background refresh	MISSING	N/A	Medium	Add SSE listener

Current Frontend Pattern (useEnvironments.ts:55-73):

const syncAndRefetch = useCallback(async () => {
  const syncResult = await triggerSync();  // Currently blocks
  const data = await getEnvironments();
  setEnvironments(data);
}, []);

Observability¶

PRD Requirement	Current State	Evidence File	Gap	Recommendation
Structured logging	AVAILABLE	`logging.py`	None	Structlog with JSON
OpenTelemetry	AVAILABLE	`telemetry.py:92-172`	None	Full OTEL setup
Tracer helper	AVAILABLE	`observability.py:10-23`	None	`telemetry_span()`
SigNoz compatibility	YES	`telemetry.py:254-286`	None	Canonical attributes
Context propagation	AVAILABLE	Core modules	None	`request_id_ctx`, `team_id_ctx`
Sync-specific metrics	MISSING	N/A	Low	Add using `get_meter()`

Existing Logging Pattern (service.py:624):

logger.info("sync_team_environments_started", team_id=team_id)

Security & Sad Path¶

PRD Requirement	Current State	Evidence File	Gap	Recommendation
Crypto (decrypt/encrypt)	AVAILABLE	`crypto.py`	None	`decrypt_str()`, `encrypt_str()`
Pydantic validation	AVAILABLE	Throughout	None	Request/response validation
SQLAlchemy transactions	AVAILABLE	Core	None	`async with db.begin()`
Transaction rollback	AUTOMATIC	SQLAlchemy	None	On exception
Error handling	AVAILABLE	HTTPException	None	Proper status codes
Security events logging	AVAILABLE	`security_events.py`	None	Auth violation tracking

13.3 Configuration Variables¶

PRD Variable	Default	Current State	Evidence File	Action
`VZ_ENV_TTL_SECONDS`	600	HARDCODED	`service.py:577`	Move to config
`VZ_ENV_MAX_STALE_SECONDS`	3600	HARDCODED	`service.py:579`	Move to config
`VZ_SYNC_QUEUE_RATE_LIMIT`	50	NOT DEFINED	N/A	Add to config
`VZ_SYNC_TIMEOUT_SECONDS`	5	15 (different)	`client.py:53`	Override for sync
`VZ_SYNC_RETRY_ATTEMPTS`	3	NOT DEFINED	N/A	Add to config
`VZ_SYNC_WORKER_CONCURRENCY`	10	NOT DEFINED	N/A	Add to config

Existing Config Pattern (config.py:19-33):

class Settings(BaseSettings):
    model_config = ConfigDict(
        env_prefix="VZ_",
        env_file=".env",
        env_file_encoding="utf-8",
        extra="ignore",
    )
    virtuozzo_timeout_seconds: int = 15

13.4 DDD Architecture Alignment¶

The design PERFECTLY ALIGNS with the existing Hybrid Modular DDD architecture:

PRD Design Component	Current Architecture Location	Evidence
`EnvironmentSyncService`	`app/domains/environments/service.py`	✅ Match
Queue Consumer Worker	`app/domains/environments/tasks.py`	✅ Match
Repository Layer	`app/domains/environments/repository.py`	✅ Match
API Router	`app/domains/environments/router.py`	✅ Match
API Composition	`app/api/v1/__init__.py:34-36`	✅ Already wired
Virtuozzo External Service	`app/infrastructure/external/virtuozzo/`	✅ Match
SSE Events	`app/infrastructure/messaging/events.py`	✅ Match

13.5 Implementation Effort Estimate¶

Phase	Tasks	Estimated Time	Dependencies
Phase 1	Database migration + Config	1 day	None
Phase 2	Service layer + Queue logic	2 days	Phase 1
Phase 3	Celery worker + Rate limiting	1 day	Phase 2
Phase 4	API response format + Headers	1 day	Phase 2
Phase 5	Frontend stale badges	1 day	Phase 4
Phase 6	Testing (85% coverage)	2 days	All phases
Phase 7	Docs + Runbooks	1 day	All phases
Phase 8	Production deployment	1 day	All phases

Total Estimated Effort: 10 days

13.6 Risk Assessment¶

Risk	Probability	Impact	Mitigation
Database migration failure	Low	High	Test migration in staging first
Rate limiter blocking workers	Low	Medium	Use semaphore + fallback
Virtuozzo API timeout too aggressive	Medium	Medium	Make configurable, monitor
Queue backlog during spikes	Medium	Low	Add priority routing
Frontend stale badge confusion	Low	Low	Clear UX copy + tooltip

13.7 Recommendation¶

PROCEED WITH IMPLEMENTATION

All required infrastructure exists. The gaps are minor and well-understood. The design aligns perfectly with the existing architecture patterns.

Next Steps: 1. Create 005_add_environment_sync_tracking.py migration 2. Add config variables to config.py 3. Implement sync_or_fetch_cached() in service.py 4. Add sync_environments_task to tasks.py 5. Update API response format

14. Implementation Checklist (Updated with Evidence)¶

Phase	Task	Status	Evidence File	Notes
Phase 1	Database schema changes	Pending	`alembic/versions/`	Create `005_add_environment_sync_tracking.py`
Phase 1	Configuration variables	Pending	`app/core/config.py`	Add 6 new `VZ_*` settings
Phase 1	Add sync queue config	Pending	`celery_app.py`	Add `jobs.env.sync` queue
Phase 2	Implement `sync_or_fetch_cached()`	Pending	`service.py:41-45`	Add after `list_environments()`
Phase 2	Add hash computation	Pending	`service.py`	SHA-256 function
Phase 2	Update `list_environments()` TTL check	Pending	`service.py:41-45`	Add stale detection
Phase 2	Add queue publish on stale	Pending	`service.py`	Use `publish_event()`
Phase 3	Implement `sync_environments_task`	Pending	`tasks.py:14-18`	Follow `create_environment_task` pattern
Phase 3	Add rate limiting to worker	Pending	`tasks.py`	Use `asyncio.Semaphore(50)`
Phase 3	Add retry with backoff	Pending	`tasks.py`	Celery `autoretry_for`
Phase 4	Add `meta` object to response	Pending	`router.py:11-17`	Update return type
Phase 4	Add response headers	Pending	`router.py`	Add `X-Data-Stale`, etc.
Phase 4	Update TypeScript types	Pending	`environment.types.ts:45-47`	Add `meta` interface
Phase 5	Display stale badges	Pending	`sites/page.tsx:62-68`	Already checks `sync_status`
Phase 5	Add SSE listener (optional)	Pending	Frontend	For background updates
Phase 6	Add sync metrics	Pending	`service.py`	Use `get_meter()`
Phase 6	Unit tests	Pending	`tests/domains/environments/`	Target 85% coverage
Phase 6	Sad path tests (3:1 ratio)	Pending	`tests/`	Timeout, 500, corruption
Phase 6	Integration tests	Pending	`tests/integration/`	End-to-end flow
Phase 7	Update runbooks	Pending	`docs/`	Ops procedures
Phase 7	Update API docs	Pending	`docs/`	Include new response format
Phase 8	Staging deployment	Pending	Infra	Test migration
Phase 8	Production deployment	Pending	Infra	Feature flags

15. References¶

Document	Location
Virtuozzo API Documentation	https://docs.jelastic.com/api/
MBPanel Architecture Guide	`/docs/architecture/001-hybrid-modular-ddd.md`
SSE Notification Pattern	`/docs/architecture/WEBSOCKET/SSE-Notif.md`
Current Implementation	`/backend/app/domains/environments/`

16. Revision History¶

Version	Date	Author	Changes
1.0	2026-01-09	MBPanel Team	Initial draft (lazy sync + background refresh)
2.0	2026-01-22	MBPanel Team	Major refactor: Stale-While-Revalidate + Queue-based architecture
2.1	2026-01-22	MBPanel Team	Added Section 13: Feasibility Analysis with evidence-based gaps and recommendations

17. Approvals¶

Role	Name	Signature	Date
Product Manager
Engineering Lead
Architecture Lead
DevOps Lead