Product Requirements Document (PRD)¶
Virtuozzo Environment Synchronization - Stale-While-Revalidate Architecture¶
Document Information¶
| Field | Details |
|---|---|
| Product Name | Virtuozzo Environment Synchronization |
| Author | MBPanel Team |
| Date Created | 2026-01-09 |
| Last Updated | 2026-01-22 |
| Version | 2.0 |
| Status | Draft - Architecture Refresh |
| Architecture Type | Domain |
| Affected Components | app/domains/environments/, app/infrastructure/external/virtuozzo/, app/infrastructure/queue/ |
1. Executive Summary¶
The Virtuozzo Environment Synchronization feature implements a distributed optimization pattern to solve the critical problem of redundant API calls when team members access MBPanel. This is a classic distributed systems challenge involving caching strategies, consistency vs. availability trade-offs, and scalability.
1.1 Core Problem Statement¶
Current State: - Every user login triggers a synchronous call to Virtuozzo API (2-5 second latency) - 5 concurrent logins = 5 identical API calls (wasteful) - UI blocks while fetching, creating poor UX - Virtuozzo API rate limits are at risk during traffic spikes
Solution: Stale-While-Revalidate Architecture - De-couple UI from Data Source: Dashboard never speaks directly to Virtuozzo for initial render - Instant UI Response: Always return data immediately (fresh or stale), never block - Background Queue System: Event-driven refresh for active users only (no global crons) - Graceful Degradation: Serve stale data with warning rather than errors
1.2 Architecture Overview¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ STALE-WHILE-REVALIDATE PATTERN │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. UI Requests → Check Local Database (PostgreSQL) │
│ 2. Data Found? │
│ - NO → Blocking fetch from Virtuozzo (unavoidable first-run) │
│ - YES → Check timestamp (TTL) │
│ ├─ FRESH (< 10 min) → Return immediately (Zero API calls) │
│ └─ STALE (> 10 min) → Return stale IMMEDIATELY + push to queue │
│ │
│ 3. Background Queue Consumer (Rate-limited, Active-User Priority) │
│ - Processes queue at fixed rate (e.g., 50 req/sec) │
│ - Fetches from Virtuozzo asynchronously │
│ - Validates payload before overwriting DB │
│ - Updates timestamp on success │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
1.3 Key Technical Decisions¶
| Decision | Rationale | Impact |
|---|---|---|
| PostgreSQL as Primary Cache | Survives restarts, source of truth | No Redis dependency for cache |
| Queue-Based Refresh | Event-driven, not O(N) cron | Scales to millions of users |
| Active-User Priority | Only logged-in users trigger refresh | Passive users consume zero resources |
| Stale-While-Revalidate | Return stale data immediately, refresh async | UI always instant, never blocks |
| Rate-Limited Consumer | Protects Virtuozzo from traffic spikes | Controlled API call volume |
| No Global Cron | Don't iterate through all users | Linear scaling with active users |
2. Problem Statement¶
2.1 Background: Distributed Systems Optimization¶
This is a classic distributed caching problem with CAP theorem implications:
Consistency vs. Availability Trade-off: - Strong Consistency: Every read gets latest data (requires calling Virtuozzo) → SLOW - High Availability: Every read returns immediately (cached data) → MAY BE STALE - Our Choice: AP with TTL-based eventual consistency (10-minute window)
Scalability Challenge (Millions of Users): - Global Cron Approach: Iterate through N users = O(N) complexity - 1M users = 1M cron jobs, impossible to scale - Queue-Based Approach: Only active users trigger refresh = O(active users) - 1M users, 10K active = 10K queue items, manageable
2.2 Current Implementation Analysis¶
File: backend/app/domains/environments/router.py
# CURRENT CODE (lines 11-17)
@router.get("/")
async def environments(
user: AuthenticatedUser = Depends(get_current_user),
db: AsyncSession = Depends(db_session),
) -> dict[str, list]:
items = await list_environments(db=db, current_user=user) # ← Just reads DB
return {"items": items}
Current Behavior:
1. User logs in
2. Frontend calls GET /api/v1/environments
3. Backend queries PostgreSQL database directly
4. Returns whatever is in database (could be hours/days old)
5. No Virtuozzo API call is made
6. No freshness check is performed
7. Data only updates if someone manually calls POST /api/v1/environments/sync
Problems: - Data can be stale forever - No automatic refresh mechanism - No staleness indicator to users - Background refresh doesn't exist
2.3 Pain Points¶
| Pain Point | Current Impact | User Impact |
|---|---|---|
| Indefinite Staleness | Data never refreshes automatically | Users see old environments |
| Manual Sync Required | Must remember to call POST /sync | Operational burden |
| No Freshness Visibility | Can't tell if data is current | Confusion about data accuracy |
| No Background Refresh | No proactive updates | Stale data persists |
| Blocking Manual Sync | POST /sync blocks for 2-5 seconds | Poor UX when forcing refresh |
3. Goals and Objectives¶
3.1 Business Goals¶
| Goal | Metric | Target |
|---|---|---|
| Eliminate blocking UI | Percent of requests that block API | < 1% (first run only) |
| Reduce API calls | API calls per active user per session | 1 call per 10 min window |
| Maintain freshness | P95 data staleness | < 10 minutes |
| Scale to millions | System cost scales with | Active users, not total users |
| High availability | Uptime during Virtuozzo outages | Serve stale with warning |
3.2 User Goals¶
| User Story | Acceptance Criteria |
|---|---|
| Instant dashboard load | Environment list renders in < 100ms (p95) |
| Transparent staleness | Visual indicator when data is > 1 hour old |
| No blocking spinners | Never show loading spinner after first login |
| Automatic updates | Data refreshes within 10 minutes without manual action |
| Graceful errors | See "Data may be delayed" badge instead of error pages |
3.3 Non-Goals (Out of Scope)¶
- Environment creation/deletion (existing provisioning flows)
- Session key management (existing
app/modules/sessions/) - Real-time WebSocket sync (use SSE for notifications)
- Per-environment detailed status (focus on listing)
- Multi-region Virtuozzo support (single endpoint per team)
4. Architecture: Stale-While-Revalidate Pattern¶
4.1 Happy Path: User Logs In¶
sequenceDiagram
participant U as User Browser
participant API as FastAPI
participant DB as PostgreSQL
participant Q as Refresh Queue
participant W as Background Worker
participant VZ as Virtuozzo API
Note over U,DB: User logs in, navigates to environments
U->>API: GET /api/v1/environments
Note over API,DB: 1. Query Local Database
API->>DB: SELECT * FROM environments WHERE team_id = ?
DB-->>API: Return records (with last_updated timestamp)
alt First Run (No Data)
API-->>U: 404 + "Loading environments..." (spinner)
Note over API,VZ: 2. BLOCKING FETCH (unavoidable)
API->>VZ: GET /getenvs
VZ-->>API: Return environments
API->>DB: INSERT environments (last_updated = NOW())
API-->>U: 200 OK (environments loaded)
else Returning User (Data Exists)
Note over API: 3. Check Timestamp (TTL = 10 min)
alt FRESH (< 10 min old)
API-->>U: 200 OK (instant, < 50ms)
Note over U: User sees data immediately
else STALE (> 10 min old)
API-->>U: 200 OK (instant, stale data)
Note over API: 4. PUSH TO UPDATE QUEUE (fire-and-forget)
API->>Q: enqueue(team_id, priority="normal")
Note over Q,W: 5. BACKGROUND WORKER (async, non-blocking)
Q->>W: dequeue(team_id)
W->>VZ: GET /getenvs
VZ-->>W: Return environments
W->>W: Validate payload
W->>DB: UPDATE environments (last_updated = NOW())
Note over U: Next page load shows fresh data
end
end
4.2 Sad Path: Edge Cases and Failures¶
sequenceDiagram
participant U as User Browser
participant API as FastAPI
participant DB as PostgreSQL
participant Q as Refresh Queue
participant W as Background Worker
participant VZ as Virtuozzo API
Note over U,DB: User logs in, data is STALE
U->>API: GET /api/v1/environments
API->>DB: SELECT * FROM environments WHERE team_id = ?
DB-->>API: Return records (last_updated = 2 hours ago)
Note over API: Data is STALE (> 10 min)
API-->>U: 200 OK (return stale data immediately)
API->>Q: enqueue(team_id, priority="normal")
Note over Q,W: Background Worker processes queue
Q->>W: dequeue(team_id)
W->>VZ: GET /getenvs
alt API Down / Timeout
VZ--xW: Timeout (5 seconds)
W->>W: Log error, retry later (exponential backoff)
Note over DB: DB retains stale data (no overwrite)
Note over U: User still sees old data (better than error)
else API Returns 500
VZ-->>W: 500 Internal Server Error
W->>W: Log error, mark for retry
Note over DB: Stale data preserved
else Incomplete Data
VZ-->>W: 200 OK (corrupted payload)
W->>W: Schema validation FAILS
W->>W: Discard new data, keep old
Note over DB: Stale data protected from corruption
end
4.3 ASCII Architecture Diagram¶
USER ACTION SYSTEM LOGIC DATA SOURCE
+-----------+ +-------------------------------------+
| User Logs |------>| 1. Query Local Database for Env |
| In | +-------------------------------------+
+-----------+ |
v
/-------------------------\
| Data Exists in DB? |
\-----------+-------------/
|
NO (First Run) | YES (Returning User)
+--------------------+-----------------------+
| |
v v
+------------------------+ /--------------------------\
| 2. BLOCKING FETCH | | Check Timestamp (TTL) |
| (Show Loading Spinner) | \-----------+--------------/
+-----------+------------+ |
| +-----------------+-----------------+
| | |
v FRESH (< 10m old) STALE (> 10m old)
+------------------------+ | |
| Call Virtuozzo API | | |
+-----------+------------+ | +-----------------------------+
| | | 3. RETURN STALE DATA (FAST) |
v v | (Dashboard loads instantly) |
+------------------------+ +----------------------+ +-------------+---------------+
| Validate & Save to DB |<---| RETURN DATA TO UI | |
+-----------+------------+ +----------------------+ |
| ^ +-------------v---------------+
| | | 4. PUSH TO UPDATE QUEUE |
v | | (Fire-and-Forget) |
+------------------------+ | +-------------+---------------+
| RETURN DATA TO UI |------------+ |
+------------------------+ |
|
-------------------------------------------------------------------|
| ASYNCHRONOUS BACKGROUND WORKER (Independent Process)
| - Rate-limited: 50 req/sec max
| - Active-user priority only
| - No global cron iteration
|
v
+----------------------+ +-------------------------+
| Queue Consumer |------>| Call Virtuozzo API |
+----------------------+ +-----------+-------------+
|
/-------------------------\
| Request Successful? |
\-----------+-------------/
|
YES | NO (Sad Path)
+----------------------+ | +-------------------------+
| Compare Hash/Diff | | | Log Error / Retry Later |
| (Is data different?) | | | (Do NOT overwrite DB) |
+----------+-----------+ | +-------------------------+
| |
+----------v-----------+ |
| Update DB Timestamp | |
| & Save New Data | |
+----------------------+ |
4.4 Queue-Based System Design¶
Why Queue-Based Instead of Global Cron?
| Approach | Complexity | Scalability | Resource Usage |
|---|---|---|---|
| Global Cron | O(N) - iterate all users | Breaks at 100K+ users | Wastes cycles on inactive users |
| Queue-Based | O(active users) | Scales to millions | Only processes active users |
Queue Architecture:
graph TB
subgraph "Request Layer"
API[FastAPI /environments endpoint]
end
subgraph "Queue Layer"
Q[RabbitMQ / Redis Queue]
ROUTER[Priority Router]
end
subgraph "Worker Layer"
W1[Worker 1]
W2[Worker 2]
W3[Worker N]
end
subgraph "External Layer"
VZ[Virtuozzo API]
end
API -->|enqueue sync task| Q
Q --> ROUTER
ROUTER -->|round-robin| W1
ROUTER -->|round-robin| W2
ROUTER -->|round-robin| W3
W1 -->|rate-limited 50/sec| VZ
W2 -->|rate-limited 50/sec| VZ
W3 -->|rate-limited 50/sec| VZ
style Q fill:#ff9,stroke:#333,stroke-width:3px
style VZ fill:#f9f,stroke:#333,stroke-width:3px
Queue Message Format:
{
"task_type": "sync_environments",
"team_id": 123,
"priority": "normal",
"enqueued_at": "2026-01-22T10:30:00Z",
"triggered_by": "stale_data",
"session_key_encrypted": "..."
}
Worker Processing Logic:
# Pseudocode
async def process_sync_task(task: SyncTask):
# 1. Rate limiting check (semaphore)
async with rate_limiter.acquire():
# 2. Decrypt session key
session_key = decrypt(task.session_key_encrypted)
# 3. Call Virtuozzo with timeout
try:
response = await virtuozzo_client.get_envs(
session=session_key,
lazy=True,
timeout=5.0
)
except Timeout:
# DO NOT overwrite DB with error
logger.error("sync_timeout", team_id=task.team_id)
return # Stale data preserved
# 4. Validate payload schema
if not validate_schema(response):
logger.error("sync_invalid_payload", team_id=task.team_id)
return # Stale data protected from corruption
# 5. Compare hash (avoid unnecessary writes)
new_hash = hash_payload(response)
old_hash = await get_last_sync_hash(task.team_id)
if new_hash == old_hash:
logger.info("sync_unchanged", team_id=task.team_id)
return # No change needed
# 6. Update database
await upsert_environments(task.team_id, response)
# 7. Update timestamp
await update_last_synced_at(task.team_id, new_hash)
logger.info("sync_complete", team_id=task.team_id)
5. Technical Specifications¶
5.1 Configuration¶
| Setting | Default | Description |
|---|---|---|
VZ_ENV_TTL_SECONDS |
600 | 10 minutes - Data freshness window |
VZ_ENV_MAX_STALE_SECONDS |
3600 | 1 hour - Show warning after this |
VZ_SYNC_QUEUE_RATE_LIMIT |
50 | Requests per second max |
VZ_SYNC_TIMEOUT_SECONDS |
5 | Virtuozzo API timeout |
VZ_SYNC_RETRY_ATTEMPTS |
3 | Max retry attempts with exponential backoff |
VZ_SYNC_WORKER_CONCURRENCY |
10 | Number of worker processes |
5.2 Database Schema¶
-- Add to existing environments table
ALTER TABLE environments
ADD COLUMN last_synced_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
ADD COLUMN last_sync_hash VARCHAR(64), -- SHA-256 for change detection
ADD COLUMN sync_status VARCHAR(20) DEFAULT 'ok', -- ok, stale, failed
ADD COLUMN sync_error_message TEXT,
ADD COLUMN api_calls_count INTEGER DEFAULT 0,
ADD COLUMN last_sync_duration_ms INTEGER;
-- Indexes for freshness queries
CREATE INDEX idx_environments_team_sync ON environments(team_id, last_synced_at);
CREATE INDEX idx_environments_sync_status ON environments(sync_status);
-- Sync queue tracking (optional, for monitoring)
CREATE TABLE environment_sync_queue (
id SERIAL PRIMARY KEY,
team_id INTEGER NOT NULL,
enqueued_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
started_at TIMESTAMP WITH TIME ZONE,
completed_at TIMESTAMP WITH TIME ZONE,
status VARCHAR(20), -- pending, processing, completed, failed
error_message TEXT,
retry_count INTEGER DEFAULT 0
);
5.3 API Contract¶
GET /api/v1/environments
Response:
{
"items": [
{
"env_name": "staging",
"display_name": "Staging Environment",
"shortdomain": "staging.example.com",
"app_id": "12345",
"status": "active",
"nodes": [...]
}
],
"meta": {
"last_synced_at": "2026-01-22T10:25:00Z",
"is_stale": false,
"stale_warning": null,
"sync_in_progress": false
}
}
Headers:
- X-Data-Stale: true if data is older than TTL
- X-Data-Last-Sync: ISO timestamp of last sync
- X-Sync-In-Progress: true if background sync is running
6. Sad Path Engineering¶
6.1 Failure Modes and Mitigations¶
| Failure Mode | Impact | Mitigation | Test Case |
|---|---|---|---|
| Virtuozzo API timeout | Stale data served | Log error, preserve stale data | TEST-TIMEOUT-001 |
| Virtuozzo API 500 | Stale data served | Exponential backoff retry | TEST-500-001 |
| Corrupted payload | Stale data protected | Schema validation before DB write | TEST-CORRUPT-001 |
| Queue processing failure | Stale data persists | Retry with exponential backoff | TEST-QUEUE-FAIL-001 |
| Database connection lost | 503 Service Unavailable | Circuit breaker, retry connection | TEST-DB-001 |
| Session key expired | Re-fetch session | Decrypt error → re-auth | TEST-AUTH-001 |
| Worker crash mid-sync | Partial data update | Transaction rollback | TEST-WORKER-CRASH-001 |
| Queue backlog | Data stays stale longer | Priority queue for recent logins | TEST-BACKLOG-001 |
| Massive concurrent logins | Queue spike | Rate limiter throttles API calls | TEST-SPIKE-001 |
| Clock skew between servers | Incorrect TTL calculation | Use DB time, not app time | TEST-CLOCK-001 |
6.2 Data Integrity Strategies¶
Atomic Updates:
async def upsert_environments(team_id: int, env_data: list[dict]):
async with db.begin(): # Transaction
# Clear old environments
await db.execute(
delete(Environment).where(Environment.team_id == team_id)
)
# Insert new environments
for env in env_data:
db.add(Environment(**env))
# Commit all or nothing
await db.commit()
Hash-Based Change Detection:
def compute_env_hash(environments: list[dict]) -> str:
"""SHA-256 hash for change detection."""
import hashlib
import json
normalized = json.dumps(environments, sort_keys=True)
return hashlib.sha256(normalized.encode()).hexdigest()
# Only update DB if hash changed
if new_hash != old_hash:
await upsert_environments(team_id, environments)
Schema Validation:
def validate_virtuozzo_response(data: dict) -> bool:
"""Validate response structure before DB write."""
required_fields = ["result", "infos"]
if not all(k in data for k in required_fields):
return False
if not isinstance(data.get("infos"), list):
return False
for env in data["infos"]:
if "env" not in env:
return False
return True
7. Observability¶
7.1 Metrics¶
| Metric | Type | Labels | Purpose |
|---|---|---|---|
env_sync_requests_total |
Counter | team_id, status | Track sync volume |
env_sync_duration_seconds |
Histogram | team_id, result | Track sync latency |
env_sync_queue_depth |
Gauge | priority | Monitor queue backlog |
env_data_age_seconds |
Histogram | team_id | Track data freshness |
env_stale_served_total |
Counter | team_id, staleness_bucket | Track staleness |
virtuozzo_api_requests_total |
Counter | endpoint, status | API call tracking |
virtuozzo_api_errors_total |
Counter | endpoint, error_type | Failure analysis |
7.2 Structured Logging¶
logger.info(
"sync_started",
team_id=team_id,
triggered_by="stale_data",
current_data_age_seconds=age.total_seconds(),
task_id=task_id
)
logger.info(
"sync_complete",
team_id=team_id,
duration_seconds=duration,
env_count=len(environments),
data_changed=(new_hash != old_hash),
api_calls=1
)
logger.error(
"sync_failed",
team_id=team_id,
error_type="timeout",
error_message=str(e),
retry_count=retry_count,
will_retry=True
)
7.3 Alerts¶
| Alert | Condition | Severity | Action |
|---|---|---|---|
| High Stale Rate | stale_served / total > 0.5 |
Warning | Check worker health |
| Queue Backlog | queue_depth > 1000 |
Warning | Scale workers |
| API Error Rate | api_errors / api_calls > 0.1 |
Critical | Check Virtuozzo status |
| Data Age | data_age_p95 > 1800 (30 min) |
Warning | Review TTL settings |
| Worker Down | worker_heartbeat > 60s |
Critical | Restart workers |
8. Testing Requirements¶
8.1 Unit Tests¶
# Test: Fresh data returned from cache
async def test_fresh_data_returned_immediately():
# Setup: DB has data from 5 minutes ago
await insert_test_envs(team_id=1, age_minutes=5)
# Act: Request environments
result = await list_environments(team_id=1)
# Assert: No API call made, data returned
assert_virtuozzo_not_called()
assert len(result) == 5
# Test: Stale data returned, queue task created
async def test_stale_data_triggers_background_sync():
# Setup: DB has data from 20 minutes ago
await insert_test_envs(team_id=1, age_minutes=20)
# Act: Request environments
result = await list_environments(team_id=1)
# Assert: Stale data returned immediately
assert len(result) == 5
# Assert: Queue task created
assert_queue_contains(team_id=1)
# Test: Corrupted payload does not overwrite DB
async def test_corrupted_payload_preserves_stale_data():
# Setup: DB has valid data
await insert_test_envs(team_id=1)
# Act: Simulate sync with corrupted response
await process_sync_task(
team_id=1,
response={"invalid": "payload"}
)
# Assert: DB unchanged
db_data = await get_environments(team_id=1)
assert len(db_data) == 5 # Original data preserved
8.2 Sad Path Tests (3:1 Ratio)¶
| Test | Scenario | Expected |
|---|---|---|
test_timeout_preserves_stale |
Virtuozzo times out | Stale data in DB, error logged |
test_500_preserves_stale |
Virtuozzo returns 500 | Stale data in DB, retry scheduled |
test_invalid_json_rejected |
Non-JSON response | Stale data preserved |
test_schema_validation_fails |
Missing required fields | Stale data preserved |
test_worker_crash_rollback |
Worker dies mid-transaction | Transaction rolled back |
test_queue_full_rejected |
Queue at capacity | Return 503, retry later |
test_session_key_expired |
401 from Virtuozzo | Re-fetch session, retry |
test_database_connection_lost |
DB unavailable | Return 503, log error |
8.3 Integration Tests¶
# Test: End-to-end stale-while-revalidate flow
async def test_stale_while_revalidate_e2e():
# 1. Setup: Insert stale data (20 min old)
await insert_test_envs(team_id=1, age_minutes=20)
# 2. Act: Request environments (should return stale + enqueue)
response = await client.get("/api/v1/environments")
assert response.status_code == 200
assert response.json()["meta"]["is_stale"] == True
# 3. Act: Process queue (background worker)
await worker_process_one_task()
# 4. Assert: DB updated with fresh data
db_data = await get_environments(team_id=1)
assert db_data[0]["last_synced_at"] > now() - timedelta(minutes=1)
# 5. Act: Request again (should be fresh)
response = await client.get("/api/v1/environments")
assert response.json()["meta"]["is_stale"] == False
9. Success Metrics¶
9.1 Key Performance Indicators¶
| Metric | Baseline | Target | Measurement |
|---|---|---|---|
| UI Response Time (p95) | 2000-5000ms | < 100ms | Frontend RUM |
| Blocking Requests | 100% | < 1% | Backend metrics |
| API Calls per Session | 1 per login | 1 per 10 min | API call counter |
| Data Staleness (p95) | Infinite | < 10 min | last_synced_at age |
| Queue Processing Time | N/A | < 30 sec | Queue duration metric |
| Error Rate | < 0.1% | < 0.1% | HTTP 5xx rate |
9.2 Scalability Metrics¶
| Scenario | Metric | Target |
|---|---|---|
| 1K concurrent logins | Queue depth | < 100 |
| 10K concurrent logins | Queue depth | < 1000 |
| 100K concurrent logins | API rate (per sec) | ≤ 50 |
| 1M total users, 10K active | Queue processing | Linear with active users |
10. Open Questions¶
| ID | Question | Owner | Priority |
|---|---|---|---|
| Q-001 | Is 10-minute TTL acceptable for business requirements? | Product | HIGH |
| Q-002 | Should stale data warning appear at 30 min or 1 hour? | UX | MEDIUM |
| Q-003 | What's the Virtuozzo API rate limit per session key? | Engineering | HIGH |
| Q-004 | Should we implement priority queue for paying customers? | Product | LOW |
| Q-005 | How many concurrent workers do we provision? | DevOps | MEDIUM |
11. Implementation Checklist¶
| Phase | Task | Status | Notes |
|---|---|---|---|
| Phase 1 | Database schema changes | Pending | Add last_synced_at, last_sync_hash |
| Phase 1 | Create queue infrastructure | Pending | RabbitMQ/Redis queue setup |
| Phase 2 | Implement sync_or_fetch_cached() |
Pending | Core stale-while-revalidate logic |
| Phase 2 | Implement background worker | Pending | Queue consumer with rate limiting |
| Phase 3 | Add staleness indicators to API | Pending | Response headers, meta object |
| Phase 3 | Implement frontend stale badge | Pending | Visual indicator |
| Phase 4 | Observability (metrics, logging) | Pending | OpenTelemetry integration |
| Phase 4 | Testing (unit, integration, sad path) | Pending | 85% coverage target |
| Phase 5 | Documentation and runbooks | Pending | Ops guides |
| Phase 5 | Production deployment | Pending | Feature flags |
12. References¶
| Document | Location |
|---|---|
| Virtuozzo API Documentation | https://docs.jelastic.com/api/ |
| MBPanel Architecture Guide | /docs/architecture/001-hybrid-modular-ddd.md |
| SSE Notification Pattern | /docs/architecture/WEBSOCKET/SSE-Notif.md |
| Current Implementation | /backend/app/domains/environments/ |
13. Feasibility Analysis & Current Tech Stack Assessment¶
13.1 Overall Feasibility: FEASIBLE with MINOR MODIFICATIONS¶
Date: 2026-01-22 Assessment Method: Code-based evidence analysis of existing infrastructure
The Stale-While-Revalidate architecture design aligns well with the existing tech stack. All required infrastructure components are present or can be added with minimal effort.
13.2 Infrastructure Component Analysis¶
Queue Infrastructure¶
| PRD Requirement | Current State | Evidence File | Gap | Recommendation |
|---|---|---|---|---|
| RabbitMQ message broker | AVAILABLE | docker-compose.yml:24-43 |
None | Already configured |
| Redis for rate limiting | AVAILABLE | docker-compose.yml:16-22 |
None | Already configured |
| Celery workers | AVAILABLE | celery_app.py:8-23 |
None | Configured with RabbitMQ |
| SSE event fan-out | AVAILABLE | sse.py:48-100 |
None | SSEBroker implemented |
| Dedicated sync queue | MISSING | N/A | Low | Add jobs.env.sync queue to config |
Configuration Evidence (backend/app/core/config.py:31-39):
rabbitmq_url: str = "amqp://mbpanel:mbpanel_pass@rabbitmq:5672/" # pragma: allowlist secret
rabbitmq_jobs_exchange: str = "jobs.direct"
rabbitmq_jobs_queue: str = "jobs.env.create" # Add new: jobs.env.sync
Database & ORM¶
| PRD Requirement | Current State | Evidence File | Gap | Recommendation |
|---|---|---|---|---|
| PostgreSQL 16 | AVAILABLE | docker-compose.yml:2-14 |
None | Latest version |
| Alembic migrations | AVAILABLE | alembic/versions/004_add_environment_fields.py |
None | Pattern established |
| SQLAlchemy 2.0 async | AVAILABLE | environment.py:45-80 |
None | AsyncSession ready |
last_synced_at column |
MISSING | environment.py |
Medium | Add via migration |
last_sync_hash column |
MISSING | environment.py |
Low | Add via migration |
sync_status column |
MISSING | environment.py |
Low | Calculated on-the-fly exists |
| Transaction rollback | AVAILABLE | SQLAlchemy core | None | async with db.begin() |
Existing Migration Pattern (004_add_environment_fields.py:18-23):
def upgrade() -> None:
op.add_column('environments', sa.Column('ishaenabled', sa.Boolean(), nullable=True))
op.add_column('environments', sa.Column('sslstate', sa.Boolean(), nullable=True))
op.add_column('environments', sa.Column('region', sa.String(100), nullable=True))
Recommended Migration: 005_add_environment_sync_tracking.py
Background Workers (Celery)¶
| PRD Requirement | Current State | Evidence File | Gap | Recommendation |
|---|---|---|---|---|
| Celery app configured | AVAILABLE | celery_app.py:8-23 |
None | RabbitMQ broker |
| Task pattern | AVAILABLE | tasks.py:14-18 |
None | create_environment_task exists |
| Late ack (at-least-once) | AVAILABLE | celery_app.py:15 |
None | acks_late=True |
| Async support | AVAILABLE | tasks.py:17 |
None | asyncio.run() |
| Sync task | MISSING | N/A | Medium | Add sync_environments_task |
| Rate limiting (50/s) | PARTIAL | rate_limit.py:11-26 |
Low | Use semaphore + Redis limiter |
Existing Task Pattern (tasks.py:14-18):
@celery_app.task(name="domains.environments.create_environment", bind=True, acks_late=True)
def create_environment_task(self, payload: dict[str, Any]) -> None:
logger.info("environment_job_received", job_id=payload.get("job_id"))
asyncio.run(service.execute_environment_job(payload))
Virtuozzo Integration¶
| PRD Requirement | Current State | Evidence File | Gap | Recommendation |
|---|---|---|---|---|
| VirtuozzoClient | AVAILABLE | client.py:36-57 |
None | Async HTTP client |
| GetEnvs API | AVAILABLE | service.py:97-116 |
None | fetch_environments_and_nodes() |
| Timeout support | AVAILABLE | client.py:53-55 |
Very Low | 15s default, make configurable |
| Lazy loading | AVAILABLE | service.py:99 |
None | lazy: bool parameter |
| Session management | AVAILABLE | Team model |
None | Encrypted keys stored |
| Hash-based change detection | MISSING | N/A | Low | Add SHA-256 to service.py |
Function Signature Evidence (service.py:97-102):
async def fetch_environments_and_nodes(
session_key: str,
lazy: bool = False,
owner_uid: int | None = None,
vz_client: VirtuozzoClient | None = None
) -> dict[str, Any]:
API Layer¶
| PRD Requirement | Current State | Evidence File | Gap | Recommendation |
|---|---|---|---|---|
| FastAPI router | AVAILABLE | router.py:11-26 |
None | GET + POST exist |
| Service layer | AVAILABLE | service.py:41-45 |
None | list_environments() |
| Response serialization | AVAILABLE | service.py:527-544 |
None | _serialize_environment() |
| CSRF protection | AVAILABLE | router.py:20 |
None | dependencies=[Depends(csrf_protected)] |
meta object in response |
MISSING | router.py:17 |
Medium | Add to return type |
| Response headers | MISSING | N/A | Low | Add X-Data-Stale, etc. |
| Queue enqueue on stale | MISSING | N/A | Medium | Add publish_event() call |
Current Response Format (router.py:11-17):
@router.get("/")
async def environments(...) -> dict[str, list]:
items = await list_environments(db=db, current_user=user)
return {"items": items} # Missing: meta object
Rate Limiting¶
| PRD Requirement | Current State | Evidence File | Gap | Recommendation |
|---|---|---|---|---|
| Redis-based limiter | AVAILABLE | rate_limit.py:11-26 |
None | Counter-based (sufficient) |
| Redis client | AVAILABLE | config.py:46 |
None | redis_url configured |
| Login rate limit pattern | IMPLEMENTED | rate_limit.py:28-47 |
None | Pattern to follow |
| Token bucket algorithm | MISSING | N/A | Medium | Use asyncio.Semaphore(50) instead |
| Virtuozzo-specific limiter | MISSING | N/A | Low | Create wrapper class |
Existing Implementation (rate_limit.py:11-26):
class RateLimiter:
"""Redis-based counter limiter using INCR semantics."""
async def check(self, key: str, limit: int, window_seconds: int) -> None:
count = await self.redis.incr(key)
if count == 1:
await self.redis.expire(key, window_seconds)
if count > limit:
raise HTTPException(status_code=429)
Frontend¶
| PRD Requirement | Current State | Evidence File | Gap | Recommendation |
|---|---|---|---|---|
| Next.js 16 | AVAILABLE | package.json:39 |
None | Latest version |
| React 19 | AVAILABLE | package.json:42 |
None | Latest version |
| TanStack Query v5 | INSTALLED | package.json:32-33 |
Low | Available but unused for environments |
| useEnvironments hook | IMPLEMENTED | useEnvironments.ts:20-106 |
None | Custom hook pattern |
| API client | AVAILABLE | api-client.ts:18-135 |
None | Full fetch wrapper |
| SyncStatus types | DEFINED | environment.types.ts:5-9 |
None | Interface exists |
| Stale badge display | PARTIAL | sites/page.tsx:62-68 |
Low | Code checks sync_status |
| Background refresh | MISSING | N/A | Medium | Add SSE listener |
Current Frontend Pattern (useEnvironments.ts:55-73):
const syncAndRefetch = useCallback(async () => {
const syncResult = await triggerSync(); // Currently blocks
const data = await getEnvironments();
setEnvironments(data);
}, []);
Observability¶
| PRD Requirement | Current State | Evidence File | Gap | Recommendation |
|---|---|---|---|---|
| Structured logging | AVAILABLE | logging.py |
None | Structlog with JSON |
| OpenTelemetry | AVAILABLE | telemetry.py:92-172 |
None | Full OTEL setup |
| Tracer helper | AVAILABLE | observability.py:10-23 |
None | telemetry_span() |
| SigNoz compatibility | YES | telemetry.py:254-286 |
None | Canonical attributes |
| Context propagation | AVAILABLE | Core modules | None | request_id_ctx, team_id_ctx |
| Sync-specific metrics | MISSING | N/A | Low | Add using get_meter() |
Existing Logging Pattern (service.py:624):
Security & Sad Path¶
| PRD Requirement | Current State | Evidence File | Gap | Recommendation |
|---|---|---|---|---|
| Crypto (decrypt/encrypt) | AVAILABLE | crypto.py |
None | decrypt_str(), encrypt_str() |
| Pydantic validation | AVAILABLE | Throughout | None | Request/response validation |
| SQLAlchemy transactions | AVAILABLE | Core | None | async with db.begin() |
| Transaction rollback | AUTOMATIC | SQLAlchemy | None | On exception |
| Error handling | AVAILABLE | HTTPException | None | Proper status codes |
| Security events logging | AVAILABLE | security_events.py |
None | Auth violation tracking |
13.3 Configuration Variables¶
| PRD Variable | Default | Current State | Evidence File | Action |
|---|---|---|---|---|
VZ_ENV_TTL_SECONDS |
600 | HARDCODED | service.py:577 |
Move to config |
VZ_ENV_MAX_STALE_SECONDS |
3600 | HARDCODED | service.py:579 |
Move to config |
VZ_SYNC_QUEUE_RATE_LIMIT |
50 | NOT DEFINED | N/A | Add to config |
VZ_SYNC_TIMEOUT_SECONDS |
5 | 15 (different) | client.py:53 |
Override for sync |
VZ_SYNC_RETRY_ATTEMPTS |
3 | NOT DEFINED | N/A | Add to config |
VZ_SYNC_WORKER_CONCURRENCY |
10 | NOT DEFINED | N/A | Add to config |
Existing Config Pattern (config.py:19-33):
class Settings(BaseSettings):
model_config = ConfigDict(
env_prefix="VZ_",
env_file=".env",
env_file_encoding="utf-8",
extra="ignore",
)
virtuozzo_timeout_seconds: int = 15
13.4 DDD Architecture Alignment¶
The design PERFECTLY ALIGNS with the existing Hybrid Modular DDD architecture:
| PRD Design Component | Current Architecture Location | Evidence |
|---|---|---|
EnvironmentSyncService |
app/domains/environments/service.py |
✅ Match |
| Queue Consumer Worker | app/domains/environments/tasks.py |
✅ Match |
| Repository Layer | app/domains/environments/repository.py |
✅ Match |
| API Router | app/domains/environments/router.py |
✅ Match |
| API Composition | app/api/v1/__init__.py:34-36 |
✅ Already wired |
| Virtuozzo External Service | app/infrastructure/external/virtuozzo/ |
✅ Match |
| SSE Events | app/infrastructure/messaging/events.py |
✅ Match |
13.5 Implementation Effort Estimate¶
| Phase | Tasks | Estimated Time | Dependencies |
|---|---|---|---|
| Phase 1 | Database migration + Config | 1 day | None |
| Phase 2 | Service layer + Queue logic | 2 days | Phase 1 |
| Phase 3 | Celery worker + Rate limiting | 1 day | Phase 2 |
| Phase 4 | API response format + Headers | 1 day | Phase 2 |
| Phase 5 | Frontend stale badges | 1 day | Phase 4 |
| Phase 6 | Testing (85% coverage) | 2 days | All phases |
| Phase 7 | Docs + Runbooks | 1 day | All phases |
| Phase 8 | Production deployment | 1 day | All phases |
Total Estimated Effort: 10 days
13.6 Risk Assessment¶
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Database migration failure | Low | High | Test migration in staging first |
| Rate limiter blocking workers | Low | Medium | Use semaphore + fallback |
| Virtuozzo API timeout too aggressive | Medium | Medium | Make configurable, monitor |
| Queue backlog during spikes | Medium | Low | Add priority routing |
| Frontend stale badge confusion | Low | Low | Clear UX copy + tooltip |
13.7 Recommendation¶
PROCEED WITH IMPLEMENTATION
All required infrastructure exists. The gaps are minor and well-understood. The design aligns perfectly with the existing architecture patterns.
Next Steps:
1. Create 005_add_environment_sync_tracking.py migration
2. Add config variables to config.py
3. Implement sync_or_fetch_cached() in service.py
4. Add sync_environments_task to tasks.py
5. Update API response format
14. Implementation Checklist (Updated with Evidence)¶
| Phase | Task | Status | Evidence File | Notes |
|---|---|---|---|---|
| Phase 1 | Database schema changes | Pending | alembic/versions/ |
Create 005_add_environment_sync_tracking.py |
| Phase 1 | Configuration variables | Pending | app/core/config.py |
Add 6 new VZ_* settings |
| Phase 1 | Add sync queue config | Pending | celery_app.py |
Add jobs.env.sync queue |
| Phase 2 | Implement sync_or_fetch_cached() |
Pending | service.py:41-45 |
Add after list_environments() |
| Phase 2 | Add hash computation | Pending | service.py |
SHA-256 function |
| Phase 2 | Update list_environments() TTL check |
Pending | service.py:41-45 |
Add stale detection |
| Phase 2 | Add queue publish on stale | Pending | service.py |
Use publish_event() |
| Phase 3 | Implement sync_environments_task |
Pending | tasks.py:14-18 |
Follow create_environment_task pattern |
| Phase 3 | Add rate limiting to worker | Pending | tasks.py |
Use asyncio.Semaphore(50) |
| Phase 3 | Add retry with backoff | Pending | tasks.py |
Celery autoretry_for |
| Phase 4 | Add meta object to response |
Pending | router.py:11-17 |
Update return type |
| Phase 4 | Add response headers | Pending | router.py |
Add X-Data-Stale, etc. |
| Phase 4 | Update TypeScript types | Pending | environment.types.ts:45-47 |
Add meta interface |
| Phase 5 | Display stale badges | Pending | sites/page.tsx:62-68 |
Already checks sync_status |
| Phase 5 | Add SSE listener (optional) | Pending | Frontend | For background updates |
| Phase 6 | Add sync metrics | Pending | service.py |
Use get_meter() |
| Phase 6 | Unit tests | Pending | tests/domains/environments/ |
Target 85% coverage |
| Phase 6 | Sad path tests (3:1 ratio) | Pending | tests/ |
Timeout, 500, corruption |
| Phase 6 | Integration tests | Pending | tests/integration/ |
End-to-end flow |
| Phase 7 | Update runbooks | Pending | docs/ |
Ops procedures |
| Phase 7 | Update API docs | Pending | docs/ |
Include new response format |
| Phase 8 | Staging deployment | Pending | Infra | Test migration |
| Phase 8 | Production deployment | Pending | Infra | Feature flags |
15. References¶
| Document | Location |
|---|---|
| Virtuozzo API Documentation | https://docs.jelastic.com/api/ |
| MBPanel Architecture Guide | /docs/architecture/001-hybrid-modular-ddd.md |
| SSE Notification Pattern | /docs/architecture/WEBSOCKET/SSE-Notif.md |
| Current Implementation | /backend/app/domains/environments/ |
16. Revision History¶
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2026-01-09 | MBPanel Team | Initial draft (lazy sync + background refresh) |
| 2.0 | 2026-01-22 | MBPanel Team | Major refactor: Stale-While-Revalidate + Queue-based architecture |
| 2.1 | 2026-01-22 | MBPanel Team | Added Section 13: Feasibility Analysis with evidence-based gaps and recommendations |
17. Approvals¶
| Role | Name | Signature | Date |
|---|---|---|---|
| Product Manager | |||
| Engineering Lead | |||
| Architecture Lead | |||
| DevOps Lead |