Skip to content

Architecture Requirements Document (ARD)

Virtuozzo Environment Synchronization - Stale-While-Revalidate Requirements

Document Information

Field Details
Project Name MBPanel - Virtuozzo Environment Synchronization
Author System Architecture Team
Date Created 2026-01-09
Last Updated 2026-01-22
Version 2.0
Status Draft - Architecture Refresh
Related PRD PRD-virtuozzo-environment-sync.md
Related ADD ADD-virtuozzo-environment-sync.md
ARD Type Domain Feature

1. Introduction

1.1 Purpose

This document defines the architectural requirements for implementing Stale-While-Revalidate pattern for Virtuozzo environment synchronization. This is a distributed systems optimization problem involving caching strategies, consistency vs. availability trade-offs, event-driven queue processing, and scalable background workers.

The system prioritizes availability and user experience over strong consistency, serving cached data immediately while asynchronously refreshing in the background.

1.2 Architectural Shift (v2.0)

Key Changes from v1.0:

Aspect v1.0 (Previous) v2.0 (Current)
Refresh Strategy Lazy sync + Celery cron Queue-based event-driven
User Blocking Could block on sync NEVER blocks (except first run)
Cache Storage Redis hot + PostgreSQL warm PostgreSQL only (simpler)
Scalability O(N) global cron iteration O(active users) queue processing
API Protection Distributed locking only Rate-limited worker pool
Data Integrity Direct overwrite Hash-based change detection

1.3 Core Principles

  1. Always Return Immediately: User requests NEVER wait for Virtuozzo API (except first run)
  2. Stale-While-Revalidate: Return stale data, refresh asynchronously
  3. Queue-Based Processing: Event-driven, not cron-based
  4. Active-User Priority: Only logged-in users trigger refresh
  5. Graceful Degradation: Serve stale with warning on errors
  6. Data Protection: Validate payload before DB write, never corrupt with error data

1.4 Scope

In Scope: - Stale-While-Revalidate pattern implementation - Queue-based background refresh (RabbitMQ/Redis) - Rate-limited worker pool for Virtuozzo API calls - PostgreSQL-only caching (survives restarts) - Hash-based change detection - Graceful error handling (preserve stale data) - Observability (metrics, logging, tracing)

Out of Scope: - Virtuozzo authentication (existing app/modules/sessions/) - Environment provisioning (existing app/domains/environments/tasks.py) - Real-time WebSocket sync (SSE for notifications only) - Redis hot cache (removed for simplicity)


2. System Overview

2.1 System Description

The Virtuozzo Environment Synchronization system implements Stale-While-Revalidate pattern:

REQUEST → Check DB → Return Data (ALWAYS)
                    ├─ Fresh (< TTL)    → Done
                    └─ Stale (> TTL)    → Return stale + enqueue refresh

BACKGROUND WORKER (Independent)
→ Process queue at fixed rate
→ Call Virtuozzo API (rate-limited)
→ Validate payload
→ Update DB if changed

2.2 Architecture Type

Domain (Full DDD with state machines, invariants, and orchestration)

Rationale: Environment sync involves: - State management (fresh → stale → syncing → fresh/error) - Business invariants (max staleness, TTL thresholds) - Orchestration of multiple systems (Virtuozzo, Queue, PostgreSQL, SSE) - Complex failure modes requiring explicit handling

2.3 Context Diagram

graph TB
    subgraph "MBPanel Platform"
        subgraph "API Layer"
            API[FastAPI /environments]
        end

        subgraph "Sync Service"
            SYNC[EnvironmentSyncService<br/>stale-while-revalidate]
        end

        subgraph "Infrastructure"
            DB[(PostgreSQL<br/>primary cache)]
            Q[Refresh Queue<br/>RabbitMQ/Redis]
        end

        subgraph "Worker Pool"
            W1[Worker 1]
            W2[Worker 2]
            WN[Worker N]
            RATE[Rate Limiter<br/>50 req/sec]
        end
    end

    subgraph "External"
        VZ[Virtuozzo API<br/>source of truth]
    end

    API --> SYNC
    SYNC --> DB
    SYNC -->|enqueue stale| Q
    Q --> RATE
    RATE --> W1
    RATE --> W2
    RATE --> WN
    W1 --> VZ
    W2 --> VZ
    WN --> VZ

    style VZ fill:#f9f,stroke:#333,stroke-width:4px
    style Q fill:#ff9,stroke:#333,stroke-width:3px
    style RATE fill:#faa,stroke:#333,stroke-width:2px

2.4 Key Stakeholders

Stakeholder Role Concerns
Platform Users End consumers Instant page loads, transparent staleness
Development Team Builders Maintainable queue system, testable logic
DevOps Team Operators Monitorable queue depth, worker health
Product Management Business Feature availability, API cost control
Virtuozzo Integration External dependency Rate limit compliance, error handling

3. Architectural Constraints

3.1 Technical Constraints

ID Constraint Rationale
TC-001 Must use FastAPI with Python 3.12+ Platform standard
TC-002 Must use async/await for all I/O operations Event loop non-blocking
TC-003 Must use Pydantic v2 for validation Type safety
TC-004 Must use PostgreSQL via async SQLAlchemy Primary cache store
TC-005 Must use service names (postgres, rabbitmq) not localhost Container networking
TC-006 Must use structured logging (Structlog) with correlation IDs Observability
TC-007 Must propagate team context in all logs Multi-tenancy
TC-008 All Virtuozzo calls through infrastructure/external/virtuozzo Centralized integration
TC-009 Queue backend must be pluggable (RabbitMQ or Redis) Flexibility
TC-010 Rate limiter must use token bucket algorithm Smooth throttling

3.2 Business Constraints

ID Constraint Rationale
BC-001 Maximum acceptable staleness: 10 minutes (TTL) User experience
BC-002 Maximum staleness before warning: 1 hour Transparency
BC-003 Must support 100K concurrent users with 10K active Scalability
BC-004 Must not exceed Virtuozzo API rate: 50 req/sec API protection
BC-005 UI must never block after first run User experience
BC-006 First run may block (unavoidable initial fetch) Initial setup

3.3 Quality Constraints

ID Constraint Standard
QC-001 UI response time (p95): < 100ms Performance
QC-002 API response time (p95): < 50ms for cached data Performance
QC-003 Background sync processing: < 30 seconds Queue latency
QC-004 Code coverage: 85% overall, 95% core Quality
QC-005 Sad path test ratio: 3:1 (failure:success) Test coverage
QC-006 Never corrupt DB with error data Data integrity

4. Functional Requirements

4.1 Core Stale-While-Revalidate Requirements

ID Requirement Acceptance Criteria
FR-001 Return cached data immediately (never block after first run) Response time < 100ms p95
FR-002 Check data freshness based on last_synced_at timestamp TTL comparison with DB clock
FR-003 Return stale data if > TTL, enqueue background refresh Stale data returned + task created
FR-004 First run (no data) blocks for initial fetch User sees loading spinner
FR-005 Background worker processes queue asynchronously Non-blocking to user
FR-006 Rate-limiter caps Virtuozzo API calls at 50 req/sec Token bucket algorithm
FR-007 Worker validates Virtuozzo response before DB write Schema validation required
FR-008 Worker computes hash for change detection SHA-256 of environment list
FR-009 Worker only updates DB if hash changed Prevents unnecessary writes
FR-010 Worker preserves stale data on Virtuozzo errors No DB overwrite on timeout/error

4.2 Queue System Requirements

ID Requirement Details
FR-020 Enqueue sync task when data is stale Task includes team_id, triggered_by
FR-021 Worker dequeues tasks and processes sequentially Round-robin distribution
FR-022 Rate limiter acquires token before Virtuozzo call Token bucket algorithm
FR-023 Worker handles task failures gracefully Exponential backoff retry
FR-024 Queue depth exposed as metric Monitoring visibility
FR-025 Task processing timeout: 60 seconds Prevents hanging tasks

4.3 Data Integrity Requirements

ID Requirement Details
FR-030 Validate Virtuozzo response structure before DB write Required fields check
FR-031 Compute SHA-256 hash of environment list Change detection
FR-032 Compare new hash with old hash Skip update if unchanged
FR-033 Use database transactions for upsert Atomic all-or-nothing
FR-034 Rollback transaction on worker crash Partial writes prevented
FR-035 Never overwrite DB with error data Stale data protected

4.4 API Requirements

ID Requirement Details
FR-040 GET /environments returns data immediately < 100ms p95 after first run
FR-041 Response includes meta object with sync status is_stale, last_synced_at, age_seconds
FR-042 Response headers include staleness indicators X-Data-Stale, X-Data-Last-Sync
FR-043 Manual refresh endpoint available POST /environments/sync (optional)

4.5 Observability Requirements

ID Requirement Details
FR-050 Log all sync operations with team context Structured JSON logs
FR-051 Emit metrics for cache hit/miss Counter: env_requests_total
FR-052 Emit metrics for sync duration Histogram: env_sync_duration_seconds
FR-053 Emit metrics for queue depth Gauge: env_sync_queue_depth
FR-054 Emit metrics for Virtuozzo API calls Counter: virtuozzo_api_requests_total
FR-055 Alert on high stale rate (> 50%) Operational awareness
FR-056 Alert on queue backlog (> 1000) Scaling trigger

5. Quality Attribute Requirements

5.1 Performance Requirements

ID Requirement Metric Target Priority
PERF-001 UI Response Time (cached data) p95 < 100ms Critical
PERF-002 API Response Time (fresh data) p95 < 50ms Critical
PERF-003 First Run Blocking Time p95 < 5000ms High
PERF-004 Queue Processing Latency p95 < 30 seconds Medium
PERF-005 Virtuozzo API Rate per second ≤ 50 Critical
PERF-006 Cache Hit Rate percentage > 90% High

5.2 Scalability Requirements

ID Requirement Current Target Growth Rate
SCAL-001 Concurrent Users 0 100K Linear
SCAL-002 Active Users (triggering refresh) 0 10K 10% of total
SCAL-003 Environments per Team 0 50 10% quarterly
SCAL-004 Teams per Instance 0 1000 20% quarterly
SCAL-005 Queue Processing Capacity 0 50 req/sec Configurable
SCAL-006 Worker Pool Size 0 10 workers Horizontal scaling

Scalability Design:

Metric O(Complexity) Notes
User login → Response O(1) Single DB read
Stale data → Queue enqueue O(1) Async enqueue
Queue processing O(active users) NOT O(total users)
Virtuozzo API calls O(active users × rate_limit) Bounded by rate limiter

5.3 Availability Requirements

ID Requirement Target Notes
AVAIL-001 System Availability 99.5% Degrades gracefully
AVAIL-002 Serve Stale Data on Virtuozzo Failure Always Fallback behavior
AVAIL-003 Queue Processing Availability 99% Workers restart automatically
AVAIL-004 Database Availability 99.9% Primary cache store
AVAIL-005 RTO (Recovery Time) < 5 minutes Worker restart
AVAIL-006 RPO (Recovery Point) < 1 minute Lost queue items

5.4 Security Requirements

ID Requirement Category Details
SEC-001 Authentication Identity JWT with team_id
SEC-002 Authorization Access Control RBAC team-based
SEC-003 Session Key Protection Data Encrypted at rest
SEC-004 Input Validation OWASP Pydantic on all endpoints
SEC-005 Rate Limiting DoS Prevention Queue-based throttling
SEC-006 Audit Logging Compliance All sync ops logged
SEC-007 PII Filtering Privacy No session keys in logs

5.5 Reliability Requirements

ID Requirement Target Notes
REL-001 Sync Success Rate (Virtuozzo healthy) > 99% Excludes outages
REL-002 Queue Processing Success Rate > 99% With retries
REL-003 Data Integrity 100% No corruption from errors
REL-004 MTBF (Mean Time Between Failures) > 720 hours Worker stability
REL-005 MTTR (Mean Time to Recovery) < 5 minutes Auto-restart

5.6 Maintainability Requirements

ID Requirement Details
MAINT-001 Code Coverage 85% overall, 95% core
MAINT-002 Test Ratio 3 sad path : 1 success
MAINT-003 Documentation OpenAPI specs + runbooks
MAINT-004 Zero-Downtime Deployment Required
MAINT-005 Configuration Environment variables
MAINT-006 Observability Structured logs + metrics

6. Integration Requirements

6.1 Internal System Integrations

System Integration Type Protocol Data Format SLA
PostgreSQL Database Async Read/Write SQLAlchemy async SQL < 50ms p95
Refresh Queue Async Enqueue AMQP/Redis JSON < 20ms p95
SSE Broker Async Publish aio-pika JSON < 50ms p95

6.2 External System Integrations

System Provider Integration Type Authentication Rate Limits
Virtuozzo getenvs Virtuozzo REST via VirtuozzoClient Session Key 50 req/sec (our limit)

Virtuozzo API Endpoint:

GET /environment/control/rest/getenvs - Method: VirtuozzoClient.get_envs(session: str, lazy: bool=True) - Location: backend/app/infrastructure/external/virtuozzo/client.py:99-106 - Timeout: 5 seconds (worker) - Retry: Exponential backoff (1s, 2s, 4s)

6.3 Queue System Requirements

ID Requirement Details
Q-001 Queue Backend RabbitMQ or Redis (pluggable)
Q-002 Message Format JSON with task_id, team_id, triggered_by
Q-003 Priority Levels normal, high (for manual refresh)
Q-004 Retry Policy Exponential backoff, max 3 attempts
Q-005 Dead Letter Queue For failed tasks after retries

6.4 Rate Limiter Requirements

ID Requirement Details
RL-001 Algorithm Token bucket
RL-002 Rate Limit 50 requests per second (configurable)
RL-003 Per-Worker Each worker has own limiter
RL-004 Burst Capacity 50 tokens (allow initial burst)
RL-005 Blocking Behavior Block until token available

7. Data Requirements

7.1 Data Storage Requirements

Data Type Volume Retention Storage Encryption
Environment Metadata ~1KB/env Until deleted PostgreSQL Yes
Session Keys ~500 bytes 24 hours PostgreSQL Yes
Queue Tasks ~200 bytes 24 hours RabbitMQ/Redis Yes
Sync Metadata ~100 bytes 90 days PostgreSQL Yes

7.2 Database Schema Requirements

environments table additions:

ALTER TABLE environments
ADD COLUMN last_synced_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
ADD COLUMN sync_status VARCHAR(20) DEFAULT 'ok',
ADD COLUMN sync_error_message TEXT,
ADD COLUMN last_sync_duration_ms INTEGER;

CREATE INDEX idx_environments_team_sync
ON environments(team_id, last_synced_at);

teams table additions:

ALTER TABLE teams
ADD COLUMN last_env_sync_at TIMESTAMP WITH TIME ZONE,
ADD COLUMN last_env_sync_hash VARCHAR(64),
ADD COLUMN last_sync_checked_at TIMESTAMP WITH TIME ZONE,
ADD COLUMN last_sync_duration_ms INTEGER;

7.3 Multi-Tenancy Requirements

ID Requirement Details
MT-001 Data Isolation All data scoped to team_id
MT-002 Context Propagation Team context in middleware
MT-003 RBAC Enforcement Team-based permissions
MT-004 Queue Isolation Tasks scoped to team_id

8. Sad Path Requirements

8.1 Failure Mode Handling

Failure Mode Expected Behavior Requirement ID
Virtuozzo API timeout Preserve stale data, log error SR-001
Virtuozzo API 500 Preserve stale data, retry SR-002
Virtuozzo returns invalid JSON Preserve stale data, log SR-003
Database connection lost Return 503, log error SR-004
Queue backend down Return stale data, log error SR-005
Worker crash mid-sync Transaction rollback SR-006
Session key expired Re-fetch, retry sync SR-007
Queue full Return 503, retry later SR-008
Rate limit exceeded Block worker until token SR-009

8.2 Data Protection Requirements

ID Requirement Details
DP-001 Validate before write Schema validation required
DP-002 Hash comparison Skip update if unchanged
DP-003 Atomic transactions All-or-nothing
DP-004 Rollback on error Never commit partial data
DP-005 Stale data preservation Never overwrite with errors

9. Observability Requirements

9.1 Metrics

ID Metric Type Labels
M-001 env_requests_total Counter team_id, status (fresh/stale/first_run)
M-002 env_stale_served_total Counter team_id, staleness_bucket
M-003 env_sync_duration_seconds Histogram team_id, result
M-004 env_sync_queue_depth Gauge priority
M-005 virtuozzo_api_requests_total Counter endpoint, status
M-006 virtuozzo_api_duration_seconds Histogram endpoint
M-007 worker_processing_duration_seconds Histogram worker_id

9.2 Structured Logging

Event Level Fields
env_request INFO team_id, has_data, is_stale, age_seconds
env_first_run INFO team_id
env_stale_enqueueing INFO team_id, age_seconds, staleness_bucket
worker_sync_started INFO team_id, task_id, triggered_by
worker_sync_complete INFO team_id, duration_ms, env_count, hash
worker_sync_failed ERROR team_id, error_type, error_message
worker_virtuozzo_timeout ERROR team_id, timeout_seconds

9.3 Alerting

ID Alert Condition Severity Action
A-001 High Stale Rate stale_served / total > 0.5 Warning Check workers
A-002 Queue Backlog queue_depth > 1000 Warning Scale workers
A-003 API Error Rate api_errors / api_calls > 0.1 Critical Check Virtuozzo
A-004 Data Age data_age_p95 > 1800 (30min) Warning Review TTL
A-005 Worker Down worker_heartbeat > 60s Critical Restart

10. Testing Requirements

10.1 Unit Tests

  • Fresh data returns immediately
  • Stale data returns + enqueues task
  • First run blocks for fetch
  • Hash computation
  • Schema validation

10.2 Sad Path Tests (3:1 Ratio)

Test Scenario Expected
test_timeout_preserves_stale Virtuozzo timeout Stale data preserved
test_500_preserves_stale Virtuozzo 500 Stale data preserved
test_invalid_payload_rejected Invalid JSON Stale data preserved
test_worker_crash_rollback Worker crash Transaction rollback
test_database_connection_lost DB down 503 returned

10.3 Integration Tests

Test Scenario Expected
test_e2e_stale_while_revalidate Full flow Stale returned, then fresh

10.4 Coverage Requirements

Component Minimum Coverage
Overall 85%
Sync Service 95%
Worker 90%
Rate Limiter 90%

11. Configuration

11.1 Environment Variables

# Environment Sync
VZ_ENV_TTL_SECONDS=600              # 10 minutes
VZ_ENV_MAX_STALE_SECONDS=3600       # 1 hour
VZ_SYNC_QUEUE_RATE_LIMIT=50         # 50 req/sec
VZ_SYNC_TIMEOUT_SECONDS=5           # Virtuozzo timeout
VZ_SYNC_RETRY_ATTEMPTS=3            # Max retries
VZ_SYNC_WORKER_CONCURRENCY=10       # Worker count
VZ_SYNC_ENABLE_QUEUE=true           # Enable queue

11.2 Feature Flags

Flag Default Description
vz_sync_enable_queue true Enable queue-based sync
vz_sync_worker_enabled true Enable background workers
vz_sync_stale_warning_enabled true Show staleness warning in UI

12. Deployment Checklist

Phase Task Status
Infrastructure Set up RabbitMQ/Redis queue
Infrastructure Configure workers
Database Run migration
Backend Implement sync service
Backend Implement workers
Backend Update router
Testing Unit tests (85%)
Testing Sad path tests (3:1)
Observability Configure metrics
Documentation Update runbooks

13. Revision History

Version Date Author Changes
1.0 2026-01-09 MBPanel Team Initial requirements (lazy sync + cron)
2.0 2026-01-22 MBPanel Team Major refactor: Stale-While-Revalidate + Queue-based architecture

14. Approvals

Role Name Signature Date
Product Manager
Engineering Lead
Architecture Lead
DevOps Lead