Architecture Requirements Document (ARD)
Virtuozzo Environment Synchronization - Stale-While-Revalidate Requirements
1. Introduction
1.1 Purpose
This document defines the architectural requirements for implementing Stale-While-Revalidate pattern for Virtuozzo environment synchronization. This is a distributed systems optimization problem involving caching strategies, consistency vs. availability trade-offs, event-driven queue processing, and scalable background workers.
The system prioritizes availability and user experience over strong consistency, serving cached data immediately while asynchronously refreshing in the background.
1.2 Architectural Shift (v2.0)
Key Changes from v1.0:
| Aspect |
v1.0 (Previous) |
v2.0 (Current) |
| Refresh Strategy |
Lazy sync + Celery cron |
Queue-based event-driven |
| User Blocking |
Could block on sync |
NEVER blocks (except first run) |
| Cache Storage |
Redis hot + PostgreSQL warm |
PostgreSQL only (simpler) |
| Scalability |
O(N) global cron iteration |
O(active users) queue processing |
| API Protection |
Distributed locking only |
Rate-limited worker pool |
| Data Integrity |
Direct overwrite |
Hash-based change detection |
1.3 Core Principles
- Always Return Immediately: User requests NEVER wait for Virtuozzo API (except first run)
- Stale-While-Revalidate: Return stale data, refresh asynchronously
- Queue-Based Processing: Event-driven, not cron-based
- Active-User Priority: Only logged-in users trigger refresh
- Graceful Degradation: Serve stale with warning on errors
- Data Protection: Validate payload before DB write, never corrupt with error data
1.4 Scope
In Scope:
- Stale-While-Revalidate pattern implementation
- Queue-based background refresh (RabbitMQ/Redis)
- Rate-limited worker pool for Virtuozzo API calls
- PostgreSQL-only caching (survives restarts)
- Hash-based change detection
- Graceful error handling (preserve stale data)
- Observability (metrics, logging, tracing)
Out of Scope:
- Virtuozzo authentication (existing app/modules/sessions/)
- Environment provisioning (existing app/domains/environments/tasks.py)
- Real-time WebSocket sync (SSE for notifications only)
- Redis hot cache (removed for simplicity)
2. System Overview
2.1 System Description
The Virtuozzo Environment Synchronization system implements Stale-While-Revalidate pattern:
REQUEST → Check DB → Return Data (ALWAYS)
│
├─ Fresh (< TTL) → Done
└─ Stale (> TTL) → Return stale + enqueue refresh
BACKGROUND WORKER (Independent)
→ Process queue at fixed rate
→ Call Virtuozzo API (rate-limited)
→ Validate payload
→ Update DB if changed
2.2 Architecture Type
Domain (Full DDD with state machines, invariants, and orchestration)
Rationale: Environment sync involves:
- State management (fresh → stale → syncing → fresh/error)
- Business invariants (max staleness, TTL thresholds)
- Orchestration of multiple systems (Virtuozzo, Queue, PostgreSQL, SSE)
- Complex failure modes requiring explicit handling
2.3 Context Diagram
graph TB
subgraph "MBPanel Platform"
subgraph "API Layer"
API[FastAPI /environments]
end
subgraph "Sync Service"
SYNC[EnvironmentSyncService<br/>stale-while-revalidate]
end
subgraph "Infrastructure"
DB[(PostgreSQL<br/>primary cache)]
Q[Refresh Queue<br/>RabbitMQ/Redis]
end
subgraph "Worker Pool"
W1[Worker 1]
W2[Worker 2]
WN[Worker N]
RATE[Rate Limiter<br/>50 req/sec]
end
end
subgraph "External"
VZ[Virtuozzo API<br/>source of truth]
end
API --> SYNC
SYNC --> DB
SYNC -->|enqueue stale| Q
Q --> RATE
RATE --> W1
RATE --> W2
RATE --> WN
W1 --> VZ
W2 --> VZ
WN --> VZ
style VZ fill:#f9f,stroke:#333,stroke-width:4px
style Q fill:#ff9,stroke:#333,stroke-width:3px
style RATE fill:#faa,stroke:#333,stroke-width:2px
2.4 Key Stakeholders
| Stakeholder |
Role |
Concerns |
| Platform Users |
End consumers |
Instant page loads, transparent staleness |
| Development Team |
Builders |
Maintainable queue system, testable logic |
| DevOps Team |
Operators |
Monitorable queue depth, worker health |
| Product Management |
Business |
Feature availability, API cost control |
| Virtuozzo Integration |
External dependency |
Rate limit compliance, error handling |
3. Architectural Constraints
3.1 Technical Constraints
| ID |
Constraint |
Rationale |
| TC-001 |
Must use FastAPI with Python 3.12+ |
Platform standard |
| TC-002 |
Must use async/await for all I/O operations |
Event loop non-blocking |
| TC-003 |
Must use Pydantic v2 for validation |
Type safety |
| TC-004 |
Must use PostgreSQL via async SQLAlchemy |
Primary cache store |
| TC-005 |
Must use service names (postgres, rabbitmq) not localhost |
Container networking |
| TC-006 |
Must use structured logging (Structlog) with correlation IDs |
Observability |
| TC-007 |
Must propagate team context in all logs |
Multi-tenancy |
| TC-008 |
All Virtuozzo calls through infrastructure/external/virtuozzo |
Centralized integration |
| TC-009 |
Queue backend must be pluggable (RabbitMQ or Redis) |
Flexibility |
| TC-010 |
Rate limiter must use token bucket algorithm |
Smooth throttling |
3.2 Business Constraints
| ID |
Constraint |
Rationale |
| BC-001 |
Maximum acceptable staleness: 10 minutes (TTL) |
User experience |
| BC-002 |
Maximum staleness before warning: 1 hour |
Transparency |
| BC-003 |
Must support 100K concurrent users with 10K active |
Scalability |
| BC-004 |
Must not exceed Virtuozzo API rate: 50 req/sec |
API protection |
| BC-005 |
UI must never block after first run |
User experience |
| BC-006 |
First run may block (unavoidable initial fetch) |
Initial setup |
3.3 Quality Constraints
| ID |
Constraint |
Standard |
| QC-001 |
UI response time (p95): < 100ms |
Performance |
| QC-002 |
API response time (p95): < 50ms for cached data |
Performance |
| QC-003 |
Background sync processing: < 30 seconds |
Queue latency |
| QC-004 |
Code coverage: 85% overall, 95% core |
Quality |
| QC-005 |
Sad path test ratio: 3:1 (failure:success) |
Test coverage |
| QC-006 |
Never corrupt DB with error data |
Data integrity |
4. Functional Requirements
4.1 Core Stale-While-Revalidate Requirements
| ID |
Requirement |
Acceptance Criteria |
| FR-001 |
Return cached data immediately (never block after first run) |
Response time < 100ms p95 |
| FR-002 |
Check data freshness based on last_synced_at timestamp |
TTL comparison with DB clock |
| FR-003 |
Return stale data if > TTL, enqueue background refresh |
Stale data returned + task created |
| FR-004 |
First run (no data) blocks for initial fetch |
User sees loading spinner |
| FR-005 |
Background worker processes queue asynchronously |
Non-blocking to user |
| FR-006 |
Rate-limiter caps Virtuozzo API calls at 50 req/sec |
Token bucket algorithm |
| FR-007 |
Worker validates Virtuozzo response before DB write |
Schema validation required |
| FR-008 |
Worker computes hash for change detection |
SHA-256 of environment list |
| FR-009 |
Worker only updates DB if hash changed |
Prevents unnecessary writes |
| FR-010 |
Worker preserves stale data on Virtuozzo errors |
No DB overwrite on timeout/error |
4.2 Queue System Requirements
| ID |
Requirement |
Details |
| FR-020 |
Enqueue sync task when data is stale |
Task includes team_id, triggered_by |
| FR-021 |
Worker dequeues tasks and processes sequentially |
Round-robin distribution |
| FR-022 |
Rate limiter acquires token before Virtuozzo call |
Token bucket algorithm |
| FR-023 |
Worker handles task failures gracefully |
Exponential backoff retry |
| FR-024 |
Queue depth exposed as metric |
Monitoring visibility |
| FR-025 |
Task processing timeout: 60 seconds |
Prevents hanging tasks |
4.3 Data Integrity Requirements
| ID |
Requirement |
Details |
| FR-030 |
Validate Virtuozzo response structure before DB write |
Required fields check |
| FR-031 |
Compute SHA-256 hash of environment list |
Change detection |
| FR-032 |
Compare new hash with old hash |
Skip update if unchanged |
| FR-033 |
Use database transactions for upsert |
Atomic all-or-nothing |
| FR-034 |
Rollback transaction on worker crash |
Partial writes prevented |
| FR-035 |
Never overwrite DB with error data |
Stale data protected |
4.4 API Requirements
| ID |
Requirement |
Details |
| FR-040 |
GET /environments returns data immediately |
< 100ms p95 after first run |
| FR-041 |
Response includes meta object with sync status |
is_stale, last_synced_at, age_seconds |
| FR-042 |
Response headers include staleness indicators |
X-Data-Stale, X-Data-Last-Sync |
| FR-043 |
Manual refresh endpoint available |
POST /environments/sync (optional) |
4.5 Observability Requirements
| ID |
Requirement |
Details |
| FR-050 |
Log all sync operations with team context |
Structured JSON logs |
| FR-051 |
Emit metrics for cache hit/miss |
Counter: env_requests_total |
| FR-052 |
Emit metrics for sync duration |
Histogram: env_sync_duration_seconds |
| FR-053 |
Emit metrics for queue depth |
Gauge: env_sync_queue_depth |
| FR-054 |
Emit metrics for Virtuozzo API calls |
Counter: virtuozzo_api_requests_total |
| FR-055 |
Alert on high stale rate (> 50%) |
Operational awareness |
| FR-056 |
Alert on queue backlog (> 1000) |
Scaling trigger |
5. Quality Attribute Requirements
| ID |
Requirement |
Metric |
Target |
Priority |
| PERF-001 |
UI Response Time (cached data) |
p95 |
< 100ms |
Critical |
| PERF-002 |
API Response Time (fresh data) |
p95 |
< 50ms |
Critical |
| PERF-003 |
First Run Blocking Time |
p95 |
< 5000ms |
High |
| PERF-004 |
Queue Processing Latency |
p95 |
< 30 seconds |
Medium |
| PERF-005 |
Virtuozzo API Rate |
per second |
≤ 50 |
Critical |
| PERF-006 |
Cache Hit Rate |
percentage |
> 90% |
High |
5.2 Scalability Requirements
| ID |
Requirement |
Current |
Target |
Growth Rate |
| SCAL-001 |
Concurrent Users |
0 |
100K |
Linear |
| SCAL-002 |
Active Users (triggering refresh) |
0 |
10K |
10% of total |
| SCAL-003 |
Environments per Team |
0 |
50 |
10% quarterly |
| SCAL-004 |
Teams per Instance |
0 |
1000 |
20% quarterly |
| SCAL-005 |
Queue Processing Capacity |
0 |
50 req/sec |
Configurable |
| SCAL-006 |
Worker Pool Size |
0 |
10 workers |
Horizontal scaling |
Scalability Design:
| Metric |
O(Complexity) |
Notes |
| User login → Response |
O(1) |
Single DB read |
| Stale data → Queue enqueue |
O(1) |
Async enqueue |
| Queue processing |
O(active users) |
NOT O(total users) |
| Virtuozzo API calls |
O(active users × rate_limit) |
Bounded by rate limiter |
5.3 Availability Requirements
| ID |
Requirement |
Target |
Notes |
| AVAIL-001 |
System Availability |
99.5% |
Degrades gracefully |
| AVAIL-002 |
Serve Stale Data on Virtuozzo Failure |
Always |
Fallback behavior |
| AVAIL-003 |
Queue Processing Availability |
99% |
Workers restart automatically |
| AVAIL-004 |
Database Availability |
99.9% |
Primary cache store |
| AVAIL-005 |
RTO (Recovery Time) |
< 5 minutes |
Worker restart |
| AVAIL-006 |
RPO (Recovery Point) |
< 1 minute |
Lost queue items |
5.4 Security Requirements
| ID |
Requirement |
Category |
Details |
| SEC-001 |
Authentication |
Identity |
JWT with team_id |
| SEC-002 |
Authorization |
Access Control |
RBAC team-based |
| SEC-003 |
Session Key Protection |
Data |
Encrypted at rest |
| SEC-004 |
Input Validation |
OWASP |
Pydantic on all endpoints |
| SEC-005 |
Rate Limiting |
DoS Prevention |
Queue-based throttling |
| SEC-006 |
Audit Logging |
Compliance |
All sync ops logged |
| SEC-007 |
PII Filtering |
Privacy |
No session keys in logs |
5.5 Reliability Requirements
| ID |
Requirement |
Target |
Notes |
| REL-001 |
Sync Success Rate (Virtuozzo healthy) |
> 99% |
Excludes outages |
| REL-002 |
Queue Processing Success Rate |
> 99% |
With retries |
| REL-003 |
Data Integrity |
100% |
No corruption from errors |
| REL-004 |
MTBF (Mean Time Between Failures) |
> 720 hours |
Worker stability |
| REL-005 |
MTTR (Mean Time to Recovery) |
< 5 minutes |
Auto-restart |
5.6 Maintainability Requirements
| ID |
Requirement |
Details |
| MAINT-001 |
Code Coverage |
85% overall, 95% core |
| MAINT-002 |
Test Ratio |
3 sad path : 1 success |
| MAINT-003 |
Documentation |
OpenAPI specs + runbooks |
| MAINT-004 |
Zero-Downtime Deployment |
Required |
| MAINT-005 |
Configuration |
Environment variables |
| MAINT-006 |
Observability |
Structured logs + metrics |
6. Integration Requirements
6.1 Internal System Integrations
| System |
Integration Type |
Protocol |
Data Format |
SLA |
| PostgreSQL Database |
Async Read/Write |
SQLAlchemy async |
SQL |
< 50ms p95 |
| Refresh Queue |
Async Enqueue |
AMQP/Redis |
JSON |
< 20ms p95 |
| SSE Broker |
Async Publish |
aio-pika |
JSON |
< 50ms p95 |
6.2 External System Integrations
| System |
Provider |
Integration Type |
Authentication |
Rate Limits |
| Virtuozzo getenvs |
Virtuozzo |
REST via VirtuozzoClient |
Session Key |
50 req/sec (our limit) |
Virtuozzo API Endpoint:
GET /environment/control/rest/getenvs
- Method: VirtuozzoClient.get_envs(session: str, lazy: bool=True)
- Location: backend/app/infrastructure/external/virtuozzo/client.py:99-106
- Timeout: 5 seconds (worker)
- Retry: Exponential backoff (1s, 2s, 4s)
6.3 Queue System Requirements
| ID |
Requirement |
Details |
| Q-001 |
Queue Backend |
RabbitMQ or Redis (pluggable) |
| Q-002 |
Message Format |
JSON with task_id, team_id, triggered_by |
| Q-003 |
Priority Levels |
normal, high (for manual refresh) |
| Q-004 |
Retry Policy |
Exponential backoff, max 3 attempts |
| Q-005 |
Dead Letter Queue |
For failed tasks after retries |
6.4 Rate Limiter Requirements
| ID |
Requirement |
Details |
| RL-001 |
Algorithm |
Token bucket |
| RL-002 |
Rate Limit |
50 requests per second (configurable) |
| RL-003 |
Per-Worker |
Each worker has own limiter |
| RL-004 |
Burst Capacity |
50 tokens (allow initial burst) |
| RL-005 |
Blocking Behavior |
Block until token available |
7. Data Requirements
7.1 Data Storage Requirements
| Data Type |
Volume |
Retention |
Storage |
Encryption |
| Environment Metadata |
~1KB/env |
Until deleted |
PostgreSQL |
Yes |
| Session Keys |
~500 bytes |
24 hours |
PostgreSQL |
Yes |
| Queue Tasks |
~200 bytes |
24 hours |
RabbitMQ/Redis |
Yes |
| Sync Metadata |
~100 bytes |
90 days |
PostgreSQL |
Yes |
7.2 Database Schema Requirements
environments table additions:
ALTER TABLE environments
ADD COLUMN last_synced_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
ADD COLUMN sync_status VARCHAR(20) DEFAULT 'ok',
ADD COLUMN sync_error_message TEXT,
ADD COLUMN last_sync_duration_ms INTEGER;
CREATE INDEX idx_environments_team_sync
ON environments(team_id, last_synced_at);
teams table additions:
ALTER TABLE teams
ADD COLUMN last_env_sync_at TIMESTAMP WITH TIME ZONE,
ADD COLUMN last_env_sync_hash VARCHAR(64),
ADD COLUMN last_sync_checked_at TIMESTAMP WITH TIME ZONE,
ADD COLUMN last_sync_duration_ms INTEGER;
7.3 Multi-Tenancy Requirements
| ID |
Requirement |
Details |
| MT-001 |
Data Isolation |
All data scoped to team_id |
| MT-002 |
Context Propagation |
Team context in middleware |
| MT-003 |
RBAC Enforcement |
Team-based permissions |
| MT-004 |
Queue Isolation |
Tasks scoped to team_id |
8. Sad Path Requirements
8.1 Failure Mode Handling
| Failure Mode |
Expected Behavior |
Requirement ID |
| Virtuozzo API timeout |
Preserve stale data, log error |
SR-001 |
| Virtuozzo API 500 |
Preserve stale data, retry |
SR-002 |
| Virtuozzo returns invalid JSON |
Preserve stale data, log |
SR-003 |
| Database connection lost |
Return 503, log error |
SR-004 |
| Queue backend down |
Return stale data, log error |
SR-005 |
| Worker crash mid-sync |
Transaction rollback |
SR-006 |
| Session key expired |
Re-fetch, retry sync |
SR-007 |
| Queue full |
Return 503, retry later |
SR-008 |
| Rate limit exceeded |
Block worker until token |
SR-009 |
8.2 Data Protection Requirements
| ID |
Requirement |
Details |
| DP-001 |
Validate before write |
Schema validation required |
| DP-002 |
Hash comparison |
Skip update if unchanged |
| DP-003 |
Atomic transactions |
All-or-nothing |
| DP-004 |
Rollback on error |
Never commit partial data |
| DP-005 |
Stale data preservation |
Never overwrite with errors |
9. Observability Requirements
9.1 Metrics
| ID |
Metric |
Type |
Labels |
| M-001 |
env_requests_total |
Counter |
team_id, status (fresh/stale/first_run) |
| M-002 |
env_stale_served_total |
Counter |
team_id, staleness_bucket |
| M-003 |
env_sync_duration_seconds |
Histogram |
team_id, result |
| M-004 |
env_sync_queue_depth |
Gauge |
priority |
| M-005 |
virtuozzo_api_requests_total |
Counter |
endpoint, status |
| M-006 |
virtuozzo_api_duration_seconds |
Histogram |
endpoint |
| M-007 |
worker_processing_duration_seconds |
Histogram |
worker_id |
9.2 Structured Logging
| Event |
Level |
Fields |
| env_request |
INFO |
team_id, has_data, is_stale, age_seconds |
| env_first_run |
INFO |
team_id |
| env_stale_enqueueing |
INFO |
team_id, age_seconds, staleness_bucket |
| worker_sync_started |
INFO |
team_id, task_id, triggered_by |
| worker_sync_complete |
INFO |
team_id, duration_ms, env_count, hash |
| worker_sync_failed |
ERROR |
team_id, error_type, error_message |
| worker_virtuozzo_timeout |
ERROR |
team_id, timeout_seconds |
9.3 Alerting
| ID |
Alert |
Condition |
Severity |
Action |
| A-001 |
High Stale Rate |
stale_served / total > 0.5 |
Warning |
Check workers |
| A-002 |
Queue Backlog |
queue_depth > 1000 |
Warning |
Scale workers |
| A-003 |
API Error Rate |
api_errors / api_calls > 0.1 |
Critical |
Check Virtuozzo |
| A-004 |
Data Age |
data_age_p95 > 1800 (30min) |
Warning |
Review TTL |
| A-005 |
Worker Down |
worker_heartbeat > 60s |
Critical |
Restart |
10. Testing Requirements
10.1 Unit Tests
- Fresh data returns immediately
- Stale data returns + enqueues task
- First run blocks for fetch
- Hash computation
- Schema validation
10.2 Sad Path Tests (3:1 Ratio)
| Test |
Scenario |
Expected |
| test_timeout_preserves_stale |
Virtuozzo timeout |
Stale data preserved |
| test_500_preserves_stale |
Virtuozzo 500 |
Stale data preserved |
| test_invalid_payload_rejected |
Invalid JSON |
Stale data preserved |
| test_worker_crash_rollback |
Worker crash |
Transaction rollback |
| test_database_connection_lost |
DB down |
503 returned |
10.3 Integration Tests
| Test |
Scenario |
Expected |
| test_e2e_stale_while_revalidate |
Full flow |
Stale returned, then fresh |
10.4 Coverage Requirements
| Component |
Minimum Coverage |
| Overall |
85% |
| Sync Service |
95% |
| Worker |
90% |
| Rate Limiter |
90% |
11. Configuration
11.1 Environment Variables
# Environment Sync
VZ_ENV_TTL_SECONDS=600 # 10 minutes
VZ_ENV_MAX_STALE_SECONDS=3600 # 1 hour
VZ_SYNC_QUEUE_RATE_LIMIT=50 # 50 req/sec
VZ_SYNC_TIMEOUT_SECONDS=5 # Virtuozzo timeout
VZ_SYNC_RETRY_ATTEMPTS=3 # Max retries
VZ_SYNC_WORKER_CONCURRENCY=10 # Worker count
VZ_SYNC_ENABLE_QUEUE=true # Enable queue
11.2 Feature Flags
| Flag |
Default |
Description |
| vz_sync_enable_queue |
true |
Enable queue-based sync |
| vz_sync_worker_enabled |
true |
Enable background workers |
| vz_sync_stale_warning_enabled |
true |
Show staleness warning in UI |
12. Deployment Checklist
| Phase |
Task |
Status |
| Infrastructure |
Set up RabbitMQ/Redis queue |
|
| Infrastructure |
Configure workers |
|
| Database |
Run migration |
|
| Backend |
Implement sync service |
|
| Backend |
Implement workers |
|
| Backend |
Update router |
|
| Testing |
Unit tests (85%) |
|
| Testing |
Sad path tests (3:1) |
|
| Observability |
Configure metrics |
|
| Documentation |
Update runbooks |
|
13. Revision History
| Version |
Date |
Author |
Changes |
| 1.0 |
2026-01-09 |
MBPanel Team |
Initial requirements (lazy sync + cron) |
| 2.0 |
2026-01-22 |
MBPanel Team |
Major refactor: Stale-While-Revalidate + Queue-based architecture |
14. Approvals
| Role |
Name |
Signature |
Date |
| Product Manager |
|
|
|
| Engineering Lead |
|
|
|
| Architecture Lead |
|
|
|
| DevOps Lead |
|
|
|