Environments Sync
[7 tools called]
Below is an end-to-end sync architecture that keeps our Postgres copy of Virtuozzo environments/nodes fresh without hammering VZ on every login. The design leans on documented patterns so we have evidence-backed decisions (see refs).
1. Architectural Goals¶
- Minimize direct Virtuozzo calls on user-path requests.
- Bound “staleness” per resource type with explicit SLAs.
- Give Ops telemetry about freshness and job health.
- Avoid simultaneous bulk syncs that would rate-limit or DOS Virtuozzo.
2. Core Patterns & References¶
- Cache-Aside with TTL + versioning – read from our DB first, and only pull from the source when data is missing or stale. Microsoft’s Cache-Aside pattern formalizes this behavior and explicitly calls out the need for expiration + lazy population (Azure Architecture Center, “Cache-Aside pattern”, https://learn.microsoft.com/en-us/azure/architecture/patterns/cache-aside).
- Scheduler/Agent/Supervisor – dependable background orchestration that schedules work, runs worker agents, and monitors completion/failures. This avoids tying sync work to user requests (Azure Architecture Center, “Scheduler Agent Supervisor pattern”, https://learn.microsoft.com/en-us/azure/architecture/patterns/scheduler-agent-supervisor).
- Redis key expiry guidance – short-lived coordination data (locks, freshness hints) should rely on Redis expirations rather than bespoke cron cleanup. Redis’ own docs recommend using expirations for time-based invalidation (Redis official docs, “Key expiration”, https://redis.io/docs/latest/develop/use/keyspace/expiry/).
These three give us the evidence-backed foundation for the rest of the plan.
3. Domain Model Additions¶
-
vz_sync_statesteam_id PK → teams.id data_type ENUM('environment','node','snapshot',…) last_full_sync_at timestamptz last_delta_sync_at timestamptz next_due_at timestamptz max_allowed_staleness interval (per type) status ENUM('idle','running','error') pending_reason TEXT requested_by ENUM('scheduler','login','api','ops') -
vz_sync_jobs -
Existing resource tables (
environments,nodes, etc.) gain:
These rows give us deterministic “freshness math” instead of heuristics scattered in job code.
4. Scheduling & Freshness Algorithm¶
4.1 Freshness scoring¶
For every resource scope (team, environment, node):
staleness_seconds = now() - last_successful_sync_at(scope)
freshness_score = staleness_seconds / max_allowed_staleness(scope)
if freshness_score >= 1.0:
enqueue_sync(scope, reason='sla_exceeded')
elif freshness_score >= 0.5 and recent_user_activity(scope):
enqueue_delta(scope, reason='active_team_halfway_stale')
max_allowed_stalenessdefaults:- Environments: 10 minutes
- Nodes: 5 minutes
- Archived jobs / logs: 30 minutes (All configurable per team tier; enterprise customers can get tighter windows.)
The scheduler keeps a Redis sorted set vz:sync:due where score = next_due_at. Workers pop ranges with ZRANGEBYSCORE + expirations (per Redis documentation, use expiration to auto-clean stale entries).
4.2 Trigger sources¶
- Baseline sweeps –
next_due_at = last_full_sync_at + full_sync_interval. Full sync interval defaults to 60 minutes; for large teams we do incremental sweeps (paginate by environment slug). - Demand-driven – When an API endpoint (e.g.,
/api/v1/environments) loads: - Read from Postgres.
- If
freshness_score < 1.0, return data immediately and attach{"freshness": "current", "lastSyncedAt": ...}in the response metadata. - If stale, enqueue a delta sync (
scope='team', type='delta') but still return the old data with"freshness": "stale"so the UI can show a badge. - We never block the request on Virtuozzo.
- User events – If the UI triggers a mutate action (create server, restart node), we already know that resource changed. Mark that resource’s row
needs_validation = trueand enqueue a low-latency “confirmation sync” that only pulls that environment/node. - Failure recovery – If a job fails with a retryable error (network, 5xx from Virtuozzo), move
next_due_atforward by an exponential backoff but cap at 5 minutes to avoid permanent staleness.
4.3 Locking & throttling¶
- Before a worker issues Virtuozzo calls, it obtains
SETNXlockvz:sync:lock:{team_id}:{scope}with 2× expected job duration TTL. Redis’ expiry guarantees the lock eventually clears even if the worker crashes (per Redis docs on expiration). - Jobs read
vz_sync_stateswithSELECT ... FOR UPDATE SKIP LOCKEDso only one worker updates a scope at a time.
5. Sync Execution Flow¶
- Scheduler service (async FastAPI background task or separate worker)
- Wakes every 30 seconds, loads all
vz_sync_stateswithnext_due_at <= now. - Pushes job descriptors onto RabbitMQ (
jobs.vz.sync) with priority =freshness_score. -
Also polls Redis
vz:sync:requestsfor demand-driven triggers (login, API). Deduplicate by hashing(team_id, scope, reason). -
Worker (Celery, Dramatiq, or our existing aio-pika consumer)
- Pops job, acquires Redis lock.
- Fetches VZ data:
* For full sync: paginate
fetchEnvironments(ENV list) thenfetchEnvDetails. * For delta: passmodified_since=last_delta_sync_atif Virtuozzo API supports; otherwise, filter locally byupdatedtimestamp returned by VZ. - Compares each payload against
source_version(SHA256 of sorted JSON). Only UPDATE/UPSERT rows whose hash changed. This keeps Postgres lean. - Marks missing entries as
soft_deletedto capture removals. -
Updates
vz_sync_stateswithlast_*fields and calculatesnext_due_at. -
Eventing
- On success: publish
sync.completedSSE/RabbitMQ event so logged-in clients can show “data refreshed at …”. - On failure: log to
JobLog, incrementfailure_count, and setvz_sync_states.status='error'+pending_reason. UI can show “sync retrying” banners.
6. Avoiding the “login storm”¶
In Laravel you synced on every login. With this design:
- Login simply reads team_data_state. If freshness_score >= 1, it enqueues a job (non-blocking) and injects {"freshness":"stale"} in the login/session response so frontend can show a toast (“data is refreshing in the background”).
- Because the queue deduplicates (team_id, scope) via Redis SET with TTL, 100 concurrent logins only create one sync job.
7. Additional Optimizations¶
- Hot-spot protection
- Track Virtuozzo rate-limit headers. If we detect near-threshold usage, push
next_due_atfor non-critical teams to spread load. - Partial materialization
- For “large” teams, store per-environment
last_synced_atso a full sweep is just a collection of smaller jobs. We can storesync_shard(0..N-1) to distribute daily across intervals. - Manual override
- Admin API
/api/v1/teams/{id}/vz-syncsetsnext_due_at = nowfor emergency refresh. - Observation
- Add Prometheus metrics:
*
vz_sync_staleness_seconds{team,scope}*vz_sync_job_duration_seconds_bucket*vz_sync_jobs_inflight - On-call gets alerted if
staleness_secondsexceeds SLA for >5 minutes.
8. Implementation Plan¶
- Schema & metadata (2–3 days)
- Add
vz_sync_states,vz_sync_jobs,source_versioncolumns via Alembic. -
Create indexes
(team_id, data_type). -
Scheduler service (2 days)
- Background task using FastAPI lifespan or a dedicated worker that implements Cache-Aside semantics (per Microsoft doc).
-
Use Redis sorted sets for due jobs + TTL per Redis best practices.
-
Worker adaptation (3–4 days)
- Port logic from Laravel jobs, but refactor to operate on job descriptors rather than “sync on login”.
- Idempotent UPSERT using SQLAlchemy
insert().on_conflict_do_update. -
Compute
source_versionto skip rewrites. -
API/UX integration (1–2 days)
- Add
freshnessmetadata to env/node responses. -
UI surfaces stale indicators & background refresh notifications.
-
Observability & rollout (1 day)
- Instrument metrics/logs.
- Toggle via feature flag per team to migrate gradually.
Why this solves the old pain points¶
- No login coupling: Users always get the last persisted view; sync happens off-path.
- Guaranteed freshness: We enforce SLAs via
max_allowed_stalenessand the scheduler’s prioritization. - Resource-efficient: Hash comparison + delta jobs mean we only write rows that actually changed.
- Operational insight:
vz_sync_statestells support exactly how stale a team’s data is and why.
This blueprint keeps Postgres in lockstep with Virtuozzo while respecting API limits and user latency, grounded in proven architectural patterns rather than ad-hoc heuristics.