Skip to content

Environments Sync

[7 tools called]

Below is an end-to-end sync architecture that keeps our Postgres copy of Virtuozzo environments/nodes fresh without hammering VZ on every login. The design leans on documented patterns so we have evidence-backed decisions (see refs).


1. Architectural Goals

  • Minimize direct Virtuozzo calls on user-path requests.
  • Bound “staleness” per resource type with explicit SLAs.
  • Give Ops telemetry about freshness and job health.
  • Avoid simultaneous bulk syncs that would rate-limit or DOS Virtuozzo.

2. Core Patterns & References

  1. Cache-Aside with TTL + versioning – read from our DB first, and only pull from the source when data is missing or stale. Microsoft’s Cache-Aside pattern formalizes this behavior and explicitly calls out the need for expiration + lazy population (Azure Architecture Center, “Cache-Aside pattern”, https://learn.microsoft.com/en-us/azure/architecture/patterns/cache-aside).
  2. Scheduler/Agent/Supervisor – dependable background orchestration that schedules work, runs worker agents, and monitors completion/failures. This avoids tying sync work to user requests (Azure Architecture Center, “Scheduler Agent Supervisor pattern”, https://learn.microsoft.com/en-us/azure/architecture/patterns/scheduler-agent-supervisor).
  3. Redis key expiry guidance – short-lived coordination data (locks, freshness hints) should rely on Redis expirations rather than bespoke cron cleanup. Redis’ own docs recommend using expirations for time-based invalidation (Redis official docs, “Key expiration”, https://redis.io/docs/latest/develop/use/keyspace/expiry/).

These three give us the evidence-backed foundation for the rest of the plan.


3. Domain Model Additions

  • vz_sync_states

    team_id PK → teams.id
    data_type ENUM('environment','node','snapshot',…)
    last_full_sync_at timestamptz
    last_delta_sync_at timestamptz
    next_due_at timestamptz
    max_allowed_staleness interval (per type)
    status ENUM('idle','running','error')
    pending_reason TEXT
    requested_by ENUM('scheduler','login','api','ops')
    

  • vz_sync_jobs

    job_id UUID PK
    team_id
    scope ENUM('full','delta','single_env','node_subtree')
    priority SMALLINT
    created_at, started_at, finished_at
    state ENUM('queued','running','succeeded','failed','cancelled')
    attempt_count
    error_payload JSONB
    source_version TEXT (hash of Virtuozzo response for dedupe)
    

  • Existing resource tables (environments, nodes, etc.) gain:

    source_version TEXT
    synced_at timestamptz
    stale boolean (derived)
    

These rows give us deterministic “freshness math” instead of heuristics scattered in job code.


4. Scheduling & Freshness Algorithm

4.1 Freshness scoring

For every resource scope (team, environment, node):

staleness_seconds = now() - last_successful_sync_at(scope)
freshness_score = staleness_seconds / max_allowed_staleness(scope)

if freshness_score >= 1.0:
    enqueue_sync(scope, reason='sla_exceeded')
elif freshness_score >= 0.5 and recent_user_activity(scope):
    enqueue_delta(scope, reason='active_team_halfway_stale')
  • max_allowed_staleness defaults:
  • Environments: 10 minutes
  • Nodes: 5 minutes
  • Archived jobs / logs: 30 minutes (All configurable per team tier; enterprise customers can get tighter windows.)

The scheduler keeps a Redis sorted set vz:sync:due where score = next_due_at. Workers pop ranges with ZRANGEBYSCORE + expirations (per Redis documentation, use expiration to auto-clean stale entries).

4.2 Trigger sources

  1. Baseline sweepsnext_due_at = last_full_sync_at + full_sync_interval. Full sync interval defaults to 60 minutes; for large teams we do incremental sweeps (paginate by environment slug).
  2. Demand-driven – When an API endpoint (e.g., /api/v1/environments) loads:
  3. Read from Postgres.
  4. If freshness_score < 1.0, return data immediately and attach {"freshness": "current", "lastSyncedAt": ...} in the response metadata.
  5. If stale, enqueue a delta sync (scope='team', type='delta') but still return the old data with "freshness": "stale" so the UI can show a badge.
  6. We never block the request on Virtuozzo.
  7. User events – If the UI triggers a mutate action (create server, restart node), we already know that resource changed. Mark that resource’s row needs_validation = true and enqueue a low-latency “confirmation sync” that only pulls that environment/node.
  8. Failure recovery – If a job fails with a retryable error (network, 5xx from Virtuozzo), move next_due_at forward by an exponential backoff but cap at 5 minutes to avoid permanent staleness.

4.3 Locking & throttling

  • Before a worker issues Virtuozzo calls, it obtains SETNX lock vz:sync:lock:{team_id}:{scope} with 2× expected job duration TTL. Redis’ expiry guarantees the lock eventually clears even if the worker crashes (per Redis docs on expiration).
  • Jobs read vz_sync_states with SELECT ... FOR UPDATE SKIP LOCKED so only one worker updates a scope at a time.

5. Sync Execution Flow

  1. Scheduler service (async FastAPI background task or separate worker)
  2. Wakes every 30 seconds, loads all vz_sync_states with next_due_at <= now.
  3. Pushes job descriptors onto RabbitMQ (jobs.vz.sync) with priority = freshness_score.
  4. Also polls Redis vz:sync:requests for demand-driven triggers (login, API). Deduplicate by hashing (team_id, scope, reason).

  5. Worker (Celery, Dramatiq, or our existing aio-pika consumer)

  6. Pops job, acquires Redis lock.
  7. Fetches VZ data: * For full sync: paginate fetchEnvironments (ENV list) then fetchEnvDetails. * For delta: pass modified_since=last_delta_sync_at if Virtuozzo API supports; otherwise, filter locally by updated timestamp returned by VZ.
  8. Compares each payload against source_version (SHA256 of sorted JSON). Only UPDATE/UPSERT rows whose hash changed. This keeps Postgres lean.
  9. Marks missing entries as soft_deleted to capture removals.
  10. Updates vz_sync_states with last_* fields and calculates next_due_at.

  11. Eventing

  12. On success: publish sync.completed SSE/RabbitMQ event so logged-in clients can show “data refreshed at …”.
  13. On failure: log to JobLog, increment failure_count, and set vz_sync_states.status='error' + pending_reason. UI can show “sync retrying” banners.

6. Avoiding the “login storm”

In Laravel you synced on every login. With this design: - Login simply reads team_data_state. If freshness_score >= 1, it enqueues a job (non-blocking) and injects {"freshness":"stale"} in the login/session response so frontend can show a toast (“data is refreshing in the background”). - Because the queue deduplicates (team_id, scope) via Redis SET with TTL, 100 concurrent logins only create one sync job.


7. Additional Optimizations

  1. Hot-spot protection
  2. Track Virtuozzo rate-limit headers. If we detect near-threshold usage, push next_due_at for non-critical teams to spread load.
  3. Partial materialization
  4. For “large” teams, store per-environment last_synced_at so a full sweep is just a collection of smaller jobs. We can store sync_shard (0..N-1) to distribute daily across intervals.
  5. Manual override
  6. Admin API /api/v1/teams/{id}/vz-sync sets next_due_at = now for emergency refresh.
  7. Observation
  8. Add Prometheus metrics: * vz_sync_staleness_seconds{team,scope} * vz_sync_job_duration_seconds_bucket * vz_sync_jobs_inflight
  9. On-call gets alerted if staleness_seconds exceeds SLA for >5 minutes.

8. Implementation Plan

  1. Schema & metadata (2–3 days)
  2. Add vz_sync_states, vz_sync_jobs, source_version columns via Alembic.
  3. Create indexes (team_id, data_type).

  4. Scheduler service (2 days)

  5. Background task using FastAPI lifespan or a dedicated worker that implements Cache-Aside semantics (per Microsoft doc).
  6. Use Redis sorted sets for due jobs + TTL per Redis best practices.

  7. Worker adaptation (3–4 days)

  8. Port logic from Laravel jobs, but refactor to operate on job descriptors rather than “sync on login”.
  9. Idempotent UPSERT using SQLAlchemy insert().on_conflict_do_update.
  10. Compute source_version to skip rewrites.

  11. API/UX integration (1–2 days)

  12. Add freshness metadata to env/node responses.
  13. UI surfaces stale indicators & background refresh notifications.

  14. Observability & rollout (1 day)

  15. Instrument metrics/logs.
  16. Toggle via feature flag per team to migrate gradually.

Why this solves the old pain points

  • No login coupling: Users always get the last persisted view; sync happens off-path.
  • Guaranteed freshness: We enforce SLAs via max_allowed_staleness and the scheduler’s prioritization.
  • Resource-efficient: Hash comparison + delta jobs mean we only write rows that actually changed.
  • Operational insight: vz_sync_states tells support exactly how stale a team’s data is and why.

This blueprint keeps Postgres in lockstep with Virtuozzo while respecting API limits and user latency, grounded in proven architectural patterns rather than ad-hoc heuristics.