Skip to content

MBPanel FastAPI Migration - Centralized Product Requirements Document (PRD)

2. Executive Summary

This document serves as the single source of truth for the complete migration of the MightyBox Control Panel from Laravel to a modern FastAPI backend with Next.js frontend. This is an enterprise-grade transformation to address performance, scalability, and maintainability challenges using a Hybrid Modular Domain-Driven Design (DDD) architecture.

Migration Overview

  • Current State: Laravel 11 with Inertia.js/React frontend (monolithic architecture), MySQL database, Node.js WebSocket server
  • Target State: FastAPI backend with Next.js frontend (decoupled architecture), PostgreSQL database, native FastAPI WebSocket support, Hybrid Modular DDD architecture
  • Timeline: 20-week implementation period
  • Performance Goals: 50-70% improvement in API response times (400-600ms → <200ms)

3. Problem Statement

Current Challenges

  1. Performance: Laravel application experiencing 400-600ms response times affecting user experience
  2. Scalability: Current architecture struggles with 5x user growth requirements
  3. Maintainability: Legacy codebase with complex monolithic structure and tight coupling
  4. Infrastructure Costs: Current setup requires 30% more resources than industry standards

Business Impact

  • Decreased user satisfaction due to slow response times
  • Limited ability to scale with business growth
  • Higher operational costs
  • Technical debt affecting feature delivery speed

4. User Personas

Understanding our diverse user base is critical for building an enterprise-grade WordPress hosting dashboard. Our platform serves millions of users with varying technical expertise, business sizes, and use cases.

4.1 Primary Personas

P1: Small Business Owner (Sarah)

Demographics - Role: Entrepreneur/Small Business Owner - Technical Skill: Beginner to Intermediate - Age Range: 28-45 - Business Size: 1-5 employees - Annual Revenue: $50K-$500K

Goals - Launch and manage business website quickly without technical expertise - Minimize costs while maintaining reliable hosting - Focus on business growth, not server management - Easy-to-understand dashboard with minimal learning curve

Pain Points - Overwhelmed by technical jargon and complex configurations - Limited time to learn hosting management - Fear of making mistakes that could break website - Budget constraints requiring cost-effective solutions - Needs 24/7 uptime for e-commerce transactions

Usage Patterns - Logs in 2-3 times per week - Primary tasks: View site status, check analytics, manage domains - Mobile usage: 40% of sessions - Peak activity: Evenings and weekends

Success Metrics - Dashboard task completion in <3 clicks - Zero technical support tickets for basic operations - 95% task success rate without documentation


P2: Web Agency Developer (Marcus)

Demographics - Role: Web Developer/Agency Owner - Technical Skill: Advanced - Age Range: 25-40 - Clients Managed: 20-100 WordPress sites - Team Size: 3-15 developers

Goals - Efficiently manage multiple client WordPress sites from single dashboard - Automate repetitive tasks (backups, updates, staging environments) - White-label solution for client presentation - Rapid deployment and cloning capabilities - API access for custom integrations and automation

Pain Points - Time wasted switching between multiple hosting control panels - Manual processes for routine maintenance across dozens of sites - Lack of bulk operations for multi-site management - Insufficient API documentation for automation - Need for staging/production environment workflows

Usage Patterns - Logs in daily, multiple times - Primary tasks: Bulk operations, API integrations, staging deployments - Desktop usage: 85% of sessions - Peak activity: Business hours (9am-6pm) - Heavy API usage for automation

Success Metrics - Bulk operations support for 50+ sites simultaneously - API response time <100ms for automation workflows - 80% reduction in manual task time vs. competitors


P3: Enterprise IT Administrator (Jennifer)

Demographics - Role: Senior IT Administrator/DevOps Lead - Technical Skill: Expert - Age Range: 30-50 - Organization Size: 500-10,000+ employees - WordPress Instances: 100-1,000+ sites

Goals - Enterprise-grade security, compliance, and governance - Centralized management with role-based access control (RBAC) - Advanced monitoring, alerting, and SLA tracking - Integration with existing enterprise tools (SSO, SIEM, ITSM) - Detailed audit logs and compliance reporting (SOC 2, GDPR, HIPAA)

Pain Points - Lack of enterprise SSO integration (SAML, OAuth, LDAP) - Insufficient audit trails for security compliance - No granular permission controls for team management - Missing integration with enterprise monitoring (DataDog, New Relic) - Inadequate disaster recovery and multi-region failover

Usage Patterns - Logs in multiple times daily - Primary tasks: Security monitoring, compliance reporting, team management - Desktop only: 100% of sessions - Peak activity: Continuous monitoring with automated alerts - Extensive use of API for integrations

Success Metrics - 100% audit trail coverage for all operations - SSO integration with <500ms authentication time - RBAC with 20+ custom role configurations - 99.99% SLA compliance with automated failover


P4: WordPress Developer/Blogger (Alex)

Demographics - Role: Content Creator/Independent Developer - Technical Skill: Intermediate to Advanced - Age Range: 22-35 - Sites Managed: 1-5 personal/client sites - Income Source: Blogging, freelance development

Goals - Easy WordPress installation and theme/plugin testing - Quick staging environment for development - Performance optimization tools for SEO - Cost-effective hosting for personal projects - Simple backup and restore for experimentation

Pain Points - Expensive hosting for multiple test/development sites - Slow deployment process for updates - Lack of performance insights for optimization - Complex staging environment setup - Fear of breaking production while testing

Usage Patterns - Logs in 3-5 times per week - Primary tasks: Deploy updates, check performance, manage backups - Mixed device usage: 60% desktop, 40% mobile - Peak activity: Evenings and weekends - Experimentation-heavy (frequent staging usage)

Success Metrics - One-click staging environment creation - Performance insights dashboard with actionable recommendations - <5 minute deployment time for WordPress updates


P5: SaaS Platform Administrator (David)

Demographics - Role: Platform Operations Manager - Technical Skill: Expert - Age Range: 28-45 - Organization: Multi-tenant SaaS company - WordPress Instances: 10,000-1,000,000+ tenant sites

Goals - Massive scale operations with automated provisioning - Per-tenant resource isolation and management - Real-time monitoring for millions of concurrent users - API-first architecture for full platform automation - Cost optimization with granular resource tracking

Pain Points - Inability to provision thousands of WordPress instances simultaneously - Lack of per-tenant resource monitoring and billing integration - Insufficient horizontal scaling capabilities - Missing webhooks for event-driven automation - No programmatic control over Jelastic environments at scale

Usage Patterns - 100% API-driven interactions - Continuous monitoring via webhooks and event streams - Automated provisioning triggered by customer signups - Dashboard usage: Primarily for high-level analytics - Peak activity: 24/7 automated operations

Success Metrics - Provision 1,000+ WordPress instances per hour - API uptime: 99.99% with <50ms response times - Zero-downtime scaling events - Real-time webhook delivery (<1 second latency)


4.2 Secondary Personas

P6: Non-Profit Organization Manager (Lisa)

Demographics - Role: Non-Profit Communications Director - Technical Skill: Beginner - Organization Size: 5-50 staff + volunteers - Budget: Highly constrained

Goals - Reliable hosting for fundraising campaigns - Simple content updates without developer dependency - Donation processing uptime during critical campaigns - Cost-effective solutions within limited budget

Pain Points - Limited technical support budget - Website breaks during high-traffic fundraising events - Difficulty understanding hosting bills - Dependency on volunteer IT help


P7: E-commerce Store Owner (Raj)

Demographics - Role: Online Retailer - Technical Skill: Intermediate - Business Type: WooCommerce-based store - Annual Sales: $500K-$5M

Goals - 99.99% uptime during peak shopping seasons - Fast page load times for conversion optimization - Easy scaling during Black Friday/Cyber Monday - PCI compliance for payment processing - Integrated CDN for global customers

Pain Points - Site crashes during traffic spikes (lost revenue) - Slow checkout process reducing conversions - Lack of real-time performance monitoring - Difficulty understanding security compliance requirements


4.3 Anti-Personas (Users We Don't Serve)

AP1: Enterprise Custom CMS Users

  • Organizations requiring non-WordPress CMS solutions
  • Custom enterprise applications not based on WordPress
  • Users needing bare-metal server access

AP2: Free Tier Seekers

  • Users expecting completely free hosting indefinitely
  • Users unwilling to pay for enterprise-grade reliability
  • Hobby projects with no growth trajectory

4.4 Persona-Driven Design Principles

Based on our persona analysis, the dashboard must:

  1. Progressive Disclosure: Simple interface for beginners (Sarah, Lisa) with advanced features accessible to experts (Marcus, Jennifer, David)

  2. Multi-Tenancy at Scale: Support individual sites (Sarah, Alex) through millions of sites (David) with the same architecture

  3. API-First Design: Every dashboard feature must have API equivalent for automation (Marcus, Jennifer, David)

  4. Role-Based Views: Customize dashboard complexity based on user role and technical expertise

  5. Performance Transparency: Real-time metrics for all personas with appropriate detail levels

  6. Mobile-First for Basic Operations: 40% of SMB users access via mobile (Sarah, Alex)

  7. Compliance Built-In: Enterprise personas (Jennifer, David) require built-in compliance frameworks

  8. Cost Visibility: All personas need clear resource usage and cost breakdowns


5. User Stories

This section documents comprehensive user stories for each persona, covering both success and failure scenarios. Each story follows the format: As a [persona], I want to [action], so that [benefit].

5.1 Small Business Owner (Sarah) - User Stories

Success Scenarios

US-001: Quick WordPress Site Launch - Story: As Sarah, I want to launch a WordPress site in under 5 minutes without technical knowledge, so that I can quickly establish my online presence - Acceptance Criteria: - One-click WordPress installation from dashboard - Pre-configured templates for common business types - Automatic SSL certificate provisioning - Default security settings applied automatically - Success confirmation with site URL and login credentials - Priority: P0 (Critical)

US-002: Visual Site Health Monitoring - Story: As Sarah, I want to see my website's health status at a glance, so that I know if there are any issues without understanding technical details - Acceptance Criteria: - Dashboard shows clear "Healthy" or "Needs Attention" status - Visual indicators (green/yellow/red) for uptime, performance, security - Plain-language explanations for any issues - Recommended actions in non-technical terms - Mobile-friendly view - Priority: P0 (Critical)

US-003: Simple Domain Management - Story: As Sarah, I want to connect my custom domain with simple instructions, so that my business has a professional web address - Acceptance Criteria: - Step-by-step domain connection wizard - Visual DNS configuration guide with screenshots - Automatic DNS verification - Email notifications when domain is live - Support for common domain registrars - Priority: P1 (High)

US-004: Easy Backup and Restore - Story: As Sarah, I want to restore my website if something goes wrong, so that I can recover from mistakes without losing my business data - Acceptance Criteria: - One-click restore to previous backup point - Calendar view of available backup dates - Preview of backup contents before restore - Confirmation dialog with plain-language warnings - Restore completion notification - Priority: P0 (Critical)

Failure Scenarios

US-005: Site Outage During Peak Hours - Story: As Sarah, I want to be immediately notified if my site goes down, so that I can minimize lost sales and customer frustration - Acceptance Criteria: - SMS/email alerts within 60 seconds of outage - Plain-language explanation of issue - Estimated recovery time - Option to contact emergency support - Automatic recovery attempts before alerting - Priority: P0 (Critical) - Failure Handling: Automated failover, rollback capabilities, 24/7 support escalation

US-006: Accidental Plugin Breaking Site - Story: As Sarah, I want the system to prevent broken plugins from taking my site offline, so that I don't lose customers due to technical mistakes - Acceptance Criteria: - Automatic site backup before plugin installations - Plugin compatibility checks with current WordPress version - Sandbox testing environment for plugins (optional) - One-click rollback if site becomes inaccessible - Error detection with automatic plugin deactivation - Priority: P1 (High) - Failure Handling: Automatic rollback, safe mode boot, plugin quarantine


5.2 Web Agency Developer (Marcus) - User Stories

Success Scenarios

US-010: Bulk Site Operations - Story: As Marcus, I want to update plugins across 50 client sites simultaneously, so that I can maintain all sites efficiently without manual repetition - Acceptance Criteria: - Multi-select sites from dashboard - Bulk actions: update plugins, update themes, update WordPress core - Progress tracking for bulk operations - Detailed results with success/failure per site - Rollback option for failed updates - Priority: P0 (Critical)

US-011: Automated Staging Environments - Story: As Marcus, I want to create staging environments with one click, so that I can test changes safely before deploying to client production sites - Acceptance Criteria: - Clone production to staging in <5 minutes - Separate subdomain for staging environment - Sync production data to staging on-demand - Push staging changes to production with approval workflow - Automatic staging environment cleanup after 30 days - Priority: P0 (Critical)

US-012: API-Driven Site Provisioning - Story: As Marcus, I want to provision new client sites via API, so that I can integrate WordPress deployment into my agency's onboarding workflow - Acceptance Criteria: - REST API endpoint for site creation - Configurable parameters: WordPress version, PHP version, plugins, theme - Webhook notifications when site is ready - API key-based authentication - Rate limiting: 100 sites per hour - Priority: P1 (High)

US-013: White-Label Client Portal - Story: As Marcus, I want to provide clients access to their sites under my agency branding, so that I can deliver a professional client experience - Acceptance Criteria: - Custom domain for client portal (e.g., portal.agency.com) - Agency logo and color scheme customization - Hide MBPanel branding from client-facing views - Granular permission controls per client - Client-specific limited feature access - Priority: P2 (Medium)

Failure Scenarios

US-014: Bulk Update Failures - Story: As Marcus, I want detailed failure reports when bulk updates fail, so that I can quickly identify and fix problematic sites - Acceptance Criteria: - Failed update list with specific error messages - Ability to retry failed updates individually - Automatic backup before each update attempt - Email digest of failures with recommended actions - Integration with ticketing system for client notifications - Priority: P1 (High) - Failure Handling: Atomic rollback per site, error categorization, automated retry logic

US-015: API Rate Limit Exceeded - Story: As Marcus, I want clear feedback when hitting API rate limits, so that I can adjust my automation scripts without breaking workflows - Acceptance Criteria: - HTTP 429 response with retry-after header - Dashboard showing current API usage vs. limits - Option to request temporary rate limit increase - Webhook notifications before hitting 80% of limit - Clear documentation on rate limit policies - Priority: P2 (Medium) - Failure Handling: Gradual backoff, request queuing, burst allowance


5.3 Enterprise IT Administrator (Jennifer) - User Stories

Success Scenarios

US-020: SSO Integration - Story: As Jennifer, I want to integrate MBPanel with our corporate SSO (SAML), so that employees can access the dashboard using existing company credentials - Acceptance Criteria: - SAML 2.0 protocol support - Support for major identity providers (Okta, Azure AD, Auth0) - Just-in-time (JIT) user provisioning - SSO login time <500ms - Fallback to local authentication if SSO unavailable - Priority: P0 (Critical)

US-021: Comprehensive Audit Logging - Story: As Jennifer, I want complete audit trails of all user actions, so that I can meet SOC 2 compliance requirements and investigate security incidents - Acceptance Criteria: - Log all CRUD operations with user, timestamp, IP address, action details - Searchable audit log with advanced filters - Export audit logs in CSV/JSON formats - Retention: 12 months minimum - Tamper-proof logging with cryptographic verification - Priority: P0 (Critical)

US-022: Granular RBAC - Story: As Jennifer, I want to create custom roles with specific permissions, so that I can enforce least-privilege access across 100+ team members - Acceptance Criteria: - Create unlimited custom roles - 50+ granular permissions (e.g., "view sites", "deploy staging", "delete sites") - Role inheritance and composition - Bulk user role assignment - Permission templates for common scenarios - Priority: P0 (Critical)

US-023: Compliance Dashboard - Story: As Jennifer, I want a compliance dashboard showing GDPR/SOC 2 readiness, so that I can prepare for audits efficiently - Acceptance Criteria: - Real-time compliance score (percentage) - Checklist of compliance requirements with status - Data residency controls (EU/US/Asia regions) - Data retention policy enforcement - Automated compliance reports (PDF/HTML) - Priority: P1 (High)

Failure Scenarios

US-024: SSO Provider Outage - Story: As Jennifer, I want emergency access when our SSO provider is down, so that critical operations can continue during corporate authentication failures - Acceptance Criteria: - Break-glass emergency access for super admins - Multi-factor authentication for emergency access - Audit log entries for all emergency access usage - Automatic notification to security team - Time-limited emergency sessions (4 hours max) - Priority: P0 (Critical) - Failure Handling: Local credential fallback, emergency admin accounts, security team alerts

US-025: Compliance Violation Detection - Story: As Jennifer, I want immediate alerts when compliance violations occur, so that I can remediate issues before audit failures - Acceptance Criteria: - Real-time violation detection (e.g., PII exposure, retention policy breach) - Severity classification (Critical/High/Medium/Low) - Automated remediation workflows where possible - Integration with SIEM tools (Splunk, QRadar) - Incident response playbooks - Priority: P0 (Critical) - Failure Handling: Automatic quarantine, rollback capabilities, incident escalation


5.4 WordPress Developer/Blogger (Alex) - User Stories

Success Scenarios

US-030: One-Click Staging - Story: As Alex, I want to create a staging copy of my blog instantly, so that I can test new plugins without risking my live site - Acceptance Criteria: - Staging environment created in <3 minutes - Isolated subdomain (e.g., staging.myblog.com) - Search engine indexing blocked on staging - Easy data sync from production to staging - Push changes from staging to production option - Priority: P1 (High)

US-031: Performance Optimization Insights - Story: As Alex, I want actionable performance recommendations, so that I can improve my blog's SEO rankings and user experience - Acceptance Criteria: - Page load time metrics (Core Web Vitals) - Database query analysis with slow query detection - Image optimization recommendations - Plugin performance impact ranking - One-click fixes for common issues - Priority: P1 (High)

US-032: Git Integration for Themes/Plugins - Story: As Alex, I want to deploy theme changes from GitHub, so that I can use version control for my development workflow - Acceptance Criteria: - Connect GitHub/GitLab/Bitbucket repositories - Webhook-triggered deployments on git push - Deploy to staging or production environments - Rollback to previous git commits - Environment-specific configuration files - Priority: P2 (Medium)

Failure Scenarios

US-033: Plugin Conflict Detection - Story: As Alex, I want to be warned about plugin conflicts before activation, so that I don't crash my site when experimenting - Acceptance Criteria: - Known conflict database check before plugin install - Compatibility matrix display for active plugins - Sandbox testing option before production activation - Automatic safe mode if site becomes inaccessible - Conflict resolution recommendations - Priority: P1 (High) - Failure Handling: Automatic plugin deactivation, safe mode boot, rollback option


5.5 SaaS Platform Administrator (David) - User Stories

Success Scenarios

US-040: Mass Provisioning API - Story: As David, I want to provision 10,000 WordPress sites per day via API, so that I can onboard new SaaS customers automatically at scale - Acceptance Criteria: - Batch provisioning endpoint accepting 1,000 sites per request - Asynchronous processing with webhook completion notifications - Provisioning time: <5 minutes per site - Idempotent operations (safe retries) - Rate limit: 10,000 sites per hour - Priority: P0 (Critical)

US-041: Per-Tenant Resource Metering - Story: As David, I want granular resource usage data per WordPress instance, so that I can implement accurate usage-based billing for my SaaS customers - Acceptance Criteria: - Real-time metrics: CPU, memory, disk, bandwidth per site - Hourly/daily/monthly aggregation - Export to billing systems via API (Stripe, Chargebee) - Cost allocation tags for multi-tenant accounting - Resource usage alerts per tenant - Priority: P0 (Critical)

US-042: Programmatic Jelastic Environment Control - Story: As David, I want full API control over Jelastic environments, so that I can automate scaling, backups, and configuration for millions of tenant sites - Acceptance Criteria: - API endpoints mirroring all Jelastic capabilities - Bulk operations: scale, restart, configure - Environment templates for rapid provisioning - Webhook events for all environment changes - Sub-100ms API response times - Priority: P0 (Critical)

US-043: Zero-Downtime Auto-Scaling - Story: As David, I want automatic horizontal scaling during traffic spikes, so that my SaaS platform maintains performance during viral content events - Acceptance Criteria: - Cloudlet auto-scaling based on CPU/memory thresholds - Scale up in <60 seconds - Scale down after 10-minute cooldown period - No connection drops during scaling events - Configurable scaling policies per environment - Priority: P0 (Critical)

Failure Scenarios

US-044: Jelastic API Outage - Story: As David, I want graceful degradation when Jelastic API is unavailable, so that my SaaS dashboard remains operational during upstream outages - Acceptance Criteria: - Circuit breaker pattern with 5-second timeout - Cached data display with staleness indicator - Read-only mode during API outages - Automatic retry with exponential backoff - Status page integration showing Jelastic health - Priority: P0 (Critical) - Failure Handling: Circuit breaker, cached data, retry queue, health monitoring

US-045: Mass Provisioning Partial Failure - Story: As David, I want detailed failure tracking when batch provisioning partially fails, so that I can retry only failed sites and maintain SaaS onboarding SLAs - Acceptance Criteria: - Per-site success/failure status in batch response - Failed sites queued for automatic retry (3 attempts) - Error categorization: transient vs. permanent failures - Webhook notifications for batch completion with failure details - Reconciliation API to verify provisioning state - Priority: P0 (Critical) - Failure Handling: Automatic retry with backoff, dead letter queue, manual reconciliation API


5.6 E-commerce Store Owner (Raj) - User Stories

Success Scenarios

US-050: Traffic Spike Auto-Scaling - Story: As Raj, I want my site to automatically scale during Black Friday sales, so that I don't lose revenue due to site crashes - Acceptance Criteria: - Automatic detection of traffic spikes (>200% baseline) - Instant cloudlet scaling up to 10x baseline capacity - No downtime during scaling events - Email notification when scaling occurs - Automatic scale-down after traffic normalizes - Priority: P0 (Critical)

US-051: Real-Time Performance Monitoring - Story: As Raj, I want real-time checkout page performance metrics, so that I can optimize conversion rates during peak shopping periods - Acceptance Criteria: - Live dashboard showing checkout page load times - Transaction success rate monitoring - WooCommerce-specific metrics (cart abandonment, payment gateway latency) - Alerts when performance degrades below thresholds - Historical comparison (vs. last Black Friday) - Priority: P1 (High)

Failure Scenarios

US-052: Payment Gateway Timeout - Story: As Raj, I want detailed logs when payment processing fails, so that I can troubleshoot lost sales and customer complaints - Acceptance Criteria: - Payment gateway API call logs with timestamps - Timeout/error categorization - Customer email/order ID correlation - Integration with WooCommerce order notes - Automatic retry recommendations - Priority: P0 (Critical) - Failure Handling: Automatic retry logic, detailed error logs, support ticket integration


5.7 User Story Summary Matrix

Persona Critical (P0) High (P1) Medium (P2) Total Stories
Sarah (SMB Owner) 3 2 1 6
Marcus (Agency) 2 3 2 7
Jennifer (Enterprise) 5 1 0 6
Alex (Developer) 0 4 1 5
David (SaaS Platform) 5 0 0 5
Raj (E-commerce) 2 1 0 3
TOTAL 17 11 4 32

5.8 Cross-Persona User Journeys

Journey 1: Site Launch to Production

  1. Sarah creates account and provisions WordPress site (US-001)
  2. Alex helps Sarah with theme customization via staging (US-030)
  3. Marcus (hired as consultant) sets up performance optimization (US-011)
  4. Site goes live with monitoring alerts configured (US-002)
  5. Jennifer (if enterprise) integrates with corporate SSO for team access (US-020)

Journey 2: Agency Multi-Site Management

  1. Marcus provisions 50 client sites via API (US-012)
  2. Marcus configures white-label portal for clients (US-013)
  3. Bulk plugin updates across all sites (US-010)
  4. Client (Sarah) logs into white-label portal to view her site (US-002)
  5. Marcus handles bulk update failures for specific sites (US-014)

Journey 3: Enterprise Compliance Onboarding

  1. Jennifer integrates SSO with corporate Azure AD (US-020)
  2. Jennifer configures RBAC for 100+ team members (US-022)
  3. Jennifer runs compliance audit (US-023)
  4. Jennifer sets up audit logging for SOC 2 requirement (US-021)
  5. Jennifer handles compliance violations proactively (US-025)

Journey 4: SaaS Platform Scaling

  1. David provisions 10,000 tenant sites via batch API (US-040)
  2. David implements per-tenant resource metering for billing (US-041)
  3. Traffic spike triggers auto-scaling for high-traffic tenants (US-043)
  4. David handles Jelastic API outage gracefully (US-044)
  5. David reconciles partial provisioning failures (US-045)

6. Technical Requirements

6.1 Performance Requirements

  • API Response Times: <200ms for 95th percentile (50-70% improvement over current)
  • Concurrent Users: Support 5x more concurrent users than current system
  • Database Queries: <50ms average query time for common operations
  • Page Load Times: <2.5 seconds for dashboard pages

6.2 Scalability Requirements and Auto-Scaling Policies

Horizontal Scaling: Support for multiple instances behind load balancer (10+ backend instances, 5+ frontend instances)

6.2.1 Auto-Scaling Policies

Backend (FastAPI) Auto-Scaling:

Metric Scale Up Threshold Scale Down Threshold Cooldown Min Instances Max Instances
CPU Utilization >70% for 3 min <30% for 10 min 5 min 5 50
Memory Utilization >80% for 3 min <40% for 10 min 5 min 5 50
Request Count >5000 req/instance/min <1000 req/instance/min 5 min 5 50
Response Time p95 >200ms for 5 min <100ms for 15 min 10 min 5 50

Kubernetes HPA Configuration:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: fastapi-backend-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: fastapi-backend
  minReplicas: 5
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "5000"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 600  # 10 min
      policies:
      - type: Percent
        value: 50  # Scale down max 50% at a time
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0  # Immediate scale up
      policies:
      - type: Percent
        value: 100  # Double capacity if needed
        periodSeconds: 60
      - type: Pods
        value: 10  # Or add 10 pods, whichever is higher
        periodSeconds: 60
      selectPolicy: Max

Frontend (Next.js) Auto-Scaling:

Metric Scale Up Threshold Scale Down Threshold Cooldown Min Instances Max Instances
CPU Utilization >60% for 3 min <20% for 10 min 5 min 3 20
Memory Utilization >75% for 3 min <30% for 10 min 5 min 3 20
Active Connections >500/instance <100/instance 5 min 3 20

Predictive Scaling: - Machine learning model predicts traffic patterns - Pre-scale 15 minutes before anticipated traffic spikes - Based on historical data (daily/weekly/seasonal patterns) - Example: Scale up at 8 AM UTC (business hours), scale down at 6 PM UTC

# Predictive Scaling Implementation
async def predictive_scaling():
    """Pre-emptive scaling based on historical traffic patterns"""
    current_hour = datetime.utcnow().hour
    current_day = datetime.utcnow().weekday()

    # Historical traffic patterns (requests per second)
    traffic_forecast = await ml_model.predict_traffic(
        hour=current_hour,
        day_of_week=current_day,
        lookback_weeks=4
    )

    # Calculate required instances
    required_instances = math.ceil(traffic_forecast / 5000)  # 5k req/s per instance

    # Pre-scale if forecast shows >30% increase
    current_instances = await get_current_instance_count()
    if required_instances > current_instances * 1.3:
        await scale_instances(required_instances, reason="PREDICTIVE")

Database (Citus) Auto-Scaling: - Worker Nodes: Add workers when shard size >500GB (see Section 7.7.3) - Read Replicas: Auto-scale replicas based on read query load - Scale up: Read query latency p95 >50ms for 5 minutes - Scale down: Read query latency p95 <20ms for 20 minutes - Min read replicas: 2 per shard - Max read replicas: 5 per shard - Connection Pooling (PgBouncer): - Supports 10,000+ concurrent connections - Auto-scale pool size based on active connections - Pool exhaustion alerts when >90% utilization

Redis Cluster Auto-Scaling: - Memory-based: Add nodes when memory >80% for 5 minutes - Eviction-based: Scale up if eviction rate >100 keys/min - Min nodes: 3 (1 master + 2 replicas) - Max nodes: 12 (4 masters + 8 replicas)

Jelastic Auto-Scaling (via API):

# Programmatic Jelastic environment scaling
async def scale_jelastic_environment(env_id: str, cloudlets: int):
    """Scale Jelastic environment cloudlets (128 MB RAM/cloudlet)"""
    async with httpx.AsyncClient() as client:
        response = await client.post(
            f"{JELASTIC_API_URL}/environment/control/rest/changetopology",
            json={
                "envName": env_id,
                "session": jelastic_session,
                "nodes": [{
                    "cloudlets": cloudlets,  # Target cloudlets
                    "fixedCloudlets": cloudlets // 4,  # Reserved cloudlets
                    "flexibleCloudlets": cloudlets - (cloudlets // 4)
                }]
            }
        )
    await log_scaling_event(env_id, cloudlets, "JELASTIC_SCALE")

Scaling Triggers for Jelastic: - CPU >80% for 5 min → Increase cloudlets by 25% - Memory >85% for 5 min → Increase cloudlets by 25% - Disk I/O >70% for 10 min → Increase cloudlets or add node - CPU <30% for 20 min → Decrease cloudlets by 25%

6.2.2 Load Balancing Strategy

  • Algorithm: Weighted round-robin with health checks
  • Health Check Interval: Every 10 seconds
  • Health Check Endpoint: GET /health (returns 200 if healthy)
  • Unhealthy Threshold: 3 consecutive failures
  • Drain Timeout: 30 seconds (graceful shutdown for in-flight requests)
  • WebSocket Sticky Sessions: Enabled (session affinity via cookie)
  • SSL Termination: At load balancer (TLS 1.3)
  • Connection Timeout: 60 seconds
  • Keep-Alive: 120 seconds

Load Balancer Configuration (AWS ALB):

TargetGroup:
  Protocol: HTTP
  Port: 8000
  HealthCheck:
    Protocol: HTTP
    Path: /health
    Interval: 10
    Timeout: 5
    HealthyThreshold: 2
    UnhealthyThreshold: 3
  TargetGroupAttributes:
    - Key: deregistration_delay.timeout_seconds
      Value: '30'
    - Key: stickiness.enabled
      Value: 'true'
    - Key: stickiness.type
      Value: 'lb_cookie'
    - Key: stickiness.lb_cookie.duration_seconds
      Value: '86400'  # 24 hours for WebSocket

6.2.3 Caching Strategy

  • Multi-tier caching:
  • L1: Application-level in-memory cache (per-instance, TTL: 60s)
    • Environment metadata, user sessions
    • 128 MB per instance, LRU eviction
  • L2: Redis cluster (distributed caching)
    • API responses, computed data, Jelastic API cache
    • TTL: 5 minutes for API responses, 15 minutes for computed data
    • Eviction policy: allkeys-lru
  • L3: CDN edge caching (CloudFlare/CloudFront)
    • Static assets (JS, CSS, images)
    • TTL: 1 year with cache-busting via versioned URLs

Cache Invalidation: - Event-driven invalidation via Redis Pub/Sub - Webhook-triggered invalidation for Jelastic environment changes - Manual purge API: POST /api/v1/cache/purge

CDN Integration: - Provider: CloudFlare (primary) or CloudFront (fallback) - Static assets served via CDN with automatic image optimization - Lazy loading for images, SVG sprites for icons - Brotli compression for text assets

6.2.4 Target Capacity and Performance Benchmarks

  • Concurrent Users: 1M+ (sustained), 5M+ (burst)
  • Requests per Second: 10,000+ (sustained), 50,000+ (burst)
  • API Response Time: <200ms p95, <500ms p99
  • Database Query Time: <50ms p95 for common operations
  • Cache Hit Rate: >80% for Redis, >95% for CDN
  • WebSocket Connections: 500K+ concurrent connections
  • Throughput: 1 Gbps sustained, 10 Gbps burst

6.3 Security Requirements

  • Authentication: JWT-based with configurable token expiration
  • Authorization: Role-based access control (RBAC) system
  • Data Protection: Encryption at rest and in transit
  • API Security: Rate limiting, input validation, SQL injection protection
  • Compliance: Industry security standards compliance
  • Mutual TLS (mTLS): Service-to-service authentication for internal traffic
  • Field-level Encryption: PII protection with KMS-managed keys and rotation
  • Secrets Management: Planned centralized vault (e.g., HashiCorp Vault) for credentials/keys; today secrets are stored via per-env .env.backend.local copies and container env vars only. This document now tracks the gap so future work can introduce an actual secrets manager.
  • Security Headers: CSP, HSTS, X-Frame-Options, X-Content-Type-Options
  • Token Management: Refresh token rotation and revocation lists
  • Account Security: Account lockout, IP/device throttling, anomaly detection
  • Data Handling: Data masking in non-production; structured redaction in logs
  • API Versioning: Explicit versioning with deprecation policy and sunset headers

6.4 Architecture Requirements

  • Modularity: High cohesion within modules, low coupling between modules
  • Maintainability: Clear separation of concerns within each domain module
  • Testability: Each module can be unit tested independently
  • Developer Productivity: Reduced cognitive load when working on specific features

6.5 High Availability Requirements (99.99% Uptime)

Target SLA: 99.99% availability for critical services (52.56 minutes downtime per year)

6.5.1 Availability Budget

SLA Level Annual Downtime Monthly Downtime Weekly Downtime Daily Downtime
99.9% 8h 45m 43m 49s 10m 4s 1m 26s
99.99% (Target) 52m 35s 4m 23s 1m 0s 8.6s
99.999% (Aspirational) 5m 15s 26s 6s 0.9s

Downtime Budget Allocation: - Planned Maintenance: 30% (15 minutes/month during low-traffic windows) - Unplanned Outages: 40% (allocated for unexpected failures) - Deployment Downtime: 20% (zero-downtime deployments) - Buffer: 10% (reserved for unforeseen incidents)

6.5.2 High Availability Architecture

1. Redundancy at Every Layer

Component Redundancy Strategy Failover Time
Load Balancer Multi-AZ AWS ALB (3 AZs) Instant (automatic)
FastAPI Backend Min 5 instances across 3 AZs <10s (health check + drain)
Next.js Frontend Min 3 instances across 3 AZs <10s (health check + drain)
PostgreSQL (Citus) Primary + 2 sync replicas per shard <30s (Patroni auto-failover)
Redis 3-node Sentinel cluster <5s (Sentinel failover)
Object Storage S3 multi-AZ (11 9's durability) Instant (AWS-managed)

2. Multi-AZ Deployment Architecture

Region: us-east-1
├── AZ-1 (us-east-1a)
│   ├── FastAPI: 2 instances
│   ├── Next.js: 1 instance
│   ├── Citus Worker-1 (primary)
│   └── Redis Node-1 (master)
├── AZ-2 (us-east-1b)
│   ├── FastAPI: 2 instances
│   ├── Next.js: 1 instance
│   ├── Citus Worker-1 (sync standby)
│   └── Redis Node-2 (replica)
└── AZ-3 (us-east-1c)
    ├── FastAPI: 1 instance
    ├── Next.js: 1 instance
    ├── Citus Worker-1 (async standby)
    └── Redis Node-3 (replica + Sentinel quorum)

3. Zero-Downtime Deployment Strategy

Blue-Green Deployment for Backend:

# Deploy new version (green) alongside existing (blue)
kubectl apply -f deployment-green.yaml

# Wait for green to be healthy
kubectl wait --for=condition=ready pod -l version=green --timeout=300s

# Gradual traffic shift: 10% → 50% → 100%
kubectl patch service fastapi-backend \
  -p '{"spec":{"selector":{"version":"green","weight":"10"}}}'

# Monitor error rates for 5 minutes
sleep 300

# If error rate <0.1%, continue to 100%
kubectl patch service fastapi-backend \
  -p '{"spec":{"selector":{"version":"green"}}}'

# Decommission blue after 24 hours
kubectl delete deployment fastapi-backend-blue

Rolling Updates for Frontend:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nextjs-frontend
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1  # Add 1 extra instance during update
      maxUnavailable: 0  # Never reduce below min capacity
  minReadySeconds: 30  # Wait 30s before marking ready

6.5.3 Disaster Recovery Targets

Recovery Time Objective (RTO): - Critical Services (API, Dashboard): <15 minutes - Database (Citus): <30 minutes (see Section 7.7.5) - Full Regional Failover: <15 minutes (see Section 7.8)

Recovery Point Objective (RPO): - Database: <5 minutes (synchronous replication + WAL archiving) - Redis Cache: <30 seconds (asynchronous replication, acceptable loss) - Object Storage (S3): <15 minutes (cross-region replication)

Backup Strategy: - Continuous WAL Archiving: Real-time to S3 (PITR capability) - Full Database Backups: Daily at 2 AM UTC - Incremental Backups: Hourly - Backup Retention: 30 days (hot), 1 year (cold/Glacier) - Backup Validation: Automated daily restore tests

6.5.4 Chaos Engineering and Resilience Testing

Chaos Experiments (Monthly):

Experiment Frequency Expected Behavior
Kill random backend pod Weekly Auto-heal within 10s, no user impact
Terminate AZ Monthly Traffic shifts to remaining AZs, <1s latency spike
Database failover Monthly Patroni promotes standby in <30s
Redis master failure Monthly Sentinel promotes replica in <5s
Network partition Quarterly Split-brain prevention, read-only mode
Overload test (10x traffic) Quarterly Auto-scaling handles load, <5% error rate

Chaos Engineering Tools: - Chaos Mesh: Kubernetes-native chaos experiments - Gremlin: Controlled chaos testing - Litmus: CRD-based chaos workflows

Example Chaos Experiment:

# Kill random FastAPI pod every 5 minutes
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: kill-fastapi-pod
spec:
  action: pod-kill
  mode: one  # Kill one random pod
  selector:
    labelSelectors:
      "app": "fastapi-backend"
  scheduler:
    cron: "*/5 * * * *"  # Every 5 minutes

6.5.5 Monitoring and Alerting for Availability

Availability Metrics:

Metric Target Alert Threshold Action
Uptime (monthly) 99.99% <99.95% Page on-call SRE
Error Rate (5xx) <0.1% >0.5% for 5 min Auto-rollback deployment
Latency p95 <200ms >500ms for 5 min Scale up instances
Health Check Failures 0 >2 consecutive failures Remove from load balancer
Database Replication Lag <5s >30s Alert DBA, prepare failover
Backup Success Rate 100% <100% Immediate investigation

Monitoring Stack: - Uptime Monitoring: UptimeRobot (external), Prometheus (internal) - APM: Datadog or New Relic (distributed tracing) - Logs: Loki + Grafana (centralized logging) - Metrics: Prometheus + Grafana (time-series metrics) - Alerting: PagerDuty (incident management) - Status Page: StatusPage.io (public status updates)

SLO Dashboard:

# Grafana SLO Dashboard
Panels:
  - Availability (current month): 99.99% ✓
  - Error Budget Remaining: 3.2 minutes (73% remaining)
  - Mean Time To Recovery (MTTR): 8 minutes (target: <15 min)
  - Mean Time Between Failures (MTBF): 720 hours (target: >168 hours)
  - Incident Count (current month): 2 (target: <5)

6.5.6 Incident Response for Availability Violations

Severity Levels:

Severity Definition Response Time Notification
SEV-1 (Critical) Complete service outage, >50% users affected <5 min Page on-call + manager
SEV-2 (High) Degraded performance, >10% users affected <15 min Page on-call
SEV-3 (Medium) Limited impact, <10% users affected <1 hour Email on-call
SEV-4 (Low) Minimal impact, single component <4 hours Ticket queue

Incident Response Runbook (SEV-1):

1. Detection (T+0)
   - Automated alert fires (PagerDuty)
   - On-call engineer acknowledges within 5 minutes

2. Triage (T+5 min)
   - Assess impact (affected users, services)
   - Determine root cause (logs, metrics, traces)
   - Escalate to manager if needed

3. Mitigation (T+10 min)
   - Apply immediate fix (rollback, failover, scaling)
   - Update status page (investigating → identified → monitoring)

4. Recovery (T+15 min)
   - Verify service restoration
   - Monitor for 30 minutes post-recovery
   - Mark incident as resolved

5. Post-Mortem (T+48 hours)
   - Root cause analysis
   - Timeline of events
   - Action items to prevent recurrence
   - Update runbooks and alerts

6.5.7 Planned Maintenance Windows

Maintenance Schedule: - Frequency: Monthly (3rd Sunday, 2 AM - 4 AM UTC) - Duration: Max 2 hours - Notification: 7 days advance notice - Impact: <5 minutes downtime (rolling updates preferred)

Zero-Downtime Maintenance (Preferred): - Database schema migrations: Online DDL (no locks) - Application deployments: Blue-green or canary - Infrastructure updates: Rolling updates across AZs

6.5.8 Dependencies and External Service SLAs

Third-Party SLA Requirements:

Service Provider SLA Required Actual SLA Mitigation
Jelastic API Virtuozzo Jelastic 99.9% 99.95% Circuit breaker, cached data
Cloud Infrastructure AWS 99.99% 99.99% (multi-AZ) Multi-region DR
Email Delivery SendGrid 99.9% 99.95% Queue retries, fallback SMTP
Payment Processing Stripe 99.99% 99.99% Retry logic, webhook replay
CDN CloudFlare 100% 100% Fallback to CloudFront

Dependency Failure Handling: - Jelastic API: Circuit breaker opens after 3 failures, serve cached data for 15 minutes - SendGrid: Queue emails in Redis, retry 3 times with exponential backoff - Stripe: Webhook replay for missed events, idempotent payment handling

6.5.9 Cost of Achieving 99.99% Availability

Additional Costs vs. 99.9%:

Component 99.9% Cost 99.99% Cost Delta Justification
Multi-AZ Deployment $5,000 $8,000 +$3,000 3 AZs instead of 2
Database Replicas $8,000 $11,000 +$3,000 Synchronous replicas
Multi-Region DR $0 $18,580 +$18,580 Secondary region (see 7.8.10)
Monitoring Tools $1,000 $3,000 +$2,000 APM, advanced alerting
On-Call Staffing $10,000 $20,000 +$10,000 24/7 coverage
TOTAL $24,000 $60,580 +$36,580 2.5x cost for 10x less downtime

Cost per User (1M users): - 99.9%: $0.024/month - 99.99%: $0.061/month (+$0.037/month or +154%)

ROI Analysis: - Revenue Impact of Downtime: Estimated $50,000/hour for 1M users - Annual Downtime Reduction: 8.75 hours → 0.88 hours (7.87 hours saved) - Annual Revenue Protected: ~$393,500 - Additional Annual Cost: $439,000 - Break-Even Point: ~1.1M users with $50K/hour revenue impact

6.6 Accessibility Requirements

  • WCAG 2.1 AA: Compliance with accessibility standards
  • Keyboard Navigation: Full functionality via keyboard
  • Screen Reader Support: Compatibility with major screen readers
  • Color Contrast: Minimum 4.5:1 contrast ratio

7. Architectural Overview

7.1 High-Level Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Next.js       │────│   FastAPI        │────│   PostgreSQL    │
│   Frontend      │    │   Backend        │    │   Database      │
│   (React 18)    │    │                  │    │                 │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                    ┌──────────────────┐
                    │    Redis         │
                    │   (Cache/Queue)  │
                    └──────────────────┘
         ┌─────────────────────────────────────────────────────┐
         │                Infrastructure                       │
         │  ┌─────────────┐   ┌─────────────┐   ┌───────────┐ │
         │  │   Grafana   │   │ Prometheus  │   │   Loki    │ │
         │  │ (Dashboards)│   │   (Metrics) │   │ (Logging) │ │
         │  └─────────────┘   └─────────────┘   └───────────┘ │
         └─────────────────────────────────────────────────────┘

7.2 Hybrid Modular Domain-Driven Design (DDD) Architecture

The MBPanel backend implements a Hybrid Modular Domain-Driven Design (DDD) architecture that follows a vertical slice approach. This architecture combines Domain-Driven Design principles with modular architecture, where all components related to a specific business feature are co-located in a single module directory, while maintaining clear separation of concerns through layered architecture within each module.

7.2.1 Modular Organization (app/[MODULE_NAME]/)

The architecture is organized into independent domain modules where each module contains all layers needed for that specific business feature:

mbpanel/
├── frontend/                         # Next.js frontend application (decoupled)
│   ├── app/                         # Next.js App Router (recommended)
│   │   ├── (auth)/                  # Auth-related pages (route group)
│   │   │   ├── login/
│   │   │   └── register/
│   │   ├── dashboard/               # Main dashboard pages
│   │   │   ├── page.tsx
│   │   │   └── layout.tsx
│   │   ├── layout.tsx               # Root layout
│   │   ├── page.tsx                 # Home page
│   │   └── globals.css              # Global styles
│   ├── features/                    # ⭐ DOMAIN-ALIGNED FEATURE MODULES
│   │   ├── auth/                    # Authentication feature
│   │   │   ├── components/          # Auth-specific components
│   │   │   ├── hooks/               # useAuth, useLogin, useMFA hooks
│   │   │   ├── services/            # Auth API client
│   │   │   ├── types/               # Auth TypeScript types
│   │   │   └── utils/               # Auth utilities
│   │   ├── users/                   # User management feature
│   │   │   ├── components/
│   │   │   ├── hooks/
│   │   │   ├── services/
│   │   │   ├── types/
│   │   │   └── utils/
│   │   ├── teams/                   # Team management feature
│   │   │   └── ... (same structure)
│   │   ├── sites/                   # Site management feature
│   │   │   └── ... (same structure)
│   │   ├── environments/            # Environment management feature
│   │   │   └── ... (same structure)
│   │   ├── staging/                 # Staging feature
│   │   │   └── ... (same structure)
│   │   ├── backups/                 # Backup/restore feature
│   │   │   └── ... (same structure)
│   │   ├── cache/                   # Cache management feature
│   │   │   └── ... (same structure)
│   │   ├── domains/                 # Domain/SSL management feature
│   │   │   └── ... (same structure)
│   │   ├── sftp/                    # SFTP management feature
│   │   │   └── ... (same structure)
│   │   ├── wordpress/               # ⭐ WordPress feature (ISOLATED)
│   │   │   ├── components/          # WP-specific UI components
│   │   │   ├── hooks/               # useWordPress, useWpCli hooks
│   │   │   ├── services/            # WordPress API client
│   │   │   ├── types/               # WP TypeScript types
│   │   │   └── utils/               # WP-specific utilities
│   │   ├── payments/                # Payment processing feature
│   │   │   └── ... (same structure)
│   │   ├── billing/                 # Billing/subscriptions feature
│   │   │   └── ... (same structure)
│   │   ├── uptime/                  # Uptime monitoring feature
│   │   │   └── ... (same structure)
│   │   ├── nodes/                   # Node management feature
│   │   │   └── ... (same structure)
│   │   ├── activity/                # Activity logs feature
│   │   │   └── ... (same structure)
│   │   ├── favourites/              # Favourites feature
│   │   │   └── ... (same structure)
│   │   ├── config/                  # Configuration feature
│   │   │   └── ... (same structure)
│   │   ├── profile/                 # User profile feature
│   │   │   └── ... (same structure)
│   │   ├── sessions/                # Session management feature
│   │   │   └── ... (same structure)
│   │   └── webhook/                 # Webhook feature
│   │       └── ... (same structure)
│   ├── shared/                      # Shared/common frontend code
│   │   ├── components/              # Reusable UI components (Button, Modal, etc.)
│   │   │   ├── ui/                  # Base UI components
│   │   │   │   ├── button.tsx
│   │   │   │   ├── modal.tsx
│   │   │   │   └── input.tsx
│   │   │   └── layout/              # Layout components
│   │   │       ├── header.tsx
│   │   │       ├── sidebar.tsx
│   │   │       └── footer.tsx
│   │   ├── hooks/                   # Common hooks (useDebounce, useLocalStorage, etc.)
│   │   ├── lib/                     # API client, axios setup, utilities
│   │   │   ├── api-client.ts        # Base API client
│   │   │   ├── axios-config.ts      # Axios interceptors
│   │   │   └── openapi-client.ts    # Auto-generated from FastAPI
│   │   ├── types/                   # Common TypeScript types
│   │   │   ├── api.ts               # API response types
│   │   │   └── common.ts            # Common types
│   │   ├── utils/                   # Common utilities
│   │   │   ├── validation.ts        # Form validation helpers
│   │   │   ├── formatting.ts        # Date/number formatting
│   │   │   └── constants.ts         # Global constants
│   │   └── store/                   # Global state management (Zustand)
│   │       ├── auth-store.ts
│   │       └── theme-store.ts
│   ├── public/                      # Static assets
│   │   ├── images/
│   │   ├── icons/
│   │   └── fonts/
│   ├── styles/                      # Global styles
│   │   └── tailwind.css             # Tailwind CSS
│   ├── tests/                       # Frontend tests
│   │   ├── unit/                    # Unit tests
│   │   ├── integration/             # Integration tests
│   │   └── e2e/                     # End-to-end tests
│   ├── package.json                 # Frontend dependencies
│   ├── next.config.js               # Next.js configuration
│   ├── tsconfig.json                # TypeScript configuration
│   ├── tailwind.config.js           # Tailwind CSS configuration
│   ├── .env.local                   # Local environment variables
│   └── Dockerfile                   # Production Dockerfile for frontend
├── backend/                         # FastAPI backend application (decoupled)
│   ├── app/                         # Main application package
│   │   ├── __init__.py              # Package initialization
│   │   ├── main.py                  # Main FastAPI app (assembler)
│   │   ├── core/                    # ⭐ SHARED: Enhanced core infrastructure
│   │   │   ├── __init__.py
│   │   │   ├── config.py            # Pydantic BaseSettings for environment config
│   │   │   ├── security.py          # Core crypto functions (hash, verify)
│   │   │   ├── exceptions.py        # Custom exception hierarchy
│   │   │   ├── logging.py           # Structured logging setup
│   │   │   ├── middleware.py        # Common middleware (correlation ID, etc.)
│   │   │   ├── constants.py         # Global constants and enums
│   │   │   ├── utils.py             # Common utility functions
│   │   │   ├── http_client.py       # ⭐ HTTP Client Infrastructure (connection pooling, circuit breaker)
│   │   │   ├── cache.py             # ⭐ Caching utilities (Redis integration)
│   │   │   ├── rate_limit.py        # ⭐ Rate limiting utilities
│   │   │   ├── adapters/            # ⭐ SHARED EXTERNAL API ADAPTERS (used by multiple modules)
│   │   │   │   ├── __init__.py
│   │   │   │   ├── virtuozzo_adapter.py      # Used by: environments, wordpress, backups, sftp, staging, nodes, sessions
│   │   │   │   ├── bunny_cdn_adapter.py      # Used by: cache, domains
│   │   │   │   ├── cloudflare_adapter.py     # Used by: domains
│   │   │   │   ├── stripe_adapter.py         # Future: Used by: payments
│   │   │   │   ├── paypal_adapter.py         # Future: Used by: billing
│   │   │   │   ├── postmark_adapter.py       # Future: Used by: notifications
│   │   │   │   └── uptime_adapter.py         # Future: Used by: uptime
│   │   │   └── shared/              # ⭐ SHARED KERNEL (cross-cutting domain logic)
│   │   │       ├── __init__.py
│   │   │       ├── rbac.py          # Role-based access control helpers
│   │   │       ├── tenant.py        # Multi-tenant resolution logic (team_id)
│   │   │       ├── audit.py         # Audit logging helpers
│   │   │       └── events.py        # Domain event definitions (optional)
│   │   ├── database/                # ⭐ SHARED: Enhanced database layer
│   │   │   ├── __init__.py
│   │   │   ├── database.py          # Engine, SessionLocal, Base
│   │   │   ├── deps.py              # get_db, get_current_user dependencies
│   │   │   ├── mixins.py            # Common model mixins (TimestampMixin, etc.)
│   │   │   └── types.py             # Custom SQLAlchemy types (GUID, JSON, etc.)
│   │   ├── users/                   # --- DOMAIN MODULE: "Users" ---
│   │   │   ├── __init__.py
│   │   │   ├── router.py            # INTERFACE: APIRouter for /users/
│   │   │   ├── service.py           # APPLICATION: UserService (use cases)
│   │   │   ├── repository.py        # INFRASTRUCTURE: UserRepository (data access)
│   │   │   ├── model.py             # DOMAIN: User SQLAlchemy model
│   │   │   └── schema.py            # INTERFACE: UserCreate, UserRead Pydantic schemas
│   │   ├── auth/                    # --- DOMAIN MODULE: "Auth" ---
│   │   │   ├── __init__.py
│   │   │   ├── router.py            # INTERFACE: APIRouter for /auth/
│   │   │   ├── service.py           # APPLICATION: AuthService
│   │   │   ├── security.py          # INFRASTRUCTURE: OAuth2/JWT logic
│   │   │   └── schema.py            # INTERFACE: Token, LoginRequest schemas
│   │   ├── teams/                   # --- DOMAIN MODULE: "Teams" ---
│   │   │   ├── __init__.py
│   │   │   ├── router.py
│   │   │   ├── service.py
│   │   │   ├── repository.py
│   │   │   ├── model.py
│   │   │   └── schema.py
│   │   ├── sites/                   # --- DOMAIN MODULE: "Sites" ---
│   │   │   ├── __init__.py
│   │   │   ├── router.py
│   │   │   ├── service.py
│   │   │   ├── repository.py
│   │   │   ├── model.py
│   │   │   └── schema.py
│   │   ├── environments/            # --- DOMAIN MODULE: "Environments" ---
│   │   │   ├── __init__.py
│   │   │   ├── router.py
│   │   │   ├── service.py
│   │   │   ├── repository.py
│   │   │   ├── model.py
│   │   │   ├── schema.py
│   │   │   └── tasks.py             # ARQ background tasks
│   │   ├── staging/                 # --- DOMAIN MODULE: "Staging" ---
│   │   │   ├── __init__.py
│   │   │   ├── router.py
│   │   │   ├── service.py
│   │   │   ├── repository.py
│   │   │   ├── model.py
│   │   │   ├── schema.py
│   │   │   └── tasks.py             # ARQ background tasks
│   │   ├── backups/                 # --- DOMAIN MODULE: "Backups" ---
│   │   │   ├── __init__.py
│   │   │   ├── router.py
│   │   │   ├── service.py
│   │   │   ├── repository.py
│   │   │   ├── model.py
│   │   │   ├── schema.py
│   │   │   └── tasks.py             # ARQ background tasks
│   │   ├── cache/                   # --- DOMAIN MODULE: "Cache" ---
│   │   │   ├── __init__.py
│   │   │   ├── router.py
│   │   │   ├── service.py
│   │   │   ├── repository.py
│   │   │   ├── model.py
│   │   │   └── schema.py
│   │   ├── domains/                 # --- DOMAIN MODULE: "Domains" ---
│   │   │   ├── __init__.py
│   │   │   ├── router.py
│   │   │   ├── service.py
│   │   │   ├── repository.py
│   │   │   ├── model.py
│   │   │   ├── schema.py
│   │   │   └── tasks.py             # ARQ background tasks (SSL)
│   │   ├── sftp/                    # --- DOMAIN MODULE: "SFTP" ---
│   │   │   ├── __init__.py
│   │   │   ├── router.py
│   │   │   ├── service.py
│   │   │   ├── repository.py
│   │   │   ├── model.py
│   │   │   ├── schema.py
│   │   │   └── tasks.py             # ARQ background tasks
│   │   ├── wordpress/               # --- DOMAIN MODULE: "WordPress" (⭐ ISOLATED) ---
│   │   │   ├── __init__.py
│   │   │   ├── router.py            # INTERFACE: WP API endpoints
│   │   │   ├── service.py           # APPLICATION: WP business logic
│   │   │   ├── repository.py        # INFRASTRUCTURE: WP data access
│   │   │   ├── model.py             # DOMAIN: WP models (if needed)
│   │   │   ├── schema.py            # INTERFACE: WP request/response schemas
│   │   │   └── tasks.py             # INFRASTRUCTURE: WP background jobs (WP-CLI)
│   │   ├── payments/                # --- DOMAIN MODULE: "Payments" ---
│   │   │   ├── __init__.py
│   │   │   ├── router.py
│   │   │   ├── service.py
│   │   │   ├── repository.py
│   │   │   ├── model.py
│   │   │   ├── schema.py
│   │   │   └── tasks.py             # ARQ background tasks
│   │   ├── billing/                 # --- DOMAIN MODULE: "Billing" ---
│   │   │   ├── __init__.py
│   │   │   ├── router.py
│   │   │   ├── service.py
│   │   │   ├── repository.py
│   │   │   ├── model.py
│   │   │   └── schema.py
│   │   ├── uptime/                  # --- DOMAIN MODULE: "Uptime" ---
│   │   │   ├── __init__.py
│   │   │   ├── router.py
│   │   │   ├── service.py
│   │   │   ├── repository.py
│   │   │   ├── model.py
│   │   │   ├── schema.py
│   │   │   └── tasks.py             # ARQ background tasks
│   │   ├── nodes/                   # --- DOMAIN MODULE: "Nodes" ---
│   │   │   ├── __init__.py
│   │   │   ├── router.py
│   │   │   ├── service.py
│   │   │   ├── repository.py
│   │   │   ├── model.py
│   │   │   └── schema.py
│   │   ├── activity/                # --- DOMAIN MODULE: "Activity" ---
│   │   │   ├── __init__.py
│   │   │   ├── router.py
│   │   │   ├── service.py
│   │   │   ├── repository.py
│   │   │   ├── model.py
│   │   │   └── schema.py
│   │   ├── favourites/              # --- DOMAIN MODULE: "Favourites" ---
│   │   │   ├── __init__.py
│   │   │   ├── router.py
│   │   │   ├── service.py
│   │   │   ├── repository.py
│   │   │   ├── model.py
│   │   │   └── schema.py
│   │   ├── config/                  # --- DOMAIN MODULE: "Config" ---
│   │   │   ├── __init__.py
│   │   │   ├── router.py
│   │   │   ├── service.py
│   │   │   ├── repository.py
│   │   │   ├── model.py
│   │   │   └── schema.py
│   │   ├── profile/                 # --- DOMAIN MODULE: "Profile" ---
│   │   │   ├── __init__.py
│   │   │   ├── router.py
│   │   │   ├── service.py
│   │   │   ├── repository.py
│   │   │   ├── model.py
│   │   │   └── schema.py
│   │   ├── sessions/                # --- DOMAIN MODULE: "Sessions" (instant/virt login) ---
│   │   │   ├── __init__.py
│   │   │   ├── router.py
│   │   │   ├── service.py
│   │   │   ├── repository.py
│   │   │   ├── model.py
│   │   │   └── schema.py
│   │   ├── webhook/                 # --- DOMAIN MODULE: "Webhook" ---
│   │   │   ├── __init__.py
│   │   │   ├── router.py
│   │   │   ├── service.py
│   │   │   ├── repository.py
│   │   │   ├── model.py
│   │   │   └── schema.py
│   │   ├── websocket/               # --- DOMAIN MODULE: "WebSocket" ---
│   │   │   ├── __init__.py          # Module exports
│   │   │   ├── router.py            # INTERFACE: WebSocket endpoints and HTTP routes
│   │   │   ├── service.py           # APPLICATION: Business logic for message handling
│   │   │   ├── connection.py        # INFRASTRUCTURE: ConnectionManager for managing connections
│   │   │   ├── channel.py           # INFRASTRUCTURE: Channel subscription management
│   │   │   ├── presence.py          # INFRASTRUCTURE: Presence tracking (online/offline)
│   │   │   ├── publisher.py         # INFRASTRUCTURE: Message publishing to Redis
│   │   │   ├── repository.py       # INFRASTRUCTURE: Database operations (audit logs)
│   │   │   ├── model.py             # DOMAIN: WebSocket-related models (if needed)
│   │   │   └── schema.py            # INTERFACE: Pydantic schemas for messages
│   │   ├── jobs/                    # --- DOMAIN MODULE: "Background Jobs" ---
│   │   │   ├── __init__.py
│   │   │   ├── router.py            # INTERFACE: (Optional) API to trigger/check jobs
│   │   │   ├── service.py           # APPLICATION: Service to trigger tasks
│   │   │   ├── worker.py            # INFRASTRUCTURE: ARQ worker setup
│   │   │   └── schema.py            # INTERFACE: Job status schemas
│   │   └── tests/                   # ⭐ COMPREHENSIVE TESTING
│   │       ├── __init__.py
│   │       ├── conftest.py          # Shared fixtures (db, client, auth tokens)
│   │       ├── test_main.py         # Main app tests
│   │       ├── integration/         # Cross-module integration tests
│   │       │   ├── __init__.py
│   │       │   ├── test_site_creation_flow.py
│   │       │   ├── test_backup_restore_flow.py
│   │       │   └── test_user_auth_flow.py
│   │       ├── users/               # Per-module tests
│   │       │   ├── __init__.py
│   │       │   ├── test_router.py
│   │       │   ├── test_service.py
│   │       │   └── test_repository.py
│   │       ├── auth/
│   │       │   └── ... (same structure)
│   │       ├── teams/
│   │       │   └── ... (same structure)
│   │       ├── sites/
│   │       │   └── ... (same structure)
│   │       ├── wordpress/           # WordPress module tests
│   │       │   ├── __init__.py
│   │       │   ├── test_router.py
│   │       │   ├── test_service.py
│   │       │   └── test_repository.py
│   │       └── ... (one folder per module)
│   ├── alembic/                     # Database migration scripts
│   │   ├── versions/                # Migration versions
│   │   └── env.py                   # Alembic environment config
│   ├── requirements/                # Python requirements by environment
│   │   ├── base.txt                 # Common dependencies
│   │   ├── dev.txt                  # Development dependencies
│   │   └── prod.txt                 # Production dependencies
│   ├── scripts/                     # ⭐ OPERATIONAL SCRIPTS
│   │   ├── migrate_data.py          # MySQL → PostgreSQL migration
│   │   ├── seed_dev_data.py         # Development data seeding
│   │   ├── health_check.py          # Health check utilities
│   │   ├── generate_module.py       # Generate new module scaffold
│   │   └── validate_architecture.py # Validate module isolation rules
│   ├── Dockerfile                   # Production Dockerfile for backend
│   ├── docker-compose.yml           # Docker Compose for local development (✅ Implemented)
│   ├── pyproject.toml               # Python project configuration
│   ├── pytest.ini                   # Pytest configuration
│   ├── .env.example                 # Example environment variables (✅ Implemented via scripts)
│   ├── requirements/                # Python requirements by environment (✅ Implemented)
│   │   ├── base.txt                 # Common dependencies
│   │   ├── dev.txt                  # Development dependencies
│   │   └── prod.txt                 # Production dependencies
│   └── alembic.ini                  # Alembic configuration
├── local-infra/                     # Local development infrastructure (✅ Implemented)
│   ├── docker-compose.dev.yml       # Development Docker Compose (✅ Implemented)
│   ├── docker-compose.test.yml      # Testing environment Docker Compose (✅ Implemented)
│   ├── docker-compose.observability.yml  # Monitoring stack (✅ Implemented)
│   ├── nginx/                       # Nginx configuration for local dev (✅ Implemented)
│   │   ├── nginx.conf               # Main nginx configuration
│   │   └── sites-available/         # Site configurations
│   ├── scripts/                     # Development and deployment scripts (✅ Implemented)
│   │   ├── setup-dev.sh             # Development environment setup
│   │   ├── migrate-db.sh            # Database migration script
│   │   ├── health-check.sh          # Health check utilities
│   │   └── create-env-files.sh     # Environment file generator
│   ├── prometheus/                  # Prometheus configuration (✅ Implemented)
│   ├── loki/                        # Loki configuration (✅ Implemented)
│   ├── promtail/                    # Promtail configuration (✅ Implemented)
│   └── certs/                       # SSL certificates for local development (✅ Implemented)
├── docs/                            # Documentation
│   ├── development/                 # Development documentation
│   │   ├── MAINPRD.md              # This document
│   │   ├── MODULE_TEMPLATE.md      # Template for new modules
│   │   ├── index.md                # Developer onboarding guide (✅ Implemented)
│   │   └── LOCAL_SETUP_SUMMARY.md  # Local setup quick reference (✅ Implemented)
│   ├── architecture/                # Architecture decision records (ADRs)
│   │   ├── 001-hybrid-modular-ddd.md
│   │   ├── 002-database-strategy.md
│   │   └── 003-caching-strategy.md
│   └── deployment/                  # Deployment guides
│       ├── DEPLOYMENT_GUIDE.md
│       └── KUBERNETES.md
├── tests/                           # ⭐ END-TO-END TESTS (cross-component)
│   ├── e2e/                         # End-to-end tests
│   │   ├── test_site_creation_flow.py
│   │   ├── test_backup_restore_flow.py
│   │   └── test_user_registration_flow.py
│   └── performance/                 # Performance/load tests
│       ├── load_tests.py
│       └── stress_tests.py
├── .github/                         # GitHub Actions workflows
│   ├── workflows/                   # CI/CD pipeline definitions
│   │   ├── backend-ci.yml          # Backend CI pipeline
│   │   ├── frontend-ci.yml         # Frontend CI pipeline
│   │   ├── integration-ci.yml      # Integration tests pipeline
│   │   └── deploy-production.yml   # Production deployment
│   └── ISSUE_TEMPLATE/              # Issue templates
├── deployment/                      # Production deployment configurations (✅ Implemented)
│   └── docker-compose.prod.yml     # Production Docker Compose (optional, for full-stack testing)
├── .gitignore                       # Git ignore rules (✅ Implemented)
├── README.md                        # Project overview (✅ Implemented)
├── Makefile                         # Common development commands (✅ Implemented)
└── .editorconfig                    # Editor configuration (✅ Implemented)

7.2.2 Layered Architecture Within Each Module

Each domain module implements a clean layered architecture:

  • Interface Layer ([MODULE_NAME]/router.py):
  • Defines API endpoints for the module
  • Handles HTTP concerns: request validation, response serialization, error handling
  • Uses dependencies like Depends(get_db), Depends(get_current_user)
  • Validates request data using Pydantic schemas from schema.py
  • Catches domain exceptions and converts to HTTPException
  • Rule: Must NOT contain business logic or SQLAlchemy query code

  • Application Layer ([MODULE_NAME]/service.py):

  • Orchestrates business logic and use cases for the module
  • Implements use cases like create_user, update_order, etc.
  • Coordinates between repository, model, and schema components
  • Contains business validation and workflow orchestration
  • Rule: Must NOT know about FastAPI or HTTP concerns; imports from repository.py, model.py, and schema.py within the same module

  • Domain Layer ([MODULE_NAME]/model.py):

  • Defines database structure and relationships
  • SQLAlchemy model classes inheriting from Base
  • Column definitions, relationships, and constraints
  • Rule: Should contain only data model definitions

  • Infrastructure Layer ([MODULE_NAME]/repository.py):

  • Contains all database communication logic for the module
  • Contains all SQLAlchemy queries and operations
  • Functions take db: Session as an argument
  • Handles query execution, commits, and refreshes
  • Rule: Must NOT contain business logic (only data access logic)

  • DTO Layer ([MODULE_NAME]/schema.py):

  • Defines Data Transfer Objects for API communication
  • Pydantic BaseModel classes for request/response
  • Create schemas for POST requests, Read for GET responses, Update for PUT/PATCH requests
  • Rule: Used by router.py for validation and serialization

7.2.3 Module Isolation Principle

  • Core Rule: A module MUST NOT import from another module
  • Example: app/users/service.py MUST NOT import app/auth/service.py
  • Exception: All modules can import from shared directories (app/core/, app/database/)
  • Shared Kernel Pattern: Cross-cutting domain policies (e.g., RBAC checks, tenant resolution, audit helpers) live in app/core/shared/ and are the ONLY allowed bridge for shared business rules across modules. Modules may depend on shared kernel interfaces, not on other modules' concrete services.

7.2.3.1 External API Adapters - Shared Infrastructure Pattern

Architectural Decision: External API adapters are SHARED infrastructure, not domain-specific code.

External API adapters (Virtuozzo, Bunny CDN, Cloudflare, etc.) are placed in app/core/adapters/ rather than duplicated in each module for the following architectural and practical reasons:

Why Adapters Are Shared (Global):

  1. Multiple Module Usage:
  2. virtuozzo_adapter.py is used by 7+ modules: environments, wordpress, backups, sftp, staging, nodes, sessions
  3. Duplicating this code 7 times violates DRY principle and creates maintenance nightmares

  4. Infrastructure, Not Domain Logic:

  5. Adapters handle infrastructure concerns: HTTP connections, retries, circuit breakers, rate limiting
  6. Domain logic (what to do with the data) stays in domain modules
  7. Analogy: Just like we don't duplicate database connection pooling per module, we don't duplicate external API clients

  8. Single Point of Configuration:

  9. API endpoint URLs, timeouts, rate limits, connection pool sizes
  10. If Virtuozzo API URL changes, update ONE adapter, not 7 modules
  11. If circuit breaker threshold needs tuning, change it once

  12. Consistent Resilience Behavior:

  13. All modules get the same connection pooling (47% latency improvement)
  14. All modules get the same circuit breaker protection
  15. All modules get the same retry logic (3 attempts with exponential backoff)
  16. No risk of inconsistent behavior across modules

  17. Easier Testing:

  18. Mock the adapter once in conftest.py
  19. All module tests benefit from the same mock
  20. Integration tests run against real adapter once

  21. Module Isolation Still Maintained:

  22. Modules still don't import from other modules
  23. They import from app/core/adapters/, which is explicitly allowed (like app/core/ and app/database/)
  24. Adapters are infrastructure dependencies, not domain dependencies

Example: Correct Usage in Domain Module:

# backend/app/environments/service.py
from sqlalchemy.orm import Session
from app.core.adapters.virtuozzo_adapter import get_virtuozzo_adapter  # ✅ ALLOWED: Shared infrastructure
from app.environments import repository, schema  # ✅ ALLOWED: Own module
from app.core.shared.audit import log_audit_event  # ✅ ALLOWED: Shared kernel

# ❌ FORBIDDEN: from app.wordpress.service import WordPressService

class EnvironmentService:
    def __init__(self):
        # Use shared adapter (infrastructure)
        self.virtuozzo_adapter = get_virtuozzo_adapter()

    async def start_environment(
        self,
        db: Session,
        environment_id: int,
        user_id: int
    ) -> schema.EnvironmentRead:
        """Start environment using shared Virtuozzo adapter"""

        # Domain logic specific to environments module
        environment = repository.get_environment_by_id(db, environment_id)

        # Infrastructure call via shared adapter
        response = await self.virtuozzo_adapter.start_environment(
            env_name=environment.env_name,
            session_key=environment.session_key,
            correlation_id=str(uuid.uuid4())
        )

        # Domain logic: update environment status
        repository.update_environment(db, environment_id, {"status": "starting"})
        db.commit()

        return environment

Comparison: What Would Be Wrong (Anti-Pattern):

# ❌ ANTI-PATTERN: Duplicating adapter in each module

# backend/app/environments/adapters/virtuozzo_adapter.py
# backend/app/wordpress/adapters/virtuozzo_adapter.py  # 🚫 DUPLICATION!
# backend/app/backups/adapters/virtuozzo_adapter.py    # 🚫 DUPLICATION!
# backend/app/sftp/adapters/virtuozzo_adapter.py       # 🚫 DUPLICATION!
# ... 7+ copies of the same code

# Problems:
# 1. Update connection pool config → must update 7 files
# 2. Fix circuit breaker bug → must fix in 7 places
# 3. Update API endpoint → change 7 files
# 4. Inconsistent behavior if one file gets out of sync
# 5. 7x more code to test
# 6. 7x more code to maintain

When to Create Per-Module Adapters (Rare Cases):

Only create module-specific adapters when: 1. Module-specific transformation logic: The adapter needs module-specific business logic (then it's not really infrastructure) 2. Different external service per module: Each module calls a completely different external API 3. Module-specific configuration: The adapter needs fundamentally different configuration per module (rare)

Example where per-module might make sense: - If payments/ module used a different Stripe client than billing/ module with completely different configuration - In practice, this is rare - you'd typically have ONE Stripe adapter with different methods

Architectural Layers Distinction:

┌─────────────────────────────────────────────────────────┐
│  DOMAIN MODULES (app/[module]/)                         │
│  - Business logic specific to the domain                │
│  - router.py, service.py, repository.py, model.py       │
│  - CAN import from app/core/, app/database/             │
│  - CANNOT import from other domain modules               │
└─────────────────────────────────────────────────────────┘
                         │ imports from
┌─────────────────────────────────────────────────────────┐
│  SHARED INFRASTRUCTURE (app/core/)                      │
│  - Technical infrastructure, not business logic         │
│  - http_client.py, adapters/, cache.py, logging.py     │
│  - Provides reusable technical capabilities             │
└─────────────────────────────────────────────────────────┘
                         │ makes calls to
┌─────────────────────────────────────────────────────────┐
│  EXTERNAL SERVICES                                       │
│  - Virtuozzo API, Bunny CDN, Cloudflare, Stripe, etc.  │
└─────────────────────────────────────────────────────────┘

Summary: Shared Adapters Are Infrastructure, Not Domain Code

  • Location: app/core/adapters/ (shared)
  • Purpose: Isolate external API integration details (HTTP, retries, circuit breakers)
  • Reusability: One adapter, many modules (DRY principle)
  • Module Isolation: Still maintained (importing from app/core/ is explicitly allowed)
  • Domain Logic: Stays in domain modules (what to do with the data)
  • Infrastructure: Centralized in adapters (how to get the data)

7.2.4 Standard Module Components

Each domain module must contain these files: - router.py: API endpoints and request/response handling - service.py: Business logic orchestration and use case implementation - repository.py: Database operations and persistence logic - model.py: SQLAlchemy models defining database structure - schema.py: Pydantic schemas for API request/response validation

7.2.5 Shared Components

  • Core Logic (app/core/):
  • Shared, non-domain-specific utilities available to all modules
  • config.py: Pydantic BaseSettings for environment configuration
  • security.py: Core crypto functions (hashing, verification) - NOT OAuth2 logic
  • Rules: Must NOT import from any domain module

  • Database Layer (app/database/):

  • Manages the single database connection for the entire application
  • database.py: SQLAlchemy engine, SessionLocal, and Base
  • deps.py: get_db() dependency for database sessions
  • Rules: All modules import get_db from app.database.deps

7.2.6 Application Entry Point (main.py)

  • The single application assembler and entry point
  • Instantiates the main FastAPI application object
  • Configures top-level middleware (e.g., CORSMiddleware)
  • Imports APIRouter from each domain module
  • Includes all module routers using app.include_router()
  • Rule: MUST NOT define any API endpoints directly

7.2.7 Enhanced Shared Kernel (app/core/shared/)

The Shared Kernel is a critical component of the Hybrid Modular DDD architecture. It contains cross-cutting domain logic that is shared across multiple modules while maintaining module isolation principles.

Purpose: - Provide a controlled mechanism for sharing domain-level concerns across modules - Prevent code duplication for common business rules - Maintain a single source of truth for cross-cutting policies - Serve as the ONLY allowed bridge for shared business logic between modules

Components:

rbac.py - Role-Based Access Control Helpers:

# Example structure:
from enum import Enum
from typing import List
from app.database.deps import get_current_user

class Role(str, Enum):
    SUPER_ADMIN = "super_admin"
    TEAM_OWNER = "team_owner"
    TEAM_ADMIN = "team_admin"
    TEAM_MEMBER = "team_member"
    USER = "user"

class Permission(str, Enum):
    SITE_CREATE = "site:create"
    SITE_DELETE = "site:delete"
    BACKUP_CREATE = "backup:create"
    BACKUP_RESTORE = "backup:restore"
    # ... more permissions

def has_permission(user, permission: Permission) -> bool:
    """Check if user has specific permission"""
    pass

def require_permission(permission: Permission):
    """Dependency for endpoint protection"""
    pass

def is_team_owner(user, team_id: int) -> bool:
    """Check if user owns a specific team"""
    pass

def is_team_member(user, team_id: int) -> bool:
    """Check if user is member of a specific team"""
    pass

tenant.py - Multi-Tenant Resolution Logic:

# Example structure:
from typing import Optional
from sqlalchemy.orm import Session

class TenantContext:
    """Context manager for tenant-scoped operations"""
    def __init__(self, team_id: int):
        self.team_id = team_id

    def __enter__(self):
        # Set tenant context
        pass

    def __exit__(self, exc_type, exc_val, exc_tb):
        # Clear tenant context
        pass

def get_current_team_id(user) -> int:
    """Extract team_id from current user context"""
    pass

def filter_by_team(query, team_id: int):
    """Apply team_id filter to SQLAlchemy query"""
    return query.filter_by(team_id=team_id)

def validate_team_access(user, team_id: int) -> bool:
    """Validate user has access to specified team"""
    pass

audit.py - Audit Logging Helpers:

# Example structure:
from enum import Enum
from datetime import datetime
from typing import Any, Dict

class AuditAction(str, Enum):
    CREATE = "create"
    UPDATE = "update"
    DELETE = "delete"
    LOGIN = "login"
    LOGOUT = "logout"

def log_audit_event(
    user_id: int,
    team_id: int,
    action: AuditAction,
    resource_type: str,
    resource_id: int,
    metadata: Dict[str, Any] = None
):
    """Log audit event for compliance and tracking"""
    pass

def get_audit_trail(
    team_id: int,
    resource_type: str = None,
    start_date: datetime = None,
    end_date: datetime = None
):
    """Retrieve audit trail for a team"""
    pass

events.py - Domain Event Definitions (Optional):

# Example structure:
from typing import Any, Dict
from datetime import datetime
from pydantic import BaseModel

class DomainEvent(BaseModel):
    """Base class for all domain events"""
    event_id: str
    event_type: str
    timestamp: datetime
    team_id: int
    payload: Dict[str, Any]

class SiteCreatedEvent(DomainEvent):
    event_type: str = "site.created"

class BackupCompletedEvent(DomainEvent):
    event_type: str = "backup.completed"

class EnvironmentDeletedEvent(DomainEvent):
    event_type: str = "environment.deleted"

# Event publishing/subscribing logic
async def publish_event(event: DomainEvent):
    """Publish domain event to Redis/message broker"""
    pass

Usage Guidelines: - Modules MAY import from app.core.shared.* for cross-cutting concerns - Modules MUST NOT import from other modules directly - Shared kernel should contain only domain-level abstractions, not infrastructure - Keep shared kernel minimal - only add when truly cross-cutting - Document all shared kernel additions with clear rationale

Architecture Validation:

# Example validation in app/backend/scripts/validate_architecture.py
def validate_module_isolation():
    """
    Validates that:
    1. No module imports from another module
    2. Modules only import from app.core.* and app.database.*
    3. Shared kernel is used appropriately
    4. External API calls go through adapters in app.core.adapters/
    5. No duplicate adapter implementations in domain modules
    """
    pass

def validate_adapter_usage():
    """
    Validates proper adapter usage:
    1. All adapters are in app/core/adapters/
    2. No duplicate adapter implementations in domain modules
    3. Domain modules use adapters correctly (import from app.core.adapters/)
    4. No direct HTTP calls to external APIs in domain modules (must use adapters)
    """
    violations = []

    # Check for direct HTTP calls in domain modules (should use adapters)
    domain_modules = Path("app").glob("*/")
    for module in domain_modules:
        if module.name in ['core', 'database', 'tests']:
            continue

        for py_file in module.rglob("*.py"):
            content = py_file.read_text()

            # Check for direct HTTP library usage (should use adapter)
            if 'import requests' in content or 'import httpx' in content or 'import urllib' in content:
                violations.append(
                    f"❌ {py_file}: Direct HTTP library usage detected. "
                    f"Use adapter from app.core.adapters/ instead."
                )

            # Check for adapter duplication (adapters should be in app/core/adapters/)
            if '/adapters/' in str(py_file) and 'core/adapters' not in str(py_file):
                violations.append(
                    f"❌ {py_file}: Adapter found in domain module. "
                    f"Move to app/core/adapters/ for reusability."
                )

    return violations

7.2.8 Enhanced Database Layer (app/database/)

The enhanced database layer provides shared database infrastructure, common model mixins, and custom types that are reusable across all domain modules.

Components:

database.py - Engine, SessionLocal, Base:

# Example structure:
from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from app.core.config import settings

# Database engine with connection pooling
engine = create_engine(
    settings.DATABASE_URL,
    pool_pre_ping=True,
    pool_size=20,
    max_overflow=40,
    pool_recycle=3600,
    echo=settings.DEBUG
)

# Session factory
SessionLocal = sessionmaker(
    autocommit=False,
    autoflush=False,
    bind=engine
)

# Base class for all SQLAlchemy models
Base = declarative_base()

deps.py - Database Dependencies:

# Example structure:
from typing import Generator
from sqlalchemy.orm import Session
from fastapi import Depends, HTTPException, status
from fastapi.security import OAuth2PasswordBearer
from jose import JWTError, jwt
from app.core.config import settings

oauth2_scheme = OAuth2PasswordBearer(tokenUrl="auth/login")

def get_db() -> Generator[Session, None, None]:
    """Database session dependency"""
    db = SessionLocal()
    try:
        yield db
    finally:
        db.close()

async def get_current_user(
    db: Session = Depends(get_db),
    token: str = Depends(oauth2_scheme)
):
    """Get current authenticated user from JWT token"""
    credentials_exception = HTTPException(
        status_code=status.HTTP_401_UNAUTHORIZED,
        detail="Could not validate credentials",
        headers={"WWW-Authenticate": "Bearer"},
    )
    try:
        payload = jwt.decode(token, settings.SECRET_KEY, algorithms=[settings.ALGORITHM])
        user_id: int = payload.get("sub")
        if user_id is None:
            raise credentials_exception
    except JWTError:
        raise credentials_exception

    # Fetch user from database
    from app.users.repository import get_user_by_id
    user = get_user_by_id(db, user_id=user_id)
    if user is None:
        raise credentials_exception
    return user

async def get_current_active_user(
    current_user = Depends(get_current_user)
):
    """Ensure user is active"""
    if not current_user.is_active:
        raise HTTPException(status_code=400, detail="Inactive user")
    return current_user

mixins.py - Common Model Mixins:

# Example structure:
from datetime import datetime
from sqlalchemy import Column, Integer, DateTime, Boolean
from sqlalchemy.ext.declarative import declared_attr

class TimestampMixin:
    """Adds created_at and updated_at timestamps to models"""
    created_at = Column(DateTime, default=datetime.utcnow, nullable=False)
    updated_at = Column(
        DateTime,
        default=datetime.utcnow,
        onupdate=datetime.utcnow,
        nullable=False
    )

class TeamScopedMixin:
    """Adds team_id for multi-tenant data isolation"""
    @declared_attr
    def team_id(cls):
        return Column(Integer, nullable=False, index=True)

class SoftDeleteMixin:
    """Adds soft delete functionality"""
    is_deleted = Column(Boolean, default=False, nullable=False)
    deleted_at = Column(DateTime, nullable=True)

class AuditMixin:
    """Adds audit tracking fields"""
    created_by = Column(Integer, nullable=True)
    updated_by = Column(Integer, nullable=True)

# Usage in models:
# class Site(Base, TimestampMixin, TeamScopedMixin, SoftDeleteMixin):
#     __tablename__ = "sites"
#     id = Column(Integer, primary_key=True)
#     name = Column(String(255))

types.py - Custom SQLAlchemy Types:

# Example structure:
from sqlalchemy.types import TypeDecorator, CHAR, String, Text
from sqlalchemy.dialects.postgresql import UUID as PG_UUID, JSONB
import uuid
import json

class GUID(TypeDecorator):
    """Platform-independent GUID type.
    Uses PostgreSQL's UUID type, otherwise uses CHAR(36) with string UUIDs."""
    impl = CHAR
    cache_ok = True

    def load_dialect_impl(self, dialect):
        if dialect.name == 'postgresql':
            return dialect.type_descriptor(PG_UUID())
        else:
            return dialect.type_descriptor(CHAR(36))

    def process_bind_param(self, value, dialect):
        if value is None:
            return value
        elif dialect.name == 'postgresql':
            return str(value)
        else:
            if not isinstance(value, uuid.UUID):
                return str(uuid.UUID(value))
            else:
                return str(value)

    def process_result_value(self, value, dialect):
        if value is None:
            return value
        else:
            if not isinstance(value, uuid.UUID):
                value = uuid.UUID(value)
            return value

class JSONEncodedDict(TypeDecorator):
    """Represents an immutable structure as a json-encoded string."""
    impl = Text
    cache_ok = True

    def process_bind_param(self, value, dialect):
        if value is not None:
            value = json.dumps(value)
        return value

    def process_result_value(self, value, dialect):
        if value is not None:
            value = json.loads(value)
        return value

# For PostgreSQL, use native JSONB
class JSONType(TypeDecorator):
    """JSON type that uses PostgreSQL JSONB or fallback to Text"""
    impl = Text
    cache_ok = True

    def load_dialect_impl(self, dialect):
        if dialect.name == 'postgresql':
            return dialect.type_descriptor(JSONB())
        else:
            return dialect.type_descriptor(Text())

Benefits of Enhanced Database Layer: - Consistency: All models inherit common behaviors (timestamps, soft delete) - Multi-tenancy: TeamScopedMixin ensures data isolation across teams - Maintainability: Change timestamp logic once, applies to all models - Type Safety: Custom types provide better data handling across databases - Performance: Native PostgreSQL types (JSONB, UUID) when available

Usage in Domain Modules:

# Example: app/sites/model.py
from sqlalchemy import Column, Integer, String, Boolean
from app.database.database import Base
from app.database.mixins import TimestampMixin, TeamScopedMixin, SoftDeleteMixin
from app.database.types import GUID, JSONType

class Site(Base, TimestampMixin, TeamScopedMixin, SoftDeleteMixin):
    __tablename__ = "sites"

    id = Column(Integer, primary_key=True, index=True)
    uuid = Column(GUID, unique=True, default=uuid.uuid4)
    name = Column(String(255), nullable=False)
    domain = Column(String(255), nullable=False)
    config = Column(JSONType, nullable=True)  # Site-specific config
    is_active = Column(Boolean, default=True)

    # TimestampMixin provides: created_at, updated_at
    # TeamScopedMixin provides: team_id
    # SoftDeleteMixin provides: is_deleted, deleted_at

7.2.9 Domain-Aligned Frontend Architecture (frontend/features/)

The frontend architecture mirrors the backend's Hybrid Modular DDD structure, organizing code by business features rather than technical layers. This alignment provides significant benefits for team collaboration and maintainability at scale.

Architecture Principles: - Feature-First Organization: Code is organized by business domain (wordpress, sites, backups) not by type (components, hooks) - Module Isolation: Each feature module is self-contained with its own components, hooks, services, types, and utils - Shared Components: Common UI elements live in frontend/shared/ for reusability - Backend Alignment: Feature modules map 1:1 with backend domain modules - Type Safety: Auto-generated TypeScript clients from FastAPI OpenAPI specs

Feature Module Structure:

Each feature module follows this standard structure:

frontend/features/[FEATURE_NAME]/
├── components/          # Feature-specific React components
   ├── [Feature]List.tsx
   ├── [Feature]Detail.tsx
   ├── [Feature]Form.tsx
   └── [Feature]Card.tsx
├── hooks/              # Feature-specific custom hooks
   ├── use[Feature].ts
   ├── use[Feature]List.ts
   └── use[Feature]Mutations.ts
├── services/           # API client for this feature
   └── [feature]-service.ts
├── types/              # TypeScript type definitions
   └── index.ts
└── utils/              # Feature-specific utilities
    └── [feature]-helpers.ts

Example: WordPress Feature Module (frontend/features/wordpress/):

// frontend/features/wordpress/components/WordPressWpCli.tsx
import { useWordPressCli } from '../hooks/useWordPressCli';
import { Button } from '@/shared/components/ui/button';

export function WordPressWpCli({ siteId }: { siteId: number }) {
  const { executeCommand, isLoading } = useWordPressCli(siteId);

  return (
    <div>
      {/* WP-CLI interface */}
    </div>
  );
}

// frontend/features/wordpress/hooks/useWordPress.ts
import { useQuery, useMutation } from '@tanstack/react-query';
import { wordpressService } from '../services/wordpress-service';

export function useWordPress(siteId: number) {
  return useQuery({
    queryKey: ['wordpress', siteId],
    queryFn: () => wordpressService.getWordPressInfo(siteId)
  });
}

export function useWordPressCli(siteId: number) {
  const mutation = useMutation({
    mutationFn: (command: string) =>
      wordpressService.executeWpCli(siteId, command)
  });

  return {
    executeCommand: mutation.mutate,
    isLoading: mutation.isPending
  };
}

// frontend/features/wordpress/services/wordpress-service.ts
import { apiClient } from '@/shared/lib/api-client';
import { WordPressInfo, WpCliResponse } from '../types';

export const wordpressService = {
  async getWordPressInfo(siteId: number): Promise<WordPressInfo> {
    const { data } = await apiClient.get(`/api/v1/wordpress/${siteId}`);
    return data;
  },

  async executeWpCli(siteId: number, command: string): Promise<WpCliResponse> {
    const { data } = await apiClient.post(`/api/v1/wordpress/${siteId}/cli`, {
      command
    });
    return data;
  },

  async updateWordPress(siteId: number): Promise<void> {
    await apiClient.post(`/api/v1/wordpress/${siteId}/update`);
  }
};

// frontend/features/wordpress/types/index.ts
export interface WordPressInfo {
  version: string;
  is_active: boolean;
  site_id: number;
  plugins: Plugin[];
  themes: Theme[];
}

export interface WpCliResponse {
  output: string;
  exit_code: number;
  timestamp: string;
}

Shared Components (frontend/shared/):

// frontend/shared/components/ui/button.tsx
// Reusable base UI components (using shadcn/ui pattern)

// frontend/shared/lib/api-client.ts
import axios from 'axios';
import { getAuthToken } from '../utils/auth';

export const apiClient = axios.create({
  baseURL: process.env.NEXT_PUBLIC_API_URL,
  timeout: 30000,
});

// Request interceptor for auth token
apiClient.interceptors.request.use((config) => {
  const token = getAuthToken();
  if (token) {
    config.headers.Authorization = `Bearer ${token}`;
  }
  return config;
});

// Response interceptor for error handling
apiClient.interceptors.response.use(
  (response) => response,
  (error) => {
    if (error.response?.status === 401) {
      // Handle unauthorized
      window.location.href = '/login';
    }
    return Promise.reject(error);
  }
);

// frontend/shared/hooks/useDebounce.ts
import { useEffect, useState } from 'react';

export function useDebounce<T>(value: T, delay: number): T {
  const [debouncedValue, setDebouncedValue] = useState<T>(value);

  useEffect(() => {
    const handler = setTimeout(() => {
      setDebouncedValue(value);
    }, delay);

    return () => {
      clearTimeout(handler);
    };
  }, [value, delay]);

  return debouncedValue;
}

Benefits of Domain-Aligned Frontend: 1. Developer Efficiency: Find all WordPress-related code in one place 2. Reduced Cognitive Load: Working on a feature doesn't require context switching across directories 3. Clear Ownership: Teams can own entire feature slices (frontend + backend) 4. Parallel Development: Multiple teams work on different features without conflicts 5. Easier Onboarding: New developers understand feature boundaries quickly 6. Better Code Reuse: Shared components are explicitly separated from feature-specific code

Auto-Generated API Clients:

Use openapi-typescript-codegen to generate type-safe API clients:

# Generate TypeScript clients from FastAPI OpenAPI spec
npx openapi-typescript-codegen --input http://localhost:8000/openapi.json \
  --output ./frontend/shared/lib/openapi-client \
  --client axios

This generates: - Type-safe request/response interfaces - API service methods - Full IntelliSense support

State Management Strategy:

  • Server State: React Query (@tanstack/react-query) for API data caching
  • Client State: Zustand for global UI state (theme, sidebar, modals)
  • Form State: React Hook Form for form management
  • URL State: Next.js router for navigation state
// frontend/shared/store/auth-store.ts
import { create } from 'zustand';
import { persist } from 'zustand/middleware';

interface AuthState {
  user: User | null;
  token: string | null;
  setAuth: (user: User, token: string) => void;
  clearAuth: () => void;
}

export const useAuthStore = create<AuthState>()(
  persist(
    (set) => ({
      user: null,
      token: null,
      setAuth: (user, token) => set({ user, token }),
      clearAuth: () => set({ user: null, token: null }),
    }),
    {
      name: 'auth-storage',
    }
  )
);

7.2.10 Comprehensive Testing Strategy and Structure

A robust testing strategy is essential for maintaining quality at scale. The MBPanel testing framework covers unit, integration, end-to-end, and performance testing across both frontend and backend.

Testing Pyramid:

         /\
        /  \  E2E Tests (10%)
       /____\
      /      \
     / Integration Tests (30%)
    /________\
   /          \
  /  Unit Tests (60%)
 /______________\

Backend Testing Structure (backend/app/tests/):

backend/app/tests/
├── conftest.py              # ⭐ Shared pytest fixtures
├── test_main.py             # Main app tests
├── integration/             # Cross-module integration tests
   ├── __init__.py
   ├── test_site_creation_flow.py
   ├── test_backup_restore_flow.py
   └── test_user_auth_flow.py
├── users/                   # Per-module tests
   ├── __init__.py
   ├── test_router.py       # API endpoint tests
   ├── test_service.py      # Business logic tests
   └── test_repository.py   # Data access tests
├── wordpress/               # WordPress module tests
   ├── __init__.py
   ├── test_router.py
   ├── test_service.py
   └── test_repository.py
└── ... (one folder per module)

Shared Test Fixtures (backend/app/tests/conftest.py):

import pytest
from fastapi.testclient import TestClient
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from app.main import app
from app.database.database import Base
from app.database.deps import get_db

# Test database URL
SQLALCHEMY_TEST_DATABASE_URL = "postgresql://test:test@localhost:5432/test_db"

# Test engine
engine = create_engine(SQLALCHEMY_TEST_DATABASE_URL)
TestingSessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)

@pytest.fixture(scope="function")
def db():
    """Create test database and yield session"""
    Base.metadata.create_all(bind=engine)
    db = TestingSessionLocal()
    try:
        yield db
    finally:
        db.close()
        Base.metadata.drop_all(bind=engine)

@pytest.fixture(scope="function")
def client(db):
    """Test client with overridden database dependency"""
    def override_get_db():
        try:
            yield db
        finally:
            db.close()

    app.dependency_overrides[get_db] = override_get_db
    with TestClient(app) as test_client:
        yield test_client
    app.dependency_overrides.clear()

@pytest.fixture
def auth_headers(client, db):
    """Get authentication headers for testing protected endpoints"""
    # Create test user
    from app.users.repository import create_user
    from app.users.schema import UserCreate

    user_data = UserCreate(
        email="test@example.com",
        password="testpassword123",
        full_name="Test User"
    )
    user = create_user(db, user_data)

    # Login and get token
    response = client.post("/api/v1/auth/login", json={
        "email": "test@example.com",
        "password": "testpassword123"
    })
    token = response.json()["access_token"]

    return {"Authorization": f"Bearer {token}"}

@pytest.fixture
def team_factory(db):
    """Factory for creating test teams"""
    from app.teams.repository import create_team
    from app.teams.schema import TeamCreate

    def _create_team(name: str, owner_id: int):
        team_data = TeamCreate(name=name, owner_id=owner_id)
        return create_team(db, team_data)

    return _create_team

Example Module Tests:

# backend/app/tests/wordpress/test_router.py
import pytest
from fastapi.testclient import TestClient

def test_get_wordpress_info(client, auth_headers, db):
    """Test getting WordPress information"""
    # Create test site
    site = create_test_site(db)

    response = client.get(
        f"/api/v1/wordpress/{site.id}",
        headers=auth_headers
    )

    assert response.status_code == 200
    data = response.json()
    assert "version" in data
    assert "is_active" in data

def test_execute_wp_cli_command(client, auth_headers, db):
    """Test executing WP-CLI command"""
    site = create_test_site(db)

    response = client.post(
        f"/api/v1/wordpress/{site.id}/cli",
        headers=auth_headers,
        json={"command": "core version"}
    )

    assert response.status_code == 200
    data = response.json()
    assert "output" in data
    assert data["exit_code"] == 0

# backend/app/tests/wordpress/test_service.py
import pytest
from app.wordpress.service import WordPressService

def test_wordpress_service_get_info(db):
    """Test WordPress service get_info method"""
    service = WordPressService()
    site = create_test_site(db)

    info = service.get_wordpress_info(db, site.id)

    assert info is not None
    assert hasattr(info, 'version')

# backend/app/tests/integration/test_site_creation_flow.py
import pytest

def test_complete_site_creation_flow(client, auth_headers, db):
    """Test complete site creation workflow"""
    # Step 1: Create team
    team_response = client.post(
        "/api/v1/teams/",
        headers=auth_headers,
        json={"name": "Test Team"}
    )
    team_id = team_response.json()["id"]

    # Step 2: Create site
    site_response = client.post(
        "/api/v1/sites/",
        headers=auth_headers,
        json={
            "team_id": team_id,
            "name": "Test Site",
            "domain": "test.example.com"
        }
    )
    assert site_response.status_code == 201
    site_id = site_response.json()["id"]

    # Step 3: Create environment
    env_response = client.post(
        f"/api/v1/sites/{site_id}/environments",
        headers=auth_headers,
        json={"name": "production"}
    )
    assert env_response.status_code == 201

    # Step 4: Install WordPress
    wp_response = client.post(
        f"/api/v1/wordpress/{site_id}/install",
        headers=auth_headers
    )
    assert wp_response.status_code == 200

Frontend Testing Structure (frontend/tests/):

frontend/tests/
├── unit/                    # Unit tests for components/hooks
│   ├── components/
│   │   └── Button.test.tsx
│   └── hooks/
│       └── useDebounce.test.ts
├── integration/             # Integration tests
│   └── features/
│       └── wordpress/
│           └── WordPressFlow.test.tsx
└── e2e/                     # End-to-end tests
    ├── auth.spec.ts
    ├── site-creation.spec.ts
    └── wordpress.spec.ts

Example Frontend Tests:

// frontend/tests/unit/hooks/useDebounce.test.ts
import { renderHook, act } from '@testing-library/react';
import { useDebounce } from '@/shared/hooks/useDebounce';

describe('useDebounce', () => {
  it('should debounce value changes', async () => {
    const { result, rerender } = renderHook(
      ({ value, delay }) => useDebounce(value, delay),
      { initialProps: { value: 'initial', delay: 500 } }
    );

    expect(result.current).toBe('initial');

    rerender({ value: 'updated', delay: 500 });
    expect(result.current).toBe('initial'); // Still initial

    await act(() => new Promise(resolve => setTimeout(resolve, 600)));
    expect(result.current).toBe('updated'); // Now updated
  });
});

// frontend/tests/e2e/wordpress.spec.ts (using Playwright)
import { test, expect } from '@playwright/test';

test('WordPress CLI execution flow', async ({ page }) => {
  // Login
  await page.goto('/login');
  await page.fill('[name="email"]', 'test@example.com');
  await page.fill('[name="password"]', 'testpassword123');
  await page.click('button[type="submit"]');

  // Navigate to WordPress section
  await page.goto('/sites/1/wordpress');

  // Execute WP-CLI command
  await page.fill('[name="command"]', 'core version');
  await page.click('button:has-text("Execute")');

  // Verify output
  await expect(page.locator('.wp-cli-output')).toContainText('5.9');
});

End-to-End Tests (tests/e2e/):

# tests/e2e/test_site_creation_flow.py
import pytest
from playwright.sync_api import Page, expect

def test_complete_site_creation_e2e(page: Page):
    """End-to-end test for site creation"""
    # Login
    page.goto("http://localhost:3000/login")
    page.fill("[name=email]", "test@example.com")
    page.fill("[name=password]", "testpassword123")
    page.click("button[type=submit]")

    # Wait for dashboard
    expect(page.locator("h1")).to_contain_text("Dashboard")

    # Create site
    page.click("text=Create Site")
    page.fill("[name=name]", "Test Site")
    page.fill("[name=domain]", "test.example.com")
    page.click("button:has-text('Create')")

    # Verify site created
    expect(page.locator(".site-list")).to_contain_text("Test Site")

Performance Tests (tests/performance/):

# tests/performance/load_tests.py
from locust import HttpUser, task, between

class MBPanelUser(HttpUser):
    wait_time = between(1, 3)

    def on_start(self):
        """Login before tests"""
        response = self.client.post("/api/v1/auth/login", json={
            "email": "test@example.com",
            "password": "testpassword123"
        })
        self.token = response.json()["access_token"]
        self.headers = {"Authorization": f"Bearer {self.token}"}

    @task(10)
    def list_sites(self):
        """Test listing sites endpoint"""
        self.client.get("/api/v1/sites/", headers=self.headers)

    @task(5)
    def get_site_details(self):
        """Test getting site details"""
        self.client.get("/api/v1/sites/1", headers=self.headers)

    @task(2)
    def create_backup(self):
        """Test creating backup"""
        self.client.post("/api/v1/backups/", headers=self.headers, json={
            "site_id": 1,
            "type": "full"
        })

Testing Guidelines: - Unit Tests: 60% coverage target, test individual functions/components - Integration Tests: 30% coverage target, test module interactions - E2E Tests: 10% coverage target, test critical user journeys - Performance Tests: Run weekly, track response times and throughput - Coverage Threshold: Minimum 80% overall code coverage - CI Integration: All tests run on every PR, blocking merge on failures

7.2.11 Operational Scripts and DevOps Tooling

Operational scripts automate common development, deployment, and maintenance tasks. These scripts live in backend/scripts/ and provide essential tooling for developers and operators.

Script Inventory:

backend/scripts/
├── migrate_data.py          # MySQL → PostgreSQL migration
├── seed_dev_data.py         # Development data seeding
├── health_check.py          # Health check utilities
├── generate_module.py       # Generate new module scaffold
├── validate_architecture.py # Validate module isolation rules
├── backup_database.py       # Database backup script
├── restore_database.py      # Database restore script
└── performance_benchmark.py # Performance benchmarking

1. Module Generator (generate_module.py):

#!/usr/bin/env python3
"""
Generate a new domain module with standard structure.

Usage:
    python scripts/generate_module.py --name payments --description "Payment processing module"
"""
import os
import argparse
from pathlib import Path

TEMPLATES = {
    "router.py": '''"""
{description}
API Router for {module_name}.
"""
from fastapi import APIRouter, Depends, HTTPException
from sqlalchemy.orm import Session
from app.database.deps import get_db, get_current_active_user
from . import service, schema

router = APIRouter(prefix="/{module_name}", tags=["{module_name}"])

@router.get("/", response_model=list[schema.{ModuleName}Read])
def list_{module_name}(
    db: Session = Depends(get_db),
    current_user = Depends(get_current_active_user),
    skip: int = 0,
    limit: int = 100
):
    """List all {module_name}"""
    return service.get_{module_name}_list(db, skip=skip, limit=limit)

@router.post("/", response_model=schema.{ModuleName}Read, status_code=201)
def create_{module_singular}(
    {module_singular}_data: schema.{ModuleName}Create,
    db: Session = Depends(get_db),
    current_user = Depends(get_current_active_user)
):
    """Create new {module_singular}"""
    return service.create_{module_singular}(db, {module_singular}_data)
''',
    "service.py": '''"""
{description}
Business logic for {module_name}.
"""
from sqlalchemy.orm import Session
from . import repository, schema

def get_{module_name}_list(db: Session, skip: int = 0, limit: int = 100):
    """Get list of {module_name}"""
    return repository.get_{module_name}(db, skip=skip, limit=limit)

def create_{module_singular}(db: Session, {module_singular}_data: schema.{ModuleName}Create):
    """Create new {module_singular}"""
    return repository.create_{module_singular}(db, {module_singular}_data)
''',
    # ... more templates
}

def generate_module(name: str, description: str):
    """Generate module with all standard files"""
    module_path = Path(f"app/{name}")
    module_path.mkdir(exist_ok=True)

    for filename, template in TEMPLATES.items():
        content = template.format(
            module_name=name,
            module_singular=name.rstrip('s'),
            ModuleName=name.capitalize(),
            description=description
        )
        (module_path / filename).write_text(content)

    print(f"✅ Module '{name}' generated successfully!")

2. Architecture Validator (validate_architecture.py):

#!/usr/bin/env python3
"""
Validate Hybrid Modular DDD architecture rules.

Checks:
- No module imports from another module
- Modules only import from app.core.* and app.database.*
- All modules have standard files (router, service, repository, model, schema)
"""
import ast
import sys
from pathlib import Path
from typing import List, Set

ALLOWED_IMPORTS = {'app.core', 'app.database'}
REQUIRED_FILES = ['router.py', 'service.py', 'repository.py', 'model.py', 'schema.py']

def get_module_imports(module_path: Path) -> Set[str]:
    """Extract all imports from a module"""
    imports = set()
    for py_file in module_path.rglob("*.py"):
        try:
            tree = ast.parse(py_file.read_text())
            for node in ast.walk(tree):
                if isinstance(node, ast.ImportFrom):
                    if node.module and node.module.startswith('app.'):
                        imports.add(node.module.split('.')[1])  # Extract first level
        except:
            pass
    return imports

def validate_module_isolation():
    """Validate no module imports from other modules"""
    app_path = Path("app")
    modules = [d for d in app_path.iterdir() if d.is_dir() and not d.name.startswith('_')]

    violations = []

    for module in modules:
        if module.name in ['core', 'database', 'tests']:
            continue

        imports = get_module_imports(module)
        forbidden_imports = imports - ALLOWED_IMPORTS - {module.name}

        if forbidden_imports:
            violations.append(f"❌ Module '{module.name}' imports from: {forbidden_imports}")

    return violations

def validate_module_structure():
    """Validate all modules have required files"""
    app_path = Path("app")
    modules = [d for d in app_path.iterdir() if d.is_dir() and not d.name.startswith('_')]

    violations = []

    for module in modules:
        if module.name in ['core', 'database', 'tests']:
            continue

        for required_file in REQUIRED_FILES:
            if not (module / required_file).exists():
                violations.append(f"❌ Module '{module.name}' missing {required_file}")

    return violations

if __name__ == "__main__":
    print("🔍 Validating architecture...")

    isolation_violations = validate_module_isolation()
    structure_violations = validate_module_structure()

    if isolation_violations:
        print("\n❌ Module Isolation Violations:")
        for violation in isolation_violations:
            print(f"  {violation}")

    if structure_violations:
        print("\n❌ Module Structure Violations:")
        for violation in structure_violations:
            print(f"  {violation}")

    if not isolation_violations and not structure_violations:
        print("✅ Architecture validation passed!")
        sys.exit(0)
    else:
        sys.exit(1)

3. Development Data Seeder (seed_dev_data.py):

#!/usr/bin/env python3
"""
Seed development database with test data.

Usage:
    python scripts/seed_dev_data.py --users 10 --teams 5 --sites 20
"""
import argparse
from faker import Faker
from sqlalchemy.orm import Session
from app.database.database import SessionLocal
from app.users.repository import create_user
from app.teams.repository import create_team
from app.sites.repository import create_site

fake = Faker()

def seed_users(db: Session, count: int):
    """Seed users"""
    print(f"Seeding {count} users...")
    users = []
    for _ in range(count):
        user_data = {
            "email": fake.email(),
            "full_name": fake.name(),
            "password": "password123"
        }
        user = create_user(db, user_data)
        users.append(user)
    return users

def seed_teams(db: Session, users: list, count: int):
    """Seed teams"""
    print(f"Seeding {count} teams...")
    teams = []
    for _ in range(count):
        team_data = {
            "name": fake.company(),
            "owner_id": fake.random_element(users).id
        }
        team = create_team(db, team_data)
        teams.append(team)
    return teams

def seed_sites(db: Session, teams: list, count: int):
    """Seed sites"""
    print(f"Seeding {count} sites...")
    for _ in range(count):
        site_data = {
            "name": fake.domain_word(),
            "domain": fake.domain_name(),
            "team_id": fake.random_element(teams).id
        }
        create_site(db, site_data)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--users", type=int, default=10)
    parser.add_argument("--teams", type=int, default=5)
    parser.add_argument("--sites", type=int, default=20)
    args = parser.parse_args()

    db = SessionLocal()
    try:
        users = seed_users(db, args.users)
        teams = seed_teams(db, users, args.teams)
        seed_sites(db, teams, args.sites)
        db.commit()
        print("✅ Seeding completed successfully!")
    except Exception as e:
        db.rollback()
        print(f"❌ Seeding failed: {e}")
    finally:
        db.close()

4. Health Check Script (health_check.py):

#!/usr/bin/env python3
"""
Comprehensive health check for all services.

Checks:
- Database connectivity
- Redis connectivity
- API endpoints
- Worker processes
"""
import requests
import psycopg2
import redis
import sys

def check_database():
    """Check PostgreSQL connectivity"""
    try:
        conn = psycopg2.connect(
            host="localhost",
            database="mbpanel",
            user="mbpanel",
            password="password"
        )
        conn.close()
        print("✅ Database: OK")
        return True
    except Exception as e:
        print(f"❌ Database: FAILED - {e}")
        return False

def check_redis():
    """Check Redis connectivity"""
    try:
        r = redis.Redis(host='localhost', port=6379, db=0)
        r.ping()
        print("✅ Redis: OK")
        return True
    except Exception as e:
        print(f"❌ Redis: FAILED - {e}")
        return False

def check_api():
    """Check API health endpoint"""
    try:
        response = requests.get("http://localhost:8000/health", timeout=5)
        if response.status_code == 200:
            print("✅ API: OK")
            return True
        else:
            print(f"❌ API: FAILED - Status {response.status_code}")
            return False
    except Exception as e:
        print(f"❌ API: FAILED - {e}")
        return False

if __name__ == "__main__":
    print("🔍 Running health checks...\n")

    checks = [
        check_database(),
        check_redis(),
        check_api()
    ]

    if all(checks):
        print("\n✅ All health checks passed!")
        sys.exit(0)
    else:
        print("\n❌ Some health checks failed!")
        sys.exit(1)

Makefile for Common Commands:

# Makefile
.PHONY: help dev test lint format seed validate health

help:
    @echo "MBPanel Development Commands"
    @echo "  make dev         - Start development environment"
    @echo "  make test        - Run all tests"
    @echo "  make lint        - Run linters"
    @echo "  make format      - Format code"
    @echo "  make seed        - Seed development data"
    @echo "  make validate    - Validate architecture"
    @echo "  make health      - Run health checks"

dev:
    docker-compose -f local-infra/docker-compose.dev.yml up -d
    cd backend && uvicorn app.main:app --reload
    cd frontend && npm run dev

test:
    cd backend && pytest
    cd frontend && npm test

lint:
    cd backend && ruff check .
    cd backend && mypy .
    cd frontend && npm run lint

format:
    cd backend && ruff format .
    cd frontend && npm run format

seed:
    cd backend && python scripts/seed_dev_data.py

validate:
    cd backend && python scripts/validate_architecture.py

health:
    cd backend && python scripts/health_check.py

module:
    @read -p "Module name: " name; \
    read -p "Description: " desc; \
    cd backend && python scripts/generate_module.py --name $$name --description "$$desc"

7.3 Monorepo and Decoupled Architecture

The MBPanel project implements a monorepo architecture with a decoupled frontend and backend system. This approach combines the benefits of a unified repository structure with the operational advantages of independently deployable services.

7.3.1 Monorepo Approach

The monorepo approach consolidates all codebases into a single repository while maintaining clear separation between components. This strategy provides several key benefits:

Benefits: - Unified Version Control: All components share the same commit history and versioning system - Simplified Dependency Management: Shared dependencies can be managed collectively - Atomic Changes: Cross-component changes can be committed atomically - Improved Collaboration: Teams can easily understand the entire system architecture - Consistent Tooling: Shared linting, testing, and build tools across all components - Easier Refactoring: Changes that span multiple components can be coordinated more effectively - Shared Infrastructure: Common CI/CD pipelines and development tools

Challenges: - Repository Size: Can become large and impact clone/pull performance - Build Complexity: Requires sophisticated build systems to handle cross-component dependencies - Permission Management: Requires careful access control to prevent unauthorized changes - Branch Management: Coordination required between teams working on different components - Testing Complexity: Comprehensive testing strategies needed for cross-component changes

7.3.2 Decoupled Architecture

The decoupled architecture separates the frontend and backend into independently deployable services while maintaining API contract integrity:

Frontend Independence: - Next.js application can be developed, tested, and deployed independently - Frontend team can iterate quickly without backend dependencies - Can implement frontend-specific performance optimizations - Supports multiple frontend applications consuming the same backend API

Backend Independence: - FastAPI backend can be scaled independently of frontend - Backend team can implement new features without affecting frontend stability - Supports multiple client types (web, mobile, third-party integrations) - Allows for backend technology upgrades without frontend changes

API Contract Management: - OpenAPI specifications serve as the contract between frontend and backend - Automated client generation ensures consistency between API and frontend - Versioned APIs allow for gradual migration and backward compatibility - Contract testing validates API compliance during CI/CD

7.3.3 Branching Strategy

The project implements a multi-branch strategy to support independent development while maintaining integration capabilities:

Branch Structure: - frontend-main: Dedicated branch for frontend development - Contains only frontend-related code (in the /frontend directory) - Frontend team works exclusively on this branch - Independent CI/CD pipeline for frontend-specific builds - Feature branches created from and merged back to this branch - Frontend-specific dependencies and configurations

  • backend-main: Dedicated branch for backend development
  • Contains only backend-related code (in the /backend directory)
  • Backend team works exclusively on this branch
  • Independent CI/CD pipeline for backend-specific builds
  • Feature branches created from and merged back to this branch
  • Backend-specific dependencies and configurations

  • LOCALDEV: Complete monorepo branch containing all files

  • Contains the full project structure with both frontend and backend
  • Used for local development and integration testing
  • Maintains the complete codebase for developers who need to work on both components
  • Serves as the integration point for cross-component changes
  • Used for end-to-end testing and local development environments

Branch Management Workflow: 1. Developers working on frontend-only changes use frontend-main 2. Developers working on backend-only changes use backend-main 3. Developers working on cross-component features use LOCALDEV 4. Cross-component changes are coordinated through pull requests to LOCALDEV 5. Periodic synchronization ensures consistency between specialized branches and LOCALDEV

7.3.4 Git Workflow and Collaboration Guidelines

Collaboration Process: - Feature Development: Create feature branches from the appropriate main branch - Code Reviews: All changes require peer review before merging - Cross-Component Changes: Use LOCALDEV branch for changes affecting both frontend and backend - Merge Strategy: Squash and merge for feature branches, merge commits for releases - Pull Request Requirements: - Automated tests must pass - Code coverage requirements met - Security scans clear - Documentation updated

Team Coordination: - Frontend team coordinates API changes with backend team - Backend team provides advance notice of breaking API changes - Regular sync meetings to discuss cross-component dependencies - Shared documentation for API contracts and integration points - Joint planning sessions for features requiring both frontend and backend changes

7.3.5 Dependency and API Contract Management

Dependency Management: - Frontend Dependencies: Managed in /frontend/package.json - Backend Dependencies: Managed in /backend/requirements/ - Shared Dependencies: Documented in /docs/dependencies.md - Version Synchronization: Automated tools ensure compatible versions - Security Updates: Automated scanning and updates for vulnerabilities

API Contract Management: - OpenAPI Specifications: Generated from FastAPI code and stored in /backend/app/api/specs/ - Client Generation: Automated TypeScript client generation for frontend - Contract Testing: Automated tests validate API compliance - Versioning Strategy: Semantic versioning for API endpoints - Breaking Change Policy: 30-day deprecation notice for breaking changes - Documentation: Interactive API documentation available at /api/docs

Integration Validation: - Pre-commit hooks validate API contract compliance - CI/CD pipelines run integration tests between components - Automated contract testing in staging environment - Mock servers for testing frontend without backend dependencies

7.3.6 Deployment Strategies

Independent Deployments: - Frontend Deployment: - Static site deployment to CDN - Independent release cycle from backend - Blue-green deployment strategy - Rollback capability within seconds - A/B testing support for UI changes

  • Backend Deployment:
  • Containerized deployment with Docker
  • Rolling updates with health checks
  • Database migration management
  • API version compatibility during updates
  • Graceful degradation for service updates

Coordinated Deployments: - Breaking Changes: Coordinated deployments when API contracts change - Feature Flags: Enable/disable features without deployment coordination - Canary Releases: Gradual rollout of changes affecting both components - Rollback Coordination: Synchronized rollback procedures for integrated changes

Environment Strategy: - Development: Local development with full monorepo - Staging: Full integration testing environment - Production: Independent scaling of frontend and backend - Hotfix: Emergency procedures for critical bug fixes

7.3.7 CI/CD Pipeline Configuration

Multi-Branch Pipeline Setup: - Frontend Pipeline: - Triggered on frontend-main and frontend feature branches - Runs frontend-specific tests (unit, integration, e2e) - Builds and deploys frontend application - Performs accessibility and performance checks - Deploys to frontend-specific staging environment

  • Backend Pipeline:
  • Triggered on backend-main and backend feature branches
  • Runs backend-specific tests (unit, integration, load)
  • Builds and deploys backend services
  • Performs security scanning and compliance checks
  • Deploys to backend-specific staging environment

  • Integration Pipeline:

  • Triggered on LOCALDEV branch
  • Runs comprehensive integration tests
  • Validates API contracts between components
  • Performs end-to-end testing
  • Deploys to integrated staging environment

Pipeline Components: - Build Stages: Separate build processes for frontend and backend - Testing Stages: Component-specific and integration tests - Security Scanning: Automated vulnerability detection - Performance Testing: Load and performance validation - Deployment Stages: Environment-specific deployment configurations - Monitoring: Post-deployment health checks and monitoring setup

7.3.8 Component Interaction in Decoupled Architecture

Change Impact Management: - Frontend Changes: Minimal impact on backend, primarily UI/UX updates - Backend API Changes: Requires coordination with frontend team for integration - Breaking Changes: 30-day notification period with migration support - Non-Breaking Changes: Can be deployed independently with minimal coordination

Communication Protocols: - API-First Development: Backend defines API contracts before implementation - Event-Driven Architecture: Backend publishes events for frontend consumption - WebSocket Integration: Real-time communication for live updates - Health Checks: Component health monitoring and alerting

Monitoring and Observability: - Component-Specific Metrics: Independent performance and error monitoring - Cross-Component Dependencies: Track API call performance and error rates - End-to-User Monitoring: Track user experience across both components - Alerting Systems: Component-specific and cross-component alerting

7.4 Technology Stack

  • Backend: Python 3.11+, FastAPI, SQLAlchemy 2.0 (async), Pydantic v2, Hybrid Modular DDD Architecture
  • Frontend: Next.js 14, React 18, TypeScript
  • Database: PostgreSQL 15+ with PgBouncer connection pooling
  • Cache/Queue: Redis 7 (Cluster mode for production)
  • Message Queue: ARQ (Asynchronous Redis Queue) for background jobs
  • WebSocket: FastAPI native WebSocket with Redis Pub/Sub
  • Infrastructure: Docker & Docker Compose, Kubernetes-ready
  • Observability: Grafana, Loki, Prometheus, Tempo (distributed tracing)
  • Testing: Pytest, Playwright, Locust (performance testing)
  • CI/CD: GitHub Actions
  • Static Analysis: Ruff (linting), MyPy (type checking)

7.5 Multi-Developer Collaboration and Workflow Guidelines

For a web hosting panel serving millions of users with 10+ developers, clear collaboration guidelines are essential to maintain code quality, prevent conflicts, and ensure consistent development practices.

7.5.1 Team Organization and Ownership

Module Ownership Model: - Each domain module has a primary owner (developer or small team) - Owners are responsible for: - Code quality within their module - Reviewing PRs touching their module - Maintaining documentation for their module - Breaking down large features into tasks

Example Ownership Map:

Module Ownership:
├── users/          → Team: Core Platform (Lead: Alice)
├── auth/           → Team: Core Platform (Lead: Alice)
├── teams/          → Team: Core Platform (Lead: Bob)
├── sites/          → Team: Hosting Platform (Lead: Carol)
├── environments/   → Team: Hosting Platform (Lead: Carol)
├── backups/        → Team: Hosting Platform (Lead: Dave)
├── wordpress/      → Team: CMS Features (Lead: Eve)
├── domains/        → Team: Infrastructure (Lead: Frank)
├── payments/       → Team: Billing (Lead: Grace)
└── billing/        → Team: Billing (Lead: Grace)

Benefits: - Clear accountability for each feature - Faster PR reviews (owners have deep knowledge) - Reduced conflicts (teams work in different modules) - Easier onboarding (new devs join a specific team)

7.5.2 Git Workflow and Branching Strategy

Branch Structure:

Main Branches:
├── LOCALDEV          # Full monorepo (local development & integration)
├── backend-main      # Backend-only code
└── frontend-main     # Frontend-only code

Feature Branches:
├── feature/wordpress-cli-enhancement    # From backend-main or LOCALDEV
├── feature/site-dashboard-ui            # From frontend-main
└── bugfix/backup-restore-issue          # From appropriate main branch

Workflow:

  1. For Backend-Only Changes:

    git checkout backend-main
    git pull origin backend-main
    git checkout -b feature/add-wordpress-themes
    # Make changes in backend/ directory only
    git commit -m "feat(wordpress): add theme management endpoints"
    git push origin feature/add-wordpress-themes
    # Create PR to backend-main
    

  2. For Frontend-Only Changes:

    git checkout frontend-main
    git pull origin frontend-main
    git checkout -b feature/wordpress-theme-ui
    # Make changes in frontend/ directory only
    git commit -m "feat(wordpress): add theme management UI"
    git push origin feature/wordpress-theme-ui
    # Create PR to frontend-main
    

  3. For Cross-Component Changes:

    git checkout LOCALDEV
    git pull origin LOCALDEV
    git checkout -b feature/wordpress-theme-management
    # Make changes in both backend/ and frontend/
    git commit -m "feat(wordpress): add complete theme management feature"
    git push origin feature/wordpress-theme-management
    # Create PR to LOCALDEV
    

7.5.3 Commit Message Convention

Format: <type>(<scope>): <subject>

Types: - feat: New feature - fix: Bug fix - docs: Documentation only - style: Code style changes (formatting, no logic change) - refactor: Code refactoring - test: Adding or updating tests - chore: Build process or auxiliary tool changes

Scopes: Module names (users, wordpress, sites, etc.)

Examples:

feat(wordpress): add WP-CLI command execution endpoint
fix(backups): resolve restore failure for large databases
docs(architecture): update DDD module isolation rules
test(sites): add integration tests for site creation flow
refactor(auth): extract JWT logic to shared module
chore(ci): update GitHub Actions workflow for backend tests

Commit Body (Optional):

feat(wordpress): add WP-CLI command execution endpoint

- Implement execute_wp_cli_command service method
- Add validation for allowed commands
- Implement timeout and error handling
- Add background job support for long-running commands

Closes #123

7.5.4 Pull Request (PR) Guidelines

PR Template:

## Description
Brief description of changes

## Type of Change
- [ ] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Documentation update

## Module(s) Affected
- [ ] users
- [ ] wordpress
- [ ] sites
- [ ] Other: ___

## Testing
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing completed

## Checklist
- [ ] Code follows style guidelines (passed ruff, mypy)
- [ ] Self-review completed
- [ ] Documentation updated
- [ ] No new warnings generated
- [ ] Module isolation rules maintained (validated with `make validate`)

## Related Issues
Closes #123, #456

PR Review Process:

  1. Automated Checks (must pass before review):
  2. Linting (Ruff, ESLint)
  3. Type checking (MyPy, TypeScript)
  4. Unit tests
  5. Integration tests
  6. Architecture validation (validate_architecture.py)
  7. Code coverage threshold (80%+)

  8. Manual Review (2 approvals required):

  9. Code owner approval (if module owner exists)
  10. One additional team member approval
  11. Architecture review for cross-module changes

  12. Merge Strategy:

  13. Squash and merge for feature branches (clean history)
  14. Merge commit for releases (preserve branch history)
  15. Rebase is discouraged (preserves all commits, clutters history)

7.5.5 Code Review Guidelines

For Reviewers:

Must Check: - Module isolation rules respected (no cross-module imports except core/database) - Proper use of shared kernel for cross-cutting concerns - All layers present in module (router, service, repository, model, schema) - Security considerations (authentication, authorization, input validation) - Error handling and edge cases covered - Tests cover new functionality (minimum 80% coverage) - Documentation updated (docstrings, comments for complex logic)

Nice to Have: - Performance considerations (N+1 queries, caching opportunities) - Code readability and maintainability - Consistent naming conventions - Reusable patterns extracted to shared utilities

Review Turnaround Time: - Small PRs (< 200 lines): 24 hours - Medium PRs (200-500 lines): 48 hours - Large PRs (500+ lines): Break into smaller PRs or 72 hours

For Authors:

Before Submitting PR:

# Run checks locally
make lint          # Lint code
make format        # Auto-format code
make validate      # Validate architecture rules
make test          # Run all tests
make health        # Health check (if adding infrastructure changes)

During Review: - Respond to all comments (resolve, explain, or implement suggestions) - Request clarification if feedback is unclear - Update PR description if scope changes - Rebase if main branch has significant changes

7.5.6 Development Environment Setup

✅ Implementation Status: The local development environment setup has been fully implemented. See index.md for complete setup instructions.

Prerequisites: - Python 3.11+ - Node.js 20+ - Docker & Docker Compose - PostgreSQL 15+ (or via Docker) - Redis 7 (or via Docker)

Quick Setup (Automated):

# Run automated setup script
make setup

# Start development environment
make dev

Manual Setup Steps:

# Clone repository
git clone https://github.com/yourorg/mbpanel.git
cd mbpanel

# Backend setup
cd backend
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements/dev.txt

# Database setup
docker-compose -f local-infra/docker-compose.dev.yml up -d postgres redis
alembic upgrade head
python scripts/seed_dev_data.py --users 10 --teams 5 --sites 20

# Frontend setup
cd ../frontend
npm install
npm run dev

# Backend server (in separate terminal)
cd backend
uvicorn app.main:app --reload --port 8000

# Verify setup
make health

Daily Development Workflow:

# Start services
make dev

# Work on feature
git checkout -b feature/my-feature
# Make changes...

# Before committing
make lint
make format
make validate
make test

# Commit and push
git add .
git commit -m "feat(module): description"
git push origin feature/my-feature

# Create PR on GitHub

7.5.7 Documentation Standards

Module Documentation (docs/development/modules/[MODULE_NAME].md):

Each module should have documentation covering: - Purpose: What the module does - Dependencies: Which core/database components it uses - API Endpoints: List of routes with request/response examples - Business Logic: Key workflows and use cases - Database Schema: Tables and relationships - Integration Points: How other modules interact with it - Testing: How to test the module

Code Documentation:

Docstrings (Google Style):

def create_wordpress_site(db: Session, site_data: WordPressCreate) -> WordPress:
    """Create a new WordPress installation for a site.

    Args:
        db: Database session
        site_data: WordPress creation data containing site_id, version, etc.

    Returns:
        Created WordPress instance

    Raises:
        HTTPException: If site doesn't exist or already has WordPress

    Example:
        >>> site_data = WordPressCreate(site_id=1, version="6.4")
        >>> wp = create_wordpress_site(db, site_data)
        >>> print(wp.version)
        '6.4'
    """
    pass

Inline Comments: - Use for complex logic that isn't immediately obvious - Explain why, not what - Keep comments up-to-date with code changes

7.5.8 Continuous Integration (CI) Pipeline

GitHub Actions Workflow (.github/workflows/backend-ci.yml):

name: Backend CI

on:
  pull_request:
    branches: [backend-main, LOCALDEV]
    paths:
      - 'backend/**'
  push:
    branches: [backend-main]

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - name: Install dependencies
        run: |
          cd backend
          pip install -r requirements/dev.txt
      - name: Run Ruff
        run: |
          cd backend
          ruff check .
      - name: Run MyPy
        run: |
          cd backend
          mypy .

  test:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:15
        env:
          POSTGRES_PASSWORD: test
          POSTGRES_DB: test_db
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
      redis:
        image: redis:7
        options: >-
          --health-cmd "redis-cli ping"
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - name: Install dependencies
        run: |
          cd backend
          pip install -r requirements/dev.txt
      - name: Run tests
        run: |
          cd backend
          pytest --cov=app --cov-report=xml --cov-report=term
      - name: Upload coverage
        uses: codecov/codecov-action@v3
        with:
          files: ./backend/coverage.xml

  architecture-validation:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - name: Validate architecture
        run: |
          cd backend
          python scripts/validate_architecture.py

Protection Rules: - Require passing CI checks before merge - Require 2 approvals - Require branch to be up-to-date before merging - Prevent force pushes to main branches

7.5.9 Communication and Coordination

Daily Standup (Async via Slack):

- Yesterday: Completed WordPress theme management API
- Today: Working on frontend UI for theme selection
- Blockers: Waiting for design review on theme preview

Weekly Planning: - Review upcoming features from product backlog - Assign module ownership for new features - Identify cross-module dependencies - Plan integration testing for complex features

Documentation Channels: - Slack #dev: General development discussion - Slack #architecture: Architecture decisions and reviews - Slack #incidents: Production issues and postmortems - GitHub Discussions: Long-form technical discussions - Confluence/Notion: Architecture decision records (ADRs), runbooks

7.5.10 Conflict Resolution

Code Conflicts: - Module isolation minimizes conflicts - If conflicts occur in shared code (core/, database/), coordinate with team - Prefer pull-based updates (rebase on latest main) over push-based (force push)

Design Conflicts: - Escalate to tech lead or architect - Document decision in ADR (Architecture Decision Record) - Communicate decision to all team members

Performance/Scalability Concerns: - Load test before deploying to production - Monitor metrics post-deployment - Rollback if performance degrades

7.6 Performance and Scalability Patterns for High-Traffic Systems

For a web hosting control panel serving millions of users, performance and scalability are critical. This section outlines proven patterns and best practices for high-traffic FastAPI applications.

7.6.1 Database Optimization Patterns

1. Connection Pooling with PgBouncer:

# backend/app/database/database.py
from sqlalchemy import create_engine
from sqlalchemy.pool import NullPool

# Use PgBouncer for connection pooling (external)
engine = create_engine(
    "postgresql://user:pass@pgbouncer:6432/mbpanel",
    pool_pre_ping=True,
    poolclass=NullPool,  # Let PgBouncer handle pooling
    echo=False
)

# PgBouncer config (pgbouncer.ini)
# [databases]
# mbpanel = host=postgres port=5432 dbname=mbpanel
#
# [pgbouncer]
# pool_mode = transaction
# max_client_conn = 1000
# default_pool_size = 25

2. Read Replicas for Read-Heavy Operations:

# backend/app/database/database.py
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

# Primary database (read-write)
engine_primary = create_engine("postgresql://user:pass@primary:5432/mbpanel")
SessionPrimary = sessionmaker(bind=engine_primary)

# Read replica (read-only)
engine_replica = create_engine("postgresql://user:pass@replica:5432/mbpanel")
SessionReplica = sessionmaker(bind=engine_replica)

# Usage in repository
def get_sites_list(db: Session, team_id: int, skip: int = 0, limit: int = 100):
    """List sites (use replica for read-only operation)"""
    # Use read replica for this query
    replica_db = SessionReplica()
    try:
        query = replica_db.query(Site).filter(Site.team_id == team_id)
        return query.offset(skip).limit(limit).all()
    finally:
        replica_db.close()

3. Query Optimization:

# BAD: N+1 query problem
def get_sites_with_environments(db: Session, team_id: int):
    sites = db.query(Site).filter(Site.team_id == team_id).all()
    for site in sites:
        # Triggers additional query for each site!
        environments = db.query(Environment).filter(Environment.site_id == site.id).all()
    return sites

# GOOD: Eager loading with joinedload
from sqlalchemy.orm import joinedload

def get_sites_with_environments(db: Session, team_id: int):
    sites = db.query(Site).options(
        joinedload(Site.environments)
    ).filter(Site.team_id == team_id).all()
    return sites  # All environments loaded in single query

4. Database Indexing Strategy:

# backend/app/sites/model.py
from sqlalchemy import Column, Integer, String, Index
from app.database.database import Base
from app.database.mixins import TimestampMixin, TeamScopedMixin

class Site(Base, TimestampMixin, TeamScopedMixin):
    __tablename__ = "sites"

    id = Column(Integer, primary_key=True, index=True)
    name = Column(String(255), nullable=False)
    domain = Column(String(255), nullable=False, unique=True)
    # team_id from TeamScopedMixin, already indexed

    # Composite index for common queries
    __table_args__ = (
        Index('ix_sites_team_created', 'team_id', 'created_at'),
        Index('ix_sites_domain_active', 'domain', 'is_active'),
    )

7.6.2 Caching Strategies

1. Multi-Tier Caching:

# backend/app/core/cache.py
import redis
import pickle
from functools import wraps
from typing import Optional, Any

redis_client = redis.Redis(
    host='redis',
    port=6379,
    db=0,
    decode_responses=False  # For pickle
)

def cache_result(key_prefix: str, ttl: int = 300):
    """Cache decorator for expensive operations"""
    def decorator(func):
        @wraps(func)
        async def wrapper(*args, **kwargs):
            # Generate cache key from function args
            cache_key = f"{key_prefix}:{func.__name__}:{hash(str(args) + str(kwargs))}"

            # Try to get from cache
            cached = redis_client.get(cache_key)
            if cached:
                return pickle.loads(cached)

            # Execute function
            result = await func(*args, **kwargs)

            # Store in cache
            redis_client.setex(cache_key, ttl, pickle.dumps(result))

            return result
        return wrapper
    return decorator

# Usage in service
from app.core.cache import cache_result

@cache_result(key_prefix="sites", ttl=60)
async def get_site_statistics(db: Session, site_id: int):
    """Get site statistics (cached for 60 seconds)"""
    # Expensive computation...
    return statistics

2. Cache Invalidation:

# backend/app/core/cache.py
def invalidate_cache(pattern: str):
    """Invalidate all cache keys matching pattern"""
    for key in redis_client.scan_iter(match=pattern):
        redis_client.delete(key)

# Usage after data modification
from app.core.cache import invalidate_cache

def update_site(db: Session, site_id: int, site_data: SiteUpdate):
    """Update site and invalidate related caches"""
    # Update database
    site = repository.update_site(db, site_id, site_data)
    db.commit()

    # Invalidate caches
    invalidate_cache(f"sites:*:{site_id}:*")
    invalidate_cache(f"sites:list:team:{site.team_id}:*")

    return site

3. Application-Level Caching (LRU):

from functools import lru_cache

@lru_cache(maxsize=128)
def get_wordpress_allowed_commands() -> list[str]:
    """Get list of allowed WP-CLI commands (cached in memory)"""
    return [
        "core version",
        "plugin list",
        "theme list",
        "user list",
        # ... more commands
    ]

7.6.3 Asynchronous Processing

1. Background Jobs with ARQ:

# backend/app/jobs/worker.py
from arq import create_pool
from arq.connections import RedisSettings

async def startup(ctx):
    ctx['redis'] = await create_pool(RedisSettings())

async def shutdown(ctx):
    await ctx['redis'].close()

# Job definitions
async def create_full_backup(ctx, site_id: int):
    """Background job for creating full site backup"""
    # Long-running operation
    db = SessionLocal()
    try:
        # Create backup logic...
        pass
    finally:
        db.close()

async def install_wordpress(ctx, site_id: int, version: str):
    """Background job for WordPress installation"""
    # Long-running operation
    pass

class WorkerSettings:
    functions = [create_full_backup, install_wordpress]
    on_startup = startup
    on_shutdown = shutdown
    redis_settings = RedisSettings()
    job_timeout = 3600  # 1 hour
    max_jobs = 10

2. Enqueue Jobs from API:

# backend/app/backups/router.py
from arq import create_pool
from arq.connections import RedisSettings

@router.post("/backups/", status_code=202)
async def create_backup(
    backup_data: BackupCreate,
    db: Session = Depends(get_db),
    current_user = Depends(get_current_active_user)
):
    """Create backup (async operation)"""
    # Validate request
    site = service.get_site(db, backup_data.site_id)
    if not site:
        raise HTTPException(status_code=404, detail="Site not found")

    # Enqueue background job
    redis = await create_pool(RedisSettings())
    job = await redis.enqueue_job('create_full_backup', backup_data.site_id)

    return {
        "job_id": job.job_id,
        "status": "queued",
        "message": "Backup job queued successfully"
    }

7.6.4 Rate Limiting and Throttling

# backend/app/core/rate_limit.py
from fastapi import HTTPException, Request
from redis import Redis
import time

redis_client = Redis(host='redis', port=6379, db=1)

async def rate_limit(
    request: Request,
    calls: int = 100,
    period: int = 60
):
    """Rate limit middleware: {calls} requests per {period} seconds"""
    # Get user ID from request (assumes authentication)
    user_id = request.state.user.id if hasattr(request.state, 'user') else 'anonymous'

    key = f"rate_limit:{user_id}:{request.url.path}"
    current = redis_client.get(key)

    if current is None:
        redis_client.setex(key, period, 1)
        return

    current = int(current)
    if current >= calls:
        raise HTTPException(
            status_code=429,
            detail=f"Rate limit exceeded. Try again in {redis_client.ttl(key)} seconds"
        )

    redis_client.incr(key)

# Usage in router
from app.core.rate_limit import rate_limit

@router.post("/wordpress/{site_id}/cli")
async def execute_wp_cli(
    site_id: int,
    command: WpCliCommand,
    db: Session = Depends(get_db),
    current_user = Depends(get_current_active_user),
    _rate_limit = Depends(lambda req: rate_limit(req, calls=10, period=60))
):
    """Execute WP-CLI command (rate limited: 10/min)"""
    pass

7.6.5 Horizontal Scaling Considerations

1. Stateless API Servers: - No session state stored on API servers - All session data in Redis - Load balancer can route to any instance

2. WebSocket Scaling:

# backend/app/websocket/connection.py
import redis.asyncio as aioredis
from fastapi import WebSocket

class ConnectionManager:
    def __init__(self):
        self.active_connections: dict[int, list[WebSocket]] = {}
        self.redis = None

    async def connect(self, websocket: WebSocket, user_id: int):
        await websocket.accept()
        if user_id not in self.active_connections:
            self.active_connections[user_id] = []
        self.active_connections[user_id].append(websocket)

        # Subscribe to Redis pub/sub for cross-instance messages
        if not self.redis:
            self.redis = await aioredis.from_url("redis://redis:6379")

        pubsub = self.redis.pubsub()
        await pubsub.subscribe(f"user:{user_id}")

    async def broadcast_to_user(self, user_id: int, message: dict):
        """Broadcast message to all user connections across all instances"""
        # Publish to Redis (all instances will receive)
        await self.redis.publish(f"user:{user_id}", json.dumps(message))

3. Database Connection Management: - Use PgBouncer to prevent connection exhaustion - Monitor active connections: SELECT count(*) FROM pg_stat_activity; - Set appropriate pool_size and max_overflow values

4. Health Checks for Load Balancer:

# backend/app/main.py
from fastapi import FastAPI, Response

@app.get("/health", tags=["health"])
async def health_check():
    """Health check endpoint for load balancer"""
    # Check database connectivity
    try:
        db = SessionLocal()
        db.execute("SELECT 1")
        db.close()
    except Exception:
        return Response(status_code=503, content="Database unhealthy")

    # Check Redis connectivity
    try:
        redis_client.ping()
    except Exception:
        return Response(status_code=503, content="Redis unhealthy")

    return {"status": "healthy", "version": "1.0.0"}

7.6.6 External API Optimization Patterns

Context: The MBPanel system integrates with 40+ external services (Virtuozzo, Bunny CDN, Cloudflare, payment processors). External API calls represent a significant performance bottleneck if not optimized properly.

Key Optimization Strategies:

1. Connection Pooling for External APIs:

# backend/app/core/http_client.py
import httpx

# Connection pooling configuration
client = httpx.AsyncClient(
    timeout=httpx.Timeout(30.0),
    limits=httpx.Limits(
        max_keepalive_connections=20,   # Reuse up to 20 connections
        max_connections=100,              # Max concurrent connections
        keepalive_expiry=60.0            # Keep connections alive for 60s
    ),
    http2=True  # Enable HTTP/2 for multiplexing
)

# Performance Impact:
# - Without pooling: DNS (20ms) + TLS handshake (50ms) + Request (30ms) = 100ms per request
# - With pooling: First request (100ms), subsequent requests (30ms) = 70% reduction

2. Response Caching for External API Calls:

# Cache frequently accessed external API responses
from app.core.cache import cache_result

@cache_result(key_prefix="virtuozzo", ttl=300)
async def fetch_environments(session_key: str):
    """Fetch environments from Virtuozzo (cached for 5 minutes)"""
    response = await virtuozzo_client.get(
        f"/getenvs?session={session_key}"
    )
    return response.json()

# Performance Impact:
# - First request: 90ms (API call)
# - Subsequent requests: 2ms (Redis cache)
# - Improvement: 98% faster for cached requests

3. Circuit Breaker Pattern:

# Prevent cascading failures when external APIs are down
class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=300):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.failure_count = 0
        self.state = "CLOSED"  # CLOSED, OPEN, HALF_OPEN

    def call(self, func, *args, **kwargs):
        if self.state == "OPEN":
            # Fail fast without calling external API
            raise Exception("Circuit breaker OPEN")

        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

# Performance Impact:
# - Prevents wasting resources on failing external APIs
# - Fails fast (0ms) instead of waiting for timeout (30s+)
# - Protects system from cascading failures

4. Retry with Exponential Backoff:

# Automatic retry for transient failures
retry_count = 0
max_retries = 3

while retry_count <= max_retries:
    try:
        response = await external_api_client.get(path)
        return response
    except (TimeoutError, NetworkError) as e:
        retry_count += 1
        if retry_count <= max_retries:
            backoff = 2 ** (retry_count - 1)  # 1s, 2s, 4s
            await asyncio.sleep(backoff)
        else:
            raise

# Performance Impact:
# - Recovers from transient failures automatically
# - Reduces error rate by 90%+ for flaky external APIs

5. Rate Limiting for External APIs:

# Prevent overwhelming external APIs
class RateLimiter:
    def __init__(self, max_calls_per_second: int = 10):
        self.max_calls = max_calls_per_second
        self.calls = []
        self.lock = asyncio.Lock()

    async def check_rate_limit(self):
        async with self.lock:
            now = datetime.now()
            # Remove calls older than 1 second
            self.calls = [t for t in self.calls if now - t < timedelta(seconds=1)]

            if len(self.calls) >= self.max_calls:
                # Wait until oldest call expires
                sleep_time = 1.0 - (now - self.calls[0]).total_seconds()
                if sleep_time > 0:
                    await asyncio.sleep(sleep_time)

            self.calls.append(now)

# Performance Impact:
# - Prevents API throttling by external services
# - Maintains consistent throughput
# - Avoids 429 (Too Many Requests) errors

6. Idempotency for Safe Retries:

# Safe retries for mutation operations
idempotency_key = f"create_env:{user_id}:{env_name}"

response = await external_api_client.post(
    path="/create",
    json=data,
    headers={"Idempotency-Key": idempotency_key}
)

# Check for idempotent errors
if response.status_code == 400 and "already exists" in response.text:
    # Treat as success (idempotent operation)
    logger.info("Resource already exists, treating as success")
    return {"status": "success", "idempotent": True}

# Performance Impact:
# - Enables safe retries without duplicate operations
# - Prevents data corruption from duplicate requests
# - Reduces complexity in error handling

7. Parallel API Calls with asyncio:

# Execute multiple external API calls concurrently
import asyncio

async def fetch_all_data(site_id: int):
    # Execute in parallel instead of sequentially
    environment_task = virtuozzo_adapter.fetch_environment(site_id)
    backup_task = virtuozzo_adapter.fetch_backups(site_id)
    domain_task = cloudflare_adapter.fetch_dns_records(site_id)

    # Wait for all to complete
    environment, backups, domains = await asyncio.gather(
        environment_task,
        backup_task,
        domain_task
    )

    return {
        "environment": environment,
        "backups": backups,
        "domains": domains
    }

# Performance Impact:
# - Sequential: 90ms + 80ms + 70ms = 240ms
# - Parallel: max(90ms, 80ms, 70ms) = 90ms
# - Improvement: 62% faster

8. HTTP/2 Multiplexing:

# Enable HTTP/2 for multiple requests over single connection
client = httpx.AsyncClient(
    http2=True  # Enable HTTP/2
)

# Performance Impact:
# - Reduces connection overhead
# - Enables request/response multiplexing
# - Reduces head-of-line blocking
# - 20-30% latency improvement for multiple requests

External API Performance Targets:

Metric Target Measurement Method
Average API Call Latency <50ms P50 response time
P95 API Call Latency <150ms P95 response time
Cache Hit Rate >80% Cache hits / total requests
Circuit Breaker Open Rate <1% Opens / total time
Retry Success Rate >90% Successful retries / total retries
Connection Reuse Rate >90% Pooled connections / total connections
Rate Limit Violation Rate 0% 429 errors / total requests

Monitoring External API Performance:

from prometheus_client import Counter, Histogram, Gauge

# Track external API performance
external_api_duration = Histogram(
    'external_api_duration_seconds',
    'External API call duration',
    ['adapter', 'endpoint']
)

external_api_errors = Counter(
    'external_api_errors_total',
    'External API errors',
    ['adapter', 'endpoint', 'error_type']
)

circuit_breaker_state = Gauge(
    'circuit_breaker_state',
    'Circuit breaker state (0=CLOSED, 1=OPEN, 2=HALF_OPEN)',
    ['adapter']
)

cache_hit_rate = Counter(
    'external_api_cache_hits',
    'Cache hits for external API calls',
    ['adapter', 'endpoint']
)

7.6.7 Monitoring and Observability

1. Metrics Instrumentation:

# backend/app/core/metrics.py
from prometheus_client import Counter, Histogram, Gauge
import time

# Request metrics
http_requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

http_request_duration_seconds = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint']
)

active_websocket_connections = Gauge(
    'active_websocket_connections',
    'Number of active WebSocket connections'
)

# Middleware to track metrics
from fastapi import Request
import time

@app.middleware("http")
async def track_metrics(request: Request, call_next):
    start_time = time.time()

    response = await call_next(request)

    duration = time.time() - start_time
    http_requests_total.labels(
        method=request.method,
        endpoint=request.url.path,
        status=response.status_code
    ).inc()

    http_request_duration_seconds.labels(
        method=request.method,
        endpoint=request.url.path
    ).observe(duration)

    return response

2. Performance Targets: - API p95 latency: <200ms - Database query p95: <50ms - Cache hit rate: >80% - Error rate: <0.1% - Uptime: 99.9% - External API p95 latency: <150ms - External API cache hit rate: >80%

7.7 Database Sharding Strategy for Millions of Users

To support millions of concurrent users and billions of records, a single PostgreSQL database instance is insufficient. This section defines our enterprise-grade database sharding strategy using Citus (distributed PostgreSQL) for horizontal scalability.

7.7.1 Sharding Architecture Overview

Why Citus? - PostgreSQL-native: Maintains full PostgreSQL compatibility - Transparent sharding: Application code requires minimal changes - Distributed queries: Automatic query parallelization across shards - Reference tables: Replicated lookup tables across all shards - Multi-tenant friendly: Natural sharding by tenant_id/team_id

Architecture Diagram:

┌─────────────────────────────────────────────────────────────┐
│                  MBPanel FastAPI Application                │
│                  (Connection to Coordinator)                │
└────────────────────┬────────────────────────────────────────┘
         ┌───────────▼──────────┐
         │  Citus Coordinator   │  ← Query planning & routing
         │  (PostgreSQL 15+)    │
         └───────────┬──────────┘
      ┌──────────────┼──────────────┐
      │              │              │
┌─────▼─────┐  ┌─────▼─────┐  ┌─────▼─────┐
│  Shard 1  │  │  Shard 2  │  │  Shard 3  │  ... Shard N
│ (Worker)  │  │ (Worker)  │  │ (Worker)  │
│ team_id:  │  │ team_id:  │  │ team_id:  │
│ 0-333M    │  │ 333M-666M │  │ 666M-999M │
└───────────┘  └───────────┘  └───────────┘

Each shard contains:
- environments (distributed table)
- nodes (distributed table)
- team-specific data (colocated with team_id)

Replicated across all shards:
- users (reference table)
- teams (reference table)
- api_keys (reference table)

7.7.2 Sharding Key Selection

Primary Sharding Key: team_id

Rationale: 1. Natural multi-tenancy: All WordPress sites belong to a team (even single-user teams) 2. Query locality: 95% of queries filter by team_id 3. Even distribution: Teams have similar data volumes (sites, backups, logs) 4. No hotspots: No single team dominates traffic (unlike user_id for SaaS platforms)

Distribution Strategy:

-- Create distributed tables (run on coordinator)
SELECT create_distributed_table('environments', 'team_id');
SELECT create_distributed_table('nodes', 'team_id');
SELECT create_distributed_table('job_logs', 'team_id');
SELECT create_distributed_table('backups', 'team_id');
SELECT create_distributed_table('metrics', 'team_id');

-- Create reference tables (replicated to all shards)
SELECT create_reference_table('users');
SELECT create_reference_table('teams');
SELECT create_reference_table('api_keys');

Colocation Strategy: - All tables sharded by team_id are colocated - Enables local JOINs within a shard (no cross-shard queries) - Example: JOIN environments + nodes for same team executes on single worker

7.7.3 Shard Sizing and Scaling

Initial Configuration (1M teams, 10M sites): - Coordinator: 1 node (16 vCPU, 64GB RAM) - Worker Shards: 8 nodes (8 vCPU, 32GB RAM each) - Shard Size: ~125K teams per shard - Data per Shard: ~250GB (estimated)

Scaling Trigger Points: | Metric | Threshold | Action | |--------|-----------|--------| | Shard size | >500GB | Add 4 more workers, rebalance | | Query latency p95 | >50ms | Add read replicas per shard | | CPU utilization | >70% | Vertical scaling or add workers | | Teams per shard | >500K | Add workers, redistribute |

Shard Rebalancing:

-- Rebalance shards after adding new workers
SELECT rebalance_table_shards('environments');
SELECT rebalance_table_shards('nodes');
-- ... other distributed tables

Automated Rebalancing Policy: - Trigger rebalancing when shard size variance > 20% - Run during low-traffic windows (2am-4am UTC) - Monitor query performance during rebalancing - Rollback mechanism if p95 latency degrades >30%

7.7.4 Query Patterns and Optimizations

Supported Query Patterns:

  1. Single-Tenant Queries (Fast Path - 95% of queries)

    -- Executes on single shard (local query)
    SELECT * FROM environments WHERE team_id = 12345;
    
    -- Multi-table JOIN within same team (local query)
    SELECT e.*, n.*
    FROM environments e
    JOIN nodes n ON e.id = n.environment_id
    WHERE e.team_id = 12345;
    

  2. Cross-Tenant Aggregations (Slow Path - 5% of queries)

    -- Parallel execution across all shards
    SELECT COUNT(*) FROM environments; -- Aggregated from all workers
    
    -- Distributed GROUP BY
    SELECT team_id, COUNT(*)
    FROM environments
    GROUP BY team_id;
    

  3. Reference Table Queries (Local - Replicated)

    -- Executes locally on each shard (no coordinator roundtrip)
    SELECT * FROM users WHERE email = 'user@example.com';
    

Query Optimization Guidelines: - Always include team_id in WHERE clause for distributed tables - Avoid cross-shard JOINs (denormalize if necessary) - Use reference tables for lookup data (<10M rows) - Leverage Citus query pushdown for aggregations - Use connection pooling (PgBouncer) per worker

7.7.5 High Availability for Sharded Database

Replication Strategy: - Per-Shard Replication: Each worker has 2 replicas (primary + 2 standby) - Coordinator Replication: Active-passive coordinator setup - Automatic Failover: Patroni + etcd for consensus-based failover - Failover Time: <30 seconds (RTO) - Data Loss: <5 seconds (RPO with synchronous replication)

Architecture:

Coordinator:
├── coordinator-primary (active)
├── coordinator-standby-1 (sync replica)
└── coordinator-standby-2 (async replica)

Shard 1:
├── worker-1-primary (active)
├── worker-1-standby-1 (sync replica)
└── worker-1-standby-2 (async replica)

... (repeat for all shards)

Health Monitoring: - Continuous replication lag monitoring (<5s threshold) - Automatic promotion of standby on primary failure - Application-level connection retry with exponential backoff - Alert on failover events (PagerDuty integration)

7.7.6 Backup Strategy for Sharded Database

Backup Requirements: - Frequency: Continuous WAL archiving + daily base backups - Retention: 30 days for base backups, 7 days for WAL - Point-in-Time Recovery (PITR): Restore to any timestamp within 7 days - Backup Storage: S3 with cross-region replication

Implementation:

# Using pgBackRest for each shard
# Coordinator backup
pgbackrest --stanza=coordinator --type=full backup

# Worker backups (parallel execution)
for shard in worker-{1..8}; do
  pgbackrest --stanza=$shard --type=full backup &
done
wait

# Incremental backups (hourly)
pgbackrest --stanza=coordinator --type=incr backup
for shard in worker-{1..8}; do
  pgbackrest --stanza=$shard --type=incr backup &
done

Disaster Recovery Testing: - Monthly DR drills: Restore random shard to verify backups - Quarterly full cluster restore to staging environment - Document recovery time for different scenarios: - Single shard failure: <10 minutes - Coordinator failure: <5 minutes - Full cluster failure: <2 hours

7.7.7 Migration Path from Single PostgreSQL to Citus

Phase 1: Preparation (Week 1-2) 1. Audit current schema for sharding compatibility 2. Add team_id to all tables lacking it (backfill historical data) 3. Identify reference vs. distributed tables 4. Update queries to always include team_id filter 5. Deploy Citus coordinator + 2 initial workers (staging)

Phase 2: Parallel Run (Week 3-4) 1. Enable dual-write to both legacy DB and Citus (staging) 2. Backfill historical data to Citus using pg_dump + COPY 3. Validate data consistency between legacy and Citus 4. Load test Citus with production traffic replay 5. Performance benchmarking (compare p50/p95/p99)

Phase 3: Cutover (Week 5) 1. Enable read-only mode on legacy PostgreSQL 2. Final incremental sync to Citus 3. Switch application to Citus coordinator (feature flag) 4. Monitor for 24 hours, rollback plan ready 5. Disable legacy PostgreSQL after 7-day validation period

Rollback Plan: - Feature flag to switch back to legacy DB (< 5 minutes) - Citus → Legacy reverse sync available for 7 days - Automated health checks with automatic rollback if: - Error rate >1% - Query latency p95 >2x baseline - Any shard becomes unavailable

7.7.8 Cost Optimization for Sharded Database

Cost Breakdown (Estimated for 1M teams): | Component | Spec | Monthly Cost | |-----------|------|--------------| | Coordinator | 16 vCPU, 64GB RAM | $500 | | Workers (8x) | 8 vCPU, 32GB RAM each | $2,400 ($300 × 8) | | Replicas (16x) | Same as workers (standby) | $4,800 ($300 × 16) | | Backup Storage | 10TB S3 Standard | $230 | | TOTAL | | $7,930/month |

Cost per User: $0.0079/month (for 1M teams)

Optimization Strategies: 1. Use spot instances for standby replicas (60% cost reduction) 2. Tiered storage: Move old WAL logs to S3 Glacier after 7 days 3. Auto-scaling workers: Add workers during peak hours, remove during off-peak 4. Compression: Enable PostgreSQL table compression (saves 40% disk) 5. Vacuuming: Aggressive autovacuum to reclaim space from deleted data

7.7.9 Monitoring and Alerting for Sharded Database

Key Metrics to Monitor: | Metric | Threshold | Action | |--------|-----------|--------| | Shard query latency p95 | >50ms | Investigate slow queries | | Replication lag | >5 seconds | Alert DBA, check network | | Shard disk usage | >80% | Provision more storage | | Connection pool saturation | >90% | Increase pool size | | Cross-shard query % | >10% | Optimize queries | | Failed shard health checks | >1 | Initiate failover |

Monitoring Stack: - Prometheus: Metrics collection from all Citus nodes - Grafana: Real-time dashboards for coordinator + all shards - PgWatch2: PostgreSQL-specific metrics (bloat, vacuum, locks) - Alertmanager: PagerDuty integration for critical alerts

Sample Alerts:

# Prometheus alert rules
- alert: ShardHighLatency
  expr: pg_stat_statements_mean_time_seconds{shard!=""} > 0.05
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Shard {{ $labels.shard }} has high query latency"

- alert: ReplicationLagHigh
  expr: pg_replication_lag_seconds > 5
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Replication lag on {{ $labels.instance }} is {{ $value }}s"

7.7.10 Alternative Sharding Strategies Considered

Rejected Alternatives:

  1. Application-Level Sharding
  2. ❌ Requires custom connection routing logic in FastAPI
  3. ❌ Complex schema migrations across shards
  4. ❌ No support for cross-shard queries
  5. ❌ Higher maintenance burden

  6. Vitess (MySQL Sharding)

  7. ❌ Requires migrating from PostgreSQL to MySQL
  8. ❌ Loss of PostgreSQL-specific features (JSONB, CTEs)
  9. ❌ Different query syntax and limitations

  10. MongoDB Sharding

  11. ❌ NoSQL requires application rewrites
  12. ❌ Lack of ACID guarantees for multi-document transactions
  13. ❌ Team lacks MongoDB expertise

  14. CockroachDB (Distributed SQL)

  15. ✅ Automatic sharding and rebalancing
  16. ❌ Higher operational cost (3x vs. Citus)
  17. ❌ Different SQL dialect (requires query rewrites)
  18. ❌ Less mature ecosystem vs. PostgreSQL

Citus Advantages: - ✅ PostgreSQL-compatible (minimal migration) - ✅ Mature open-source project (CNCF) - ✅ Proven at scale (millions of shards in production) - ✅ Strong community and enterprise support

7.8 Multi-Region Disaster Recovery (DR) Plan

For an enterprise-grade WordPress hosting platform serving millions of users, a comprehensive multi-region disaster recovery strategy is critical to ensure business continuity, data integrity, and regulatory compliance.

7.8.1 DR Objectives and Requirements

Recovery Objectives: - Recovery Time Objective (RTO): <15 minutes for critical services - Recovery Point Objective (RPO): <5 minutes data loss (synchronous replication) - Availability Target: 99.99% uptime (52.56 minutes downtime per year) - Geographic Redundancy: Minimum 2 regions, 500+ miles apart

Disaster Scenarios Covered: 1. Regional Outage: Complete AWS/cloud provider region failure 2. Data Center Failure: Single availability zone (AZ) failure 3. Database Corruption: Logical corruption requiring point-in-time recovery 4. Ransomware/Cyberattack: Complete system compromise 5. Human Error: Accidental deletion or misconfiguration 6. Network Partition: Split-brain scenarios

7.8.2 Multi-Region Architecture

Primary Region: us-east-1 (Virginia) Secondary Region: us-west-2 (Oregon) Tertiary Region (Optional): eu-west-1 (Ireland) for GDPR compliance

Architecture Diagram:

┌──────────────────────────────────────────────────────────────┐
│                  Global Load Balancer (Route53)              │
│              Health Checks + Failover Routing                │
└────────────────┬─────────────────────────┬───────────────────┘
                 │                         │
     ┌───────────▼───────────┐   ┌────────▼────────────┐
     │   PRIMARY REGION      │   │   SECONDARY REGION  │
     │   (us-east-1)         │   │   (us-west-2)       │
     │                       │   │                     │
     │  ┌─────────────────┐  │   │  ┌───────────────┐ │
     │  │ FastAPI Cluster │  │   │  │ FastAPI (Hot) │ │
     │  │ (Active)        │  │   │  │ (Standby)     │ │
     │  └────────┬────────┘  │   │  └───────┬───────┘ │
     │           │           │   │          │         │
     │  ┌────────▼────────┐  │   │  ┌───────▼───────┐ │
     │  │ Citus Primary   │  │===│==│ Citus Replica │ │
     │  │ (Active Writer) │  │   │  │ (Async Rep)   │ │
     │  └────────┬────────┘  │   │  └───────┬───────┘ │
     │           │           │   │          │         │
     │  ┌────────▼────────┐  │   │  ┌───────▼───────┐ │
     │  │ Redis Primary   │  │===│==│ Redis Replica │ │
     │  └─────────────────┘  │   │  └───────────────┘ │
     │                       │   │                     │
     │  ┌─────────────────┐  │   │  ┌───────────────┐ │
     │  │ S3 Bucket       │  │===│==│ S3 Replica    │ │
     │  │ (Backups/Assets)│  │   │  │ (Cross-Region)│ │
     │  └─────────────────┘  │   │  └───────────────┘ │
     └───────────────────────┘   └─────────────────────┘

Replication Flow:
- Database: Asynchronous streaming replication (5-10 second lag)
- Redis: Redis Sentinel with async replication
- S3: Cross-region replication (CRR) enabled
- Application: Stateless, can run in both regions simultaneously

7.8.3 Failover Strategies

1. Automated Failover (Preferred)

Triggers: - Primary region health checks fail for >60 seconds - Database replication lag >30 seconds - API error rate >5% for 5 minutes - Manual trigger via admin dashboard

Failover Process:

1. Route53 detects primary region failure (health check)
2. DNS failover to secondary region (TTL: 60 seconds)
3. Promote secondary database to primary (automated via Patroni)
4. Update application config to point to new primary DB
5. Scale up secondary region FastAPI instances
6. Send alerts to incident response team
7. Update status page (https://status.mbpanel.io)

Total Time: <5 minutes

Automated Failover Script:

#!/bin/bash
# Disaster Recovery Automated Failover Script

# 1. Health Check Failure Detection
if [[ $(check_primary_region_health) == "FAILED" ]]; then
  echo "PRIMARY REGION FAILURE DETECTED"

  # 2. Promote Secondary Database
  patronictl failover --candidate secondary-db-1 --force

  # 3. Update DNS
  aws route53 change-resource-record-sets \
    --hosted-zone-id Z1234567890ABC \
    --change-batch file://failover-dns.json

  # 4. Scale Secondary Region
  kubectl scale deployment fastapi --replicas=20 -n us-west-2

  # 5. Update Application Config
  kubectl set env deployment/fastapi \
    DATABASE_HOST=secondary-db-1.us-west-2.rds.amazonaws.com

  # 6. Notify Incident Response Team
  curl -X POST https://api.pagerduty.com/incidents \
    -H "Authorization: Token $PAGERDUTY_TOKEN" \
    -d '{"incident": {"type": "incident", "title": "PRIMARY REGION FAILOVER"}}'

  # 7. Update Status Page
  curl -X POST https://api.statuspage.io/v1/pages/$PAGE_ID/incidents \
    -d "status=investigating&impact=major&name=Regional+Failover+in+Progress"

  echo "FAILOVER COMPLETE"
fi

2. Manual Failover

Use Cases: - Planned maintenance in primary region - Degraded performance (not full outage) - Compliance testing

Process: 1. Enable maintenance mode (read-only for 2 minutes) 2. Ensure secondary DB is <5 seconds behind primary 3. Manually promote secondary DB to primary 4. Update DNS records (gradual weighted routing shift) 5. Monitor error rates and latency 6. Full cutover after validation (15-minute window)

7.8.4 Data Replication Strategy

Database Replication (Citus):

Primary Region (us-east-1):
├── Coordinator (primary)
│   ├── Standby 1 (synchronous, same AZ)
│   └── Standby 2 (asynchronous, us-west-2)  ← DR replica
├── Worker Shard 1 (primary)
│   ├── Standby 1 (synchronous, same AZ)
│   └── Standby 2 (asynchronous, us-west-2)
... (repeat for all shards)

Replication Lag Monitoring:
- Alert if lag >10 seconds (warning)
- Alert if lag >30 seconds (critical)
- Automatic failover if lag >60 seconds + primary failure

Redis Replication: - Redis Sentinel with 3-node quorum - Asynchronous replication to secondary region - Acceptable data loss: Last 5 seconds of cache/session data - Session replay from database if Redis unavailable

Object Storage (S3) Replication: - Cross-region replication (CRR) enabled - Replication time: <15 minutes for 95% of objects - Versioning enabled (30-day retention) - Lifecycle policies to archive old backups to Glacier

7.8.5 Backup Strategy

Backup Tiers:

Backup Type Frequency Retention Storage Location RTO RPO
Hot Backups Continuous WAL 7 days Primary + Secondary S3 <5 min <1 min
Warm Backups Daily full DB dump 30 days S3 Standard <2 hours 24 hours
Cold Backups Weekly full snapshot 1 year S3 Glacier <24 hours 7 days
Archive Monthly compliance 7 years S3 Glacier Deep Archive <48 hours 30 days

Backup Validation: - Automated daily backup restoration tests in isolated environment - Monthly full DR drill: Restore complete production environment from backups - Quarterly chaos engineering: Simulate regional failures during business hours

Backup Encryption: - AES-256 encryption at rest (AWS KMS) - Encryption keys stored in multi-region AWS Secrets Manager - Key rotation every 90 days - Backup integrity checks using SHA-256 checksums

7.8.6 DR Testing and Validation

Test Frequency: | Test Type | Frequency | Duration | Scope | |-----------|-----------|----------|-------| | Automated Failover Test | Weekly | 15 min | DNS + Application layer | | Database Failover Test | Monthly | 1 hour | Full database promotion | | Full DR Drill | Quarterly | 4 hours | Complete regional failover | | Chaos Engineering | Bi-annual | 8 hours | Random component failures |

Full DR Drill Runbook:

1. Pre-Drill Preparation (1 hour before)
   - Notify all stakeholders
   - Enable enhanced monitoring
   - Take pre-test snapshots
   - Document current system state

2. Simulated Regional Failure (T+0)
   - Artificially fail primary region health checks
   - Trigger automated failover scripts
   - Monitor failover process

3. Validation Phase (T+15 min)
   - Verify DNS propagation (<5 min)
   - Check database promotion success
   - Test read/write operations
   - Validate application functionality

4. Recovery Phase (T+2 hours)
   - Fail back to primary region
   - Verify data consistency
   - Reconcile any data conflicts

5. Post-Drill Review (T+4 hours)
   - Document RTO/RPO achieved
   - Identify improvement areas
   - Update DR playbooks
   - Generate DR report for stakeholders

7.8.7 Split-Brain Prevention

Problem: Network partition causes both regions to believe they are primary, leading to data conflicts.

Prevention Mechanisms:

  1. Fencing (STONITH - Shoot The Other Node In The Head)

    # Automatic fencing via Patroni
    patronictl configure --set-option etcd.ttl=30
    patronictl configure --set-option etcd.retry_timeout=10
    

  2. Distributed Consensus (etcd)

  3. 5-node etcd cluster across 3 regions (2 primary, 2 secondary, 1 tiebreaker in eu-west-1)
  4. Requires quorum (3/5 nodes) for leader election
  5. Network partition automatically isolates minority partition

  6. Write Fencing

  7. Only region with etcd quorum can accept writes
  8. Minority partition automatically enters read-only mode
  9. Application layer enforces "write to primary region only" rule

Split-Brain Detection:

# Application-level split-brain detection
def check_split_brain():
    primary_region_health = check_region_health("us-east-1")
    secondary_region_health = check_region_health("us-west-2")

    if primary_region_health and secondary_region_health:
        # Both regions reporting as primary - SPLIT BRAIN
        if both_accepting_writes():
            alert_critical("SPLIT BRAIN DETECTED")
            enter_read_only_mode(determine_minority_region())
            page_dba_oncall()

7.8.8 Data Consistency After Failover

Conflict Resolution Strategy:

  1. No Conflicts (99.9% of cases)
  2. Asynchronous replication lag <5 seconds
  3. Acceptable data loss: Last 5 seconds of writes
  4. Lost writes logged for manual review

  5. Conflict Detection

  6. Compare WAL sequence numbers between regions
  7. Identify divergent transactions
  8. Flag conflicting records in conflict_resolution table

  9. Resolution Process

    -- Automated conflict resolution (Last Write Wins)
    SELECT resolve_conflicts(
      resolution_strategy := 'LAST_WRITE_WINS',
      timestamp_field := 'updated_at'
    );
    
    -- Manual conflict resolution for critical records
    SELECT * FROM conflict_resolution
    WHERE status = 'PENDING_MANUAL_REVIEW'
    ORDER BY severity DESC;
    

7.8.9 Monitoring and Alerting for DR

Critical DR Metrics:

Metric Threshold Alert Severity Action
Replication lag >10s Warning Investigate primary DB load
Replication lag >30s Critical Prepare for failover
Cross-region latency >100ms Warning Check network connectivity
Backup failure 1 failed backup Critical Immediate investigation
DR test failure Any failure High Update DR playbooks
etcd cluster health <3 nodes Critical Restore quorum immediately

Monitoring Dashboard:

# Grafana DR Dashboard
Panels:
  - Primary Region Health (Green/Yellow/Red)
  - Secondary Region Health (Green/Yellow/Red)
  - Replication Lag (real-time graph, last 24h)
  - Last Successful Failover Test (timestamp)
  - Last Successful Backup (per shard)
  - Cross-Region Latency (p50/p95/p99)
  - DR Drill Results (historical trend)

7.8.10 Cost Optimization for Multi-Region DR

Cost Breakdown (Estimated for 1M users):

Component Primary Region Secondary Region Monthly Cost
FastAPI Instances 20 instances (active) 5 instances (standby) $3,000 + $750
Database (Citus) 8 workers + replicas 8 async replicas $7,930 + $3,200
Redis 3-node cluster 3-node replica $600 + $300
S3 Storage 50TB 50TB (CRR) $1,150 + $1,150
Data Transfer Egress fees Cross-region transfer $500
TOTAL $18,580/month

Cost per User: $0.0186/month

Optimization Strategies: 1. Right-size secondary region: Run at 25% capacity (scale on failover) 2. Intelligent data tiering: Replicate only critical data in real-time 3. Backup compression: Reduce S3 storage by 60% with Zstandard compression 4. Spot instances for DR testing: 70% cost reduction for quarterly drills 5. Incremental backups: Reduce cross-region transfer costs

7.8.11 Compliance and Regulatory Considerations

GDPR (EU Users): - Tertiary region in eu-west-1 for EU user data - Data residency guarantees (EU data stays in EU) - Cross-region replication only for EU → EU regions - Right to be forgotten: Automated data purge across all regions

SOC 2 Type II: - Documented DR procedures (this PRD section) - Quarterly DR testing with audit trails - Incident response playbooks - Change management for DR infrastructure

HIPAA (if applicable): - Encrypted backups (AES-256) - Access logging for all DR operations - Business Associate Agreements (BAAs) with cloud providers

7.8.12 Runbook: Common DR Scenarios

Scenario 1: Complete Primary Region Failure

1. Automatic detection via health checks (60s)
2. Patroni promotes secondary DB coordinator (30s)
3. Route53 DNS failover (60s TTL propagation)
4. Scale up secondary region FastAPI (2 min)
5. Validate application functionality (5 min)
Total RTO: <10 minutes

Scenario 2: Database Corruption

1. Identify corruption timestamp
2. Restore nearest hourly backup before corruption
3. Replay WAL logs up to corruption point
4. Validate data integrity
5. Resume normal operations
Total RTO: <30 minutes, RPO: <1 hour

Scenario 3: Ransomware Attack

1. Isolate compromised systems immediately
2. Restore from immutable backup (S3 Object Lock)
3. Deploy clean infrastructure in secondary region
4. Forensic analysis in parallel
5. Gradual traffic migration after validation
Total RTO: <4 hours, RPO: <24 hours

7.9 Enterprise Compliance Framework (GDPR, SOC 2, HIPAA)

For an enterprise-grade WordPress hosting platform serving millions of users globally, compliance with international data protection regulations and industry standards is mandatory. This section defines our comprehensive compliance framework.

7.9.1 GDPR Compliance (General Data Protection Regulation)

Scope: All users in the European Economic Area (EEA) + UK

Key Requirements:

  1. Lawful Basis for Processing
  2. Consent: Explicit consent for marketing communications
  3. Contract: Processing necessary for service delivery
  4. Legitimate Interest: Fraud prevention, system security

  5. Data Subject Rights Implementation

Right Implementation API Endpoint Response Time
Right to Access (Art. 15) User data export (JSON/CSV) GET /api/v1/gdpr/data-export <48 hours
Right to Rectification (Art. 16) Profile update API PATCH /api/v1/users/{id} Real-time
Right to Erasure (Art. 17) Account deletion with cascading purge DELETE /api/v1/users/{id}/gdpr-erase <30 days
Right to Data Portability (Art. 20) Machine-readable export GET /api/v1/gdpr/portability <48 hours
Right to Object (Art. 21) Marketing opt-out POST /api/v1/gdpr/object-processing Real-time
Right to Restriction (Art. 18) Suspend processing flag POST /api/v1/gdpr/restrict-processing <24 hours

Implementation Example:

# GDPR Data Erasure Implementation
@router.delete("/users/{user_id}/gdpr-erase")
async def gdpr_erase_user(user_id: int, request: GDPREraseRequest):
    """
    Right to Erasure (Article 17 GDPR)
    Permanently delete user data across all systems
    """
    # 1. Verify user identity (2FA required)
    await verify_user_identity(user_id, request.verification_code)

    # 2. Log erasure request for audit trail
    await audit_log.log_event(
        event_type="GDPR_ERASURE_REQUESTED",
        user_id=user_id,
        ip_address=request.ip,
        timestamp=datetime.utcnow()
    )

    # 3. Schedule asynchronous data purge
    await erasure_queue.enqueue(
        task="purge_user_data",
        user_id=user_id,
        cascade=True  # Delete environments, backups, logs
    )

    # 4. Anonymize logs (replace PII with hashed user_id)
    await anonymize_historical_logs(user_id)

    # 5. Notify DPO (Data Protection Officer)
    await notify_dpo(user_id, "GDPR_ERASURE_INITIATED")

    return {"status": "accepted", "completion_eta": "30 days"}

  1. Data Residency and Cross-Border Transfers
  2. EU Data Stays in EU: eu-west-1 region for EU users
  3. Standard Contractual Clauses (SCCs): For US cloud provider (AWS)
  4. Data Transfer Impact Assessment (DTIA): Annual review
  5. No data transfers to non-adequate countries without explicit consent

  6. Consent Management

    # Granular Consent Tracking
    consent_categories = {
        "essential": True,  # Always required for service
        "analytics": user.consent_analytics,  # Optional
        "marketing": user.consent_marketing,  # Optional
        "third_party_integrations": user.consent_integrations  # Optional
    }
    
    # Cookie consent banner
    def check_consent(user_id: int, category: str) -> bool:
        consent = get_user_consent(user_id)
        if category == "essential":
            return True
        return consent.get(category, False)
    

  7. Data Breach Notification

  8. Detection: Automated monitoring for unauthorized access
  9. Assessment: <24 hours to determine breach severity
  10. Supervisory Authority Notification: <72 hours if high risk
  11. User Notification: Immediate if high risk to rights and freedoms
  12. Breach Register: Maintained for all incidents (regardless of notification)

  13. Privacy by Design and Default

  14. Pseudonymization of logs (no PII in application logs)
  15. Data minimization (only collect necessary fields)
  16. Encryption at rest and in transit (TLS 1.3, AES-256)
  17. Default privacy settings (marketing opt-in, not opt-out)

  18. Data Protection Impact Assessment (DPIA)

  19. Required for high-risk processing (millions of user records)
  20. Annual DPIA review
  21. Document processing activities in ROPA (Record of Processing Activities)

7.9.2 SOC 2 Type II Compliance

Scope: Trust Services Criteria - Security, Availability, Confidentiality, Processing Integrity

Audit Timeline: - Year 1: SOC 2 Type I (point-in-time assessment) - Year 2: SOC 2 Type II (12-month continuous monitoring) - Annual Re-certification: Required

1. Security (Common Criteria)

Control Implementation Evidence
CC1.1: Control Environment Security policies, code of conduct Policy documents, training records
CC2.1: Risk Assessment Quarterly risk assessments Risk register, mitigation plans
CC3.1: Logical Access RBAC, MFA enforcement Access logs, user provisioning audit
CC4.1: Monitoring Prometheus, Grafana, PagerDuty Alert history, incident reports
CC5.1: Change Management Git workflow, CI/CD approvals Pull request logs, deployment audit
CC6.1: Encryption TLS 1.3, AES-256 at rest Encryption policy, key rotation logs
CC7.1: Incident Response Incident playbooks, on-call rotation Incident timeline, post-mortems

2. Availability (Uptime Commitment) - SLA: 99.99% uptime (52.56 min downtime/year) - Monitoring: 24/7 automated health checks - Incident Response: <15 min response time for critical issues - Evidence: Uptime dashboards, incident response logs

3. Confidentiality - Data Classification: Public, Internal, Confidential, Restricted - Access Controls: Need-to-know basis, least privilege - Encryption: Field-level encryption for PII (SSN, payment info) - Evidence: Data classification policy, encryption audit

4. Processing Integrity - Input Validation: All API inputs validated (Pydantic models) - Data Integrity Checks: Checksums for backups, DB constraints - Error Handling: Graceful degradation, no data corruption - Evidence: Test coverage reports, integrity check logs

5. Privacy (if applicable) - Notice: Privacy policy, cookie consent - Choice: Opt-in/opt-out mechanisms - Collection: Data minimization - Evidence: GDPR compliance documentation

SOC 2 Audit Evidence Collection:

# Automated evidence collection for SOC 2
async def collect_soc2_evidence(audit_period: str):
    evidence = {
        "access_logs": await export_access_logs(audit_period),
        "change_management": await export_git_pr_history(audit_period),
        "incident_response": await export_pagerduty_incidents(audit_period),
        "backup_validation": await export_backup_test_results(audit_period),
        "encryption_audits": await export_encryption_compliance(audit_period),
        "security_training": await export_training_completion(audit_period),
        "vendor_assessments": await export_vendor_security_reviews(audit_period),
    }
    return evidence

7.9.3 HIPAA Compliance (If Hosting Healthcare WordPress Sites)

Note: HIPAA compliance is optional unless explicitly offering healthcare hosting services.

Key Requirements:

  1. Business Associate Agreement (BAA)
  2. Required contract with all customers handling PHI
  3. Subcontractor BAAs with AWS, Jelastic

  4. Physical Safeguards

  5. AWS data centers (SOC 2, ISO 27001 certified)
  6. Badge access, video surveillance (delegated to AWS)

  7. Technical Safeguards

  8. Access Control: Unique user IDs, automatic logoff (15 min inactivity)
  9. Audit Controls: Comprehensive logging of PHI access
  10. Integrity Controls: Checksums for PHI transmission
  11. Transmission Security: TLS 1.3 only, no TLS 1.2

  12. Administrative Safeguards

  13. Security Officer: Designated HIPAA Security Officer
  14. Training: Annual HIPAA training for all staff
  15. Risk Analysis: Annual HIPAA risk assessment
  16. Incident Response: <60 days breach notification

  17. PHI Encryption

    # HIPAA-compliant field-level encryption
    from cryptography.fernet import Fernet
    
    def encrypt_phi(plaintext: str, patient_id: int) -> str:
        """Encrypt Protected Health Information (PHI)"""
        key = get_kms_key(f"phi-encryption-{patient_id}")
        cipher = Fernet(key)
        encrypted = cipher.encrypt(plaintext.encode())
    
        # Audit log
        log_phi_access(
            action="ENCRYPT",
            patient_id=patient_id,
            user_id=get_current_user_id(),
            timestamp=datetime.utcnow()
        )
    
        return encrypted.decode()
    

7.9.4 PCI DSS Compliance (Payment Card Industry)

Scope: If processing credit cards for WooCommerce sites

Recommended Approach: Avoid PCI scope by using payment processors (Stripe, PayPal) - Never store credit card numbers - Tokenization via Stripe/PayPal - No PCI audit required (out of scope)

If PCI Compliance Required: - Level: PCI DSS Level 2 (1M-6M transactions/year) - Requirements: 12 requirements, 78 sub-requirements - Quarterly: Vulnerability scans (ASV - Approved Scanning Vendor) - Annual: Penetration testing, onsite audit

7.9.5 ISO 27001 Certification (Optional)

Timeline: 18-24 months to certification

Benefits: - Global recognition for information security - Competitive advantage for enterprise customers - Comprehensive ISMS (Information Security Management System)

Key Requirements: - 114 controls across 14 domains - Annual surveillance audits - Continual improvement process

7.9.6 Compliance Monitoring and Reporting

1. Automated Compliance Dashboard

# Grafana Compliance Dashboard
Metrics:
  - GDPR Requests (Access, Erasure, Portability) - Last 30 days
  - SOC 2 Control Effectiveness (Pass/Fail) - Real-time
  - Data Residency Compliance (% EU data in EU region) - Real-time
  - Encryption Coverage (% encrypted data) - Real-time
  - Access Control Violations (failed auth attempts) - Last 24h
  - Incident Response Time (p95) - Last 90 days
  - Backup Success Rate - Last 30 days
  - Compliance Training Completion - Current status

2. Compliance Audit Trail

# Immutable audit log for compliance
@dataclass
class ComplianceAuditEvent:
    event_id: str  # UUID
    timestamp: datetime
    event_type: str  # GDPR_ACCESS, SOC2_CONTROL_TEST, etc.
    user_id: Optional[int]
    ip_address: str
    action: str
    result: str  # SUCCESS, FAILURE, PARTIAL
    evidence_s3_path: str  # Link to evidence
    digital_signature: str  # SHA-256 hash for tamper-proofing

# Write to append-only log (no deletes/updates allowed)
await append_to_audit_log(event)

3. Quarterly Compliance Reports - Audience: CISO, DPO, Board of Directors - Contents: - GDPR requests summary (volume, response times) - SOC 2 control test results - Security incidents and resolutions - Compliance training completion rates - Vendor security assessments - Recommendations for next quarter

7.9.7 Third-Party Vendor Compliance

Vendor Risk Management:

Vendor Service Compliance Certs Assessment Frequency
AWS Infrastructure SOC 2, ISO 27001, PCI DSS Annual
Jelastic Container Platform SOC 2 (in progress) Annual
Stripe Payment Processing PCI DSS Level 1 Annual
SendGrid Email Delivery SOC 2 Annual
PagerDuty Incident Management SOC 2 Annual

Vendor Assessment Process: 1. Request SOC 2 report or equivalent 2. Review security questionnaire (SIG Lite) 3. Assess data access and residency 4. Document in vendor register 5. Annual re-assessment

7.9.8 Compliance Training Program

Training Matrix:

Role Training Required Frequency Completion Target
All Staff Security Awareness Annual 100%
Developers Secure Coding (OWASP Top 10) Annual 100%
DevOps Infrastructure Security Annual 100%
Support GDPR Data Subject Rights Quarterly 100%
Management Compliance Overview Annual 100%
DPO/Security Team Advanced GDPR, SOC 2 Bi-annual 100%

Training Tracking:

# Automated training reminders
async def check_training_compliance():
    overdue_users = await get_users_with_overdue_training()
    for user in overdue_users:
        await send_training_reminder(user)

        # Escalate if 30 days overdue
        if user.training_overdue_days > 30:
            await notify_manager(user.manager_id, user)

7.9.9 Compliance Cost Estimation

Annual Compliance Budget (Estimated):

Item Cost (Annual)
SOC 2 Type II Audit $40,000 - $80,000
GDPR DPO (Consultant or FTE) $80,000 - $150,000
Compliance Software (Vanta, Drata) $20,000
Penetration Testing $30,000
Security Training Platform $10,000
Legal Review (Policies, Contracts) $20,000
TOTAL $200,000 - $310,000

Cost per User: $0.20 - $0.31/month (for 1M users)

7.9.10 Compliance Roadmap

Year 1 (Months 1-12): - Q1: Implement GDPR data subject rights APIs - Q2: SOC 2 Type I audit preparation - Q3: SOC 2 Type I audit execution - Q4: Begin SOC 2 Type II observation period

Year 2 (Months 13-24): - Q1-Q4: SOC 2 Type II continuous monitoring - Q4: SOC 2 Type II audit and certification

Year 3+: - Maintain SOC 2 certification (annual) - Optional: ISO 27001 certification - Optional: HIPAA if expanding to healthcare vertical

7.9.11 Compliance Incident Response

Scenario: GDPR Data Breach

1. Detection (T+0)
   - Automated anomaly detection alerts security team
   - Example: Unauthorized bulk data export

2. Assessment (T+1 hour)
   - Determine scope: How many users affected?
   - Determine sensitivity: PII, financial data, health data?
   - Classify severity: High/Medium/Low risk

3. Containment (T+2 hours)
   - Revoke compromised credentials
   - Block unauthorized access
   - Preserve forensic evidence

4. Notification Decision (T+24 hours)
   - If >5,000 users + sensitive data: Notify supervisory authority
   - If high risk to user rights: Notify affected users immediately

5. Supervisory Authority Notification (T+48-72 hours)
   - Submit to lead supervisory authority (Irish DPC for EU)
   - Include: Nature of breach, affected users, measures taken
   - Template: GDPR Article 33 notification form

6. User Notification (T+72 hours if required)
   - Email to affected users
   - Clear explanation of breach and mitigation steps
   - Offer free identity protection services if financial data

7. Post-Incident Review (T+2 weeks)
   - Root cause analysis
   - Update security controls
   - Document lessons learned
   - Report to board of directors

8. Detailed Component Specifications

8.1 Authentication & Authorization System

  • JWT Implementation: Access tokens (15-min expiry), refresh tokens (7-day expiry)
  • Multi-factor Authentication: Support for TOTP and SMS-based MFA
  • Social Login: OAuth integration for Google, GitHub, and Microsoft
  • Session Management: Redis-based session storage with configurable TTL
  • API Key Management: Per-user API key generation with usage tracking
  • Role Management: Hierarchical role system with granular permissions

8.2 Database Migration (MySQL to PostgreSQL)

  • Schema Migration: 45+ Laravel MySQL tables to PostgreSQL optimized schema
  • Data Migration: Complete data transfer with validation and integrity checks
  • Performance Optimization: PostgreSQL-specific optimizations (indexes, partitioning)
  • Migration Strategy: Schema-first approach with data validation
  • Rollback Plan: Complete rollback capability with data integrity checks
  • Performance Targets: 50% improvement in common query execution times

8.3 Job Queue System (ARQ-based)

  • Queue Architecture: Redis-backed ARQ workers for background processing
  • Job Types: 48 different job types migrated from Laravel queues
  • Priority System: Multi-level priority queue for critical operations
  • Progress Tracking: Real-time progress updates for long-running jobs
  • Error Handling: Automatic retry with exponential backoff
  • Dead Letter Queue: Fallback queue for failed jobs with manual processing
  • Idempotency: Task-level idempotency keys; deduplication window and replay protection
  • Exactly-Once Semantics (Best-Effort): Use outbox pattern for external side-effects
  • Tracing: Correlation IDs propagated across HTTP → job → WebSocket notifications
  • Safety: Per-queue concurrency limits and rate limits; circuit breakers for flaky integrations

8.4 Real-time WebSocket System

6.4.1 Executive Summary

The WebSocket system provides real-time bidirectional communication for alerts, notifications, job completion updates, and live feature updates. Built using FastAPI's native WebSocket support with Redis Pub/Sub for horizontal scaling, this system replaces the legacy Node.js WebSocket server while maintaining feature parity and improving performance.

Key Capabilities: - Real-time alerts and notifications delivery - Job progress tracking and completion notifications - Live updates for site creation, backups, deployments, and domain operations - Multi-tenant channel architecture with team-scoped authorization - Message durability with Redis Streams for offline message replay - Presence tracking for user online/offline status - Horizontal scaling across multiple FastAPI instances without sticky sessions

6.4.2 Problem Statement

Current Challenges: - Legacy Node.js WebSocket server requires separate infrastructure and deployment - Limited scalability due to in-memory connection state - No message durability - messages lost if client disconnects - Complex integration between Laravel backend and Node.js WebSocket server - No standardized channel authorization model

Business Requirements: - Real-time notifications for critical operations (backups, deployments, SSL certificates) - Job progress updates for long-running operations (environment creation, WordPress installation) - Alert system for system events and user actions - Multi-device support (user can have multiple active connections) - Message replay capability for missed notifications

6.4.3 Technical Requirements

Performance Requirements: - WebSocket connection establishment: <100ms - Message broadcast latency (p95): <250ms across all instances - Support 10,000+ concurrent WebSocket connections per instance - Message throughput: 50,000+ messages/second across cluster - Heartbeat interval: 30 seconds (configurable) - Connection timeout: 5 minutes of inactivity

Scalability Requirements: - Horizontal scaling: Support 10+ FastAPI instances behind load balancer - No sticky sessions required - any instance can handle any connection - Redis Pub/Sub for cross-instance message propagation - Connection state externalized to Redis (no in-memory state) - Channel subscription management via Redis Sets

Reliability Requirements: - Message durability: Redis Streams with consumer groups - Message replay: Support replaying missed messages up to 24 hours - Automatic reconnection: Client-side exponential backoff - Graceful degradation: Fallback to polling if WebSocket unavailable - Connection health monitoring with automatic cleanup

Security Requirements: - JWT-based authentication for WebSocket connections - Team-scoped channel authorization (users can only subscribe to their team channels) - Rate limiting: Max 5 connections per user, 100 messages/minute per connection - Input validation: All incoming messages validated against Pydantic schemas - Message size limits: Max 64KB per message

6.4.4 Architectural Overview

High-Level Architecture:

┌─────────────────────────────────────────────────────────────────┐
│                    FastAPI Backend Instances                    │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐        │
│  │  Instance 1  │  │  Instance 2  │  │  Instance N  │        │
│  │              │  │              │  │              │        │
│  │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │        │
│  │ │WebSocket │ │  │ │WebSocket │ │  │ │WebSocket │ │        │
│  │ │  Module  │ │  │ │  Module  │ │  │ │  Module  │ │        │
│  │ └────┬─────┘ │  │ └────┬─────┘ │  │ └────┬─────┘ │        │
│  └──────┼───────┘  └──────┼───────┘  └──────┼───────┘        │
│         │                 │                 │                 │
│         └─────────────────┼─────────────────┘                 │
│                           │                                    │
└───────────────────────────┼────────────────────────────────────┘
                ┌───────────▼───────────┐
                │   Redis Cluster       │
                │                       │
                │  ┌─────────────────┐  │
                │  │   Pub/Sub      │  │  Cross-instance messaging
                │  └─────────────────┘  │
                │  ┌─────────────────┐  │
                │  │   Streams       │  │  Message durability
                │  └─────────────────┘  │
                │  ┌─────────────────┐  │
                │  │   Sets          │  │  Channel subscriptions
                │  └─────────────────┘  │
                │  ┌─────────────────┐  │
                │  │   Strings       │  │  Presence tracking
                │  └─────────────────┘  │
                └───────────────────────┘
                ┌───────────▼───────────┐
                │   PostgreSQL          │
                │   (for audit logs)    │
                └───────────────────────┘

Channel Architecture:

Channels follow a hierarchical naming pattern: {scope}.{identifier}.{feature}

Channel Types: - User Channels: user.{userId} - Personal notifications and alerts - Team Channels: team.{teamId}.{feature} - Team-scoped updates - team.{teamId}.jobs - Job status updates - team.{teamId}.backups - Backup progress and completion - team.{teamId}.activities - Activity feed updates - team.{teamId}.sites.creating - Site creation progress - team.{teamId}.domains - Domain/SSL updates - team.{teamId}.favourites - Favourite changes - System Channels: system.{event} - System-wide announcements (admin only)

Message Flow:

1. Client connects → WebSocket endpoint (/ws)
2. Authentication → JWT token validation
3. Channel subscription → Subscribe to authorized channels
4. Message publishing:
   - Module publishes message → Redis Pub/Sub
   - All instances receive → Filter by subscriptions
   - Deliver to connected clients
5. Message persistence → Redis Streams (for replay)
6. Presence update → Redis Sets (online/offline status)

6.4.5 Detailed Component Specifications

Module Structure (app/websocket/):

app/websocket/
├── __init__.py                 # Module exports
├── router.py                   # INTERFACE: WebSocket endpoints and HTTP routes
├── service.py                  # APPLICATION: Business logic for message handling
├── connection.py               # INFRASTRUCTURE: ConnectionManager for managing connections
├── channel.py                  # INFRASTRUCTURE: Channel subscription management
├── presence.py                 # INFRASTRUCTURE: Presence tracking (online/offline)
├── publisher.py                # INFRASTRUCTURE: Message publishing to Redis
├── repository.py               # INFRASTRUCTURE: Database operations (audit logs)
├── model.py                   # DOMAIN: WebSocket-related models (if needed)
└── schema.py                   # INTERFACE: Pydantic schemas for messages

Component Responsibilities:

1. router.py (Interface Layer) - WebSocket endpoint: ws://api.example.com/ws - HTTP endpoints: - GET /ws/health - Health check - GET /ws/metrics - Prometheus metrics - GET /ws/channels - List available channels - POST /ws/broadcast - Admin broadcast endpoint - WebSocket connection handling - JWT authentication for WebSocket connections - Rate limiting per connection

2. connection.py (Infrastructure Layer) - ConnectionManager class: - Manages active WebSocket connections per user - Handles connection lifecycle (connect, disconnect, cleanup) - Tracks connection metadata (user_id, team_ids, subscribed channels) - Implements heartbeat mechanism - Connection health monitoring - Redis integration for cross-instance state - Connection pooling and resource management

3. channel.py (Infrastructure Layer) - ChannelManager class: - Channel subscription management - Authorization validation (team-scoped access) - Channel pattern matching (wildcard subscriptions) - Subscription persistence in Redis Sets - Channel discovery and validation - Subscription limits enforcement

4. presence.py (Infrastructure Layer) - PresenceManager class: - User online/offline status tracking - Presence set management in Redis - TTL-based presence expiration - Cross-instance presence synchronization - Presence events (user online, user offline) - Presence query API

5. publisher.py (Infrastructure Layer) - MessagePublisher class: - Publish messages to Redis Pub/Sub - Write messages to Redis Streams for durability - Message routing to appropriate channels - Batch publishing for performance - Integration with other modules for publishing events

6. service.py (Application Layer) - WebSocketService class: - Business logic for message handling - Message validation and transformation - Channel authorization logic - Message replay service - Connection management orchestration - Integration with other domain modules - Event handling and routing

7. repository.py (Infrastructure Layer) - Database operations for audit logging - WebSocket connection history - Message delivery tracking (optional) - Analytics queries

8. schema.py (Interface Layer) - WebSocketMessage - Base message schema - SubscribeMessage - Channel subscription request - UnsubscribeMessage - Channel unsubscription request - PingMessage - Heartbeat ping - PongMessage - Heartbeat pong - JobUpdateMessage - Job status update - NotificationMessage - Alert/notification - PresenceUpdateMessage - User presence change - ErrorMessage - Error response

6.4.6 Shared Components Integration

Integration with app/core/: - app/core/security.py - JWT token validation for WebSocket connections - app/core/cache.py - Redis connection pooling and management - app/core/rate_limit.py - Rate limiting for WebSocket connections - app/core/logging.py - Structured logging with correlation IDs - app/core/exceptions.py - Custom exceptions for WebSocket errors

Integration with app/database/: - app/database/database.py - Database session for audit logs - app/database/deps.py - Database dependency injection

Integration with Other Modules:

Job Updates (from jobs/ module):

# In jobs/tasks.py or jobs/service.py
from app.websocket.publisher import MessagePublisher

async def notify_job_completion(job_id: str, status: str, result: dict):
    publisher = MessagePublisher()
    await publisher.publish(
        channel=f"team.{team_id}.jobs",
        message={
            "type": "job.update",
            "job_id": job_id,
            "status": status,
            "result": result,
            "timestamp": datetime.utcnow().isoformat()
        }
    )

Backup Updates (from backups/ module):

# In backups/tasks.py
from app.websocket.publisher import MessagePublisher

async def notify_backup_progress(backup_id: int, progress: int, team_id: int):
    publisher = MessagePublisher()
    await publisher.publish(
        channel=f"team.{team_id}.backups",
        message={
            "type": "backup.progress",
            "backup_id": backup_id,
            "progress": progress,
            "timestamp": datetime.utcnow().isoformat()
        }
    )

Activity Feed (from activity/ module):

# In activity/service.py
from app.websocket.publisher import MessagePublisher

async def broadcast_activity(activity: Activity, team_id: int):
    publisher = MessagePublisher()
    await publisher.publish(
        channel=f"team.{team_id}.activities",
        message={
            "type": "activity.new",
            "activity": activity.dict(),
            "timestamp": datetime.utcnow().isoformat()
        }
    )

Module Isolation Compliance: - ✅ WebSocket module does NOT import from other domain modules - ✅ Other modules import MessagePublisher from app.websocket.publisher (allowed - WebSocket is infrastructure) - ✅ WebSocket module uses shared components from app/core/ and app/database/ - ✅ Clear separation: WebSocket handles delivery, modules handle business logic

6.4.7 Implementation Roadmap

Phase 1: Core Infrastructure (Weeks 1-2) 1. Set up WebSocket module structure (app/websocket/) 2. Implement ConnectionManager with Redis state management 3. Implement basic WebSocket endpoint in router.py 4. JWT authentication for WebSocket connections 5. Basic message publishing to Redis Pub/Sub 6. Unit tests for connection management

Phase 2: Channel Management (Weeks 3-4) 1. Implement ChannelManager with subscription management 2. Team-scoped channel authorization 3. Channel subscription/unsubscription endpoints 4. Redis Sets for subscription persistence 5. Integration tests for channel operations

Phase 3: Message Durability (Weeks 5-6) 1. Implement Redis Streams integration 2. Message persistence for durability 3. Message replay API 4. Consumer groups for message delivery 5. ACK mechanism for message processing

Phase 4: Presence & Advanced Features (Weeks 7-8) 1. Implement PresenceManager for online/offline tracking 2. Heartbeat mechanism (ping/pong) 3. Connection health monitoring 4. Automatic cleanup of stale connections 5. Presence query endpoints

Phase 5: Integration & Testing (Weeks 9-10) 1. Integrate with jobs/ module for job updates 2. Integrate with backups/ module for backup progress 3. Integrate with activity/ module for activity feed 4. Integrate with sites/ module for site creation updates 5. End-to-end integration tests 6. Load testing (10,000+ concurrent connections)

Phase 6: Migration & Cutover (Weeks 11-12) 1. Feature parity validation with Node.js WebSocket server 2. Dual-publish period (both Node and FastAPI) 3. Canary deployment (10% traffic to FastAPI) 4. Gradual rollout (25%, 50%, 100%) 5. Node.js server deprecation

6.4.8 Success Criteria

Functional Requirements: - ✅ WebSocket connections established successfully with JWT authentication - ✅ Users can subscribe to team-scoped channels - ✅ Messages published from any module delivered to subscribed clients - ✅ Message replay works for missed messages (24-hour window) - ✅ Presence tracking accurately reflects user online/offline status - ✅ Job completion notifications delivered in <250ms (p95) - ✅ Backup progress updates delivered in real-time - ✅ Activity feed updates broadcast to team members

Performance Requirements: - ✅ WebSocket connection establishment: <100ms (p95) - ✅ Message broadcast latency: <250ms (p95) across all instances - ✅ Support 10,000+ concurrent connections per instance - ✅ Message throughput: 50,000+ messages/second across cluster - ✅ Zero message loss during normal operations - ✅ Graceful handling of 1,000+ simultaneous reconnections

Reliability Requirements: - ✅ 99.9% WebSocket availability - ✅ Automatic reconnection with exponential backoff - ✅ Message durability: 100% of messages persisted to Redis Streams - ✅ Connection cleanup: Stale connections removed within 5 minutes - ✅ Cross-instance message delivery: 100% reliability

Security Requirements: - ✅ JWT authentication required for all connections - ✅ Team-scoped channel authorization enforced - ✅ Rate limiting: Max 5 connections per user enforced - ✅ Message size limits: 64KB max enforced - ✅ Input validation: All messages validated against schemas

6.4.9 Risk Assessment

Technical Risks:

  1. Redis Pub/Sub Message Loss
  2. Risk: Messages may be lost if instance crashes before delivery
  3. Mitigation: Redis Streams for durability + ACK mechanism
  4. Impact: Medium
  5. Probability: Low

  6. Connection State Synchronization

  7. Risk: Connection state may become inconsistent across instances
  8. Mitigation: All state in Redis, no in-memory state
  9. Impact: High
  10. Probability: Low

  11. Scalability Bottlenecks

  12. Risk: Redis Pub/Sub may become bottleneck at high message rates
  13. Mitigation: Redis Cluster, message batching, connection pooling
  14. Impact: High
  15. Probability: Medium

  16. Message Replay Performance

  17. Risk: Replaying large message history may be slow
  18. Mitigation: Pagination, time-windowed queries, consumer groups
  19. Impact: Medium
  20. Probability: Medium

Operational Risks:

  1. Migration Complexity
  2. Risk: Migrating from Node.js to FastAPI may cause downtime
  3. Mitigation: Dual-publish period, gradual rollout, rollback plan
  4. Impact: High
  5. Probability: Medium

  6. Monitoring Gaps

  7. Risk: Insufficient visibility into WebSocket performance
  8. Mitigation: Comprehensive metrics, structured logging, alerting
  9. Impact: Medium
  10. Probability: Low

6.4.10 Module Isolation Verification

Compliance Checklist: - ✅ WebSocket module is self-contained in app/websocket/ - ✅ No imports from other domain modules (users/, teams/, sites/, etc.) - ✅ Uses shared components from app/core/ (security, cache, rate_limit, logging) - ✅ Uses shared components from app/database/ (database session) - ✅ Other modules import MessagePublisher from WebSocket (infrastructure dependency - allowed) - ✅ Clear separation: WebSocket handles delivery, modules handle business logic - ✅ WebSocket module can be tested independently

Dependency Graph:

websocket/
  ├── imports from: app/core/ (security, cache, rate_limit, logging)
  ├── imports from: app/database/ (database session)
  └── NO imports from: other domain modules

other modules (jobs/, backups/, activity/, etc.)
  ├── imports from: app/core/
  ├── imports from: app/database/
  └── imports from: app/websocket/publisher.py (infrastructure - allowed)

6.4.11 Code Examples

WebSocket Connection Handler:

# app/websocket/router.py
from fastapi import APIRouter, WebSocket, WebSocketDisconnect, Depends
from app.core.security import verify_websocket_token
from app.websocket.connection import ConnectionManager
from app.websocket.service import WebSocketService
from app.websocket.schema import WebSocketMessage, SubscribeMessage

router = APIRouter(prefix="/ws", tags=["websocket"])
connection_manager = ConnectionManager()
websocket_service = WebSocketService()

@router.websocket("")
async def websocket_endpoint(
    websocket: WebSocket,
    token: str = None
):
    """
    WebSocket endpoint for real-time communication.

    Query Parameters:
    - token: JWT access token for authentication
    """
    # Authenticate connection
    if not token:
        await websocket.close(code=1008, reason="Missing authentication token")
        return

    try:
        payload = verify_websocket_token(token)
        user_id = payload.get("sub")
        team_ids = payload.get("team_ids", [])
    except Exception as e:
        await websocket.close(code=1008, reason="Invalid authentication token")
        return

    # Accept connection
    await connection_manager.connect(websocket, user_id, team_ids)

    try:
        while True:
            # Receive message from client
            data = await websocket.receive_json()
            message = WebSocketMessage(**data)

            # Handle message based on type
            if message.type == "subscribe":
                subscribe_msg = SubscribeMessage(**message.data)
                await websocket_service.handle_subscribe(
                    websocket, user_id, team_ids, subscribe_msg.channel
                )
            elif message.type == "unsubscribe":
                # Handle unsubscribe
                pass
            elif message.type == "ping":
                # Handle heartbeat
                await websocket.send_json({"type": "pong", "timestamp": datetime.utcnow().isoformat()})
            else:
                await websocket.send_json({
                    "type": "error",
                    "message": f"Unknown message type: {message.type}"
                })

    except WebSocketDisconnect:
        connection_manager.disconnect(websocket, user_id)
    except Exception as e:
        logger.error(f"WebSocket error: {e}", exc_info=True)
        await websocket.close(code=1011, reason="Internal server error")
        connection_manager.disconnect(websocket, user_id)

Message Publisher:

# app/websocket/publisher.py
import json
import redis.asyncio as aioredis
from datetime import datetime
from typing import Dict, Any, Optional
from app.core.cache import get_redis_client
from app.core.logging import get_logger

logger = get_logger(__name__)

class MessagePublisher:
    """Publishes messages to Redis Pub/Sub and Streams for durability."""

    def __init__(self):
        self.redis: Optional[aioredis.Redis] = None

    async def _get_redis(self) -> aioredis.Redis:
        """Get Redis client (lazy initialization)."""
        if self.redis is None:
            self.redis = await get_redis_client()
        return self.redis

    async def publish(
        self,
        channel: str,
        message: Dict[str, Any],
        persist: bool = True
    ) -> None:
        """
        Publish message to channel.

        Args:
            channel: Channel name (e.g., "team.123.jobs")
            message: Message payload (must be JSON-serializable)
            persist: Whether to persist message to Redis Streams
        """
        redis = await self._get_redis()

        # Add metadata
        message_with_meta = {
            **message,
            "channel": channel,
            "timestamp": datetime.utcnow().isoformat(),
            "id": f"{datetime.utcnow().timestamp()}-{hash(str(message))}"
        }

        # Publish to Pub/Sub (for real-time delivery)
        await redis.publish(
            f"ws:channel:{channel}",
            json.dumps(message_with_meta)
        )

        # Persist to Redis Streams (for durability and replay)
        if persist:
            await redis.xadd(
                f"ws:stream:{channel}",
                message_with_meta,
                maxlen=10000  # Keep last 10,000 messages per channel
            )

        logger.info(
            f"Published message to channel {channel}",
            extra={"channel": channel, "message_type": message.get("type")}
        )

    async def publish_to_user(
        self,
        user_id: int,
        message: Dict[str, Any],
        persist: bool = True
    ) -> None:
        """Publish message to user's personal channel."""
        await self.publish(f"user.{user_id}", message, persist)

    async def publish_to_team(
        self,
        team_id: int,
        feature: str,
        message: Dict[str, Any],
        persist: bool = True
    ) -> None:
        """Publish message to team feature channel."""
        await self.publish(f"team.{team_id}.{feature}", message, persist)

Connection Manager:

# app/websocket/connection.py
from typing import Dict, List, Set, Optional
from fastapi import WebSocket
import redis.asyncio as aioredis
from datetime import datetime, timedelta
from app.core.cache import get_redis_client
from app.core.logging import get_logger

logger = get_logger(__name__)

class ConnectionManager:
    """Manages WebSocket connections with Redis-backed state."""

    def __init__(self):
        self.redis: Optional[aioredis.Redis] = None
        self.local_connections: Dict[int, List[WebSocket]] = {}
        self.connection_metadata: Dict[WebSocket, Dict] = {}

    async def _get_redis(self) -> aioredis.Redis:
        """Get Redis client (lazy initialization)."""
        if self.redis is None:
            self.redis = await get_redis_client()
        return self.redis

    async def connect(
        self,
        websocket: WebSocket,
        user_id: int,
        team_ids: List[int]
    ) -> None:
        """Accept and register WebSocket connection."""
        await websocket.accept()

        # Store connection locally
        if user_id not in self.local_connections:
            self.local_connections[user_id] = []
        self.local_connections[user_id].append(websocket)

        # Store metadata
        self.connection_metadata[websocket] = {
            "user_id": user_id,
            "team_ids": team_ids,
            "connected_at": datetime.utcnow(),
            "last_heartbeat": datetime.utcnow()
        }

        # Update presence in Redis
        redis = await self._get_redis()
        await redis.sadd(f"ws:presence:user:{user_id}", str(id(websocket)))
        await redis.setex(
            f"ws:presence:ttl:user:{user_id}",
            300,  # 5 minutes TTL
            "online"
        )

        # Subscribe to Redis Pub/Sub for this user
        await self._subscribe_to_user_channels(websocket, user_id, team_ids)

        logger.info(
            f"WebSocket connected for user {user_id}",
            extra={"user_id": user_id, "team_ids": team_ids}
        )

    async def disconnect(
        self,
        websocket: WebSocket,
        user_id: int
    ) -> None:
        """Remove WebSocket connection."""
        # Remove from local connections
        if user_id in self.local_connections:
            self.local_connections[user_id].remove(websocket)
            if not self.local_connections[user_id]:
                del self.local_connections[user_id]

        # Remove metadata
        if websocket in self.connection_metadata:
            del self.connection_metadata[websocket]

        # Update presence in Redis
        redis = await self._get_redis()
        await redis.srem(f"ws:presence:user:{user_id}", str(id(websocket)))

        # If no more connections for user, mark as offline
        if user_id not in self.local_connections:
            await redis.delete(f"ws:presence:ttl:user:{user_id}")

        logger.info(f"WebSocket disconnected for user {user_id}")

    async def send_to_user(
        self,
        user_id: int,
        message: dict
    ) -> None:
        """Send message to all connections for a user."""
        if user_id in self.local_connections:
            import json
            message_json = json.dumps(message)
            for websocket in self.local_connections[user_id]:
                try:
                    await websocket.send_text(message_json)
                except Exception as e:
                    logger.error(f"Error sending message to user {user_id}: {e}")
                    # Remove failed connection
                    await self.disconnect(websocket, user_id)

    async def _subscribe_to_user_channels(
        self,
        websocket: WebSocket,
        user_id: int,
        team_ids: List[int]
    ) -> None:
        """Subscribe to Redis Pub/Sub channels for this connection."""
        redis = await self._get_redis()
        pubsub = redis.pubsub()

        # Subscribe to user channel
        await pubsub.subscribe(f"ws:channel:user.{user_id}")

        # Subscribe to team channels
        for team_id in team_ids:
            await pubsub.psubscribe(f"ws:channel:team.{team_id}.*")

        # Start listening for messages
        # This would run in a background task
        # Implementation details omitted for brevity

Channel Manager:

# app/websocket/channel.py
from typing import List, Set
import redis.asyncio as aioredis
from app.core.cache import get_redis_client
from app.core.logging import get_logger

logger = get_logger(__name__)

class ChannelManager:
    """Manages channel subscriptions with authorization."""

    def __init__(self):
        self.redis: Optional[aioredis.Redis] = None

    async def _get_redis(self) -> aioredis.Redis:
        """Get Redis client."""
        if self.redis is None:
            self.redis = await get_redis_client()
        return self.redis

    def validate_channel_access(
        self,
        channel: str,
        user_id: int,
        team_ids: List[int]
    ) -> bool:
        """
        Validate if user can access channel.

        Channel patterns:
        - user.{userId} - User can only access their own channel
        - team.{teamId}.{feature} - User must be member of team
        - system.* - Admin only (not implemented in this example)
        """
        if channel.startswith("user."):
            channel_user_id = int(channel.split(".")[1])
            return channel_user_id == user_id

        elif channel.startswith("team."):
            parts = channel.split(".")
            if len(parts) >= 2:
                channel_team_id = int(parts[1])
                return channel_team_id in team_ids

        return False

    async def subscribe(
        self,
        channel: str,
        user_id: int,
        team_ids: List[int]
    ) -> bool:
        """Subscribe user to channel."""
        # Validate access
        if not self.validate_channel_access(channel, user_id, team_ids):
            logger.warning(
                f"User {user_id} attempted to subscribe to unauthorized channel {channel}"
            )
            return False

        # Add to subscription set
        redis = await self._get_redis()
        await redis.sadd(f"ws:subscriptions:user:{user_id}", channel)
        await redis.sadd(f"ws:subscribers:channel:{channel}", str(user_id))

        logger.info(f"User {user_id} subscribed to channel {channel}")
        return True

    async def unsubscribe(
        self,
        channel: str,
        user_id: int
    ) -> None:
        """Unsubscribe user from channel."""
        redis = await self._get_redis()
        await redis.srem(f"ws:subscriptions:user:{user_id}", channel)
        await redis.srem(f"ws:subscribers:channel:{channel}", str(user_id))

        logger.info(f"User {user_id} unsubscribed from channel {channel}")

    async def get_user_subscriptions(self, user_id: int) -> Set[str]:
        """Get all channels user is subscribed to."""
        redis = await self._get_redis()
        subscriptions = await redis.smembers(f"ws:subscriptions:user:{user_id}")
        return {s.decode() for s in subscriptions}

Message Schemas:

# app/websocket/schema.py
from pydantic import BaseModel, Field
from typing import Optional, Dict, Any, Literal
from datetime import datetime

class WebSocketMessage(BaseModel):
    """Base WebSocket message schema."""
    type: str = Field(..., description="Message type")
    data: Dict[str, Any] = Field(default_factory=dict, description="Message payload")
    timestamp: Optional[str] = Field(default=None, description="Message timestamp")

class SubscribeMessage(BaseModel):
    """Channel subscription request."""
    channel: str = Field(..., description="Channel name to subscribe to")
    type: Literal["subscribe"] = "subscribe"

class UnsubscribeMessage(BaseModel):
    """Channel unsubscription request."""
    channel: str = Field(..., description="Channel name to unsubscribe from")
    type: Literal["unsubscribe"] = "unsubscribe"

class PingMessage(BaseModel):
    """Heartbeat ping message."""
    type: Literal["ping"] = "ping"
    timestamp: str = Field(default_factory=lambda: datetime.utcnow().isoformat())

class PongMessage(BaseModel):
    """Heartbeat pong response."""
    type: Literal["pong"] = "pong"
    timestamp: str = Field(default_factory=lambda: datetime.utcnow().isoformat())

class JobUpdateMessage(BaseModel):
    """Job status update message."""
    type: Literal["job.update"] = "job.update"
    job_id: str
    status: str  # "pending", "running", "completed", "failed"
    progress: Optional[int] = None  # 0-100
    result: Optional[Dict[str, Any]] = None
    error: Optional[str] = None
    timestamp: str = Field(default_factory=lambda: datetime.utcnow().isoformat())

class NotificationMessage(BaseModel):
    """Alert/notification message."""
    type: Literal["notification"] = "notification"
    level: str  # "info", "warning", "error", "success"
    title: str
    message: str
    action_url: Optional[str] = None
    timestamp: str = Field(default_factory=lambda: datetime.utcnow().isoformat())

class PresenceUpdateMessage(BaseModel):
    """User presence update message."""
    type: Literal["presence.update"] = "presence.update"
    user_id: int
    status: str  # "online", "offline"
    timestamp: str = Field(default_factory=lambda: datetime.utcnow().isoformat())

class ErrorMessage(BaseModel):
    """Error response message."""
    type: Literal["error"] = "error"
    code: str
    message: str
    timestamp: str = Field(default_factory=lambda: datetime.utcnow().isoformat())

6.4.12 Monitoring & Observability

Metrics (Prometheus): - websocket_connections_total - Total active connections - websocket_connections_per_user - Connections per user (histogram) - websocket_messages_sent_total - Total messages sent - websocket_messages_received_total - Total messages received - websocket_message_latency_seconds - Message delivery latency (histogram) - websocket_channel_subscriptions - Active channel subscriptions - websocket_errors_total - Error count by type - websocket_reconnect_total - Reconnection count

Health Endpoints: - GET /ws/health - Health check (checks Redis connectivity, connection count) - GET /ws/metrics - Prometheus metrics endpoint - GET /ws/channels - List all active channels and subscriber counts

Logging: - Structured JSON logs with correlation IDs - Log events: connection, disconnection, subscription, message delivery, errors - PII scrubbing for user data in logs

6.4.13 Migration Strategy

Phase A: Bridge (Weeks 1-4) - Keep Node.js WebSocket server running - FastAPI WebSocket module development - Feature parity validation - Dual-publish testing (both systems receive messages)

Phase B: Dual-Run (Weeks 5-8) - Deploy FastAPI WebSocket alongside Node.js - Mirror all publishes to both systems - Validators compare payload parity - Gradual client migration (10% → 25% → 50%)

Phase C: Cutover (Weeks 9-12) - Switch ingress to FastAPI WebSocket (100% traffic) - Node.js server remains as hot-standby - Monitor metrics and error rates - Rollback capability if issues detected - Node.js server deprecation after 2 weeks of stable operation

8.5 API Endpoints Mapping (206 routes)

  • Authentication Endpoints: 15+ routes for user authentication and session management
  • User Management: 20+ routes for user CRUD and permissions
  • Site Management: 30+ routes for site creation, configuration, and deployment
  • Environment Management: 25+ routes for environment lifecycle management
  • Team Management: 15+ routes for team creation, member management, and permissions
  • Billing/Payment: 12+ routes for subscription management and payment processing
  • Webhook Support: 8+ routes for webhook management and delivery
  • Admin Operations: 20+ routes for administrative functions

8.6 Service Layer Architecture with Module Isolation Validation

The Service Layer implements business logic within each domain module following strict isolation principles. This architecture ensures maintainability, testability, and scalability as the codebase grows to serve millions of users.

6.6.1 Service Layer Principles

Core Principles: - Domain Services: 40+ services implemented using the modular DDD approach - Single Responsibility: Each service handles one domain's business logic - Module Isolation: Services MUST NOT import from other modules (except core/database) - Adapter Pattern: Clean separation between business logic and external integrations - Transaction Management: Services orchestrate database transactions through repositories - Validation: Business rule validation separate from API request validation - Error Handling: Domain-specific exceptions that routers translate to HTTP responses

6.6.2 Service Layer Structure

Standard Service Pattern:

# backend/app/wordpress/service.py
"""
WordPress Service Layer
Implements business logic for WordPress management.
"""
from sqlalchemy.orm import Session
from fastapi import HTTPException
from typing import Optional

from . import repository, schema
from app.core.shared.rbac import require_permission, Permission
from app.core.shared.tenant import validate_team_access
from app.core.shared.audit import log_audit_event, AuditAction
from app.core.cache import cache_result, invalidate_cache

class WordPressService:
    """Service for WordPress operations"""

    def __init__(self):
        pass

    @cache_result(key_prefix="wordpress", ttl=300)
    def get_wordpress_info(
        self,
        db: Session,
        site_id: int,
        user_id: int
    ) -> schema.WordPressRead:
        """
        Get WordPress information for a site.

        Business Rules:
        - User must have permission to view site
        - Site must have WordPress installed

        Args:
            db: Database session
            site_id: Site ID
            user_id: Requesting user ID

        Returns:
            WordPress information

        Raises:
            HTTPException: If site not found or no permission
        """
        # Validate user has access to site
        site = self._validate_site_access(db, site_id, user_id)

        # Get WordPress installation
        wp = repository.get_wordpress_by_site_id(db, site_id)
        if not wp:
            raise HTTPException(
                status_code=404,
                detail="WordPress not installed on this site"
            )

        return wp

    async def execute_wp_cli_command(
        self,
        db: Session,
        site_id: int,
        command: str,
        user_id: int
    ) -> schema.WpCliResponse:
        """
        Execute WP-CLI command.

        Business Rules:
        - User must have execute permission
        - Command must be in allowed commands list
        - Long-running commands must be queued

        Args:
            db: Database session
            site_id: Site ID
            command: WP-CLI command to execute
            user_id: Requesting user ID

        Returns:
            Command execution result

        Raises:
            HTTPException: If validation fails or command not allowed
        """
        # Validate access
        site = self._validate_site_access(db, site_id, user_id)

        # Validate command is allowed
        if not self._is_command_allowed(command):
            raise HTTPException(
                status_code=400,
                detail=f"Command '{command}' is not allowed"
            )

        # Check if command is long-running
        if self._is_long_running_command(command):
            # Enqueue background job
            job_id = await self._enqueue_wp_cli_job(site_id, command)
            return schema.WpCliResponse(
                output="Command queued",
                exit_code=0,
                job_id=job_id,
                status="queued"
            )

        # Execute command immediately
        result = repository.execute_wp_cli(db, site_id, command)

        # Log audit event
        log_audit_event(
            user_id=user_id,
            team_id=site.team_id,
            action=AuditAction.UPDATE,
            resource_type="wordpress",
            resource_id=site_id,
            metadata={"command": command}
        )

        return result

    def _validate_site_access(
        self,
        db: Session,
        site_id: int,
        user_id: int
    ):
        """Validate user has access to site (internal helper)"""
        from app.sites.repository import get_site_by_id

        site = get_site_by_id(db, site_id)
        if not site:
            raise HTTPException(status_code=404, detail="Site not found")

        # Validate team access
        if not validate_team_access(user_id, site.team_id):
            raise HTTPException(status_code=403, detail="Access denied")

        return site

    def _is_command_allowed(self, command: str) -> bool:
        """Check if WP-CLI command is allowed"""
        from app.core.utils import get_wordpress_allowed_commands
        allowed = get_wordpress_allowed_commands()
        return any(command.startswith(cmd) for cmd in allowed)

    def _is_long_running_command(self, command: str) -> bool:
        """Check if command is long-running"""
        long_running = ['plugin update', 'core update', 'db export', 'db import']
        return any(command.startswith(cmd) for cmd in long_running)

    async def _enqueue_wp_cli_job(self, site_id: int, command: str) -> str:
        """Enqueue WP-CLI command as background job"""
        from arq import create_pool
        from arq.connections import RedisSettings

        redis = await create_pool(RedisSettings())
        job = await redis.enqueue_job('execute_wp_cli', site_id, command)
        return job.job_id

# Singleton instance
wordpress_service = WordPressService()

6.6.3 Module Isolation Validation

Automated Validation:

The validate_architecture.py script enforces module isolation rules:

# backend/scripts/validate_architecture.py (enhanced version)
import ast
import sys
from pathlib import Path
from typing import Set, Dict, List

ALLOWED_IMPORTS = {'core', 'database'}
REQUIRED_FILES = ['router.py', 'service.py', 'repository.py', 'model.py', 'schema.py']
SHARED_MODULES = ['core', 'database', 'tests']

def get_module_imports(module_path: Path) -> Dict[str, Set[str]]:
    """Extract all app.* imports from a module"""
    imports = {}
    for py_file in module_path.rglob("*.py"):
        if py_file.name.startswith('_'):
            continue

        file_imports = set()
        try:
            tree = ast.parse(py_file.read_text())
            for node in ast.walk(tree):
                if isinstance(node, ast.ImportFrom):
                    if node.module and node.module.startswith('app.'):
                        parts = node.module.split('.')
                        if len(parts) >= 2:
                            file_imports.add(parts[1])  # Extract module name
        except Exception as e:
            print(f"Warning: Could not parse {py_file}: {e}")
            continue

        if file_imports:
            imports[str(py_file.relative_to(module_path.parent))] = file_imports

    return imports

def validate_module_isolation() -> List[str]:
    """Validate no module imports from other modules"""
    app_path = Path("app")
    modules = [d for d in app_path.iterdir() if d.is_dir() and d.name not in SHARED_MODULES]

    violations = []

    for module in modules:
        module_imports = get_module_imports(module)

        for file_path, imports in module_imports.items():
            # Allowed imports: core, database, self
            forbidden_imports = imports - set(ALLOWED_IMPORTS) - {module.name}

            if forbidden_imports:
                violations.append(
                    f"❌ {file_path} imports from forbidden modules: {forbidden_imports}\n"
                    f"   Allowed: {ALLOWED_IMPORTS} + own module ({module.name})"
                )

    return violations

def validate_module_structure() -> List[str]:
    """Validate all modules have required files"""
    app_path = Path("app")
    modules = [d for d in app_path.iterdir() if d.is_dir() and d.name not in SHARED_MODULES]

    violations = []

    for module in modules:
        for required_file in REQUIRED_FILES:
            file_path = module / required_file
            if not file_path.exists():
                violations.append(f"❌ Module '{module.name}' missing {required_file}")
            else:
                # Check file is not empty (at least has imports or docstring)
                content = file_path.read_text().strip()
                if len(content) < 10:  # Arbitrary minimum
                    violations.append(f"⚠️  Module '{module.name}' has empty {required_file}")

    return violations

def validate_layer_separation() -> List[str]:
    """Validate layer separation within modules"""
    app_path = Path("app")
    modules = [d for d in app_path.iterdir() if d.is_dir() and d.name not in SHARED_MODULES]

    violations = []

    for module in modules:
        # Check routers don't import SQLAlchemy query code
        router_file = module / "router.py"
        if router_file.exists():
            content = router_file.read_text()
            if 'from sqlalchemy' in content and 'Session' not in content:
                violations.append(
                    f"❌ {module.name}/router.py imports SQLAlchemy (should use service layer)"
                )
            if '.query(' in content or '.filter(' in content:
                violations.append(
                    f"❌ {module.name}/router.py contains database queries (should use service layer)"
                )

        # Check services use repositories for database access
        service_file = module / "service.py"
        if service_file.exists():
            content = service_file.read_text()
            if '.query(' in content or '.add(' in content or '.commit(' in content:
                violations.append(
                    f"⚠️  {module.name}/service.py directly accesses database (should use repository)"
                )

    return violations

def generate_dependency_graph():
    """Generate module dependency graph"""
    app_path = Path("app")
    modules = [d for d in app_path.iterdir() if d.is_dir() and d.name not in SHARED_MODULES]

    print("\n📊 Module Dependency Graph:")
    print("=" * 60)

    for module in modules:
        imports = get_module_imports(module)
        all_imports = set()
        for file_imports in imports.values():
            all_imports.update(file_imports)

        # Filter out own module and allowed shared imports
        external_imports = all_imports - {module.name} - set(ALLOWED_IMPORTS)

        if external_imports:
            print(f"{module.name}{', '.join(external_imports)}")
        else:
            print(f"{module.name} → (no external dependencies) ✓")

if __name__ == "__main__":
    print("🔍 Validating Hybrid Modular DDD Architecture...\n")

    # Run all validations
    isolation_violations = validate_module_isolation()
    structure_violations = validate_module_structure()
    layer_violations = validate_layer_separation()

    # Report violations
    if isolation_violations:
        print("❌ Module Isolation Violations:")
        for violation in isolation_violations:
            print(f"  {violation}")
        print()

    if structure_violations:
        print("❌ Module Structure Violations:")
        for violation in structure_violations:
            print(f"  {violation}")
        print()

    if layer_violations:
        print("❌ Layer Separation Violations:")
        for violation in layer_violations:
            print(f"  {violation}")
        print()

    # Generate dependency graph
    generate_dependency_graph()

    # Exit with appropriate code
    total_violations = len(isolation_violations) + len(structure_violations) + len(layer_violations)

    if total_violations == 0:
        print("\n✅ All architecture validation checks passed!")
        sys.exit(0)
    else:
        print(f"\n❌ Found {total_violations} architecture violations!")
        sys.exit(1)

6.6.4 Service Layer Best Practices

1. Transaction Management:

# GOOD: Service manages transaction, repository executes queries
def create_site_with_environment(
    db: Session,
    site_data: schema.SiteCreate,
    env_data: schema.EnvironmentCreate
) -> schema.SiteRead:
    """Create site and default environment (transactional)"""
    try:
        # Create site
        site = repository.create_site(db, site_data)

        # Create environment
        env_data.site_id = site.id
        environment = repository.create_environment(db, env_data)

        # Commit transaction
        db.commit()
        db.refresh(site)

        return site
    except Exception as e:
        db.rollback()
        raise HTTPException(status_code=500, detail=f"Failed to create site: {str(e)}")

# BAD: Service doesn't manage transaction properly
def create_site_with_environment_bad(
    db: Session,
    site_data: schema.SiteCreate
):
    site = repository.create_site(db, site_data)
    db.commit()  # Commits too early!

    try:
        environment = repository.create_environment(db, env_data)
        db.commit()
    except:
        # Site already committed, can't rollback!
        pass

2. Caching Strategy:

from app.core.cache import cache_result, invalidate_cache

class SiteService:
    @cache_result(key_prefix="sites", ttl=300)
    def get_site_statistics(self, db: Session, site_id: int):
        """Get site statistics (cached for 5 minutes)"""
        return repository.get_site_statistics(db, site_id)

    def update_site(self, db: Session, site_id: int, site_data: schema.SiteUpdate):
        """Update site and invalidate cache"""
        site = repository.update_site(db, site_id, site_data)
        db.commit()

        # Invalidate related caches
        invalidate_cache(f"sites:*:{site_id}:*")

        return site

3. Error Handling:

class WordPressError(Exception):
    """Base exception for WordPress operations"""
    pass

class WordPressNotInstalledError(WordPressError):
    """Raised when WordPress is not installed"""
    pass

class CommandNotAllowedError(WordPressError):
    """Raised when WP-CLI command is not allowed"""
    pass

# In service
def get_wordpress_info(self, db: Session, site_id: int):
    wp = repository.get_wordpress_by_site_id(db, site_id)
    if not wp:
        raise WordPressNotInstalledError(f"WordPress not installed on site {site_id}")
    return wp

# In router (translates domain exceptions to HTTP exceptions)
try:
    wp = service.get_wordpress_info(db, site_id)
    return wp
except WordPressNotInstalledError as e:
    raise HTTPException(status_code=404, detail=str(e))
except WordPressError as e:
    raise HTTPException(status_code=500, detail=str(e))

6.6.5 Multi-Tier Caching Strategy

Application-Level Caching: - In-memory LRU caches for static data (allowed commands, configuration) - Request-scoped caching within a single API call

Distributed Caching (Redis): - User session data - API response caching - Computed statistics and dashboards - Rate limiting counters

Database-Level Optimization: - Query result caching - Connection pooling via PgBouncer - Read replicas for read-heavy operations

CDN Caching: - Static assets (images, CSS, JS) - Public API responses with Cache-Control headers

6.6.6 Monitoring and Metrics

Service-Level Metrics:

from prometheus_client import Counter, Histogram

# Service operation metrics
service_operation_total = Counter(
    'service_operation_total',
    'Total service operations',
    ['module', 'operation', 'status']
)

service_operation_duration = Histogram(
    'service_operation_duration_seconds',
    'Service operation duration',
    ['module', 'operation']
)

# Usage in service
def create_site(self, db: Session, site_data: schema.SiteCreate):
    with service_operation_duration.labels(module='sites', operation='create').time():
        try:
            site = repository.create_site(db, site_data)
            db.commit()
            service_operation_total.labels(module='sites', operation='create', status='success').inc()
            return site
        except Exception as e:
            db.rollback()
            service_operation_total.labels(module='sites', operation='create', status='error').inc()
            raise

6.6.7 Integration with Background Jobs

Services coordinate with background jobs for long-running operations:

from arq import create_pool
from arq.connections import RedisSettings

async def create_full_backup(
    self,
    db: Session,
    site_id: int,
    user_id: int
) -> dict:
    """Create full backup (queued as background job)"""
    # Validate access
    site = self._validate_site_access(db, site_id, user_id)

    # Create backup record (status: pending)
    backup = repository.create_backup(db, schema.BackupCreate(
        site_id=site_id,
        type="full",
        status="pending"
    ))
    db.commit()

    # Enqueue background job
    redis = await create_pool(RedisSettings())
    job = await redis.enqueue_job('create_full_backup', site_id, backup.id)

    return {
        "backup_id": backup.id,
        "job_id": job.job_id,
        "status": "queued"
    }

Summary: - 40+ Domain Services: Each module has its own service implementing business logic - Strict Module Isolation: Enforced via automated validation in CI/CD - Transaction Management: Services orchestrate multi-step operations - Caching: Multi-tier caching strategy for performance - Error Handling: Domain-specific exceptions translated to HTTP responses - Monitoring: Built-in metrics for all service operations - Background Jobs: Integration with ARQ for long-running operations

8.7 Next.js Frontend Integration

  • Modern React: Next.js 14 with App Router and React Server Components
  • API Integration: Auto-generated TypeScript clients from OpenAPI specs
  • State Management: Zustand for global state with React Query for server state
  • Authentication Flow: Seamless JWT token management with refresh logic
  • Real-time Updates: WebSocket integration for live updates
  • Progressive Web App: Offline support and native app-like experience

8.8 Domain Modules Inventory & Migration Map

Based on the Laravel 11 inventory (controllers, jobs, services, routes), the following FastAPI modules will be implemented. Each module owns its router, service, repository, model, and schema:

Core Modules (with External API Integration):

  • Auth (auth/): login, refresh, MFA, social login; replaces Laravel Sanctum flows
  • External API: None (internal authentication only)
  • Shared Components: Uses app/core/security.py for JWT management

  • Users (users/): profile, OTP, audit logs; maps UserController

  • External API: None (internal user management)
  • Shared Components: Uses app/core/shared/rbac.py for permissions

  • Teams (teams/): team CRUD, membership; maps TeamController, TeamService

  • External API: None (internal team management)
  • Shared Components: Uses app/core/shared/tenant.py for multi-tenancy

Environment & Infrastructure Modules (Virtuozzo Integration):

  • Sites (sites/): site CRUD/dashboard; maps SitesController, SiteDetailsController
  • External API: None (internal site management)
  • Shared Components: Uses app/core/shared/audit.py for audit logging

  • Environments (environments/): lifecycle start/stop/sleep/restart/rename/delete; maps ApiController, DeleteEnvWorkflowController, Virtuozzo services

  • External API: ⭐ Virtuozzo API via app/core/adapters/virtuozzo_adapter.py
  • Service Methods: start_environment(), stop_environment(), sleep_environment(), restart_environment(), delete_environment()
  • Performance: Connection pooling, caching (5 min TTL), circuit breaker, retries (3x)
  • Shared Components: Uses virtuozzo_adapter for all Virtuozzo API calls

  • Staging (staging/): staging creation pipeline (export DB, create env, install addons, import DB, search/replace); maps StagingController

  • External API: ⭐ Virtuozzo API via app/core/adapters/virtuozzo_adapter.py
  • Service Methods: create_staging_environment(), sync_staging_to_production()
  • Background Jobs: Long-running staging operations via ARQ
  • Shared Components: Uses virtuozzo_adapter for environment creation

  • Backups (backups/): full/custom backup, restore, sessions; maps BackupController, BackupStatusController

  • External API: ⭐ Virtuozzo API via app/core/adapters/virtuozzo_adapter.py
  • Service Methods: create_backup(), restore_backup(), list_backup_sessions()
  • Background Jobs: Backup/restore operations via ARQ
  • Shared Components: Uses virtuozzo_adapter for backup operations

CDN & Cache Modules:

  • Cache (cache/): Redis/OPCache/Relay/LiteSpeed controls and metrics; maps CacheController, ApiController cache routes
  • External API: ⭐ Bunny CDN API via app/core/adapters/bunny_cdn_adapter.py
  • Service Methods: create_dns_record(), configure_cdn(), purge_cache()
  • Performance: HTTP/2, idempotency keys, automatic retries
  • Shared Components: Uses bunny_cdn_adapter for CDN operations

  • Domains (domains/): SSL issuance/renewal, domain verify/bind/remove; maps DomainsController, SiteDetailsController

  • External API:
    • Cloudflare API via app/core/adapters/cloudflare_adapter.py
    • Bunny CDN API via app/core/adapters/bunny_cdn_adapter.py
  • Service Methods: issue_ssl(), verify_domain(), bind_domain(), remove_domain()
  • Background Jobs: SSL issuance/renewal via ARQ
  • Shared Components: Uses multiple adapters for DNS and SSL management

Infrastructure Services:

  • SFTP (sftp/): users CRUD, addon install; maps SftpUserController
  • External API: ⭐ Virtuozzo API via app/core/adapters/virtuozzo_adapter.py
  • Service Methods: create_sftp_user(), install_sftp_addon()
  • Shared Components: Uses virtuozzo_adapter for SFTP user management

  • WordPress (wordpress/): WP CLI proxy, update history, activities; maps WpCacheController, UpdateHistoryController

  • External API: ⭐ Virtuozzo API via app/core/adapters/virtuozzo_adapter.py
  • Service Methods: execute_wp_cli(), install_plugin(), update_wordpress()
  • Background Jobs: Long-running WP-CLI commands via ARQ
  • Rate Limiting: 10 WP-CLI commands per minute per user
  • Shared Components: Uses virtuozzo_adapter for WP-CLI execution

Payment & Billing Modules:

  • Payments (payments/): Stripe intents, configs, post-payment; maps FundController, InvoiceController
  • External API: ⭐ Stripe API (future adapter: app/core/adapters/stripe_adapter.py)
  • Service Methods: create_payment_intent(), process_payment(), refund_payment()
  • Performance: Idempotency keys for safe retries, webhook validation
  • Shared Components: Future Stripe adapter implementation

  • Billing (billing/): PayPal one-time/recurring, plan deactivate/cancel; maps WebhookController

  • External API: ⭐ PayPal API (future adapter: app/core/adapters/paypal_adapter.py)
  • Service Methods: create_subscription(), cancel_subscription(), process_webhook()
  • Shared Components: Future PayPal adapter implementation

Monitoring & Notifications:

  • Uptime (uptime/): uptime monitor/checker; maps UptimeController, Uptime*Service
  • External API: ⭐ UptimeRobot API (future adapter: app/core/adapters/uptime_adapter.py)
  • Service Methods: create_monitor(), check_uptime(), get_monitor_status()
  • Background Jobs: Periodic uptime checks via ARQ
  • Shared Components: Future UptimeRobot adapter implementation

  • Nodes & Metrics (nodes/): node stats, env metrics time-series; maps NodeController, NodeStatsController, EnvMetricsController

  • External API: ⭐ Virtuozzo API via app/core/adapters/virtuozzo_adapter.py
  • Service Methods: fetch_node_stats(), get_environment_metrics()
  • Caching: Aggressive caching (1 min TTL) for metrics data
  • Shared Components: Uses virtuozzo_adapter for metrics collection

Utility Modules:

  • Activity & Notes (activity/): notes CRUD, site activities; maps ActivityController
  • External API: None (internal activity logging)
  • Shared Components: Uses app/core/shared/audit.py for audit trail

  • Favourites (favourites/): add/remove/list; maps FavouritesController

  • External API: None (internal user preferences)

  • Config (config/): versioned config endpoints api/v1/config/*; maps ConfigController

  • External API: None (internal configuration management)

  • Profile (profile/): user profile update/delete; maps ProfileController

  • External API: None (internal user profile management)

  • Instant/Virt Login (sessions/): instant login, virt login; maps InstantLoginController, VirtLoginController

  • External API: ⭐ Virtuozzo API via app/core/adapters/virtuozzo_adapter.py
  • Service Methods: create_instant_login(), create_virt_login()
  • Shared Components: Uses virtuozzo_adapter for session key generation

  • Jobs (jobs/): dispatch/status for background tasks; maps JobController, JobLogsController, JobStatusController

  • External API: None (internal job queue management)
  • Shared Components: Uses ARQ for background job processing

  • Webhook (webhook/): inbound webhooks processing; maps WebhookController

  • External API: Receives webhooks from external services (Stripe, PayPal, etc.)
  • Service Methods: process_stripe_webhook(), process_paypal_webhook()

  • WebSocket (websocket/): channels, health, metrics, broadcast; maps WebSocketController, Node bridge parity

  • External API: None (internal real-time communication)
  • Shared Components: Uses Redis Pub/Sub for cross-instance messaging

Shared External API Infrastructure:

Located in app/core/adapters/:

app/core/adapters/
├── __init__.py
├── virtuozzo_adapter.py        # Used by: environments, wordpress, backups, sftp, staging, nodes, sessions
├── bunny_cdn_adapter.py         # Used by: cache, domains
├── cloudflare_adapter.py        # Used by: domains
├── stripe_adapter.py            # Future: Used by: payments
├── paypal_adapter.py            # Future: Used by: billing
├── postmark_adapter.py          # Future: Used by: notifications
└── uptime_adapter.py            # Future: Used by: uptime

Adapter Usage Matrix:

Adapter Modules Using It API Operations Performance Features
virtuozzo_adapter.py environments, wordpress, backups, sftp, staging, nodes, sessions Environment lifecycle, WP-CLI, backups, SFTP, metrics Connection pooling, caching (5 min), circuit breaker, rate limiting (10 req/s)
bunny_cdn_adapter.py cache, domains DNS records, CDN config, cache purging HTTP/2, idempotency keys, retries (3x)
cloudflare_adapter.py domains DNS management, SSL verification Connection pooling, automatic retries

For full endpoint-by-endpoint mapping, see docs/04-API-ENDPOINTS-MAPPING.md (auto-generated from route analyzers), kept in lockstep with OpenAPI.

8.9 API Versioning & Backward Compatibility Strategy

  • Versioned FastAPI routes under /api/v1/** mirroring Laravel endpoints to enable incremental frontend migration
  • Legacy compatibility shims maintained where necessary:
  • Path aliases for legacy endpoints (e.g., /get-backup-list/api/v1/backups/list)
  • Payload/response translators to preserve current frontend contracts
  • Deprecation policy:
  • All shims carry Deprecation and Sunset headers
  • Removal only after frontend cutover and 30-day window
  • Contract testing:
  • Snapshot tests for JSON shapes versus Laravel responses
  • Contract CI checks block incompatible changes

8.10 Background Jobs Migration Mapping

  • Queues:
  • critical, default, bulk, io-bound, webhooks, emails, dlq
  • Representative job mappings (Laravel → FastAPI/ARQ task):
  • CreateEnvironmentJobenvironments.tasks.create_environment
  • DeleteEnvironmentJobenvironments.tasks.delete_environment
  • CreateFullBackupJob/CreateCustomBackupJobbackups.tasks.create_backup
  • RestoreBackupbackups.tasks.restore_backup
  • InstallWordPressJob/RunDynamicWpCliwordpress.tasks.run_wp_cli
  • InstallLetsEncryptSSLJob/UpdateSSLJob/IssueSslCertJobdomains.tasks.manage_ssl
  • SyncSftpUsersJob/InstallAddSftpJobsftp.tasks.sync_users
  • SetupBunnyCdnJobcache.tasks.configure_cdn
  • DispatchSyncJob/DispatchSingleSyncJobsites.tasks.sync_environment
  • ProcessDomainIP/VerifyDomainJob/RemoveDomainJobdomains.tasks.verify_or_remove
  • Execution guarantees:
  • Idempotency keys based on tuple (team_id, env, op, args_hash)
  • Outbox + inbox tables for external calls (payments, SSL, CDN)
  • DLQ viewer and replay tooling with backoff policies

8.11 WebSocket Migration Plan (Durability & Presence)

  • Phase A (Bridge): Keep Node WS (Redis Pub/Sub + Streams) behind /ws while FastAPI replicates features
  • Phase B (Dual-Run): Mirror publish to both WS stacks; validators compare payload parity
  • Phase C (Cutover): Switch ingress to FastAPI WS; Node remains as hot-standby for rollback
  • Feature parity requirements:
  • Channels: team.{teamId}.{feature} (jobs, backups, activities, sites.creating, domains, favourites)
  • Durability: Redis Streams with consumer groups, replay API, ACK timeouts
  • Presence: Redis-backed presence set with TTL, cross-instance fan-out
  • Security: JWT with team claims; per-channel authorization
  • Ops: /metrics, /health, /channels, structured logs with correlation IDs

8.12 Data Model Migration & SQLAlchemy Strategy

  • Model inventory (representative): User, Team, Member, Site, Environment, Backup, BackupSession, Domain, SftpUser, Node, Note, SiteActivity, Transaction, PaypalSubscription, TeamFund, PendingSite, UpdateHistory, SyncHistory
  • Approach:
  • Generate SQLAlchemy models to match target PostgreSQL schema (snake_case, explicit FKs, indexes)
  • Alembic migrations: baseline from current MySQL schema (via alembic revision --autogenerate after initial models), then hand-tune constraints and indexes
  • Data migration:
    • Extract: chunked reads from MySQL ordered by PK
    • Transform: enum/string normalization, timezone normalization, UUIDs where applicable
    • Load: COPY-based bulk inserts; verify row counts and checksums
  • Validation:
    • Referential integrity checks per batch
    • Sampling-based semantic validation (e.g., backup sessions reconcile with backups)
  • Cutover:
    • Dual-write window for critical tables (feature-flagged)
    • Read-from-Postgres dark launch; promote after parity

8.13 Observability, SLOs, and Load Testing

  • SLOs:
  • API p95 latency < 200ms; WS broadcast latency p95 < 250ms
  • Error rate < 0.1%; Availability 99.9%
  • Metrics:
  • API: latency, throughput, error rates, saturation (DB pool, Redis)
  • Jobs: queue depth, processing latency, retry rates, DLQ rates
  • WS: active connections, channel subscribers, publish/replay latency
  • Tracing:
  • End-to-end tracing (OpenTelemetry) across HTTP → jobs → WS
  • Logging:
  • Structured JSON logs; PII scrubbing; per-request correlation IDs
  • Load testing:
  • k6/Gatling scenarios per critical flows (site lifecycle, backups, domains)
  • Soak tests and spike tests with autoscaling validation

8.14 Cutover & Rollback Strategy

  • Blue/Green deployments for backend; frontend deployed independently
  • Database:
  • Pre-warm read replicas; promote on failover
  • Point-in-time recovery enabled; backups validated daily
  • WS:
  • Dual-publish period for parity; feature flag to revert to Node WS
  • API:
  • Feature flags for compatibility shims; runtime kill-switches for risky features
  • Canary release to 10% traffic before full rollout
  • Rollback playbooks with time-bounded MTTR targets

8.15 External API Integration Architecture

Context: The MBPanel system integrates with 40+ external services including Virtuozzo (19 services), CDN/Cache providers (4 services), payment processors, and other third-party APIs. This section defines the architecture for migrating Laravel's HTTP client patterns to FastAPI with optimal performance, reliability, and resilience.

6.15.1 Problem Statement and Current Challenges

Laravel Pattern (Current State):

// Laravel: app/Services/Virtuozzo/MbAdminService.php
$response = Http::timeout(90)->post($apiUrl);
if (!$response->successful()) {
    throw new \Exception("Failed...");
}
return $response->json();

Critical Issues with Direct Migration: 1. ❌ No connection pooling → new connection per request (70ms overhead) 2. ❌ No circuit breaker → cascading failures during external API outages 3. ❌ No response caching → repeated calls to same endpoint 4. ❌ No rate limiting → risk of being throttled by external APIs 5. ❌ No retry with exponential backoff → transient failures cause errors 6. ❌ No request timeout strategy → hanging requests 7. ❌ No correlation IDs → difficult to trace across systems

Laravel's Good Patterns to Preserve: - Circuit breaker implementation in ExternalApiService.php - Idempotent operation handling in BunnyCdnService.php - Structured error logging

6.15.2 HTTP Client Infrastructure (app/core/http_client.py)

Purpose: Centralized HTTP client with connection pooling, retries, circuit breakers, caching, and rate limiting.

Key Components:

1. CircuitBreaker Class:

class CircuitBreaker:
    """Circuit breaker implementation for external API calls"""

    def __init__(
        self,
        failure_threshold: int = 5,
        recovery_timeout: int = 300,  # 5 minutes
        expected_exception: type = Exception
    ):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.expected_exception = expected_exception
        self.failure_count = 0
        self.last_failure_time = None
        self.state = "CLOSED"  # CLOSED, OPEN, HALF_OPEN

States: - CLOSED: Normal operation, requests pass through - OPEN: Circuit breaker triggered, fail fast without calling API - HALF_OPEN: Testing recovery, single request allowed

2. ExternalAPIClient Class:

class ExternalAPIClient:
    """
    High-performance HTTP client for external API integrations.

    Features:
    - Connection pooling (reduces latency by 50-70%)
    - Automatic retries with exponential backoff
    - Circuit breaker pattern
    - Response caching
    - Rate limiting
    - Request/response logging
    """

    def __init__(
        self,
        base_url: str,
        timeout: float = 30.0,
        max_retries: int = 3,
        cache_ttl: int = 300,
        rate_limit: Optional[int] = None,  # requests per second
    ):
        self.base_url = base_url
        self.timeout = timeout
        self.max_retries = max_retries
        self.cache_ttl = cache_ttl
        self.rate_limit = rate_limit

        # Connection pooling: reuse connections across requests
        self.client = httpx.AsyncClient(
            base_url=base_url,
            timeout=httpx.Timeout(timeout),
            limits=httpx.Limits(
                max_keepalive_connections=20,
                max_connections=100,
                keepalive_expiry=60.0
            ),
            http2=True  # Enable HTTP/2 for better performance
        )

        # Circuit breaker
        self.circuit_breaker = CircuitBreaker(
            failure_threshold=5,
            recovery_timeout=300
        )

        # Rate limiting
        self._rate_limit_tokens = []
        self._rate_limit_lock = asyncio.Lock()

3. Request Methods with Resilience:

GET with Caching:

async def get(
    self,
    path: str,
    params: Optional[Dict[str, Any]] = None,
    headers: Optional[Dict[str, str]] = None,
    cache_key: Optional[str] = None,
    correlation_id: Optional[str] = None,
) -> Dict[str, Any]:
    """Execute GET request with caching, retries, and circuit breaker"""

    # Check cache first
    if cache_key:
        cached = redis_client.get(cache_key)
        if cached:
            logger.info(
                "external_api_cache_hit",
                path=path,
                cache_key=cache_key,
                correlation_id=correlation_id
            )
            return cached

    # Check rate limit
    await self._check_rate_limit()

    # Add correlation ID to headers
    if correlation_id and headers:
        headers["X-Correlation-ID"] = correlation_id

    # Execute with circuit breaker and retry
    return self.circuit_breaker.call(self._execute_get, path, params, headers, cache_key, correlation_id)

POST with Idempotency:

async def post(
    self,
    path: str,
    data: Optional[Dict[str, Any]] = None,
    json: Optional[Dict[str, Any]] = None,
    headers: Optional[Dict[str, str]] = None,
    correlation_id: Optional[str] = None,
    idempotency_key: Optional[str] = None,
) -> Dict[str, Any]:
    """
    Execute POST request with retries and circuit breaker.

    Args:
        idempotency_key: Idempotency key for safe retries
    """

    # Check rate limit
    await self._check_rate_limit()

    # Add correlation ID and idempotency key to headers
    if headers is None:
        headers = {}
    if correlation_id:
        headers["X-Correlation-ID"] = correlation_id
    if idempotency_key:
        headers["Idempotency-Key"] = idempotency_key

    # Check for idempotent errors (already exists)
    if response.status_code == 400:
        response_body = response.text
        if any(
            phrase in response_body.lower()
            for phrase in ["already registered", "already exists", "duplicate"]
        ):
            logger.info(
                "external_api_idempotent_success",
                method="POST",
                path=path,
                status_code=response.status_code,
                correlation_id=correlation_id
            )
            # Treat as success (idempotent operation)
            return {"status": "success", "idempotent": True}

4. Retry Logic with Exponential Backoff:

retry_count = 0
while retry_count <= self.max_retries:
    try:
        response = await self.client.get(path, params=params, headers=headers)
        response.raise_for_status()
        return response.json()
    except httpx.HTTPStatusError as e:
        # Don't retry client errors (4xx)
        if 400 <= e.response.status_code < 500:
            raise

        retry_count += 1
        if retry_count <= self.max_retries:
            # Exponential backoff: 1s, 2s, 4s
            backoff = 2 ** (retry_count - 1)
            logger.warning(
                "external_api_retry",
                path=path,
                retry=retry_count,
                backoff_seconds=backoff,
                error=str(e),
                correlation_id=correlation_id
            )
            await asyncio.sleep(backoff)

5. Singleton Pattern for Client Management:

# Global HTTP client instances (singleton pattern)
_clients: Dict[str, ExternalAPIClient] = {}

def get_http_client(
    name: str,
    base_url: str,
    **kwargs
) -> ExternalAPIClient:
    """
    Get or create HTTP client instance.

    Args:
        name: Client identifier (e.g., 'virtuozzo', 'bunny_cdn')
        base_url: Base URL for API
        **kwargs: Additional client configuration

    Returns:
        ExternalAPIClient instance
    """
    if name not in _clients:
        _clients[name] = ExternalAPIClient(base_url, **kwargs)
    return _clients[name]

async def close_all_clients():
    """Close all HTTP clients (called on app shutdown)"""
    for client in _clients.values():
        await client.close()
    _clients.clear()

6.15.3 Service Adapter Pattern (app/core/adapters/)

Purpose: Isolate external API logic from domain service logic using the Adapter Pattern.

Adapter Location Strategy:

app/
├── core/
│   ├── adapters/           # ⭐ SHARED ADAPTERS (used by multiple modules)
│   │   ├── __init__.py
│   │   ├── virtuozzo_adapter.py      # Used by environments, wordpress, backups
│   │   ├── bunny_cdn_adapter.py      # Used by cache, domains
│   │   └── cloudflare_adapter.py     # Used by domains
│   └── http_client.py      # Shared HTTP client
├── environments/
│   └── service.py          # Uses virtuozzo_adapter
├── wordpress/
│   └── service.py          # Uses virtuozzo_adapter
└── cache/
    └── service.py          # Uses bunny_cdn_adapter

Example: Virtuozzo Adapter:

# backend/app/core/adapters/virtuozzo_adapter.py
"""
Virtuozzo API Adapter

Handles all interactions with Virtuozzo API.
Isolates external API concerns from domain logic.
"""

from typing import Dict, Any, Optional
from sqlalchemy.orm import Session
from app.core.http_client import get_http_client
from app.core.config import settings
import structlog

logger = structlog.get_logger()

class VirtuozzoAdapter:
    """
    Adapter for Virtuozzo API integration.

    Responsibilities:
    - Execute Virtuozzo API calls
    - Handle session key management
    - Transform Virtuozzo responses to domain models
    - Cache frequently accessed data
    """

    def __init__(self):
        self.client = get_http_client(
            name="virtuozzo",
            base_url=settings.VIRTUOZZO_API_URL,
            timeout=90.0,  # Virtuozzo needs longer timeout
            max_retries=3,
            cache_ttl=300,
            rate_limit=10  # Max 10 requests per second
        )

    async def fetch_environments_and_nodes(
        self,
        session_key: str,
        correlation_id: Optional[str] = None
    ) -> Dict[str, Any]:
        """
        Fetch environments and nodes from Virtuozzo API.

        Args:
            session_key: Encrypted session key
            correlation_id: Request correlation ID for tracing

        Returns:
            Dict containing environments and nodes data
        """

        cache_key = f"virtuozzo:environments:{session_key[:8]}"

        response = await self.client.get(
            path="/1.0/environment/control/rest/getenvs",
            params={"session": session_key},
            cache_key=cache_key,
            correlation_id=correlation_id
        )

        # Normalize response
        if "error" in response and response["error"]:
            logger.error(
                "virtuozzo_api_error",
                error=response["error"],
                correlation_id=correlation_id
            )
            raise Exception(f"Virtuozzo API error: {response['error']}")

        # Process and normalize environment data
        if "infos" in response and isinstance(response["infos"], list):
            for info in response["infos"]:
                if "env" in info:
                    # Ensure displayName exists (fallback to envName)
                    if not info["env"].get("displayName"):
                        info["env"]["displayName"] = (
                            info["env"].get("envName") or
                            info["env"].get("shortdomain") or
                            "Unknown"
                        )

        return response

    async def execute_mbadmin_action(
        self,
        app_unique_name: str,
        session_key: str,
        action: str,
        params: Dict[str, Any],
        correlation_id: Optional[str] = None,
        idempotency_key: Optional[str] = None
    ) -> Dict[str, Any]:
        """
        Execute MbAdmin action via Virtuozzo marketplace API.
        """

        import json
        import urllib.parse

        # Encode params
        params_json = json.dumps(params)
        params_encoded = urllib.parse.quote(params_json)

        # Construct URL path
        path = (
            f"/1.0/marketplace/installation/rest/executeaction"
            f"?appUniqueName={app_unique_name}"
            f"&session={session_key}"
            f"&action={action}"
            f"&params={params_encoded}"
        )

        logger.info(
            "virtuozzo_execute_action",
            app_unique_name=app_unique_name,
            action=action,
            correlation_id=correlation_id
        )

        response = await self.client.post(
            path=path,
            headers={},
            correlation_id=correlation_id,
            idempotency_key=idempotency_key
        )

        return response

    async def start_environment(
        self,
        env_name: str,
        session_key: str,
        correlation_id: Optional[str] = None
    ) -> Dict[str, Any]:
        """Start environment"""
        return await self.client.post(
            path="/1.0/environment/control/rest/startenv",
            json={"envName": env_name, "session": session_key},
            correlation_id=correlation_id
        )

# Singleton instance
_virtuozzo_adapter: Optional[VirtuozzoAdapter] = None

def get_virtuozzo_adapter() -> VirtuozzoAdapter:
    """Get Virtuozzo adapter instance"""
    global _virtuozzo_adapter
    if _virtuozzo_adapter is None:
        _virtuozzo_adapter = VirtuozzoAdapter()
    return _virtuozzo_adapter

Example: Bunny CDN Adapter:

# backend/app/core/adapters/bunny_cdn_adapter.py
from app.core.http_client import get_http_client
from app.core.config import settings

class BunnyCDNAdapter:
    def __init__(self):
        self.client = get_http_client(
            name="bunny_cdn",
            base_url="https://api.bunny.net",
            timeout=30.0,
            max_retries=3
        )
        self.access_key = settings.BUNNY_CDN_ACCESS_KEY

    async def create_dns_record(
        self,
        env_name: str,
        platform_domain: str,
        correlation_id: str
    ) -> dict:
        """Create DNS CNAME record"""

        payload = {
            "Type": "CNAME",
            "Name": env_name,
            "Value": f"{env_name}.{platform_domain}",
            "Ttl": 15,
            "Accelerated": True,
            "MonitorType": "Monitor",
            "AutoSslIssuance": True
        }

        headers = {
            "AccessKey": self.access_key,
            "Accept": "application/json",
            "Content-Type": "application/json"
        }

        # Generate idempotency key for safe retries
        idempotency_key = f"bunny:dns:{env_name}:{platform_domain}"

        response = await self.client.post(
            path=f"/dnszone/{settings.BUNNY_DNS_ZONE_ID}/records",
            json=payload,
            headers=headers,
            correlation_id=correlation_id,
            idempotency_key=idempotency_key
        )

        return response

6.15.4 Service Integration Pattern

Purpose: Use adapters in service layer while maintaining domain logic separation.

Example: Environment Service Using Virtuozzo Adapter:

# backend/app/environments/service.py
from sqlalchemy.orm import Session
from typing import Optional
from app.core.shared.audit import log_audit_event, AuditAction
from app.core.adapters.virtuozzo_adapter import get_virtuozzo_adapter
from app.environments import repository, schema
from app.core.exceptions import EnvironmentNotFoundError
import structlog
import uuid

logger = structlog.get_logger()

class EnvironmentService:
    """Service for environment operations"""

    def __init__(self):
        self.virtuozzo_adapter = get_virtuozzo_adapter()

    async def start_environment(
        self,
        db: Session,
        environment_id: int,
        user_id: int
    ) -> schema.EnvironmentRead:
        """
        Start environment.

        Business Rules:
        - User must have permission
        - Environment must exist
        - Environment must be in stopped/sleeping state
        """

        # Generate correlation ID for tracing
        correlation_id = str(uuid.uuid4())

        logger.info(
            "environment_start_requested",
            environment_id=environment_id,
            user_id=user_id,
            correlation_id=correlation_id
        )

        # Get environment from database
        environment = repository.get_environment_by_id(db, environment_id)
        if not environment:
            raise EnvironmentNotFoundError(f"Environment {environment_id} not found")

        # Validate user has access (via team)
        self._validate_access(db, user_id, environment.team_id)

        # Validate environment state
        if environment.status == "running":
            logger.info(
                "environment_already_running",
                environment_id=environment_id,
                correlation_id=correlation_id
            )
            return environment

        try:
            # Call Virtuozzo API via adapter
            response = await self.virtuozzo_adapter.start_environment(
                env_name=environment.env_name,
                session_key=environment.session_key,
                correlation_id=correlation_id
            )

            # Update environment status
            environment.status = "starting"
            repository.update_environment(db, environment_id, {"status": "starting"})
            db.commit()

            # Log audit event
            log_audit_event(
                user_id=user_id,
                team_id=environment.team_id,
                action=AuditAction.UPDATE,
                resource_type="environment",
                resource_id=environment_id,
                metadata={
                    "action": "start",
                    "correlation_id": correlation_id
                }
            )

            logger.info(
                "environment_start_success",
                environment_id=environment_id,
                correlation_id=correlation_id
            )

            return environment

        except Exception as e:
            logger.error(
                "environment_start_failed",
                environment_id=environment_id,
                error=str(e),
                correlation_id=correlation_id
            )

            # Update status to error
            repository.update_environment(db, environment_id, {"status": "error"})
            db.commit()

            raise

6.15.5 Performance Optimizations

1. Connection Pooling Benefits:

Before (Laravel - No Pooling):

Request 1: DNS lookup (20ms) + TLS handshake (50ms) + Request (30ms) = 100ms
Request 2: DNS lookup (20ms) + TLS handshake (50ms) + Request (30ms) = 100ms
Request 3: DNS lookup (20ms) + TLS handshake (50ms) + Request (30ms) = 100ms
Total: 300ms

After (FastAPI - With Pooling):

Request 1: DNS lookup (20ms) + TLS handshake (50ms) + Request (30ms) = 100ms
Request 2: Reuse connection + Request (30ms) = 30ms
Request 3: Reuse connection + Request (30ms) = 30ms
Total: 160ms (47% improvement)

2. Response Caching:

# Cache frequently accessed Virtuozzo data
cache_key = f"virtuozzo:environments:{session_key[:8]}"
cache_ttl = 300  # 5 minutes

# First request: 90ms (API call)
# Subsequent requests: 2ms (Redis cache)
# Improvement: 98% faster

3. HTTP/2 Multiplexing:

# Enable HTTP/2 for better performance
self.client = httpx.AsyncClient(
    http2=True  # Multiple requests over single connection
)

# Multiple parallel requests share single TCP connection
# Reduces latency and connection overhead

6.15.6 Resilience Patterns

1. Circuit Breaker:

# Protect against cascading failures
circuit_breaker = CircuitBreaker(
    failure_threshold=5,  # Open after 5 failures
    recovery_timeout=300  # Try again after 5 minutes
)

# States:
# - CLOSED: Normal operation
# - OPEN: Fail fast, don't call API
# - HALF_OPEN: Try single request to test recovery

2. Retry with Exponential Backoff:

# Automatic retries for transient failures
retry_count = 0
while retry_count <= max_retries:
    try:
        return await self.client.get(path)
    except NetworkError:
        retry_count += 1
        backoff = 2 ** (retry_count - 1)  # 1s, 2s, 4s
        await asyncio.sleep(backoff)

3. Idempotent Operations:

# Safe retries for mutations
idempotency_key = f"create_env:{user_id}:{env_name}"

await self.client.post(
    path="/create",
    json=data,
    idempotency_key=idempotency_key
)

# If retry hits "already exists" error, treat as success

6.15.7 Module Isolation Compliance

Import Rules for Adapters:

# ✅ ALLOWED: Import shared adapter
from app.core.adapters.virtuozzo_adapter import get_virtuozzo_adapter

# ✅ ALLOWED: Import HTTP client
from app.core.http_client import get_http_client

# ❌ FORBIDDEN: Import from other module
from app.wordpress.service import WordPressService  # VIOLATION!

Adapter Directory Structure:

app/
├── core/
│   ├── adapters/           # ⭐ SHARED ADAPTERS
│   │   ├── __init__.py
│   │   ├── virtuozzo_adapter.py
│   │   ├── bunny_cdn_adapter.py
│   │   └── cloudflare_adapter.py
│   └── http_client.py      # Shared HTTP client
├── environments/
│   └── service.py          # Uses virtuozzo_adapter
├── wordpress/
│   └── service.py          # Uses virtuozzo_adapter
└── backups/
    └── service.py          # Uses virtuozzo_adapter

6.15.8 Testing Strategy for External APIs

1. HTTP Client Testing:

# backend/app/tests/core/test_http_client.py
import pytest
from app.core.http_client import ExternalAPIClient

@pytest.mark.asyncio
async def test_circuit_breaker_opens_after_failures():
    """Test circuit breaker opens after threshold failures"""
    client = ExternalAPIClient(
        base_url="http://failing-service",
        timeout=1.0,
        max_retries=0
    )

    # Trigger 5 failures
    for i in range(5):
        with pytest.raises(Exception):
            await client.get("/fail")

    # Circuit should be open now
    assert client.circuit_breaker.state == "OPEN"

    # Next call should fail immediately without API call
    with pytest.raises(Exception, match="Circuit breaker OPEN"):
        await client.get("/fail")

2. Adapter Testing (Mocked):

# backend/app/tests/core/adapters/test_virtuozzo_adapter.py
import pytest
from unittest.mock import AsyncMock, patch
from app.core.adapters.virtuozzo_adapter import VirtuozzoAdapter

@pytest.mark.asyncio
async def test_start_environment_success():
    """Test successful environment start"""
    adapter = VirtuozzoAdapter()

    # Mock HTTP client response
    with patch.object(adapter.client, 'post', new=AsyncMock(return_value={
        "result": 0,
        "message": "Environment started successfully"
    })):
        result = await adapter.start_environment(
            env_name="test-env",
            session_key="test-session",
            correlation_id="test-123"
        )

    assert result["result"] == 0
    assert "started successfully" in result["message"]

3. Integration Testing (Real API - Dev/Staging Only):

# backend/app/tests/integration/test_virtuozzo_integration.py
import pytest
from app.core.adapters.virtuozzo_adapter import get_virtuozzo_adapter

@pytest.mark.integration
@pytest.mark.asyncio
async def test_fetch_environments_real_api():
    """Integration test with real Virtuozzo API (staging)"""
    adapter = get_virtuozzo_adapter()

    session_key = "test-session-key"  # From staging environment

    result = await adapter.fetch_environments_and_nodes(
        session_key=session_key,
        correlation_id="integration-test"
    )

    assert "infos" in result
    assert isinstance(result["infos"], list)

6.15.9 Migration Strategy by Service Type

Services Requiring Migration:

Virtuozzo Services (19 files):

Laravel Service FastAPI Module Adapter Priority
VirtuozzoService.php environments/ virtuozzo_adapter.py P0
MbAdminService.php wordpress/ virtuozzo_adapter.py P0
EnvironmentStartService.php environments/ virtuozzo_adapter.py P0
EnvironmentStopService.php environments/ virtuozzo_adapter.py P0
EnvironmentSleepService.php environments/ virtuozzo_adapter.py P0
EnvironmentRestartService.php environments/ virtuozzo_adapter.py P0
EnvironmentDeletionService.php environments/ virtuozzo_adapter.py P0
VirtuozzoBackupService.php backups/ virtuozzo_adapter.py P1
SftpService.php sftp/ virtuozzo_adapter.py P1
AddonManagementService.php environments/ virtuozzo_adapter.py P1

CDN/Cache Services (4 files):

Laravel Service FastAPI Module Adapter Priority
BunnyCdnService.php cache/ bunny_cdn_adapter.py P1
CloudflareDetectionService.php domains/ cloudflare_adapter.py P2
RelayService.php cache/ relay_adapter.py P2
CacheManagementService.php cache/ N/A (internal) P2

External Integration Services (8 files):

Laravel Service FastAPI Module Adapter Priority
ExternalApiService.php core/ http_client.py (base) P0
ExternalAuthService.php auth/ N/A (internal) P1
PostmarkService.php notifications/ postmark_adapter.py P2
UptimeCheckerService.php uptime/ uptime_adapter.py P2
UptimeMonitorService.php uptime/ uptime_adapter.py P2
DomainAvailabilityService.php domains/ domain_check_adapter.py P2
CnameChecker.php domains/ N/A (internal) P2
SslChecker.php domains/ N/A (internal) P2

Other Services (9 files):

Laravel Service FastAPI Module Adapter Priority
WebSocketBroadcastService.php websocket/ N/A (internal) P0
WebSocketLogger.php core/ N/A (internal) P1
WebSocketTelemetry.php websocket/ N/A (internal) P1
WebSocketTokenService.php websocket/ N/A (internal) P1
UserAgentService.php core/ N/A (internal) P2
TeamService.php teams/ N/A (internal) P0
AccountUpgradeService.php billing/ N/A (internal) P1
DeadLetterQueueService.php jobs/ N/A (internal) P1
DlqMonitoringService.php jobs/ N/A (internal) P1

Total: 40 services to migrate

6.15.10 Performance Targets

Metric Laravel Baseline FastAPI Target Improvement
Average API Call Latency 100ms 50ms 50%
P95 Latency 300ms 150ms 50%
Connection Overhead 70ms per request 70ms first request, 5ms subsequent 93%
Cache Hit Rate 0% (no caching) 80%+
Failed Request Recovery Manual retry Automatic (3 retries) N/A
Cascading Failure Protection None Circuit breaker N/A

6.15.11 Rollback Strategy

Feature Flags:

# backend/app/core/config.py
class Settings(BaseSettings):
    USE_EXTERNAL_API_CLIENT: bool = True  # Feature flag
    CIRCUIT_BREAKER_ENABLED: bool = True
    RESPONSE_CACHING_ENABLED: bool = True

# In service
if settings.USE_EXTERNAL_API_CLIENT:
    adapter = get_virtuozzo_adapter()
else:
    # Fallback to legacy implementation
    adapter = LegacyVirtuozzoAdapter()

Gradual Rollout: 1. Week 1: Deploy with feature flag OFF 2. Week 2: Enable for 10% of traffic 3. Week 3: Enable for 50% of traffic 4. Week 4: Enable for 100% of traffic 5. Week 5: Remove feature flag

6.15.12 Monitoring and Metrics

External API Metrics:

from prometheus_client import Counter, Histogram

# External API call metrics
external_api_requests_total = Counter(
    'external_api_requests_total',
    'Total external API requests',
    ['adapter', 'method', 'endpoint', 'status']
)

external_api_duration_seconds = Histogram(
    'external_api_duration_seconds',
    'External API call duration',
    ['adapter', 'method', 'endpoint']
)

circuit_breaker_state = Gauge(
    'circuit_breaker_state',
    'Circuit breaker state (0=CLOSED, 1=OPEN, 2=HALF_OPEN)',
    ['adapter']
)

cache_hit_rate = Counter(
    'external_api_cache_hits',
    'Cache hits for external API calls',
    ['adapter', 'endpoint']
)

Alerting Rules: - Circuit breaker opens (alert severity: WARNING) - High error rate (>5% over 5 minutes, severity: CRITICAL) - High latency (p95 > 500ms, severity: WARNING) - Cache hit rate drops below 50% (severity: INFO)


9. Development Priorities & Sequencing

This section provides a strategic overview of implementation priorities. Detailed task lists are maintained in separate files (see docs/development/tasks/ directory).

9.1 Priority Framework

P0 (Critical - Must Have for MVP): - Authentication & Authorization System (JWT, RBAC, API Keys) - Database Migration (MySQL → PostgreSQL with Citus sharding preparation) - Core Jelastic API Integration (environments, nodes, basic operations) - Real-time WebSocket System (environment status updates) - Basic Frontend Dashboard (environment list, status monitoring)

P1 (High - Required for Production): - Advanced Jelastic Operations (scaling, backups, logs) - Multi-Region Disaster Recovery - Compliance Framework (GDPR data subject rights APIs) - Auto-Scaling Policies & Implementation - Performance Optimization & Caching

P2 (Medium - Post-Launch Enhancements): - White-label Client Portal (for agencies) - Advanced Analytics & Reporting - Git Integration for Deployments - Plugin Conflict Detection - Predictive Scaling with ML

P3 (Low - Future Roadmap): - Mobile App (iOS/Android) - AI-Powered Performance Recommendations - Multi-Cloud Support (beyond AWS) - Blockchain-based Audit Logs

9.2 Development Sequencing Strategy

Vertical Slice Approach: Each phase delivers end-to-end functionality for a subset of features, rather than building all layers of a single feature.

Phase 1: Foundation (Authentication + Basic CRUD)
├── Backend: JWT auth, user CRUD, basic Jelastic integration
├── Frontend: Login page, dashboard shell, environment list
├── Database: PostgreSQL setup, initial migrations
└── Testing: Auth e2e tests, API integration tests

Phase 2: Core Business Logic (Environment Management)
├── Backend: Full Jelastic CRUD operations
├── Frontend: Environment details, node management
├── Database: Environments, nodes, job_logs tables
└── Testing: Environment lifecycle tests

Phase 3: Advanced Features (Real-time, Scaling, Compliance)
├── Backend: WebSocket server, auto-scaling, GDPR APIs
├── Frontend: Real-time updates, compliance dashboard
├── Database: WebSocket state, audit logs
└── Testing: Load tests, chaos engineering

Phase 4: Production Hardening (DR, Monitoring, Compliance)
├── Infrastructure: Multi-region setup, Citus sharding
├── Monitoring: Full observability stack (Prometheus, Grafana)
├── Compliance: SOC 2 audit preparation
└── Testing: DR drills, penetration testing

9.3 Component Development Order

Week 1-3: Foundation 1. ✅ Authentication Systemdocs/development/tasks/auth_system.md 2. ✅ Database Setupdocs/development/tasks/database_migration.md 3. ✅ API Frameworkdocs/development/tasks/api_foundation.md 4. ✅ Frontend Shelldocs/development/tasks/frontend_foundation.md

Week 4-8: Core Features 5. ✅ Jelastic Integrationdocs/development/tasks/jelastic_integration.md 6. ✅ Environment CRUDdocs/development/tasks/environment_management.md 7. ✅ Node Managementdocs/development/tasks/node_management.md 8. ✅ Job Queue Systemdocs/development/tasks/job_queue.md

Week 9-12: Advanced Features 9. ✅ WebSocket Systemdocs/development/tasks/websocket_system.md 10. ✅ Auto-Scalingdocs/development/tasks/auto_scaling.md 11. ✅ GDPR Compliancedocs/development/tasks/gdpr_compliance.md 12. ✅ Performance Optimizationdocs/development/tasks/performance_optimization.md

Week 13-16: Frontend & UX 13. ✅ Dashboard UIdocs/development/tasks/dashboard_ui.md 14. ✅ Monitoring Viewsdocs/development/tasks/monitoring_views.md 15. ✅ User Managementdocs/development/tasks/user_management_ui.md

Week 17-20: Production Deployment 16. ✅ Multi-Region DRdocs/development/tasks/disaster_recovery.md 17. ✅ Database Shardingdocs/development/tasks/citus_sharding.md 18. ✅ SOC 2 Preparationdocs/development/tasks/soc2_compliance.md 19. ✅ Load Testingdocs/development/tasks/load_testing.md 20. ✅ Production Cutoverdocs/development/tasks/production_cutover.md

9.4 Cross-Cutting Concerns (Continuous)

These activities run parallel to all phases:

Activity Frequency Owner Deliverable
Code Reviews Every PR Tech Lead Approved PRs, architecture feedback
Security Scanning Daily (CI/CD) DevOps Vulnerability reports, fixes
Performance Testing Weekly Backend Lead Latency reports, optimization backlog
Documentation Per feature Feature owner Updated docs in /docs
Compliance Checks Bi-weekly Security Team GDPR/SOC 2 checklist updates
DR Drills Monthly SRE Team DR test reports, runbook updates

9.5 External Task List Structure

All detailed implementation tasks are organized in separate files:

docs/development/
├── MAINPRD.md                          # This file (strategic blueprint)
├── tasks/
│   ├── README.md                       # Task list index
│   ├── phase1_foundation/
│   │   ├── auth_system.md              # Detailed auth tasks
│   │   ├── database_migration.md       # DB migration steps
│   │   ├── api_foundation.md           # FastAPI setup tasks
│   │   └── frontend_foundation.md      # Next.js setup tasks
│   ├── phase2_core_features/
│   │   ├── jelastic_integration.md     # Jelastic API integration
│   │   ├── environment_management.md   # Environment CRUD
│   │   └── job_queue.md                # ARQ job queue setup
│   ├── phase3_advanced_features/
│   │   ├── websocket_system.md         # WebSocket implementation
│   │   ├── auto_scaling.md             # Auto-scaling policies
│   │   └── gdpr_compliance.md          # GDPR API implementation
│   └── phase4_production/
│       ├── disaster_recovery.md        # Multi-region DR setup
│       ├── citus_sharding.md           # Database sharding migration
│       └── production_cutover.md       # Production deployment
└── runbooks/
    ├── deployment.md                   # Deployment procedures
    ├── incident_response.md            # Incident handling
    └── dr_failover.md                  # DR failover runbook

9.6 Task List Format (Standard Template)

Each task list file follows this structure:

# [Component Name] - Detailed Task List

## Overview
Brief description of component and its role in the system.

## Prerequisites
- Dependencies that must be completed first
- Required tools/libraries
- Access requirements

## Tasks

### Task 1: [Task Name]
**Priority**: P0/P1/P2/P3
**Estimated Time**: X hours/days
**Assignee**: [Role or name]
**Status**: ❌ Not Started | 🟡 In Progress | ✅ Completed

**Description:**
[Detailed task description]

**Acceptance Criteria:**
- [ ] Criterion 1
- [ ] Criterion 2
- [ ] Criterion 3

**Implementation Steps:**
1. Step 1
2. Step 2
3. Step 3

**Testing:**
- Unit tests: [description]
- Integration tests: [description]
- E2E tests: [description]

**Self-Verification:**
```bash
# Commands to verify task completion
pytest tests/test_component.py
curl -X GET http://localhost:8000/health

Dependencies: - Depends on: [Other tasks] - Blocks: [Future tasks]


Task 2: [Next Task]

...

### 9.7 Risk Mitigation During Development

| Risk | Probability | Impact | Mitigation Strategy |
|------|-------------|--------|---------------------|
| **Jelastic API Breaking Changes** | Medium | High | Version pin, contract tests, fallback to cached data |
| **Database Migration Data Loss** | Low | Critical | Parallel run for 2 weeks, automated rollback |
| **Performance Regression** | Medium | High | Automated performance tests in CI, canary deployments |
| **Security Vulnerability** | Medium | Critical | Daily security scans, penetration testing before launch |
| **Scope Creep** | High | Medium | Strict prioritization framework (P0-P3), weekly review |
| **Key Developer Attrition** | Low | High | Documentation, pair programming, knowledge sharing |

---

## 10. Local Development Setup

This section provides everything developers need to run MBPanel locally.

### 10.1 Prerequisites

**Required Software:**
- **Python**: 3.11+ (recommended: 3.11.9)
- **Node.js**: 18+ (recommended: 18.20.0 LTS)
- **PostgreSQL**: 15+ (or Docker container)
- **Redis**: 7+ (or Docker container)
- **Docker**: 24+ (for containerized services)
- **Git**: 2.40+
- **Make**: 4.0+ (for Makefile commands)

**Recommended Tools:**
- **VS Code** with Python, Pylance, ESLint extensions
- **pgAdmin** or **TablePlus** for database management
- **RedisInsight** for Redis debugging
- **Postman** or **HTTPie** for API testing
- **Context7 MCP** for AI-assisted development (optional)

### 10.2 Quick Start (5 Minutes)

```bash
# 1. Clone repository
git clone https://github.com/mightybox-io/mbpanel.git
cd mbpanel

# 2. Start infrastructure with Docker Compose
make infra-up
# This starts: PostgreSQL, Redis, PgBouncer

# 3. Setup backend
cd backend
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

# 4. Run database migrations
alembic upgrade head

# 5. Create superuser
python scripts/create_superuser.py --email admin@example.com --password admin123

# 6. Start backend server
uvicorn main:app --reload --host 0.0.0.0 --port 8000

# 7. Setup frontend (new terminal)
cd frontend
npm install
npm run dev

# 8. Access application
# - Frontend: http://localhost:3000
# - Backend API: http://localhost:8000
# - API Docs: http://localhost:8000/docs
# - PgAdmin: http://localhost:5050 (admin@admin.com / admin)

10.3 Environment Variables

Create .env files for backend and frontend:

Backend .env (backend/.env):

# Database
DATABASE_URL=postgresql://mbpanel:mbpanel_dev@localhost:5432/mbpanel_dev
DATABASE_POOL_SIZE=10
DATABASE_MAX_OVERFLOW=20

# Redis
REDIS_URL=redis://localhost:6379/0
REDIS_CACHE_DB=1

# Security
SECRET_KEY=your-secret-key-change-in-production
JWT_SECRET_KEY=your-jwt-secret-change-in-production
JWT_ACCESS_TOKEN_EXPIRE_MINUTES=15
JWT_REFRESH_TOKEN_EXPIRE_DAYS=7

# Jelastic API (use test environment)
JELASTIC_API_URL=https://test-api.jelastic.com
JELASTIC_API_TOKEN=your-test-token

# CORS (allow frontend)
ALLOWED_ORIGINS=http://localhost:3000,http://localhost:8000

# Logging
LOG_LEVEL=DEBUG
LOG_FORMAT=json

# Environment
ENVIRONMENT=development
DEBUG=true

Frontend .env.local (frontend/.env.local):

# API Endpoint
NEXT_PUBLIC_API_URL=http://localhost:8000
NEXT_PUBLIC_WS_URL=ws://localhost:8000/ws

# Authentication
NEXT_PUBLIC_AUTH_COOKIE_NAME=mbpanel_token

# Analytics (disable in dev)
NEXT_PUBLIC_ANALYTICS_ENABLED=false

# Feature Flags
NEXT_PUBLIC_ENABLE_GDPR_TOOLS=true
NEXT_PUBLIC_ENABLE_BETA_FEATURES=false

10.4 Development Workflow

Daily Development Cycle:

# 1. Start your day - pull latest changes
git checkout main
git pull origin main
make dev-start  # Starts infra + backend + frontend

# 2. Create feature branch
git checkout -b feature/US-042-jelastic-scaling

# 3. Make changes, run tests frequently
make test-watch  # Auto-runs tests on file changes

# 4. Lint and format before committing
make lint  # Runs ruff, mypy, eslint
make format  # Formats code

# 5. Commit with conventional commits
git add .
git commit -m "feat(jelastic): add auto-scaling API endpoint

- Implement POST /api/v1/environments/{id}/scale
- Add cloudlet calculation logic
- Include integration tests

Closes #42"

# 6. Push and create PR
git push origin feature/US-042-jelastic-scaling
gh pr create --title "feat: Jelastic auto-scaling API" --body "..."

# 7. End of day - stop services
make dev-stop

10.5 Makefile Commands

The Makefile provides convenient shortcuts:

Infrastructure:

make infra-up          # Start PostgreSQL, Redis, PgBouncer
make infra-down        # Stop all infrastructure
make infra-logs        # Tail infrastructure logs
make db-reset          # Drop and recreate database (WARNING: data loss)

Backend:

make backend-dev       # Start backend in dev mode (hot reload)
make backend-test      # Run all backend tests
make backend-lint      # Lint backend code (ruff, mypy)
make backend-format    # Format backend code
make migration-create  # Create new Alembic migration
make migration-up      # Apply migrations
make migration-down    # Rollback last migration

Frontend:

make frontend-dev      # Start frontend in dev mode
make frontend-test     # Run frontend tests (Jest, Playwright)
make frontend-lint     # Lint frontend code (ESLint, Prettier)
make frontend-build    # Production build

Combined:

make dev-start         # Start all services (infra + backend + frontend)
make dev-stop          # Stop all services
make test-all          # Run all tests (backend + frontend)
make lint-all          # Lint all code
make format-all        # Format all code
make clean             # Clean build artifacts, caches

10.6 Database Management

Creating Migrations:

# 1. Modify SQLAlchemy models in backend/app/models/
# 2. Generate migration
cd backend
alembic revision --autogenerate -m "Add team_id to environments"

# 3. Review migration file in alembic/versions/
# 4. Apply migration
alembic upgrade head

# 5. Verify migration
psql -U mbpanel -d mbpanel_dev -c "\d environments"

Seeding Test Data:

# Seed database with test data
python scripts/seed_database.py --teams 10 --envs-per-team 5

# This creates:
# - 10 test teams
# - 50 test environments (5 per team)
# - Test users for each team
# - Sample job logs

Database Backup/Restore (Local):

# Backup local database
make db-backup  # Creates backup in backups/mbpanel_dev_YYYYMMDD.sql

# Restore from backup
make db-restore BACKUP=backups/mbpanel_dev_20250117.sql

10.7 Testing Strategy

Backend Testing:

# Run all tests
pytest

# Run specific test file
pytest tests/test_auth.py

# Run with coverage
pytest --cov=app --cov-report=html
open htmlcov/index.html

# Run integration tests only
pytest -m integration

# Run unit tests only
pytest -m "not integration"

# Watch mode (auto-run on changes)
ptw  # pytest-watch

Frontend Testing:

# Unit tests (Jest)
cd frontend
npm run test

# Watch mode
npm run test:watch

# E2E tests (Playwright)
npm run test:e2e

# E2E in headed mode (see browser)
npm run test:e2e:headed

Load Testing (Locust):

# Start backend first
make backend-dev

# Run load test
cd tests/performance
locust -f locustfile.py --host=http://localhost:8000

# Open browser: http://localhost:8089
# Set users: 100, spawn rate: 10

10.8 Debugging

Backend Debugging (VS Code):

Create .vscode/launch.json:

{
  "version": "0.2.0",
  "configurations": [
    {
      "name": "Python: FastAPI",
      "type": "python",
      "request": "launch",
      "module": "uvicorn",
      "args": [
        "main:app",
        "--reload",
        "--host", "0.0.0.0",
        "--port", "8000"
      ],
      "jinja": true,
      "justMyCode": false,
      "cwd": "${workspaceFolder}/backend",
      "env": {
        "PYTHONPATH": "${workspaceFolder}/backend"
      }
    }
  ]
}

Set breakpoints in VS Code, press F5 to start debugging.

Frontend Debugging (Chrome DevTools):

# Start Next.js in debug mode
npm run dev

# Open Chrome DevTools
# Sources tab → Filesystem → Add folder → Select frontend/
# Set breakpoints in .tsx files

Redis Debugging:

# Connect to Redis CLI
redis-cli -h localhost -p 6379

# Monitor all commands
MONITOR

# Inspect cache keys
KEYS mbpanel:*

# Get specific cache value
GET mbpanel:environment:12345

Database Debugging:

# Connect to PostgreSQL
psql -U mbpanel -d mbpanel_dev

# Enable query logging
\set ECHO_ALL

# Explain query
EXPLAIN ANALYZE SELECT * FROM environments WHERE team_id = 1;

# Check slow queries
SELECT * FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;

10.9 Common Issues & Solutions

Issue: Port already in use

# Error: uvicorn.error: Address already in use
# Solution: Kill process on port 8000
lsof -ti:8000 | xargs kill -9

# Or use different port
uvicorn main:app --port 8001

Issue: Database connection refused

# Error: psycopg2.OperationalError: could not connect to server
# Solution: Check PostgreSQL is running
docker ps | grep postgres

# Restart PostgreSQL
make infra-down && make infra-up

Issue: Redis connection refused

# Error: redis.exceptions.ConnectionError
# Solution: Check Redis is running
docker ps | grep redis

# Test Redis connection
redis-cli ping  # Should return PONG

Issue: Alembic migration conflicts

# Error: Multiple heads detected
# Solution: Merge migration heads
alembic merge heads -m "merge migration heads"
alembic upgrade head

Issue: Frontend module not found

# Error: Module not found: Can't resolve '@/components/...'
# Solution: Clear Next.js cache
cd frontend
rm -rf .next node_modules
npm install
npm run dev

10.10 Development Tools & Extensions

VS Code Extensions (Recommended): - Python: ms-python.python - Pylance: ms-python.vscode-pylance - ESLint: dbaeumer.vscode-eslint - Prettier: esbenp.prettier-vscode - GitLens: eamodio.gitlens - Thunder Client: rangav.vscode-thunder-client (API testing) - PostgreSQL: ckolkman.vscode-postgres - Docker: ms-azuretools.vscode-docker - Tailwind CSS IntelliSense: bradlc.vscode-tailwindcss

Browser Extensions (Recommended): - React Developer Tools: Chrome/Firefox extension - Redux DevTools: For state debugging - JSON Viewer: Pretty-print API responses

10.11 Code Quality Standards

Pre-Commit Hooks:

# Install pre-commit hooks
pip install pre-commit
pre-commit install

# Runs automatically on git commit:
# - Ruff linting
# - MyPy type checking
# - Prettier formatting
# - Trailing whitespace removal
# - JSON/YAML validation

Code Coverage Requirements: - Backend: Minimum 80% coverage - Frontend: Minimum 70% coverage - Critical paths (auth, payments): 95% coverage

Performance Budgets: - API Response Time: <200ms p95 - Frontend Bundle Size: <500KB gzipped - Lighthouse Score: >90 for Performance, Accessibility, Best Practices


11. Implementation Roadmap

Phase 0: Discovery & Compatibility (Weeks 0-1)

Objective: Lock API contracts, inventory legacy behavior, stand up compatibility shims

  • Generate OpenAPI from FastAPI skeleton and align with legacy endpoints
  • Build compatibility layer for high-traffic legacy routes under /api/v1/**
  • Stand up ARQ worker scaffolding and DLQ
  • Mirror WebSocket publish path to both Node and FastAPI (no cutover)
  • Author data migration runbooks and validate on a sanitized subset
  • Establish CI baseline: lint, type checks, unit tests, contract tests

Phase 1: Foundation (Weeks 1-3)

Objective: Establish core infrastructure, authentication systems, and External API integration foundation

Week 1: - Set up FastAPI project structure with Hybrid Modular DDD architecture - Implement basic authentication system with JWT - Set up PostgreSQL database with initial schema - Configure Docker containers and CI/CD pipeline - 🔥 NEW: HTTP Client Infrastructure - Implement app/core/http_client.py with ExternalAPIClient class - Implement CircuitBreaker class for resilience - Configure connection pooling with httpx (20 keepalive, 100 max connections) - Implement retry logic with exponential backoff (1s, 2s, 4s) - Add rate limiting mechanism - Write unit tests for circuit breaker and retry logic

Week 2: - Implement user registration and login flows in users module - Set up Redis for caching and session management - Create basic API endpoints for user management - Implement database migration system (Alembic) - 🔥 NEW: Virtuozzo Adapter Foundation - Create app/core/adapters/ directory structure - Implement virtuozzo_adapter.py with basic structure - Implement session key management - Add response normalization logic - Create adapter unit tests (mocked) - Configure Virtuozzo API connection (timeout: 90s, rate limit: 10 req/s)

Week 3: - Complete authentication system with refresh tokens in auth module - Implement role-based access control - Set up comprehensive logging infrastructure - Create API documentation and testing framework - 🔥 NEW: Virtuozzo Adapter Core Methods - Implement fetch_environments_and_nodes() with caching (5 min TTL) - Implement execute_mbadmin_action() with idempotency keys - Implement environment lifecycle methods: start_environment(), stop_environment(), sleep_environment() - Add correlation ID support for distributed tracing - Create integration tests for Virtuozzo adapter (staging environment) - Document adapter usage patterns for domain modules

Phase 2: Core Business Logic (Weeks 4-8)

Objective: Implement essential business functionality and integrate Virtuozzo services

Week 4: - Implement user and team management services in users module - Create basic site management endpoints - Set up job queue system with ARQ in jobs module - Begin migration of Laravel services to FastAPI modular services - 🔥 NEW: Environments Module with Virtuozzo Integration - Implement environments/service.py using virtuozzo_adapter - Implement start_environment(), stop_environment(), sleep_environment() service methods - Add correlation ID generation and audit logging - Implement error handling and status management - Create environment lifecycle background jobs (ARQ) - Write integration tests for environment operations

Week 5: - Complete site creation and configuration endpoints - Implement environment management functionality - Set up WebSocket system for real-time updates in websocket module - Begin data migration from MySQL to PostgreSQL - 🔥 NEW: WordPress Module with Virtuozzo Integration - Implement wordpress/service.py using virtuozzo_adapter - Implement execute_wp_cli() with command validation - Add long-running command queueing (ARQ background jobs) - Implement rate limiting (10 commands/minute/user) - Add WP-CLI allowed commands whitelist - Create WordPress service tests (mocked adapter)

Week 6: - Complete environment lifecycle management - Implement backup and restore functionality - Finish core job queue implementations in jobs module - Set up basic observability stack (Prometheus, Grafana) - 🔥 NEW: Backups Module with Virtuozzo Integration - Implement backups/service.py using virtuozzo_adapter - Implement create_backup() with background job queueing - Implement restore_backup() with validation and status tracking - Add backup session management - Create backup/restore integration tests - Set up Prometheus metrics for backup operations

Week 7: - Implement domain management and SSL certificate functionality - Complete advanced site management features - Set up comprehensive monitoring and alerting - Begin performance testing and optimization - 🔥 NEW: CDN/Cache Services Integration - Implement bunny_cdn_adapter.py for DNS and CDN operations - Implement cloudflare_adapter.py for DNS and SSL - Implement cache/service.py using bunny_cdn_adapter - Implement domains/service.py using multiple adapters - Add idempotency keys for DNS operations - Create CDN/DNS integration tests

Week 8: - Complete all Phase 2 business logic - Perform security audit and hardening - Conduct performance validation testing - Prepare for Phase 3 development - 🔥 NEW: External API Performance Validation - Validate connection pooling effectiveness (target: 90%+ reuse rate) - Validate circuit breaker functionality (test with simulated failures) - Validate cache hit rates (target: 80%+ for Virtuozzo API) - Measure external API latency (target: p95 < 150ms) - Load test external API integrations (concurrent requests) - Document performance baselines and optimization opportunities

Phase 3: Advanced Features (Weeks 9-12)

Objective: Implement advanced functionality and integrations

Week 9: - Implement billing and subscription management - Set up payment processing integration - Begin advanced job queue implementations - Implement advanced caching strategies

Week 10: - Complete billing and payment functionality - Implement advanced team and permissions features - Set up webhook system for external integrations - Begin advanced performance optimization

Week 11: - Complete all advanced feature implementations - Set up comprehensive error tracking system - Implement advanced security features (MFA, audit logs) - Conduct load testing and optimization

Week 12: - Complete all Phase 3 functionality - Perform comprehensive security and performance validation - Prepare staging environment for Phase 4 - Document all advanced features

Phase 4: Frontend Migration (Weeks 13-16)

Objective: Migrate frontend from React/Inertia to Next.js

Week 13: - Set up Next.js project structure - Implement authentication integration - Create basic dashboard components - Set up API client generation from OpenAPI specs

Week 14: - Implement core UI components for site management - Create environment management interfaces - Set up real-time WebSocket integration - Implement responsive design patterns

Week 15: - Complete all major UI feature implementations - Implement advanced dashboard functionality - Set up comprehensive state management - Begin user acceptance testing

Week 16: - Complete frontend implementation - Conduct comprehensive UI testing - Perform cross-browser compatibility testing - Prepare for production deployment

Phase 5: Production Deployment (Weeks 17-20)

Objective: Deploy to production and complete migration

Week 17: - Set up production infrastructure - Implement backup and disaster recovery procedures - Conduct final security audit - Prepare production deployment scripts

Week 18: - Deploy to staging environment - Conduct final end-to-end testing - Perform performance validation in staging - Prepare final data migration plan

Week 19: - Execute production deployment - Conduct data migration from legacy system - Monitor system performance and stability - Implement post-deployment monitoring

Week 20: - Complete migration cutover from Laravel - Monitor system performance and user feedback - Address any post-deployment issues - Document final system architecture


10. Success Criteria

10.1 Performance Metrics

  • API Response Times: Achieve <200ms for 95th percentile of requests
  • Database Queries: Achieve <50ms for common operations
  • Page Load Times: Achieve <2.5 seconds for dashboard pages
  • Concurrent Users: Successfully handle 5x current user load
  • 🔥 External API Performance:
  • Average API call latency: <50ms (50% improvement from Laravel baseline of 100ms)
  • P95 external API latency: <150ms (50% improvement from Laravel baseline of 300ms)
  • Cache hit rate for external APIs: >80% (compared to 0% in Laravel)
  • Connection reuse rate: >90% (via connection pooling)
  • Failed request recovery: 100% automatic retry with exponential backoff

10.2 Quality Gates

  • Code Coverage: Maintain >90% test coverage for all services
  • Security Score: Achieve A+ rating on security scanning tools
  • Performance Score: Achieve 95+ on Lighthouse performance metrics
  • Error Rate: Maintain <0.1% error rate in production
  • 🔥 External API Quality Gates:
  • Circuit breaker integration tests: 100% passing
  • Retry logic tests: 100% passing (all 3 retry attempts tested)
  • Idempotency tests: 100% passing (verify duplicate requests handled correctly)
  • Rate limiting tests: 100% passing (verify 429 errors prevented)
  • Integration test coverage: >80% for all external API adapters

10.3 Architecture Metrics

  • Module Isolation: Validate that no modules import from other modules
  • Cohesion: Validate that each module contains all necessary layers
  • Maintainability: Measure cognitive load reduction when working on features
  • Developer Velocity: Track improvement in feature development time
  • 🔥 Adapter Architecture Metrics:
  • Adapter isolation: 100% compliance (all external API calls go through adapters)
  • Shared adapter reuse: Virtuozzo adapter used by 7+ modules
  • Module-to-adapter coupling: Each module uses 1-2 adapters maximum
  • Adapter test coverage: >90% for all adapter implementations

10.4 Business Metrics

  • User Satisfaction: Achieve >4.5/5 satisfaction rating from user testing
  • Migration Completion: Successfully migrate 100% of Laravel functionality
  • Downtime: Maintain <4 hours total planned downtime during migration
  • Cost Reduction: Achieve 30% reduction in operational costs

10.5 External API Reliability Metrics

  • 🔥 NEW: Resilience Targets:
  • Circuit breaker open rate: <1% of total time
  • Retry success rate: >90% (transient failures recovered automatically)
  • Idempotent operation safety: 100% (no duplicate resource creation)
  • Rate limit violation rate: 0% (no 429 errors from external APIs)
  • Cascading failure prevention: 100% (circuit breaker prevents cascading failures)

  • 🔥 NEW: External Service Migration Targets:

  • Virtuozzo Services: 19/19 services migrated with feature parity
  • CDN/Cache Services: 4/4 services migrated (Bunny CDN, Cloudflare, Relay, Cache Management)
  • External Integrations: 8/8 services migrated (Postmark, UptimeRobot, etc.)
  • Other Services: 9/9 services migrated
  • Total: 40/40 services migrated successfully

  • 🔥 NEW: Performance Improvement Validation:

  • Connection overhead reduction: 93% (70ms → 5ms for subsequent requests)
  • Cache effectiveness: 98% latency reduction for cached responses (90ms → 2ms)
  • Parallel request optimization: 62% improvement (240ms → 90ms for 3 parallel calls)
  • HTTP/2 multiplexing: 20-30% latency improvement for multiple requests

10.6 Monitoring and Observability Criteria

  • 🔥 NEW: External API Monitoring:
  • Prometheus metrics captured for all external API calls
  • Grafana dashboards deployed for external API performance
  • Alerts configured for circuit breaker opens (severity: WARNING)
  • Alerts configured for high error rates (>5% over 5 min, severity: CRITICAL)
  • Alerts configured for high latency (p95 > 500ms, severity: WARNING)
  • Alerts configured for low cache hit rate (<50%, severity: INFO)
  • Correlation IDs tracked across all external API calls for distributed tracing

11. Risk Assessment

11.1 Risk Matrix

Risk Probability Impact Severity Mitigation Strategy
Data Migration Issues Medium High High Comprehensive testing with validation, rollback plan
Performance Degradation Low High Medium Performance testing at each phase, optimization plan
Architecture Compliance Medium Medium Medium Regular architecture reviews, automated checks
Team Knowledge Gaps Medium Medium Medium Training sessions, documentation, code reviews
Third-party Integration Issues Medium Medium Medium Early integration testing, fallback options
Timeline Delays Medium Medium Medium Buffer time in schedule, parallel development
🔥 External API Failure Cascade Medium Critical High Circuit breaker implementation, fail-fast patterns
🔥 External API Rate Limiting High Medium High Rate limiting on our side, request queueing
🔥 Connection Pool Exhaustion Low High Medium Connection pool monitoring, proper timeout configuration
🔥 Adapter Implementation Bugs Medium High High Comprehensive adapter testing, staged rollout with feature flags
🔥 External API Breaking Changes Low High Medium API versioning, contract testing, monitoring for deprecation headers
🔥 Idempotency Key Conflicts Low Medium Low UUID-based key generation, conflict detection logic

11.2 Mitigation Strategies

Existing Mitigation Strategies: - Data Migration: Multiple validation checkpoints, parallel migration testing - Performance: Continuous performance monitoring, optimization at each phase - Architecture Compliance: Regular architecture reviews and automated validation of module isolation - Security: Security-first development approach, regular audits - Knowledge Gaps: Comprehensive documentation, mentoring program - Integration Issues: Early integration validation, contract testing - Timeline Delays: Agile methodology with sprint reviews and adjustments

🔥 NEW: External API Integration Mitigation Strategies:

1. External API Failure Cascade: - Risk: If Virtuozzo API goes down, it could cause cascading failures across multiple modules (environments, wordpress, backups, etc.) - Mitigation: - Implement circuit breaker with 5-failure threshold and 5-minute recovery timeout - Fail-fast pattern: Return 503 Service Unavailable immediately when circuit is open - Graceful degradation: Cache last known state and serve stale data with warning - Independent monitoring: Separate health checks for each external service - Incident response playbook: Document steps to take when external APIs are down

2. External API Rate Limiting: - Risk: Overwhelming external APIs with requests could result in 429 errors and service throttling - Mitigation: - Client-side rate limiting: Max 10 requests/second to Virtuozzo API - Request queueing: Queue requests when rate limit reached instead of failing - Backpressure: Implement queue depth limits to prevent memory exhaustion - Monitoring: Alert when approaching rate limits (>80% of limit) - Coordinate with vendors: Document rate limits for each external service

3. Connection Pool Exhaustion: - Risk: Running out of available connections during traffic spikes - Mitigation: - Connection pool configuration: 20 keepalive + 100 max connections per service - Connection timeout: 60s keepalive expiry to prevent stale connections - Monitoring: Track active/idle connections ratio (alert when >90% active) - Load testing: Validate connection pool under peak traffic (5x current load) - Graceful degradation: Return 503 when pool exhausted instead of hanging

4. Adapter Implementation Bugs: - Risk: Bugs in adapter code could cause widespread issues across multiple modules - Mitigation: - Comprehensive testing: Unit tests (mocked), integration tests (staging), contract tests - Staged rollout: Feature flags for gradual rollout (10% → 50% → 100%) - Canary deployment: Deploy to single instance first, monitor for errors - Automated testing: CI pipeline blocks merges with failed adapter tests - Code review: Require 2 approvals for adapter changes - Rollback procedure: Document quick rollback steps (disable feature flag)

5. External API Breaking Changes: - Risk: External APIs (Virtuozzo, Bunny CDN, Cloudflare) could introduce breaking changes - Mitigation: - API versioning: Use explicit API versions in all requests (e.g., /v1.0/) - Contract testing: Automated tests validate API responses match expected schema - Monitoring for deprecation: Check for Deprecation and Sunset headers in responses - Version pinning: Pin to specific API versions, test new versions before migrating - Adapter abstraction: Changes isolated to adapters, not domain services - Vendor communication: Subscribe to API changelog notifications

6. Idempotency Key Conflicts: - Risk: Duplicate idempotency keys could prevent legitimate operations - Mitigation: - UUID-based keys: Use UUIDs instead of sequential IDs - Scoped keys: Include user_id, team_id, and operation in key (e.g., create_env:123:456:uuid) - Key expiration: Store keys in Redis with 24-hour TTL - Conflict detection: Log and alert on idempotency key conflicts - Retry with new key: If conflict detected, generate new key and retry

7. Performance Degradation from External APIs: - Risk: Slow external API responses could degrade overall system performance - Mitigation: - Aggressive timeouts: 30s for standard operations, 90s for Virtuozzo (long-running) - Timeout monitoring: Alert when p95 timeout rate >5% - Async operations: Long-running operations queued as background jobs (ARQ) - Caching strategy: Cache responses for 5 minutes (Virtuozzo), 1 minute (metrics) - Cache warming: Pre-fetch frequently accessed data during off-peak hours - Performance SLOs: External API p95 latency <150ms, alert if exceeded

8. External API Authentication Issues: - Risk: Session keys, API keys, or tokens could expire or become invalid - Mitigation: - Token refresh logic: Automatic token refresh before expiration - Encrypted storage: Future integration with a secrets management system (e.g., HashiCorp Vault). Currently secrets live in env files/container env vars, so interim guidance is to rotate .env values manually until Vault work ships. - Key rotation: Document and test key rotation procedures - Fallback authentication: Support multiple authentication methods where available - Monitoring: Alert on authentication errors (401, 403) - Manual intervention: Document steps for manual key rotation

9. Correlation ID Loss: - Risk: Losing correlation IDs makes it difficult to trace issues across systems - Mitigation: - Mandatory correlation IDs: Generate UUID for every request - Propagation: Pass correlation ID in headers (X-Correlation-ID) - Structured logging: Include correlation ID in all log entries - Distributed tracing: Integrate with OpenTelemetry for end-to-end tracing - Monitoring: Verify correlation ID propagation in integration tests

10. Adapter Testing Complexity: - Risk: Testing adapters requires complex mocking and integration test infrastructure - Mitigation: - Three-tier testing: Unit (mocked), integration (staging API), contract tests - Test fixtures: Reusable mock responses for common scenarios - Staging environment: Dedicated staging Virtuozzo/CDN accounts for testing - Automated test data: Scripts to generate test data in staging - CI/CD integration: Run integration tests nightly against staging APIs


12. Appendices

10.1 Glossary

  • FastAPI: Modern, fast web framework for building APIs with Python 3.7+ based on standard Python type hints
  • ARQ: Asynchronous job queues for Python, used for background task processing
  • Next.js: React-based framework for building production-ready web applications
  • JWT: JSON Web Token, an open standard for secure authentication
  • RBAC: Role-Based Access Control, a method to regulate access to computer resources
  • DDD: Domain-Driven Design, an approach to software development that focuses on complex needs by connecting the implementation to an evolving model of the core business concepts
  • Modular Architecture: An architectural approach that organizes code into independent, interchangeable modules that encapsulate functionality

12.2 Open Questions

  • Integration approach for legacy payment systems
  • Data retention policies for historical records
  • Cross-region deployment strategy for future scaling
  • Virtuozzo API rate limits: What are the actual rate limits? Document: ___ (Recommended: 10 req/s based on testing)
  • Session key rotation: How often do session keys expire? Document: ___ (Implement automatic refresh logic)
  • Bunny CDN rate limits: What are the rate limits for DNS operations? Document: ___ (Test and document)
  • Idempotency window: How long should idempotency keys be cached? Recommendation: 24 hours
  • Circuit breaker thresholds: Are 5 failures in 5 minutes appropriate? Test and adjust based on production data

12.3 External Services Migration Map

This section provides the complete mapping of 40 Laravel services to FastAPI modules, including code comparison examples.

10.3.1 Complete Service Inventory

Virtuozzo Services (19 files):

# Laravel Service FastAPI Module Adapter Priority Status
1 VirtuozzoService.php environments/ virtuozzo_adapter.py P0 Required for environment lifecycle
2 MbAdminService.php wordpress/ virtuozzo_adapter.py P0 Required for WP-CLI execution
3 EnvironmentStartService.php environments/ virtuozzo_adapter.py P0 Core environment operation
4 EnvironmentStopService.php environments/ virtuozzo_adapter.py P0 Core environment operation
5 EnvironmentSleepService.php environments/ virtuozzo_adapter.py P0 Core environment operation
6 EnvironmentRestartService.php environments/ virtuozzo_adapter.py P0 Core environment operation
7 EnvironmentDeletionService.php environments/ virtuozzo_adapter.py P0 Core environment operation
8 EnvironmentRenameService.php environments/ virtuozzo_adapter.py P0 Core environment operation
9 EnvironmentStatusSyncService.php environments/ virtuozzo_adapter.py P0 Status synchronization
10 VirtuozzoBackupService.php backups/ virtuozzo_adapter.py P1 Backup operations
11 SftpService.php sftp/ virtuozzo_adapter.py P1 SFTP user management
12 AddonManagementService.php environments/ virtuozzo_adapter.py P1 Addon lifecycle
13 AddonService.php environments/ virtuozzo_adapter.py P1 Addon operations
14 VzAccountService.php environments/ virtuozzo_adapter.py P1 Account management
15 VZaccountGroup.php environments/ virtuozzo_adapter.py P1 Account grouping
16 MbAdminParamsService.php wordpress/ virtuozzo_adapter.py P1 WP parameter management
17 SearchAndReplaceService.php wordpress/ virtuozzo_adapter.py P1 DB search/replace
18 SyncDirectoriesService.php environments/ virtuozzo_adapter.py P1 Directory sync
19 RunDynamicWpCliService.php wordpress/ virtuozzo_adapter.py P1 Dynamic WP-CLI execution

CDN/Cache Services (4 files):

# Laravel Service FastAPI Module Adapter Priority Status
20 BunnyCdnService.php cache/ bunny_cdn_adapter.py P1 DNS and CDN management
21 CloudflareDetectionService.php domains/ cloudflare_adapter.py P2 Cloudflare integration
22 RelayService.php cache/ relay_adapter.py P2 Relay cache management
23 CacheManagementService.php cache/ N/A (internal) P2 Internal cache operations

External Integration Services (8 files):

# Laravel Service FastAPI Module Adapter Priority Status
24 ExternalApiService.php core/ http_client.py (base) P0 Base HTTP client infrastructure
25 ExternalAuthService.php auth/ N/A (internal) P1 Internal auth logic
26 PostmarkService.php notifications/ postmark_adapter.py P2 Email notifications
27 UptimeCheckerService.php uptime/ uptime_adapter.py P2 Uptime checking
28 UptimeMonitorService.php uptime/ uptime_adapter.py P2 Uptime monitoring
29 DomainAvailabilityService.php domains/ domain_check_adapter.py P2 Domain availability
30 CnameChecker.php domains/ N/A (internal) P2 CNAME validation
31 SslChecker.php domains/ N/A (internal) P2 SSL validation

Other Services (9 files):

# Laravel Service FastAPI Module Adapter Priority Status
32 WebSocketBroadcastService.php websocket/ N/A (internal) P0 WebSocket broadcasting
33 WebSocketLogger.php core/ N/A (internal) P1 WebSocket logging
34 WebSocketTelemetry.php websocket/ N/A (internal) P1 WebSocket metrics
35 WebSocketTokenService.php websocket/ N/A (internal) P1 WebSocket auth
36 UserAgentService.php core/ N/A (internal) P2 User agent parsing
37 TeamService.php teams/ N/A (internal) P0 Team management
38 AccountUpgradeService.php billing/ N/A (internal) P1 Account upgrades
39 DeadLetterQueueService.php jobs/ N/A (internal) P1 DLQ management
40 DlqMonitoringService.php jobs/ N/A (internal) P1 DLQ monitoring

Total: 40 services to migrate

10.3.2 Code Comparison Examples

Example 1: Environment Start (Laravel → FastAPI)

Laravel (Before):

// app/Services/Virtuozzo/EnvironmentStartService.php
public function startEnvironment(int $environmentId): void
{
    $environment = Environment::findOrFail($environmentId);

    $response = Http::timeout(90)->post($this->getApiUrl(), [
        'envName' => $environment->env_name,
        'session' => $environment->session_key
    ]);

    if (!$response->successful()) {
        throw new \Exception("Failed to start environment");
    }

    $environment->update(['status' => 'starting']);

    Log::info("Environment started", ['env_id' => $environmentId]);
}

FastAPI (After):

# backend/app/environments/service.py
from app.core.adapters.virtuozzo_adapter import get_virtuozzo_adapter
from app.core.shared.audit import log_audit_event, AuditAction
import structlog
import uuid

logger = structlog.get_logger()

class EnvironmentService:
    def __init__(self):
        self.virtuozzo_adapter = get_virtuozzo_adapter()

    async def start_environment(
        self,
        db: Session,
        environment_id: int,
        user_id: int
    ) -> schema.EnvironmentRead:
        """
        Start environment with full error handling and tracing.

        Improvements over Laravel:
        - Connection pooling (47% faster)
        - Automatic retries (3x with exponential backoff)
        - Circuit breaker (prevents cascading failures)
        - Correlation IDs (distributed tracing)
        - Structured logging
        """

        # Generate correlation ID for tracing
        correlation_id = str(uuid.uuid4())

        logger.info(
            "environment_start_requested",
            environment_id=environment_id,
            user_id=user_id,
            correlation_id=correlation_id
        )

        # Get environment from database
        environment = repository.get_environment_by_id(db, environment_id)
        if not environment:
            raise EnvironmentNotFoundError(f"Environment {environment_id} not found")

        # Validate user has access
        self._validate_access(db, user_id, environment.team_id)

        # Check if already running (idempotent)
        if environment.status == "running":
            logger.info("environment_already_running", correlation_id=correlation_id)
            return environment

        try:
            # Call Virtuozzo API via adapter (with connection pooling, retries, circuit breaker)
            response = await self.virtuozzo_adapter.start_environment(
                env_name=environment.env_name,
                session_key=environment.session_key,
                correlation_id=correlation_id
            )

            # Update status
            repository.update_environment(db, environment_id, {"status": "starting"})
            db.commit()

            # Log audit event
            log_audit_event(
                user_id=user_id,
                team_id=environment.team_id,
                action=AuditAction.UPDATE,
                resource_type="environment",
                resource_id=environment_id,
                metadata={"action": "start", "correlation_id": correlation_id}
            )

            logger.info("environment_start_success", correlation_id=correlation_id)
            return environment

        except Exception as e:
            logger.error("environment_start_failed", error=str(e), correlation_id=correlation_id)
            repository.update_environment(db, environment_id, {"status": "error"})
            db.commit()
            raise

Key Improvements: 1. ✅ Connection pooling: 47% latency reduction (100ms → 53ms) 2. ✅ Automatic retries: 3 attempts with exponential backoff (1s, 2s, 4s) 3. ✅ Circuit breaker: Prevents cascading failures when Virtuozzo is down 4. ✅ Correlation IDs: End-to-end request tracing across services 5. ✅ Structured logging: Machine-parseable logs with context 6. ✅ Idempotency: Safe to retry without side effects 7. ✅ Audit logging: Compliance-ready audit trail 8. ✅ Error handling: Graceful error handling with status updates

Example 2: WordPress WP-CLI Execution (Laravel → FastAPI)

Laravel (Before):

// app/Services/WordPress/RunDynamicWpCliService.php
public function executeWpCli(int $siteId, string $command): array
{
    $site = Site::findOrFail($siteId);

    // Basic validation
    if (!in_array($command, $this->allowedCommands)) {
        throw new \Exception("Command not allowed");
    }

    $response = Http::timeout(90)->post($this->getWpCliUrl(), [
        'appUniqueName' => $site->app_unique_name,
        'session' => $site->session_key,
        'action' => 'executeWpCli',
        'command' => $command
    ]);

    return $response->json();
}

FastAPI (After):

# backend/app/wordpress/service.py
from app.core.adapters.virtuozzo_adapter import get_virtuozzo_adapter
from app.core.rate_limit import rate_limit
import structlog

logger = structlog.get_logger()

class WordPressService:
    def __init__(self):
        self.virtuozzo_adapter = get_virtuozzo_adapter()

    async def execute_wp_cli_command(
        self,
        db: Session,
        site_id: int,
        command: str,
        user_id: int
    ) -> schema.WpCliResponse:
        """
        Execute WP-CLI command with rate limiting and queueing.

        Improvements over Laravel:
        - Rate limiting (10 commands/min/user)
        - Long-running command queueing (ARQ)
        - Idempotency keys for safe retries
        - Command whitelist validation
        - Background job status tracking
        """

        # Validate access
        site = self._validate_site_access(db, site_id, user_id)

        # Validate command is allowed
        if not self._is_command_allowed(command):
            raise HTTPException(
                status_code=400,
                detail=f"Command '{command}' is not allowed"
            )

        # Rate limiting: 10 commands per minute per user
        await rate_limit(user_id, calls=10, period=60)

        # Check if command is long-running
        if self._is_long_running_command(command):
            # Enqueue background job (ARQ)
            job_id = await self._enqueue_wp_cli_job(site_id, command)
            logger.info(
                "wp_cli_queued",
                site_id=site_id,
                command=command,
                job_id=job_id,
                user_id=user_id
            )
            return schema.WpCliResponse(
                output="Command queued",
                exit_code=0,
                job_id=job_id,
                status="queued"
            )

        # Execute command immediately (with idempotency key)
        idempotency_key = f"wp_cli:{site_id}:{command}:{uuid.uuid4()}"

        result = await self.virtuozzo_adapter.execute_mbadmin_action(
            app_unique_name=site.app_unique_name,
            session_key=site.session_key,
            action="executeWpCli",
            params={"command": command},
            correlation_id=str(uuid.uuid4()),
            idempotency_key=idempotency_key
        )

        # Log audit event
        log_audit_event(
            user_id=user_id,
            team_id=site.team_id,
            action=AuditAction.UPDATE,
            resource_type="wordpress",
            resource_id=site_id,
            metadata={"command": command}
        )

        return schema.WpCliResponse(
            output=result.get("output", ""),
            exit_code=result.get("exit_code", 0),
            job_id=None,
            status="completed"
        )

Key Improvements: 1. ✅ Rate limiting: 10 commands per minute per user (prevents abuse) 2. ✅ Long-running command queueing: Background jobs via ARQ for operations >30s 3. ✅ Idempotency keys: Safe retries for WP-CLI commands 4. ✅ Command whitelist: Explicit allowed commands list 5. ✅ Background job tracking: Status updates via WebSocket 6. ✅ Correlation IDs: Trace requests across services 7. ✅ Audit logging: Track all WP-CLI executions 8. ✅ Error handling: Graceful errors with detailed messages

Example 3: Bunny CDN DNS Record (Laravel → FastAPI)

Laravel (Before):

// app/Services/Cdn/BunnyCdnService.php
public function createDnsRecord(string $envName, string $domain): array
{
    $response = Http::withHeaders([
        'AccessKey' => config('services.bunny.access_key'),
        'Accept' => 'application/json'
    ])->timeout(30)->post('https://api.bunny.net/dnszone/' . config('services.bunny.zone_id') . '/records', [
        'Type' => 'CNAME',
        'Name' => $envName,
        'Value' => "$envName.$domain",
        'Ttl' => 15,
        'Accelerated' => true
    ]);

    if ($response->status() === 400 && str_contains($response->body(), 'already registered')) {
        Log::info('DNS record already exists', ['env_name' => $envName]);
        return [];
    }

    if (!$response->successful()) {
        throw new \Exception('Failed to create DNS record');
    }

    return $response->json();
}

FastAPI (After):

# backend/app/core/adapters/bunny_cdn_adapter.py
from app.core.http_client import get_http_client
from app.core.config import settings
import structlog

logger = structlog.get_logger()

class BunnyCDNAdapter:
    def __init__(self):
        self.client = get_http_client(
            name="bunny_cdn",
            base_url="https://api.bunny.net",
            timeout=30.0,
            max_retries=3
        )
        self.access_key = settings.BUNNY_CDN_ACCESS_KEY

    async def create_dns_record(
        self,
        env_name: str,
        platform_domain: str,
        correlation_id: str
    ) -> dict:
        """
        Create DNS CNAME record with idempotency.

        Improvements over Laravel:
        - Idempotency keys (safe retries)
        - Automatic retries (3x with exponential backoff)
        - Connection pooling (HTTP/2)
        - Structured error handling
        - Correlation IDs for tracing
        """

        payload = {
            "Type": "CNAME",
            "Name": env_name,
            "Value": f"{env_name}.{platform_domain}",
            "Ttl": 15,
            "Accelerated": True,
            "MonitorType": "Monitor",
            "AutoSslIssuance": True
        }

        headers = {
            "AccessKey": self.access_key,
            "Accept": "application/json",
            "Content-Type": "application/json"
        }

        # Generate idempotency key for safe retries
        idempotency_key = f"bunny:dns:{env_name}:{platform_domain}"

        logger.info(
            "bunny_cdn_create_dns",
            env_name=env_name,
            domain=platform_domain,
            correlation_id=correlation_id
        )

        # POST with idempotency key (handled by HTTP client)
        # If "already registered" error, treats as success
        response = await self.client.post(
            path=f"/dnszone/{settings.BUNNY_DNS_ZONE_ID}/records",
            json=payload,
            headers=headers,
            correlation_id=correlation_id,
            idempotency_key=idempotency_key
        )

        logger.info(
            "bunny_cdn_dns_created",
            env_name=env_name,
            correlation_id=correlation_id
        )

        return response

Key Improvements: 1. ✅ Idempotency keys: Safe retries (duplicate requests handled gracefully) 2. ✅ Automatic retries: 3 attempts with exponential backoff 3. ✅ Connection pooling: HTTP/2 multiplexing for multiple requests 4. ✅ Structured errors: Proper exception handling with context 5. ✅ Correlation IDs: Distributed tracing across services 6. ✅ Already exists handling: Automatic detection and success response

Performance Comparison Summary:

Operation Laravel (Before) FastAPI (After) Improvement
Environment Start 100ms 53ms 47%
WP-CLI Execution 90ms 45ms 50%
DNS Record Creation 80ms 30ms (with pooling) 62%
Connection Overhead 70ms per request 5ms (subsequent requests) 93%
Cache Hit N/A (no caching) 2ms (Redis cache) 98% faster
Failed Request Recovery Manual retry Automatic (3 retries) N/A

10.3.3 Migration Progress Tracking

Phase 1 (Weeks 1-3): Foundation - [ ] HTTP Client Infrastructure (app/core/http_client.py) - [ ] Circuit Breaker Implementation - [ ] Virtuozzo Adapter Foundation (app/core/adapters/virtuozzo_adapter.py) - [ ] Core adapter methods (start, stop, sleep, restart)

Phase 2 (Weeks 4-8): Core Services - [ ] Environments module (19 Virtuozzo services) - [ ] WordPress module (WP-CLI execution) - [ ] Backups module (backup/restore operations) - [ ] CDN/Cache adapters (Bunny CDN, Cloudflare)

Phase 3 (Weeks 9-12): Advanced Services - [ ] SFTP module - [ ] Domains module (SSL, DNS) - [ ] Staging module - [ ] Remaining integration services

Completion Criteria: - ✅ All 40 services migrated with feature parity - ✅ Performance targets met (50% latency improvement) - ✅ All integration tests passing - ✅ Circuit breaker validated (test with simulated failures) - ✅ Cache hit rate >80% for external APIs - ✅ Connection reuse rate >90%

12.4 Context7 Model Context Protocol (MCP) Integration

10.4.1 Overview

The MBPanel API integrates with Context7 Model Context Protocol (MCP) to enhance developer experience and provide real-time documentation access to AI tools and LLMs working with the codebase. This integration enables AI-powered development tools to access contextual information about the MBPanel architecture, modules, and documentation directly.

10.4.2 Architecture

The Context7 MCP integration consists of: - MCP Server: Running as an embedded service within the FastAPI application - API Endpoints: Direct HTTP access to documentation and module analysis - Internal Client: Enables other MBPanel services to query documentation programmatically - Documentation Resources: Dynamic access to development docs, MAINPRD, and module structures

10.4.3 MCP Tools Available

  1. get_mbpapidoc(section, query): Retrieve documentation for a specific section
  2. get_mbpapidoc_all(): List all available documentation sections
  3. analyze_module_structure(module_name): Analyze and describe module architecture
  4. docs://mbpanel/{section}: Resource endpoint for specific documentation sections

10.4.4 HTTP API Endpoints

  • GET /api/v1/context7/status: Check Context7 service status
  • GET /api/v1/context7/docs/{section}: Get documentation for a section
  • GET /api/v1/context7/modules/{module_name}: Analyze module structure
  • POST /mcp/: Streamable HTTP endpoint for MCP protocol access

10.4.5 Implementation Details

The Context7 service is implemented as: - Located in app/core/context7/ module - Uses the official MCP Python SDK (mcp[cli] package) - Mounted as a Streamable HTTP application to the main FastAPI app - Provides both real-time file access and search capabilities for documentation - Supports structured data return for AI consumption

10.4.6 Developer Experience Benefits

  1. AI Tool Integration: IDEs with MCP support can access current documentation context
  2. Documentation Discovery: Programmatic access to API docs and architecture details
  3. Module Analysis: Automatic analysis of module structure and components
  4. Search Capabilities: Full-text search across all project documentation
  5. Real-time Updates: Documentation changes are immediately available via MCP

10.4.7 Configuration

  • MCP server is automatically mounted to /mcp endpoint
  • Can be accessed by MCP-compatible tools like Cursor, VS Code with MCP extensions
  • Authentication: Currently uses the same security context as the main API
  • Rate limiting: Inherits the main application's rate limiting configuration

10.4.8 Usage Example

# For MCP-compatible tools
# The server endpoint: http://your-mbpanel-api.com/mcp

# For direct API access
GET /api/v1/context7/docs/mainprd?query=migration
GET /api/v1/context7/modules/users
GET /api/v1/context7/status

10.4.9 Module Integration Examples

To integrate Context7 capabilities into existing modules:

1. Import utilities in your module:

from app.core.context7.utils import get_mbpanel_documentation, search_mbpanel_docs

2. Use in service methods:

async def enhanced_service_method(param):
    # Get relevant documentation
    docs = await get_mbpanel_documentation("users", "user creation")

    # Use documentation for guidance or logging
    result = await perform_operation(param)
    return result

3. Synchronous usage in existing code:

from app.core.context7.utils import sync_get_mbpanel_documentation

def sync_service_method(param):
    # Get documentation synchronously
    docs = sync_get_mbpanel_documentation("users", "validation rules")

    # Use in sync context
    return perform_sync_operation(param)

4. Decorator pattern for automatic documentation access:

def with_context7_docs(section):
    def decorator(func):
        async def wrapper(*args, **kwargs):
            docs = await get_mbpanel_documentation(section)
            # Use docs for logging, validation, or guidance
            result = await func(*args, **kwargs)
            return result
        return wrapper
    return decorator

This integration significantly improves the developer experience by providing real-time access to MBPanel-specific documentation and architectural context for AI-assisted development workflows.