MBPanel FastAPI Migration - Centralized Product Requirements Document (PRD)
2. Executive Summary¶
This document serves as the single source of truth for the complete migration of the MightyBox Control Panel from Laravel to a modern FastAPI backend with Next.js frontend. This is an enterprise-grade transformation to address performance, scalability, and maintainability challenges using a Hybrid Modular Domain-Driven Design (DDD) architecture.
Migration Overview¶
- Current State: Laravel 11 with Inertia.js/React frontend (monolithic architecture), MySQL database, Node.js WebSocket server
- Target State: FastAPI backend with Next.js frontend (decoupled architecture), PostgreSQL database, native FastAPI WebSocket support, Hybrid Modular DDD architecture
- Timeline: 20-week implementation period
- Performance Goals: 50-70% improvement in API response times (400-600ms → <200ms)
3. Problem Statement¶
Current Challenges¶
- Performance: Laravel application experiencing 400-600ms response times affecting user experience
- Scalability: Current architecture struggles with 5x user growth requirements
- Maintainability: Legacy codebase with complex monolithic structure and tight coupling
- Infrastructure Costs: Current setup requires 30% more resources than industry standards
Business Impact¶
- Decreased user satisfaction due to slow response times
- Limited ability to scale with business growth
- Higher operational costs
- Technical debt affecting feature delivery speed
4. User Personas¶
Understanding our diverse user base is critical for building an enterprise-grade WordPress hosting dashboard. Our platform serves millions of users with varying technical expertise, business sizes, and use cases.
4.1 Primary Personas¶
P1: Small Business Owner (Sarah)¶
Demographics - Role: Entrepreneur/Small Business Owner - Technical Skill: Beginner to Intermediate - Age Range: 28-45 - Business Size: 1-5 employees - Annual Revenue: $50K-$500K
Goals - Launch and manage business website quickly without technical expertise - Minimize costs while maintaining reliable hosting - Focus on business growth, not server management - Easy-to-understand dashboard with minimal learning curve
Pain Points - Overwhelmed by technical jargon and complex configurations - Limited time to learn hosting management - Fear of making mistakes that could break website - Budget constraints requiring cost-effective solutions - Needs 24/7 uptime for e-commerce transactions
Usage Patterns - Logs in 2-3 times per week - Primary tasks: View site status, check analytics, manage domains - Mobile usage: 40% of sessions - Peak activity: Evenings and weekends
Success Metrics - Dashboard task completion in <3 clicks - Zero technical support tickets for basic operations - 95% task success rate without documentation
P2: Web Agency Developer (Marcus)¶
Demographics - Role: Web Developer/Agency Owner - Technical Skill: Advanced - Age Range: 25-40 - Clients Managed: 20-100 WordPress sites - Team Size: 3-15 developers
Goals - Efficiently manage multiple client WordPress sites from single dashboard - Automate repetitive tasks (backups, updates, staging environments) - White-label solution for client presentation - Rapid deployment and cloning capabilities - API access for custom integrations and automation
Pain Points - Time wasted switching between multiple hosting control panels - Manual processes for routine maintenance across dozens of sites - Lack of bulk operations for multi-site management - Insufficient API documentation for automation - Need for staging/production environment workflows
Usage Patterns - Logs in daily, multiple times - Primary tasks: Bulk operations, API integrations, staging deployments - Desktop usage: 85% of sessions - Peak activity: Business hours (9am-6pm) - Heavy API usage for automation
Success Metrics - Bulk operations support for 50+ sites simultaneously - API response time <100ms for automation workflows - 80% reduction in manual task time vs. competitors
P3: Enterprise IT Administrator (Jennifer)¶
Demographics - Role: Senior IT Administrator/DevOps Lead - Technical Skill: Expert - Age Range: 30-50 - Organization Size: 500-10,000+ employees - WordPress Instances: 100-1,000+ sites
Goals - Enterprise-grade security, compliance, and governance - Centralized management with role-based access control (RBAC) - Advanced monitoring, alerting, and SLA tracking - Integration with existing enterprise tools (SSO, SIEM, ITSM) - Detailed audit logs and compliance reporting (SOC 2, GDPR, HIPAA)
Pain Points - Lack of enterprise SSO integration (SAML, OAuth, LDAP) - Insufficient audit trails for security compliance - No granular permission controls for team management - Missing integration with enterprise monitoring (DataDog, New Relic) - Inadequate disaster recovery and multi-region failover
Usage Patterns - Logs in multiple times daily - Primary tasks: Security monitoring, compliance reporting, team management - Desktop only: 100% of sessions - Peak activity: Continuous monitoring with automated alerts - Extensive use of API for integrations
Success Metrics - 100% audit trail coverage for all operations - SSO integration with <500ms authentication time - RBAC with 20+ custom role configurations - 99.99% SLA compliance with automated failover
P4: WordPress Developer/Blogger (Alex)¶
Demographics - Role: Content Creator/Independent Developer - Technical Skill: Intermediate to Advanced - Age Range: 22-35 - Sites Managed: 1-5 personal/client sites - Income Source: Blogging, freelance development
Goals - Easy WordPress installation and theme/plugin testing - Quick staging environment for development - Performance optimization tools for SEO - Cost-effective hosting for personal projects - Simple backup and restore for experimentation
Pain Points - Expensive hosting for multiple test/development sites - Slow deployment process for updates - Lack of performance insights for optimization - Complex staging environment setup - Fear of breaking production while testing
Usage Patterns - Logs in 3-5 times per week - Primary tasks: Deploy updates, check performance, manage backups - Mixed device usage: 60% desktop, 40% mobile - Peak activity: Evenings and weekends - Experimentation-heavy (frequent staging usage)
Success Metrics - One-click staging environment creation - Performance insights dashboard with actionable recommendations - <5 minute deployment time for WordPress updates
P5: SaaS Platform Administrator (David)¶
Demographics - Role: Platform Operations Manager - Technical Skill: Expert - Age Range: 28-45 - Organization: Multi-tenant SaaS company - WordPress Instances: 10,000-1,000,000+ tenant sites
Goals - Massive scale operations with automated provisioning - Per-tenant resource isolation and management - Real-time monitoring for millions of concurrent users - API-first architecture for full platform automation - Cost optimization with granular resource tracking
Pain Points - Inability to provision thousands of WordPress instances simultaneously - Lack of per-tenant resource monitoring and billing integration - Insufficient horizontal scaling capabilities - Missing webhooks for event-driven automation - No programmatic control over Jelastic environments at scale
Usage Patterns - 100% API-driven interactions - Continuous monitoring via webhooks and event streams - Automated provisioning triggered by customer signups - Dashboard usage: Primarily for high-level analytics - Peak activity: 24/7 automated operations
Success Metrics - Provision 1,000+ WordPress instances per hour - API uptime: 99.99% with <50ms response times - Zero-downtime scaling events - Real-time webhook delivery (<1 second latency)
4.2 Secondary Personas¶
P6: Non-Profit Organization Manager (Lisa)¶
Demographics - Role: Non-Profit Communications Director - Technical Skill: Beginner - Organization Size: 5-50 staff + volunteers - Budget: Highly constrained
Goals - Reliable hosting for fundraising campaigns - Simple content updates without developer dependency - Donation processing uptime during critical campaigns - Cost-effective solutions within limited budget
Pain Points - Limited technical support budget - Website breaks during high-traffic fundraising events - Difficulty understanding hosting bills - Dependency on volunteer IT help
P7: E-commerce Store Owner (Raj)¶
Demographics - Role: Online Retailer - Technical Skill: Intermediate - Business Type: WooCommerce-based store - Annual Sales: $500K-$5M
Goals - 99.99% uptime during peak shopping seasons - Fast page load times for conversion optimization - Easy scaling during Black Friday/Cyber Monday - PCI compliance for payment processing - Integrated CDN for global customers
Pain Points - Site crashes during traffic spikes (lost revenue) - Slow checkout process reducing conversions - Lack of real-time performance monitoring - Difficulty understanding security compliance requirements
4.3 Anti-Personas (Users We Don't Serve)¶
AP1: Enterprise Custom CMS Users¶
- Organizations requiring non-WordPress CMS solutions
- Custom enterprise applications not based on WordPress
- Users needing bare-metal server access
AP2: Free Tier Seekers¶
- Users expecting completely free hosting indefinitely
- Users unwilling to pay for enterprise-grade reliability
- Hobby projects with no growth trajectory
4.4 Persona-Driven Design Principles¶
Based on our persona analysis, the dashboard must:
-
Progressive Disclosure: Simple interface for beginners (Sarah, Lisa) with advanced features accessible to experts (Marcus, Jennifer, David)
-
Multi-Tenancy at Scale: Support individual sites (Sarah, Alex) through millions of sites (David) with the same architecture
-
API-First Design: Every dashboard feature must have API equivalent for automation (Marcus, Jennifer, David)
-
Role-Based Views: Customize dashboard complexity based on user role and technical expertise
-
Performance Transparency: Real-time metrics for all personas with appropriate detail levels
-
Mobile-First for Basic Operations: 40% of SMB users access via mobile (Sarah, Alex)
-
Compliance Built-In: Enterprise personas (Jennifer, David) require built-in compliance frameworks
-
Cost Visibility: All personas need clear resource usage and cost breakdowns
5. User Stories¶
This section documents comprehensive user stories for each persona, covering both success and failure scenarios. Each story follows the format: As a [persona], I want to [action], so that [benefit].
5.1 Small Business Owner (Sarah) - User Stories¶
Success Scenarios¶
US-001: Quick WordPress Site Launch - Story: As Sarah, I want to launch a WordPress site in under 5 minutes without technical knowledge, so that I can quickly establish my online presence - Acceptance Criteria: - One-click WordPress installation from dashboard - Pre-configured templates for common business types - Automatic SSL certificate provisioning - Default security settings applied automatically - Success confirmation with site URL and login credentials - Priority: P0 (Critical)
US-002: Visual Site Health Monitoring - Story: As Sarah, I want to see my website's health status at a glance, so that I know if there are any issues without understanding technical details - Acceptance Criteria: - Dashboard shows clear "Healthy" or "Needs Attention" status - Visual indicators (green/yellow/red) for uptime, performance, security - Plain-language explanations for any issues - Recommended actions in non-technical terms - Mobile-friendly view - Priority: P0 (Critical)
US-003: Simple Domain Management - Story: As Sarah, I want to connect my custom domain with simple instructions, so that my business has a professional web address - Acceptance Criteria: - Step-by-step domain connection wizard - Visual DNS configuration guide with screenshots - Automatic DNS verification - Email notifications when domain is live - Support for common domain registrars - Priority: P1 (High)
US-004: Easy Backup and Restore - Story: As Sarah, I want to restore my website if something goes wrong, so that I can recover from mistakes without losing my business data - Acceptance Criteria: - One-click restore to previous backup point - Calendar view of available backup dates - Preview of backup contents before restore - Confirmation dialog with plain-language warnings - Restore completion notification - Priority: P0 (Critical)
Failure Scenarios¶
US-005: Site Outage During Peak Hours - Story: As Sarah, I want to be immediately notified if my site goes down, so that I can minimize lost sales and customer frustration - Acceptance Criteria: - SMS/email alerts within 60 seconds of outage - Plain-language explanation of issue - Estimated recovery time - Option to contact emergency support - Automatic recovery attempts before alerting - Priority: P0 (Critical) - Failure Handling: Automated failover, rollback capabilities, 24/7 support escalation
US-006: Accidental Plugin Breaking Site - Story: As Sarah, I want the system to prevent broken plugins from taking my site offline, so that I don't lose customers due to technical mistakes - Acceptance Criteria: - Automatic site backup before plugin installations - Plugin compatibility checks with current WordPress version - Sandbox testing environment for plugins (optional) - One-click rollback if site becomes inaccessible - Error detection with automatic plugin deactivation - Priority: P1 (High) - Failure Handling: Automatic rollback, safe mode boot, plugin quarantine
5.2 Web Agency Developer (Marcus) - User Stories¶
Success Scenarios¶
US-010: Bulk Site Operations - Story: As Marcus, I want to update plugins across 50 client sites simultaneously, so that I can maintain all sites efficiently without manual repetition - Acceptance Criteria: - Multi-select sites from dashboard - Bulk actions: update plugins, update themes, update WordPress core - Progress tracking for bulk operations - Detailed results with success/failure per site - Rollback option for failed updates - Priority: P0 (Critical)
US-011: Automated Staging Environments - Story: As Marcus, I want to create staging environments with one click, so that I can test changes safely before deploying to client production sites - Acceptance Criteria: - Clone production to staging in <5 minutes - Separate subdomain for staging environment - Sync production data to staging on-demand - Push staging changes to production with approval workflow - Automatic staging environment cleanup after 30 days - Priority: P0 (Critical)
US-012: API-Driven Site Provisioning - Story: As Marcus, I want to provision new client sites via API, so that I can integrate WordPress deployment into my agency's onboarding workflow - Acceptance Criteria: - REST API endpoint for site creation - Configurable parameters: WordPress version, PHP version, plugins, theme - Webhook notifications when site is ready - API key-based authentication - Rate limiting: 100 sites per hour - Priority: P1 (High)
US-013: White-Label Client Portal - Story: As Marcus, I want to provide clients access to their sites under my agency branding, so that I can deliver a professional client experience - Acceptance Criteria: - Custom domain for client portal (e.g., portal.agency.com) - Agency logo and color scheme customization - Hide MBPanel branding from client-facing views - Granular permission controls per client - Client-specific limited feature access - Priority: P2 (Medium)
Failure Scenarios¶
US-014: Bulk Update Failures - Story: As Marcus, I want detailed failure reports when bulk updates fail, so that I can quickly identify and fix problematic sites - Acceptance Criteria: - Failed update list with specific error messages - Ability to retry failed updates individually - Automatic backup before each update attempt - Email digest of failures with recommended actions - Integration with ticketing system for client notifications - Priority: P1 (High) - Failure Handling: Atomic rollback per site, error categorization, automated retry logic
US-015: API Rate Limit Exceeded - Story: As Marcus, I want clear feedback when hitting API rate limits, so that I can adjust my automation scripts without breaking workflows - Acceptance Criteria: - HTTP 429 response with retry-after header - Dashboard showing current API usage vs. limits - Option to request temporary rate limit increase - Webhook notifications before hitting 80% of limit - Clear documentation on rate limit policies - Priority: P2 (Medium) - Failure Handling: Gradual backoff, request queuing, burst allowance
5.3 Enterprise IT Administrator (Jennifer) - User Stories¶
Success Scenarios¶
US-020: SSO Integration - Story: As Jennifer, I want to integrate MBPanel with our corporate SSO (SAML), so that employees can access the dashboard using existing company credentials - Acceptance Criteria: - SAML 2.0 protocol support - Support for major identity providers (Okta, Azure AD, Auth0) - Just-in-time (JIT) user provisioning - SSO login time <500ms - Fallback to local authentication if SSO unavailable - Priority: P0 (Critical)
US-021: Comprehensive Audit Logging - Story: As Jennifer, I want complete audit trails of all user actions, so that I can meet SOC 2 compliance requirements and investigate security incidents - Acceptance Criteria: - Log all CRUD operations with user, timestamp, IP address, action details - Searchable audit log with advanced filters - Export audit logs in CSV/JSON formats - Retention: 12 months minimum - Tamper-proof logging with cryptographic verification - Priority: P0 (Critical)
US-022: Granular RBAC - Story: As Jennifer, I want to create custom roles with specific permissions, so that I can enforce least-privilege access across 100+ team members - Acceptance Criteria: - Create unlimited custom roles - 50+ granular permissions (e.g., "view sites", "deploy staging", "delete sites") - Role inheritance and composition - Bulk user role assignment - Permission templates for common scenarios - Priority: P0 (Critical)
US-023: Compliance Dashboard - Story: As Jennifer, I want a compliance dashboard showing GDPR/SOC 2 readiness, so that I can prepare for audits efficiently - Acceptance Criteria: - Real-time compliance score (percentage) - Checklist of compliance requirements with status - Data residency controls (EU/US/Asia regions) - Data retention policy enforcement - Automated compliance reports (PDF/HTML) - Priority: P1 (High)
Failure Scenarios¶
US-024: SSO Provider Outage - Story: As Jennifer, I want emergency access when our SSO provider is down, so that critical operations can continue during corporate authentication failures - Acceptance Criteria: - Break-glass emergency access for super admins - Multi-factor authentication for emergency access - Audit log entries for all emergency access usage - Automatic notification to security team - Time-limited emergency sessions (4 hours max) - Priority: P0 (Critical) - Failure Handling: Local credential fallback, emergency admin accounts, security team alerts
US-025: Compliance Violation Detection - Story: As Jennifer, I want immediate alerts when compliance violations occur, so that I can remediate issues before audit failures - Acceptance Criteria: - Real-time violation detection (e.g., PII exposure, retention policy breach) - Severity classification (Critical/High/Medium/Low) - Automated remediation workflows where possible - Integration with SIEM tools (Splunk, QRadar) - Incident response playbooks - Priority: P0 (Critical) - Failure Handling: Automatic quarantine, rollback capabilities, incident escalation
5.4 WordPress Developer/Blogger (Alex) - User Stories¶
Success Scenarios¶
US-030: One-Click Staging - Story: As Alex, I want to create a staging copy of my blog instantly, so that I can test new plugins without risking my live site - Acceptance Criteria: - Staging environment created in <3 minutes - Isolated subdomain (e.g., staging.myblog.com) - Search engine indexing blocked on staging - Easy data sync from production to staging - Push changes from staging to production option - Priority: P1 (High)
US-031: Performance Optimization Insights - Story: As Alex, I want actionable performance recommendations, so that I can improve my blog's SEO rankings and user experience - Acceptance Criteria: - Page load time metrics (Core Web Vitals) - Database query analysis with slow query detection - Image optimization recommendations - Plugin performance impact ranking - One-click fixes for common issues - Priority: P1 (High)
US-032: Git Integration for Themes/Plugins - Story: As Alex, I want to deploy theme changes from GitHub, so that I can use version control for my development workflow - Acceptance Criteria: - Connect GitHub/GitLab/Bitbucket repositories - Webhook-triggered deployments on git push - Deploy to staging or production environments - Rollback to previous git commits - Environment-specific configuration files - Priority: P2 (Medium)
Failure Scenarios¶
US-033: Plugin Conflict Detection - Story: As Alex, I want to be warned about plugin conflicts before activation, so that I don't crash my site when experimenting - Acceptance Criteria: - Known conflict database check before plugin install - Compatibility matrix display for active plugins - Sandbox testing option before production activation - Automatic safe mode if site becomes inaccessible - Conflict resolution recommendations - Priority: P1 (High) - Failure Handling: Automatic plugin deactivation, safe mode boot, rollback option
5.5 SaaS Platform Administrator (David) - User Stories¶
Success Scenarios¶
US-040: Mass Provisioning API - Story: As David, I want to provision 10,000 WordPress sites per day via API, so that I can onboard new SaaS customers automatically at scale - Acceptance Criteria: - Batch provisioning endpoint accepting 1,000 sites per request - Asynchronous processing with webhook completion notifications - Provisioning time: <5 minutes per site - Idempotent operations (safe retries) - Rate limit: 10,000 sites per hour - Priority: P0 (Critical)
US-041: Per-Tenant Resource Metering - Story: As David, I want granular resource usage data per WordPress instance, so that I can implement accurate usage-based billing for my SaaS customers - Acceptance Criteria: - Real-time metrics: CPU, memory, disk, bandwidth per site - Hourly/daily/monthly aggregation - Export to billing systems via API (Stripe, Chargebee) - Cost allocation tags for multi-tenant accounting - Resource usage alerts per tenant - Priority: P0 (Critical)
US-042: Programmatic Jelastic Environment Control - Story: As David, I want full API control over Jelastic environments, so that I can automate scaling, backups, and configuration for millions of tenant sites - Acceptance Criteria: - API endpoints mirroring all Jelastic capabilities - Bulk operations: scale, restart, configure - Environment templates for rapid provisioning - Webhook events for all environment changes - Sub-100ms API response times - Priority: P0 (Critical)
US-043: Zero-Downtime Auto-Scaling - Story: As David, I want automatic horizontal scaling during traffic spikes, so that my SaaS platform maintains performance during viral content events - Acceptance Criteria: - Cloudlet auto-scaling based on CPU/memory thresholds - Scale up in <60 seconds - Scale down after 10-minute cooldown period - No connection drops during scaling events - Configurable scaling policies per environment - Priority: P0 (Critical)
Failure Scenarios¶
US-044: Jelastic API Outage - Story: As David, I want graceful degradation when Jelastic API is unavailable, so that my SaaS dashboard remains operational during upstream outages - Acceptance Criteria: - Circuit breaker pattern with 5-second timeout - Cached data display with staleness indicator - Read-only mode during API outages - Automatic retry with exponential backoff - Status page integration showing Jelastic health - Priority: P0 (Critical) - Failure Handling: Circuit breaker, cached data, retry queue, health monitoring
US-045: Mass Provisioning Partial Failure - Story: As David, I want detailed failure tracking when batch provisioning partially fails, so that I can retry only failed sites and maintain SaaS onboarding SLAs - Acceptance Criteria: - Per-site success/failure status in batch response - Failed sites queued for automatic retry (3 attempts) - Error categorization: transient vs. permanent failures - Webhook notifications for batch completion with failure details - Reconciliation API to verify provisioning state - Priority: P0 (Critical) - Failure Handling: Automatic retry with backoff, dead letter queue, manual reconciliation API
5.6 E-commerce Store Owner (Raj) - User Stories¶
Success Scenarios¶
US-050: Traffic Spike Auto-Scaling - Story: As Raj, I want my site to automatically scale during Black Friday sales, so that I don't lose revenue due to site crashes - Acceptance Criteria: - Automatic detection of traffic spikes (>200% baseline) - Instant cloudlet scaling up to 10x baseline capacity - No downtime during scaling events - Email notification when scaling occurs - Automatic scale-down after traffic normalizes - Priority: P0 (Critical)
US-051: Real-Time Performance Monitoring - Story: As Raj, I want real-time checkout page performance metrics, so that I can optimize conversion rates during peak shopping periods - Acceptance Criteria: - Live dashboard showing checkout page load times - Transaction success rate monitoring - WooCommerce-specific metrics (cart abandonment, payment gateway latency) - Alerts when performance degrades below thresholds - Historical comparison (vs. last Black Friday) - Priority: P1 (High)
Failure Scenarios¶
US-052: Payment Gateway Timeout - Story: As Raj, I want detailed logs when payment processing fails, so that I can troubleshoot lost sales and customer complaints - Acceptance Criteria: - Payment gateway API call logs with timestamps - Timeout/error categorization - Customer email/order ID correlation - Integration with WooCommerce order notes - Automatic retry recommendations - Priority: P0 (Critical) - Failure Handling: Automatic retry logic, detailed error logs, support ticket integration
5.7 User Story Summary Matrix¶
| Persona | Critical (P0) | High (P1) | Medium (P2) | Total Stories |
|---|---|---|---|---|
| Sarah (SMB Owner) | 3 | 2 | 1 | 6 |
| Marcus (Agency) | 2 | 3 | 2 | 7 |
| Jennifer (Enterprise) | 5 | 1 | 0 | 6 |
| Alex (Developer) | 0 | 4 | 1 | 5 |
| David (SaaS Platform) | 5 | 0 | 0 | 5 |
| Raj (E-commerce) | 2 | 1 | 0 | 3 |
| TOTAL | 17 | 11 | 4 | 32 |
5.8 Cross-Persona User Journeys¶
Journey 1: Site Launch to Production¶
- Sarah creates account and provisions WordPress site (US-001)
- Alex helps Sarah with theme customization via staging (US-030)
- Marcus (hired as consultant) sets up performance optimization (US-011)
- Site goes live with monitoring alerts configured (US-002)
- Jennifer (if enterprise) integrates with corporate SSO for team access (US-020)
Journey 2: Agency Multi-Site Management¶
- Marcus provisions 50 client sites via API (US-012)
- Marcus configures white-label portal for clients (US-013)
- Bulk plugin updates across all sites (US-010)
- Client (Sarah) logs into white-label portal to view her site (US-002)
- Marcus handles bulk update failures for specific sites (US-014)
Journey 3: Enterprise Compliance Onboarding¶
- Jennifer integrates SSO with corporate Azure AD (US-020)
- Jennifer configures RBAC for 100+ team members (US-022)
- Jennifer runs compliance audit (US-023)
- Jennifer sets up audit logging for SOC 2 requirement (US-021)
- Jennifer handles compliance violations proactively (US-025)
Journey 4: SaaS Platform Scaling¶
- David provisions 10,000 tenant sites via batch API (US-040)
- David implements per-tenant resource metering for billing (US-041)
- Traffic spike triggers auto-scaling for high-traffic tenants (US-043)
- David handles Jelastic API outage gracefully (US-044)
- David reconciles partial provisioning failures (US-045)
6. Technical Requirements¶
6.1 Performance Requirements¶
- API Response Times: <200ms for 95th percentile (50-70% improvement over current)
- Concurrent Users: Support 5x more concurrent users than current system
- Database Queries: <50ms average query time for common operations
- Page Load Times: <2.5 seconds for dashboard pages
6.2 Scalability Requirements and Auto-Scaling Policies¶
Horizontal Scaling: Support for multiple instances behind load balancer (10+ backend instances, 5+ frontend instances)
6.2.1 Auto-Scaling Policies¶
Backend (FastAPI) Auto-Scaling:
| Metric | Scale Up Threshold | Scale Down Threshold | Cooldown | Min Instances | Max Instances |
|---|---|---|---|---|---|
| CPU Utilization | >70% for 3 min | <30% for 10 min | 5 min | 5 | 50 |
| Memory Utilization | >80% for 3 min | <40% for 10 min | 5 min | 5 | 50 |
| Request Count | >5000 req/instance/min | <1000 req/instance/min | 5 min | 5 | 50 |
| Response Time p95 | >200ms for 5 min | <100ms for 15 min | 10 min | 5 | 50 |
Kubernetes HPA Configuration:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: fastapi-backend-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: fastapi-backend
minReplicas: 5
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "5000"
behavior:
scaleDown:
stabilizationWindowSeconds: 600 # 10 min
policies:
- type: Percent
value: 50 # Scale down max 50% at a time
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0 # Immediate scale up
policies:
- type: Percent
value: 100 # Double capacity if needed
periodSeconds: 60
- type: Pods
value: 10 # Or add 10 pods, whichever is higher
periodSeconds: 60
selectPolicy: Max
Frontend (Next.js) Auto-Scaling:
| Metric | Scale Up Threshold | Scale Down Threshold | Cooldown | Min Instances | Max Instances |
|---|---|---|---|---|---|
| CPU Utilization | >60% for 3 min | <20% for 10 min | 5 min | 3 | 20 |
| Memory Utilization | >75% for 3 min | <30% for 10 min | 5 min | 3 | 20 |
| Active Connections | >500/instance | <100/instance | 5 min | 3 | 20 |
Predictive Scaling: - Machine learning model predicts traffic patterns - Pre-scale 15 minutes before anticipated traffic spikes - Based on historical data (daily/weekly/seasonal patterns) - Example: Scale up at 8 AM UTC (business hours), scale down at 6 PM UTC
# Predictive Scaling Implementation
async def predictive_scaling():
"""Pre-emptive scaling based on historical traffic patterns"""
current_hour = datetime.utcnow().hour
current_day = datetime.utcnow().weekday()
# Historical traffic patterns (requests per second)
traffic_forecast = await ml_model.predict_traffic(
hour=current_hour,
day_of_week=current_day,
lookback_weeks=4
)
# Calculate required instances
required_instances = math.ceil(traffic_forecast / 5000) # 5k req/s per instance
# Pre-scale if forecast shows >30% increase
current_instances = await get_current_instance_count()
if required_instances > current_instances * 1.3:
await scale_instances(required_instances, reason="PREDICTIVE")
Database (Citus) Auto-Scaling: - Worker Nodes: Add workers when shard size >500GB (see Section 7.7.3) - Read Replicas: Auto-scale replicas based on read query load - Scale up: Read query latency p95 >50ms for 5 minutes - Scale down: Read query latency p95 <20ms for 20 minutes - Min read replicas: 2 per shard - Max read replicas: 5 per shard - Connection Pooling (PgBouncer): - Supports 10,000+ concurrent connections - Auto-scale pool size based on active connections - Pool exhaustion alerts when >90% utilization
Redis Cluster Auto-Scaling: - Memory-based: Add nodes when memory >80% for 5 minutes - Eviction-based: Scale up if eviction rate >100 keys/min - Min nodes: 3 (1 master + 2 replicas) - Max nodes: 12 (4 masters + 8 replicas)
Jelastic Auto-Scaling (via API):
# Programmatic Jelastic environment scaling
async def scale_jelastic_environment(env_id: str, cloudlets: int):
"""Scale Jelastic environment cloudlets (128 MB RAM/cloudlet)"""
async with httpx.AsyncClient() as client:
response = await client.post(
f"{JELASTIC_API_URL}/environment/control/rest/changetopology",
json={
"envName": env_id,
"session": jelastic_session,
"nodes": [{
"cloudlets": cloudlets, # Target cloudlets
"fixedCloudlets": cloudlets // 4, # Reserved cloudlets
"flexibleCloudlets": cloudlets - (cloudlets // 4)
}]
}
)
await log_scaling_event(env_id, cloudlets, "JELASTIC_SCALE")
Scaling Triggers for Jelastic: - CPU >80% for 5 min → Increase cloudlets by 25% - Memory >85% for 5 min → Increase cloudlets by 25% - Disk I/O >70% for 10 min → Increase cloudlets or add node - CPU <30% for 20 min → Decrease cloudlets by 25%
6.2.2 Load Balancing Strategy¶
- Algorithm: Weighted round-robin with health checks
- Health Check Interval: Every 10 seconds
- Health Check Endpoint:
GET /health(returns 200 if healthy) - Unhealthy Threshold: 3 consecutive failures
- Drain Timeout: 30 seconds (graceful shutdown for in-flight requests)
- WebSocket Sticky Sessions: Enabled (session affinity via cookie)
- SSL Termination: At load balancer (TLS 1.3)
- Connection Timeout: 60 seconds
- Keep-Alive: 120 seconds
Load Balancer Configuration (AWS ALB):
TargetGroup:
Protocol: HTTP
Port: 8000
HealthCheck:
Protocol: HTTP
Path: /health
Interval: 10
Timeout: 5
HealthyThreshold: 2
UnhealthyThreshold: 3
TargetGroupAttributes:
- Key: deregistration_delay.timeout_seconds
Value: '30'
- Key: stickiness.enabled
Value: 'true'
- Key: stickiness.type
Value: 'lb_cookie'
- Key: stickiness.lb_cookie.duration_seconds
Value: '86400' # 24 hours for WebSocket
6.2.3 Caching Strategy¶
- Multi-tier caching:
- L1: Application-level in-memory cache (per-instance, TTL: 60s)
- Environment metadata, user sessions
- 128 MB per instance, LRU eviction
- L2: Redis cluster (distributed caching)
- API responses, computed data, Jelastic API cache
- TTL: 5 minutes for API responses, 15 minutes for computed data
- Eviction policy: allkeys-lru
- L3: CDN edge caching (CloudFlare/CloudFront)
- Static assets (JS, CSS, images)
- TTL: 1 year with cache-busting via versioned URLs
Cache Invalidation:
- Event-driven invalidation via Redis Pub/Sub
- Webhook-triggered invalidation for Jelastic environment changes
- Manual purge API: POST /api/v1/cache/purge
CDN Integration: - Provider: CloudFlare (primary) or CloudFront (fallback) - Static assets served via CDN with automatic image optimization - Lazy loading for images, SVG sprites for icons - Brotli compression for text assets
6.2.4 Target Capacity and Performance Benchmarks¶
- Concurrent Users: 1M+ (sustained), 5M+ (burst)
- Requests per Second: 10,000+ (sustained), 50,000+ (burst)
- API Response Time: <200ms p95, <500ms p99
- Database Query Time: <50ms p95 for common operations
- Cache Hit Rate: >80% for Redis, >95% for CDN
- WebSocket Connections: 500K+ concurrent connections
- Throughput: 1 Gbps sustained, 10 Gbps burst
6.3 Security Requirements¶
- Authentication: JWT-based with configurable token expiration
- Authorization: Role-based access control (RBAC) system
- Data Protection: Encryption at rest and in transit
- API Security: Rate limiting, input validation, SQL injection protection
- Compliance: Industry security standards compliance
- Mutual TLS (mTLS): Service-to-service authentication for internal traffic
- Field-level Encryption: PII protection with KMS-managed keys and rotation
- Secrets Management: Planned centralized vault (e.g., HashiCorp Vault) for credentials/keys; today secrets are stored via per-env
.env.backend.localcopies and container env vars only. This document now tracks the gap so future work can introduce an actual secrets manager. - Security Headers: CSP, HSTS, X-Frame-Options, X-Content-Type-Options
- Token Management: Refresh token rotation and revocation lists
- Account Security: Account lockout, IP/device throttling, anomaly detection
- Data Handling: Data masking in non-production; structured redaction in logs
- API Versioning: Explicit versioning with deprecation policy and sunset headers
6.4 Architecture Requirements¶
- Modularity: High cohesion within modules, low coupling between modules
- Maintainability: Clear separation of concerns within each domain module
- Testability: Each module can be unit tested independently
- Developer Productivity: Reduced cognitive load when working on specific features
6.5 High Availability Requirements (99.99% Uptime)¶
Target SLA: 99.99% availability for critical services (52.56 minutes downtime per year)
6.5.1 Availability Budget¶
| SLA Level | Annual Downtime | Monthly Downtime | Weekly Downtime | Daily Downtime |
|---|---|---|---|---|
| 99.9% | 8h 45m | 43m 49s | 10m 4s | 1m 26s |
| 99.99% (Target) | 52m 35s | 4m 23s | 1m 0s | 8.6s |
| 99.999% (Aspirational) | 5m 15s | 26s | 6s | 0.9s |
Downtime Budget Allocation: - Planned Maintenance: 30% (15 minutes/month during low-traffic windows) - Unplanned Outages: 40% (allocated for unexpected failures) - Deployment Downtime: 20% (zero-downtime deployments) - Buffer: 10% (reserved for unforeseen incidents)
6.5.2 High Availability Architecture¶
1. Redundancy at Every Layer
| Component | Redundancy Strategy | Failover Time |
|---|---|---|
| Load Balancer | Multi-AZ AWS ALB (3 AZs) | Instant (automatic) |
| FastAPI Backend | Min 5 instances across 3 AZs | <10s (health check + drain) |
| Next.js Frontend | Min 3 instances across 3 AZs | <10s (health check + drain) |
| PostgreSQL (Citus) | Primary + 2 sync replicas per shard | <30s (Patroni auto-failover) |
| Redis | 3-node Sentinel cluster | <5s (Sentinel failover) |
| Object Storage | S3 multi-AZ (11 9's durability) | Instant (AWS-managed) |
2. Multi-AZ Deployment Architecture
Region: us-east-1
├── AZ-1 (us-east-1a)
│ ├── FastAPI: 2 instances
│ ├── Next.js: 1 instance
│ ├── Citus Worker-1 (primary)
│ └── Redis Node-1 (master)
│
├── AZ-2 (us-east-1b)
│ ├── FastAPI: 2 instances
│ ├── Next.js: 1 instance
│ ├── Citus Worker-1 (sync standby)
│ └── Redis Node-2 (replica)
│
└── AZ-3 (us-east-1c)
├── FastAPI: 1 instance
├── Next.js: 1 instance
├── Citus Worker-1 (async standby)
└── Redis Node-3 (replica + Sentinel quorum)
3. Zero-Downtime Deployment Strategy
Blue-Green Deployment for Backend:
# Deploy new version (green) alongside existing (blue)
kubectl apply -f deployment-green.yaml
# Wait for green to be healthy
kubectl wait --for=condition=ready pod -l version=green --timeout=300s
# Gradual traffic shift: 10% → 50% → 100%
kubectl patch service fastapi-backend \
-p '{"spec":{"selector":{"version":"green","weight":"10"}}}'
# Monitor error rates for 5 minutes
sleep 300
# If error rate <0.1%, continue to 100%
kubectl patch service fastapi-backend \
-p '{"spec":{"selector":{"version":"green"}}}'
# Decommission blue after 24 hours
kubectl delete deployment fastapi-backend-blue
Rolling Updates for Frontend:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nextjs-frontend
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Add 1 extra instance during update
maxUnavailable: 0 # Never reduce below min capacity
minReadySeconds: 30 # Wait 30s before marking ready
6.5.3 Disaster Recovery Targets¶
Recovery Time Objective (RTO): - Critical Services (API, Dashboard): <15 minutes - Database (Citus): <30 minutes (see Section 7.7.5) - Full Regional Failover: <15 minutes (see Section 7.8)
Recovery Point Objective (RPO): - Database: <5 minutes (synchronous replication + WAL archiving) - Redis Cache: <30 seconds (asynchronous replication, acceptable loss) - Object Storage (S3): <15 minutes (cross-region replication)
Backup Strategy: - Continuous WAL Archiving: Real-time to S3 (PITR capability) - Full Database Backups: Daily at 2 AM UTC - Incremental Backups: Hourly - Backup Retention: 30 days (hot), 1 year (cold/Glacier) - Backup Validation: Automated daily restore tests
6.5.4 Chaos Engineering and Resilience Testing¶
Chaos Experiments (Monthly):
| Experiment | Frequency | Expected Behavior |
|---|---|---|
| Kill random backend pod | Weekly | Auto-heal within 10s, no user impact |
| Terminate AZ | Monthly | Traffic shifts to remaining AZs, <1s latency spike |
| Database failover | Monthly | Patroni promotes standby in <30s |
| Redis master failure | Monthly | Sentinel promotes replica in <5s |
| Network partition | Quarterly | Split-brain prevention, read-only mode |
| Overload test (10x traffic) | Quarterly | Auto-scaling handles load, <5% error rate |
Chaos Engineering Tools: - Chaos Mesh: Kubernetes-native chaos experiments - Gremlin: Controlled chaos testing - Litmus: CRD-based chaos workflows
Example Chaos Experiment:
# Kill random FastAPI pod every 5 minutes
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: kill-fastapi-pod
spec:
action: pod-kill
mode: one # Kill one random pod
selector:
labelSelectors:
"app": "fastapi-backend"
scheduler:
cron: "*/5 * * * *" # Every 5 minutes
6.5.5 Monitoring and Alerting for Availability¶
Availability Metrics:
| Metric | Target | Alert Threshold | Action |
|---|---|---|---|
| Uptime (monthly) | 99.99% | <99.95% | Page on-call SRE |
| Error Rate (5xx) | <0.1% | >0.5% for 5 min | Auto-rollback deployment |
| Latency p95 | <200ms | >500ms for 5 min | Scale up instances |
| Health Check Failures | 0 | >2 consecutive failures | Remove from load balancer |
| Database Replication Lag | <5s | >30s | Alert DBA, prepare failover |
| Backup Success Rate | 100% | <100% | Immediate investigation |
Monitoring Stack: - Uptime Monitoring: UptimeRobot (external), Prometheus (internal) - APM: Datadog or New Relic (distributed tracing) - Logs: Loki + Grafana (centralized logging) - Metrics: Prometheus + Grafana (time-series metrics) - Alerting: PagerDuty (incident management) - Status Page: StatusPage.io (public status updates)
SLO Dashboard:
# Grafana SLO Dashboard
Panels:
- Availability (current month): 99.99% ✓
- Error Budget Remaining: 3.2 minutes (73% remaining)
- Mean Time To Recovery (MTTR): 8 minutes (target: <15 min)
- Mean Time Between Failures (MTBF): 720 hours (target: >168 hours)
- Incident Count (current month): 2 (target: <5)
6.5.6 Incident Response for Availability Violations¶
Severity Levels:
| Severity | Definition | Response Time | Notification |
|---|---|---|---|
| SEV-1 (Critical) | Complete service outage, >50% users affected | <5 min | Page on-call + manager |
| SEV-2 (High) | Degraded performance, >10% users affected | <15 min | Page on-call |
| SEV-3 (Medium) | Limited impact, <10% users affected | <1 hour | Email on-call |
| SEV-4 (Low) | Minimal impact, single component | <4 hours | Ticket queue |
Incident Response Runbook (SEV-1):
1. Detection (T+0)
- Automated alert fires (PagerDuty)
- On-call engineer acknowledges within 5 minutes
2. Triage (T+5 min)
- Assess impact (affected users, services)
- Determine root cause (logs, metrics, traces)
- Escalate to manager if needed
3. Mitigation (T+10 min)
- Apply immediate fix (rollback, failover, scaling)
- Update status page (investigating → identified → monitoring)
4. Recovery (T+15 min)
- Verify service restoration
- Monitor for 30 minutes post-recovery
- Mark incident as resolved
5. Post-Mortem (T+48 hours)
- Root cause analysis
- Timeline of events
- Action items to prevent recurrence
- Update runbooks and alerts
6.5.7 Planned Maintenance Windows¶
Maintenance Schedule: - Frequency: Monthly (3rd Sunday, 2 AM - 4 AM UTC) - Duration: Max 2 hours - Notification: 7 days advance notice - Impact: <5 minutes downtime (rolling updates preferred)
Zero-Downtime Maintenance (Preferred): - Database schema migrations: Online DDL (no locks) - Application deployments: Blue-green or canary - Infrastructure updates: Rolling updates across AZs
6.5.8 Dependencies and External Service SLAs¶
Third-Party SLA Requirements:
| Service | Provider | SLA Required | Actual SLA | Mitigation |
|---|---|---|---|---|
| Jelastic API | Virtuozzo Jelastic | 99.9% | 99.95% | Circuit breaker, cached data |
| Cloud Infrastructure | AWS | 99.99% | 99.99% (multi-AZ) | Multi-region DR |
| Email Delivery | SendGrid | 99.9% | 99.95% | Queue retries, fallback SMTP |
| Payment Processing | Stripe | 99.99% | 99.99% | Retry logic, webhook replay |
| CDN | CloudFlare | 100% | 100% | Fallback to CloudFront |
Dependency Failure Handling: - Jelastic API: Circuit breaker opens after 3 failures, serve cached data for 15 minutes - SendGrid: Queue emails in Redis, retry 3 times with exponential backoff - Stripe: Webhook replay for missed events, idempotent payment handling
6.5.9 Cost of Achieving 99.99% Availability¶
Additional Costs vs. 99.9%:
| Component | 99.9% Cost | 99.99% Cost | Delta | Justification |
|---|---|---|---|---|
| Multi-AZ Deployment | $5,000 | $8,000 | +$3,000 | 3 AZs instead of 2 |
| Database Replicas | $8,000 | $11,000 | +$3,000 | Synchronous replicas |
| Multi-Region DR | $0 | $18,580 | +$18,580 | Secondary region (see 7.8.10) |
| Monitoring Tools | $1,000 | $3,000 | +$2,000 | APM, advanced alerting |
| On-Call Staffing | $10,000 | $20,000 | +$10,000 | 24/7 coverage |
| TOTAL | $24,000 | $60,580 | +$36,580 | 2.5x cost for 10x less downtime |
Cost per User (1M users): - 99.9%: $0.024/month - 99.99%: $0.061/month (+$0.037/month or +154%)
ROI Analysis: - Revenue Impact of Downtime: Estimated $50,000/hour for 1M users - Annual Downtime Reduction: 8.75 hours → 0.88 hours (7.87 hours saved) - Annual Revenue Protected: ~$393,500 - Additional Annual Cost: $439,000 - Break-Even Point: ~1.1M users with $50K/hour revenue impact
6.6 Accessibility Requirements¶
- WCAG 2.1 AA: Compliance with accessibility standards
- Keyboard Navigation: Full functionality via keyboard
- Screen Reader Support: Compatibility with major screen readers
- Color Contrast: Minimum 4.5:1 contrast ratio
7. Architectural Overview¶
7.1 High-Level Architecture¶
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Next.js │────│ FastAPI │────│ PostgreSQL │
│ Frontend │ │ Backend │ │ Database │
│ (React 18) │ │ │ │ │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│
┌──────────────────┐
│ Redis │
│ (Cache/Queue) │
└──────────────────┘
│
┌─────────────────────────────────────────────────────┐
│ Infrastructure │
│ ┌─────────────┐ ┌─────────────┐ ┌───────────┐ │
│ │ Grafana │ │ Prometheus │ │ Loki │ │
│ │ (Dashboards)│ │ (Metrics) │ │ (Logging) │ │
│ └─────────────┘ └─────────────┘ └───────────┘ │
└─────────────────────────────────────────────────────┘
7.2 Hybrid Modular Domain-Driven Design (DDD) Architecture¶
The MBPanel backend implements a Hybrid Modular Domain-Driven Design (DDD) architecture that follows a vertical slice approach. This architecture combines Domain-Driven Design principles with modular architecture, where all components related to a specific business feature are co-located in a single module directory, while maintaining clear separation of concerns through layered architecture within each module.
7.2.1 Modular Organization (app/[MODULE_NAME]/)¶
The architecture is organized into independent domain modules where each module contains all layers needed for that specific business feature:
mbpanel/
├── frontend/ # Next.js frontend application (decoupled)
│ ├── app/ # Next.js App Router (recommended)
│ │ ├── (auth)/ # Auth-related pages (route group)
│ │ │ ├── login/
│ │ │ └── register/
│ │ ├── dashboard/ # Main dashboard pages
│ │ │ ├── page.tsx
│ │ │ └── layout.tsx
│ │ ├── layout.tsx # Root layout
│ │ ├── page.tsx # Home page
│ │ └── globals.css # Global styles
│ ├── features/ # ⭐ DOMAIN-ALIGNED FEATURE MODULES
│ │ ├── auth/ # Authentication feature
│ │ │ ├── components/ # Auth-specific components
│ │ │ ├── hooks/ # useAuth, useLogin, useMFA hooks
│ │ │ ├── services/ # Auth API client
│ │ │ ├── types/ # Auth TypeScript types
│ │ │ └── utils/ # Auth utilities
│ │ ├── users/ # User management feature
│ │ │ ├── components/
│ │ │ ├── hooks/
│ │ │ ├── services/
│ │ │ ├── types/
│ │ │ └── utils/
│ │ ├── teams/ # Team management feature
│ │ │ └── ... (same structure)
│ │ ├── sites/ # Site management feature
│ │ │ └── ... (same structure)
│ │ ├── environments/ # Environment management feature
│ │ │ └── ... (same structure)
│ │ ├── staging/ # Staging feature
│ │ │ └── ... (same structure)
│ │ ├── backups/ # Backup/restore feature
│ │ │ └── ... (same structure)
│ │ ├── cache/ # Cache management feature
│ │ │ └── ... (same structure)
│ │ ├── domains/ # Domain/SSL management feature
│ │ │ └── ... (same structure)
│ │ ├── sftp/ # SFTP management feature
│ │ │ └── ... (same structure)
│ │ ├── wordpress/ # ⭐ WordPress feature (ISOLATED)
│ │ │ ├── components/ # WP-specific UI components
│ │ │ ├── hooks/ # useWordPress, useWpCli hooks
│ │ │ ├── services/ # WordPress API client
│ │ │ ├── types/ # WP TypeScript types
│ │ │ └── utils/ # WP-specific utilities
│ │ ├── payments/ # Payment processing feature
│ │ │ └── ... (same structure)
│ │ ├── billing/ # Billing/subscriptions feature
│ │ │ └── ... (same structure)
│ │ ├── uptime/ # Uptime monitoring feature
│ │ │ └── ... (same structure)
│ │ ├── nodes/ # Node management feature
│ │ │ └── ... (same structure)
│ │ ├── activity/ # Activity logs feature
│ │ │ └── ... (same structure)
│ │ ├── favourites/ # Favourites feature
│ │ │ └── ... (same structure)
│ │ ├── config/ # Configuration feature
│ │ │ └── ... (same structure)
│ │ ├── profile/ # User profile feature
│ │ │ └── ... (same structure)
│ │ ├── sessions/ # Session management feature
│ │ │ └── ... (same structure)
│ │ └── webhook/ # Webhook feature
│ │ └── ... (same structure)
│ ├── shared/ # Shared/common frontend code
│ │ ├── components/ # Reusable UI components (Button, Modal, etc.)
│ │ │ ├── ui/ # Base UI components
│ │ │ │ ├── button.tsx
│ │ │ │ ├── modal.tsx
│ │ │ │ └── input.tsx
│ │ │ └── layout/ # Layout components
│ │ │ ├── header.tsx
│ │ │ ├── sidebar.tsx
│ │ │ └── footer.tsx
│ │ ├── hooks/ # Common hooks (useDebounce, useLocalStorage, etc.)
│ │ ├── lib/ # API client, axios setup, utilities
│ │ │ ├── api-client.ts # Base API client
│ │ │ ├── axios-config.ts # Axios interceptors
│ │ │ └── openapi-client.ts # Auto-generated from FastAPI
│ │ ├── types/ # Common TypeScript types
│ │ │ ├── api.ts # API response types
│ │ │ └── common.ts # Common types
│ │ ├── utils/ # Common utilities
│ │ │ ├── validation.ts # Form validation helpers
│ │ │ ├── formatting.ts # Date/number formatting
│ │ │ └── constants.ts # Global constants
│ │ └── store/ # Global state management (Zustand)
│ │ ├── auth-store.ts
│ │ └── theme-store.ts
│ ├── public/ # Static assets
│ │ ├── images/
│ │ ├── icons/
│ │ └── fonts/
│ ├── styles/ # Global styles
│ │ └── tailwind.css # Tailwind CSS
│ ├── tests/ # Frontend tests
│ │ ├── unit/ # Unit tests
│ │ ├── integration/ # Integration tests
│ │ └── e2e/ # End-to-end tests
│ ├── package.json # Frontend dependencies
│ ├── next.config.js # Next.js configuration
│ ├── tsconfig.json # TypeScript configuration
│ ├── tailwind.config.js # Tailwind CSS configuration
│ ├── .env.local # Local environment variables
│ └── Dockerfile # Production Dockerfile for frontend
│
├── backend/ # FastAPI backend application (decoupled)
│ ├── app/ # Main application package
│ │ ├── __init__.py # Package initialization
│ │ ├── main.py # Main FastAPI app (assembler)
│ │ ├── core/ # ⭐ SHARED: Enhanced core infrastructure
│ │ │ ├── __init__.py
│ │ │ ├── config.py # Pydantic BaseSettings for environment config
│ │ │ ├── security.py # Core crypto functions (hash, verify)
│ │ │ ├── exceptions.py # Custom exception hierarchy
│ │ │ ├── logging.py # Structured logging setup
│ │ │ ├── middleware.py # Common middleware (correlation ID, etc.)
│ │ │ ├── constants.py # Global constants and enums
│ │ │ ├── utils.py # Common utility functions
│ │ │ ├── http_client.py # ⭐ HTTP Client Infrastructure (connection pooling, circuit breaker)
│ │ │ ├── cache.py # ⭐ Caching utilities (Redis integration)
│ │ │ ├── rate_limit.py # ⭐ Rate limiting utilities
│ │ │ ├── adapters/ # ⭐ SHARED EXTERNAL API ADAPTERS (used by multiple modules)
│ │ │ │ ├── __init__.py
│ │ │ │ ├── virtuozzo_adapter.py # Used by: environments, wordpress, backups, sftp, staging, nodes, sessions
│ │ │ │ ├── bunny_cdn_adapter.py # Used by: cache, domains
│ │ │ │ ├── cloudflare_adapter.py # Used by: domains
│ │ │ │ ├── stripe_adapter.py # Future: Used by: payments
│ │ │ │ ├── paypal_adapter.py # Future: Used by: billing
│ │ │ │ ├── postmark_adapter.py # Future: Used by: notifications
│ │ │ │ └── uptime_adapter.py # Future: Used by: uptime
│ │ │ └── shared/ # ⭐ SHARED KERNEL (cross-cutting domain logic)
│ │ │ ├── __init__.py
│ │ │ ├── rbac.py # Role-based access control helpers
│ │ │ ├── tenant.py # Multi-tenant resolution logic (team_id)
│ │ │ ├── audit.py # Audit logging helpers
│ │ │ └── events.py # Domain event definitions (optional)
│ │ ├── database/ # ⭐ SHARED: Enhanced database layer
│ │ │ ├── __init__.py
│ │ │ ├── database.py # Engine, SessionLocal, Base
│ │ │ ├── deps.py # get_db, get_current_user dependencies
│ │ │ ├── mixins.py # Common model mixins (TimestampMixin, etc.)
│ │ │ └── types.py # Custom SQLAlchemy types (GUID, JSON, etc.)
│ │ ├── users/ # --- DOMAIN MODULE: "Users" ---
│ │ │ ├── __init__.py
│ │ │ ├── router.py # INTERFACE: APIRouter for /users/
│ │ │ ├── service.py # APPLICATION: UserService (use cases)
│ │ │ ├── repository.py # INFRASTRUCTURE: UserRepository (data access)
│ │ │ ├── model.py # DOMAIN: User SQLAlchemy model
│ │ │ └── schema.py # INTERFACE: UserCreate, UserRead Pydantic schemas
│ │ ├── auth/ # --- DOMAIN MODULE: "Auth" ---
│ │ │ ├── __init__.py
│ │ │ ├── router.py # INTERFACE: APIRouter for /auth/
│ │ │ ├── service.py # APPLICATION: AuthService
│ │ │ ├── security.py # INFRASTRUCTURE: OAuth2/JWT logic
│ │ │ └── schema.py # INTERFACE: Token, LoginRequest schemas
│ │ ├── teams/ # --- DOMAIN MODULE: "Teams" ---
│ │ │ ├── __init__.py
│ │ │ ├── router.py
│ │ │ ├── service.py
│ │ │ ├── repository.py
│ │ │ ├── model.py
│ │ │ └── schema.py
│ │ ├── sites/ # --- DOMAIN MODULE: "Sites" ---
│ │ │ ├── __init__.py
│ │ │ ├── router.py
│ │ │ ├── service.py
│ │ │ ├── repository.py
│ │ │ ├── model.py
│ │ │ └── schema.py
│ │ ├── environments/ # --- DOMAIN MODULE: "Environments" ---
│ │ │ ├── __init__.py
│ │ │ ├── router.py
│ │ │ ├── service.py
│ │ │ ├── repository.py
│ │ │ ├── model.py
│ │ │ ├── schema.py
│ │ │ └── tasks.py # ARQ background tasks
│ │ ├── staging/ # --- DOMAIN MODULE: "Staging" ---
│ │ │ ├── __init__.py
│ │ │ ├── router.py
│ │ │ ├── service.py
│ │ │ ├── repository.py
│ │ │ ├── model.py
│ │ │ ├── schema.py
│ │ │ └── tasks.py # ARQ background tasks
│ │ ├── backups/ # --- DOMAIN MODULE: "Backups" ---
│ │ │ ├── __init__.py
│ │ │ ├── router.py
│ │ │ ├── service.py
│ │ │ ├── repository.py
│ │ │ ├── model.py
│ │ │ ├── schema.py
│ │ │ └── tasks.py # ARQ background tasks
│ │ ├── cache/ # --- DOMAIN MODULE: "Cache" ---
│ │ │ ├── __init__.py
│ │ │ ├── router.py
│ │ │ ├── service.py
│ │ │ ├── repository.py
│ │ │ ├── model.py
│ │ │ └── schema.py
│ │ ├── domains/ # --- DOMAIN MODULE: "Domains" ---
│ │ │ ├── __init__.py
│ │ │ ├── router.py
│ │ │ ├── service.py
│ │ │ ├── repository.py
│ │ │ ├── model.py
│ │ │ ├── schema.py
│ │ │ └── tasks.py # ARQ background tasks (SSL)
│ │ ├── sftp/ # --- DOMAIN MODULE: "SFTP" ---
│ │ │ ├── __init__.py
│ │ │ ├── router.py
│ │ │ ├── service.py
│ │ │ ├── repository.py
│ │ │ ├── model.py
│ │ │ ├── schema.py
│ │ │ └── tasks.py # ARQ background tasks
│ │ ├── wordpress/ # --- DOMAIN MODULE: "WordPress" (⭐ ISOLATED) ---
│ │ │ ├── __init__.py
│ │ │ ├── router.py # INTERFACE: WP API endpoints
│ │ │ ├── service.py # APPLICATION: WP business logic
│ │ │ ├── repository.py # INFRASTRUCTURE: WP data access
│ │ │ ├── model.py # DOMAIN: WP models (if needed)
│ │ │ ├── schema.py # INTERFACE: WP request/response schemas
│ │ │ └── tasks.py # INFRASTRUCTURE: WP background jobs (WP-CLI)
│ │ ├── payments/ # --- DOMAIN MODULE: "Payments" ---
│ │ │ ├── __init__.py
│ │ │ ├── router.py
│ │ │ ├── service.py
│ │ │ ├── repository.py
│ │ │ ├── model.py
│ │ │ ├── schema.py
│ │ │ └── tasks.py # ARQ background tasks
│ │ ├── billing/ # --- DOMAIN MODULE: "Billing" ---
│ │ │ ├── __init__.py
│ │ │ ├── router.py
│ │ │ ├── service.py
│ │ │ ├── repository.py
│ │ │ ├── model.py
│ │ │ └── schema.py
│ │ ├── uptime/ # --- DOMAIN MODULE: "Uptime" ---
│ │ │ ├── __init__.py
│ │ │ ├── router.py
│ │ │ ├── service.py
│ │ │ ├── repository.py
│ │ │ ├── model.py
│ │ │ ├── schema.py
│ │ │ └── tasks.py # ARQ background tasks
│ │ ├── nodes/ # --- DOMAIN MODULE: "Nodes" ---
│ │ │ ├── __init__.py
│ │ │ ├── router.py
│ │ │ ├── service.py
│ │ │ ├── repository.py
│ │ │ ├── model.py
│ │ │ └── schema.py
│ │ ├── activity/ # --- DOMAIN MODULE: "Activity" ---
│ │ │ ├── __init__.py
│ │ │ ├── router.py
│ │ │ ├── service.py
│ │ │ ├── repository.py
│ │ │ ├── model.py
│ │ │ └── schema.py
│ │ ├── favourites/ # --- DOMAIN MODULE: "Favourites" ---
│ │ │ ├── __init__.py
│ │ │ ├── router.py
│ │ │ ├── service.py
│ │ │ ├── repository.py
│ │ │ ├── model.py
│ │ │ └── schema.py
│ │ ├── config/ # --- DOMAIN MODULE: "Config" ---
│ │ │ ├── __init__.py
│ │ │ ├── router.py
│ │ │ ├── service.py
│ │ │ ├── repository.py
│ │ │ ├── model.py
│ │ │ └── schema.py
│ │ ├── profile/ # --- DOMAIN MODULE: "Profile" ---
│ │ │ ├── __init__.py
│ │ │ ├── router.py
│ │ │ ├── service.py
│ │ │ ├── repository.py
│ │ │ ├── model.py
│ │ │ └── schema.py
│ │ ├── sessions/ # --- DOMAIN MODULE: "Sessions" (instant/virt login) ---
│ │ │ ├── __init__.py
│ │ │ ├── router.py
│ │ │ ├── service.py
│ │ │ ├── repository.py
│ │ │ ├── model.py
│ │ │ └── schema.py
│ │ ├── webhook/ # --- DOMAIN MODULE: "Webhook" ---
│ │ │ ├── __init__.py
│ │ │ ├── router.py
│ │ │ ├── service.py
│ │ │ ├── repository.py
│ │ │ ├── model.py
│ │ │ └── schema.py
│ │ ├── websocket/ # --- DOMAIN MODULE: "WebSocket" ---
│ │ │ ├── __init__.py # Module exports
│ │ │ ├── router.py # INTERFACE: WebSocket endpoints and HTTP routes
│ │ │ ├── service.py # APPLICATION: Business logic for message handling
│ │ │ ├── connection.py # INFRASTRUCTURE: ConnectionManager for managing connections
│ │ │ ├── channel.py # INFRASTRUCTURE: Channel subscription management
│ │ │ ├── presence.py # INFRASTRUCTURE: Presence tracking (online/offline)
│ │ │ ├── publisher.py # INFRASTRUCTURE: Message publishing to Redis
│ │ │ ├── repository.py # INFRASTRUCTURE: Database operations (audit logs)
│ │ │ ├── model.py # DOMAIN: WebSocket-related models (if needed)
│ │ │ └── schema.py # INTERFACE: Pydantic schemas for messages
│ │ ├── jobs/ # --- DOMAIN MODULE: "Background Jobs" ---
│ │ │ ├── __init__.py
│ │ │ ├── router.py # INTERFACE: (Optional) API to trigger/check jobs
│ │ │ ├── service.py # APPLICATION: Service to trigger tasks
│ │ │ ├── worker.py # INFRASTRUCTURE: ARQ worker setup
│ │ │ └── schema.py # INTERFACE: Job status schemas
│ │ └── tests/ # ⭐ COMPREHENSIVE TESTING
│ │ ├── __init__.py
│ │ ├── conftest.py # Shared fixtures (db, client, auth tokens)
│ │ ├── test_main.py # Main app tests
│ │ ├── integration/ # Cross-module integration tests
│ │ │ ├── __init__.py
│ │ │ ├── test_site_creation_flow.py
│ │ │ ├── test_backup_restore_flow.py
│ │ │ └── test_user_auth_flow.py
│ │ ├── users/ # Per-module tests
│ │ │ ├── __init__.py
│ │ │ ├── test_router.py
│ │ │ ├── test_service.py
│ │ │ └── test_repository.py
│ │ ├── auth/
│ │ │ └── ... (same structure)
│ │ ├── teams/
│ │ │ └── ... (same structure)
│ │ ├── sites/
│ │ │ └── ... (same structure)
│ │ ├── wordpress/ # WordPress module tests
│ │ │ ├── __init__.py
│ │ │ ├── test_router.py
│ │ │ ├── test_service.py
│ │ │ └── test_repository.py
│ │ └── ... (one folder per module)
│ ├── alembic/ # Database migration scripts
│ │ ├── versions/ # Migration versions
│ │ └── env.py # Alembic environment config
│ ├── requirements/ # Python requirements by environment
│ │ ├── base.txt # Common dependencies
│ │ ├── dev.txt # Development dependencies
│ │ └── prod.txt # Production dependencies
│ ├── scripts/ # ⭐ OPERATIONAL SCRIPTS
│ │ ├── migrate_data.py # MySQL → PostgreSQL migration
│ │ ├── seed_dev_data.py # Development data seeding
│ │ ├── health_check.py # Health check utilities
│ │ ├── generate_module.py # Generate new module scaffold
│ │ └── validate_architecture.py # Validate module isolation rules
│ ├── Dockerfile # Production Dockerfile for backend
│ ├── docker-compose.yml # Docker Compose for local development (✅ Implemented)
│ ├── pyproject.toml # Python project configuration
│ ├── pytest.ini # Pytest configuration
│ ├── .env.example # Example environment variables (✅ Implemented via scripts)
│ ├── requirements/ # Python requirements by environment (✅ Implemented)
│ │ ├── base.txt # Common dependencies
│ │ ├── dev.txt # Development dependencies
│ │ └── prod.txt # Production dependencies
│ └── alembic.ini # Alembic configuration
│
├── local-infra/ # Local development infrastructure (✅ Implemented)
│ ├── docker-compose.dev.yml # Development Docker Compose (✅ Implemented)
│ ├── docker-compose.test.yml # Testing environment Docker Compose (✅ Implemented)
│ ├── docker-compose.observability.yml # Monitoring stack (✅ Implemented)
│ ├── nginx/ # Nginx configuration for local dev (✅ Implemented)
│ │ ├── nginx.conf # Main nginx configuration
│ │ └── sites-available/ # Site configurations
│ ├── scripts/ # Development and deployment scripts (✅ Implemented)
│ │ ├── setup-dev.sh # Development environment setup
│ │ ├── migrate-db.sh # Database migration script
│ │ ├── health-check.sh # Health check utilities
│ │ └── create-env-files.sh # Environment file generator
│ ├── prometheus/ # Prometheus configuration (✅ Implemented)
│ ├── loki/ # Loki configuration (✅ Implemented)
│ ├── promtail/ # Promtail configuration (✅ Implemented)
│ └── certs/ # SSL certificates for local development (✅ Implemented)
│
├── docs/ # Documentation
│ ├── development/ # Development documentation
│ │ ├── MAINPRD.md # This document
│ │ ├── MODULE_TEMPLATE.md # Template for new modules
│ │ ├── index.md # Developer onboarding guide (✅ Implemented)
│ │ └── LOCAL_SETUP_SUMMARY.md # Local setup quick reference (✅ Implemented)
│ ├── architecture/ # Architecture decision records (ADRs)
│ │ ├── 001-hybrid-modular-ddd.md
│ │ ├── 002-database-strategy.md
│ │ └── 003-caching-strategy.md
│ └── deployment/ # Deployment guides
│ ├── DEPLOYMENT_GUIDE.md
│ └── KUBERNETES.md
│
├── tests/ # ⭐ END-TO-END TESTS (cross-component)
│ ├── e2e/ # End-to-end tests
│ │ ├── test_site_creation_flow.py
│ │ ├── test_backup_restore_flow.py
│ │ └── test_user_registration_flow.py
│ └── performance/ # Performance/load tests
│ ├── load_tests.py
│ └── stress_tests.py
│
├── .github/ # GitHub Actions workflows
│ ├── workflows/ # CI/CD pipeline definitions
│ │ ├── backend-ci.yml # Backend CI pipeline
│ │ ├── frontend-ci.yml # Frontend CI pipeline
│ │ ├── integration-ci.yml # Integration tests pipeline
│ │ └── deploy-production.yml # Production deployment
│ └── ISSUE_TEMPLATE/ # Issue templates
│
├── deployment/ # Production deployment configurations (✅ Implemented)
│ └── docker-compose.prod.yml # Production Docker Compose (optional, for full-stack testing)
├── .gitignore # Git ignore rules (✅ Implemented)
├── README.md # Project overview (✅ Implemented)
├── Makefile # Common development commands (✅ Implemented)
└── .editorconfig # Editor configuration (✅ Implemented)
7.2.2 Layered Architecture Within Each Module¶
Each domain module implements a clean layered architecture:
- Interface Layer (
[MODULE_NAME]/router.py): - Defines API endpoints for the module
- Handles HTTP concerns: request validation, response serialization, error handling
- Uses dependencies like
Depends(get_db),Depends(get_current_user) - Validates request data using Pydantic schemas from
schema.py - Catches domain exceptions and converts to HTTPException
-
Rule: Must NOT contain business logic or SQLAlchemy query code
-
Application Layer (
[MODULE_NAME]/service.py): - Orchestrates business logic and use cases for the module
- Implements use cases like
create_user,update_order, etc. - Coordinates between repository, model, and schema components
- Contains business validation and workflow orchestration
-
Rule: Must NOT know about FastAPI or HTTP concerns; imports from repository.py, model.py, and schema.py within the same module
-
Domain Layer (
[MODULE_NAME]/model.py): - Defines database structure and relationships
- SQLAlchemy model classes inheriting from Base
- Column definitions, relationships, and constraints
-
Rule: Should contain only data model definitions
-
Infrastructure Layer (
[MODULE_NAME]/repository.py): - Contains all database communication logic for the module
- Contains all SQLAlchemy queries and operations
- Functions take
db: Sessionas an argument - Handles query execution, commits, and refreshes
-
Rule: Must NOT contain business logic (only data access logic)
-
DTO Layer (
[MODULE_NAME]/schema.py): - Defines Data Transfer Objects for API communication
- Pydantic BaseModel classes for request/response
Createschemas for POST requests,Readfor GET responses,Updatefor PUT/PATCH requests- Rule: Used by router.py for validation and serialization
7.2.3 Module Isolation Principle¶
- Core Rule: A module MUST NOT import from another module
- Example:
app/users/service.pyMUST NOT importapp/auth/service.py - Exception: All modules can import from shared directories (
app/core/,app/database/) - Shared Kernel Pattern: Cross-cutting domain policies (e.g., RBAC checks, tenant resolution, audit helpers) live in
app/core/shared/and are the ONLY allowed bridge for shared business rules across modules. Modules may depend on shared kernel interfaces, not on other modules' concrete services.
7.2.3.1 External API Adapters - Shared Infrastructure Pattern¶
Architectural Decision: External API adapters are SHARED infrastructure, not domain-specific code.
External API adapters (Virtuozzo, Bunny CDN, Cloudflare, etc.) are placed in app/core/adapters/ rather than duplicated in each module for the following architectural and practical reasons:
Why Adapters Are Shared (Global):
- Multiple Module Usage:
virtuozzo_adapter.pyis used by 7+ modules: environments, wordpress, backups, sftp, staging, nodes, sessions-
Duplicating this code 7 times violates DRY principle and creates maintenance nightmares
-
Infrastructure, Not Domain Logic:
- Adapters handle infrastructure concerns: HTTP connections, retries, circuit breakers, rate limiting
- Domain logic (what to do with the data) stays in domain modules
-
Analogy: Just like we don't duplicate database connection pooling per module, we don't duplicate external API clients
-
Single Point of Configuration:
- API endpoint URLs, timeouts, rate limits, connection pool sizes
- If Virtuozzo API URL changes, update ONE adapter, not 7 modules
-
If circuit breaker threshold needs tuning, change it once
-
Consistent Resilience Behavior:
- All modules get the same connection pooling (47% latency improvement)
- All modules get the same circuit breaker protection
- All modules get the same retry logic (3 attempts with exponential backoff)
-
No risk of inconsistent behavior across modules
-
Easier Testing:
- Mock the adapter once in
conftest.py - All module tests benefit from the same mock
-
Integration tests run against real adapter once
-
Module Isolation Still Maintained:
- Modules still don't import from other modules
- They import from
app/core/adapters/, which is explicitly allowed (likeapp/core/andapp/database/) - Adapters are infrastructure dependencies, not domain dependencies
Example: Correct Usage in Domain Module:
# backend/app/environments/service.py
from sqlalchemy.orm import Session
from app.core.adapters.virtuozzo_adapter import get_virtuozzo_adapter # ✅ ALLOWED: Shared infrastructure
from app.environments import repository, schema # ✅ ALLOWED: Own module
from app.core.shared.audit import log_audit_event # ✅ ALLOWED: Shared kernel
# ❌ FORBIDDEN: from app.wordpress.service import WordPressService
class EnvironmentService:
def __init__(self):
# Use shared adapter (infrastructure)
self.virtuozzo_adapter = get_virtuozzo_adapter()
async def start_environment(
self,
db: Session,
environment_id: int,
user_id: int
) -> schema.EnvironmentRead:
"""Start environment using shared Virtuozzo adapter"""
# Domain logic specific to environments module
environment = repository.get_environment_by_id(db, environment_id)
# Infrastructure call via shared adapter
response = await self.virtuozzo_adapter.start_environment(
env_name=environment.env_name,
session_key=environment.session_key,
correlation_id=str(uuid.uuid4())
)
# Domain logic: update environment status
repository.update_environment(db, environment_id, {"status": "starting"})
db.commit()
return environment
Comparison: What Would Be Wrong (Anti-Pattern):
# ❌ ANTI-PATTERN: Duplicating adapter in each module
# backend/app/environments/adapters/virtuozzo_adapter.py
# backend/app/wordpress/adapters/virtuozzo_adapter.py # 🚫 DUPLICATION!
# backend/app/backups/adapters/virtuozzo_adapter.py # 🚫 DUPLICATION!
# backend/app/sftp/adapters/virtuozzo_adapter.py # 🚫 DUPLICATION!
# ... 7+ copies of the same code
# Problems:
# 1. Update connection pool config → must update 7 files
# 2. Fix circuit breaker bug → must fix in 7 places
# 3. Update API endpoint → change 7 files
# 4. Inconsistent behavior if one file gets out of sync
# 5. 7x more code to test
# 6. 7x more code to maintain
When to Create Per-Module Adapters (Rare Cases):
Only create module-specific adapters when: 1. Module-specific transformation logic: The adapter needs module-specific business logic (then it's not really infrastructure) 2. Different external service per module: Each module calls a completely different external API 3. Module-specific configuration: The adapter needs fundamentally different configuration per module (rare)
Example where per-module might make sense:
- If payments/ module used a different Stripe client than billing/ module with completely different configuration
- In practice, this is rare - you'd typically have ONE Stripe adapter with different methods
Architectural Layers Distinction:
┌─────────────────────────────────────────────────────────┐
│ DOMAIN MODULES (app/[module]/) │
│ - Business logic specific to the domain │
│ - router.py, service.py, repository.py, model.py │
│ - CAN import from app/core/, app/database/ │
│ - CANNOT import from other domain modules │
└─────────────────────────────────────────────────────────┘
│
│ imports from
▼
┌─────────────────────────────────────────────────────────┐
│ SHARED INFRASTRUCTURE (app/core/) │
│ - Technical infrastructure, not business logic │
│ - http_client.py, adapters/, cache.py, logging.py │
│ - Provides reusable technical capabilities │
└─────────────────────────────────────────────────────────┘
│
│ makes calls to
▼
┌─────────────────────────────────────────────────────────┐
│ EXTERNAL SERVICES │
│ - Virtuozzo API, Bunny CDN, Cloudflare, Stripe, etc. │
└─────────────────────────────────────────────────────────┘
Summary: Shared Adapters Are Infrastructure, Not Domain Code
- Location:
app/core/adapters/(shared) - Purpose: Isolate external API integration details (HTTP, retries, circuit breakers)
- Reusability: One adapter, many modules (DRY principle)
- Module Isolation: Still maintained (importing from
app/core/is explicitly allowed) - Domain Logic: Stays in domain modules (what to do with the data)
- Infrastructure: Centralized in adapters (how to get the data)
7.2.4 Standard Module Components¶
Each domain module must contain these files:
- router.py: API endpoints and request/response handling
- service.py: Business logic orchestration and use case implementation
- repository.py: Database operations and persistence logic
- model.py: SQLAlchemy models defining database structure
- schema.py: Pydantic schemas for API request/response validation
7.2.5 Shared Components¶
- Core Logic (
app/core/): - Shared, non-domain-specific utilities available to all modules
config.py: Pydantic BaseSettings for environment configurationsecurity.py: Core crypto functions (hashing, verification) - NOT OAuth2 logic-
Rules: Must NOT import from any domain module
-
Database Layer (
app/database/): - Manages the single database connection for the entire application
database.py: SQLAlchemy engine, SessionLocal, and Basedeps.py:get_db()dependency for database sessions- Rules: All modules import
get_dbfromapp.database.deps
7.2.6 Application Entry Point (main.py)¶
- The single application assembler and entry point
- Instantiates the main FastAPI application object
- Configures top-level middleware (e.g., CORSMiddleware)
- Imports APIRouter from each domain module
- Includes all module routers using
app.include_router() - Rule: MUST NOT define any API endpoints directly
7.2.7 Enhanced Shared Kernel (app/core/shared/)¶
The Shared Kernel is a critical component of the Hybrid Modular DDD architecture. It contains cross-cutting domain logic that is shared across multiple modules while maintaining module isolation principles.
Purpose: - Provide a controlled mechanism for sharing domain-level concerns across modules - Prevent code duplication for common business rules - Maintain a single source of truth for cross-cutting policies - Serve as the ONLY allowed bridge for shared business logic between modules
Components:
rbac.py - Role-Based Access Control Helpers:
# Example structure:
from enum import Enum
from typing import List
from app.database.deps import get_current_user
class Role(str, Enum):
SUPER_ADMIN = "super_admin"
TEAM_OWNER = "team_owner"
TEAM_ADMIN = "team_admin"
TEAM_MEMBER = "team_member"
USER = "user"
class Permission(str, Enum):
SITE_CREATE = "site:create"
SITE_DELETE = "site:delete"
BACKUP_CREATE = "backup:create"
BACKUP_RESTORE = "backup:restore"
# ... more permissions
def has_permission(user, permission: Permission) -> bool:
"""Check if user has specific permission"""
pass
def require_permission(permission: Permission):
"""Dependency for endpoint protection"""
pass
def is_team_owner(user, team_id: int) -> bool:
"""Check if user owns a specific team"""
pass
def is_team_member(user, team_id: int) -> bool:
"""Check if user is member of a specific team"""
pass
tenant.py - Multi-Tenant Resolution Logic:
# Example structure:
from typing import Optional
from sqlalchemy.orm import Session
class TenantContext:
"""Context manager for tenant-scoped operations"""
def __init__(self, team_id: int):
self.team_id = team_id
def __enter__(self):
# Set tenant context
pass
def __exit__(self, exc_type, exc_val, exc_tb):
# Clear tenant context
pass
def get_current_team_id(user) -> int:
"""Extract team_id from current user context"""
pass
def filter_by_team(query, team_id: int):
"""Apply team_id filter to SQLAlchemy query"""
return query.filter_by(team_id=team_id)
def validate_team_access(user, team_id: int) -> bool:
"""Validate user has access to specified team"""
pass
audit.py - Audit Logging Helpers:
# Example structure:
from enum import Enum
from datetime import datetime
from typing import Any, Dict
class AuditAction(str, Enum):
CREATE = "create"
UPDATE = "update"
DELETE = "delete"
LOGIN = "login"
LOGOUT = "logout"
def log_audit_event(
user_id: int,
team_id: int,
action: AuditAction,
resource_type: str,
resource_id: int,
metadata: Dict[str, Any] = None
):
"""Log audit event for compliance and tracking"""
pass
def get_audit_trail(
team_id: int,
resource_type: str = None,
start_date: datetime = None,
end_date: datetime = None
):
"""Retrieve audit trail for a team"""
pass
events.py - Domain Event Definitions (Optional):
# Example structure:
from typing import Any, Dict
from datetime import datetime
from pydantic import BaseModel
class DomainEvent(BaseModel):
"""Base class for all domain events"""
event_id: str
event_type: str
timestamp: datetime
team_id: int
payload: Dict[str, Any]
class SiteCreatedEvent(DomainEvent):
event_type: str = "site.created"
class BackupCompletedEvent(DomainEvent):
event_type: str = "backup.completed"
class EnvironmentDeletedEvent(DomainEvent):
event_type: str = "environment.deleted"
# Event publishing/subscribing logic
async def publish_event(event: DomainEvent):
"""Publish domain event to Redis/message broker"""
pass
Usage Guidelines:
- Modules MAY import from app.core.shared.* for cross-cutting concerns
- Modules MUST NOT import from other modules directly
- Shared kernel should contain only domain-level abstractions, not infrastructure
- Keep shared kernel minimal - only add when truly cross-cutting
- Document all shared kernel additions with clear rationale
Architecture Validation:
# Example validation in app/backend/scripts/validate_architecture.py
def validate_module_isolation():
"""
Validates that:
1. No module imports from another module
2. Modules only import from app.core.* and app.database.*
3. Shared kernel is used appropriately
4. External API calls go through adapters in app.core.adapters/
5. No duplicate adapter implementations in domain modules
"""
pass
def validate_adapter_usage():
"""
Validates proper adapter usage:
1. All adapters are in app/core/adapters/
2. No duplicate adapter implementations in domain modules
3. Domain modules use adapters correctly (import from app.core.adapters/)
4. No direct HTTP calls to external APIs in domain modules (must use adapters)
"""
violations = []
# Check for direct HTTP calls in domain modules (should use adapters)
domain_modules = Path("app").glob("*/")
for module in domain_modules:
if module.name in ['core', 'database', 'tests']:
continue
for py_file in module.rglob("*.py"):
content = py_file.read_text()
# Check for direct HTTP library usage (should use adapter)
if 'import requests' in content or 'import httpx' in content or 'import urllib' in content:
violations.append(
f"❌ {py_file}: Direct HTTP library usage detected. "
f"Use adapter from app.core.adapters/ instead."
)
# Check for adapter duplication (adapters should be in app/core/adapters/)
if '/adapters/' in str(py_file) and 'core/adapters' not in str(py_file):
violations.append(
f"❌ {py_file}: Adapter found in domain module. "
f"Move to app/core/adapters/ for reusability."
)
return violations
7.2.8 Enhanced Database Layer (app/database/)¶
The enhanced database layer provides shared database infrastructure, common model mixins, and custom types that are reusable across all domain modules.
Components:
database.py - Engine, SessionLocal, Base:
# Example structure:
from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from app.core.config import settings
# Database engine with connection pooling
engine = create_engine(
settings.DATABASE_URL,
pool_pre_ping=True,
pool_size=20,
max_overflow=40,
pool_recycle=3600,
echo=settings.DEBUG
)
# Session factory
SessionLocal = sessionmaker(
autocommit=False,
autoflush=False,
bind=engine
)
# Base class for all SQLAlchemy models
Base = declarative_base()
deps.py - Database Dependencies:
# Example structure:
from typing import Generator
from sqlalchemy.orm import Session
from fastapi import Depends, HTTPException, status
from fastapi.security import OAuth2PasswordBearer
from jose import JWTError, jwt
from app.core.config import settings
oauth2_scheme = OAuth2PasswordBearer(tokenUrl="auth/login")
def get_db() -> Generator[Session, None, None]:
"""Database session dependency"""
db = SessionLocal()
try:
yield db
finally:
db.close()
async def get_current_user(
db: Session = Depends(get_db),
token: str = Depends(oauth2_scheme)
):
"""Get current authenticated user from JWT token"""
credentials_exception = HTTPException(
status_code=status.HTTP_401_UNAUTHORIZED,
detail="Could not validate credentials",
headers={"WWW-Authenticate": "Bearer"},
)
try:
payload = jwt.decode(token, settings.SECRET_KEY, algorithms=[settings.ALGORITHM])
user_id: int = payload.get("sub")
if user_id is None:
raise credentials_exception
except JWTError:
raise credentials_exception
# Fetch user from database
from app.users.repository import get_user_by_id
user = get_user_by_id(db, user_id=user_id)
if user is None:
raise credentials_exception
return user
async def get_current_active_user(
current_user = Depends(get_current_user)
):
"""Ensure user is active"""
if not current_user.is_active:
raise HTTPException(status_code=400, detail="Inactive user")
return current_user
mixins.py - Common Model Mixins:
# Example structure:
from datetime import datetime
from sqlalchemy import Column, Integer, DateTime, Boolean
from sqlalchemy.ext.declarative import declared_attr
class TimestampMixin:
"""Adds created_at and updated_at timestamps to models"""
created_at = Column(DateTime, default=datetime.utcnow, nullable=False)
updated_at = Column(
DateTime,
default=datetime.utcnow,
onupdate=datetime.utcnow,
nullable=False
)
class TeamScopedMixin:
"""Adds team_id for multi-tenant data isolation"""
@declared_attr
def team_id(cls):
return Column(Integer, nullable=False, index=True)
class SoftDeleteMixin:
"""Adds soft delete functionality"""
is_deleted = Column(Boolean, default=False, nullable=False)
deleted_at = Column(DateTime, nullable=True)
class AuditMixin:
"""Adds audit tracking fields"""
created_by = Column(Integer, nullable=True)
updated_by = Column(Integer, nullable=True)
# Usage in models:
# class Site(Base, TimestampMixin, TeamScopedMixin, SoftDeleteMixin):
# __tablename__ = "sites"
# id = Column(Integer, primary_key=True)
# name = Column(String(255))
types.py - Custom SQLAlchemy Types:
# Example structure:
from sqlalchemy.types import TypeDecorator, CHAR, String, Text
from sqlalchemy.dialects.postgresql import UUID as PG_UUID, JSONB
import uuid
import json
class GUID(TypeDecorator):
"""Platform-independent GUID type.
Uses PostgreSQL's UUID type, otherwise uses CHAR(36) with string UUIDs."""
impl = CHAR
cache_ok = True
def load_dialect_impl(self, dialect):
if dialect.name == 'postgresql':
return dialect.type_descriptor(PG_UUID())
else:
return dialect.type_descriptor(CHAR(36))
def process_bind_param(self, value, dialect):
if value is None:
return value
elif dialect.name == 'postgresql':
return str(value)
else:
if not isinstance(value, uuid.UUID):
return str(uuid.UUID(value))
else:
return str(value)
def process_result_value(self, value, dialect):
if value is None:
return value
else:
if not isinstance(value, uuid.UUID):
value = uuid.UUID(value)
return value
class JSONEncodedDict(TypeDecorator):
"""Represents an immutable structure as a json-encoded string."""
impl = Text
cache_ok = True
def process_bind_param(self, value, dialect):
if value is not None:
value = json.dumps(value)
return value
def process_result_value(self, value, dialect):
if value is not None:
value = json.loads(value)
return value
# For PostgreSQL, use native JSONB
class JSONType(TypeDecorator):
"""JSON type that uses PostgreSQL JSONB or fallback to Text"""
impl = Text
cache_ok = True
def load_dialect_impl(self, dialect):
if dialect.name == 'postgresql':
return dialect.type_descriptor(JSONB())
else:
return dialect.type_descriptor(Text())
Benefits of Enhanced Database Layer:
- Consistency: All models inherit common behaviors (timestamps, soft delete)
- Multi-tenancy: TeamScopedMixin ensures data isolation across teams
- Maintainability: Change timestamp logic once, applies to all models
- Type Safety: Custom types provide better data handling across databases
- Performance: Native PostgreSQL types (JSONB, UUID) when available
Usage in Domain Modules:
# Example: app/sites/model.py
from sqlalchemy import Column, Integer, String, Boolean
from app.database.database import Base
from app.database.mixins import TimestampMixin, TeamScopedMixin, SoftDeleteMixin
from app.database.types import GUID, JSONType
class Site(Base, TimestampMixin, TeamScopedMixin, SoftDeleteMixin):
__tablename__ = "sites"
id = Column(Integer, primary_key=True, index=True)
uuid = Column(GUID, unique=True, default=uuid.uuid4)
name = Column(String(255), nullable=False)
domain = Column(String(255), nullable=False)
config = Column(JSONType, nullable=True) # Site-specific config
is_active = Column(Boolean, default=True)
# TimestampMixin provides: created_at, updated_at
# TeamScopedMixin provides: team_id
# SoftDeleteMixin provides: is_deleted, deleted_at
7.2.9 Domain-Aligned Frontend Architecture (frontend/features/)¶
The frontend architecture mirrors the backend's Hybrid Modular DDD structure, organizing code by business features rather than technical layers. This alignment provides significant benefits for team collaboration and maintainability at scale.
Architecture Principles:
- Feature-First Organization: Code is organized by business domain (wordpress, sites, backups) not by type (components, hooks)
- Module Isolation: Each feature module is self-contained with its own components, hooks, services, types, and utils
- Shared Components: Common UI elements live in frontend/shared/ for reusability
- Backend Alignment: Feature modules map 1:1 with backend domain modules
- Type Safety: Auto-generated TypeScript clients from FastAPI OpenAPI specs
Feature Module Structure:
Each feature module follows this standard structure:
frontend/features/[FEATURE_NAME]/
├── components/ # Feature-specific React components
│ ├── [Feature]List.tsx
│ ├── [Feature]Detail.tsx
│ ├── [Feature]Form.tsx
│ └── [Feature]Card.tsx
├── hooks/ # Feature-specific custom hooks
│ ├── use[Feature].ts
│ ├── use[Feature]List.ts
│ └── use[Feature]Mutations.ts
├── services/ # API client for this feature
│ └── [feature]-service.ts
├── types/ # TypeScript type definitions
│ └── index.ts
└── utils/ # Feature-specific utilities
└── [feature]-helpers.ts
Example: WordPress Feature Module (frontend/features/wordpress/):
// frontend/features/wordpress/components/WordPressWpCli.tsx
import { useWordPressCli } from '../hooks/useWordPressCli';
import { Button } from '@/shared/components/ui/button';
export function WordPressWpCli({ siteId }: { siteId: number }) {
const { executeCommand, isLoading } = useWordPressCli(siteId);
return (
<div>
{/* WP-CLI interface */}
</div>
);
}
// frontend/features/wordpress/hooks/useWordPress.ts
import { useQuery, useMutation } from '@tanstack/react-query';
import { wordpressService } from '../services/wordpress-service';
export function useWordPress(siteId: number) {
return useQuery({
queryKey: ['wordpress', siteId],
queryFn: () => wordpressService.getWordPressInfo(siteId)
});
}
export function useWordPressCli(siteId: number) {
const mutation = useMutation({
mutationFn: (command: string) =>
wordpressService.executeWpCli(siteId, command)
});
return {
executeCommand: mutation.mutate,
isLoading: mutation.isPending
};
}
// frontend/features/wordpress/services/wordpress-service.ts
import { apiClient } from '@/shared/lib/api-client';
import { WordPressInfo, WpCliResponse } from '../types';
export const wordpressService = {
async getWordPressInfo(siteId: number): Promise<WordPressInfo> {
const { data } = await apiClient.get(`/api/v1/wordpress/${siteId}`);
return data;
},
async executeWpCli(siteId: number, command: string): Promise<WpCliResponse> {
const { data } = await apiClient.post(`/api/v1/wordpress/${siteId}/cli`, {
command
});
return data;
},
async updateWordPress(siteId: number): Promise<void> {
await apiClient.post(`/api/v1/wordpress/${siteId}/update`);
}
};
// frontend/features/wordpress/types/index.ts
export interface WordPressInfo {
version: string;
is_active: boolean;
site_id: number;
plugins: Plugin[];
themes: Theme[];
}
export interface WpCliResponse {
output: string;
exit_code: number;
timestamp: string;
}
Shared Components (frontend/shared/):
// frontend/shared/components/ui/button.tsx
// Reusable base UI components (using shadcn/ui pattern)
// frontend/shared/lib/api-client.ts
import axios from 'axios';
import { getAuthToken } from '../utils/auth';
export const apiClient = axios.create({
baseURL: process.env.NEXT_PUBLIC_API_URL,
timeout: 30000,
});
// Request interceptor for auth token
apiClient.interceptors.request.use((config) => {
const token = getAuthToken();
if (token) {
config.headers.Authorization = `Bearer ${token}`;
}
return config;
});
// Response interceptor for error handling
apiClient.interceptors.response.use(
(response) => response,
(error) => {
if (error.response?.status === 401) {
// Handle unauthorized
window.location.href = '/login';
}
return Promise.reject(error);
}
);
// frontend/shared/hooks/useDebounce.ts
import { useEffect, useState } from 'react';
export function useDebounce<T>(value: T, delay: number): T {
const [debouncedValue, setDebouncedValue] = useState<T>(value);
useEffect(() => {
const handler = setTimeout(() => {
setDebouncedValue(value);
}, delay);
return () => {
clearTimeout(handler);
};
}, [value, delay]);
return debouncedValue;
}
Benefits of Domain-Aligned Frontend: 1. Developer Efficiency: Find all WordPress-related code in one place 2. Reduced Cognitive Load: Working on a feature doesn't require context switching across directories 3. Clear Ownership: Teams can own entire feature slices (frontend + backend) 4. Parallel Development: Multiple teams work on different features without conflicts 5. Easier Onboarding: New developers understand feature boundaries quickly 6. Better Code Reuse: Shared components are explicitly separated from feature-specific code
Auto-Generated API Clients:
Use openapi-typescript-codegen to generate type-safe API clients:
# Generate TypeScript clients from FastAPI OpenAPI spec
npx openapi-typescript-codegen --input http://localhost:8000/openapi.json \
--output ./frontend/shared/lib/openapi-client \
--client axios
This generates: - Type-safe request/response interfaces - API service methods - Full IntelliSense support
State Management Strategy:
- Server State: React Query (
@tanstack/react-query) for API data caching - Client State: Zustand for global UI state (theme, sidebar, modals)
- Form State: React Hook Form for form management
- URL State: Next.js router for navigation state
// frontend/shared/store/auth-store.ts
import { create } from 'zustand';
import { persist } from 'zustand/middleware';
interface AuthState {
user: User | null;
token: string | null;
setAuth: (user: User, token: string) => void;
clearAuth: () => void;
}
export const useAuthStore = create<AuthState>()(
persist(
(set) => ({
user: null,
token: null,
setAuth: (user, token) => set({ user, token }),
clearAuth: () => set({ user: null, token: null }),
}),
{
name: 'auth-storage',
}
)
);
7.2.10 Comprehensive Testing Strategy and Structure¶
A robust testing strategy is essential for maintaining quality at scale. The MBPanel testing framework covers unit, integration, end-to-end, and performance testing across both frontend and backend.
Testing Pyramid:
/\
/ \ E2E Tests (10%)
/____\
/ \
/ Integration Tests (30%)
/________\
/ \
/ Unit Tests (60%)
/______________\
Backend Testing Structure (backend/app/tests/):
backend/app/tests/
├── conftest.py # ⭐ Shared pytest fixtures
├── test_main.py # Main app tests
├── integration/ # Cross-module integration tests
│ ├── __init__.py
│ ├── test_site_creation_flow.py
│ ├── test_backup_restore_flow.py
│ └── test_user_auth_flow.py
├── users/ # Per-module tests
│ ├── __init__.py
│ ├── test_router.py # API endpoint tests
│ ├── test_service.py # Business logic tests
│ └── test_repository.py # Data access tests
├── wordpress/ # WordPress module tests
│ ├── __init__.py
│ ├── test_router.py
│ ├── test_service.py
│ └── test_repository.py
└── ... (one folder per module)
Shared Test Fixtures (backend/app/tests/conftest.py):
import pytest
from fastapi.testclient import TestClient
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from app.main import app
from app.database.database import Base
from app.database.deps import get_db
# Test database URL
SQLALCHEMY_TEST_DATABASE_URL = "postgresql://test:test@localhost:5432/test_db"
# Test engine
engine = create_engine(SQLALCHEMY_TEST_DATABASE_URL)
TestingSessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
@pytest.fixture(scope="function")
def db():
"""Create test database and yield session"""
Base.metadata.create_all(bind=engine)
db = TestingSessionLocal()
try:
yield db
finally:
db.close()
Base.metadata.drop_all(bind=engine)
@pytest.fixture(scope="function")
def client(db):
"""Test client with overridden database dependency"""
def override_get_db():
try:
yield db
finally:
db.close()
app.dependency_overrides[get_db] = override_get_db
with TestClient(app) as test_client:
yield test_client
app.dependency_overrides.clear()
@pytest.fixture
def auth_headers(client, db):
"""Get authentication headers for testing protected endpoints"""
# Create test user
from app.users.repository import create_user
from app.users.schema import UserCreate
user_data = UserCreate(
email="test@example.com",
password="testpassword123",
full_name="Test User"
)
user = create_user(db, user_data)
# Login and get token
response = client.post("/api/v1/auth/login", json={
"email": "test@example.com",
"password": "testpassword123"
})
token = response.json()["access_token"]
return {"Authorization": f"Bearer {token}"}
@pytest.fixture
def team_factory(db):
"""Factory for creating test teams"""
from app.teams.repository import create_team
from app.teams.schema import TeamCreate
def _create_team(name: str, owner_id: int):
team_data = TeamCreate(name=name, owner_id=owner_id)
return create_team(db, team_data)
return _create_team
Example Module Tests:
# backend/app/tests/wordpress/test_router.py
import pytest
from fastapi.testclient import TestClient
def test_get_wordpress_info(client, auth_headers, db):
"""Test getting WordPress information"""
# Create test site
site = create_test_site(db)
response = client.get(
f"/api/v1/wordpress/{site.id}",
headers=auth_headers
)
assert response.status_code == 200
data = response.json()
assert "version" in data
assert "is_active" in data
def test_execute_wp_cli_command(client, auth_headers, db):
"""Test executing WP-CLI command"""
site = create_test_site(db)
response = client.post(
f"/api/v1/wordpress/{site.id}/cli",
headers=auth_headers,
json={"command": "core version"}
)
assert response.status_code == 200
data = response.json()
assert "output" in data
assert data["exit_code"] == 0
# backend/app/tests/wordpress/test_service.py
import pytest
from app.wordpress.service import WordPressService
def test_wordpress_service_get_info(db):
"""Test WordPress service get_info method"""
service = WordPressService()
site = create_test_site(db)
info = service.get_wordpress_info(db, site.id)
assert info is not None
assert hasattr(info, 'version')
# backend/app/tests/integration/test_site_creation_flow.py
import pytest
def test_complete_site_creation_flow(client, auth_headers, db):
"""Test complete site creation workflow"""
# Step 1: Create team
team_response = client.post(
"/api/v1/teams/",
headers=auth_headers,
json={"name": "Test Team"}
)
team_id = team_response.json()["id"]
# Step 2: Create site
site_response = client.post(
"/api/v1/sites/",
headers=auth_headers,
json={
"team_id": team_id,
"name": "Test Site",
"domain": "test.example.com"
}
)
assert site_response.status_code == 201
site_id = site_response.json()["id"]
# Step 3: Create environment
env_response = client.post(
f"/api/v1/sites/{site_id}/environments",
headers=auth_headers,
json={"name": "production"}
)
assert env_response.status_code == 201
# Step 4: Install WordPress
wp_response = client.post(
f"/api/v1/wordpress/{site_id}/install",
headers=auth_headers
)
assert wp_response.status_code == 200
Frontend Testing Structure (frontend/tests/):
frontend/tests/
├── unit/ # Unit tests for components/hooks
│ ├── components/
│ │ └── Button.test.tsx
│ └── hooks/
│ └── useDebounce.test.ts
├── integration/ # Integration tests
│ └── features/
│ └── wordpress/
│ └── WordPressFlow.test.tsx
└── e2e/ # End-to-end tests
├── auth.spec.ts
├── site-creation.spec.ts
└── wordpress.spec.ts
Example Frontend Tests:
// frontend/tests/unit/hooks/useDebounce.test.ts
import { renderHook, act } from '@testing-library/react';
import { useDebounce } from '@/shared/hooks/useDebounce';
describe('useDebounce', () => {
it('should debounce value changes', async () => {
const { result, rerender } = renderHook(
({ value, delay }) => useDebounce(value, delay),
{ initialProps: { value: 'initial', delay: 500 } }
);
expect(result.current).toBe('initial');
rerender({ value: 'updated', delay: 500 });
expect(result.current).toBe('initial'); // Still initial
await act(() => new Promise(resolve => setTimeout(resolve, 600)));
expect(result.current).toBe('updated'); // Now updated
});
});
// frontend/tests/e2e/wordpress.spec.ts (using Playwright)
import { test, expect } from '@playwright/test';
test('WordPress CLI execution flow', async ({ page }) => {
// Login
await page.goto('/login');
await page.fill('[name="email"]', 'test@example.com');
await page.fill('[name="password"]', 'testpassword123');
await page.click('button[type="submit"]');
// Navigate to WordPress section
await page.goto('/sites/1/wordpress');
// Execute WP-CLI command
await page.fill('[name="command"]', 'core version');
await page.click('button:has-text("Execute")');
// Verify output
await expect(page.locator('.wp-cli-output')).toContainText('5.9');
});
End-to-End Tests (tests/e2e/):
# tests/e2e/test_site_creation_flow.py
import pytest
from playwright.sync_api import Page, expect
def test_complete_site_creation_e2e(page: Page):
"""End-to-end test for site creation"""
# Login
page.goto("http://localhost:3000/login")
page.fill("[name=email]", "test@example.com")
page.fill("[name=password]", "testpassword123")
page.click("button[type=submit]")
# Wait for dashboard
expect(page.locator("h1")).to_contain_text("Dashboard")
# Create site
page.click("text=Create Site")
page.fill("[name=name]", "Test Site")
page.fill("[name=domain]", "test.example.com")
page.click("button:has-text('Create')")
# Verify site created
expect(page.locator(".site-list")).to_contain_text("Test Site")
Performance Tests (tests/performance/):
# tests/performance/load_tests.py
from locust import HttpUser, task, between
class MBPanelUser(HttpUser):
wait_time = between(1, 3)
def on_start(self):
"""Login before tests"""
response = self.client.post("/api/v1/auth/login", json={
"email": "test@example.com",
"password": "testpassword123"
})
self.token = response.json()["access_token"]
self.headers = {"Authorization": f"Bearer {self.token}"}
@task(10)
def list_sites(self):
"""Test listing sites endpoint"""
self.client.get("/api/v1/sites/", headers=self.headers)
@task(5)
def get_site_details(self):
"""Test getting site details"""
self.client.get("/api/v1/sites/1", headers=self.headers)
@task(2)
def create_backup(self):
"""Test creating backup"""
self.client.post("/api/v1/backups/", headers=self.headers, json={
"site_id": 1,
"type": "full"
})
Testing Guidelines: - Unit Tests: 60% coverage target, test individual functions/components - Integration Tests: 30% coverage target, test module interactions - E2E Tests: 10% coverage target, test critical user journeys - Performance Tests: Run weekly, track response times and throughput - Coverage Threshold: Minimum 80% overall code coverage - CI Integration: All tests run on every PR, blocking merge on failures
7.2.11 Operational Scripts and DevOps Tooling¶
Operational scripts automate common development, deployment, and maintenance tasks. These scripts live in backend/scripts/ and provide essential tooling for developers and operators.
Script Inventory:
backend/scripts/
├── migrate_data.py # MySQL → PostgreSQL migration
├── seed_dev_data.py # Development data seeding
├── health_check.py # Health check utilities
├── generate_module.py # Generate new module scaffold
├── validate_architecture.py # Validate module isolation rules
├── backup_database.py # Database backup script
├── restore_database.py # Database restore script
└── performance_benchmark.py # Performance benchmarking
1. Module Generator (generate_module.py):
#!/usr/bin/env python3
"""
Generate a new domain module with standard structure.
Usage:
python scripts/generate_module.py --name payments --description "Payment processing module"
"""
import os
import argparse
from pathlib import Path
TEMPLATES = {
"router.py": '''"""
{description}
API Router for {module_name}.
"""
from fastapi import APIRouter, Depends, HTTPException
from sqlalchemy.orm import Session
from app.database.deps import get_db, get_current_active_user
from . import service, schema
router = APIRouter(prefix="/{module_name}", tags=["{module_name}"])
@router.get("/", response_model=list[schema.{ModuleName}Read])
def list_{module_name}(
db: Session = Depends(get_db),
current_user = Depends(get_current_active_user),
skip: int = 0,
limit: int = 100
):
"""List all {module_name}"""
return service.get_{module_name}_list(db, skip=skip, limit=limit)
@router.post("/", response_model=schema.{ModuleName}Read, status_code=201)
def create_{module_singular}(
{module_singular}_data: schema.{ModuleName}Create,
db: Session = Depends(get_db),
current_user = Depends(get_current_active_user)
):
"""Create new {module_singular}"""
return service.create_{module_singular}(db, {module_singular}_data)
''',
"service.py": '''"""
{description}
Business logic for {module_name}.
"""
from sqlalchemy.orm import Session
from . import repository, schema
def get_{module_name}_list(db: Session, skip: int = 0, limit: int = 100):
"""Get list of {module_name}"""
return repository.get_{module_name}(db, skip=skip, limit=limit)
def create_{module_singular}(db: Session, {module_singular}_data: schema.{ModuleName}Create):
"""Create new {module_singular}"""
return repository.create_{module_singular}(db, {module_singular}_data)
''',
# ... more templates
}
def generate_module(name: str, description: str):
"""Generate module with all standard files"""
module_path = Path(f"app/{name}")
module_path.mkdir(exist_ok=True)
for filename, template in TEMPLATES.items():
content = template.format(
module_name=name,
module_singular=name.rstrip('s'),
ModuleName=name.capitalize(),
description=description
)
(module_path / filename).write_text(content)
print(f"✅ Module '{name}' generated successfully!")
2. Architecture Validator (validate_architecture.py):
#!/usr/bin/env python3
"""
Validate Hybrid Modular DDD architecture rules.
Checks:
- No module imports from another module
- Modules only import from app.core.* and app.database.*
- All modules have standard files (router, service, repository, model, schema)
"""
import ast
import sys
from pathlib import Path
from typing import List, Set
ALLOWED_IMPORTS = {'app.core', 'app.database'}
REQUIRED_FILES = ['router.py', 'service.py', 'repository.py', 'model.py', 'schema.py']
def get_module_imports(module_path: Path) -> Set[str]:
"""Extract all imports from a module"""
imports = set()
for py_file in module_path.rglob("*.py"):
try:
tree = ast.parse(py_file.read_text())
for node in ast.walk(tree):
if isinstance(node, ast.ImportFrom):
if node.module and node.module.startswith('app.'):
imports.add(node.module.split('.')[1]) # Extract first level
except:
pass
return imports
def validate_module_isolation():
"""Validate no module imports from other modules"""
app_path = Path("app")
modules = [d for d in app_path.iterdir() if d.is_dir() and not d.name.startswith('_')]
violations = []
for module in modules:
if module.name in ['core', 'database', 'tests']:
continue
imports = get_module_imports(module)
forbidden_imports = imports - ALLOWED_IMPORTS - {module.name}
if forbidden_imports:
violations.append(f"❌ Module '{module.name}' imports from: {forbidden_imports}")
return violations
def validate_module_structure():
"""Validate all modules have required files"""
app_path = Path("app")
modules = [d for d in app_path.iterdir() if d.is_dir() and not d.name.startswith('_')]
violations = []
for module in modules:
if module.name in ['core', 'database', 'tests']:
continue
for required_file in REQUIRED_FILES:
if not (module / required_file).exists():
violations.append(f"❌ Module '{module.name}' missing {required_file}")
return violations
if __name__ == "__main__":
print("🔍 Validating architecture...")
isolation_violations = validate_module_isolation()
structure_violations = validate_module_structure()
if isolation_violations:
print("\n❌ Module Isolation Violations:")
for violation in isolation_violations:
print(f" {violation}")
if structure_violations:
print("\n❌ Module Structure Violations:")
for violation in structure_violations:
print(f" {violation}")
if not isolation_violations and not structure_violations:
print("✅ Architecture validation passed!")
sys.exit(0)
else:
sys.exit(1)
3. Development Data Seeder (seed_dev_data.py):
#!/usr/bin/env python3
"""
Seed development database with test data.
Usage:
python scripts/seed_dev_data.py --users 10 --teams 5 --sites 20
"""
import argparse
from faker import Faker
from sqlalchemy.orm import Session
from app.database.database import SessionLocal
from app.users.repository import create_user
from app.teams.repository import create_team
from app.sites.repository import create_site
fake = Faker()
def seed_users(db: Session, count: int):
"""Seed users"""
print(f"Seeding {count} users...")
users = []
for _ in range(count):
user_data = {
"email": fake.email(),
"full_name": fake.name(),
"password": "password123"
}
user = create_user(db, user_data)
users.append(user)
return users
def seed_teams(db: Session, users: list, count: int):
"""Seed teams"""
print(f"Seeding {count} teams...")
teams = []
for _ in range(count):
team_data = {
"name": fake.company(),
"owner_id": fake.random_element(users).id
}
team = create_team(db, team_data)
teams.append(team)
return teams
def seed_sites(db: Session, teams: list, count: int):
"""Seed sites"""
print(f"Seeding {count} sites...")
for _ in range(count):
site_data = {
"name": fake.domain_word(),
"domain": fake.domain_name(),
"team_id": fake.random_element(teams).id
}
create_site(db, site_data)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--users", type=int, default=10)
parser.add_argument("--teams", type=int, default=5)
parser.add_argument("--sites", type=int, default=20)
args = parser.parse_args()
db = SessionLocal()
try:
users = seed_users(db, args.users)
teams = seed_teams(db, users, args.teams)
seed_sites(db, teams, args.sites)
db.commit()
print("✅ Seeding completed successfully!")
except Exception as e:
db.rollback()
print(f"❌ Seeding failed: {e}")
finally:
db.close()
4. Health Check Script (health_check.py):
#!/usr/bin/env python3
"""
Comprehensive health check for all services.
Checks:
- Database connectivity
- Redis connectivity
- API endpoints
- Worker processes
"""
import requests
import psycopg2
import redis
import sys
def check_database():
"""Check PostgreSQL connectivity"""
try:
conn = psycopg2.connect(
host="localhost",
database="mbpanel",
user="mbpanel",
password="password"
)
conn.close()
print("✅ Database: OK")
return True
except Exception as e:
print(f"❌ Database: FAILED - {e}")
return False
def check_redis():
"""Check Redis connectivity"""
try:
r = redis.Redis(host='localhost', port=6379, db=0)
r.ping()
print("✅ Redis: OK")
return True
except Exception as e:
print(f"❌ Redis: FAILED - {e}")
return False
def check_api():
"""Check API health endpoint"""
try:
response = requests.get("http://localhost:8000/health", timeout=5)
if response.status_code == 200:
print("✅ API: OK")
return True
else:
print(f"❌ API: FAILED - Status {response.status_code}")
return False
except Exception as e:
print(f"❌ API: FAILED - {e}")
return False
if __name__ == "__main__":
print("🔍 Running health checks...\n")
checks = [
check_database(),
check_redis(),
check_api()
]
if all(checks):
print("\n✅ All health checks passed!")
sys.exit(0)
else:
print("\n❌ Some health checks failed!")
sys.exit(1)
Makefile for Common Commands:
# Makefile
.PHONY: help dev test lint format seed validate health
help:
@echo "MBPanel Development Commands"
@echo " make dev - Start development environment"
@echo " make test - Run all tests"
@echo " make lint - Run linters"
@echo " make format - Format code"
@echo " make seed - Seed development data"
@echo " make validate - Validate architecture"
@echo " make health - Run health checks"
dev:
docker-compose -f local-infra/docker-compose.dev.yml up -d
cd backend && uvicorn app.main:app --reload
cd frontend && npm run dev
test:
cd backend && pytest
cd frontend && npm test
lint:
cd backend && ruff check .
cd backend && mypy .
cd frontend && npm run lint
format:
cd backend && ruff format .
cd frontend && npm run format
seed:
cd backend && python scripts/seed_dev_data.py
validate:
cd backend && python scripts/validate_architecture.py
health:
cd backend && python scripts/health_check.py
module:
@read -p "Module name: " name; \
read -p "Description: " desc; \
cd backend && python scripts/generate_module.py --name $$name --description "$$desc"
7.3 Monorepo and Decoupled Architecture¶
The MBPanel project implements a monorepo architecture with a decoupled frontend and backend system. This approach combines the benefits of a unified repository structure with the operational advantages of independently deployable services.
7.3.1 Monorepo Approach¶
The monorepo approach consolidates all codebases into a single repository while maintaining clear separation between components. This strategy provides several key benefits:
Benefits: - Unified Version Control: All components share the same commit history and versioning system - Simplified Dependency Management: Shared dependencies can be managed collectively - Atomic Changes: Cross-component changes can be committed atomically - Improved Collaboration: Teams can easily understand the entire system architecture - Consistent Tooling: Shared linting, testing, and build tools across all components - Easier Refactoring: Changes that span multiple components can be coordinated more effectively - Shared Infrastructure: Common CI/CD pipelines and development tools
Challenges: - Repository Size: Can become large and impact clone/pull performance - Build Complexity: Requires sophisticated build systems to handle cross-component dependencies - Permission Management: Requires careful access control to prevent unauthorized changes - Branch Management: Coordination required between teams working on different components - Testing Complexity: Comprehensive testing strategies needed for cross-component changes
7.3.2 Decoupled Architecture¶
The decoupled architecture separates the frontend and backend into independently deployable services while maintaining API contract integrity:
Frontend Independence: - Next.js application can be developed, tested, and deployed independently - Frontend team can iterate quickly without backend dependencies - Can implement frontend-specific performance optimizations - Supports multiple frontend applications consuming the same backend API
Backend Independence: - FastAPI backend can be scaled independently of frontend - Backend team can implement new features without affecting frontend stability - Supports multiple client types (web, mobile, third-party integrations) - Allows for backend technology upgrades without frontend changes
API Contract Management: - OpenAPI specifications serve as the contract between frontend and backend - Automated client generation ensures consistency between API and frontend - Versioned APIs allow for gradual migration and backward compatibility - Contract testing validates API compliance during CI/CD
7.3.3 Branching Strategy¶
The project implements a multi-branch strategy to support independent development while maintaining integration capabilities:
Branch Structure:
- frontend-main: Dedicated branch for frontend development
- Contains only frontend-related code (in the /frontend directory)
- Frontend team works exclusively on this branch
- Independent CI/CD pipeline for frontend-specific builds
- Feature branches created from and merged back to this branch
- Frontend-specific dependencies and configurations
backend-main: Dedicated branch for backend development- Contains only backend-related code (in the
/backenddirectory) - Backend team works exclusively on this branch
- Independent CI/CD pipeline for backend-specific builds
- Feature branches created from and merged back to this branch
-
Backend-specific dependencies and configurations
-
LOCALDEV: Complete monorepo branch containing all files - Contains the full project structure with both frontend and backend
- Used for local development and integration testing
- Maintains the complete codebase for developers who need to work on both components
- Serves as the integration point for cross-component changes
- Used for end-to-end testing and local development environments
Branch Management Workflow:
1. Developers working on frontend-only changes use frontend-main
2. Developers working on backend-only changes use backend-main
3. Developers working on cross-component features use LOCALDEV
4. Cross-component changes are coordinated through pull requests to LOCALDEV
5. Periodic synchronization ensures consistency between specialized branches and LOCALDEV
7.3.4 Git Workflow and Collaboration Guidelines¶
Collaboration Process:
- Feature Development: Create feature branches from the appropriate main branch
- Code Reviews: All changes require peer review before merging
- Cross-Component Changes: Use LOCALDEV branch for changes affecting both frontend and backend
- Merge Strategy: Squash and merge for feature branches, merge commits for releases
- Pull Request Requirements:
- Automated tests must pass
- Code coverage requirements met
- Security scans clear
- Documentation updated
Team Coordination: - Frontend team coordinates API changes with backend team - Backend team provides advance notice of breaking API changes - Regular sync meetings to discuss cross-component dependencies - Shared documentation for API contracts and integration points - Joint planning sessions for features requiring both frontend and backend changes
7.3.5 Dependency and API Contract Management¶
Dependency Management:
- Frontend Dependencies: Managed in /frontend/package.json
- Backend Dependencies: Managed in /backend/requirements/
- Shared Dependencies: Documented in /docs/dependencies.md
- Version Synchronization: Automated tools ensure compatible versions
- Security Updates: Automated scanning and updates for vulnerabilities
API Contract Management:
- OpenAPI Specifications: Generated from FastAPI code and stored in /backend/app/api/specs/
- Client Generation: Automated TypeScript client generation for frontend
- Contract Testing: Automated tests validate API compliance
- Versioning Strategy: Semantic versioning for API endpoints
- Breaking Change Policy: 30-day deprecation notice for breaking changes
- Documentation: Interactive API documentation available at /api/docs
Integration Validation: - Pre-commit hooks validate API contract compliance - CI/CD pipelines run integration tests between components - Automated contract testing in staging environment - Mock servers for testing frontend without backend dependencies
7.3.6 Deployment Strategies¶
Independent Deployments: - Frontend Deployment: - Static site deployment to CDN - Independent release cycle from backend - Blue-green deployment strategy - Rollback capability within seconds - A/B testing support for UI changes
- Backend Deployment:
- Containerized deployment with Docker
- Rolling updates with health checks
- Database migration management
- API version compatibility during updates
- Graceful degradation for service updates
Coordinated Deployments: - Breaking Changes: Coordinated deployments when API contracts change - Feature Flags: Enable/disable features without deployment coordination - Canary Releases: Gradual rollout of changes affecting both components - Rollback Coordination: Synchronized rollback procedures for integrated changes
Environment Strategy: - Development: Local development with full monorepo - Staging: Full integration testing environment - Production: Independent scaling of frontend and backend - Hotfix: Emergency procedures for critical bug fixes
7.3.7 CI/CD Pipeline Configuration¶
Multi-Branch Pipeline Setup:
- Frontend Pipeline:
- Triggered on frontend-main and frontend feature branches
- Runs frontend-specific tests (unit, integration, e2e)
- Builds and deploys frontend application
- Performs accessibility and performance checks
- Deploys to frontend-specific staging environment
- Backend Pipeline:
- Triggered on
backend-mainand backend feature branches - Runs backend-specific tests (unit, integration, load)
- Builds and deploys backend services
- Performs security scanning and compliance checks
-
Deploys to backend-specific staging environment
-
Integration Pipeline:
- Triggered on
LOCALDEVbranch - Runs comprehensive integration tests
- Validates API contracts between components
- Performs end-to-end testing
- Deploys to integrated staging environment
Pipeline Components: - Build Stages: Separate build processes for frontend and backend - Testing Stages: Component-specific and integration tests - Security Scanning: Automated vulnerability detection - Performance Testing: Load and performance validation - Deployment Stages: Environment-specific deployment configurations - Monitoring: Post-deployment health checks and monitoring setup
7.3.8 Component Interaction in Decoupled Architecture¶
Change Impact Management: - Frontend Changes: Minimal impact on backend, primarily UI/UX updates - Backend API Changes: Requires coordination with frontend team for integration - Breaking Changes: 30-day notification period with migration support - Non-Breaking Changes: Can be deployed independently with minimal coordination
Communication Protocols: - API-First Development: Backend defines API contracts before implementation - Event-Driven Architecture: Backend publishes events for frontend consumption - WebSocket Integration: Real-time communication for live updates - Health Checks: Component health monitoring and alerting
Monitoring and Observability: - Component-Specific Metrics: Independent performance and error monitoring - Cross-Component Dependencies: Track API call performance and error rates - End-to-User Monitoring: Track user experience across both components - Alerting Systems: Component-specific and cross-component alerting
7.4 Technology Stack¶
- Backend: Python 3.11+, FastAPI, SQLAlchemy 2.0 (async), Pydantic v2, Hybrid Modular DDD Architecture
- Frontend: Next.js 14, React 18, TypeScript
- Database: PostgreSQL 15+ with PgBouncer connection pooling
- Cache/Queue: Redis 7 (Cluster mode for production)
- Message Queue: ARQ (Asynchronous Redis Queue) for background jobs
- WebSocket: FastAPI native WebSocket with Redis Pub/Sub
- Infrastructure: Docker & Docker Compose, Kubernetes-ready
- Observability: Grafana, Loki, Prometheus, Tempo (distributed tracing)
- Testing: Pytest, Playwright, Locust (performance testing)
- CI/CD: GitHub Actions
- Static Analysis: Ruff (linting), MyPy (type checking)
7.5 Multi-Developer Collaboration and Workflow Guidelines¶
For a web hosting panel serving millions of users with 10+ developers, clear collaboration guidelines are essential to maintain code quality, prevent conflicts, and ensure consistent development practices.
7.5.1 Team Organization and Ownership¶
Module Ownership Model: - Each domain module has a primary owner (developer or small team) - Owners are responsible for: - Code quality within their module - Reviewing PRs touching their module - Maintaining documentation for their module - Breaking down large features into tasks
Example Ownership Map:
Module Ownership:
├── users/ → Team: Core Platform (Lead: Alice)
├── auth/ → Team: Core Platform (Lead: Alice)
├── teams/ → Team: Core Platform (Lead: Bob)
├── sites/ → Team: Hosting Platform (Lead: Carol)
├── environments/ → Team: Hosting Platform (Lead: Carol)
├── backups/ → Team: Hosting Platform (Lead: Dave)
├── wordpress/ → Team: CMS Features (Lead: Eve)
├── domains/ → Team: Infrastructure (Lead: Frank)
├── payments/ → Team: Billing (Lead: Grace)
└── billing/ → Team: Billing (Lead: Grace)
Benefits: - Clear accountability for each feature - Faster PR reviews (owners have deep knowledge) - Reduced conflicts (teams work in different modules) - Easier onboarding (new devs join a specific team)
7.5.2 Git Workflow and Branching Strategy¶
Branch Structure:
Main Branches:
├── LOCALDEV # Full monorepo (local development & integration)
├── backend-main # Backend-only code
└── frontend-main # Frontend-only code
Feature Branches:
├── feature/wordpress-cli-enhancement # From backend-main or LOCALDEV
├── feature/site-dashboard-ui # From frontend-main
└── bugfix/backup-restore-issue # From appropriate main branch
Workflow:
-
For Backend-Only Changes:
-
For Frontend-Only Changes:
-
For Cross-Component Changes:
7.5.3 Commit Message Convention¶
Format: <type>(<scope>): <subject>
Types:
- feat: New feature
- fix: Bug fix
- docs: Documentation only
- style: Code style changes (formatting, no logic change)
- refactor: Code refactoring
- test: Adding or updating tests
- chore: Build process or auxiliary tool changes
Scopes: Module names (users, wordpress, sites, etc.)
Examples:
feat(wordpress): add WP-CLI command execution endpoint
fix(backups): resolve restore failure for large databases
docs(architecture): update DDD module isolation rules
test(sites): add integration tests for site creation flow
refactor(auth): extract JWT logic to shared module
chore(ci): update GitHub Actions workflow for backend tests
Commit Body (Optional):
feat(wordpress): add WP-CLI command execution endpoint
- Implement execute_wp_cli_command service method
- Add validation for allowed commands
- Implement timeout and error handling
- Add background job support for long-running commands
Closes #123
7.5.4 Pull Request (PR) Guidelines¶
PR Template:
## Description
Brief description of changes
## Type of Change
- [ ] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Documentation update
## Module(s) Affected
- [ ] users
- [ ] wordpress
- [ ] sites
- [ ] Other: ___
## Testing
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing completed
## Checklist
- [ ] Code follows style guidelines (passed ruff, mypy)
- [ ] Self-review completed
- [ ] Documentation updated
- [ ] No new warnings generated
- [ ] Module isolation rules maintained (validated with `make validate`)
## Related Issues
Closes #123, #456
PR Review Process:
- Automated Checks (must pass before review):
- Linting (Ruff, ESLint)
- Type checking (MyPy, TypeScript)
- Unit tests
- Integration tests
- Architecture validation (
validate_architecture.py) -
Code coverage threshold (80%+)
-
Manual Review (2 approvals required):
- Code owner approval (if module owner exists)
- One additional team member approval
-
Architecture review for cross-module changes
-
Merge Strategy:
- Squash and merge for feature branches (clean history)
- Merge commit for releases (preserve branch history)
- Rebase is discouraged (preserves all commits, clutters history)
7.5.5 Code Review Guidelines¶
For Reviewers:
Must Check: - Module isolation rules respected (no cross-module imports except core/database) - Proper use of shared kernel for cross-cutting concerns - All layers present in module (router, service, repository, model, schema) - Security considerations (authentication, authorization, input validation) - Error handling and edge cases covered - Tests cover new functionality (minimum 80% coverage) - Documentation updated (docstrings, comments for complex logic)
Nice to Have: - Performance considerations (N+1 queries, caching opportunities) - Code readability and maintainability - Consistent naming conventions - Reusable patterns extracted to shared utilities
Review Turnaround Time: - Small PRs (< 200 lines): 24 hours - Medium PRs (200-500 lines): 48 hours - Large PRs (500+ lines): Break into smaller PRs or 72 hours
For Authors:
Before Submitting PR:
# Run checks locally
make lint # Lint code
make format # Auto-format code
make validate # Validate architecture rules
make test # Run all tests
make health # Health check (if adding infrastructure changes)
During Review: - Respond to all comments (resolve, explain, or implement suggestions) - Request clarification if feedback is unclear - Update PR description if scope changes - Rebase if main branch has significant changes
7.5.6 Development Environment Setup¶
✅ Implementation Status: The local development environment setup has been fully implemented. See index.md for complete setup instructions.
Prerequisites: - Python 3.11+ - Node.js 20+ - Docker & Docker Compose - PostgreSQL 15+ (or via Docker) - Redis 7 (or via Docker)
Quick Setup (Automated):
Manual Setup Steps:
# Clone repository
git clone https://github.com/yourorg/mbpanel.git
cd mbpanel
# Backend setup
cd backend
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements/dev.txt
# Database setup
docker-compose -f local-infra/docker-compose.dev.yml up -d postgres redis
alembic upgrade head
python scripts/seed_dev_data.py --users 10 --teams 5 --sites 20
# Frontend setup
cd ../frontend
npm install
npm run dev
# Backend server (in separate terminal)
cd backend
uvicorn app.main:app --reload --port 8000
# Verify setup
make health
Daily Development Workflow:
# Start services
make dev
# Work on feature
git checkout -b feature/my-feature
# Make changes...
# Before committing
make lint
make format
make validate
make test
# Commit and push
git add .
git commit -m "feat(module): description"
git push origin feature/my-feature
# Create PR on GitHub
7.5.7 Documentation Standards¶
Module Documentation (docs/development/modules/[MODULE_NAME].md):
Each module should have documentation covering: - Purpose: What the module does - Dependencies: Which core/database components it uses - API Endpoints: List of routes with request/response examples - Business Logic: Key workflows and use cases - Database Schema: Tables and relationships - Integration Points: How other modules interact with it - Testing: How to test the module
Code Documentation:
Docstrings (Google Style):
def create_wordpress_site(db: Session, site_data: WordPressCreate) -> WordPress:
"""Create a new WordPress installation for a site.
Args:
db: Database session
site_data: WordPress creation data containing site_id, version, etc.
Returns:
Created WordPress instance
Raises:
HTTPException: If site doesn't exist or already has WordPress
Example:
>>> site_data = WordPressCreate(site_id=1, version="6.4")
>>> wp = create_wordpress_site(db, site_data)
>>> print(wp.version)
'6.4'
"""
pass
Inline Comments: - Use for complex logic that isn't immediately obvious - Explain why, not what - Keep comments up-to-date with code changes
7.5.8 Continuous Integration (CI) Pipeline¶
GitHub Actions Workflow (.github/workflows/backend-ci.yml):
name: Backend CI
on:
pull_request:
branches: [backend-main, LOCALDEV]
paths:
- 'backend/**'
push:
branches: [backend-main]
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
cd backend
pip install -r requirements/dev.txt
- name: Run Ruff
run: |
cd backend
ruff check .
- name: Run MyPy
run: |
cd backend
mypy .
test:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:15
env:
POSTGRES_PASSWORD: test
POSTGRES_DB: test_db
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
redis:
image: redis:7
options: >-
--health-cmd "redis-cli ping"
--health-interval 10s
--health-timeout 5s
--health-retries 5
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
cd backend
pip install -r requirements/dev.txt
- name: Run tests
run: |
cd backend
pytest --cov=app --cov-report=xml --cov-report=term
- name: Upload coverage
uses: codecov/codecov-action@v3
with:
files: ./backend/coverage.xml
architecture-validation:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Validate architecture
run: |
cd backend
python scripts/validate_architecture.py
Protection Rules: - Require passing CI checks before merge - Require 2 approvals - Require branch to be up-to-date before merging - Prevent force pushes to main branches
7.5.9 Communication and Coordination¶
Daily Standup (Async via Slack):
- Yesterday: Completed WordPress theme management API
- Today: Working on frontend UI for theme selection
- Blockers: Waiting for design review on theme preview
Weekly Planning: - Review upcoming features from product backlog - Assign module ownership for new features - Identify cross-module dependencies - Plan integration testing for complex features
Documentation Channels: - Slack #dev: General development discussion - Slack #architecture: Architecture decisions and reviews - Slack #incidents: Production issues and postmortems - GitHub Discussions: Long-form technical discussions - Confluence/Notion: Architecture decision records (ADRs), runbooks
7.5.10 Conflict Resolution¶
Code Conflicts: - Module isolation minimizes conflicts - If conflicts occur in shared code (core/, database/), coordinate with team - Prefer pull-based updates (rebase on latest main) over push-based (force push)
Design Conflicts: - Escalate to tech lead or architect - Document decision in ADR (Architecture Decision Record) - Communicate decision to all team members
Performance/Scalability Concerns: - Load test before deploying to production - Monitor metrics post-deployment - Rollback if performance degrades
7.6 Performance and Scalability Patterns for High-Traffic Systems¶
For a web hosting control panel serving millions of users, performance and scalability are critical. This section outlines proven patterns and best practices for high-traffic FastAPI applications.
7.6.1 Database Optimization Patterns¶
1. Connection Pooling with PgBouncer:
# backend/app/database/database.py
from sqlalchemy import create_engine
from sqlalchemy.pool import NullPool
# Use PgBouncer for connection pooling (external)
engine = create_engine(
"postgresql://user:pass@pgbouncer:6432/mbpanel",
pool_pre_ping=True,
poolclass=NullPool, # Let PgBouncer handle pooling
echo=False
)
# PgBouncer config (pgbouncer.ini)
# [databases]
# mbpanel = host=postgres port=5432 dbname=mbpanel
#
# [pgbouncer]
# pool_mode = transaction
# max_client_conn = 1000
# default_pool_size = 25
2. Read Replicas for Read-Heavy Operations:
# backend/app/database/database.py
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
# Primary database (read-write)
engine_primary = create_engine("postgresql://user:pass@primary:5432/mbpanel")
SessionPrimary = sessionmaker(bind=engine_primary)
# Read replica (read-only)
engine_replica = create_engine("postgresql://user:pass@replica:5432/mbpanel")
SessionReplica = sessionmaker(bind=engine_replica)
# Usage in repository
def get_sites_list(db: Session, team_id: int, skip: int = 0, limit: int = 100):
"""List sites (use replica for read-only operation)"""
# Use read replica for this query
replica_db = SessionReplica()
try:
query = replica_db.query(Site).filter(Site.team_id == team_id)
return query.offset(skip).limit(limit).all()
finally:
replica_db.close()
3. Query Optimization:
# BAD: N+1 query problem
def get_sites_with_environments(db: Session, team_id: int):
sites = db.query(Site).filter(Site.team_id == team_id).all()
for site in sites:
# Triggers additional query for each site!
environments = db.query(Environment).filter(Environment.site_id == site.id).all()
return sites
# GOOD: Eager loading with joinedload
from sqlalchemy.orm import joinedload
def get_sites_with_environments(db: Session, team_id: int):
sites = db.query(Site).options(
joinedload(Site.environments)
).filter(Site.team_id == team_id).all()
return sites # All environments loaded in single query
4. Database Indexing Strategy:
# backend/app/sites/model.py
from sqlalchemy import Column, Integer, String, Index
from app.database.database import Base
from app.database.mixins import TimestampMixin, TeamScopedMixin
class Site(Base, TimestampMixin, TeamScopedMixin):
__tablename__ = "sites"
id = Column(Integer, primary_key=True, index=True)
name = Column(String(255), nullable=False)
domain = Column(String(255), nullable=False, unique=True)
# team_id from TeamScopedMixin, already indexed
# Composite index for common queries
__table_args__ = (
Index('ix_sites_team_created', 'team_id', 'created_at'),
Index('ix_sites_domain_active', 'domain', 'is_active'),
)
7.6.2 Caching Strategies¶
1. Multi-Tier Caching:
# backend/app/core/cache.py
import redis
import pickle
from functools import wraps
from typing import Optional, Any
redis_client = redis.Redis(
host='redis',
port=6379,
db=0,
decode_responses=False # For pickle
)
def cache_result(key_prefix: str, ttl: int = 300):
"""Cache decorator for expensive operations"""
def decorator(func):
@wraps(func)
async def wrapper(*args, **kwargs):
# Generate cache key from function args
cache_key = f"{key_prefix}:{func.__name__}:{hash(str(args) + str(kwargs))}"
# Try to get from cache
cached = redis_client.get(cache_key)
if cached:
return pickle.loads(cached)
# Execute function
result = await func(*args, **kwargs)
# Store in cache
redis_client.setex(cache_key, ttl, pickle.dumps(result))
return result
return wrapper
return decorator
# Usage in service
from app.core.cache import cache_result
@cache_result(key_prefix="sites", ttl=60)
async def get_site_statistics(db: Session, site_id: int):
"""Get site statistics (cached for 60 seconds)"""
# Expensive computation...
return statistics
2. Cache Invalidation:
# backend/app/core/cache.py
def invalidate_cache(pattern: str):
"""Invalidate all cache keys matching pattern"""
for key in redis_client.scan_iter(match=pattern):
redis_client.delete(key)
# Usage after data modification
from app.core.cache import invalidate_cache
def update_site(db: Session, site_id: int, site_data: SiteUpdate):
"""Update site and invalidate related caches"""
# Update database
site = repository.update_site(db, site_id, site_data)
db.commit()
# Invalidate caches
invalidate_cache(f"sites:*:{site_id}:*")
invalidate_cache(f"sites:list:team:{site.team_id}:*")
return site
3. Application-Level Caching (LRU):
from functools import lru_cache
@lru_cache(maxsize=128)
def get_wordpress_allowed_commands() -> list[str]:
"""Get list of allowed WP-CLI commands (cached in memory)"""
return [
"core version",
"plugin list",
"theme list",
"user list",
# ... more commands
]
7.6.3 Asynchronous Processing¶
1. Background Jobs with ARQ:
# backend/app/jobs/worker.py
from arq import create_pool
from arq.connections import RedisSettings
async def startup(ctx):
ctx['redis'] = await create_pool(RedisSettings())
async def shutdown(ctx):
await ctx['redis'].close()
# Job definitions
async def create_full_backup(ctx, site_id: int):
"""Background job for creating full site backup"""
# Long-running operation
db = SessionLocal()
try:
# Create backup logic...
pass
finally:
db.close()
async def install_wordpress(ctx, site_id: int, version: str):
"""Background job for WordPress installation"""
# Long-running operation
pass
class WorkerSettings:
functions = [create_full_backup, install_wordpress]
on_startup = startup
on_shutdown = shutdown
redis_settings = RedisSettings()
job_timeout = 3600 # 1 hour
max_jobs = 10
2. Enqueue Jobs from API:
# backend/app/backups/router.py
from arq import create_pool
from arq.connections import RedisSettings
@router.post("/backups/", status_code=202)
async def create_backup(
backup_data: BackupCreate,
db: Session = Depends(get_db),
current_user = Depends(get_current_active_user)
):
"""Create backup (async operation)"""
# Validate request
site = service.get_site(db, backup_data.site_id)
if not site:
raise HTTPException(status_code=404, detail="Site not found")
# Enqueue background job
redis = await create_pool(RedisSettings())
job = await redis.enqueue_job('create_full_backup', backup_data.site_id)
return {
"job_id": job.job_id,
"status": "queued",
"message": "Backup job queued successfully"
}
7.6.4 Rate Limiting and Throttling¶
# backend/app/core/rate_limit.py
from fastapi import HTTPException, Request
from redis import Redis
import time
redis_client = Redis(host='redis', port=6379, db=1)
async def rate_limit(
request: Request,
calls: int = 100,
period: int = 60
):
"""Rate limit middleware: {calls} requests per {period} seconds"""
# Get user ID from request (assumes authentication)
user_id = request.state.user.id if hasattr(request.state, 'user') else 'anonymous'
key = f"rate_limit:{user_id}:{request.url.path}"
current = redis_client.get(key)
if current is None:
redis_client.setex(key, period, 1)
return
current = int(current)
if current >= calls:
raise HTTPException(
status_code=429,
detail=f"Rate limit exceeded. Try again in {redis_client.ttl(key)} seconds"
)
redis_client.incr(key)
# Usage in router
from app.core.rate_limit import rate_limit
@router.post("/wordpress/{site_id}/cli")
async def execute_wp_cli(
site_id: int,
command: WpCliCommand,
db: Session = Depends(get_db),
current_user = Depends(get_current_active_user),
_rate_limit = Depends(lambda req: rate_limit(req, calls=10, period=60))
):
"""Execute WP-CLI command (rate limited: 10/min)"""
pass
7.6.5 Horizontal Scaling Considerations¶
1. Stateless API Servers: - No session state stored on API servers - All session data in Redis - Load balancer can route to any instance
2. WebSocket Scaling:
# backend/app/websocket/connection.py
import redis.asyncio as aioredis
from fastapi import WebSocket
class ConnectionManager:
def __init__(self):
self.active_connections: dict[int, list[WebSocket]] = {}
self.redis = None
async def connect(self, websocket: WebSocket, user_id: int):
await websocket.accept()
if user_id not in self.active_connections:
self.active_connections[user_id] = []
self.active_connections[user_id].append(websocket)
# Subscribe to Redis pub/sub for cross-instance messages
if not self.redis:
self.redis = await aioredis.from_url("redis://redis:6379")
pubsub = self.redis.pubsub()
await pubsub.subscribe(f"user:{user_id}")
async def broadcast_to_user(self, user_id: int, message: dict):
"""Broadcast message to all user connections across all instances"""
# Publish to Redis (all instances will receive)
await self.redis.publish(f"user:{user_id}", json.dumps(message))
3. Database Connection Management:
- Use PgBouncer to prevent connection exhaustion
- Monitor active connections: SELECT count(*) FROM pg_stat_activity;
- Set appropriate pool_size and max_overflow values
4. Health Checks for Load Balancer:
# backend/app/main.py
from fastapi import FastAPI, Response
@app.get("/health", tags=["health"])
async def health_check():
"""Health check endpoint for load balancer"""
# Check database connectivity
try:
db = SessionLocal()
db.execute("SELECT 1")
db.close()
except Exception:
return Response(status_code=503, content="Database unhealthy")
# Check Redis connectivity
try:
redis_client.ping()
except Exception:
return Response(status_code=503, content="Redis unhealthy")
return {"status": "healthy", "version": "1.0.0"}
7.6.6 External API Optimization Patterns¶
Context: The MBPanel system integrates with 40+ external services (Virtuozzo, Bunny CDN, Cloudflare, payment processors). External API calls represent a significant performance bottleneck if not optimized properly.
Key Optimization Strategies:
1. Connection Pooling for External APIs:
# backend/app/core/http_client.py
import httpx
# Connection pooling configuration
client = httpx.AsyncClient(
timeout=httpx.Timeout(30.0),
limits=httpx.Limits(
max_keepalive_connections=20, # Reuse up to 20 connections
max_connections=100, # Max concurrent connections
keepalive_expiry=60.0 # Keep connections alive for 60s
),
http2=True # Enable HTTP/2 for multiplexing
)
# Performance Impact:
# - Without pooling: DNS (20ms) + TLS handshake (50ms) + Request (30ms) = 100ms per request
# - With pooling: First request (100ms), subsequent requests (30ms) = 70% reduction
2. Response Caching for External API Calls:
# Cache frequently accessed external API responses
from app.core.cache import cache_result
@cache_result(key_prefix="virtuozzo", ttl=300)
async def fetch_environments(session_key: str):
"""Fetch environments from Virtuozzo (cached for 5 minutes)"""
response = await virtuozzo_client.get(
f"/getenvs?session={session_key}"
)
return response.json()
# Performance Impact:
# - First request: 90ms (API call)
# - Subsequent requests: 2ms (Redis cache)
# - Improvement: 98% faster for cached requests
3. Circuit Breaker Pattern:
# Prevent cascading failures when external APIs are down
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=300):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.failure_count = 0
self.state = "CLOSED" # CLOSED, OPEN, HALF_OPEN
def call(self, func, *args, **kwargs):
if self.state == "OPEN":
# Fail fast without calling external API
raise Exception("Circuit breaker OPEN")
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
# Performance Impact:
# - Prevents wasting resources on failing external APIs
# - Fails fast (0ms) instead of waiting for timeout (30s+)
# - Protects system from cascading failures
4. Retry with Exponential Backoff:
# Automatic retry for transient failures
retry_count = 0
max_retries = 3
while retry_count <= max_retries:
try:
response = await external_api_client.get(path)
return response
except (TimeoutError, NetworkError) as e:
retry_count += 1
if retry_count <= max_retries:
backoff = 2 ** (retry_count - 1) # 1s, 2s, 4s
await asyncio.sleep(backoff)
else:
raise
# Performance Impact:
# - Recovers from transient failures automatically
# - Reduces error rate by 90%+ for flaky external APIs
5. Rate Limiting for External APIs:
# Prevent overwhelming external APIs
class RateLimiter:
def __init__(self, max_calls_per_second: int = 10):
self.max_calls = max_calls_per_second
self.calls = []
self.lock = asyncio.Lock()
async def check_rate_limit(self):
async with self.lock:
now = datetime.now()
# Remove calls older than 1 second
self.calls = [t for t in self.calls if now - t < timedelta(seconds=1)]
if len(self.calls) >= self.max_calls:
# Wait until oldest call expires
sleep_time = 1.0 - (now - self.calls[0]).total_seconds()
if sleep_time > 0:
await asyncio.sleep(sleep_time)
self.calls.append(now)
# Performance Impact:
# - Prevents API throttling by external services
# - Maintains consistent throughput
# - Avoids 429 (Too Many Requests) errors
6. Idempotency for Safe Retries:
# Safe retries for mutation operations
idempotency_key = f"create_env:{user_id}:{env_name}"
response = await external_api_client.post(
path="/create",
json=data,
headers={"Idempotency-Key": idempotency_key}
)
# Check for idempotent errors
if response.status_code == 400 and "already exists" in response.text:
# Treat as success (idempotent operation)
logger.info("Resource already exists, treating as success")
return {"status": "success", "idempotent": True}
# Performance Impact:
# - Enables safe retries without duplicate operations
# - Prevents data corruption from duplicate requests
# - Reduces complexity in error handling
7. Parallel API Calls with asyncio:
# Execute multiple external API calls concurrently
import asyncio
async def fetch_all_data(site_id: int):
# Execute in parallel instead of sequentially
environment_task = virtuozzo_adapter.fetch_environment(site_id)
backup_task = virtuozzo_adapter.fetch_backups(site_id)
domain_task = cloudflare_adapter.fetch_dns_records(site_id)
# Wait for all to complete
environment, backups, domains = await asyncio.gather(
environment_task,
backup_task,
domain_task
)
return {
"environment": environment,
"backups": backups,
"domains": domains
}
# Performance Impact:
# - Sequential: 90ms + 80ms + 70ms = 240ms
# - Parallel: max(90ms, 80ms, 70ms) = 90ms
# - Improvement: 62% faster
8. HTTP/2 Multiplexing:
# Enable HTTP/2 for multiple requests over single connection
client = httpx.AsyncClient(
http2=True # Enable HTTP/2
)
# Performance Impact:
# - Reduces connection overhead
# - Enables request/response multiplexing
# - Reduces head-of-line blocking
# - 20-30% latency improvement for multiple requests
External API Performance Targets:
| Metric | Target | Measurement Method |
|---|---|---|
| Average API Call Latency | <50ms | P50 response time |
| P95 API Call Latency | <150ms | P95 response time |
| Cache Hit Rate | >80% | Cache hits / total requests |
| Circuit Breaker Open Rate | <1% | Opens / total time |
| Retry Success Rate | >90% | Successful retries / total retries |
| Connection Reuse Rate | >90% | Pooled connections / total connections |
| Rate Limit Violation Rate | 0% | 429 errors / total requests |
Monitoring External API Performance:
from prometheus_client import Counter, Histogram, Gauge
# Track external API performance
external_api_duration = Histogram(
'external_api_duration_seconds',
'External API call duration',
['adapter', 'endpoint']
)
external_api_errors = Counter(
'external_api_errors_total',
'External API errors',
['adapter', 'endpoint', 'error_type']
)
circuit_breaker_state = Gauge(
'circuit_breaker_state',
'Circuit breaker state (0=CLOSED, 1=OPEN, 2=HALF_OPEN)',
['adapter']
)
cache_hit_rate = Counter(
'external_api_cache_hits',
'Cache hits for external API calls',
['adapter', 'endpoint']
)
7.6.7 Monitoring and Observability¶
1. Metrics Instrumentation:
# backend/app/core/metrics.py
from prometheus_client import Counter, Histogram, Gauge
import time
# Request metrics
http_requests_total = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
http_request_duration_seconds = Histogram(
'http_request_duration_seconds',
'HTTP request latency',
['method', 'endpoint']
)
active_websocket_connections = Gauge(
'active_websocket_connections',
'Number of active WebSocket connections'
)
# Middleware to track metrics
from fastapi import Request
import time
@app.middleware("http")
async def track_metrics(request: Request, call_next):
start_time = time.time()
response = await call_next(request)
duration = time.time() - start_time
http_requests_total.labels(
method=request.method,
endpoint=request.url.path,
status=response.status_code
).inc()
http_request_duration_seconds.labels(
method=request.method,
endpoint=request.url.path
).observe(duration)
return response
2. Performance Targets: - API p95 latency: <200ms - Database query p95: <50ms - Cache hit rate: >80% - Error rate: <0.1% - Uptime: 99.9% - External API p95 latency: <150ms - External API cache hit rate: >80%
7.7 Database Sharding Strategy for Millions of Users¶
To support millions of concurrent users and billions of records, a single PostgreSQL database instance is insufficient. This section defines our enterprise-grade database sharding strategy using Citus (distributed PostgreSQL) for horizontal scalability.
7.7.1 Sharding Architecture Overview¶
Why Citus? - PostgreSQL-native: Maintains full PostgreSQL compatibility - Transparent sharding: Application code requires minimal changes - Distributed queries: Automatic query parallelization across shards - Reference tables: Replicated lookup tables across all shards - Multi-tenant friendly: Natural sharding by tenant_id/team_id
Architecture Diagram:
┌─────────────────────────────────────────────────────────────┐
│ MBPanel FastAPI Application │
│ (Connection to Coordinator) │
└────────────────────┬────────────────────────────────────────┘
│
┌───────────▼──────────┐
│ Citus Coordinator │ ← Query planning & routing
│ (PostgreSQL 15+) │
└───────────┬──────────┘
│
┌──────────────┼──────────────┐
│ │ │
┌─────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐
│ Shard 1 │ │ Shard 2 │ │ Shard 3 │ ... Shard N
│ (Worker) │ │ (Worker) │ │ (Worker) │
│ team_id: │ │ team_id: │ │ team_id: │
│ 0-333M │ │ 333M-666M │ │ 666M-999M │
└───────────┘ └───────────┘ └───────────┘
Each shard contains:
- environments (distributed table)
- nodes (distributed table)
- team-specific data (colocated with team_id)
Replicated across all shards:
- users (reference table)
- teams (reference table)
- api_keys (reference table)
7.7.2 Sharding Key Selection¶
Primary Sharding Key: team_id
Rationale: 1. Natural multi-tenancy: All WordPress sites belong to a team (even single-user teams) 2. Query locality: 95% of queries filter by team_id 3. Even distribution: Teams have similar data volumes (sites, backups, logs) 4. No hotspots: No single team dominates traffic (unlike user_id for SaaS platforms)
Distribution Strategy:
-- Create distributed tables (run on coordinator)
SELECT create_distributed_table('environments', 'team_id');
SELECT create_distributed_table('nodes', 'team_id');
SELECT create_distributed_table('job_logs', 'team_id');
SELECT create_distributed_table('backups', 'team_id');
SELECT create_distributed_table('metrics', 'team_id');
-- Create reference tables (replicated to all shards)
SELECT create_reference_table('users');
SELECT create_reference_table('teams');
SELECT create_reference_table('api_keys');
Colocation Strategy:
- All tables sharded by team_id are colocated
- Enables local JOINs within a shard (no cross-shard queries)
- Example: JOIN environments + nodes for same team executes on single worker
7.7.3 Shard Sizing and Scaling¶
Initial Configuration (1M teams, 10M sites): - Coordinator: 1 node (16 vCPU, 64GB RAM) - Worker Shards: 8 nodes (8 vCPU, 32GB RAM each) - Shard Size: ~125K teams per shard - Data per Shard: ~250GB (estimated)
Scaling Trigger Points: | Metric | Threshold | Action | |--------|-----------|--------| | Shard size | >500GB | Add 4 more workers, rebalance | | Query latency p95 | >50ms | Add read replicas per shard | | CPU utilization | >70% | Vertical scaling or add workers | | Teams per shard | >500K | Add workers, redistribute |
Shard Rebalancing:
-- Rebalance shards after adding new workers
SELECT rebalance_table_shards('environments');
SELECT rebalance_table_shards('nodes');
-- ... other distributed tables
Automated Rebalancing Policy: - Trigger rebalancing when shard size variance > 20% - Run during low-traffic windows (2am-4am UTC) - Monitor query performance during rebalancing - Rollback mechanism if p95 latency degrades >30%
7.7.4 Query Patterns and Optimizations¶
Supported Query Patterns:
-
Single-Tenant Queries (Fast Path - 95% of queries)
-
Cross-Tenant Aggregations (Slow Path - 5% of queries)
-
Reference Table Queries (Local - Replicated)
Query Optimization Guidelines: - Always include team_id in WHERE clause for distributed tables - Avoid cross-shard JOINs (denormalize if necessary) - Use reference tables for lookup data (<10M rows) - Leverage Citus query pushdown for aggregations - Use connection pooling (PgBouncer) per worker
7.7.5 High Availability for Sharded Database¶
Replication Strategy: - Per-Shard Replication: Each worker has 2 replicas (primary + 2 standby) - Coordinator Replication: Active-passive coordinator setup - Automatic Failover: Patroni + etcd for consensus-based failover - Failover Time: <30 seconds (RTO) - Data Loss: <5 seconds (RPO with synchronous replication)
Architecture:
Coordinator:
├── coordinator-primary (active)
├── coordinator-standby-1 (sync replica)
└── coordinator-standby-2 (async replica)
Shard 1:
├── worker-1-primary (active)
├── worker-1-standby-1 (sync replica)
└── worker-1-standby-2 (async replica)
... (repeat for all shards)
Health Monitoring: - Continuous replication lag monitoring (<5s threshold) - Automatic promotion of standby on primary failure - Application-level connection retry with exponential backoff - Alert on failover events (PagerDuty integration)
7.7.6 Backup Strategy for Sharded Database¶
Backup Requirements: - Frequency: Continuous WAL archiving + daily base backups - Retention: 30 days for base backups, 7 days for WAL - Point-in-Time Recovery (PITR): Restore to any timestamp within 7 days - Backup Storage: S3 with cross-region replication
Implementation:
# Using pgBackRest for each shard
# Coordinator backup
pgbackrest --stanza=coordinator --type=full backup
# Worker backups (parallel execution)
for shard in worker-{1..8}; do
pgbackrest --stanza=$shard --type=full backup &
done
wait
# Incremental backups (hourly)
pgbackrest --stanza=coordinator --type=incr backup
for shard in worker-{1..8}; do
pgbackrest --stanza=$shard --type=incr backup &
done
Disaster Recovery Testing: - Monthly DR drills: Restore random shard to verify backups - Quarterly full cluster restore to staging environment - Document recovery time for different scenarios: - Single shard failure: <10 minutes - Coordinator failure: <5 minutes - Full cluster failure: <2 hours
7.7.7 Migration Path from Single PostgreSQL to Citus¶
Phase 1: Preparation (Week 1-2)
1. Audit current schema for sharding compatibility
2. Add team_id to all tables lacking it (backfill historical data)
3. Identify reference vs. distributed tables
4. Update queries to always include team_id filter
5. Deploy Citus coordinator + 2 initial workers (staging)
Phase 2: Parallel Run (Week 3-4)
1. Enable dual-write to both legacy DB and Citus (staging)
2. Backfill historical data to Citus using pg_dump + COPY
3. Validate data consistency between legacy and Citus
4. Load test Citus with production traffic replay
5. Performance benchmarking (compare p50/p95/p99)
Phase 3: Cutover (Week 5) 1. Enable read-only mode on legacy PostgreSQL 2. Final incremental sync to Citus 3. Switch application to Citus coordinator (feature flag) 4. Monitor for 24 hours, rollback plan ready 5. Disable legacy PostgreSQL after 7-day validation period
Rollback Plan: - Feature flag to switch back to legacy DB (< 5 minutes) - Citus → Legacy reverse sync available for 7 days - Automated health checks with automatic rollback if: - Error rate >1% - Query latency p95 >2x baseline - Any shard becomes unavailable
7.7.8 Cost Optimization for Sharded Database¶
Cost Breakdown (Estimated for 1M teams): | Component | Spec | Monthly Cost | |-----------|------|--------------| | Coordinator | 16 vCPU, 64GB RAM | $500 | | Workers (8x) | 8 vCPU, 32GB RAM each | $2,400 ($300 × 8) | | Replicas (16x) | Same as workers (standby) | $4,800 ($300 × 16) | | Backup Storage | 10TB S3 Standard | $230 | | TOTAL | | $7,930/month |
Cost per User: $0.0079/month (for 1M teams)
Optimization Strategies: 1. Use spot instances for standby replicas (60% cost reduction) 2. Tiered storage: Move old WAL logs to S3 Glacier after 7 days 3. Auto-scaling workers: Add workers during peak hours, remove during off-peak 4. Compression: Enable PostgreSQL table compression (saves 40% disk) 5. Vacuuming: Aggressive autovacuum to reclaim space from deleted data
7.7.9 Monitoring and Alerting for Sharded Database¶
Key Metrics to Monitor: | Metric | Threshold | Action | |--------|-----------|--------| | Shard query latency p95 | >50ms | Investigate slow queries | | Replication lag | >5 seconds | Alert DBA, check network | | Shard disk usage | >80% | Provision more storage | | Connection pool saturation | >90% | Increase pool size | | Cross-shard query % | >10% | Optimize queries | | Failed shard health checks | >1 | Initiate failover |
Monitoring Stack: - Prometheus: Metrics collection from all Citus nodes - Grafana: Real-time dashboards for coordinator + all shards - PgWatch2: PostgreSQL-specific metrics (bloat, vacuum, locks) - Alertmanager: PagerDuty integration for critical alerts
Sample Alerts:
# Prometheus alert rules
- alert: ShardHighLatency
expr: pg_stat_statements_mean_time_seconds{shard!=""} > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "Shard {{ $labels.shard }} has high query latency"
- alert: ReplicationLagHigh
expr: pg_replication_lag_seconds > 5
for: 2m
labels:
severity: critical
annotations:
summary: "Replication lag on {{ $labels.instance }} is {{ $value }}s"
7.7.10 Alternative Sharding Strategies Considered¶
Rejected Alternatives:
- Application-Level Sharding
- ❌ Requires custom connection routing logic in FastAPI
- ❌ Complex schema migrations across shards
- ❌ No support for cross-shard queries
-
❌ Higher maintenance burden
-
Vitess (MySQL Sharding)
- ❌ Requires migrating from PostgreSQL to MySQL
- ❌ Loss of PostgreSQL-specific features (JSONB, CTEs)
-
❌ Different query syntax and limitations
-
MongoDB Sharding
- ❌ NoSQL requires application rewrites
- ❌ Lack of ACID guarantees for multi-document transactions
-
❌ Team lacks MongoDB expertise
-
CockroachDB (Distributed SQL)
- ✅ Automatic sharding and rebalancing
- ❌ Higher operational cost (3x vs. Citus)
- ❌ Different SQL dialect (requires query rewrites)
- ❌ Less mature ecosystem vs. PostgreSQL
Citus Advantages: - ✅ PostgreSQL-compatible (minimal migration) - ✅ Mature open-source project (CNCF) - ✅ Proven at scale (millions of shards in production) - ✅ Strong community and enterprise support
7.8 Multi-Region Disaster Recovery (DR) Plan¶
For an enterprise-grade WordPress hosting platform serving millions of users, a comprehensive multi-region disaster recovery strategy is critical to ensure business continuity, data integrity, and regulatory compliance.
7.8.1 DR Objectives and Requirements¶
Recovery Objectives: - Recovery Time Objective (RTO): <15 minutes for critical services - Recovery Point Objective (RPO): <5 minutes data loss (synchronous replication) - Availability Target: 99.99% uptime (52.56 minutes downtime per year) - Geographic Redundancy: Minimum 2 regions, 500+ miles apart
Disaster Scenarios Covered: 1. Regional Outage: Complete AWS/cloud provider region failure 2. Data Center Failure: Single availability zone (AZ) failure 3. Database Corruption: Logical corruption requiring point-in-time recovery 4. Ransomware/Cyberattack: Complete system compromise 5. Human Error: Accidental deletion or misconfiguration 6. Network Partition: Split-brain scenarios
7.8.2 Multi-Region Architecture¶
Primary Region: us-east-1 (Virginia) Secondary Region: us-west-2 (Oregon) Tertiary Region (Optional): eu-west-1 (Ireland) for GDPR compliance
Architecture Diagram:
┌──────────────────────────────────────────────────────────────┐
│ Global Load Balancer (Route53) │
│ Health Checks + Failover Routing │
└────────────────┬─────────────────────────┬───────────────────┘
│ │
┌───────────▼───────────┐ ┌────────▼────────────┐
│ PRIMARY REGION │ │ SECONDARY REGION │
│ (us-east-1) │ │ (us-west-2) │
│ │ │ │
│ ┌─────────────────┐ │ │ ┌───────────────┐ │
│ │ FastAPI Cluster │ │ │ │ FastAPI (Hot) │ │
│ │ (Active) │ │ │ │ (Standby) │ │
│ └────────┬────────┘ │ │ └───────┬───────┘ │
│ │ │ │ │ │
│ ┌────────▼────────┐ │ │ ┌───────▼───────┐ │
│ │ Citus Primary │ │===│==│ Citus Replica │ │
│ │ (Active Writer) │ │ │ │ (Async Rep) │ │
│ └────────┬────────┘ │ │ └───────┬───────┘ │
│ │ │ │ │ │
│ ┌────────▼────────┐ │ │ ┌───────▼───────┐ │
│ │ Redis Primary │ │===│==│ Redis Replica │ │
│ └─────────────────┘ │ │ └───────────────┘ │
│ │ │ │
│ ┌─────────────────┐ │ │ ┌───────────────┐ │
│ │ S3 Bucket │ │===│==│ S3 Replica │ │
│ │ (Backups/Assets)│ │ │ │ (Cross-Region)│ │
│ └─────────────────┘ │ │ └───────────────┘ │
└───────────────────────┘ └─────────────────────┘
Replication Flow:
- Database: Asynchronous streaming replication (5-10 second lag)
- Redis: Redis Sentinel with async replication
- S3: Cross-region replication (CRR) enabled
- Application: Stateless, can run in both regions simultaneously
7.8.3 Failover Strategies¶
1. Automated Failover (Preferred)
Triggers: - Primary region health checks fail for >60 seconds - Database replication lag >30 seconds - API error rate >5% for 5 minutes - Manual trigger via admin dashboard
Failover Process:
1. Route53 detects primary region failure (health check)
2. DNS failover to secondary region (TTL: 60 seconds)
3. Promote secondary database to primary (automated via Patroni)
4. Update application config to point to new primary DB
5. Scale up secondary region FastAPI instances
6. Send alerts to incident response team
7. Update status page (https://status.mbpanel.io)
Total Time: <5 minutes
Automated Failover Script:
#!/bin/bash
# Disaster Recovery Automated Failover Script
# 1. Health Check Failure Detection
if [[ $(check_primary_region_health) == "FAILED" ]]; then
echo "PRIMARY REGION FAILURE DETECTED"
# 2. Promote Secondary Database
patronictl failover --candidate secondary-db-1 --force
# 3. Update DNS
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890ABC \
--change-batch file://failover-dns.json
# 4. Scale Secondary Region
kubectl scale deployment fastapi --replicas=20 -n us-west-2
# 5. Update Application Config
kubectl set env deployment/fastapi \
DATABASE_HOST=secondary-db-1.us-west-2.rds.amazonaws.com
# 6. Notify Incident Response Team
curl -X POST https://api.pagerduty.com/incidents \
-H "Authorization: Token $PAGERDUTY_TOKEN" \
-d '{"incident": {"type": "incident", "title": "PRIMARY REGION FAILOVER"}}'
# 7. Update Status Page
curl -X POST https://api.statuspage.io/v1/pages/$PAGE_ID/incidents \
-d "status=investigating&impact=major&name=Regional+Failover+in+Progress"
echo "FAILOVER COMPLETE"
fi
2. Manual Failover
Use Cases: - Planned maintenance in primary region - Degraded performance (not full outage) - Compliance testing
Process: 1. Enable maintenance mode (read-only for 2 minutes) 2. Ensure secondary DB is <5 seconds behind primary 3. Manually promote secondary DB to primary 4. Update DNS records (gradual weighted routing shift) 5. Monitor error rates and latency 6. Full cutover after validation (15-minute window)
7.8.4 Data Replication Strategy¶
Database Replication (Citus):
Primary Region (us-east-1):
├── Coordinator (primary)
│ ├── Standby 1 (synchronous, same AZ)
│ └── Standby 2 (asynchronous, us-west-2) ← DR replica
├── Worker Shard 1 (primary)
│ ├── Standby 1 (synchronous, same AZ)
│ └── Standby 2 (asynchronous, us-west-2)
... (repeat for all shards)
Replication Lag Monitoring:
- Alert if lag >10 seconds (warning)
- Alert if lag >30 seconds (critical)
- Automatic failover if lag >60 seconds + primary failure
Redis Replication: - Redis Sentinel with 3-node quorum - Asynchronous replication to secondary region - Acceptable data loss: Last 5 seconds of cache/session data - Session replay from database if Redis unavailable
Object Storage (S3) Replication: - Cross-region replication (CRR) enabled - Replication time: <15 minutes for 95% of objects - Versioning enabled (30-day retention) - Lifecycle policies to archive old backups to Glacier
7.8.5 Backup Strategy¶
Backup Tiers:
| Backup Type | Frequency | Retention | Storage Location | RTO | RPO |
|---|---|---|---|---|---|
| Hot Backups | Continuous WAL | 7 days | Primary + Secondary S3 | <5 min | <1 min |
| Warm Backups | Daily full DB dump | 30 days | S3 Standard | <2 hours | 24 hours |
| Cold Backups | Weekly full snapshot | 1 year | S3 Glacier | <24 hours | 7 days |
| Archive | Monthly compliance | 7 years | S3 Glacier Deep Archive | <48 hours | 30 days |
Backup Validation: - Automated daily backup restoration tests in isolated environment - Monthly full DR drill: Restore complete production environment from backups - Quarterly chaos engineering: Simulate regional failures during business hours
Backup Encryption: - AES-256 encryption at rest (AWS KMS) - Encryption keys stored in multi-region AWS Secrets Manager - Key rotation every 90 days - Backup integrity checks using SHA-256 checksums
7.8.6 DR Testing and Validation¶
Test Frequency: | Test Type | Frequency | Duration | Scope | |-----------|-----------|----------|-------| | Automated Failover Test | Weekly | 15 min | DNS + Application layer | | Database Failover Test | Monthly | 1 hour | Full database promotion | | Full DR Drill | Quarterly | 4 hours | Complete regional failover | | Chaos Engineering | Bi-annual | 8 hours | Random component failures |
Full DR Drill Runbook:
1. Pre-Drill Preparation (1 hour before)
- Notify all stakeholders
- Enable enhanced monitoring
- Take pre-test snapshots
- Document current system state
2. Simulated Regional Failure (T+0)
- Artificially fail primary region health checks
- Trigger automated failover scripts
- Monitor failover process
3. Validation Phase (T+15 min)
- Verify DNS propagation (<5 min)
- Check database promotion success
- Test read/write operations
- Validate application functionality
4. Recovery Phase (T+2 hours)
- Fail back to primary region
- Verify data consistency
- Reconcile any data conflicts
5. Post-Drill Review (T+4 hours)
- Document RTO/RPO achieved
- Identify improvement areas
- Update DR playbooks
- Generate DR report for stakeholders
7.8.7 Split-Brain Prevention¶
Problem: Network partition causes both regions to believe they are primary, leading to data conflicts.
Prevention Mechanisms:
-
Fencing (STONITH - Shoot The Other Node In The Head)
-
Distributed Consensus (etcd)
- 5-node etcd cluster across 3 regions (2 primary, 2 secondary, 1 tiebreaker in eu-west-1)
- Requires quorum (3/5 nodes) for leader election
-
Network partition automatically isolates minority partition
-
Write Fencing
- Only region with etcd quorum can accept writes
- Minority partition automatically enters read-only mode
- Application layer enforces "write to primary region only" rule
Split-Brain Detection:
# Application-level split-brain detection
def check_split_brain():
primary_region_health = check_region_health("us-east-1")
secondary_region_health = check_region_health("us-west-2")
if primary_region_health and secondary_region_health:
# Both regions reporting as primary - SPLIT BRAIN
if both_accepting_writes():
alert_critical("SPLIT BRAIN DETECTED")
enter_read_only_mode(determine_minority_region())
page_dba_oncall()
7.8.8 Data Consistency After Failover¶
Conflict Resolution Strategy:
- No Conflicts (99.9% of cases)
- Asynchronous replication lag <5 seconds
- Acceptable data loss: Last 5 seconds of writes
-
Lost writes logged for manual review
-
Conflict Detection
- Compare WAL sequence numbers between regions
- Identify divergent transactions
-
Flag conflicting records in
conflict_resolutiontable -
Resolution Process
-- Automated conflict resolution (Last Write Wins) SELECT resolve_conflicts( resolution_strategy := 'LAST_WRITE_WINS', timestamp_field := 'updated_at' ); -- Manual conflict resolution for critical records SELECT * FROM conflict_resolution WHERE status = 'PENDING_MANUAL_REVIEW' ORDER BY severity DESC;
7.8.9 Monitoring and Alerting for DR¶
Critical DR Metrics:
| Metric | Threshold | Alert Severity | Action |
|---|---|---|---|
| Replication lag | >10s | Warning | Investigate primary DB load |
| Replication lag | >30s | Critical | Prepare for failover |
| Cross-region latency | >100ms | Warning | Check network connectivity |
| Backup failure | 1 failed backup | Critical | Immediate investigation |
| DR test failure | Any failure | High | Update DR playbooks |
| etcd cluster health | <3 nodes | Critical | Restore quorum immediately |
Monitoring Dashboard:
# Grafana DR Dashboard
Panels:
- Primary Region Health (Green/Yellow/Red)
- Secondary Region Health (Green/Yellow/Red)
- Replication Lag (real-time graph, last 24h)
- Last Successful Failover Test (timestamp)
- Last Successful Backup (per shard)
- Cross-Region Latency (p50/p95/p99)
- DR Drill Results (historical trend)
7.8.10 Cost Optimization for Multi-Region DR¶
Cost Breakdown (Estimated for 1M users):
| Component | Primary Region | Secondary Region | Monthly Cost |
|---|---|---|---|
| FastAPI Instances | 20 instances (active) | 5 instances (standby) | $3,000 + $750 |
| Database (Citus) | 8 workers + replicas | 8 async replicas | $7,930 + $3,200 |
| Redis | 3-node cluster | 3-node replica | $600 + $300 |
| S3 Storage | 50TB | 50TB (CRR) | $1,150 + $1,150 |
| Data Transfer | Egress fees | Cross-region transfer | $500 |
| TOTAL | $18,580/month |
Cost per User: $0.0186/month
Optimization Strategies: 1. Right-size secondary region: Run at 25% capacity (scale on failover) 2. Intelligent data tiering: Replicate only critical data in real-time 3. Backup compression: Reduce S3 storage by 60% with Zstandard compression 4. Spot instances for DR testing: 70% cost reduction for quarterly drills 5. Incremental backups: Reduce cross-region transfer costs
7.8.11 Compliance and Regulatory Considerations¶
GDPR (EU Users): - Tertiary region in eu-west-1 for EU user data - Data residency guarantees (EU data stays in EU) - Cross-region replication only for EU → EU regions - Right to be forgotten: Automated data purge across all regions
SOC 2 Type II: - Documented DR procedures (this PRD section) - Quarterly DR testing with audit trails - Incident response playbooks - Change management for DR infrastructure
HIPAA (if applicable): - Encrypted backups (AES-256) - Access logging for all DR operations - Business Associate Agreements (BAAs) with cloud providers
7.8.12 Runbook: Common DR Scenarios¶
Scenario 1: Complete Primary Region Failure
1. Automatic detection via health checks (60s)
2. Patroni promotes secondary DB coordinator (30s)
3. Route53 DNS failover (60s TTL propagation)
4. Scale up secondary region FastAPI (2 min)
5. Validate application functionality (5 min)
Total RTO: <10 minutes
Scenario 2: Database Corruption
1. Identify corruption timestamp
2. Restore nearest hourly backup before corruption
3. Replay WAL logs up to corruption point
4. Validate data integrity
5. Resume normal operations
Total RTO: <30 minutes, RPO: <1 hour
Scenario 3: Ransomware Attack
1. Isolate compromised systems immediately
2. Restore from immutable backup (S3 Object Lock)
3. Deploy clean infrastructure in secondary region
4. Forensic analysis in parallel
5. Gradual traffic migration after validation
Total RTO: <4 hours, RPO: <24 hours
7.9 Enterprise Compliance Framework (GDPR, SOC 2, HIPAA)¶
For an enterprise-grade WordPress hosting platform serving millions of users globally, compliance with international data protection regulations and industry standards is mandatory. This section defines our comprehensive compliance framework.
7.9.1 GDPR Compliance (General Data Protection Regulation)¶
Scope: All users in the European Economic Area (EEA) + UK
Key Requirements:
- Lawful Basis for Processing
- Consent: Explicit consent for marketing communications
- Contract: Processing necessary for service delivery
-
Legitimate Interest: Fraud prevention, system security
-
Data Subject Rights Implementation
| Right | Implementation | API Endpoint | Response Time |
|---|---|---|---|
| Right to Access (Art. 15) | User data export (JSON/CSV) | GET /api/v1/gdpr/data-export |
<48 hours |
| Right to Rectification (Art. 16) | Profile update API | PATCH /api/v1/users/{id} |
Real-time |
| Right to Erasure (Art. 17) | Account deletion with cascading purge | DELETE /api/v1/users/{id}/gdpr-erase |
<30 days |
| Right to Data Portability (Art. 20) | Machine-readable export | GET /api/v1/gdpr/portability |
<48 hours |
| Right to Object (Art. 21) | Marketing opt-out | POST /api/v1/gdpr/object-processing |
Real-time |
| Right to Restriction (Art. 18) | Suspend processing flag | POST /api/v1/gdpr/restrict-processing |
<24 hours |
Implementation Example:
# GDPR Data Erasure Implementation
@router.delete("/users/{user_id}/gdpr-erase")
async def gdpr_erase_user(user_id: int, request: GDPREraseRequest):
"""
Right to Erasure (Article 17 GDPR)
Permanently delete user data across all systems
"""
# 1. Verify user identity (2FA required)
await verify_user_identity(user_id, request.verification_code)
# 2. Log erasure request for audit trail
await audit_log.log_event(
event_type="GDPR_ERASURE_REQUESTED",
user_id=user_id,
ip_address=request.ip,
timestamp=datetime.utcnow()
)
# 3. Schedule asynchronous data purge
await erasure_queue.enqueue(
task="purge_user_data",
user_id=user_id,
cascade=True # Delete environments, backups, logs
)
# 4. Anonymize logs (replace PII with hashed user_id)
await anonymize_historical_logs(user_id)
# 5. Notify DPO (Data Protection Officer)
await notify_dpo(user_id, "GDPR_ERASURE_INITIATED")
return {"status": "accepted", "completion_eta": "30 days"}
- Data Residency and Cross-Border Transfers
- EU Data Stays in EU: eu-west-1 region for EU users
- Standard Contractual Clauses (SCCs): For US cloud provider (AWS)
- Data Transfer Impact Assessment (DTIA): Annual review
-
No data transfers to non-adequate countries without explicit consent
-
Consent Management
# Granular Consent Tracking consent_categories = { "essential": True, # Always required for service "analytics": user.consent_analytics, # Optional "marketing": user.consent_marketing, # Optional "third_party_integrations": user.consent_integrations # Optional } # Cookie consent banner def check_consent(user_id: int, category: str) -> bool: consent = get_user_consent(user_id) if category == "essential": return True return consent.get(category, False) -
Data Breach Notification
- Detection: Automated monitoring for unauthorized access
- Assessment: <24 hours to determine breach severity
- Supervisory Authority Notification: <72 hours if high risk
- User Notification: Immediate if high risk to rights and freedoms
-
Breach Register: Maintained for all incidents (regardless of notification)
-
Privacy by Design and Default
- Pseudonymization of logs (no PII in application logs)
- Data minimization (only collect necessary fields)
- Encryption at rest and in transit (TLS 1.3, AES-256)
-
Default privacy settings (marketing opt-in, not opt-out)
-
Data Protection Impact Assessment (DPIA)
- Required for high-risk processing (millions of user records)
- Annual DPIA review
- Document processing activities in ROPA (Record of Processing Activities)
7.9.2 SOC 2 Type II Compliance¶
Scope: Trust Services Criteria - Security, Availability, Confidentiality, Processing Integrity
Audit Timeline: - Year 1: SOC 2 Type I (point-in-time assessment) - Year 2: SOC 2 Type II (12-month continuous monitoring) - Annual Re-certification: Required
1. Security (Common Criteria)
| Control | Implementation | Evidence |
|---|---|---|
| CC1.1: Control Environment | Security policies, code of conduct | Policy documents, training records |
| CC2.1: Risk Assessment | Quarterly risk assessments | Risk register, mitigation plans |
| CC3.1: Logical Access | RBAC, MFA enforcement | Access logs, user provisioning audit |
| CC4.1: Monitoring | Prometheus, Grafana, PagerDuty | Alert history, incident reports |
| CC5.1: Change Management | Git workflow, CI/CD approvals | Pull request logs, deployment audit |
| CC6.1: Encryption | TLS 1.3, AES-256 at rest | Encryption policy, key rotation logs |
| CC7.1: Incident Response | Incident playbooks, on-call rotation | Incident timeline, post-mortems |
2. Availability (Uptime Commitment) - SLA: 99.99% uptime (52.56 min downtime/year) - Monitoring: 24/7 automated health checks - Incident Response: <15 min response time for critical issues - Evidence: Uptime dashboards, incident response logs
3. Confidentiality - Data Classification: Public, Internal, Confidential, Restricted - Access Controls: Need-to-know basis, least privilege - Encryption: Field-level encryption for PII (SSN, payment info) - Evidence: Data classification policy, encryption audit
4. Processing Integrity - Input Validation: All API inputs validated (Pydantic models) - Data Integrity Checks: Checksums for backups, DB constraints - Error Handling: Graceful degradation, no data corruption - Evidence: Test coverage reports, integrity check logs
5. Privacy (if applicable) - Notice: Privacy policy, cookie consent - Choice: Opt-in/opt-out mechanisms - Collection: Data minimization - Evidence: GDPR compliance documentation
SOC 2 Audit Evidence Collection:
# Automated evidence collection for SOC 2
async def collect_soc2_evidence(audit_period: str):
evidence = {
"access_logs": await export_access_logs(audit_period),
"change_management": await export_git_pr_history(audit_period),
"incident_response": await export_pagerduty_incidents(audit_period),
"backup_validation": await export_backup_test_results(audit_period),
"encryption_audits": await export_encryption_compliance(audit_period),
"security_training": await export_training_completion(audit_period),
"vendor_assessments": await export_vendor_security_reviews(audit_period),
}
return evidence
7.9.3 HIPAA Compliance (If Hosting Healthcare WordPress Sites)¶
Note: HIPAA compliance is optional unless explicitly offering healthcare hosting services.
Key Requirements:
- Business Associate Agreement (BAA)
- Required contract with all customers handling PHI
-
Subcontractor BAAs with AWS, Jelastic
-
Physical Safeguards
- AWS data centers (SOC 2, ISO 27001 certified)
-
Badge access, video surveillance (delegated to AWS)
-
Technical Safeguards
- Access Control: Unique user IDs, automatic logoff (15 min inactivity)
- Audit Controls: Comprehensive logging of PHI access
- Integrity Controls: Checksums for PHI transmission
-
Transmission Security: TLS 1.3 only, no TLS 1.2
-
Administrative Safeguards
- Security Officer: Designated HIPAA Security Officer
- Training: Annual HIPAA training for all staff
- Risk Analysis: Annual HIPAA risk assessment
-
Incident Response: <60 days breach notification
-
PHI Encryption
# HIPAA-compliant field-level encryption from cryptography.fernet import Fernet def encrypt_phi(plaintext: str, patient_id: int) -> str: """Encrypt Protected Health Information (PHI)""" key = get_kms_key(f"phi-encryption-{patient_id}") cipher = Fernet(key) encrypted = cipher.encrypt(plaintext.encode()) # Audit log log_phi_access( action="ENCRYPT", patient_id=patient_id, user_id=get_current_user_id(), timestamp=datetime.utcnow() ) return encrypted.decode()
7.9.4 PCI DSS Compliance (Payment Card Industry)¶
Scope: If processing credit cards for WooCommerce sites
Recommended Approach: Avoid PCI scope by using payment processors (Stripe, PayPal) - Never store credit card numbers - Tokenization via Stripe/PayPal - No PCI audit required (out of scope)
If PCI Compliance Required: - Level: PCI DSS Level 2 (1M-6M transactions/year) - Requirements: 12 requirements, 78 sub-requirements - Quarterly: Vulnerability scans (ASV - Approved Scanning Vendor) - Annual: Penetration testing, onsite audit
7.9.5 ISO 27001 Certification (Optional)¶
Timeline: 18-24 months to certification
Benefits: - Global recognition for information security - Competitive advantage for enterprise customers - Comprehensive ISMS (Information Security Management System)
Key Requirements: - 114 controls across 14 domains - Annual surveillance audits - Continual improvement process
7.9.6 Compliance Monitoring and Reporting¶
1. Automated Compliance Dashboard
# Grafana Compliance Dashboard
Metrics:
- GDPR Requests (Access, Erasure, Portability) - Last 30 days
- SOC 2 Control Effectiveness (Pass/Fail) - Real-time
- Data Residency Compliance (% EU data in EU region) - Real-time
- Encryption Coverage (% encrypted data) - Real-time
- Access Control Violations (failed auth attempts) - Last 24h
- Incident Response Time (p95) - Last 90 days
- Backup Success Rate - Last 30 days
- Compliance Training Completion - Current status
2. Compliance Audit Trail
# Immutable audit log for compliance
@dataclass
class ComplianceAuditEvent:
event_id: str # UUID
timestamp: datetime
event_type: str # GDPR_ACCESS, SOC2_CONTROL_TEST, etc.
user_id: Optional[int]
ip_address: str
action: str
result: str # SUCCESS, FAILURE, PARTIAL
evidence_s3_path: str # Link to evidence
digital_signature: str # SHA-256 hash for tamper-proofing
# Write to append-only log (no deletes/updates allowed)
await append_to_audit_log(event)
3. Quarterly Compliance Reports - Audience: CISO, DPO, Board of Directors - Contents: - GDPR requests summary (volume, response times) - SOC 2 control test results - Security incidents and resolutions - Compliance training completion rates - Vendor security assessments - Recommendations for next quarter
7.9.7 Third-Party Vendor Compliance¶
Vendor Risk Management:
| Vendor | Service | Compliance Certs | Assessment Frequency |
|---|---|---|---|
| AWS | Infrastructure | SOC 2, ISO 27001, PCI DSS | Annual |
| Jelastic | Container Platform | SOC 2 (in progress) | Annual |
| Stripe | Payment Processing | PCI DSS Level 1 | Annual |
| SendGrid | Email Delivery | SOC 2 | Annual |
| PagerDuty | Incident Management | SOC 2 | Annual |
Vendor Assessment Process: 1. Request SOC 2 report or equivalent 2. Review security questionnaire (SIG Lite) 3. Assess data access and residency 4. Document in vendor register 5. Annual re-assessment
7.9.8 Compliance Training Program¶
Training Matrix:
| Role | Training Required | Frequency | Completion Target |
|---|---|---|---|
| All Staff | Security Awareness | Annual | 100% |
| Developers | Secure Coding (OWASP Top 10) | Annual | 100% |
| DevOps | Infrastructure Security | Annual | 100% |
| Support | GDPR Data Subject Rights | Quarterly | 100% |
| Management | Compliance Overview | Annual | 100% |
| DPO/Security Team | Advanced GDPR, SOC 2 | Bi-annual | 100% |
Training Tracking:
# Automated training reminders
async def check_training_compliance():
overdue_users = await get_users_with_overdue_training()
for user in overdue_users:
await send_training_reminder(user)
# Escalate if 30 days overdue
if user.training_overdue_days > 30:
await notify_manager(user.manager_id, user)
7.9.9 Compliance Cost Estimation¶
Annual Compliance Budget (Estimated):
| Item | Cost (Annual) |
|---|---|
| SOC 2 Type II Audit | $40,000 - $80,000 |
| GDPR DPO (Consultant or FTE) | $80,000 - $150,000 |
| Compliance Software (Vanta, Drata) | $20,000 |
| Penetration Testing | $30,000 |
| Security Training Platform | $10,000 |
| Legal Review (Policies, Contracts) | $20,000 |
| TOTAL | $200,000 - $310,000 |
Cost per User: $0.20 - $0.31/month (for 1M users)
7.9.10 Compliance Roadmap¶
Year 1 (Months 1-12): - Q1: Implement GDPR data subject rights APIs - Q2: SOC 2 Type I audit preparation - Q3: SOC 2 Type I audit execution - Q4: Begin SOC 2 Type II observation period
Year 2 (Months 13-24): - Q1-Q4: SOC 2 Type II continuous monitoring - Q4: SOC 2 Type II audit and certification
Year 3+: - Maintain SOC 2 certification (annual) - Optional: ISO 27001 certification - Optional: HIPAA if expanding to healthcare vertical
7.9.11 Compliance Incident Response¶
Scenario: GDPR Data Breach
1. Detection (T+0)
- Automated anomaly detection alerts security team
- Example: Unauthorized bulk data export
2. Assessment (T+1 hour)
- Determine scope: How many users affected?
- Determine sensitivity: PII, financial data, health data?
- Classify severity: High/Medium/Low risk
3. Containment (T+2 hours)
- Revoke compromised credentials
- Block unauthorized access
- Preserve forensic evidence
4. Notification Decision (T+24 hours)
- If >5,000 users + sensitive data: Notify supervisory authority
- If high risk to user rights: Notify affected users immediately
5. Supervisory Authority Notification (T+48-72 hours)
- Submit to lead supervisory authority (Irish DPC for EU)
- Include: Nature of breach, affected users, measures taken
- Template: GDPR Article 33 notification form
6. User Notification (T+72 hours if required)
- Email to affected users
- Clear explanation of breach and mitigation steps
- Offer free identity protection services if financial data
7. Post-Incident Review (T+2 weeks)
- Root cause analysis
- Update security controls
- Document lessons learned
- Report to board of directors
8. Detailed Component Specifications¶
8.1 Authentication & Authorization System¶
- JWT Implementation: Access tokens (15-min expiry), refresh tokens (7-day expiry)
- Multi-factor Authentication: Support for TOTP and SMS-based MFA
- Social Login: OAuth integration for Google, GitHub, and Microsoft
- Session Management: Redis-based session storage with configurable TTL
- API Key Management: Per-user API key generation with usage tracking
- Role Management: Hierarchical role system with granular permissions
8.2 Database Migration (MySQL to PostgreSQL)¶
- Schema Migration: 45+ Laravel MySQL tables to PostgreSQL optimized schema
- Data Migration: Complete data transfer with validation and integrity checks
- Performance Optimization: PostgreSQL-specific optimizations (indexes, partitioning)
- Migration Strategy: Schema-first approach with data validation
- Rollback Plan: Complete rollback capability with data integrity checks
- Performance Targets: 50% improvement in common query execution times
8.3 Job Queue System (ARQ-based)¶
- Queue Architecture: Redis-backed ARQ workers for background processing
- Job Types: 48 different job types migrated from Laravel queues
- Priority System: Multi-level priority queue for critical operations
- Progress Tracking: Real-time progress updates for long-running jobs
- Error Handling: Automatic retry with exponential backoff
- Dead Letter Queue: Fallback queue for failed jobs with manual processing
- Idempotency: Task-level idempotency keys; deduplication window and replay protection
- Exactly-Once Semantics (Best-Effort): Use outbox pattern for external side-effects
- Tracing: Correlation IDs propagated across HTTP → job → WebSocket notifications
- Safety: Per-queue concurrency limits and rate limits; circuit breakers for flaky integrations
8.4 Real-time WebSocket System¶
6.4.1 Executive Summary¶
The WebSocket system provides real-time bidirectional communication for alerts, notifications, job completion updates, and live feature updates. Built using FastAPI's native WebSocket support with Redis Pub/Sub for horizontal scaling, this system replaces the legacy Node.js WebSocket server while maintaining feature parity and improving performance.
Key Capabilities: - Real-time alerts and notifications delivery - Job progress tracking and completion notifications - Live updates for site creation, backups, deployments, and domain operations - Multi-tenant channel architecture with team-scoped authorization - Message durability with Redis Streams for offline message replay - Presence tracking for user online/offline status - Horizontal scaling across multiple FastAPI instances without sticky sessions
6.4.2 Problem Statement¶
Current Challenges: - Legacy Node.js WebSocket server requires separate infrastructure and deployment - Limited scalability due to in-memory connection state - No message durability - messages lost if client disconnects - Complex integration between Laravel backend and Node.js WebSocket server - No standardized channel authorization model
Business Requirements: - Real-time notifications for critical operations (backups, deployments, SSL certificates) - Job progress updates for long-running operations (environment creation, WordPress installation) - Alert system for system events and user actions - Multi-device support (user can have multiple active connections) - Message replay capability for missed notifications
6.4.3 Technical Requirements¶
Performance Requirements: - WebSocket connection establishment: <100ms - Message broadcast latency (p95): <250ms across all instances - Support 10,000+ concurrent WebSocket connections per instance - Message throughput: 50,000+ messages/second across cluster - Heartbeat interval: 30 seconds (configurable) - Connection timeout: 5 minutes of inactivity
Scalability Requirements: - Horizontal scaling: Support 10+ FastAPI instances behind load balancer - No sticky sessions required - any instance can handle any connection - Redis Pub/Sub for cross-instance message propagation - Connection state externalized to Redis (no in-memory state) - Channel subscription management via Redis Sets
Reliability Requirements: - Message durability: Redis Streams with consumer groups - Message replay: Support replaying missed messages up to 24 hours - Automatic reconnection: Client-side exponential backoff - Graceful degradation: Fallback to polling if WebSocket unavailable - Connection health monitoring with automatic cleanup
Security Requirements: - JWT-based authentication for WebSocket connections - Team-scoped channel authorization (users can only subscribe to their team channels) - Rate limiting: Max 5 connections per user, 100 messages/minute per connection - Input validation: All incoming messages validated against Pydantic schemas - Message size limits: Max 64KB per message
6.4.4 Architectural Overview¶
High-Level Architecture:
┌─────────────────────────────────────────────────────────────────┐
│ FastAPI Backend Instances │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Instance 1 │ │ Instance 2 │ │ Instance N │ │
│ │ │ │ │ │ │ │
│ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │
│ │ │WebSocket │ │ │ │WebSocket │ │ │ │WebSocket │ │ │
│ │ │ Module │ │ │ │ Module │ │ │ │ Module │ │ │
│ │ └────┬─────┘ │ │ └────┬─────┘ │ │ └────┬─────┘ │ │
│ └──────┼───────┘ └──────┼───────┘ └──────┼───────┘ │
│ │ │ │ │
│ └─────────────────┼─────────────────┘ │
│ │ │
└───────────────────────────┼────────────────────────────────────┘
│
┌───────────▼───────────┐
│ Redis Cluster │
│ │
│ ┌─────────────────┐ │
│ │ Pub/Sub │ │ Cross-instance messaging
│ └─────────────────┘ │
│ ┌─────────────────┐ │
│ │ Streams │ │ Message durability
│ └─────────────────┘ │
│ ┌─────────────────┐ │
│ │ Sets │ │ Channel subscriptions
│ └─────────────────┘ │
│ ┌─────────────────┐ │
│ │ Strings │ │ Presence tracking
│ └─────────────────┘ │
└───────────────────────┘
│
┌───────────▼───────────┐
│ PostgreSQL │
│ (for audit logs) │
└───────────────────────┘
Channel Architecture:
Channels follow a hierarchical naming pattern: {scope}.{identifier}.{feature}
Channel Types:
- User Channels: user.{userId} - Personal notifications and alerts
- Team Channels: team.{teamId}.{feature} - Team-scoped updates
- team.{teamId}.jobs - Job status updates
- team.{teamId}.backups - Backup progress and completion
- team.{teamId}.activities - Activity feed updates
- team.{teamId}.sites.creating - Site creation progress
- team.{teamId}.domains - Domain/SSL updates
- team.{teamId}.favourites - Favourite changes
- System Channels: system.{event} - System-wide announcements (admin only)
Message Flow:
1. Client connects → WebSocket endpoint (/ws)
2. Authentication → JWT token validation
3. Channel subscription → Subscribe to authorized channels
4. Message publishing:
- Module publishes message → Redis Pub/Sub
- All instances receive → Filter by subscriptions
- Deliver to connected clients
5. Message persistence → Redis Streams (for replay)
6. Presence update → Redis Sets (online/offline status)
6.4.5 Detailed Component Specifications¶
Module Structure (app/websocket/):
app/websocket/
├── __init__.py # Module exports
├── router.py # INTERFACE: WebSocket endpoints and HTTP routes
├── service.py # APPLICATION: Business logic for message handling
├── connection.py # INFRASTRUCTURE: ConnectionManager for managing connections
├── channel.py # INFRASTRUCTURE: Channel subscription management
├── presence.py # INFRASTRUCTURE: Presence tracking (online/offline)
├── publisher.py # INFRASTRUCTURE: Message publishing to Redis
├── repository.py # INFRASTRUCTURE: Database operations (audit logs)
├── model.py # DOMAIN: WebSocket-related models (if needed)
└── schema.py # INTERFACE: Pydantic schemas for messages
Component Responsibilities:
1. router.py (Interface Layer)
- WebSocket endpoint: ws://api.example.com/ws
- HTTP endpoints:
- GET /ws/health - Health check
- GET /ws/metrics - Prometheus metrics
- GET /ws/channels - List available channels
- POST /ws/broadcast - Admin broadcast endpoint
- WebSocket connection handling
- JWT authentication for WebSocket connections
- Rate limiting per connection
2. connection.py (Infrastructure Layer)
- ConnectionManager class:
- Manages active WebSocket connections per user
- Handles connection lifecycle (connect, disconnect, cleanup)
- Tracks connection metadata (user_id, team_ids, subscribed channels)
- Implements heartbeat mechanism
- Connection health monitoring
- Redis integration for cross-instance state
- Connection pooling and resource management
3. channel.py (Infrastructure Layer)
- ChannelManager class:
- Channel subscription management
- Authorization validation (team-scoped access)
- Channel pattern matching (wildcard subscriptions)
- Subscription persistence in Redis Sets
- Channel discovery and validation
- Subscription limits enforcement
4. presence.py (Infrastructure Layer)
- PresenceManager class:
- User online/offline status tracking
- Presence set management in Redis
- TTL-based presence expiration
- Cross-instance presence synchronization
- Presence events (user online, user offline)
- Presence query API
5. publisher.py (Infrastructure Layer)
- MessagePublisher class:
- Publish messages to Redis Pub/Sub
- Write messages to Redis Streams for durability
- Message routing to appropriate channels
- Batch publishing for performance
- Integration with other modules for publishing events
6. service.py (Application Layer)
- WebSocketService class:
- Business logic for message handling
- Message validation and transformation
- Channel authorization logic
- Message replay service
- Connection management orchestration
- Integration with other domain modules
- Event handling and routing
7. repository.py (Infrastructure Layer) - Database operations for audit logging - WebSocket connection history - Message delivery tracking (optional) - Analytics queries
8. schema.py (Interface Layer)
- WebSocketMessage - Base message schema
- SubscribeMessage - Channel subscription request
- UnsubscribeMessage - Channel unsubscription request
- PingMessage - Heartbeat ping
- PongMessage - Heartbeat pong
- JobUpdateMessage - Job status update
- NotificationMessage - Alert/notification
- PresenceUpdateMessage - User presence change
- ErrorMessage - Error response
6.4.6 Shared Components Integration¶
Integration with app/core/:
- app/core/security.py - JWT token validation for WebSocket connections
- app/core/cache.py - Redis connection pooling and management
- app/core/rate_limit.py - Rate limiting for WebSocket connections
- app/core/logging.py - Structured logging with correlation IDs
- app/core/exceptions.py - Custom exceptions for WebSocket errors
Integration with app/database/:
- app/database/database.py - Database session for audit logs
- app/database/deps.py - Database dependency injection
Integration with Other Modules:
Job Updates (from jobs/ module):
# In jobs/tasks.py or jobs/service.py
from app.websocket.publisher import MessagePublisher
async def notify_job_completion(job_id: str, status: str, result: dict):
publisher = MessagePublisher()
await publisher.publish(
channel=f"team.{team_id}.jobs",
message={
"type": "job.update",
"job_id": job_id,
"status": status,
"result": result,
"timestamp": datetime.utcnow().isoformat()
}
)
Backup Updates (from backups/ module):
# In backups/tasks.py
from app.websocket.publisher import MessagePublisher
async def notify_backup_progress(backup_id: int, progress: int, team_id: int):
publisher = MessagePublisher()
await publisher.publish(
channel=f"team.{team_id}.backups",
message={
"type": "backup.progress",
"backup_id": backup_id,
"progress": progress,
"timestamp": datetime.utcnow().isoformat()
}
)
Activity Feed (from activity/ module):
# In activity/service.py
from app.websocket.publisher import MessagePublisher
async def broadcast_activity(activity: Activity, team_id: int):
publisher = MessagePublisher()
await publisher.publish(
channel=f"team.{team_id}.activities",
message={
"type": "activity.new",
"activity": activity.dict(),
"timestamp": datetime.utcnow().isoformat()
}
)
Module Isolation Compliance:
- ✅ WebSocket module does NOT import from other domain modules
- ✅ Other modules import MessagePublisher from app.websocket.publisher (allowed - WebSocket is infrastructure)
- ✅ WebSocket module uses shared components from app/core/ and app/database/
- ✅ Clear separation: WebSocket handles delivery, modules handle business logic
6.4.7 Implementation Roadmap¶
Phase 1: Core Infrastructure (Weeks 1-2)
1. Set up WebSocket module structure (app/websocket/)
2. Implement ConnectionManager with Redis state management
3. Implement basic WebSocket endpoint in router.py
4. JWT authentication for WebSocket connections
5. Basic message publishing to Redis Pub/Sub
6. Unit tests for connection management
Phase 2: Channel Management (Weeks 3-4)
1. Implement ChannelManager with subscription management
2. Team-scoped channel authorization
3. Channel subscription/unsubscription endpoints
4. Redis Sets for subscription persistence
5. Integration tests for channel operations
Phase 3: Message Durability (Weeks 5-6) 1. Implement Redis Streams integration 2. Message persistence for durability 3. Message replay API 4. Consumer groups for message delivery 5. ACK mechanism for message processing
Phase 4: Presence & Advanced Features (Weeks 7-8)
1. Implement PresenceManager for online/offline tracking
2. Heartbeat mechanism (ping/pong)
3. Connection health monitoring
4. Automatic cleanup of stale connections
5. Presence query endpoints
Phase 5: Integration & Testing (Weeks 9-10)
1. Integrate with jobs/ module for job updates
2. Integrate with backups/ module for backup progress
3. Integrate with activity/ module for activity feed
4. Integrate with sites/ module for site creation updates
5. End-to-end integration tests
6. Load testing (10,000+ concurrent connections)
Phase 6: Migration & Cutover (Weeks 11-12) 1. Feature parity validation with Node.js WebSocket server 2. Dual-publish period (both Node and FastAPI) 3. Canary deployment (10% traffic to FastAPI) 4. Gradual rollout (25%, 50%, 100%) 5. Node.js server deprecation
6.4.8 Success Criteria¶
Functional Requirements: - ✅ WebSocket connections established successfully with JWT authentication - ✅ Users can subscribe to team-scoped channels - ✅ Messages published from any module delivered to subscribed clients - ✅ Message replay works for missed messages (24-hour window) - ✅ Presence tracking accurately reflects user online/offline status - ✅ Job completion notifications delivered in <250ms (p95) - ✅ Backup progress updates delivered in real-time - ✅ Activity feed updates broadcast to team members
Performance Requirements: - ✅ WebSocket connection establishment: <100ms (p95) - ✅ Message broadcast latency: <250ms (p95) across all instances - ✅ Support 10,000+ concurrent connections per instance - ✅ Message throughput: 50,000+ messages/second across cluster - ✅ Zero message loss during normal operations - ✅ Graceful handling of 1,000+ simultaneous reconnections
Reliability Requirements: - ✅ 99.9% WebSocket availability - ✅ Automatic reconnection with exponential backoff - ✅ Message durability: 100% of messages persisted to Redis Streams - ✅ Connection cleanup: Stale connections removed within 5 minutes - ✅ Cross-instance message delivery: 100% reliability
Security Requirements: - ✅ JWT authentication required for all connections - ✅ Team-scoped channel authorization enforced - ✅ Rate limiting: Max 5 connections per user enforced - ✅ Message size limits: 64KB max enforced - ✅ Input validation: All messages validated against schemas
6.4.9 Risk Assessment¶
Technical Risks:
- Redis Pub/Sub Message Loss
- Risk: Messages may be lost if instance crashes before delivery
- Mitigation: Redis Streams for durability + ACK mechanism
- Impact: Medium
-
Probability: Low
-
Connection State Synchronization
- Risk: Connection state may become inconsistent across instances
- Mitigation: All state in Redis, no in-memory state
- Impact: High
-
Probability: Low
-
Scalability Bottlenecks
- Risk: Redis Pub/Sub may become bottleneck at high message rates
- Mitigation: Redis Cluster, message batching, connection pooling
- Impact: High
-
Probability: Medium
-
Message Replay Performance
- Risk: Replaying large message history may be slow
- Mitigation: Pagination, time-windowed queries, consumer groups
- Impact: Medium
- Probability: Medium
Operational Risks:
- Migration Complexity
- Risk: Migrating from Node.js to FastAPI may cause downtime
- Mitigation: Dual-publish period, gradual rollout, rollback plan
- Impact: High
-
Probability: Medium
-
Monitoring Gaps
- Risk: Insufficient visibility into WebSocket performance
- Mitigation: Comprehensive metrics, structured logging, alerting
- Impact: Medium
- Probability: Low
6.4.10 Module Isolation Verification¶
Compliance Checklist:
- ✅ WebSocket module is self-contained in app/websocket/
- ✅ No imports from other domain modules (users/, teams/, sites/, etc.)
- ✅ Uses shared components from app/core/ (security, cache, rate_limit, logging)
- ✅ Uses shared components from app/database/ (database session)
- ✅ Other modules import MessagePublisher from WebSocket (infrastructure dependency - allowed)
- ✅ Clear separation: WebSocket handles delivery, modules handle business logic
- ✅ WebSocket module can be tested independently
Dependency Graph:
websocket/
├── imports from: app/core/ (security, cache, rate_limit, logging)
├── imports from: app/database/ (database session)
└── NO imports from: other domain modules
other modules (jobs/, backups/, activity/, etc.)
├── imports from: app/core/
├── imports from: app/database/
└── imports from: app/websocket/publisher.py (infrastructure - allowed)
6.4.11 Code Examples¶
WebSocket Connection Handler:
# app/websocket/router.py
from fastapi import APIRouter, WebSocket, WebSocketDisconnect, Depends
from app.core.security import verify_websocket_token
from app.websocket.connection import ConnectionManager
from app.websocket.service import WebSocketService
from app.websocket.schema import WebSocketMessage, SubscribeMessage
router = APIRouter(prefix="/ws", tags=["websocket"])
connection_manager = ConnectionManager()
websocket_service = WebSocketService()
@router.websocket("")
async def websocket_endpoint(
websocket: WebSocket,
token: str = None
):
"""
WebSocket endpoint for real-time communication.
Query Parameters:
- token: JWT access token for authentication
"""
# Authenticate connection
if not token:
await websocket.close(code=1008, reason="Missing authentication token")
return
try:
payload = verify_websocket_token(token)
user_id = payload.get("sub")
team_ids = payload.get("team_ids", [])
except Exception as e:
await websocket.close(code=1008, reason="Invalid authentication token")
return
# Accept connection
await connection_manager.connect(websocket, user_id, team_ids)
try:
while True:
# Receive message from client
data = await websocket.receive_json()
message = WebSocketMessage(**data)
# Handle message based on type
if message.type == "subscribe":
subscribe_msg = SubscribeMessage(**message.data)
await websocket_service.handle_subscribe(
websocket, user_id, team_ids, subscribe_msg.channel
)
elif message.type == "unsubscribe":
# Handle unsubscribe
pass
elif message.type == "ping":
# Handle heartbeat
await websocket.send_json({"type": "pong", "timestamp": datetime.utcnow().isoformat()})
else:
await websocket.send_json({
"type": "error",
"message": f"Unknown message type: {message.type}"
})
except WebSocketDisconnect:
connection_manager.disconnect(websocket, user_id)
except Exception as e:
logger.error(f"WebSocket error: {e}", exc_info=True)
await websocket.close(code=1011, reason="Internal server error")
connection_manager.disconnect(websocket, user_id)
Message Publisher:
# app/websocket/publisher.py
import json
import redis.asyncio as aioredis
from datetime import datetime
from typing import Dict, Any, Optional
from app.core.cache import get_redis_client
from app.core.logging import get_logger
logger = get_logger(__name__)
class MessagePublisher:
"""Publishes messages to Redis Pub/Sub and Streams for durability."""
def __init__(self):
self.redis: Optional[aioredis.Redis] = None
async def _get_redis(self) -> aioredis.Redis:
"""Get Redis client (lazy initialization)."""
if self.redis is None:
self.redis = await get_redis_client()
return self.redis
async def publish(
self,
channel: str,
message: Dict[str, Any],
persist: bool = True
) -> None:
"""
Publish message to channel.
Args:
channel: Channel name (e.g., "team.123.jobs")
message: Message payload (must be JSON-serializable)
persist: Whether to persist message to Redis Streams
"""
redis = await self._get_redis()
# Add metadata
message_with_meta = {
**message,
"channel": channel,
"timestamp": datetime.utcnow().isoformat(),
"id": f"{datetime.utcnow().timestamp()}-{hash(str(message))}"
}
# Publish to Pub/Sub (for real-time delivery)
await redis.publish(
f"ws:channel:{channel}",
json.dumps(message_with_meta)
)
# Persist to Redis Streams (for durability and replay)
if persist:
await redis.xadd(
f"ws:stream:{channel}",
message_with_meta,
maxlen=10000 # Keep last 10,000 messages per channel
)
logger.info(
f"Published message to channel {channel}",
extra={"channel": channel, "message_type": message.get("type")}
)
async def publish_to_user(
self,
user_id: int,
message: Dict[str, Any],
persist: bool = True
) -> None:
"""Publish message to user's personal channel."""
await self.publish(f"user.{user_id}", message, persist)
async def publish_to_team(
self,
team_id: int,
feature: str,
message: Dict[str, Any],
persist: bool = True
) -> None:
"""Publish message to team feature channel."""
await self.publish(f"team.{team_id}.{feature}", message, persist)
Connection Manager:
# app/websocket/connection.py
from typing import Dict, List, Set, Optional
from fastapi import WebSocket
import redis.asyncio as aioredis
from datetime import datetime, timedelta
from app.core.cache import get_redis_client
from app.core.logging import get_logger
logger = get_logger(__name__)
class ConnectionManager:
"""Manages WebSocket connections with Redis-backed state."""
def __init__(self):
self.redis: Optional[aioredis.Redis] = None
self.local_connections: Dict[int, List[WebSocket]] = {}
self.connection_metadata: Dict[WebSocket, Dict] = {}
async def _get_redis(self) -> aioredis.Redis:
"""Get Redis client (lazy initialization)."""
if self.redis is None:
self.redis = await get_redis_client()
return self.redis
async def connect(
self,
websocket: WebSocket,
user_id: int,
team_ids: List[int]
) -> None:
"""Accept and register WebSocket connection."""
await websocket.accept()
# Store connection locally
if user_id not in self.local_connections:
self.local_connections[user_id] = []
self.local_connections[user_id].append(websocket)
# Store metadata
self.connection_metadata[websocket] = {
"user_id": user_id,
"team_ids": team_ids,
"connected_at": datetime.utcnow(),
"last_heartbeat": datetime.utcnow()
}
# Update presence in Redis
redis = await self._get_redis()
await redis.sadd(f"ws:presence:user:{user_id}", str(id(websocket)))
await redis.setex(
f"ws:presence:ttl:user:{user_id}",
300, # 5 minutes TTL
"online"
)
# Subscribe to Redis Pub/Sub for this user
await self._subscribe_to_user_channels(websocket, user_id, team_ids)
logger.info(
f"WebSocket connected for user {user_id}",
extra={"user_id": user_id, "team_ids": team_ids}
)
async def disconnect(
self,
websocket: WebSocket,
user_id: int
) -> None:
"""Remove WebSocket connection."""
# Remove from local connections
if user_id in self.local_connections:
self.local_connections[user_id].remove(websocket)
if not self.local_connections[user_id]:
del self.local_connections[user_id]
# Remove metadata
if websocket in self.connection_metadata:
del self.connection_metadata[websocket]
# Update presence in Redis
redis = await self._get_redis()
await redis.srem(f"ws:presence:user:{user_id}", str(id(websocket)))
# If no more connections for user, mark as offline
if user_id not in self.local_connections:
await redis.delete(f"ws:presence:ttl:user:{user_id}")
logger.info(f"WebSocket disconnected for user {user_id}")
async def send_to_user(
self,
user_id: int,
message: dict
) -> None:
"""Send message to all connections for a user."""
if user_id in self.local_connections:
import json
message_json = json.dumps(message)
for websocket in self.local_connections[user_id]:
try:
await websocket.send_text(message_json)
except Exception as e:
logger.error(f"Error sending message to user {user_id}: {e}")
# Remove failed connection
await self.disconnect(websocket, user_id)
async def _subscribe_to_user_channels(
self,
websocket: WebSocket,
user_id: int,
team_ids: List[int]
) -> None:
"""Subscribe to Redis Pub/Sub channels for this connection."""
redis = await self._get_redis()
pubsub = redis.pubsub()
# Subscribe to user channel
await pubsub.subscribe(f"ws:channel:user.{user_id}")
# Subscribe to team channels
for team_id in team_ids:
await pubsub.psubscribe(f"ws:channel:team.{team_id}.*")
# Start listening for messages
# This would run in a background task
# Implementation details omitted for brevity
Channel Manager:
# app/websocket/channel.py
from typing import List, Set
import redis.asyncio as aioredis
from app.core.cache import get_redis_client
from app.core.logging import get_logger
logger = get_logger(__name__)
class ChannelManager:
"""Manages channel subscriptions with authorization."""
def __init__(self):
self.redis: Optional[aioredis.Redis] = None
async def _get_redis(self) -> aioredis.Redis:
"""Get Redis client."""
if self.redis is None:
self.redis = await get_redis_client()
return self.redis
def validate_channel_access(
self,
channel: str,
user_id: int,
team_ids: List[int]
) -> bool:
"""
Validate if user can access channel.
Channel patterns:
- user.{userId} - User can only access their own channel
- team.{teamId}.{feature} - User must be member of team
- system.* - Admin only (not implemented in this example)
"""
if channel.startswith("user."):
channel_user_id = int(channel.split(".")[1])
return channel_user_id == user_id
elif channel.startswith("team."):
parts = channel.split(".")
if len(parts) >= 2:
channel_team_id = int(parts[1])
return channel_team_id in team_ids
return False
async def subscribe(
self,
channel: str,
user_id: int,
team_ids: List[int]
) -> bool:
"""Subscribe user to channel."""
# Validate access
if not self.validate_channel_access(channel, user_id, team_ids):
logger.warning(
f"User {user_id} attempted to subscribe to unauthorized channel {channel}"
)
return False
# Add to subscription set
redis = await self._get_redis()
await redis.sadd(f"ws:subscriptions:user:{user_id}", channel)
await redis.sadd(f"ws:subscribers:channel:{channel}", str(user_id))
logger.info(f"User {user_id} subscribed to channel {channel}")
return True
async def unsubscribe(
self,
channel: str,
user_id: int
) -> None:
"""Unsubscribe user from channel."""
redis = await self._get_redis()
await redis.srem(f"ws:subscriptions:user:{user_id}", channel)
await redis.srem(f"ws:subscribers:channel:{channel}", str(user_id))
logger.info(f"User {user_id} unsubscribed from channel {channel}")
async def get_user_subscriptions(self, user_id: int) -> Set[str]:
"""Get all channels user is subscribed to."""
redis = await self._get_redis()
subscriptions = await redis.smembers(f"ws:subscriptions:user:{user_id}")
return {s.decode() for s in subscriptions}
Message Schemas:
# app/websocket/schema.py
from pydantic import BaseModel, Field
from typing import Optional, Dict, Any, Literal
from datetime import datetime
class WebSocketMessage(BaseModel):
"""Base WebSocket message schema."""
type: str = Field(..., description="Message type")
data: Dict[str, Any] = Field(default_factory=dict, description="Message payload")
timestamp: Optional[str] = Field(default=None, description="Message timestamp")
class SubscribeMessage(BaseModel):
"""Channel subscription request."""
channel: str = Field(..., description="Channel name to subscribe to")
type: Literal["subscribe"] = "subscribe"
class UnsubscribeMessage(BaseModel):
"""Channel unsubscription request."""
channel: str = Field(..., description="Channel name to unsubscribe from")
type: Literal["unsubscribe"] = "unsubscribe"
class PingMessage(BaseModel):
"""Heartbeat ping message."""
type: Literal["ping"] = "ping"
timestamp: str = Field(default_factory=lambda: datetime.utcnow().isoformat())
class PongMessage(BaseModel):
"""Heartbeat pong response."""
type: Literal["pong"] = "pong"
timestamp: str = Field(default_factory=lambda: datetime.utcnow().isoformat())
class JobUpdateMessage(BaseModel):
"""Job status update message."""
type: Literal["job.update"] = "job.update"
job_id: str
status: str # "pending", "running", "completed", "failed"
progress: Optional[int] = None # 0-100
result: Optional[Dict[str, Any]] = None
error: Optional[str] = None
timestamp: str = Field(default_factory=lambda: datetime.utcnow().isoformat())
class NotificationMessage(BaseModel):
"""Alert/notification message."""
type: Literal["notification"] = "notification"
level: str # "info", "warning", "error", "success"
title: str
message: str
action_url: Optional[str] = None
timestamp: str = Field(default_factory=lambda: datetime.utcnow().isoformat())
class PresenceUpdateMessage(BaseModel):
"""User presence update message."""
type: Literal["presence.update"] = "presence.update"
user_id: int
status: str # "online", "offline"
timestamp: str = Field(default_factory=lambda: datetime.utcnow().isoformat())
class ErrorMessage(BaseModel):
"""Error response message."""
type: Literal["error"] = "error"
code: str
message: str
timestamp: str = Field(default_factory=lambda: datetime.utcnow().isoformat())
6.4.12 Monitoring & Observability¶
Metrics (Prometheus):
- websocket_connections_total - Total active connections
- websocket_connections_per_user - Connections per user (histogram)
- websocket_messages_sent_total - Total messages sent
- websocket_messages_received_total - Total messages received
- websocket_message_latency_seconds - Message delivery latency (histogram)
- websocket_channel_subscriptions - Active channel subscriptions
- websocket_errors_total - Error count by type
- websocket_reconnect_total - Reconnection count
Health Endpoints:
- GET /ws/health - Health check (checks Redis connectivity, connection count)
- GET /ws/metrics - Prometheus metrics endpoint
- GET /ws/channels - List all active channels and subscriber counts
Logging: - Structured JSON logs with correlation IDs - Log events: connection, disconnection, subscription, message delivery, errors - PII scrubbing for user data in logs
6.4.13 Migration Strategy¶
Phase A: Bridge (Weeks 1-4) - Keep Node.js WebSocket server running - FastAPI WebSocket module development - Feature parity validation - Dual-publish testing (both systems receive messages)
Phase B: Dual-Run (Weeks 5-8) - Deploy FastAPI WebSocket alongside Node.js - Mirror all publishes to both systems - Validators compare payload parity - Gradual client migration (10% → 25% → 50%)
Phase C: Cutover (Weeks 9-12) - Switch ingress to FastAPI WebSocket (100% traffic) - Node.js server remains as hot-standby - Monitor metrics and error rates - Rollback capability if issues detected - Node.js server deprecation after 2 weeks of stable operation
8.5 API Endpoints Mapping (206 routes)¶
- Authentication Endpoints: 15+ routes for user authentication and session management
- User Management: 20+ routes for user CRUD and permissions
- Site Management: 30+ routes for site creation, configuration, and deployment
- Environment Management: 25+ routes for environment lifecycle management
- Team Management: 15+ routes for team creation, member management, and permissions
- Billing/Payment: 12+ routes for subscription management and payment processing
- Webhook Support: 8+ routes for webhook management and delivery
- Admin Operations: 20+ routes for administrative functions
8.6 Service Layer Architecture with Module Isolation Validation¶
The Service Layer implements business logic within each domain module following strict isolation principles. This architecture ensures maintainability, testability, and scalability as the codebase grows to serve millions of users.
6.6.1 Service Layer Principles¶
Core Principles: - Domain Services: 40+ services implemented using the modular DDD approach - Single Responsibility: Each service handles one domain's business logic - Module Isolation: Services MUST NOT import from other modules (except core/database) - Adapter Pattern: Clean separation between business logic and external integrations - Transaction Management: Services orchestrate database transactions through repositories - Validation: Business rule validation separate from API request validation - Error Handling: Domain-specific exceptions that routers translate to HTTP responses
6.6.2 Service Layer Structure¶
Standard Service Pattern:
# backend/app/wordpress/service.py
"""
WordPress Service Layer
Implements business logic for WordPress management.
"""
from sqlalchemy.orm import Session
from fastapi import HTTPException
from typing import Optional
from . import repository, schema
from app.core.shared.rbac import require_permission, Permission
from app.core.shared.tenant import validate_team_access
from app.core.shared.audit import log_audit_event, AuditAction
from app.core.cache import cache_result, invalidate_cache
class WordPressService:
"""Service for WordPress operations"""
def __init__(self):
pass
@cache_result(key_prefix="wordpress", ttl=300)
def get_wordpress_info(
self,
db: Session,
site_id: int,
user_id: int
) -> schema.WordPressRead:
"""
Get WordPress information for a site.
Business Rules:
- User must have permission to view site
- Site must have WordPress installed
Args:
db: Database session
site_id: Site ID
user_id: Requesting user ID
Returns:
WordPress information
Raises:
HTTPException: If site not found or no permission
"""
# Validate user has access to site
site = self._validate_site_access(db, site_id, user_id)
# Get WordPress installation
wp = repository.get_wordpress_by_site_id(db, site_id)
if not wp:
raise HTTPException(
status_code=404,
detail="WordPress not installed on this site"
)
return wp
async def execute_wp_cli_command(
self,
db: Session,
site_id: int,
command: str,
user_id: int
) -> schema.WpCliResponse:
"""
Execute WP-CLI command.
Business Rules:
- User must have execute permission
- Command must be in allowed commands list
- Long-running commands must be queued
Args:
db: Database session
site_id: Site ID
command: WP-CLI command to execute
user_id: Requesting user ID
Returns:
Command execution result
Raises:
HTTPException: If validation fails or command not allowed
"""
# Validate access
site = self._validate_site_access(db, site_id, user_id)
# Validate command is allowed
if not self._is_command_allowed(command):
raise HTTPException(
status_code=400,
detail=f"Command '{command}' is not allowed"
)
# Check if command is long-running
if self._is_long_running_command(command):
# Enqueue background job
job_id = await self._enqueue_wp_cli_job(site_id, command)
return schema.WpCliResponse(
output="Command queued",
exit_code=0,
job_id=job_id,
status="queued"
)
# Execute command immediately
result = repository.execute_wp_cli(db, site_id, command)
# Log audit event
log_audit_event(
user_id=user_id,
team_id=site.team_id,
action=AuditAction.UPDATE,
resource_type="wordpress",
resource_id=site_id,
metadata={"command": command}
)
return result
def _validate_site_access(
self,
db: Session,
site_id: int,
user_id: int
):
"""Validate user has access to site (internal helper)"""
from app.sites.repository import get_site_by_id
site = get_site_by_id(db, site_id)
if not site:
raise HTTPException(status_code=404, detail="Site not found")
# Validate team access
if not validate_team_access(user_id, site.team_id):
raise HTTPException(status_code=403, detail="Access denied")
return site
def _is_command_allowed(self, command: str) -> bool:
"""Check if WP-CLI command is allowed"""
from app.core.utils import get_wordpress_allowed_commands
allowed = get_wordpress_allowed_commands()
return any(command.startswith(cmd) for cmd in allowed)
def _is_long_running_command(self, command: str) -> bool:
"""Check if command is long-running"""
long_running = ['plugin update', 'core update', 'db export', 'db import']
return any(command.startswith(cmd) for cmd in long_running)
async def _enqueue_wp_cli_job(self, site_id: int, command: str) -> str:
"""Enqueue WP-CLI command as background job"""
from arq import create_pool
from arq.connections import RedisSettings
redis = await create_pool(RedisSettings())
job = await redis.enqueue_job('execute_wp_cli', site_id, command)
return job.job_id
# Singleton instance
wordpress_service = WordPressService()
6.6.3 Module Isolation Validation¶
Automated Validation:
The validate_architecture.py script enforces module isolation rules:
# backend/scripts/validate_architecture.py (enhanced version)
import ast
import sys
from pathlib import Path
from typing import Set, Dict, List
ALLOWED_IMPORTS = {'core', 'database'}
REQUIRED_FILES = ['router.py', 'service.py', 'repository.py', 'model.py', 'schema.py']
SHARED_MODULES = ['core', 'database', 'tests']
def get_module_imports(module_path: Path) -> Dict[str, Set[str]]:
"""Extract all app.* imports from a module"""
imports = {}
for py_file in module_path.rglob("*.py"):
if py_file.name.startswith('_'):
continue
file_imports = set()
try:
tree = ast.parse(py_file.read_text())
for node in ast.walk(tree):
if isinstance(node, ast.ImportFrom):
if node.module and node.module.startswith('app.'):
parts = node.module.split('.')
if len(parts) >= 2:
file_imports.add(parts[1]) # Extract module name
except Exception as e:
print(f"Warning: Could not parse {py_file}: {e}")
continue
if file_imports:
imports[str(py_file.relative_to(module_path.parent))] = file_imports
return imports
def validate_module_isolation() -> List[str]:
"""Validate no module imports from other modules"""
app_path = Path("app")
modules = [d for d in app_path.iterdir() if d.is_dir() and d.name not in SHARED_MODULES]
violations = []
for module in modules:
module_imports = get_module_imports(module)
for file_path, imports in module_imports.items():
# Allowed imports: core, database, self
forbidden_imports = imports - set(ALLOWED_IMPORTS) - {module.name}
if forbidden_imports:
violations.append(
f"❌ {file_path} imports from forbidden modules: {forbidden_imports}\n"
f" Allowed: {ALLOWED_IMPORTS} + own module ({module.name})"
)
return violations
def validate_module_structure() -> List[str]:
"""Validate all modules have required files"""
app_path = Path("app")
modules = [d for d in app_path.iterdir() if d.is_dir() and d.name not in SHARED_MODULES]
violations = []
for module in modules:
for required_file in REQUIRED_FILES:
file_path = module / required_file
if not file_path.exists():
violations.append(f"❌ Module '{module.name}' missing {required_file}")
else:
# Check file is not empty (at least has imports or docstring)
content = file_path.read_text().strip()
if len(content) < 10: # Arbitrary minimum
violations.append(f"⚠️ Module '{module.name}' has empty {required_file}")
return violations
def validate_layer_separation() -> List[str]:
"""Validate layer separation within modules"""
app_path = Path("app")
modules = [d for d in app_path.iterdir() if d.is_dir() and d.name not in SHARED_MODULES]
violations = []
for module in modules:
# Check routers don't import SQLAlchemy query code
router_file = module / "router.py"
if router_file.exists():
content = router_file.read_text()
if 'from sqlalchemy' in content and 'Session' not in content:
violations.append(
f"❌ {module.name}/router.py imports SQLAlchemy (should use service layer)"
)
if '.query(' in content or '.filter(' in content:
violations.append(
f"❌ {module.name}/router.py contains database queries (should use service layer)"
)
# Check services use repositories for database access
service_file = module / "service.py"
if service_file.exists():
content = service_file.read_text()
if '.query(' in content or '.add(' in content or '.commit(' in content:
violations.append(
f"⚠️ {module.name}/service.py directly accesses database (should use repository)"
)
return violations
def generate_dependency_graph():
"""Generate module dependency graph"""
app_path = Path("app")
modules = [d for d in app_path.iterdir() if d.is_dir() and d.name not in SHARED_MODULES]
print("\n📊 Module Dependency Graph:")
print("=" * 60)
for module in modules:
imports = get_module_imports(module)
all_imports = set()
for file_imports in imports.values():
all_imports.update(file_imports)
# Filter out own module and allowed shared imports
external_imports = all_imports - {module.name} - set(ALLOWED_IMPORTS)
if external_imports:
print(f"{module.name} → {', '.join(external_imports)}")
else:
print(f"{module.name} → (no external dependencies) ✓")
if __name__ == "__main__":
print("🔍 Validating Hybrid Modular DDD Architecture...\n")
# Run all validations
isolation_violations = validate_module_isolation()
structure_violations = validate_module_structure()
layer_violations = validate_layer_separation()
# Report violations
if isolation_violations:
print("❌ Module Isolation Violations:")
for violation in isolation_violations:
print(f" {violation}")
print()
if structure_violations:
print("❌ Module Structure Violations:")
for violation in structure_violations:
print(f" {violation}")
print()
if layer_violations:
print("❌ Layer Separation Violations:")
for violation in layer_violations:
print(f" {violation}")
print()
# Generate dependency graph
generate_dependency_graph()
# Exit with appropriate code
total_violations = len(isolation_violations) + len(structure_violations) + len(layer_violations)
if total_violations == 0:
print("\n✅ All architecture validation checks passed!")
sys.exit(0)
else:
print(f"\n❌ Found {total_violations} architecture violations!")
sys.exit(1)
6.6.4 Service Layer Best Practices¶
1. Transaction Management:
# GOOD: Service manages transaction, repository executes queries
def create_site_with_environment(
db: Session,
site_data: schema.SiteCreate,
env_data: schema.EnvironmentCreate
) -> schema.SiteRead:
"""Create site and default environment (transactional)"""
try:
# Create site
site = repository.create_site(db, site_data)
# Create environment
env_data.site_id = site.id
environment = repository.create_environment(db, env_data)
# Commit transaction
db.commit()
db.refresh(site)
return site
except Exception as e:
db.rollback()
raise HTTPException(status_code=500, detail=f"Failed to create site: {str(e)}")
# BAD: Service doesn't manage transaction properly
def create_site_with_environment_bad(
db: Session,
site_data: schema.SiteCreate
):
site = repository.create_site(db, site_data)
db.commit() # Commits too early!
try:
environment = repository.create_environment(db, env_data)
db.commit()
except:
# Site already committed, can't rollback!
pass
2. Caching Strategy:
from app.core.cache import cache_result, invalidate_cache
class SiteService:
@cache_result(key_prefix="sites", ttl=300)
def get_site_statistics(self, db: Session, site_id: int):
"""Get site statistics (cached for 5 minutes)"""
return repository.get_site_statistics(db, site_id)
def update_site(self, db: Session, site_id: int, site_data: schema.SiteUpdate):
"""Update site and invalidate cache"""
site = repository.update_site(db, site_id, site_data)
db.commit()
# Invalidate related caches
invalidate_cache(f"sites:*:{site_id}:*")
return site
3. Error Handling:
class WordPressError(Exception):
"""Base exception for WordPress operations"""
pass
class WordPressNotInstalledError(WordPressError):
"""Raised when WordPress is not installed"""
pass
class CommandNotAllowedError(WordPressError):
"""Raised when WP-CLI command is not allowed"""
pass
# In service
def get_wordpress_info(self, db: Session, site_id: int):
wp = repository.get_wordpress_by_site_id(db, site_id)
if not wp:
raise WordPressNotInstalledError(f"WordPress not installed on site {site_id}")
return wp
# In router (translates domain exceptions to HTTP exceptions)
try:
wp = service.get_wordpress_info(db, site_id)
return wp
except WordPressNotInstalledError as e:
raise HTTPException(status_code=404, detail=str(e))
except WordPressError as e:
raise HTTPException(status_code=500, detail=str(e))
6.6.5 Multi-Tier Caching Strategy¶
Application-Level Caching: - In-memory LRU caches for static data (allowed commands, configuration) - Request-scoped caching within a single API call
Distributed Caching (Redis): - User session data - API response caching - Computed statistics and dashboards - Rate limiting counters
Database-Level Optimization: - Query result caching - Connection pooling via PgBouncer - Read replicas for read-heavy operations
CDN Caching:
- Static assets (images, CSS, JS)
- Public API responses with Cache-Control headers
6.6.6 Monitoring and Metrics¶
Service-Level Metrics:
from prometheus_client import Counter, Histogram
# Service operation metrics
service_operation_total = Counter(
'service_operation_total',
'Total service operations',
['module', 'operation', 'status']
)
service_operation_duration = Histogram(
'service_operation_duration_seconds',
'Service operation duration',
['module', 'operation']
)
# Usage in service
def create_site(self, db: Session, site_data: schema.SiteCreate):
with service_operation_duration.labels(module='sites', operation='create').time():
try:
site = repository.create_site(db, site_data)
db.commit()
service_operation_total.labels(module='sites', operation='create', status='success').inc()
return site
except Exception as e:
db.rollback()
service_operation_total.labels(module='sites', operation='create', status='error').inc()
raise
6.6.7 Integration with Background Jobs¶
Services coordinate with background jobs for long-running operations:
from arq import create_pool
from arq.connections import RedisSettings
async def create_full_backup(
self,
db: Session,
site_id: int,
user_id: int
) -> dict:
"""Create full backup (queued as background job)"""
# Validate access
site = self._validate_site_access(db, site_id, user_id)
# Create backup record (status: pending)
backup = repository.create_backup(db, schema.BackupCreate(
site_id=site_id,
type="full",
status="pending"
))
db.commit()
# Enqueue background job
redis = await create_pool(RedisSettings())
job = await redis.enqueue_job('create_full_backup', site_id, backup.id)
return {
"backup_id": backup.id,
"job_id": job.job_id,
"status": "queued"
}
Summary: - 40+ Domain Services: Each module has its own service implementing business logic - Strict Module Isolation: Enforced via automated validation in CI/CD - Transaction Management: Services orchestrate multi-step operations - Caching: Multi-tier caching strategy for performance - Error Handling: Domain-specific exceptions translated to HTTP responses - Monitoring: Built-in metrics for all service operations - Background Jobs: Integration with ARQ for long-running operations
8.7 Next.js Frontend Integration¶
- Modern React: Next.js 14 with App Router and React Server Components
- API Integration: Auto-generated TypeScript clients from OpenAPI specs
- State Management: Zustand for global state with React Query for server state
- Authentication Flow: Seamless JWT token management with refresh logic
- Real-time Updates: WebSocket integration for live updates
- Progressive Web App: Offline support and native app-like experience
8.8 Domain Modules Inventory & Migration Map¶
Based on the Laravel 11 inventory (controllers, jobs, services, routes), the following FastAPI modules will be implemented. Each module owns its router, service, repository, model, and schema:
Core Modules (with External API Integration):
- Auth (
auth/): login, refresh, MFA, social login; replaces Laravel Sanctum flows - External API: None (internal authentication only)
-
Shared Components: Uses
app/core/security.pyfor JWT management -
Users (
users/): profile, OTP, audit logs; mapsUserController - External API: None (internal user management)
-
Shared Components: Uses
app/core/shared/rbac.pyfor permissions -
Teams (
teams/): team CRUD, membership; mapsTeamController,TeamService - External API: None (internal team management)
- Shared Components: Uses
app/core/shared/tenant.pyfor multi-tenancy
Environment & Infrastructure Modules (Virtuozzo Integration):
- Sites (
sites/): site CRUD/dashboard; mapsSitesController,SiteDetailsController - External API: None (internal site management)
-
Shared Components: Uses
app/core/shared/audit.pyfor audit logging -
Environments (
environments/): lifecycle start/stop/sleep/restart/rename/delete; mapsApiController,DeleteEnvWorkflowController, Virtuozzo services - External API: ⭐ Virtuozzo API via
app/core/adapters/virtuozzo_adapter.py - Service Methods:
start_environment(),stop_environment(),sleep_environment(),restart_environment(),delete_environment() - Performance: Connection pooling, caching (5 min TTL), circuit breaker, retries (3x)
-
Shared Components: Uses
virtuozzo_adapterfor all Virtuozzo API calls -
Staging (
staging/): staging creation pipeline (export DB, create env, install addons, import DB, search/replace); mapsStagingController - External API: ⭐ Virtuozzo API via
app/core/adapters/virtuozzo_adapter.py - Service Methods:
create_staging_environment(),sync_staging_to_production() - Background Jobs: Long-running staging operations via ARQ
-
Shared Components: Uses
virtuozzo_adapterfor environment creation -
Backups (
backups/): full/custom backup, restore, sessions; mapsBackupController,BackupStatusController - External API: ⭐ Virtuozzo API via
app/core/adapters/virtuozzo_adapter.py - Service Methods:
create_backup(),restore_backup(),list_backup_sessions() - Background Jobs: Backup/restore operations via ARQ
- Shared Components: Uses
virtuozzo_adapterfor backup operations
CDN & Cache Modules:
- Cache (
cache/): Redis/OPCache/Relay/LiteSpeed controls and metrics; mapsCacheController,ApiControllercache routes - External API: ⭐ Bunny CDN API via
app/core/adapters/bunny_cdn_adapter.py - Service Methods:
create_dns_record(),configure_cdn(),purge_cache() - Performance: HTTP/2, idempotency keys, automatic retries
-
Shared Components: Uses
bunny_cdn_adapterfor CDN operations -
Domains (
domains/): SSL issuance/renewal, domain verify/bind/remove; mapsDomainsController,SiteDetailsController - External API:
- ⭐ Cloudflare API via
app/core/adapters/cloudflare_adapter.py - ⭐ Bunny CDN API via
app/core/adapters/bunny_cdn_adapter.py
- ⭐ Cloudflare API via
- Service Methods:
issue_ssl(),verify_domain(),bind_domain(),remove_domain() - Background Jobs: SSL issuance/renewal via ARQ
- Shared Components: Uses multiple adapters for DNS and SSL management
Infrastructure Services:
- SFTP (
sftp/): users CRUD, addon install; mapsSftpUserController - External API: ⭐ Virtuozzo API via
app/core/adapters/virtuozzo_adapter.py - Service Methods:
create_sftp_user(),install_sftp_addon() -
Shared Components: Uses
virtuozzo_adapterfor SFTP user management -
WordPress (
wordpress/): WP CLI proxy, update history, activities; mapsWpCacheController,UpdateHistoryController - External API: ⭐ Virtuozzo API via
app/core/adapters/virtuozzo_adapter.py - Service Methods:
execute_wp_cli(),install_plugin(),update_wordpress() - Background Jobs: Long-running WP-CLI commands via ARQ
- Rate Limiting: 10 WP-CLI commands per minute per user
- Shared Components: Uses
virtuozzo_adapterfor WP-CLI execution
Payment & Billing Modules:
- Payments (
payments/): Stripe intents, configs, post-payment; mapsFundController,InvoiceController - External API: ⭐ Stripe API (future adapter:
app/core/adapters/stripe_adapter.py) - Service Methods:
create_payment_intent(),process_payment(),refund_payment() - Performance: Idempotency keys for safe retries, webhook validation
-
Shared Components: Future Stripe adapter implementation
-
Billing (
billing/): PayPal one-time/recurring, plan deactivate/cancel; mapsWebhookController - External API: ⭐ PayPal API (future adapter:
app/core/adapters/paypal_adapter.py) - Service Methods:
create_subscription(),cancel_subscription(),process_webhook() - Shared Components: Future PayPal adapter implementation
Monitoring & Notifications:
- Uptime (
uptime/): uptime monitor/checker; mapsUptimeController,Uptime*Service - External API: ⭐ UptimeRobot API (future adapter:
app/core/adapters/uptime_adapter.py) - Service Methods:
create_monitor(),check_uptime(),get_monitor_status() - Background Jobs: Periodic uptime checks via ARQ
-
Shared Components: Future UptimeRobot adapter implementation
-
Nodes & Metrics (
nodes/): node stats, env metrics time-series; mapsNodeController,NodeStatsController,EnvMetricsController - External API: ⭐ Virtuozzo API via
app/core/adapters/virtuozzo_adapter.py - Service Methods:
fetch_node_stats(),get_environment_metrics() - Caching: Aggressive caching (1 min TTL) for metrics data
- Shared Components: Uses
virtuozzo_adapterfor metrics collection
Utility Modules:
- Activity & Notes (
activity/): notes CRUD, site activities; mapsActivityController - External API: None (internal activity logging)
-
Shared Components: Uses
app/core/shared/audit.pyfor audit trail -
Favourites (
favourites/): add/remove/list; mapsFavouritesController -
External API: None (internal user preferences)
-
Config (
config/): versioned config endpointsapi/v1/config/*; mapsConfigController -
External API: None (internal configuration management)
-
Profile (
profile/): user profile update/delete; mapsProfileController -
External API: None (internal user profile management)
-
Instant/Virt Login (
sessions/): instant login, virt login; mapsInstantLoginController,VirtLoginController - External API: ⭐ Virtuozzo API via
app/core/adapters/virtuozzo_adapter.py - Service Methods:
create_instant_login(),create_virt_login() -
Shared Components: Uses
virtuozzo_adapterfor session key generation -
Jobs (
jobs/): dispatch/status for background tasks; mapsJobController,JobLogsController,JobStatusController - External API: None (internal job queue management)
-
Shared Components: Uses ARQ for background job processing
-
Webhook (
webhook/): inbound webhooks processing; mapsWebhookController - External API: Receives webhooks from external services (Stripe, PayPal, etc.)
-
Service Methods:
process_stripe_webhook(),process_paypal_webhook() -
WebSocket (
websocket/): channels, health, metrics, broadcast; mapsWebSocketController, Node bridge parity - External API: None (internal real-time communication)
- Shared Components: Uses Redis Pub/Sub for cross-instance messaging
Shared External API Infrastructure:
Located in app/core/adapters/:
app/core/adapters/
├── __init__.py
├── virtuozzo_adapter.py # Used by: environments, wordpress, backups, sftp, staging, nodes, sessions
├── bunny_cdn_adapter.py # Used by: cache, domains
├── cloudflare_adapter.py # Used by: domains
├── stripe_adapter.py # Future: Used by: payments
├── paypal_adapter.py # Future: Used by: billing
├── postmark_adapter.py # Future: Used by: notifications
└── uptime_adapter.py # Future: Used by: uptime
Adapter Usage Matrix:
| Adapter | Modules Using It | API Operations | Performance Features |
|---|---|---|---|
virtuozzo_adapter.py |
environments, wordpress, backups, sftp, staging, nodes, sessions | Environment lifecycle, WP-CLI, backups, SFTP, metrics | Connection pooling, caching (5 min), circuit breaker, rate limiting (10 req/s) |
bunny_cdn_adapter.py |
cache, domains | DNS records, CDN config, cache purging | HTTP/2, idempotency keys, retries (3x) |
cloudflare_adapter.py |
domains | DNS management, SSL verification | Connection pooling, automatic retries |
For full endpoint-by-endpoint mapping, see docs/04-API-ENDPOINTS-MAPPING.md (auto-generated from route analyzers), kept in lockstep with OpenAPI.
8.9 API Versioning & Backward Compatibility Strategy¶
- Versioned FastAPI routes under
/api/v1/**mirroring Laravel endpoints to enable incremental frontend migration - Legacy compatibility shims maintained where necessary:
- Path aliases for legacy endpoints (e.g.,
/get-backup-list→/api/v1/backups/list) - Payload/response translators to preserve current frontend contracts
- Deprecation policy:
- All shims carry
DeprecationandSunsetheaders - Removal only after frontend cutover and 30-day window
- Contract testing:
- Snapshot tests for JSON shapes versus Laravel responses
- Contract CI checks block incompatible changes
8.10 Background Jobs Migration Mapping¶
- Queues:
critical,default,bulk,io-bound,webhooks,emails,dlq- Representative job mappings (Laravel → FastAPI/ARQ task):
CreateEnvironmentJob→environments.tasks.create_environmentDeleteEnvironmentJob→environments.tasks.delete_environmentCreateFullBackupJob/CreateCustomBackupJob→backups.tasks.create_backupRestoreBackup→backups.tasks.restore_backupInstallWordPressJob/RunDynamicWpCli→wordpress.tasks.run_wp_cliInstallLetsEncryptSSLJob/UpdateSSLJob/IssueSslCertJob→domains.tasks.manage_sslSyncSftpUsersJob/InstallAddSftpJob→sftp.tasks.sync_usersSetupBunnyCdnJob→cache.tasks.configure_cdnDispatchSyncJob/DispatchSingleSyncJob→sites.tasks.sync_environmentProcessDomainIP/VerifyDomainJob/RemoveDomainJob→domains.tasks.verify_or_remove- Execution guarantees:
- Idempotency keys based on tuple (team_id, env, op, args_hash)
- Outbox + inbox tables for external calls (payments, SSL, CDN)
- DLQ viewer and replay tooling with backoff policies
8.11 WebSocket Migration Plan (Durability & Presence)¶
- Phase A (Bridge): Keep Node WS (Redis Pub/Sub + Streams) behind
/wswhile FastAPI replicates features - Phase B (Dual-Run): Mirror publish to both WS stacks; validators compare payload parity
- Phase C (Cutover): Switch ingress to FastAPI WS; Node remains as hot-standby for rollback
- Feature parity requirements:
- Channels:
team.{teamId}.{feature}(jobs, backups, activities, sites.creating, domains, favourites) - Durability: Redis Streams with consumer groups, replay API, ACK timeouts
- Presence: Redis-backed presence set with TTL, cross-instance fan-out
- Security: JWT with team claims; per-channel authorization
- Ops:
/metrics,/health,/channels, structured logs with correlation IDs
8.12 Data Model Migration & SQLAlchemy Strategy¶
- Model inventory (representative):
User,Team,Member,Site,Environment,Backup,BackupSession,Domain,SftpUser,Node,Note,SiteActivity,Transaction,PaypalSubscription,TeamFund,PendingSite,UpdateHistory,SyncHistory - Approach:
- Generate SQLAlchemy models to match target PostgreSQL schema (snake_case, explicit FKs, indexes)
- Alembic migrations: baseline from current MySQL schema (via
alembic revision --autogenerateafter initial models), then hand-tune constraints and indexes - Data migration:
- Extract: chunked reads from MySQL ordered by PK
- Transform: enum/string normalization, timezone normalization, UUIDs where applicable
- Load: COPY-based bulk inserts; verify row counts and checksums
- Validation:
- Referential integrity checks per batch
- Sampling-based semantic validation (e.g., backup sessions reconcile with backups)
- Cutover:
- Dual-write window for critical tables (feature-flagged)
- Read-from-Postgres dark launch; promote after parity
8.13 Observability, SLOs, and Load Testing¶
- SLOs:
- API p95 latency < 200ms; WS broadcast latency p95 < 250ms
- Error rate < 0.1%; Availability 99.9%
- Metrics:
- API: latency, throughput, error rates, saturation (DB pool, Redis)
- Jobs: queue depth, processing latency, retry rates, DLQ rates
- WS: active connections, channel subscribers, publish/replay latency
- Tracing:
- End-to-end tracing (OpenTelemetry) across HTTP → jobs → WS
- Logging:
- Structured JSON logs; PII scrubbing; per-request correlation IDs
- Load testing:
- k6/Gatling scenarios per critical flows (site lifecycle, backups, domains)
- Soak tests and spike tests with autoscaling validation
8.14 Cutover & Rollback Strategy¶
- Blue/Green deployments for backend; frontend deployed independently
- Database:
- Pre-warm read replicas; promote on failover
- Point-in-time recovery enabled; backups validated daily
- WS:
- Dual-publish period for parity; feature flag to revert to Node WS
- API:
- Feature flags for compatibility shims; runtime kill-switches for risky features
- Canary release to 10% traffic before full rollout
- Rollback playbooks with time-bounded MTTR targets
8.15 External API Integration Architecture¶
Context: The MBPanel system integrates with 40+ external services including Virtuozzo (19 services), CDN/Cache providers (4 services), payment processors, and other third-party APIs. This section defines the architecture for migrating Laravel's HTTP client patterns to FastAPI with optimal performance, reliability, and resilience.
6.15.1 Problem Statement and Current Challenges¶
Laravel Pattern (Current State):
// Laravel: app/Services/Virtuozzo/MbAdminService.php
$response = Http::timeout(90)->post($apiUrl);
if (!$response->successful()) {
throw new \Exception("Failed...");
}
return $response->json();
Critical Issues with Direct Migration: 1. ❌ No connection pooling → new connection per request (70ms overhead) 2. ❌ No circuit breaker → cascading failures during external API outages 3. ❌ No response caching → repeated calls to same endpoint 4. ❌ No rate limiting → risk of being throttled by external APIs 5. ❌ No retry with exponential backoff → transient failures cause errors 6. ❌ No request timeout strategy → hanging requests 7. ❌ No correlation IDs → difficult to trace across systems
Laravel's Good Patterns to Preserve:
- Circuit breaker implementation in ExternalApiService.php
- Idempotent operation handling in BunnyCdnService.php
- Structured error logging
6.15.2 HTTP Client Infrastructure (app/core/http_client.py)¶
Purpose: Centralized HTTP client with connection pooling, retries, circuit breakers, caching, and rate limiting.
Key Components:
1. CircuitBreaker Class:
class CircuitBreaker:
"""Circuit breaker implementation for external API calls"""
def __init__(
self,
failure_threshold: int = 5,
recovery_timeout: int = 300, # 5 minutes
expected_exception: type = Exception
):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.expected_exception = expected_exception
self.failure_count = 0
self.last_failure_time = None
self.state = "CLOSED" # CLOSED, OPEN, HALF_OPEN
States:
- CLOSED: Normal operation, requests pass through
- OPEN: Circuit breaker triggered, fail fast without calling API
- HALF_OPEN: Testing recovery, single request allowed
2. ExternalAPIClient Class:
class ExternalAPIClient:
"""
High-performance HTTP client for external API integrations.
Features:
- Connection pooling (reduces latency by 50-70%)
- Automatic retries with exponential backoff
- Circuit breaker pattern
- Response caching
- Rate limiting
- Request/response logging
"""
def __init__(
self,
base_url: str,
timeout: float = 30.0,
max_retries: int = 3,
cache_ttl: int = 300,
rate_limit: Optional[int] = None, # requests per second
):
self.base_url = base_url
self.timeout = timeout
self.max_retries = max_retries
self.cache_ttl = cache_ttl
self.rate_limit = rate_limit
# Connection pooling: reuse connections across requests
self.client = httpx.AsyncClient(
base_url=base_url,
timeout=httpx.Timeout(timeout),
limits=httpx.Limits(
max_keepalive_connections=20,
max_connections=100,
keepalive_expiry=60.0
),
http2=True # Enable HTTP/2 for better performance
)
# Circuit breaker
self.circuit_breaker = CircuitBreaker(
failure_threshold=5,
recovery_timeout=300
)
# Rate limiting
self._rate_limit_tokens = []
self._rate_limit_lock = asyncio.Lock()
3. Request Methods with Resilience:
GET with Caching:
async def get(
self,
path: str,
params: Optional[Dict[str, Any]] = None,
headers: Optional[Dict[str, str]] = None,
cache_key: Optional[str] = None,
correlation_id: Optional[str] = None,
) -> Dict[str, Any]:
"""Execute GET request with caching, retries, and circuit breaker"""
# Check cache first
if cache_key:
cached = redis_client.get(cache_key)
if cached:
logger.info(
"external_api_cache_hit",
path=path,
cache_key=cache_key,
correlation_id=correlation_id
)
return cached
# Check rate limit
await self._check_rate_limit()
# Add correlation ID to headers
if correlation_id and headers:
headers["X-Correlation-ID"] = correlation_id
# Execute with circuit breaker and retry
return self.circuit_breaker.call(self._execute_get, path, params, headers, cache_key, correlation_id)
POST with Idempotency:
async def post(
self,
path: str,
data: Optional[Dict[str, Any]] = None,
json: Optional[Dict[str, Any]] = None,
headers: Optional[Dict[str, str]] = None,
correlation_id: Optional[str] = None,
idempotency_key: Optional[str] = None,
) -> Dict[str, Any]:
"""
Execute POST request with retries and circuit breaker.
Args:
idempotency_key: Idempotency key for safe retries
"""
# Check rate limit
await self._check_rate_limit()
# Add correlation ID and idempotency key to headers
if headers is None:
headers = {}
if correlation_id:
headers["X-Correlation-ID"] = correlation_id
if idempotency_key:
headers["Idempotency-Key"] = idempotency_key
# Check for idempotent errors (already exists)
if response.status_code == 400:
response_body = response.text
if any(
phrase in response_body.lower()
for phrase in ["already registered", "already exists", "duplicate"]
):
logger.info(
"external_api_idempotent_success",
method="POST",
path=path,
status_code=response.status_code,
correlation_id=correlation_id
)
# Treat as success (idempotent operation)
return {"status": "success", "idempotent": True}
4. Retry Logic with Exponential Backoff:
retry_count = 0
while retry_count <= self.max_retries:
try:
response = await self.client.get(path, params=params, headers=headers)
response.raise_for_status()
return response.json()
except httpx.HTTPStatusError as e:
# Don't retry client errors (4xx)
if 400 <= e.response.status_code < 500:
raise
retry_count += 1
if retry_count <= self.max_retries:
# Exponential backoff: 1s, 2s, 4s
backoff = 2 ** (retry_count - 1)
logger.warning(
"external_api_retry",
path=path,
retry=retry_count,
backoff_seconds=backoff,
error=str(e),
correlation_id=correlation_id
)
await asyncio.sleep(backoff)
5. Singleton Pattern for Client Management:
# Global HTTP client instances (singleton pattern)
_clients: Dict[str, ExternalAPIClient] = {}
def get_http_client(
name: str,
base_url: str,
**kwargs
) -> ExternalAPIClient:
"""
Get or create HTTP client instance.
Args:
name: Client identifier (e.g., 'virtuozzo', 'bunny_cdn')
base_url: Base URL for API
**kwargs: Additional client configuration
Returns:
ExternalAPIClient instance
"""
if name not in _clients:
_clients[name] = ExternalAPIClient(base_url, **kwargs)
return _clients[name]
async def close_all_clients():
"""Close all HTTP clients (called on app shutdown)"""
for client in _clients.values():
await client.close()
_clients.clear()
6.15.3 Service Adapter Pattern (app/core/adapters/)¶
Purpose: Isolate external API logic from domain service logic using the Adapter Pattern.
Adapter Location Strategy:
app/
├── core/
│ ├── adapters/ # ⭐ SHARED ADAPTERS (used by multiple modules)
│ │ ├── __init__.py
│ │ ├── virtuozzo_adapter.py # Used by environments, wordpress, backups
│ │ ├── bunny_cdn_adapter.py # Used by cache, domains
│ │ └── cloudflare_adapter.py # Used by domains
│ └── http_client.py # Shared HTTP client
├── environments/
│ └── service.py # Uses virtuozzo_adapter
├── wordpress/
│ └── service.py # Uses virtuozzo_adapter
└── cache/
└── service.py # Uses bunny_cdn_adapter
Example: Virtuozzo Adapter:
# backend/app/core/adapters/virtuozzo_adapter.py
"""
Virtuozzo API Adapter
Handles all interactions with Virtuozzo API.
Isolates external API concerns from domain logic.
"""
from typing import Dict, Any, Optional
from sqlalchemy.orm import Session
from app.core.http_client import get_http_client
from app.core.config import settings
import structlog
logger = structlog.get_logger()
class VirtuozzoAdapter:
"""
Adapter for Virtuozzo API integration.
Responsibilities:
- Execute Virtuozzo API calls
- Handle session key management
- Transform Virtuozzo responses to domain models
- Cache frequently accessed data
"""
def __init__(self):
self.client = get_http_client(
name="virtuozzo",
base_url=settings.VIRTUOZZO_API_URL,
timeout=90.0, # Virtuozzo needs longer timeout
max_retries=3,
cache_ttl=300,
rate_limit=10 # Max 10 requests per second
)
async def fetch_environments_and_nodes(
self,
session_key: str,
correlation_id: Optional[str] = None
) -> Dict[str, Any]:
"""
Fetch environments and nodes from Virtuozzo API.
Args:
session_key: Encrypted session key
correlation_id: Request correlation ID for tracing
Returns:
Dict containing environments and nodes data
"""
cache_key = f"virtuozzo:environments:{session_key[:8]}"
response = await self.client.get(
path="/1.0/environment/control/rest/getenvs",
params={"session": session_key},
cache_key=cache_key,
correlation_id=correlation_id
)
# Normalize response
if "error" in response and response["error"]:
logger.error(
"virtuozzo_api_error",
error=response["error"],
correlation_id=correlation_id
)
raise Exception(f"Virtuozzo API error: {response['error']}")
# Process and normalize environment data
if "infos" in response and isinstance(response["infos"], list):
for info in response["infos"]:
if "env" in info:
# Ensure displayName exists (fallback to envName)
if not info["env"].get("displayName"):
info["env"]["displayName"] = (
info["env"].get("envName") or
info["env"].get("shortdomain") or
"Unknown"
)
return response
async def execute_mbadmin_action(
self,
app_unique_name: str,
session_key: str,
action: str,
params: Dict[str, Any],
correlation_id: Optional[str] = None,
idempotency_key: Optional[str] = None
) -> Dict[str, Any]:
"""
Execute MbAdmin action via Virtuozzo marketplace API.
"""
import json
import urllib.parse
# Encode params
params_json = json.dumps(params)
params_encoded = urllib.parse.quote(params_json)
# Construct URL path
path = (
f"/1.0/marketplace/installation/rest/executeaction"
f"?appUniqueName={app_unique_name}"
f"&session={session_key}"
f"&action={action}"
f"¶ms={params_encoded}"
)
logger.info(
"virtuozzo_execute_action",
app_unique_name=app_unique_name,
action=action,
correlation_id=correlation_id
)
response = await self.client.post(
path=path,
headers={},
correlation_id=correlation_id,
idempotency_key=idempotency_key
)
return response
async def start_environment(
self,
env_name: str,
session_key: str,
correlation_id: Optional[str] = None
) -> Dict[str, Any]:
"""Start environment"""
return await self.client.post(
path="/1.0/environment/control/rest/startenv",
json={"envName": env_name, "session": session_key},
correlation_id=correlation_id
)
# Singleton instance
_virtuozzo_adapter: Optional[VirtuozzoAdapter] = None
def get_virtuozzo_adapter() -> VirtuozzoAdapter:
"""Get Virtuozzo adapter instance"""
global _virtuozzo_adapter
if _virtuozzo_adapter is None:
_virtuozzo_adapter = VirtuozzoAdapter()
return _virtuozzo_adapter
Example: Bunny CDN Adapter:
# backend/app/core/adapters/bunny_cdn_adapter.py
from app.core.http_client import get_http_client
from app.core.config import settings
class BunnyCDNAdapter:
def __init__(self):
self.client = get_http_client(
name="bunny_cdn",
base_url="https://api.bunny.net",
timeout=30.0,
max_retries=3
)
self.access_key = settings.BUNNY_CDN_ACCESS_KEY
async def create_dns_record(
self,
env_name: str,
platform_domain: str,
correlation_id: str
) -> dict:
"""Create DNS CNAME record"""
payload = {
"Type": "CNAME",
"Name": env_name,
"Value": f"{env_name}.{platform_domain}",
"Ttl": 15,
"Accelerated": True,
"MonitorType": "Monitor",
"AutoSslIssuance": True
}
headers = {
"AccessKey": self.access_key,
"Accept": "application/json",
"Content-Type": "application/json"
}
# Generate idempotency key for safe retries
idempotency_key = f"bunny:dns:{env_name}:{platform_domain}"
response = await self.client.post(
path=f"/dnszone/{settings.BUNNY_DNS_ZONE_ID}/records",
json=payload,
headers=headers,
correlation_id=correlation_id,
idempotency_key=idempotency_key
)
return response
6.15.4 Service Integration Pattern¶
Purpose: Use adapters in service layer while maintaining domain logic separation.
Example: Environment Service Using Virtuozzo Adapter:
# backend/app/environments/service.py
from sqlalchemy.orm import Session
from typing import Optional
from app.core.shared.audit import log_audit_event, AuditAction
from app.core.adapters.virtuozzo_adapter import get_virtuozzo_adapter
from app.environments import repository, schema
from app.core.exceptions import EnvironmentNotFoundError
import structlog
import uuid
logger = structlog.get_logger()
class EnvironmentService:
"""Service for environment operations"""
def __init__(self):
self.virtuozzo_adapter = get_virtuozzo_adapter()
async def start_environment(
self,
db: Session,
environment_id: int,
user_id: int
) -> schema.EnvironmentRead:
"""
Start environment.
Business Rules:
- User must have permission
- Environment must exist
- Environment must be in stopped/sleeping state
"""
# Generate correlation ID for tracing
correlation_id = str(uuid.uuid4())
logger.info(
"environment_start_requested",
environment_id=environment_id,
user_id=user_id,
correlation_id=correlation_id
)
# Get environment from database
environment = repository.get_environment_by_id(db, environment_id)
if not environment:
raise EnvironmentNotFoundError(f"Environment {environment_id} not found")
# Validate user has access (via team)
self._validate_access(db, user_id, environment.team_id)
# Validate environment state
if environment.status == "running":
logger.info(
"environment_already_running",
environment_id=environment_id,
correlation_id=correlation_id
)
return environment
try:
# Call Virtuozzo API via adapter
response = await self.virtuozzo_adapter.start_environment(
env_name=environment.env_name,
session_key=environment.session_key,
correlation_id=correlation_id
)
# Update environment status
environment.status = "starting"
repository.update_environment(db, environment_id, {"status": "starting"})
db.commit()
# Log audit event
log_audit_event(
user_id=user_id,
team_id=environment.team_id,
action=AuditAction.UPDATE,
resource_type="environment",
resource_id=environment_id,
metadata={
"action": "start",
"correlation_id": correlation_id
}
)
logger.info(
"environment_start_success",
environment_id=environment_id,
correlation_id=correlation_id
)
return environment
except Exception as e:
logger.error(
"environment_start_failed",
environment_id=environment_id,
error=str(e),
correlation_id=correlation_id
)
# Update status to error
repository.update_environment(db, environment_id, {"status": "error"})
db.commit()
raise
6.15.5 Performance Optimizations¶
1. Connection Pooling Benefits:
Before (Laravel - No Pooling):
Request 1: DNS lookup (20ms) + TLS handshake (50ms) + Request (30ms) = 100ms
Request 2: DNS lookup (20ms) + TLS handshake (50ms) + Request (30ms) = 100ms
Request 3: DNS lookup (20ms) + TLS handshake (50ms) + Request (30ms) = 100ms
Total: 300ms
After (FastAPI - With Pooling):
Request 1: DNS lookup (20ms) + TLS handshake (50ms) + Request (30ms) = 100ms
Request 2: Reuse connection + Request (30ms) = 30ms
Request 3: Reuse connection + Request (30ms) = 30ms
Total: 160ms (47% improvement)
2. Response Caching:
# Cache frequently accessed Virtuozzo data
cache_key = f"virtuozzo:environments:{session_key[:8]}"
cache_ttl = 300 # 5 minutes
# First request: 90ms (API call)
# Subsequent requests: 2ms (Redis cache)
# Improvement: 98% faster
3. HTTP/2 Multiplexing:
# Enable HTTP/2 for better performance
self.client = httpx.AsyncClient(
http2=True # Multiple requests over single connection
)
# Multiple parallel requests share single TCP connection
# Reduces latency and connection overhead
6.15.6 Resilience Patterns¶
1. Circuit Breaker:
# Protect against cascading failures
circuit_breaker = CircuitBreaker(
failure_threshold=5, # Open after 5 failures
recovery_timeout=300 # Try again after 5 minutes
)
# States:
# - CLOSED: Normal operation
# - OPEN: Fail fast, don't call API
# - HALF_OPEN: Try single request to test recovery
2. Retry with Exponential Backoff:
# Automatic retries for transient failures
retry_count = 0
while retry_count <= max_retries:
try:
return await self.client.get(path)
except NetworkError:
retry_count += 1
backoff = 2 ** (retry_count - 1) # 1s, 2s, 4s
await asyncio.sleep(backoff)
3. Idempotent Operations:
# Safe retries for mutations
idempotency_key = f"create_env:{user_id}:{env_name}"
await self.client.post(
path="/create",
json=data,
idempotency_key=idempotency_key
)
# If retry hits "already exists" error, treat as success
6.15.7 Module Isolation Compliance¶
Import Rules for Adapters:
# ✅ ALLOWED: Import shared adapter
from app.core.adapters.virtuozzo_adapter import get_virtuozzo_adapter
# ✅ ALLOWED: Import HTTP client
from app.core.http_client import get_http_client
# ❌ FORBIDDEN: Import from other module
from app.wordpress.service import WordPressService # VIOLATION!
Adapter Directory Structure:
app/
├── core/
│ ├── adapters/ # ⭐ SHARED ADAPTERS
│ │ ├── __init__.py
│ │ ├── virtuozzo_adapter.py
│ │ ├── bunny_cdn_adapter.py
│ │ └── cloudflare_adapter.py
│ └── http_client.py # Shared HTTP client
├── environments/
│ └── service.py # Uses virtuozzo_adapter
├── wordpress/
│ └── service.py # Uses virtuozzo_adapter
└── backups/
└── service.py # Uses virtuozzo_adapter
6.15.8 Testing Strategy for External APIs¶
1. HTTP Client Testing:
# backend/app/tests/core/test_http_client.py
import pytest
from app.core.http_client import ExternalAPIClient
@pytest.mark.asyncio
async def test_circuit_breaker_opens_after_failures():
"""Test circuit breaker opens after threshold failures"""
client = ExternalAPIClient(
base_url="http://failing-service",
timeout=1.0,
max_retries=0
)
# Trigger 5 failures
for i in range(5):
with pytest.raises(Exception):
await client.get("/fail")
# Circuit should be open now
assert client.circuit_breaker.state == "OPEN"
# Next call should fail immediately without API call
with pytest.raises(Exception, match="Circuit breaker OPEN"):
await client.get("/fail")
2. Adapter Testing (Mocked):
# backend/app/tests/core/adapters/test_virtuozzo_adapter.py
import pytest
from unittest.mock import AsyncMock, patch
from app.core.adapters.virtuozzo_adapter import VirtuozzoAdapter
@pytest.mark.asyncio
async def test_start_environment_success():
"""Test successful environment start"""
adapter = VirtuozzoAdapter()
# Mock HTTP client response
with patch.object(adapter.client, 'post', new=AsyncMock(return_value={
"result": 0,
"message": "Environment started successfully"
})):
result = await adapter.start_environment(
env_name="test-env",
session_key="test-session",
correlation_id="test-123"
)
assert result["result"] == 0
assert "started successfully" in result["message"]
3. Integration Testing (Real API - Dev/Staging Only):
# backend/app/tests/integration/test_virtuozzo_integration.py
import pytest
from app.core.adapters.virtuozzo_adapter import get_virtuozzo_adapter
@pytest.mark.integration
@pytest.mark.asyncio
async def test_fetch_environments_real_api():
"""Integration test with real Virtuozzo API (staging)"""
adapter = get_virtuozzo_adapter()
session_key = "test-session-key" # From staging environment
result = await adapter.fetch_environments_and_nodes(
session_key=session_key,
correlation_id="integration-test"
)
assert "infos" in result
assert isinstance(result["infos"], list)
6.15.9 Migration Strategy by Service Type¶
Services Requiring Migration:
Virtuozzo Services (19 files):
| Laravel Service | FastAPI Module | Adapter | Priority |
|---|---|---|---|
VirtuozzoService.php |
environments/ |
virtuozzo_adapter.py |
P0 |
MbAdminService.php |
wordpress/ |
virtuozzo_adapter.py |
P0 |
EnvironmentStartService.php |
environments/ |
virtuozzo_adapter.py |
P0 |
EnvironmentStopService.php |
environments/ |
virtuozzo_adapter.py |
P0 |
EnvironmentSleepService.php |
environments/ |
virtuozzo_adapter.py |
P0 |
EnvironmentRestartService.php |
environments/ |
virtuozzo_adapter.py |
P0 |
EnvironmentDeletionService.php |
environments/ |
virtuozzo_adapter.py |
P0 |
VirtuozzoBackupService.php |
backups/ |
virtuozzo_adapter.py |
P1 |
SftpService.php |
sftp/ |
virtuozzo_adapter.py |
P1 |
AddonManagementService.php |
environments/ |
virtuozzo_adapter.py |
P1 |
CDN/Cache Services (4 files):
| Laravel Service | FastAPI Module | Adapter | Priority |
|---|---|---|---|
BunnyCdnService.php |
cache/ |
bunny_cdn_adapter.py |
P1 |
CloudflareDetectionService.php |
domains/ |
cloudflare_adapter.py |
P2 |
RelayService.php |
cache/ |
relay_adapter.py |
P2 |
CacheManagementService.php |
cache/ |
N/A (internal) | P2 |
External Integration Services (8 files):
| Laravel Service | FastAPI Module | Adapter | Priority |
|---|---|---|---|
ExternalApiService.php |
core/ |
http_client.py (base) |
P0 |
ExternalAuthService.php |
auth/ |
N/A (internal) | P1 |
PostmarkService.php |
notifications/ |
postmark_adapter.py |
P2 |
UptimeCheckerService.php |
uptime/ |
uptime_adapter.py |
P2 |
UptimeMonitorService.php |
uptime/ |
uptime_adapter.py |
P2 |
DomainAvailabilityService.php |
domains/ |
domain_check_adapter.py |
P2 |
CnameChecker.php |
domains/ |
N/A (internal) | P2 |
SslChecker.php |
domains/ |
N/A (internal) | P2 |
Other Services (9 files):
| Laravel Service | FastAPI Module | Adapter | Priority |
|---|---|---|---|
WebSocketBroadcastService.php |
websocket/ |
N/A (internal) | P0 |
WebSocketLogger.php |
core/ |
N/A (internal) | P1 |
WebSocketTelemetry.php |
websocket/ |
N/A (internal) | P1 |
WebSocketTokenService.php |
websocket/ |
N/A (internal) | P1 |
UserAgentService.php |
core/ |
N/A (internal) | P2 |
TeamService.php |
teams/ |
N/A (internal) | P0 |
AccountUpgradeService.php |
billing/ |
N/A (internal) | P1 |
DeadLetterQueueService.php |
jobs/ |
N/A (internal) | P1 |
DlqMonitoringService.php |
jobs/ |
N/A (internal) | P1 |
Total: 40 services to migrate
6.15.10 Performance Targets¶
| Metric | Laravel Baseline | FastAPI Target | Improvement |
|---|---|---|---|
| Average API Call Latency | 100ms | 50ms | 50% |
| P95 Latency | 300ms | 150ms | 50% |
| Connection Overhead | 70ms per request | 70ms first request, 5ms subsequent | 93% |
| Cache Hit Rate | 0% (no caching) | 80%+ | ∞ |
| Failed Request Recovery | Manual retry | Automatic (3 retries) | N/A |
| Cascading Failure Protection | None | Circuit breaker | N/A |
6.15.11 Rollback Strategy¶
Feature Flags:
# backend/app/core/config.py
class Settings(BaseSettings):
USE_EXTERNAL_API_CLIENT: bool = True # Feature flag
CIRCUIT_BREAKER_ENABLED: bool = True
RESPONSE_CACHING_ENABLED: bool = True
# In service
if settings.USE_EXTERNAL_API_CLIENT:
adapter = get_virtuozzo_adapter()
else:
# Fallback to legacy implementation
adapter = LegacyVirtuozzoAdapter()
Gradual Rollout: 1. Week 1: Deploy with feature flag OFF 2. Week 2: Enable for 10% of traffic 3. Week 3: Enable for 50% of traffic 4. Week 4: Enable for 100% of traffic 5. Week 5: Remove feature flag
6.15.12 Monitoring and Metrics¶
External API Metrics:
from prometheus_client import Counter, Histogram
# External API call metrics
external_api_requests_total = Counter(
'external_api_requests_total',
'Total external API requests',
['adapter', 'method', 'endpoint', 'status']
)
external_api_duration_seconds = Histogram(
'external_api_duration_seconds',
'External API call duration',
['adapter', 'method', 'endpoint']
)
circuit_breaker_state = Gauge(
'circuit_breaker_state',
'Circuit breaker state (0=CLOSED, 1=OPEN, 2=HALF_OPEN)',
['adapter']
)
cache_hit_rate = Counter(
'external_api_cache_hits',
'Cache hits for external API calls',
['adapter', 'endpoint']
)
Alerting Rules: - Circuit breaker opens (alert severity: WARNING) - High error rate (>5% over 5 minutes, severity: CRITICAL) - High latency (p95 > 500ms, severity: WARNING) - Cache hit rate drops below 50% (severity: INFO)
9. Development Priorities & Sequencing¶
This section provides a strategic overview of implementation priorities. Detailed task lists are maintained in separate files (see docs/development/tasks/ directory).
9.1 Priority Framework¶
P0 (Critical - Must Have for MVP): - Authentication & Authorization System (JWT, RBAC, API Keys) - Database Migration (MySQL → PostgreSQL with Citus sharding preparation) - Core Jelastic API Integration (environments, nodes, basic operations) - Real-time WebSocket System (environment status updates) - Basic Frontend Dashboard (environment list, status monitoring)
P1 (High - Required for Production): - Advanced Jelastic Operations (scaling, backups, logs) - Multi-Region Disaster Recovery - Compliance Framework (GDPR data subject rights APIs) - Auto-Scaling Policies & Implementation - Performance Optimization & Caching
P2 (Medium - Post-Launch Enhancements): - White-label Client Portal (for agencies) - Advanced Analytics & Reporting - Git Integration for Deployments - Plugin Conflict Detection - Predictive Scaling with ML
P3 (Low - Future Roadmap): - Mobile App (iOS/Android) - AI-Powered Performance Recommendations - Multi-Cloud Support (beyond AWS) - Blockchain-based Audit Logs
9.2 Development Sequencing Strategy¶
Vertical Slice Approach: Each phase delivers end-to-end functionality for a subset of features, rather than building all layers of a single feature.
Phase 1: Foundation (Authentication + Basic CRUD)
├── Backend: JWT auth, user CRUD, basic Jelastic integration
├── Frontend: Login page, dashboard shell, environment list
├── Database: PostgreSQL setup, initial migrations
└── Testing: Auth e2e tests, API integration tests
Phase 2: Core Business Logic (Environment Management)
├── Backend: Full Jelastic CRUD operations
├── Frontend: Environment details, node management
├── Database: Environments, nodes, job_logs tables
└── Testing: Environment lifecycle tests
Phase 3: Advanced Features (Real-time, Scaling, Compliance)
├── Backend: WebSocket server, auto-scaling, GDPR APIs
├── Frontend: Real-time updates, compliance dashboard
├── Database: WebSocket state, audit logs
└── Testing: Load tests, chaos engineering
Phase 4: Production Hardening (DR, Monitoring, Compliance)
├── Infrastructure: Multi-region setup, Citus sharding
├── Monitoring: Full observability stack (Prometheus, Grafana)
├── Compliance: SOC 2 audit preparation
└── Testing: DR drills, penetration testing
9.3 Component Development Order¶
Week 1-3: Foundation
1. ✅ Authentication System → docs/development/tasks/auth_system.md
2. ✅ Database Setup → docs/development/tasks/database_migration.md
3. ✅ API Framework → docs/development/tasks/api_foundation.md
4. ✅ Frontend Shell → docs/development/tasks/frontend_foundation.md
Week 4-8: Core Features
5. ✅ Jelastic Integration → docs/development/tasks/jelastic_integration.md
6. ✅ Environment CRUD → docs/development/tasks/environment_management.md
7. ✅ Node Management → docs/development/tasks/node_management.md
8. ✅ Job Queue System → docs/development/tasks/job_queue.md
Week 9-12: Advanced Features
9. ✅ WebSocket System → docs/development/tasks/websocket_system.md
10. ✅ Auto-Scaling → docs/development/tasks/auto_scaling.md
11. ✅ GDPR Compliance → docs/development/tasks/gdpr_compliance.md
12. ✅ Performance Optimization → docs/development/tasks/performance_optimization.md
Week 13-16: Frontend & UX
13. ✅ Dashboard UI → docs/development/tasks/dashboard_ui.md
14. ✅ Monitoring Views → docs/development/tasks/monitoring_views.md
15. ✅ User Management → docs/development/tasks/user_management_ui.md
Week 17-20: Production Deployment
16. ✅ Multi-Region DR → docs/development/tasks/disaster_recovery.md
17. ✅ Database Sharding → docs/development/tasks/citus_sharding.md
18. ✅ SOC 2 Preparation → docs/development/tasks/soc2_compliance.md
19. ✅ Load Testing → docs/development/tasks/load_testing.md
20. ✅ Production Cutover → docs/development/tasks/production_cutover.md
9.4 Cross-Cutting Concerns (Continuous)¶
These activities run parallel to all phases:
| Activity | Frequency | Owner | Deliverable |
|---|---|---|---|
| Code Reviews | Every PR | Tech Lead | Approved PRs, architecture feedback |
| Security Scanning | Daily (CI/CD) | DevOps | Vulnerability reports, fixes |
| Performance Testing | Weekly | Backend Lead | Latency reports, optimization backlog |
| Documentation | Per feature | Feature owner | Updated docs in /docs |
| Compliance Checks | Bi-weekly | Security Team | GDPR/SOC 2 checklist updates |
| DR Drills | Monthly | SRE Team | DR test reports, runbook updates |
9.5 External Task List Structure¶
All detailed implementation tasks are organized in separate files:
docs/development/
├── MAINPRD.md # This file (strategic blueprint)
├── tasks/
│ ├── README.md # Task list index
│ ├── phase1_foundation/
│ │ ├── auth_system.md # Detailed auth tasks
│ │ ├── database_migration.md # DB migration steps
│ │ ├── api_foundation.md # FastAPI setup tasks
│ │ └── frontend_foundation.md # Next.js setup tasks
│ ├── phase2_core_features/
│ │ ├── jelastic_integration.md # Jelastic API integration
│ │ ├── environment_management.md # Environment CRUD
│ │ └── job_queue.md # ARQ job queue setup
│ ├── phase3_advanced_features/
│ │ ├── websocket_system.md # WebSocket implementation
│ │ ├── auto_scaling.md # Auto-scaling policies
│ │ └── gdpr_compliance.md # GDPR API implementation
│ └── phase4_production/
│ ├── disaster_recovery.md # Multi-region DR setup
│ ├── citus_sharding.md # Database sharding migration
│ └── production_cutover.md # Production deployment
└── runbooks/
├── deployment.md # Deployment procedures
├── incident_response.md # Incident handling
└── dr_failover.md # DR failover runbook
9.6 Task List Format (Standard Template)¶
Each task list file follows this structure:
# [Component Name] - Detailed Task List
## Overview
Brief description of component and its role in the system.
## Prerequisites
- Dependencies that must be completed first
- Required tools/libraries
- Access requirements
## Tasks
### Task 1: [Task Name]
**Priority**: P0/P1/P2/P3
**Estimated Time**: X hours/days
**Assignee**: [Role or name]
**Status**: ❌ Not Started | 🟡 In Progress | ✅ Completed
**Description:**
[Detailed task description]
**Acceptance Criteria:**
- [ ] Criterion 1
- [ ] Criterion 2
- [ ] Criterion 3
**Implementation Steps:**
1. Step 1
2. Step 2
3. Step 3
**Testing:**
- Unit tests: [description]
- Integration tests: [description]
- E2E tests: [description]
**Self-Verification:**
```bash
# Commands to verify task completion
pytest tests/test_component.py
curl -X GET http://localhost:8000/health
Dependencies: - Depends on: [Other tasks] - Blocks: [Future tasks]
Task 2: [Next Task]¶
...
### 9.7 Risk Mitigation During Development
| Risk | Probability | Impact | Mitigation Strategy |
|------|-------------|--------|---------------------|
| **Jelastic API Breaking Changes** | Medium | High | Version pin, contract tests, fallback to cached data |
| **Database Migration Data Loss** | Low | Critical | Parallel run for 2 weeks, automated rollback |
| **Performance Regression** | Medium | High | Automated performance tests in CI, canary deployments |
| **Security Vulnerability** | Medium | Critical | Daily security scans, penetration testing before launch |
| **Scope Creep** | High | Medium | Strict prioritization framework (P0-P3), weekly review |
| **Key Developer Attrition** | Low | High | Documentation, pair programming, knowledge sharing |
---
## 10. Local Development Setup
This section provides everything developers need to run MBPanel locally.
### 10.1 Prerequisites
**Required Software:**
- **Python**: 3.11+ (recommended: 3.11.9)
- **Node.js**: 18+ (recommended: 18.20.0 LTS)
- **PostgreSQL**: 15+ (or Docker container)
- **Redis**: 7+ (or Docker container)
- **Docker**: 24+ (for containerized services)
- **Git**: 2.40+
- **Make**: 4.0+ (for Makefile commands)
**Recommended Tools:**
- **VS Code** with Python, Pylance, ESLint extensions
- **pgAdmin** or **TablePlus** for database management
- **RedisInsight** for Redis debugging
- **Postman** or **HTTPie** for API testing
- **Context7 MCP** for AI-assisted development (optional)
### 10.2 Quick Start (5 Minutes)
```bash
# 1. Clone repository
git clone https://github.com/mightybox-io/mbpanel.git
cd mbpanel
# 2. Start infrastructure with Docker Compose
make infra-up
# This starts: PostgreSQL, Redis, PgBouncer
# 3. Setup backend
cd backend
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements.txt
# 4. Run database migrations
alembic upgrade head
# 5. Create superuser
python scripts/create_superuser.py --email admin@example.com --password admin123
# 6. Start backend server
uvicorn main:app --reload --host 0.0.0.0 --port 8000
# 7. Setup frontend (new terminal)
cd frontend
npm install
npm run dev
# 8. Access application
# - Frontend: http://localhost:3000
# - Backend API: http://localhost:8000
# - API Docs: http://localhost:8000/docs
# - PgAdmin: http://localhost:5050 (admin@admin.com / admin)
10.3 Environment Variables¶
Create .env files for backend and frontend:
Backend .env (backend/.env):
# Database
DATABASE_URL=postgresql://mbpanel:mbpanel_dev@localhost:5432/mbpanel_dev
DATABASE_POOL_SIZE=10
DATABASE_MAX_OVERFLOW=20
# Redis
REDIS_URL=redis://localhost:6379/0
REDIS_CACHE_DB=1
# Security
SECRET_KEY=your-secret-key-change-in-production
JWT_SECRET_KEY=your-jwt-secret-change-in-production
JWT_ACCESS_TOKEN_EXPIRE_MINUTES=15
JWT_REFRESH_TOKEN_EXPIRE_DAYS=7
# Jelastic API (use test environment)
JELASTIC_API_URL=https://test-api.jelastic.com
JELASTIC_API_TOKEN=your-test-token
# CORS (allow frontend)
ALLOWED_ORIGINS=http://localhost:3000,http://localhost:8000
# Logging
LOG_LEVEL=DEBUG
LOG_FORMAT=json
# Environment
ENVIRONMENT=development
DEBUG=true
Frontend .env.local (frontend/.env.local):
# API Endpoint
NEXT_PUBLIC_API_URL=http://localhost:8000
NEXT_PUBLIC_WS_URL=ws://localhost:8000/ws
# Authentication
NEXT_PUBLIC_AUTH_COOKIE_NAME=mbpanel_token
# Analytics (disable in dev)
NEXT_PUBLIC_ANALYTICS_ENABLED=false
# Feature Flags
NEXT_PUBLIC_ENABLE_GDPR_TOOLS=true
NEXT_PUBLIC_ENABLE_BETA_FEATURES=false
10.4 Development Workflow¶
Daily Development Cycle:
# 1. Start your day - pull latest changes
git checkout main
git pull origin main
make dev-start # Starts infra + backend + frontend
# 2. Create feature branch
git checkout -b feature/US-042-jelastic-scaling
# 3. Make changes, run tests frequently
make test-watch # Auto-runs tests on file changes
# 4. Lint and format before committing
make lint # Runs ruff, mypy, eslint
make format # Formats code
# 5. Commit with conventional commits
git add .
git commit -m "feat(jelastic): add auto-scaling API endpoint
- Implement POST /api/v1/environments/{id}/scale
- Add cloudlet calculation logic
- Include integration tests
Closes #42"
# 6. Push and create PR
git push origin feature/US-042-jelastic-scaling
gh pr create --title "feat: Jelastic auto-scaling API" --body "..."
# 7. End of day - stop services
make dev-stop
10.5 Makefile Commands¶
The Makefile provides convenient shortcuts:
Infrastructure:
make infra-up # Start PostgreSQL, Redis, PgBouncer
make infra-down # Stop all infrastructure
make infra-logs # Tail infrastructure logs
make db-reset # Drop and recreate database (WARNING: data loss)
Backend:
make backend-dev # Start backend in dev mode (hot reload)
make backend-test # Run all backend tests
make backend-lint # Lint backend code (ruff, mypy)
make backend-format # Format backend code
make migration-create # Create new Alembic migration
make migration-up # Apply migrations
make migration-down # Rollback last migration
Frontend:
make frontend-dev # Start frontend in dev mode
make frontend-test # Run frontend tests (Jest, Playwright)
make frontend-lint # Lint frontend code (ESLint, Prettier)
make frontend-build # Production build
Combined:
make dev-start # Start all services (infra + backend + frontend)
make dev-stop # Stop all services
make test-all # Run all tests (backend + frontend)
make lint-all # Lint all code
make format-all # Format all code
make clean # Clean build artifacts, caches
10.6 Database Management¶
Creating Migrations:
# 1. Modify SQLAlchemy models in backend/app/models/
# 2. Generate migration
cd backend
alembic revision --autogenerate -m "Add team_id to environments"
# 3. Review migration file in alembic/versions/
# 4. Apply migration
alembic upgrade head
# 5. Verify migration
psql -U mbpanel -d mbpanel_dev -c "\d environments"
Seeding Test Data:
# Seed database with test data
python scripts/seed_database.py --teams 10 --envs-per-team 5
# This creates:
# - 10 test teams
# - 50 test environments (5 per team)
# - Test users for each team
# - Sample job logs
Database Backup/Restore (Local):
# Backup local database
make db-backup # Creates backup in backups/mbpanel_dev_YYYYMMDD.sql
# Restore from backup
make db-restore BACKUP=backups/mbpanel_dev_20250117.sql
10.7 Testing Strategy¶
Backend Testing:
# Run all tests
pytest
# Run specific test file
pytest tests/test_auth.py
# Run with coverage
pytest --cov=app --cov-report=html
open htmlcov/index.html
# Run integration tests only
pytest -m integration
# Run unit tests only
pytest -m "not integration"
# Watch mode (auto-run on changes)
ptw # pytest-watch
Frontend Testing:
# Unit tests (Jest)
cd frontend
npm run test
# Watch mode
npm run test:watch
# E2E tests (Playwright)
npm run test:e2e
# E2E in headed mode (see browser)
npm run test:e2e:headed
Load Testing (Locust):
# Start backend first
make backend-dev
# Run load test
cd tests/performance
locust -f locustfile.py --host=http://localhost:8000
# Open browser: http://localhost:8089
# Set users: 100, spawn rate: 10
10.8 Debugging¶
Backend Debugging (VS Code):
Create .vscode/launch.json:
{
"version": "0.2.0",
"configurations": [
{
"name": "Python: FastAPI",
"type": "python",
"request": "launch",
"module": "uvicorn",
"args": [
"main:app",
"--reload",
"--host", "0.0.0.0",
"--port", "8000"
],
"jinja": true,
"justMyCode": false,
"cwd": "${workspaceFolder}/backend",
"env": {
"PYTHONPATH": "${workspaceFolder}/backend"
}
}
]
}
Set breakpoints in VS Code, press F5 to start debugging.
Frontend Debugging (Chrome DevTools):
# Start Next.js in debug mode
npm run dev
# Open Chrome DevTools
# Sources tab → Filesystem → Add folder → Select frontend/
# Set breakpoints in .tsx files
Redis Debugging:
# Connect to Redis CLI
redis-cli -h localhost -p 6379
# Monitor all commands
MONITOR
# Inspect cache keys
KEYS mbpanel:*
# Get specific cache value
GET mbpanel:environment:12345
Database Debugging:
# Connect to PostgreSQL
psql -U mbpanel -d mbpanel_dev
# Enable query logging
\set ECHO_ALL
# Explain query
EXPLAIN ANALYZE SELECT * FROM environments WHERE team_id = 1;
# Check slow queries
SELECT * FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;
10.9 Common Issues & Solutions¶
Issue: Port already in use
# Error: uvicorn.error: Address already in use
# Solution: Kill process on port 8000
lsof -ti:8000 | xargs kill -9
# Or use different port
uvicorn main:app --port 8001
Issue: Database connection refused
# Error: psycopg2.OperationalError: could not connect to server
# Solution: Check PostgreSQL is running
docker ps | grep postgres
# Restart PostgreSQL
make infra-down && make infra-up
Issue: Redis connection refused
# Error: redis.exceptions.ConnectionError
# Solution: Check Redis is running
docker ps | grep redis
# Test Redis connection
redis-cli ping # Should return PONG
Issue: Alembic migration conflicts
# Error: Multiple heads detected
# Solution: Merge migration heads
alembic merge heads -m "merge migration heads"
alembic upgrade head
Issue: Frontend module not found
# Error: Module not found: Can't resolve '@/components/...'
# Solution: Clear Next.js cache
cd frontend
rm -rf .next node_modules
npm install
npm run dev
10.10 Development Tools & Extensions¶
VS Code Extensions (Recommended): - Python: ms-python.python - Pylance: ms-python.vscode-pylance - ESLint: dbaeumer.vscode-eslint - Prettier: esbenp.prettier-vscode - GitLens: eamodio.gitlens - Thunder Client: rangav.vscode-thunder-client (API testing) - PostgreSQL: ckolkman.vscode-postgres - Docker: ms-azuretools.vscode-docker - Tailwind CSS IntelliSense: bradlc.vscode-tailwindcss
Browser Extensions (Recommended): - React Developer Tools: Chrome/Firefox extension - Redux DevTools: For state debugging - JSON Viewer: Pretty-print API responses
10.11 Code Quality Standards¶
Pre-Commit Hooks:
# Install pre-commit hooks
pip install pre-commit
pre-commit install
# Runs automatically on git commit:
# - Ruff linting
# - MyPy type checking
# - Prettier formatting
# - Trailing whitespace removal
# - JSON/YAML validation
Code Coverage Requirements: - Backend: Minimum 80% coverage - Frontend: Minimum 70% coverage - Critical paths (auth, payments): 95% coverage
Performance Budgets: - API Response Time: <200ms p95 - Frontend Bundle Size: <500KB gzipped - Lighthouse Score: >90 for Performance, Accessibility, Best Practices
11. Implementation Roadmap¶
Phase 0: Discovery & Compatibility (Weeks 0-1)¶
Objective: Lock API contracts, inventory legacy behavior, stand up compatibility shims
- Generate OpenAPI from FastAPI skeleton and align with legacy endpoints
- Build compatibility layer for high-traffic legacy routes under
/api/v1/** - Stand up ARQ worker scaffolding and DLQ
- Mirror WebSocket publish path to both Node and FastAPI (no cutover)
- Author data migration runbooks and validate on a sanitized subset
- Establish CI baseline: lint, type checks, unit tests, contract tests
Phase 1: Foundation (Weeks 1-3)¶
Objective: Establish core infrastructure, authentication systems, and External API integration foundation
Week 1:
- Set up FastAPI project structure with Hybrid Modular DDD architecture
- Implement basic authentication system with JWT
- Set up PostgreSQL database with initial schema
- Configure Docker containers and CI/CD pipeline
- 🔥 NEW: HTTP Client Infrastructure
- Implement app/core/http_client.py with ExternalAPIClient class
- Implement CircuitBreaker class for resilience
- Configure connection pooling with httpx (20 keepalive, 100 max connections)
- Implement retry logic with exponential backoff (1s, 2s, 4s)
- Add rate limiting mechanism
- Write unit tests for circuit breaker and retry logic
Week 2:
- Implement user registration and login flows in users module
- Set up Redis for caching and session management
- Create basic API endpoints for user management
- Implement database migration system (Alembic)
- 🔥 NEW: Virtuozzo Adapter Foundation
- Create app/core/adapters/ directory structure
- Implement virtuozzo_adapter.py with basic structure
- Implement session key management
- Add response normalization logic
- Create adapter unit tests (mocked)
- Configure Virtuozzo API connection (timeout: 90s, rate limit: 10 req/s)
Week 3:
- Complete authentication system with refresh tokens in auth module
- Implement role-based access control
- Set up comprehensive logging infrastructure
- Create API documentation and testing framework
- 🔥 NEW: Virtuozzo Adapter Core Methods
- Implement fetch_environments_and_nodes() with caching (5 min TTL)
- Implement execute_mbadmin_action() with idempotency keys
- Implement environment lifecycle methods: start_environment(), stop_environment(), sleep_environment()
- Add correlation ID support for distributed tracing
- Create integration tests for Virtuozzo adapter (staging environment)
- Document adapter usage patterns for domain modules
Phase 2: Core Business Logic (Weeks 4-8)¶
Objective: Implement essential business functionality and integrate Virtuozzo services
Week 4:
- Implement user and team management services in users module
- Create basic site management endpoints
- Set up job queue system with ARQ in jobs module
- Begin migration of Laravel services to FastAPI modular services
- 🔥 NEW: Environments Module with Virtuozzo Integration
- Implement environments/service.py using virtuozzo_adapter
- Implement start_environment(), stop_environment(), sleep_environment() service methods
- Add correlation ID generation and audit logging
- Implement error handling and status management
- Create environment lifecycle background jobs (ARQ)
- Write integration tests for environment operations
Week 5:
- Complete site creation and configuration endpoints
- Implement environment management functionality
- Set up WebSocket system for real-time updates in websocket module
- Begin data migration from MySQL to PostgreSQL
- 🔥 NEW: WordPress Module with Virtuozzo Integration
- Implement wordpress/service.py using virtuozzo_adapter
- Implement execute_wp_cli() with command validation
- Add long-running command queueing (ARQ background jobs)
- Implement rate limiting (10 commands/minute/user)
- Add WP-CLI allowed commands whitelist
- Create WordPress service tests (mocked adapter)
Week 6:
- Complete environment lifecycle management
- Implement backup and restore functionality
- Finish core job queue implementations in jobs module
- Set up basic observability stack (Prometheus, Grafana)
- 🔥 NEW: Backups Module with Virtuozzo Integration
- Implement backups/service.py using virtuozzo_adapter
- Implement create_backup() with background job queueing
- Implement restore_backup() with validation and status tracking
- Add backup session management
- Create backup/restore integration tests
- Set up Prometheus metrics for backup operations
Week 7:
- Implement domain management and SSL certificate functionality
- Complete advanced site management features
- Set up comprehensive monitoring and alerting
- Begin performance testing and optimization
- 🔥 NEW: CDN/Cache Services Integration
- Implement bunny_cdn_adapter.py for DNS and CDN operations
- Implement cloudflare_adapter.py for DNS and SSL
- Implement cache/service.py using bunny_cdn_adapter
- Implement domains/service.py using multiple adapters
- Add idempotency keys for DNS operations
- Create CDN/DNS integration tests
Week 8: - Complete all Phase 2 business logic - Perform security audit and hardening - Conduct performance validation testing - Prepare for Phase 3 development - 🔥 NEW: External API Performance Validation - Validate connection pooling effectiveness (target: 90%+ reuse rate) - Validate circuit breaker functionality (test with simulated failures) - Validate cache hit rates (target: 80%+ for Virtuozzo API) - Measure external API latency (target: p95 < 150ms) - Load test external API integrations (concurrent requests) - Document performance baselines and optimization opportunities
Phase 3: Advanced Features (Weeks 9-12)¶
Objective: Implement advanced functionality and integrations
Week 9: - Implement billing and subscription management - Set up payment processing integration - Begin advanced job queue implementations - Implement advanced caching strategies
Week 10: - Complete billing and payment functionality - Implement advanced team and permissions features - Set up webhook system for external integrations - Begin advanced performance optimization
Week 11: - Complete all advanced feature implementations - Set up comprehensive error tracking system - Implement advanced security features (MFA, audit logs) - Conduct load testing and optimization
Week 12: - Complete all Phase 3 functionality - Perform comprehensive security and performance validation - Prepare staging environment for Phase 4 - Document all advanced features
Phase 4: Frontend Migration (Weeks 13-16)¶
Objective: Migrate frontend from React/Inertia to Next.js
Week 13: - Set up Next.js project structure - Implement authentication integration - Create basic dashboard components - Set up API client generation from OpenAPI specs
Week 14: - Implement core UI components for site management - Create environment management interfaces - Set up real-time WebSocket integration - Implement responsive design patterns
Week 15: - Complete all major UI feature implementations - Implement advanced dashboard functionality - Set up comprehensive state management - Begin user acceptance testing
Week 16: - Complete frontend implementation - Conduct comprehensive UI testing - Perform cross-browser compatibility testing - Prepare for production deployment
Phase 5: Production Deployment (Weeks 17-20)¶
Objective: Deploy to production and complete migration
Week 17: - Set up production infrastructure - Implement backup and disaster recovery procedures - Conduct final security audit - Prepare production deployment scripts
Week 18: - Deploy to staging environment - Conduct final end-to-end testing - Perform performance validation in staging - Prepare final data migration plan
Week 19: - Execute production deployment - Conduct data migration from legacy system - Monitor system performance and stability - Implement post-deployment monitoring
Week 20: - Complete migration cutover from Laravel - Monitor system performance and user feedback - Address any post-deployment issues - Document final system architecture
10. Success Criteria¶
10.1 Performance Metrics¶
- API Response Times: Achieve <200ms for 95th percentile of requests
- Database Queries: Achieve <50ms for common operations
- Page Load Times: Achieve <2.5 seconds for dashboard pages
- Concurrent Users: Successfully handle 5x current user load
- 🔥 External API Performance:
- Average API call latency: <50ms (50% improvement from Laravel baseline of 100ms)
- P95 external API latency: <150ms (50% improvement from Laravel baseline of 300ms)
- Cache hit rate for external APIs: >80% (compared to 0% in Laravel)
- Connection reuse rate: >90% (via connection pooling)
- Failed request recovery: 100% automatic retry with exponential backoff
10.2 Quality Gates¶
- Code Coverage: Maintain >90% test coverage for all services
- Security Score: Achieve A+ rating on security scanning tools
- Performance Score: Achieve 95+ on Lighthouse performance metrics
- Error Rate: Maintain <0.1% error rate in production
- 🔥 External API Quality Gates:
- Circuit breaker integration tests: 100% passing
- Retry logic tests: 100% passing (all 3 retry attempts tested)
- Idempotency tests: 100% passing (verify duplicate requests handled correctly)
- Rate limiting tests: 100% passing (verify 429 errors prevented)
- Integration test coverage: >80% for all external API adapters
10.3 Architecture Metrics¶
- Module Isolation: Validate that no modules import from other modules
- Cohesion: Validate that each module contains all necessary layers
- Maintainability: Measure cognitive load reduction when working on features
- Developer Velocity: Track improvement in feature development time
- 🔥 Adapter Architecture Metrics:
- Adapter isolation: 100% compliance (all external API calls go through adapters)
- Shared adapter reuse: Virtuozzo adapter used by 7+ modules
- Module-to-adapter coupling: Each module uses 1-2 adapters maximum
- Adapter test coverage: >90% for all adapter implementations
10.4 Business Metrics¶
- User Satisfaction: Achieve >4.5/5 satisfaction rating from user testing
- Migration Completion: Successfully migrate 100% of Laravel functionality
- Downtime: Maintain <4 hours total planned downtime during migration
- Cost Reduction: Achieve 30% reduction in operational costs
10.5 External API Reliability Metrics¶
- 🔥 NEW: Resilience Targets:
- Circuit breaker open rate: <1% of total time
- Retry success rate: >90% (transient failures recovered automatically)
- Idempotent operation safety: 100% (no duplicate resource creation)
- Rate limit violation rate: 0% (no 429 errors from external APIs)
-
Cascading failure prevention: 100% (circuit breaker prevents cascading failures)
-
🔥 NEW: External Service Migration Targets:
- Virtuozzo Services: 19/19 services migrated with feature parity
- CDN/Cache Services: 4/4 services migrated (Bunny CDN, Cloudflare, Relay, Cache Management)
- External Integrations: 8/8 services migrated (Postmark, UptimeRobot, etc.)
- Other Services: 9/9 services migrated
-
Total: 40/40 services migrated successfully
-
🔥 NEW: Performance Improvement Validation:
- Connection overhead reduction: 93% (70ms → 5ms for subsequent requests)
- Cache effectiveness: 98% latency reduction for cached responses (90ms → 2ms)
- Parallel request optimization: 62% improvement (240ms → 90ms for 3 parallel calls)
- HTTP/2 multiplexing: 20-30% latency improvement for multiple requests
10.6 Monitoring and Observability Criteria¶
- 🔥 NEW: External API Monitoring:
- Prometheus metrics captured for all external API calls
- Grafana dashboards deployed for external API performance
- Alerts configured for circuit breaker opens (severity: WARNING)
- Alerts configured for high error rates (>5% over 5 min, severity: CRITICAL)
- Alerts configured for high latency (p95 > 500ms, severity: WARNING)
- Alerts configured for low cache hit rate (<50%, severity: INFO)
- Correlation IDs tracked across all external API calls for distributed tracing
11. Risk Assessment¶
11.1 Risk Matrix¶
| Risk | Probability | Impact | Severity | Mitigation Strategy |
|---|---|---|---|---|
| Data Migration Issues | Medium | High | High | Comprehensive testing with validation, rollback plan |
| Performance Degradation | Low | High | Medium | Performance testing at each phase, optimization plan |
| Architecture Compliance | Medium | Medium | Medium | Regular architecture reviews, automated checks |
| Team Knowledge Gaps | Medium | Medium | Medium | Training sessions, documentation, code reviews |
| Third-party Integration Issues | Medium | Medium | Medium | Early integration testing, fallback options |
| Timeline Delays | Medium | Medium | Medium | Buffer time in schedule, parallel development |
| 🔥 External API Failure Cascade | Medium | Critical | High | Circuit breaker implementation, fail-fast patterns |
| 🔥 External API Rate Limiting | High | Medium | High | Rate limiting on our side, request queueing |
| 🔥 Connection Pool Exhaustion | Low | High | Medium | Connection pool monitoring, proper timeout configuration |
| 🔥 Adapter Implementation Bugs | Medium | High | High | Comprehensive adapter testing, staged rollout with feature flags |
| 🔥 External API Breaking Changes | Low | High | Medium | API versioning, contract testing, monitoring for deprecation headers |
| 🔥 Idempotency Key Conflicts | Low | Medium | Low | UUID-based key generation, conflict detection logic |
11.2 Mitigation Strategies¶
Existing Mitigation Strategies: - Data Migration: Multiple validation checkpoints, parallel migration testing - Performance: Continuous performance monitoring, optimization at each phase - Architecture Compliance: Regular architecture reviews and automated validation of module isolation - Security: Security-first development approach, regular audits - Knowledge Gaps: Comprehensive documentation, mentoring program - Integration Issues: Early integration validation, contract testing - Timeline Delays: Agile methodology with sprint reviews and adjustments
🔥 NEW: External API Integration Mitigation Strategies:
1. External API Failure Cascade: - Risk: If Virtuozzo API goes down, it could cause cascading failures across multiple modules (environments, wordpress, backups, etc.) - Mitigation: - Implement circuit breaker with 5-failure threshold and 5-minute recovery timeout - Fail-fast pattern: Return 503 Service Unavailable immediately when circuit is open - Graceful degradation: Cache last known state and serve stale data with warning - Independent monitoring: Separate health checks for each external service - Incident response playbook: Document steps to take when external APIs are down
2. External API Rate Limiting: - Risk: Overwhelming external APIs with requests could result in 429 errors and service throttling - Mitigation: - Client-side rate limiting: Max 10 requests/second to Virtuozzo API - Request queueing: Queue requests when rate limit reached instead of failing - Backpressure: Implement queue depth limits to prevent memory exhaustion - Monitoring: Alert when approaching rate limits (>80% of limit) - Coordinate with vendors: Document rate limits for each external service
3. Connection Pool Exhaustion: - Risk: Running out of available connections during traffic spikes - Mitigation: - Connection pool configuration: 20 keepalive + 100 max connections per service - Connection timeout: 60s keepalive expiry to prevent stale connections - Monitoring: Track active/idle connections ratio (alert when >90% active) - Load testing: Validate connection pool under peak traffic (5x current load) - Graceful degradation: Return 503 when pool exhausted instead of hanging
4. Adapter Implementation Bugs: - Risk: Bugs in adapter code could cause widespread issues across multiple modules - Mitigation: - Comprehensive testing: Unit tests (mocked), integration tests (staging), contract tests - Staged rollout: Feature flags for gradual rollout (10% → 50% → 100%) - Canary deployment: Deploy to single instance first, monitor for errors - Automated testing: CI pipeline blocks merges with failed adapter tests - Code review: Require 2 approvals for adapter changes - Rollback procedure: Document quick rollback steps (disable feature flag)
5. External API Breaking Changes:
- Risk: External APIs (Virtuozzo, Bunny CDN, Cloudflare) could introduce breaking changes
- Mitigation:
- API versioning: Use explicit API versions in all requests (e.g., /v1.0/)
- Contract testing: Automated tests validate API responses match expected schema
- Monitoring for deprecation: Check for Deprecation and Sunset headers in responses
- Version pinning: Pin to specific API versions, test new versions before migrating
- Adapter abstraction: Changes isolated to adapters, not domain services
- Vendor communication: Subscribe to API changelog notifications
6. Idempotency Key Conflicts:
- Risk: Duplicate idempotency keys could prevent legitimate operations
- Mitigation:
- UUID-based keys: Use UUIDs instead of sequential IDs
- Scoped keys: Include user_id, team_id, and operation in key (e.g., create_env:123:456:uuid)
- Key expiration: Store keys in Redis with 24-hour TTL
- Conflict detection: Log and alert on idempotency key conflicts
- Retry with new key: If conflict detected, generate new key and retry
7. Performance Degradation from External APIs: - Risk: Slow external API responses could degrade overall system performance - Mitigation: - Aggressive timeouts: 30s for standard operations, 90s for Virtuozzo (long-running) - Timeout monitoring: Alert when p95 timeout rate >5% - Async operations: Long-running operations queued as background jobs (ARQ) - Caching strategy: Cache responses for 5 minutes (Virtuozzo), 1 minute (metrics) - Cache warming: Pre-fetch frequently accessed data during off-peak hours - Performance SLOs: External API p95 latency <150ms, alert if exceeded
8. External API Authentication Issues:
- Risk: Session keys, API keys, or tokens could expire or become invalid
- Mitigation:
- Token refresh logic: Automatic token refresh before expiration
- Encrypted storage: Future integration with a secrets management system (e.g., HashiCorp Vault). Currently secrets live in env files/container env vars, so interim guidance is to rotate .env values manually until Vault work ships.
- Key rotation: Document and test key rotation procedures
- Fallback authentication: Support multiple authentication methods where available
- Monitoring: Alert on authentication errors (401, 403)
- Manual intervention: Document steps for manual key rotation
9. Correlation ID Loss:
- Risk: Losing correlation IDs makes it difficult to trace issues across systems
- Mitigation:
- Mandatory correlation IDs: Generate UUID for every request
- Propagation: Pass correlation ID in headers (X-Correlation-ID)
- Structured logging: Include correlation ID in all log entries
- Distributed tracing: Integrate with OpenTelemetry for end-to-end tracing
- Monitoring: Verify correlation ID propagation in integration tests
10. Adapter Testing Complexity: - Risk: Testing adapters requires complex mocking and integration test infrastructure - Mitigation: - Three-tier testing: Unit (mocked), integration (staging API), contract tests - Test fixtures: Reusable mock responses for common scenarios - Staging environment: Dedicated staging Virtuozzo/CDN accounts for testing - Automated test data: Scripts to generate test data in staging - CI/CD integration: Run integration tests nightly against staging APIs
12. Appendices¶
10.1 Glossary¶
- FastAPI: Modern, fast web framework for building APIs with Python 3.7+ based on standard Python type hints
- ARQ: Asynchronous job queues for Python, used for background task processing
- Next.js: React-based framework for building production-ready web applications
- JWT: JSON Web Token, an open standard for secure authentication
- RBAC: Role-Based Access Control, a method to regulate access to computer resources
- DDD: Domain-Driven Design, an approach to software development that focuses on complex needs by connecting the implementation to an evolving model of the core business concepts
- Modular Architecture: An architectural approach that organizes code into independent, interchangeable modules that encapsulate functionality
12.2 Open Questions¶
- Integration approach for legacy payment systems
- Data retention policies for historical records
- Cross-region deployment strategy for future scaling
- Virtuozzo API rate limits: What are the actual rate limits? Document: ___ (Recommended: 10 req/s based on testing)
- Session key rotation: How often do session keys expire? Document: ___ (Implement automatic refresh logic)
- Bunny CDN rate limits: What are the rate limits for DNS operations? Document: ___ (Test and document)
- Idempotency window: How long should idempotency keys be cached? Recommendation: 24 hours
- Circuit breaker thresholds: Are 5 failures in 5 minutes appropriate? Test and adjust based on production data
12.3 External Services Migration Map¶
This section provides the complete mapping of 40 Laravel services to FastAPI modules, including code comparison examples.
10.3.1 Complete Service Inventory¶
Virtuozzo Services (19 files):
| # | Laravel Service | FastAPI Module | Adapter | Priority | Status |
|---|---|---|---|---|---|
| 1 | VirtuozzoService.php |
environments/ |
virtuozzo_adapter.py |
P0 | Required for environment lifecycle |
| 2 | MbAdminService.php |
wordpress/ |
virtuozzo_adapter.py |
P0 | Required for WP-CLI execution |
| 3 | EnvironmentStartService.php |
environments/ |
virtuozzo_adapter.py |
P0 | Core environment operation |
| 4 | EnvironmentStopService.php |
environments/ |
virtuozzo_adapter.py |
P0 | Core environment operation |
| 5 | EnvironmentSleepService.php |
environments/ |
virtuozzo_adapter.py |
P0 | Core environment operation |
| 6 | EnvironmentRestartService.php |
environments/ |
virtuozzo_adapter.py |
P0 | Core environment operation |
| 7 | EnvironmentDeletionService.php |
environments/ |
virtuozzo_adapter.py |
P0 | Core environment operation |
| 8 | EnvironmentRenameService.php |
environments/ |
virtuozzo_adapter.py |
P0 | Core environment operation |
| 9 | EnvironmentStatusSyncService.php |
environments/ |
virtuozzo_adapter.py |
P0 | Status synchronization |
| 10 | VirtuozzoBackupService.php |
backups/ |
virtuozzo_adapter.py |
P1 | Backup operations |
| 11 | SftpService.php |
sftp/ |
virtuozzo_adapter.py |
P1 | SFTP user management |
| 12 | AddonManagementService.php |
environments/ |
virtuozzo_adapter.py |
P1 | Addon lifecycle |
| 13 | AddonService.php |
environments/ |
virtuozzo_adapter.py |
P1 | Addon operations |
| 14 | VzAccountService.php |
environments/ |
virtuozzo_adapter.py |
P1 | Account management |
| 15 | VZaccountGroup.php |
environments/ |
virtuozzo_adapter.py |
P1 | Account grouping |
| 16 | MbAdminParamsService.php |
wordpress/ |
virtuozzo_adapter.py |
P1 | WP parameter management |
| 17 | SearchAndReplaceService.php |
wordpress/ |
virtuozzo_adapter.py |
P1 | DB search/replace |
| 18 | SyncDirectoriesService.php |
environments/ |
virtuozzo_adapter.py |
P1 | Directory sync |
| 19 | RunDynamicWpCliService.php |
wordpress/ |
virtuozzo_adapter.py |
P1 | Dynamic WP-CLI execution |
CDN/Cache Services (4 files):
| # | Laravel Service | FastAPI Module | Adapter | Priority | Status |
|---|---|---|---|---|---|
| 20 | BunnyCdnService.php |
cache/ |
bunny_cdn_adapter.py |
P1 | DNS and CDN management |
| 21 | CloudflareDetectionService.php |
domains/ |
cloudflare_adapter.py |
P2 | Cloudflare integration |
| 22 | RelayService.php |
cache/ |
relay_adapter.py |
P2 | Relay cache management |
| 23 | CacheManagementService.php |
cache/ |
N/A (internal) | P2 | Internal cache operations |
External Integration Services (8 files):
| # | Laravel Service | FastAPI Module | Adapter | Priority | Status |
|---|---|---|---|---|---|
| 24 | ExternalApiService.php |
core/ |
http_client.py (base) |
P0 | Base HTTP client infrastructure |
| 25 | ExternalAuthService.php |
auth/ |
N/A (internal) | P1 | Internal auth logic |
| 26 | PostmarkService.php |
notifications/ |
postmark_adapter.py |
P2 | Email notifications |
| 27 | UptimeCheckerService.php |
uptime/ |
uptime_adapter.py |
P2 | Uptime checking |
| 28 | UptimeMonitorService.php |
uptime/ |
uptime_adapter.py |
P2 | Uptime monitoring |
| 29 | DomainAvailabilityService.php |
domains/ |
domain_check_adapter.py |
P2 | Domain availability |
| 30 | CnameChecker.php |
domains/ |
N/A (internal) | P2 | CNAME validation |
| 31 | SslChecker.php |
domains/ |
N/A (internal) | P2 | SSL validation |
Other Services (9 files):
| # | Laravel Service | FastAPI Module | Adapter | Priority | Status |
|---|---|---|---|---|---|
| 32 | WebSocketBroadcastService.php |
websocket/ |
N/A (internal) | P0 | WebSocket broadcasting |
| 33 | WebSocketLogger.php |
core/ |
N/A (internal) | P1 | WebSocket logging |
| 34 | WebSocketTelemetry.php |
websocket/ |
N/A (internal) | P1 | WebSocket metrics |
| 35 | WebSocketTokenService.php |
websocket/ |
N/A (internal) | P1 | WebSocket auth |
| 36 | UserAgentService.php |
core/ |
N/A (internal) | P2 | User agent parsing |
| 37 | TeamService.php |
teams/ |
N/A (internal) | P0 | Team management |
| 38 | AccountUpgradeService.php |
billing/ |
N/A (internal) | P1 | Account upgrades |
| 39 | DeadLetterQueueService.php |
jobs/ |
N/A (internal) | P1 | DLQ management |
| 40 | DlqMonitoringService.php |
jobs/ |
N/A (internal) | P1 | DLQ monitoring |
Total: 40 services to migrate
10.3.2 Code Comparison Examples¶
Example 1: Environment Start (Laravel → FastAPI)
Laravel (Before):
// app/Services/Virtuozzo/EnvironmentStartService.php
public function startEnvironment(int $environmentId): void
{
$environment = Environment::findOrFail($environmentId);
$response = Http::timeout(90)->post($this->getApiUrl(), [
'envName' => $environment->env_name,
'session' => $environment->session_key
]);
if (!$response->successful()) {
throw new \Exception("Failed to start environment");
}
$environment->update(['status' => 'starting']);
Log::info("Environment started", ['env_id' => $environmentId]);
}
FastAPI (After):
# backend/app/environments/service.py
from app.core.adapters.virtuozzo_adapter import get_virtuozzo_adapter
from app.core.shared.audit import log_audit_event, AuditAction
import structlog
import uuid
logger = structlog.get_logger()
class EnvironmentService:
def __init__(self):
self.virtuozzo_adapter = get_virtuozzo_adapter()
async def start_environment(
self,
db: Session,
environment_id: int,
user_id: int
) -> schema.EnvironmentRead:
"""
Start environment with full error handling and tracing.
Improvements over Laravel:
- Connection pooling (47% faster)
- Automatic retries (3x with exponential backoff)
- Circuit breaker (prevents cascading failures)
- Correlation IDs (distributed tracing)
- Structured logging
"""
# Generate correlation ID for tracing
correlation_id = str(uuid.uuid4())
logger.info(
"environment_start_requested",
environment_id=environment_id,
user_id=user_id,
correlation_id=correlation_id
)
# Get environment from database
environment = repository.get_environment_by_id(db, environment_id)
if not environment:
raise EnvironmentNotFoundError(f"Environment {environment_id} not found")
# Validate user has access
self._validate_access(db, user_id, environment.team_id)
# Check if already running (idempotent)
if environment.status == "running":
logger.info("environment_already_running", correlation_id=correlation_id)
return environment
try:
# Call Virtuozzo API via adapter (with connection pooling, retries, circuit breaker)
response = await self.virtuozzo_adapter.start_environment(
env_name=environment.env_name,
session_key=environment.session_key,
correlation_id=correlation_id
)
# Update status
repository.update_environment(db, environment_id, {"status": "starting"})
db.commit()
# Log audit event
log_audit_event(
user_id=user_id,
team_id=environment.team_id,
action=AuditAction.UPDATE,
resource_type="environment",
resource_id=environment_id,
metadata={"action": "start", "correlation_id": correlation_id}
)
logger.info("environment_start_success", correlation_id=correlation_id)
return environment
except Exception as e:
logger.error("environment_start_failed", error=str(e), correlation_id=correlation_id)
repository.update_environment(db, environment_id, {"status": "error"})
db.commit()
raise
Key Improvements: 1. ✅ Connection pooling: 47% latency reduction (100ms → 53ms) 2. ✅ Automatic retries: 3 attempts with exponential backoff (1s, 2s, 4s) 3. ✅ Circuit breaker: Prevents cascading failures when Virtuozzo is down 4. ✅ Correlation IDs: End-to-end request tracing across services 5. ✅ Structured logging: Machine-parseable logs with context 6. ✅ Idempotency: Safe to retry without side effects 7. ✅ Audit logging: Compliance-ready audit trail 8. ✅ Error handling: Graceful error handling with status updates
Example 2: WordPress WP-CLI Execution (Laravel → FastAPI)
Laravel (Before):
// app/Services/WordPress/RunDynamicWpCliService.php
public function executeWpCli(int $siteId, string $command): array
{
$site = Site::findOrFail($siteId);
// Basic validation
if (!in_array($command, $this->allowedCommands)) {
throw new \Exception("Command not allowed");
}
$response = Http::timeout(90)->post($this->getWpCliUrl(), [
'appUniqueName' => $site->app_unique_name,
'session' => $site->session_key,
'action' => 'executeWpCli',
'command' => $command
]);
return $response->json();
}
FastAPI (After):
# backend/app/wordpress/service.py
from app.core.adapters.virtuozzo_adapter import get_virtuozzo_adapter
from app.core.rate_limit import rate_limit
import structlog
logger = structlog.get_logger()
class WordPressService:
def __init__(self):
self.virtuozzo_adapter = get_virtuozzo_adapter()
async def execute_wp_cli_command(
self,
db: Session,
site_id: int,
command: str,
user_id: int
) -> schema.WpCliResponse:
"""
Execute WP-CLI command with rate limiting and queueing.
Improvements over Laravel:
- Rate limiting (10 commands/min/user)
- Long-running command queueing (ARQ)
- Idempotency keys for safe retries
- Command whitelist validation
- Background job status tracking
"""
# Validate access
site = self._validate_site_access(db, site_id, user_id)
# Validate command is allowed
if not self._is_command_allowed(command):
raise HTTPException(
status_code=400,
detail=f"Command '{command}' is not allowed"
)
# Rate limiting: 10 commands per minute per user
await rate_limit(user_id, calls=10, period=60)
# Check if command is long-running
if self._is_long_running_command(command):
# Enqueue background job (ARQ)
job_id = await self._enqueue_wp_cli_job(site_id, command)
logger.info(
"wp_cli_queued",
site_id=site_id,
command=command,
job_id=job_id,
user_id=user_id
)
return schema.WpCliResponse(
output="Command queued",
exit_code=0,
job_id=job_id,
status="queued"
)
# Execute command immediately (with idempotency key)
idempotency_key = f"wp_cli:{site_id}:{command}:{uuid.uuid4()}"
result = await self.virtuozzo_adapter.execute_mbadmin_action(
app_unique_name=site.app_unique_name,
session_key=site.session_key,
action="executeWpCli",
params={"command": command},
correlation_id=str(uuid.uuid4()),
idempotency_key=idempotency_key
)
# Log audit event
log_audit_event(
user_id=user_id,
team_id=site.team_id,
action=AuditAction.UPDATE,
resource_type="wordpress",
resource_id=site_id,
metadata={"command": command}
)
return schema.WpCliResponse(
output=result.get("output", ""),
exit_code=result.get("exit_code", 0),
job_id=None,
status="completed"
)
Key Improvements: 1. ✅ Rate limiting: 10 commands per minute per user (prevents abuse) 2. ✅ Long-running command queueing: Background jobs via ARQ for operations >30s 3. ✅ Idempotency keys: Safe retries for WP-CLI commands 4. ✅ Command whitelist: Explicit allowed commands list 5. ✅ Background job tracking: Status updates via WebSocket 6. ✅ Correlation IDs: Trace requests across services 7. ✅ Audit logging: Track all WP-CLI executions 8. ✅ Error handling: Graceful errors with detailed messages
Example 3: Bunny CDN DNS Record (Laravel → FastAPI)
Laravel (Before):
// app/Services/Cdn/BunnyCdnService.php
public function createDnsRecord(string $envName, string $domain): array
{
$response = Http::withHeaders([
'AccessKey' => config('services.bunny.access_key'),
'Accept' => 'application/json'
])->timeout(30)->post('https://api.bunny.net/dnszone/' . config('services.bunny.zone_id') . '/records', [
'Type' => 'CNAME',
'Name' => $envName,
'Value' => "$envName.$domain",
'Ttl' => 15,
'Accelerated' => true
]);
if ($response->status() === 400 && str_contains($response->body(), 'already registered')) {
Log::info('DNS record already exists', ['env_name' => $envName]);
return [];
}
if (!$response->successful()) {
throw new \Exception('Failed to create DNS record');
}
return $response->json();
}
FastAPI (After):
# backend/app/core/adapters/bunny_cdn_adapter.py
from app.core.http_client import get_http_client
from app.core.config import settings
import structlog
logger = structlog.get_logger()
class BunnyCDNAdapter:
def __init__(self):
self.client = get_http_client(
name="bunny_cdn",
base_url="https://api.bunny.net",
timeout=30.0,
max_retries=3
)
self.access_key = settings.BUNNY_CDN_ACCESS_KEY
async def create_dns_record(
self,
env_name: str,
platform_domain: str,
correlation_id: str
) -> dict:
"""
Create DNS CNAME record with idempotency.
Improvements over Laravel:
- Idempotency keys (safe retries)
- Automatic retries (3x with exponential backoff)
- Connection pooling (HTTP/2)
- Structured error handling
- Correlation IDs for tracing
"""
payload = {
"Type": "CNAME",
"Name": env_name,
"Value": f"{env_name}.{platform_domain}",
"Ttl": 15,
"Accelerated": True,
"MonitorType": "Monitor",
"AutoSslIssuance": True
}
headers = {
"AccessKey": self.access_key,
"Accept": "application/json",
"Content-Type": "application/json"
}
# Generate idempotency key for safe retries
idempotency_key = f"bunny:dns:{env_name}:{platform_domain}"
logger.info(
"bunny_cdn_create_dns",
env_name=env_name,
domain=platform_domain,
correlation_id=correlation_id
)
# POST with idempotency key (handled by HTTP client)
# If "already registered" error, treats as success
response = await self.client.post(
path=f"/dnszone/{settings.BUNNY_DNS_ZONE_ID}/records",
json=payload,
headers=headers,
correlation_id=correlation_id,
idempotency_key=idempotency_key
)
logger.info(
"bunny_cdn_dns_created",
env_name=env_name,
correlation_id=correlation_id
)
return response
Key Improvements: 1. ✅ Idempotency keys: Safe retries (duplicate requests handled gracefully) 2. ✅ Automatic retries: 3 attempts with exponential backoff 3. ✅ Connection pooling: HTTP/2 multiplexing for multiple requests 4. ✅ Structured errors: Proper exception handling with context 5. ✅ Correlation IDs: Distributed tracing across services 6. ✅ Already exists handling: Automatic detection and success response
Performance Comparison Summary:
| Operation | Laravel (Before) | FastAPI (After) | Improvement |
|---|---|---|---|
| Environment Start | 100ms | 53ms | 47% |
| WP-CLI Execution | 90ms | 45ms | 50% |
| DNS Record Creation | 80ms | 30ms (with pooling) | 62% |
| Connection Overhead | 70ms per request | 5ms (subsequent requests) | 93% |
| Cache Hit | N/A (no caching) | 2ms (Redis cache) | 98% faster |
| Failed Request Recovery | Manual retry | Automatic (3 retries) | N/A |
10.3.3 Migration Progress Tracking¶
Phase 1 (Weeks 1-3): Foundation
- [ ] HTTP Client Infrastructure (app/core/http_client.py)
- [ ] Circuit Breaker Implementation
- [ ] Virtuozzo Adapter Foundation (app/core/adapters/virtuozzo_adapter.py)
- [ ] Core adapter methods (start, stop, sleep, restart)
Phase 2 (Weeks 4-8): Core Services - [ ] Environments module (19 Virtuozzo services) - [ ] WordPress module (WP-CLI execution) - [ ] Backups module (backup/restore operations) - [ ] CDN/Cache adapters (Bunny CDN, Cloudflare)
Phase 3 (Weeks 9-12): Advanced Services - [ ] SFTP module - [ ] Domains module (SSL, DNS) - [ ] Staging module - [ ] Remaining integration services
Completion Criteria: - ✅ All 40 services migrated with feature parity - ✅ Performance targets met (50% latency improvement) - ✅ All integration tests passing - ✅ Circuit breaker validated (test with simulated failures) - ✅ Cache hit rate >80% for external APIs - ✅ Connection reuse rate >90%
12.4 Context7 Model Context Protocol (MCP) Integration¶
10.4.1 Overview¶
The MBPanel API integrates with Context7 Model Context Protocol (MCP) to enhance developer experience and provide real-time documentation access to AI tools and LLMs working with the codebase. This integration enables AI-powered development tools to access contextual information about the MBPanel architecture, modules, and documentation directly.
10.4.2 Architecture¶
The Context7 MCP integration consists of: - MCP Server: Running as an embedded service within the FastAPI application - API Endpoints: Direct HTTP access to documentation and module analysis - Internal Client: Enables other MBPanel services to query documentation programmatically - Documentation Resources: Dynamic access to development docs, MAINPRD, and module structures
10.4.3 MCP Tools Available¶
get_mbpapidoc(section, query): Retrieve documentation for a specific sectionget_mbpapidoc_all(): List all available documentation sectionsanalyze_module_structure(module_name): Analyze and describe module architecturedocs://mbpanel/{section}: Resource endpoint for specific documentation sections
10.4.4 HTTP API Endpoints¶
GET /api/v1/context7/status: Check Context7 service statusGET /api/v1/context7/docs/{section}: Get documentation for a sectionGET /api/v1/context7/modules/{module_name}: Analyze module structurePOST /mcp/: Streamable HTTP endpoint for MCP protocol access
10.4.5 Implementation Details¶
The Context7 service is implemented as:
- Located in app/core/context7/ module
- Uses the official MCP Python SDK (mcp[cli] package)
- Mounted as a Streamable HTTP application to the main FastAPI app
- Provides both real-time file access and search capabilities for documentation
- Supports structured data return for AI consumption
10.4.6 Developer Experience Benefits¶
- AI Tool Integration: IDEs with MCP support can access current documentation context
- Documentation Discovery: Programmatic access to API docs and architecture details
- Module Analysis: Automatic analysis of module structure and components
- Search Capabilities: Full-text search across all project documentation
- Real-time Updates: Documentation changes are immediately available via MCP
10.4.7 Configuration¶
- MCP server is automatically mounted to
/mcpendpoint - Can be accessed by MCP-compatible tools like Cursor, VS Code with MCP extensions
- Authentication: Currently uses the same security context as the main API
- Rate limiting: Inherits the main application's rate limiting configuration
10.4.8 Usage Example¶
# For MCP-compatible tools
# The server endpoint: http://your-mbpanel-api.com/mcp
# For direct API access
GET /api/v1/context7/docs/mainprd?query=migration
GET /api/v1/context7/modules/users
GET /api/v1/context7/status
10.4.9 Module Integration Examples¶
To integrate Context7 capabilities into existing modules:
1. Import utilities in your module:
2. Use in service methods:
async def enhanced_service_method(param):
# Get relevant documentation
docs = await get_mbpanel_documentation("users", "user creation")
# Use documentation for guidance or logging
result = await perform_operation(param)
return result
3. Synchronous usage in existing code:
from app.core.context7.utils import sync_get_mbpanel_documentation
def sync_service_method(param):
# Get documentation synchronously
docs = sync_get_mbpanel_documentation("users", "validation rules")
# Use in sync context
return perform_sync_operation(param)
4. Decorator pattern for automatic documentation access:
def with_context7_docs(section):
def decorator(func):
async def wrapper(*args, **kwargs):
docs = await get_mbpanel_documentation(section)
# Use docs for logging, validation, or guidance
result = await func(*args, **kwargs)
return result
return wrapper
return decorator
This integration significantly improves the developer experience by providing real-time access to MBPanel-specific documentation and architectural context for AI-assisted development workflows.