Skip to content

Authentication System - Operational Metrics

Key performance indicators, monitoring, and alerting for the authentication system.


Key Metrics Overview

Business Metrics

Metric Description Target Current
Login Success Rate % of successful logins > 99% 98.5%
Registration Rate New users per day Track 15/day
Active Users (7d) Unique users in last 7 days Track 450
Team Growth New teams per week Track 3/week

Security Metrics

Metric Description Target Current
Failed Login Rate % of failed logins < 1% 0.8%
Suspicious Login Rate % flagged as suspicious < 5% 3.2%
Device OTP Rate % of logins requiring OTP Track 8%
Concurrent Login Denials % denied vs. approved Track 12%

Technical Metrics

Metric Description Target Current
Login Latency (p95) Time to complete login < 500ms 320ms
Token Refresh Latency (p95) Time to refresh token < 200ms 145ms
Session Expiry Rate % expiring naturally > 80% 85%
Redis Hit Rate Cache effectiveness > 95% 97%

Monitoring Dashboard

Dashboard Sections

1. Authentication Health

  • Login Success Rate (line chart, last 24h)
  • Failed Login Rate (line chart, last 24h)
  • Active Sessions (gauge)
  • Token Refresh Rate (line chart)

2. Security Overview

  • Failed Logins by IP (bar chart, top 10)
  • Suspicious Logins (line chart)
  • Rate Limit Violations (line chart)
  • CSRF Failures (line chart)

3. User Activity

  • Registrations (line chart, daily)
  • New Devices (line chart, daily)
  • Password Resets (line chart, daily)
  • Team Invitations (line chart, daily)

4. Technical Performance

  • Endpoint Latency (heatmap)
  • Redis Operations (line chart)
  • Database Query Time (line chart)
  • Error Rate (line chart)

Alerts

Critical Alerts (Immediate Action Required)

Alert Condition Threshold Escalation
Login Failure Spike Failed logins spike > 10% in 5 min → On-call
Redis Down Redis unavailable Any → On-call
Database Down PostgreSQL unavailable Any → On-call
Email Service Down Postmark failing > 5 min → On-call
High Rate Limit Hits Abuse pattern detected > 100/min → Security

Warning Alerts (Investigate Within 1 Hour)

Alert Condition Threshold Escalation
Elevated Failed Logins Failed login rate elevated > 5% sustained → Team Lead
Suspicious Login Increase More suspicious than usual > 10% of logins → Security
Slow Login Performance Login latency degraded p95 > 1s → Backend
High Session Churn Users logging in/out frequently > 3x normal → Backend
Alert Condition Threshold Action
New Device Spike More new devices than usual > 2x normal Monitor
Password Reset Spike More resets than usual > 2x normal Monitor
Registration Drop Fewer registrations than usual < 50% normal Investigate

Log Queries

Useful Queries for Monitoring

Failed Login Analysis

-- Top IPs with failed logins (last 24h)
SELECT
    SUBSTRING(error_detail FROM 'ip":"([^"]+)') as ip,
    COUNT(*) as failed_count
FROM auth_violation_log
WHERE violation_type = 'not_authenticated'
  AND created_at > NOW() - INTERVAL '24 hours'
GROUP BY ip
ORDER BY failed_count DESC
LIMIT 10;
-- Suspicious logins by day (last 30 days)
SELECT
    DATE(created_at) as date,
    COUNT(*) FILTER (WHERE suspicious = true) as suspicious_count,
    COUNT(*) as total_count,
    ROUND(100.0 * COUNT(*) FILTER (WHERE suspicious = true) / COUNT(*), 2) as suspicious_pct
FROM login_activity
WHERE created_at > NOW() - INTERVAL '30 days'
GROUP BY DATE(created_at)
ORDER BY date DESC;

Session Duration

-- Average session duration (last 7 days)
SELECT
    AVG(EXTRACT(EPOCH FROM (revoked_at - created_at))) / 60 as avg_duration_minutes,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (revoked_at - created_at))) / 60 as median_duration_minutes
FROM active_sessions
WHERE created_at > NOW() - INTERVAL '7 days'
  AND revoked_at IS NOT NULL;

SigNoz Dashboards

Dashboard Configuration

Import the following dashboard JSON into SigNoz:

{
  "name": "Authentication System",
  "panels": [
    {
      "title": "Login Success Rate",
      "query": "SELECT 100.0 * COUNT_IF(status = 200) / COUNT(*) FROM Span WHERE serviceName = 'backend' AND spanName LIKE '%login%'"
    },
    {
      "title": "Token Refresh Latency",
      "query": "SELECT percentile(duration_ms, 95) FROM Span WHERE serviceName = 'backend' AND spanName = 'AuthService.refresh_tokens'"
    },
    {
      "title": "Active Sessions",
      "query": "SELECT COUNT(*) FROM active_sessions WHERE revoked_at IS NULL"
    },
    {
      "title": "Failed Logins by IP",
      "query": "SELECT request_headers_ip, COUNT(*) FROM auth_violation_log WHERE violation_type = 'not_authenticated' GROUP BY request_headers_ip ORDER BY COUNT DESC LIMIT 10"
    }
  ]
}