Authentication System - Operational Metrics
Key performance indicators, monitoring, and alerting for the authentication system.
Key Metrics Overview
Business Metrics
| Metric |
Description |
Target |
Current |
| Login Success Rate |
% of successful logins |
> 99% |
98.5% |
| Registration Rate |
New users per day |
Track |
15/day |
| Active Users (7d) |
Unique users in last 7 days |
Track |
450 |
| Team Growth |
New teams per week |
Track |
3/week |
Security Metrics
| Metric |
Description |
Target |
Current |
| Failed Login Rate |
% of failed logins |
< 1% |
0.8% |
| Suspicious Login Rate |
% flagged as suspicious |
< 5% |
3.2% |
| Device OTP Rate |
% of logins requiring OTP |
Track |
8% |
| Concurrent Login Denials |
% denied vs. approved |
Track |
12% |
Technical Metrics
| Metric |
Description |
Target |
Current |
| Login Latency (p95) |
Time to complete login |
< 500ms |
320ms |
| Token Refresh Latency (p95) |
Time to refresh token |
< 200ms |
145ms |
| Session Expiry Rate |
% expiring naturally |
> 80% |
85% |
| Redis Hit Rate |
Cache effectiveness |
> 95% |
97% |
Monitoring Dashboard
Dashboard Sections
1. Authentication Health
- Login Success Rate (line chart, last 24h)
- Failed Login Rate (line chart, last 24h)
- Active Sessions (gauge)
- Token Refresh Rate (line chart)
2. Security Overview
- Failed Logins by IP (bar chart, top 10)
- Suspicious Logins (line chart)
- Rate Limit Violations (line chart)
- CSRF Failures (line chart)
3. User Activity
- Registrations (line chart, daily)
- New Devices (line chart, daily)
- Password Resets (line chart, daily)
- Team Invitations (line chart, daily)
- Endpoint Latency (heatmap)
- Redis Operations (line chart)
- Database Query Time (line chart)
- Error Rate (line chart)
Alerts
| Alert |
Condition |
Threshold |
Escalation |
| Login Failure Spike |
Failed logins spike |
> 10% in 5 min |
→ On-call |
| Redis Down |
Redis unavailable |
Any |
→ On-call |
| Database Down |
PostgreSQL unavailable |
Any |
→ On-call |
| Email Service Down |
Postmark failing |
> 5 min |
→ On-call |
| High Rate Limit Hits |
Abuse pattern detected |
> 100/min |
→ Security |
Warning Alerts (Investigate Within 1 Hour)
| Alert |
Condition |
Threshold |
Escalation |
| Elevated Failed Logins |
Failed login rate elevated |
> 5% sustained |
→ Team Lead |
| Suspicious Login Increase |
More suspicious than usual |
> 10% of logins |
→ Security |
| Slow Login Performance |
Login latency degraded |
p95 > 1s |
→ Backend |
| High Session Churn |
Users logging in/out frequently |
> 3x normal |
→ Backend |
Info Alerts (Track Trends)
| Alert |
Condition |
Threshold |
Action |
| New Device Spike |
More new devices than usual |
> 2x normal |
Monitor |
| Password Reset Spike |
More resets than usual |
> 2x normal |
Monitor |
| Registration Drop |
Fewer registrations than usual |
< 50% normal |
Investigate |
Log Queries
Useful Queries for Monitoring
Failed Login Analysis
-- Top IPs with failed logins (last 24h)
SELECT
SUBSTRING(error_detail FROM 'ip":"([^"]+)') as ip,
COUNT(*) as failed_count
FROM auth_violation_log
WHERE violation_type = 'not_authenticated'
AND created_at > NOW() - INTERVAL '24 hours'
GROUP BY ip
ORDER BY failed_count DESC
LIMIT 10;
Suspicious Login Trends
-- Suspicious logins by day (last 30 days)
SELECT
DATE(created_at) as date,
COUNT(*) FILTER (WHERE suspicious = true) as suspicious_count,
COUNT(*) as total_count,
ROUND(100.0 * COUNT(*) FILTER (WHERE suspicious = true) / COUNT(*), 2) as suspicious_pct
FROM login_activity
WHERE created_at > NOW() - INTERVAL '30 days'
GROUP BY DATE(created_at)
ORDER BY date DESC;
Session Duration
-- Average session duration (last 7 days)
SELECT
AVG(EXTRACT(EPOCH FROM (revoked_at - created_at))) / 60 as avg_duration_minutes,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (revoked_at - created_at))) / 60 as median_duration_minutes
FROM active_sessions
WHERE created_at > NOW() - INTERVAL '7 days'
AND revoked_at IS NOT NULL;
SigNoz Dashboards
Dashboard Configuration
Import the following dashboard JSON into SigNoz:
{
"name": "Authentication System",
"panels": [
{
"title": "Login Success Rate",
"query": "SELECT 100.0 * COUNT_IF(status = 200) / COUNT(*) FROM Span WHERE serviceName = 'backend' AND spanName LIKE '%login%'"
},
{
"title": "Token Refresh Latency",
"query": "SELECT percentile(duration_ms, 95) FROM Span WHERE serviceName = 'backend' AND spanName = 'AuthService.refresh_tokens'"
},
{
"title": "Active Sessions",
"query": "SELECT COUNT(*) FROM active_sessions WHERE revoked_at IS NULL"
},
{
"title": "Failed Logins by IP",
"query": "SELECT request_headers_ip, COUNT(*) FROM auth_violation_log WHERE violation_type = 'not_authenticated' GROUP BY request_headers_ip ORDER BY COUNT DESC LIMIT 10"
}
]
}