Skip to content

Monitoring API

The Monitoring API provides endpoints for tracking application performance, health status, and error monitoring. These endpoints are designed for operational observability and debugging.

Overview

The Monitoring API tracks:

  • Request performance metrics (duration, success rate, percentiles)
  • Slowest operations (top 20)
  • Recent errors with full context
  • System health (uptime, memory, CPU usage)

Key Features:

  • Automatic performance tracking for all API endpoints
  • In-memory metrics buffer (last 1000 requests)
  • Slow request detection (>200ms threshold)
  • Real-time health monitoring
  • No authentication required (public endpoints)

Base URL

https://api.boardapi.io/api/v1/monitoring

For development:

http://localhost:4000/api/v1/monitoring

Endpoints Overview

MethodEndpointDescription
GET/monitoring/performance/statsGet performance statistics
GET/monitoring/performance/slowestGet slowest operations
GET/monitoring/performance/errorsGet recent errors
GET/monitoring/healthHealth check endpoint

Get Performance Statistics

Returns aggregated performance metrics for the application.

Request

http
GET /api/v1/monitoring/performance/stats

Authentication: None (public endpoint)

Response

json
{
  "timestamp": "2025-11-19T10:30:45.123Z",
  "stats": {
    "totalRequests": 1543,
    "successRate": 98.5,
    "averageDuration": 45.3,
    "p50": 32,
    "p95": 180,
    "p99": 350,
    "slowRequests": 23,
    "errors": 12
  }
}

Response Fields

FieldTypeDescription
timestampstringISO 8601 timestamp of the response
stats.totalRequestsnumberTotal number of tracked requests in buffer
stats.successRatenumberSuccess rate as percentage (0-100)
stats.averageDurationnumberAverage request duration in milliseconds
stats.p50number50th percentile (median) duration in ms
stats.p95number95th percentile duration in ms
stats.p99number99th percentile duration in ms
stats.slowRequestsnumberNumber of slow requests (>200ms)
stats.errorsnumberNumber of failed requests

Example

bash
curl http://localhost:4000/api/v1/monitoring/performance/stats

Use Cases:

  • Dashboard monitoring
  • Performance trend analysis
  • SLA compliance verification
  • Capacity planning

Get Slowest Operations

Returns the 20 slowest successful operations tracked in the buffer.

Request

http
GET /api/v1/monitoring/performance/slowest

Authentication: None (public endpoint)

Response

json
{
  "timestamp": "2025-11-19T10:30:45.123Z",
  "operations": [
    {
      "requestId": "1732012245123-abc123xyz",
      "className": "BoardsController",
      "handlerName": "create",
      "method": "POST",
      "url": "/api/v1/boards",
      "duration": 485,
      "success": true,
      "timestamp": "2025-11-19T10:25:30.000Z"
    },
    {
      "requestId": "1732012200456-def456uvw",
      "className": "WebhooksService",
      "handlerName": "deliverWebhook",
      "method": "POST",
      "url": "/api/v1/webhooks/subscriptions",
      "duration": 420,
      "success": true,
      "timestamp": "2025-11-19T10:20:15.000Z"
    }
  ]
}

Response Fields

FieldTypeDescription
timestampstringISO 8601 timestamp of the response
operationsarrayArray of performance metrics (max 20)
operations[].requestIdstringUnique request identifier
operations[].classNamestringNestJS controller/service class name
operations[].handlerNamestringMethod/handler name
operations[].methodstringHTTP method (GET, POST, etc.) or "WS" for WebSocket
operations[].urlstringRequest URL or handler name
operations[].durationnumberRequest duration in milliseconds
operations[].successbooleanWhether request succeeded
operations[].timestampstringISO 8601 timestamp of request

Example

bash
curl http://localhost:4000/api/v1/monitoring/performance/slowest

Use Cases:

  • Identifying performance bottlenecks
  • Optimization prioritization
  • Database query analysis
  • API endpoint optimization

Get Recent Errors

Returns the 20 most recent failed requests with error details.

Request

http
GET /api/v1/monitoring/performance/errors

Authentication: None (public endpoint)

Response

json
{
  "timestamp": "2025-11-19T10:30:45.123Z",
  "errors": [
    {
      "requestId": "1732012300789-ghi789rst",
      "className": "BoardsController",
      "handlerName": "findOneByToken",
      "method": "GET",
      "url": "/api/v1/boards/invalid-uuid",
      "duration": 15,
      "success": false,
      "error": "Board not found",
      "timestamp": "2025-11-19T10:28:30.000Z"
    },
    {
      "requestId": "1732012150234-jkl012mno",
      "className": "AuthController",
      "handlerName": "validateBoardToken",
      "method": "POST",
      "url": "/api/v1/auth/validate-board-token",
      "duration": 8,
      "success": false,
      "error": "Invalid or expired token",
      "timestamp": "2025-11-19T10:15:45.000Z"
    }
  ]
}

Response Fields

FieldTypeDescription
timestampstringISO 8601 timestamp of the response
errorsarrayArray of error metrics (max 20)
errors[].requestIdstringUnique request identifier
errors[].classNamestringNestJS controller/service class name
errors[].handlerNamestringMethod/handler name where error occurred
errors[].methodstringHTTP method or "WS" for WebSocket
errors[].urlstringRequest URL or handler name
errors[].durationnumberRequest duration before failure (ms)
errors[].successbooleanAlways false for errors
errors[].errorstringError message
errors[].timestampstringISO 8601 timestamp of error

Example

bash
curl http://localhost:4000/api/v1/monitoring/performance/errors

Use Cases:

  • Error rate monitoring
  • Debugging production issues
  • Error pattern detection
  • Alerting and incident response

Health Check

Returns the current health status of the application, including uptime, memory usage, and CPU metrics.

Request

http
GET /api/v1/monitoring/health

Authentication: None (public endpoint)

Response

json
{
  "status": "healthy",
  "timestamp": "2025-11-19T10:30:45.123Z",
  "uptime": 86400.5,
  "memory": {
    "rss": 125829120,
    "heapTotal": 83296256,
    "heapUsed": 52428800,
    "external": 2097152,
    "arrayBuffers": 1048576
  },
  "cpu": {
    "user": 1500000,
    "system": 500000
  }
}

Response Fields

FieldTypeDescription
statusstringHealth status (always "healthy" if responding)
timestampstringISO 8601 timestamp of the response
uptimenumberProcess uptime in seconds
memoryobjectMemory usage statistics (in bytes)
memory.rssnumberResident Set Size (total memory allocated)
memory.heapTotalnumberTotal heap size
memory.heapUsednumberHeap memory currently in use
memory.externalnumberMemory used by C++ objects bound to JS
memory.arrayBuffersnumberMemory allocated for ArrayBuffers and SharedArrayBuffers
cpuobjectCPU usage statistics (in microseconds)
cpu.usernumberCPU time spent in user mode
cpu.systemnumberCPU time spent in system mode

Example

bash
curl http://localhost:4000/api/v1/monitoring/health

Use Cases:

  • Load balancer health checks
  • Uptime monitoring
  • Resource usage tracking
  • Kubernetes/Docker liveness probes

Implementation Details

Performance Tracking

The monitoring system automatically tracks all HTTP requests and WebSocket events through a NestJS interceptor:

  • Buffer Size: 1000 most recent requests (circular buffer)
  • Slow Request Threshold: 200ms
  • Metrics Collected: Duration, success/failure, error messages, timestamps
  • Auto-cleanup: Old metrics are automatically removed when buffer is full

Log Levels

typescript
// Debug: All successful requests (if debug logging enabled)
Logger.debug('BoardsController.create - 45ms [POST /api/v1/boards]')

// Warning: Slow requests (>200ms)
Logger.warn('Slow request: BoardsController.create - 485ms [POST /api/v1/boards]')

// Error: Failed requests
Logger.error('Failed request: BoardsController.findOneByToken - 15ms [GET /api/v1/boards/invalid-uuid] - Board not found')

Memory Considerations

  • The metrics buffer has a fixed size of 1000 entries
  • Older metrics are automatically evicted (FIFO)
  • No persistent storage - metrics are lost on restart
  • Typical memory usage: ~500KB for full buffer

Integration Examples

Prometheus/Grafana

javascript
// Fetch and expose metrics for Prometheus
const response = await fetch('http://localhost:4000/api/v1/monitoring/performance/stats');
const { stats } = await response.json();

// Expose as Prometheus metrics
stats_requests_total.set(stats.totalRequests);
stats_success_rate.set(stats.successRate);
stats_duration_average.set(stats.averageDuration);
stats_duration_p95.set(stats.p95);
stats_slow_requests_total.set(stats.slowRequests);
stats_errors_total.set(stats.errors);

Uptime Monitoring (UptimeRobot, Pingdom)

bash
# Simple health check URL
https://api.boardapi.io/api/v1/monitoring/health

Error Alerting

javascript
// Check for recent errors every minute
setInterval(async () => {
  const response = await fetch('http://localhost:4000/api/v1/monitoring/performance/errors');
  const { errors } = await response.json();

  if (errors.length > 10) {
    // Alert: High error rate detected
    sendAlert(`High error rate: ${errors.length} errors in last 1000 requests`);
  }
}, 60000);

Dashboard Widget

javascript
// Real-time performance dashboard
async function updateDashboard() {
  const stats = await fetch('/api/v1/monitoring/performance/stats').then(r => r.json());

  document.getElementById('totalRequests').textContent = stats.stats.totalRequests;
  document.getElementById('successRate').textContent = stats.stats.successRate.toFixed(2) + '%';
  document.getElementById('p95Duration').textContent = stats.stats.p95 + 'ms';
  document.getElementById('errorCount').textContent = stats.stats.errors;
}

setInterval(updateDashboard, 5000); // Update every 5 seconds

Best Practices

1. Regular Monitoring

  • Poll /performance/stats every 30-60 seconds for dashboards
  • Use /health for load balancer health checks (every 10-30 seconds)
  • Check /performance/errors when error rates spike

2. Alerting Thresholds

  • Success Rate: Alert if < 95%
  • P95 Duration: Alert if > 500ms
  • Error Count: Alert if > 5% of total requests
  • Memory Usage: Alert if heapUsed > 80% of heapTotal

3. Performance Optimization

  • Identify slow operations via /performance/slowest
  • Focus on endpoints with p95 > 200ms
  • Investigate patterns in error messages

4. Production Deployment

  • Consider restricting monitoring endpoints to internal network
  • Use reverse proxy authentication for sensitive metrics
  • Export metrics to external monitoring systems (Prometheus, DataDog)

Security Considerations

Public Endpoints: All monitoring endpoints are currently public (no authentication required).

Production Recommendations:

  • Restrict access via reverse proxy (nginx, API Gateway)
  • Use IP whitelisting for monitoring systems
  • Consider adding API key authentication
  • Do not expose sensitive error messages to external users

Example nginx configuration:

nginx
# Restrict monitoring endpoints to internal IPs
location /api/v1/monitoring {
    allow 10.0.0.0/8;      # Internal network
    allow 192.168.0.0/16;  # Local network
    deny all;

    proxy_pass http://backend:4000;
}

Troubleshooting

No Metrics Available

Problem: /performance/stats returns all zeros

Solution:

  • Metrics buffer is empty (no requests processed yet)
  • Application recently restarted (metrics are not persisted)
  • Make a few API requests to populate metrics

High Memory Usage

Problem: Application memory increasing over time

Solution:

  • Metrics buffer has fixed size (1000 entries)
  • Check for memory leaks elsewhere in application
  • Monitor heapUsed via /health endpoint

Missing Slow Requests

Problem: Known slow endpoint not appearing in /performance/slowest

Solution:

  • Only successful requests are tracked
  • Buffer size is limited to 1000 entries
  • Slow requests may have been evicted if buffer is full
  • Check if request is actually >200ms (threshold)


Support

For monitoring-related issues:

  • Check application logs for errors
  • Verify endpoints are accessible via curl
  • Review nginx/proxy configuration
  • Contact DevOps team for production access