Monitoring API

The Monitoring API provides endpoints for tracking application performance, health status, and error monitoring. These endpoints are designed for operational observability and debugging.

Overview

The Monitoring API tracks:

Request performance metrics (duration, success rate, percentiles)
Slowest operations (top 20)
Recent errors with full context
System health (uptime, memory, CPU usage)

Key Features:

Automatic performance tracking for all API endpoints
In-memory metrics buffer (last 1000 requests)
Slow request detection (>200ms threshold)
Real-time health monitoring
No authentication required (public endpoints)

Base URL

https://api.boardapi.io/api/v1/monitoring

For development:

http://localhost:4000/api/v1/monitoring

Endpoints Overview

Method	Endpoint	Description
`GET`	`/monitoring/performance/stats`	Get performance statistics
`GET`	`/monitoring/performance/slowest`	Get slowest operations
`GET`	`/monitoring/performance/errors`	Get recent errors
`GET`	`/monitoring/health`	Health check endpoint

Get Performance Statistics

Returns aggregated performance metrics for the application.

Request

http

GET /api/v1/monitoring/performance/stats

Authentication: None (public endpoint)

Response

json

{
  "timestamp": "2025-11-19T10:30:45.123Z",
  "stats": {
    "totalRequests": 1543,
    "successRate": 98.5,
    "averageDuration": 45.3,
    "p50": 32,
    "p95": 180,
    "p99": 350,
    "slowRequests": 23,
    "errors": 12
  }
}

Response Fields

Field	Type	Description
`timestamp`	string	ISO 8601 timestamp of the response
`stats.totalRequests`	number	Total number of tracked requests in buffer
`stats.successRate`	number	Success rate as percentage (0-100)
`stats.averageDuration`	number	Average request duration in milliseconds
`stats.p50`	number	50th percentile (median) duration in ms
`stats.p95`	number	95th percentile duration in ms
`stats.p99`	number	99th percentile duration in ms
`stats.slowRequests`	number	Number of slow requests (>200ms)
`stats.errors`	number	Number of failed requests

Example

bash

curl http://localhost:4000/api/v1/monitoring/performance/stats

Use Cases:

Dashboard monitoring
Performance trend analysis
SLA compliance verification
Capacity planning

Get Slowest Operations

Returns the 20 slowest successful operations tracked in the buffer.

Request

http

GET /api/v1/monitoring/performance/slowest

Authentication: None (public endpoint)

Response

json

{
  "timestamp": "2025-11-19T10:30:45.123Z",
  "operations": [
    {
      "requestId": "1732012245123-abc123xyz",
      "className": "BoardsController",
      "handlerName": "create",
      "method": "POST",
      "url": "/api/v1/boards",
      "duration": 485,
      "success": true,
      "timestamp": "2025-11-19T10:25:30.000Z"
    },
    {
      "requestId": "1732012200456-def456uvw",
      "className": "WebhooksService",
      "handlerName": "deliverWebhook",
      "method": "POST",
      "url": "/api/v1/webhooks/subscriptions",
      "duration": 420,
      "success": true,
      "timestamp": "2025-11-19T10:20:15.000Z"
    }
  ]
}

Response Fields

Field	Type	Description
`timestamp`	string	ISO 8601 timestamp of the response
`operations`	array	Array of performance metrics (max 20)
`operations[].requestId`	string	Unique request identifier
`operations[].className`	string	NestJS controller/service class name
`operations[].handlerName`	string	Method/handler name
`operations[].method`	string	HTTP method (GET, POST, etc.) or "WS" for WebSocket
`operations[].url`	string	Request URL or handler name
`operations[].duration`	number	Request duration in milliseconds
`operations[].success`	boolean	Whether request succeeded
`operations[].timestamp`	string	ISO 8601 timestamp of request

Example

bash

curl http://localhost:4000/api/v1/monitoring/performance/slowest

Use Cases:

Identifying performance bottlenecks
Optimization prioritization
Database query analysis
API endpoint optimization

Get Recent Errors

Returns the 20 most recent failed requests with error details.

Request

http

GET /api/v1/monitoring/performance/errors

Authentication: None (public endpoint)

Response

json

{
  "timestamp": "2025-11-19T10:30:45.123Z",
  "errors": [
    {
      "requestId": "1732012300789-ghi789rst",
      "className": "BoardsController",
      "handlerName": "findOneByToken",
      "method": "GET",
      "url": "/api/v1/boards/invalid-uuid",
      "duration": 15,
      "success": false,
      "error": "Board not found",
      "timestamp": "2025-11-19T10:28:30.000Z"
    },
    {
      "requestId": "1732012150234-jkl012mno",
      "className": "AuthController",
      "handlerName": "validateBoardToken",
      "method": "POST",
      "url": "/api/v1/auth/validate-board-token",
      "duration": 8,
      "success": false,
      "error": "Invalid or expired token",
      "timestamp": "2025-11-19T10:15:45.000Z"
    }
  ]
}

Response Fields

Field	Type	Description
`timestamp`	string	ISO 8601 timestamp of the response
`errors`	array	Array of error metrics (max 20)
`errors[].requestId`	string	Unique request identifier
`errors[].className`	string	NestJS controller/service class name
`errors[].handlerName`	string	Method/handler name where error occurred
`errors[].method`	string	HTTP method or "WS" for WebSocket
`errors[].url`	string	Request URL or handler name
`errors[].duration`	number	Request duration before failure (ms)
`errors[].success`	boolean	Always `false` for errors
`errors[].error`	string	Error message
`errors[].timestamp`	string	ISO 8601 timestamp of error

Example

bash

curl http://localhost:4000/api/v1/monitoring/performance/errors

Use Cases:

Error rate monitoring
Debugging production issues
Error pattern detection
Alerting and incident response

Health Check

Returns the current health status of the application, including uptime, memory usage, and CPU metrics.

Request

http

GET /api/v1/monitoring/health

Authentication: None (public endpoint)

Response

json

{
  "status": "healthy",
  "timestamp": "2025-11-19T10:30:45.123Z",
  "uptime": 86400.5,
  "memory": {
    "rss": 125829120,
    "heapTotal": 83296256,
    "heapUsed": 52428800,
    "external": 2097152,
    "arrayBuffers": 1048576
  },
  "cpu": {
    "user": 1500000,
    "system": 500000
  }
}

Response Fields

Field	Type	Description
`status`	string	Health status (always "healthy" if responding)
`timestamp`	string	ISO 8601 timestamp of the response
`uptime`	number	Process uptime in seconds
`memory`	object	Memory usage statistics (in bytes)
`memory.rss`	number	Resident Set Size (total memory allocated)
`memory.heapTotal`	number	Total heap size
`memory.heapUsed`	number	Heap memory currently in use
`memory.external`	number	Memory used by C++ objects bound to JS
`memory.arrayBuffers`	number	Memory allocated for ArrayBuffers and SharedArrayBuffers
`cpu`	object	CPU usage statistics (in microseconds)
`cpu.user`	number	CPU time spent in user mode
`cpu.system`	number	CPU time spent in system mode

Example

bash

curl http://localhost:4000/api/v1/monitoring/health

Use Cases:

Load balancer health checks
Uptime monitoring
Resource usage tracking
Kubernetes/Docker liveness probes

Implementation Details

Performance Tracking

The monitoring system automatically tracks all HTTP requests and WebSocket events through a NestJS interceptor:

Buffer Size: 1000 most recent requests (circular buffer)
Slow Request Threshold: 200ms
Metrics Collected: Duration, success/failure, error messages, timestamps
Auto-cleanup: Old metrics are automatically removed when buffer is full

Log Levels

typescript

// Debug: All successful requests (if debug logging enabled)
Logger.debug('BoardsController.create - 45ms [POST /api/v1/boards]')

// Warning: Slow requests (>200ms)
Logger.warn('Slow request: BoardsController.create - 485ms [POST /api/v1/boards]')

// Error: Failed requests
Logger.error('Failed request: BoardsController.findOneByToken - 15ms [GET /api/v1/boards/invalid-uuid] - Board not found')

Memory Considerations

The metrics buffer has a fixed size of 1000 entries
Older metrics are automatically evicted (FIFO)
No persistent storage - metrics are lost on restart
Typical memory usage: ~500KB for full buffer

Integration Examples

Prometheus/Grafana

javascript

// Fetch and expose metrics for Prometheus
const response = await fetch('http://localhost:4000/api/v1/monitoring/performance/stats');
const { stats } = await response.json();

// Expose as Prometheus metrics
stats_requests_total.set(stats.totalRequests);
stats_success_rate.set(stats.successRate);
stats_duration_average.set(stats.averageDuration);
stats_duration_p95.set(stats.p95);
stats_slow_requests_total.set(stats.slowRequests);
stats_errors_total.set(stats.errors);

Uptime Monitoring (UptimeRobot, Pingdom)

bash

# Simple health check URL
https://api.boardapi.io/api/v1/monitoring/health

Error Alerting

javascript

// Check for recent errors every minute
setInterval(async () => {
  const response = await fetch('http://localhost:4000/api/v1/monitoring/performance/errors');
  const { errors } = await response.json();

  if (errors.length > 10) {
    // Alert: High error rate detected
    sendAlert(`High error rate: ${errors.length} errors in last 1000 requests`);
  }
}, 60000);

javascript

// Real-time performance dashboard
async function updateDashboard() {
  const stats = await fetch('/api/v1/monitoring/performance/stats').then(r => r.json());

  document.getElementById('totalRequests').textContent = stats.stats.totalRequests;
  document.getElementById('successRate').textContent = stats.stats.successRate.toFixed(2) + '%';
  document.getElementById('p95Duration').textContent = stats.stats.p95 + 'ms';
  document.getElementById('errorCount').textContent = stats.stats.errors;
}

setInterval(updateDashboard, 5000); // Update every 5 seconds

Best Practices

1. Regular Monitoring

Poll /performance/stats every 30-60 seconds for dashboards
Use /health for load balancer health checks (every 10-30 seconds)
Check /performance/errors when error rates spike

2. Alerting Thresholds

Success Rate: Alert if < 95%
P95 Duration: Alert if > 500ms
Error Count: Alert if > 5% of total requests
Memory Usage: Alert if heapUsed > 80% of heapTotal

3. Performance Optimization

Identify slow operations via /performance/slowest
Focus on endpoints with p95 > 200ms
Investigate patterns in error messages

4. Production Deployment

Consider restricting monitoring endpoints to internal network
Use reverse proxy authentication for sensitive metrics
Export metrics to external monitoring systems (Prometheus, DataDog)

Security Considerations

Public Endpoints: All monitoring endpoints are currently public (no authentication required).

Production Recommendations:

Restrict access via reverse proxy (nginx, API Gateway)
Use IP whitelisting for monitoring systems
Consider adding API key authentication
Do not expose sensitive error messages to external users

Example nginx configuration:

nginx

# Restrict monitoring endpoints to internal IPs
location /api/v1/monitoring {
    allow 10.0.0.0/8;      # Internal network
    allow 192.168.0.0/16;  # Local network
    deny all;

    proxy_pass http://backend:4000;
}

Troubleshooting

No Metrics Available

Problem: /performance/stats returns all zeros

Solution:

Metrics buffer is empty (no requests processed yet)
Application recently restarted (metrics are not persisted)
Make a few API requests to populate metrics

High Memory Usage

Problem: Application memory increasing over time

Solution:

Metrics buffer has fixed size (1000 entries)
Check for memory leaks elsewhere in application
Monitor heapUsed via /health endpoint

Missing Slow Requests

Problem: Known slow endpoint not appearing in /performance/slowest

Solution:

Only successful requests are tracked
Buffer size is limited to 1000 entries
Slow requests may have been evicted if buffer is full
Check if request is actually >200ms (threshold)

Error Codes Reference - HTTP status codes and error formats
Webhooks API - Webhook delivery monitoring
Authentication API - Token validation performance

Support

For monitoring-related issues:

Check application logs for errors
Verify endpoints are accessible via curl
Review nginx/proxy configuration
Contact DevOps team for production access

Monitoring API ​

Overview ​

Base URL ​

Endpoints Overview ​

Get Performance Statistics ​

Request ​

Response ​

Response Fields ​

Example ​

Get Slowest Operations ​

Request ​

Response ​

Response Fields ​

Example ​

Get Recent Errors ​

Request ​

Response ​

Response Fields ​

Example ​

Health Check ​

Request ​

Response ​

Response Fields ​

Example ​

Implementation Details ​

Performance Tracking ​

Log Levels ​

Memory Considerations ​

Integration Examples ​

Prometheus/Grafana ​

Uptime Monitoring (UptimeRobot, Pingdom) ​

Error Alerting ​

Dashboard Widget ​

Best Practices ​

1. Regular Monitoring ​

2. Alerting Thresholds ​

3. Performance Optimization ​

4. Production Deployment ​

Security Considerations ​

Troubleshooting ​

No Metrics Available ​

High Memory Usage ​

Missing Slow Requests ​

Related Documentation ​

Support ​

Monitoring API

Overview

Base URL

Endpoints Overview

Get Performance Statistics

Request

Response

Response Fields

Example

Get Slowest Operations

Request

Response

Response Fields

Example

Get Recent Errors

Request

Response

Response Fields

Example

Health Check

Request

Response

Response Fields

Example

Implementation Details

Performance Tracking

Log Levels

Memory Considerations

Integration Examples

Prometheus/Grafana

Uptime Monitoring (UptimeRobot, Pingdom)

Error Alerting

Dashboard Widget

Best Practices

1. Regular Monitoring

2. Alerting Thresholds

3. Performance Optimization

4. Production Deployment

Security Considerations

Troubleshooting

No Metrics Available

High Memory Usage

Missing Slow Requests

Related Documentation

Support