Building a Budget Friendly, Production-Ready Monitoring Stack
Before launching a new project, I make it a priority to have a robust monitoring system in place. When an issue arises, having access to searchable logs, historical metrics, and resource usage data makes my life easier. For my latest project, I made a conscious decision to use a dedicated server instead of a major cloud provider like AWS or GCP. A decision made primarily for financial reasons. This approach meant I wouldn't have access to managed monitoring tools. With third-party services outside my budget, I set out to engineer a cost-effective and budget-friendly replacement. This document details the architecture and implementation of the powerful, self-hosted observability stack I built.
System Architecture for Resilient Monitoring
A key need, was that the monitoring server be separate from my production server. This would be the largest cost driver, and I considered using a raspberry pi or my Synology NAS as they would be the most cost effective, but relying on my residential IScombined with the spotty power issues (paying bills on time is a struggle) ruled that out. Cloud providers to provide a free tier for smaller instance, but i dind dealing with hosting providers like Linode, Digital Ocean, etc... about the same cost wise without having to deal with overwhelming UI's that AWS and Azure have.
So here is my two-server model to guarantee fault tolerance:
- The Production Server: This machine is dedicated to running the core application. It is instrumented with lightweight, efficient data shippers that export telemetry with minimal performance overhead.
- The Monitoring Server: This is a dedicated machine that runs the core data aggregation, storage, and visualization stack.
For the monitoring server, I selected a small cloud VPS. This approach provides a highly available, professionally managed environment that avoids the unreliability of residential ISPs and power, while remaining exceptionally cost-effective. This architecture not only ensures robust monitoring for the current project but also establishes a scalable foundation that could be extended to monitor a fleet of services or even evolve into a managed monitoring solution.
Part 1: Deploying the Core Monitoring Stack
The monitoring server is the centralized hub of the operation. After provisioning a small VM, I installed Docker and Docker Compose and deployed the core stack using the following configuration.
docker-compose.yml for the Monitoring Server
This file orchestrates the primary services: Grafana for data visualization, Loki for log aggregation, Prometheus for time-series metrics, and Uptime Kuma for external uptime checks.
services: grafana: image: grafana/grafana:latest container_name: grafana ports: - "3000:3000" volumes: - grafana-data:/var/lib/grafana restart: unless-stopped loki: image: grafana/loki:latest container_name: loki command: -config.file=/etc/loki/local-config.yaml ports: - "3100:3100" restart: unless-stopped prometheus: image: prom/prometheus:latest container_name: prometheus ports: - "9090:9090" volumes: - ./prometheus:/etc/prometheus - prometheus-data:/prometheus command: --config.file=/etc/prometheus/prometheus.yml restart: unless-stopped uptime-kuma: image: louislam/uptime-kuma:1 container_name: uptime-kuma ports: - "3001:3001" volumes: - uptime-kuma-data:/app/data restart: unless-stopped volumes: grafana-data: {} prometheus-data: {} uptime-kuma-data: {}
Prometheus Configuration
I created a prometheus directory containing a prometheus.yml file to define the scrape targets on the production server.
# ./prometheus/prometheus.yml global: scrape_interval: 15s scrape_configs: - job_name: 'node-exporter' static_configs: - targets: ['<PRODUCTION_SERVER_IP>:9100'] - job_name: 'cadvisor' static_configs: - targets: ['<PRODUCTION_SERVER_IP>:8085']
With a docker-compose up -d, the monitoring server was operational.
Part 2: Instrumenting the Production Server
On the production server, I deployed a lean Docker Compose configuration containing only the necessary data shippers.
- Node Exporter: Exposes host-level system metrics.
- cAdvisor: Exposes container-level performance metrics.
- Promtail: The log collection agent that forwards container and system logs to Loki.
docker-compose.yml for the Production Server
services: # --- Existing application services --- # my-app: # image: ... # --- Monitoring agents --- promtail: image: grafana/promtail:latest container_name: promtail volumes: - /var/log:/var/log - /var/lib/docker/containers:/var/lib/docker/containers:ro - /var/run/docker.sock:/var/run/docker.sock:ro - ./promtail:/etc/promtail command: -config.file=/etc/promtail/promtail-config.yml restart: unless-stopped node-exporter: image: prom/node-exporter:latest container_name: node-exporter ports: - "9100:9100" pid: host volumes: - /:/host:ro,rslave command: --path.rootfs=/host restart: unless-stopped cadvisor: image: gcr.io/cadvisor/cadvisor:latest container_name: cadvisor ports: - "8085:8080" volumes: - /:/rootfs:ro - /var/run:/var/run:rw - /sys:/sys:ro - /var/lib/docker/:/var/lib/docker:ro restart: unless-stopped
Promtail Configuration
I created a promtail directory with a promtail-config.yml file to configure the log shipping destination.
# ./promtail/promtail-config.yml server: http_listen_port: 9080 grpc_listen_port: 0 positions: filename: /tmp/positions.yaml clients: - url: http://<MONITORING_SERVER_IP>:3100/loki/api/v1/push
I then ran docker-compose up -d to deploy the agents.
Part 3: Establishing Secure Connectivity
A common point of failure in distributed systems is firewall misconfiguration. I configured both the host-level (ufw) and network-level (cloud provider) firewalls to allow the necessary traffic between the two servers.
On the Production Server (ufw):
# Allow ingress from monitoring server for metric scraping sudo ufw allow from <MONITORING_SERVER_IP> to any port 9100 proto tcp comment 'Allow Prometheus Node Exporter' sudo ufw allow from <MONITORING_SERVER_IP> to any port 8085 proto tcp comment 'Allow Prometheus cAdvisor' sudo ufw reload
On the Monitoring Server (ufw):
# Allow ingress from production server for log ingestion sudo ufw allow from <PRODUCTION_SERVER_IP> to any port 3100 proto tcp comment 'Allow Loki Log Ingestion' sudo ufw reload
Critically, I mirrored these rules in the respective cloud provider's network firewall console to ensure traffic was not dropped at the edge.
Part 4: Securing Monitoring Endpoints
The web interfaces for Prometheus, Grafana, and Uptime Kuma were now functional but publicly exposed. I implemented a reverse proxy using Caddy on the monitoring server to enforce authentication and provide automatic HTTPS.
- I installed Caddy on the monitoring server.
- I generated a secure, hashed password by running caddy hash-password.
- I configured the /etc/caddy/Caddyfile to proxy traffic and require authentication for sensitive endpoints, after pointing the relevant subdomains' DNS records to the monitoring server's IP.
- I reloaded the Caddy service (sudo systemctl reload caddy) and ensured the host firewall allowed HTTPS traffic (sudo ufw allow 443/tcp).
prometheus.yourdomain.com { basic_auth { admin <YOUR_HASHED_PASSWORD> } reverse_proxy localhost:9090 } grafana.yourdomain.com { reverse_proxy localhost:3000 } uptime.yourdomain.com { reverse_proxy localhost:3001 }
The monitoring endpoints were now secured behind TLS and basic authentication.