Monitoring¶

The metrics and log stack runs on yggdrasil. Uptime Kuma is responsible for public endpoint checks and the public status page.

Components¶

Module	Role	Binding
`services/prometheus/`	node metrics, alert rule evaluation, 15d retention	`127.0.0.1:9090`
`services/grafana/`	dashboards, Prometheus provisioned as default datasource	`127.0.0.1:3003`
`services/loki.nix`	log store	localhost
`services/alloy.nix`	log collection on each host → ships to Loki (all hosts)
`services/node-exporter.nix`	node metrics (all hosts, systemd collector enabled)	`:9100`
`services/uptime-kuma.nix`	public endpoint checks + status page	`127.0.0.1:3001`

Prometheus scrape targets¶

127.0.0.1:9100                  -> yggdrasil node_exporter
midgard.tail6fc192.ts.net:9100  -> midgard node_exporter (over the tailnet)

The scrape interval is 3m.

Alert rules¶

Defined in services/prometheus/node-health-alert-rule.yml.

NodeDown, CriticalServiceInactive, SystemdServiceFailed
RootDiskLow, RootInodesLow, RootFilesystemReadOnly
LowMemory, HighCpuUsage, HighLoad

No Alertmanager yet

Rules are visible in the Prometheus UI, but external notification delivery through Alertmanager is not configured yet.

Access¶

Grafana, from a tailnet-connected client:

https://grafana.ridewithmin.com

Prometheus UI, through SSH port forwarding:

ssh -L 9090:127.0.0.1:9090 yggdrasil
# open http://127.0.0.1:9090

Health checks¶

# on yggdrasil
systemctl is-active prometheus prometheus-node-exporter grafana loki
curl -fsS http://127.0.0.1:9090/-/ready
curl -fsS http://127.0.0.1:9090/api/v1/targets | jq
curl -fsS http://127.0.0.1:3003/api/health | jq