Back to Projects

Observability stack (Graylog + Prometheus + Grafana + Zabbix)

Built an internal monitoring setup so ops could search logs and graph server health without hopping between boxes. Graylog handled syslog, Prometheus scraped exporters, Grafana made it readable, and Zabbix covered classic host monitoring.

At a glance

  • Logs: Graylog + OpenSearch, with separate streams for firewall vs. network gear.
  • Metrics: Prometheus + Grafana, scraping Node Exporter and Windows Exporter every 15s.
  • Host monitoring: Zabbix backed by PostgreSQL, running on a separate Hyper-V host.
  • Result: One place to search logs and check server health during triage.

Logs: Graylog + OpenSearch

The starting point was simple: get firewall and network logs into one place so you can actually search them when something breaks. In this environment, that meant PFSense firewall events, UniFi logs, and the usual app/server noise that otherwise ends up scattered across boxes.

I deployed Graylog 6.0 backed by OpenSearch 2.11 for indexing/search and MongoDB 7.0 for Graylog's configuration/metadata. OpenSearch was given 8GB of heap so indexing stayed responsive under load.

Ingestion is plain syslog: UDP 514 for PFSense (filterlog, system events, Suricata alerts) and UDP 515 for UniFi (AP/switch/controller events). Keeping them on separate inputs makes it easy to split streams and avoid mixing firewall traffic with WiFi chatter during an incident.

The only gotcha early on: Graylog's web interface was bound to localhost, so nothing external could reach it. Once I bound it to the right interface and locked it down with firewall rules, logs started flowing and were searchable within seconds.

Metrics: Prometheus + Grafana

Logs are great for forensics. For “is it melting right now?” you want metrics. I deployed Prometheus 2.48 to scrape exporters across the fleet.

On the Linux side, Node Exporter covers CPU, memory, disk, and network. For Windows (5 production servers), I used Windows Exporter on port 9182 and scraped every 15 seconds for CPU/memory/disk queue/network counters.

Prometheus stores time-series data in its TSDB with 15-day retention. A 15s interval is a good compromise: it catches short spikes without turning storage into a problem.

Grafana sits on top as the front end. I used community dashboards (Node Exporter Full #1860 and Windows Exporter #14694) so the team could pick a server from a dropdown and immediately see health + trends without building dashboards from scratch.

Host monitoring: Zabbix

Prometheus/Grafana covered metrics well, but the environment also needed traditional host monitoring with templates and agents. That's where Zabbix 7.0 fits in.

I ran Zabbix on a separate Hyper-V host (i5-10500) and backed it with PostgreSQL 16. The install was straightforward once database auth was set up correctly.

Zabbix's PHP frontend needs password auth, but PostgreSQL defaults didn't line up. The fix was in pg_hba.conf: allow md5 for local connections used by Zabbix, restart PostgreSQL, and confirm the connection before continuing. After that, importing the Zabbix schema and finishing the web installer went normally.

The Zabbix UI came up on an internal URL (e.g., http://zabbix-server/zabbix), ready for agent rollout. Since Zabbix is agent-based, it can be easier to fit into networks where pull-based scraping is awkward.

┌─────────────────────────────────────────────────────────────┐
    │                    Observability Platform                    │
    │                    (internal deployment)                     │
    │                                                              │
    │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │
    │  │   Graylog    │  │  Prometheus  │  │    Zabbix    │       │
    │  │   Stack      │  │   + Grafana  │  │    Server    │       │
    │  └──────────────┘  └──────────────┘  └──────────────┘       │
    │                                                              │
    │  Logs (Syslog)     Metrics (HTTP)      Host checks           │
    └─────────────────────────────────────────────────────────────┘
         ▲                  ▲                    ▲
         │                  │                    │
        ┌────┴────┐        ┌────┴────┐         ┌────┴────┐
        │ PFSense │        │ Windows │         │ Network │
        │  :514   │        │ Servers │         │ Devices │
        │  UniFi  │        │ Exporters│        │  Agents │
        │  :515   │        │  :9182  │         │         │
        └─────────┘        └─────────┘         └─────────┘

Database + tuning

When the Zabbix web installer failed with "Cannot connect to the database," I treated it like any other connectivity issue: verify the service, verify the database exists, and then test the exact connection Zabbix is trying to make.

That meant checking basics (systemctl status postgresql, sudo -u postgres psql -c "\l") and then running the same kind of login Zabbix would use: PGPASSWORD=zabbix123 psql -U zabbix -d zabbix -h localhost. The error pointed straight at auth rules: pg_hba.conf was set to scram-sha-256 for the localhost path Zabbix was using.

Switching the relevant local rule to md5, restarting PostgreSQL, and re-testing fixed it. The important part wasn't the specific setting; it was making sure the change was small, testable, and easy to explain later.

OpenSearch also needed sensible heap sizing: 8GB via /etc/opensearch/jvm.options.d/heap.options (-Xms8g -Xmx8g) on a 16GB VM. That left room for MongoDB (~2GB), Graylog (~2GB), Prometheus (~3GB), plus OS overhead.

Syslog routing + streams

Once the stack was up, the next step was keeping logs separated so troubleshooting stayed sane (firewall events in one place, UniFi noise in another).

I routed logs into streams based on source address. One small checkbox matters here: Remove matches from Default Stream. Without it, events show up in both the custom stream and the default catch-all.

After that, the PFSense stream was just firewall logs (filterlog, blocked connections, Suricata alerts) and the UniFi stream was just WiFi/switch/controller events. For one-off questions, Graylog queries like source:filterlog or source:php-cgi are usually enough without touching routing rules again.

PFSense was configured under Status → System Logs → Settings → Remote Logging to send to the Graylog server on UDP 514. UniFi lives under Settings → CyberSecure → Traffic Logging. UniFi syslog can be finicky; sometimes it needed a reboot or a toggle to start emitting reliably.

Scaling notes (multi-site)

This setup ran on a single VM, but the obvious next question in healthcare IT is multi-site: “how do we watch multiple locations without standing up a full stack in every building?”

For Zabbix, the usual pattern is proxies: a central server plus lightweight proxies at each site to collect/buffer data locally and forward it back up.

Prometheus can do something similar via federation: local Prometheus instances per site, plus a central Prometheus that scrapes aggregates for org-wide dashboards.

Graylog scales mostly by scaling OpenSearch: add nodes as volume grows, and put Graylog behind a load balancer if you need more than one instance.

On a 16GB VM, the rough budget was OpenSearch (8GB heap), MongoDB (~2GB), Graylog (~2GB), Prometheus (~3GB), Grafana (~500MB), and the rest for the OS. If this grew, the first split would be OpenSearch onto its own VM(s), then Prometheus, leaving Grafana wherever it's convenient.

Impact

Day to day, this meant less time spelunking through individual systems. Logs were searchable in Graylog, and server health was visible in Grafana with a simple per-server dropdown. Zabbix templates made it easier to add monitoring for new hosts without hand-rolling everything.

It's intentionally a “right tool for the job” setup: Graylog for logs, Prometheus for metrics, Grafana for dashboards, and Zabbix for traditional monitoring and templates.

Graylog 6.0 OpenSearch 2.11 MongoDB 7.0 Prometheus 2.48 Grafana Zabbix 7.0 PostgreSQL 16 Ubuntu 24.04 LTS Hyper-V Virtualization PFSense Firewall UniFi Networking Node Exporter Windows Exporter Syslog Protocol Time-Series Database Log Aggregation Infrastructure Monitoring