Prometheus is a systems and service monitoring system [Link]. Out of the box, it monitors the host where it is running and uses collectors such as Prometheus Node Explorer [Link] to scrape metrics from other endpoints.

On the other hand, Grafana is a visualizer that consumes metrics from data sources like Prometheus, to build graphs and dashboards [Link].


INSTALLING PROMETHEUS

Some will say that you should not use the distribution source because it is not frequently updated but this is the most reliable source and will upgrade/patch with the system:

sudo apt-get install prometheus prometheus-node-exporter prometheus-pushgateway prometheus-alertmanager -y
sudo systemctl stop prometheus

Enable the API (it will provide additional features):

sudo nano /etc/systemd/system/multi-user.target.wants/prometheus.service

Edit the following line:

ExecStart=/usr/bin/prometheus --web.enable-lifecycle $ARGS

Note: the default retention period of the collected data is 15 days. To customize this period add the argument –storage.tsdb.retention.time=30d with the desired period.

Apply the change and start the service:

sudo systemctl daemon-reload
sudo systemctl start prometheus
sudo systemctl status prometheus

It could also be deployed as a Docker Container (just for reference):

sudo docker run --name prometheus -d -p 127.0.0.1:9090:9090 prom/prometheus

INSTALLING NODE_EXPLORER ON EACH OF THE MONITORED HOSTS

sudo apt install prometheus-node-exporter -y
sudo systemctl start prometheus-node-exporter
sudo systemctl enable prometheus-node-exporter
sudo systemctl status prometheus-node-exporter
sudo ufw allow from 192.168.1.162 to any port 9100

Note: replace the IP 192.168.1.162 with the IP of the server where Prometheus is running.

CONFIGURING THE PROMETHEUS TO REACH THE  MONITORED HOSTS

sudo nano /etc/prometheus/prometheus.yml

Example of configuration:

global:
  scrape_interval: 1s
  evaluation_interval: 1s
alerting:
  alertmanagers:
  - static_configs:
    - targets: ['192.168.1.162:9093']
scrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 1s
    scrape_timeout: 1s
    static_configs:
      - targets: ['192.168.1.162:9090']
  - job_name: 'nodes'
    scrape_interval: 1s
    scrape_timeout: 1s
    static_configs:
      - targets: ['192.168.1.162:9100', '192.168.1.163:9100', '192.168.1.164:9100']

Note: I recommend you use names instead of IPs. You can configure the translation in the file /etc/hosts.

Reload the configuration using the API:

curl -X POST http://localhost:9090/-/reload

Access Prometheus WebUI with a browser http://192.168.1.162:9090/ and navigate to Status > Targets.


INSTALLING GRAFANA

Grafana is a powerful tool to visualize data over time. It is the most popular dashboard viewer for Prometheus but not limited to it [Link].

sudo apt-get install -y apt-transport-https
sudo apt-get install -y software-properties-common wget
curl -fsSL "https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x963FA27710458545" | gpg --dearmor | sudo tee /usr/share/keyrings/repository-keyring.gpg >/dev/null
echo "deb [signed-by=/usr/share/keyrings/repository-keyring.gpg] https://apt.grafana.com stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list
sudo apt-get update
sudo apt-get install grafana -y
sudo systemctl start grafana-server
sudo systemctl enable grafana-server
sudo systemctl status grafana-server
sudo ufw allow 3000

Access the Web UI  on http://192.168.1.162:3000/ and change the default password (admin:admin) immediately.

Back to Home, import a Dashboard View from the online repository [Link].

Import the dashboard 1860 and select the data source as Prometheus.

Repeat the import procedure with the number 405.

Go to the dashboard, navigate and customize your imported view:


REFLECTION

Keep monitoring the storage usage on /var/lib/prometheus/metrics2/ until you define the retention time and scraping frequency that best suits your need.

For reference, I have set up the server where the Prometheus runs on, and two other servers. It scrapes data from 4 sources every 1 second and every 24h it accumulates about 800MB of data.

The email alerts would look much nicer with the image renderer:

sudo grafana-cli plugins install grafana-image-renderer

The SMTP configuration can be found at:

sudo nano /etc/grafana/grafana.ini

It is always recommended to restart Grafana after making changes to its configurations or installing plugins:

sudo systemctl restart grafana-server

To use AWS CloudWatch as a data source for Grafana follow the step below making the adjustments accordingly:

IAM > Policies > Add service: CloudWatch > Allow for: ListMetrics, GetMetricData, GetMetricStatistics.
IAM > Roles > AWS Service > EC2 > Select the Policy to the Role.
IAM > Users > Add User > Attach Existent Policy > Select the Policy to the User > Get AccessKey and SecretKey.
EC2 > Select the Instance > Actions > Instance Settings > Attach/Replace IAM Role.

Then use the created AccessKey and SecretKey to add the data source using the Grafana web application.


BONUS

For monitoring Databases use Percona Monitoring and Management (PMM) [Link].

Server

curl -fsSL https://www.percona.com/get/pmm | /bin/bash

Client

wget https://repo.percona.com/apt/percona-release_latest.$(lsb_release -sc)_all.deb
sudo dpkg -i percona-release_latest.$(lsb_release -sc)_all.deb
sudo apt-get update
sudo apt-get install pmm2-client

Connect the Client to the Server

sudo pmm-admin config --server-insecure-tls --server-url=https://admin:strong_password@pmm-server.local

Create an account in the DB for acquiring metrics from the DB engine (example for MySQL 8):

CREATE USER 'pmm'@'localhost' IDENTIFIED BY 'strong_password' WITH MAX_USER_CONNECTIONS 10;
GRANT SELECT, PROCESS, SUPER, REPLICATION CLIENT, RELOAD, BACKUP_ADMIN ON *.* TO 'pmm'@'localhost';

BONUS OF THE BONUS

For monitoring the uptime of different service types (e.g., HTTP(S), DNS, Ping, and TCP), there is an amazing tool called Uptime Kuma [Link].

Docker Compose Deployment

apt update &&m apt-get upgrade -y
apt install docker.io docker-compose -y
mkdir uptime-kuma && cd uptime-kuma
curl -o compose.yaml https://raw.githubusercontent.com/louislam/uptime-kuma/master/compose.yaml
docker-compose up -d

Docker Deployment (the lazy way)

sudo docker run -d --restart=always --network=host -v uptime-kuma:/app/data --name uptime-kuma louislam/uptime-kuma:base3

Ubuntu Installation

git clone https://github.com/louislam/uptime-kuma.git
cd uptime-kuma
sudo apt install npm -y
sudo npm run setup
sudo npm install pm2 -g && sudo pm2 install pm2-logrotate
sudo pm2 start server/server.js --name uptime-kuma
sudo pm2 save && sudo pm2 startup

To run it behind a reverse proxy with TLS termination and automated Let’s Encrypt certificates.

sudo apt install nginx certbot python3-certbot-nginx -y
sudo nano /etc/nginx/sites-enabled/default
server {
   listen 80 default_server;
   listen [::]:80 default_server;
   root /var/www/html;
   index index.html;
   server_name _;
   location / {
      proxy_pass http://127.0.0.1:3001;
      proxy_set_header Host $host;
   }
}
sudo nginx -t && sudo nginx -s reload
sudo certbot --nginx --non-interactive --agree-tos --redirect --email [email protected] -d uptime.domain.com

If you need to monitor an external or a public service to identify if they are experiencing an outage that affects your monitoring, check out Status Shield [Link].


MOREOVER

For monitoring cron jobs or background tasks that could silently fail, try healthchecks.io [Link].

The following docker-compose.yaml takes care of the whole deployment and configuration.

version: "3"
services:
  healthchecks:
    image: healthchecks/healthchecks:latest
    container_name: healthchecks
    environment:
      - DB=sqlite
      - DB_NAME=/data/hc.sqlite
      - DEBUG=False
      - [email protected]
      - EMAIL_HOST=smtp.example.com
      - [email protected]
      - EMAIL_HOST_PASSWORD=strong_password

###### For STARTTLS - Explicit TLS
      - EMAIL_PORT=587
      - EMAIL_USE_TLS=True

###### For SMTPS - Implicit TLS (deprecated)
#      - EMAIL_PORT=465
#      - EMAIL_USE_TLS=False
#      - EMAIL_USE_SSL=True

      - SECRET_KEY=f7993db8-1004-414b-be03-ad4a5d5153a3
      - ALLOWED_HOSTS=healthchecks.simnet.cloud
      - SITE_ROOT=https://healthchecks.simnet.cloud/
      - REGISTRATION_OPEN=False
    ports:
      - 8000:8000
    volumes:
      - healthchecks-data:/data
    restart: unless-stopped
volumes:
    healthchecks-data:

It is also recommended to keep behind a reverse proxy that terminates TLS.


See also Grafana Alloy and Grafana Loki for combining logs with metrics and visualizing and drilling down with Grafana Dashboards.

  • Grafana Alloy [Link] scrapes log files and sends them to Grafana Loki.
  • Grafana Loki [Link] indexes only the relevant data and stores it efficiently.