Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 11 Dec 2016 20:35:32 +0100
From:      Ed Schouten <ed@nuxi.nl>
To:        hackers@freebsd.org
Subject:   Sysctl as a Service, or: making sysctl(3) more friendly for monitoring systems
Message-ID:  <CABh_MKk87hJTsu1ETX8Ffq9E8gqRPELeSEKzf1jKk_wwUROgAw@mail.gmail.com>

next in thread | raw e-mail | index | archive | help
Hi there,

The last couple of months I've been playing around with a monitoring
system called Prometheus (https://prometheus.io/). In short,
Prometheus works like this:

----- If you already know Prometheus, skip this -----

1. For the thing you want to monitor, you either integrate the
Prometheus client library into your codebase or you write a separate
exporter process. The client library or the exporter process then
exposes key metrics of your application over HTTP. Simplified example:

$ curl http://localhost:12345/metrics
# HELP open_file_descriptors The number of files opened by the process
# TYPE open_file_descriptors gauge
open_file_descriptors 12
# HELP http_requests The number of HTTP requests received.
# TYPE http_requests counter
http_requests{result="2xx"} 100
http_requests{result="4xx"} 14
http_requests{result="5xx"} 0

2. You fire op Prometheus and configure it to scrape and store all of
the things you want to monitor. Prometheus can then add more labels to
the metrics it scrapes. So the example above may get transformed by
Prometheus to look like this:

open_file_descriptors{job="nginx",instance="web1.mycompany.com"} 12
http_requests{job="nginx",instance="web1.mycompany.com",result="2xx"} 100
http_requests{job="nginx",instance="web1.mycompany.com",result="4xx"} 14
http_requests{job="nginx",instance="web1.mycompany.com",result="5xx"} 0

Fun fact: Prometheus can also scrape Prometheus, so if you operate
multiple datacenters, you can let a global instance scrape a per-DC
instance and add a dc="..." label to all metrics.

3. After scraping data for some time, you can do fancy queries like these:

- Compute the 5-minute rate of HTTP requests per server and per HTTP error code:
rate(http_requests[5m])

- Compute the 5-minute rate of all HTTP requests on the entire cluster:
sum(rate(http_requests[5m]))

- Same as the above, but aggregate by HTTP error code:
sum(rate(http_requests[5m])) by (result)

Prometheus can do alerting as well by using these expressions as matchers.

4. Set up Grafana and voila: you can create fancy dashboards!

----- If you skipped the introduction, start reading here -----

The Prometheus folks have developed a tool called the node_exporter
(https://github.com/prometheus/node_exporter). Basically it extracts a
whole bunch of interesting system-related metrics (disk usage, network
I/O, etc) through sysctl(3), invoking ioctl(2), parsing /proc files,
etc. and exposes that information using Prometheus' syntax.

The other day I was thinking: in a certain way, the node exporter is a
bit of a redundant tool on the BSDs. Instead of needing to write
custom collectors for every kernel subsystem, we could write a generic
exporter for converting the entire sysctl(3) tree to Prometheus
metrics, which is exactly what I'm experimenting with here:

https://github.com/EdSchouten/prometheus_sysctl_exporter

An example of what this tool's output looks like:

$ ./prometheus_sysctl_exporter
...
# HELP kern_maxfiles Maximum number of files
sysctl_kern_maxfiles 1043382
# HELP kern_openfiles System-wide number of open files
sysctl_kern_openfiles 316
...

You could use this to write alerting rules like this:

ALERT FileDescriptorUsageHigh
  IF sysctl_kern_openfiles / sysctl_kern_maxfiles > 0.5
  FOR 15m
  ANNOTATIONS {
    description = "More than half of all FDs are in use!",
  }

There you go. Access to a very large number of metrics without too much effort.

My main question here is: are there any people in here interested in
seeing something like this being developed into something usable? If
so, let me know and I'll pursue this further.

I also have a couple of technical questions related to sysctl(3)'s
in-kernel design:

- Prometheus differentiates between gauges (memory usage), counters
(number of HTTP requests), histograms (per-RPC latency stats), etc.,
while sysctl(3) does not. It would be nice if we could have that info
on a per-sysctl basis. Mind if I add a CTLFLAG_GAUGE, CTLFLAG_COUNTER,
etc?

- Semantically sysctl(3) and Prometheus are slightly different.
Consider this sysctl:

hw.acpi.thermal.tz0.temperature: 27.8C

My tool currently converts this metric's name to
sysctl_hw_acpi_thermal_tz0_temperature. This is suboptimal, as it
would ideally be called
sysctl_hw_acpi_thermal_temperature{sensor="tz0"}. Otherwise you
wouldn't be able to write generic alerting rules, use aggregation in
queries, etc.

I was thinking: we could quite easily do such a translation by
attaching labels to SYSCTL_NODE objects. As in, the hw.acpi.thermal
node would get a label "sensor". Any OID placed underneath this node
will not become a midfix of the sysctl name, but the value of that
label instead. Thoughts?

A final remark I want to make: a concern might be that changes like
these would not be generic, but only apply to Prometheus. I tend to
disagree. First of all, an advantage of Prometheus is that the
coupling is very loose: it's just a GET request with key-value pairs.
Anyone is free to add his/her own implementation.

Second, emaste@ also pointed me to another monitoring framework being
developed by Intel right now:

https://github.com/intelsdi-x/snap

The changes I'm proposing would also seem to make exporting sysctl
data to that system easier.

Anyway, thanks for reading this huge wall of text.

Best regards,
-- 
Ed Schouten <ed@nuxi.nl>
Nuxi, 's-Hertogenbosch, the Netherlands
KvK-nr.: 62051717



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CABh_MKk87hJTsu1ETX8Ffq9E8gqRPELeSEKzf1jKk_wwUROgAw>