From owner-freebsd-hackers@freebsd.org  Sun Dec 11 19:36:04 2016
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 113CCC72483
 for <freebsd-hackers@mailman.ysv.freebsd.org>;
 Sun, 11 Dec 2016 19:36:04 +0000 (UTC) (envelope-from ed@nuxi.nl)
Received: from mailman.ysv.freebsd.org (unknown [127.0.1.3])
 by mx1.freebsd.org (Postfix) with ESMTP id E86BD173E
 for <freebsd-hackers@freebsd.org>; Sun, 11 Dec 2016 19:36:03 +0000 (UTC)
 (envelope-from ed@nuxi.nl)
Received: by mailman.ysv.freebsd.org (Postfix)
 id E4D97C72482; Sun, 11 Dec 2016 19:36:03 +0000 (UTC)
Delivered-To: hackers@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id E473DC72481
 for <hackers@mailman.ysv.freebsd.org>; Sun, 11 Dec 2016 19:36:03 +0000 (UTC)
 (envelope-from ed@nuxi.nl)
Received: from mail-yw0-x232.google.com (mail-yw0-x232.google.com
 [IPv6:2607:f8b0:4002:c05::232])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id B0D8C173D
 for <hackers@freebsd.org>; Sun, 11 Dec 2016 19:36:03 +0000 (UTC)
 (envelope-from ed@nuxi.nl)
Received: by mail-yw0-x232.google.com with SMTP id i145so50676399ywg.2
 for <hackers@freebsd.org>; Sun, 11 Dec 2016 11:36:03 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=nuxi-nl.20150623.gappssmtp.com; s=20150623;
 h=mime-version:from:date:message-id:subject:to;
 bh=hiYs9LGpLt9pJ51fddYS5023QoBMXOlbD4fOOqw50Fs=;
 b=u+DnMDJvA31NQXPddNRkbTV66xbx6yTjWnNJEo5vWEDoaatVSZzlpqZnWaJE3DOmHU
 lYCFvQDXGncNL/sLR1k3Mlj44WvghadNjA3heBcjjzZCiG2Q9hEelNyLawM33FeXA9+i
 btg7YYxpot5+lBhnadN8LhKslpH5z7KyrK2+wY1GM039GUgPVZ76FCi9zwzgkZt5Bh4c
 5Lw8eN6DJYOcNT+hbN7yGSoYflk7lRTq1qVKQo33HL6vn7FDLtW1DDH4HsoPyXRz6ApA
 VUOpdh4unNlVL1Cx1MWjCgYPjqAiYrnty6y0IBGsGnkcWYXMV6G8oIfQas+WERR8iRv7
 c3Cw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:mime-version:from:date:message-id:subject:to;
 bh=hiYs9LGpLt9pJ51fddYS5023QoBMXOlbD4fOOqw50Fs=;
 b=EtQVT1e4feLhmLWi099Z5u1OucEiAYIhhSu5atqc9namj1v+WAKQvfo9eks4sGIdib
 Oq8r8ISywflOTlJypKXa9m30DjmSpePX3dUM5biP4L25fKNAxFO1OuHBziegHq1kHn9W
 cKysl5n4Qup8cvr3cHdIGNpGbYLw90zzPHTTMtnQl+/YbUiN1lpSoLyHCAuxJ5Pspv9F
 fCHW1kjB/X9LZJVB27CtHmcYVUxlRIT7Chy9w2xprgCLtHMabONL33DNKxNynskxNHp0
 ll6EzH76A+HZ5TyqIv/6P3SzeRdAtF5oDA3OgZYHi2Ff3anvVZmX1WN6mvHVa4anPXrj
 TpRg==
X-Gm-Message-State: AKaTC03AO4JAfb2CeXTyXskBSiveBbHzFb8UeqNKFkxXIkfS6PG6Pxb6Wycdc5PuXtYhsyoG9aisfclWf7VwBw==
X-Received: by 10.129.164.198 with SMTP id b189mr80534015ywh.294.1481484962535; 
 Sun, 11 Dec 2016 11:36:02 -0800 (PST)
MIME-Version: 1.0
Received: by 10.129.0.212 with HTTP; Sun, 11 Dec 2016 11:35:32 -0800 (PST)
From: Ed Schouten <ed@nuxi.nl>
Date: Sun, 11 Dec 2016 20:35:32 +0100
Message-ID: <CABh_MKk87hJTsu1ETX8Ffq9E8gqRPELeSEKzf1jKk_wwUROgAw@mail.gmail.com>
Subject: Sysctl as a Service, or: making sysctl(3) more friendly for
 monitoring systems
To: hackers@freebsd.org
Content-Type: text/plain; charset=UTF-8
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 11 Dec 2016 19:36:04 -0000

Hi there,

The last couple of months I've been playing around with a monitoring
system called Prometheus (https://prometheus.io/). In short,
Prometheus works like this:

----- If you already know Prometheus, skip this -----

1. For the thing you want to monitor, you either integrate the
Prometheus client library into your codebase or you write a separate
exporter process. The client library or the exporter process then
exposes key metrics of your application over HTTP. Simplified example:

$ curl http://localhost:12345/metrics
# HELP open_file_descriptors The number of files opened by the process
# TYPE open_file_descriptors gauge
open_file_descriptors 12
# HELP http_requests The number of HTTP requests received.
# TYPE http_requests counter
http_requests{result="2xx"} 100
http_requests{result="4xx"} 14
http_requests{result="5xx"} 0

2. You fire op Prometheus and configure it to scrape and store all of
the things you want to monitor. Prometheus can then add more labels to
the metrics it scrapes. So the example above may get transformed by
Prometheus to look like this:

open_file_descriptors{job="nginx",instance="web1.mycompany.com"} 12
http_requests{job="nginx",instance="web1.mycompany.com",result="2xx"} 100
http_requests{job="nginx",instance="web1.mycompany.com",result="4xx"} 14
http_requests{job="nginx",instance="web1.mycompany.com",result="5xx"} 0

Fun fact: Prometheus can also scrape Prometheus, so if you operate
multiple datacenters, you can let a global instance scrape a per-DC
instance and add a dc="..." label to all metrics.

3. After scraping data for some time, you can do fancy queries like these:

- Compute the 5-minute rate of HTTP requests per server and per HTTP error code:
rate(http_requests[5m])

- Compute the 5-minute rate of all HTTP requests on the entire cluster:
sum(rate(http_requests[5m]))

- Same as the above, but aggregate by HTTP error code:
sum(rate(http_requests[5m])) by (result)

Prometheus can do alerting as well by using these expressions as matchers.

4. Set up Grafana and voila: you can create fancy dashboards!

----- If you skipped the introduction, start reading here -----

The Prometheus folks have developed a tool called the node_exporter
(https://github.com/prometheus/node_exporter). Basically it extracts a
whole bunch of interesting system-related metrics (disk usage, network
I/O, etc) through sysctl(3), invoking ioctl(2), parsing /proc files,
etc. and exposes that information using Prometheus' syntax.

The other day I was thinking: in a certain way, the node exporter is a
bit of a redundant tool on the BSDs. Instead of needing to write
custom collectors for every kernel subsystem, we could write a generic
exporter for converting the entire sysctl(3) tree to Prometheus
metrics, which is exactly what I'm experimenting with here:

https://github.com/EdSchouten/prometheus_sysctl_exporter

An example of what this tool's output looks like:

$ ./prometheus_sysctl_exporter
...
# HELP kern_maxfiles Maximum number of files
sysctl_kern_maxfiles 1043382
# HELP kern_openfiles System-wide number of open files
sysctl_kern_openfiles 316
...

You could use this to write alerting rules like this:

ALERT FileDescriptorUsageHigh
  IF sysctl_kern_openfiles / sysctl_kern_maxfiles > 0.5
  FOR 15m
  ANNOTATIONS {
    description = "More than half of all FDs are in use!",
  }

There you go. Access to a very large number of metrics without too much effort.

My main question here is: are there any people in here interested in
seeing something like this being developed into something usable? If
so, let me know and I'll pursue this further.

I also have a couple of technical questions related to sysctl(3)'s
in-kernel design:

- Prometheus differentiates between gauges (memory usage), counters
(number of HTTP requests), histograms (per-RPC latency stats), etc.,
while sysctl(3) does not. It would be nice if we could have that info
on a per-sysctl basis. Mind if I add a CTLFLAG_GAUGE, CTLFLAG_COUNTER,
etc?

- Semantically sysctl(3) and Prometheus are slightly different.
Consider this sysctl:

hw.acpi.thermal.tz0.temperature: 27.8C

My tool currently converts this metric's name to
sysctl_hw_acpi_thermal_tz0_temperature. This is suboptimal, as it
would ideally be called
sysctl_hw_acpi_thermal_temperature{sensor="tz0"}. Otherwise you
wouldn't be able to write generic alerting rules, use aggregation in
queries, etc.

I was thinking: we could quite easily do such a translation by
attaching labels to SYSCTL_NODE objects. As in, the hw.acpi.thermal
node would get a label "sensor". Any OID placed underneath this node
will not become a midfix of the sysctl name, but the value of that
label instead. Thoughts?

A final remark I want to make: a concern might be that changes like
these would not be generic, but only apply to Prometheus. I tend to
disagree. First of all, an advantage of Prometheus is that the
coupling is very loose: it's just a GET request with key-value pairs.
Anyone is free to add his/her own implementation.

Second, emaste@ also pointed me to another monitoring framework being
developed by Intel right now:

https://github.com/intelsdi-x/snap

The changes I'm proposing would also seem to make exporting sysctl
data to that system easier.

Anyway, thanks for reading this huge wall of text.

Best regards,
-- 
Ed Schouten <ed@nuxi.nl>
Nuxi, 's-Hertogenbosch, the Netherlands
KvK-nr.: 62051717