2018-06-13 14:47:30 +04:00
|
|
|
# ADR 011: Monitoring
|
|
|
|
|
|
|
|
## Changelog
|
|
|
|
|
|
|
|
08-06-2018: Initial draft
|
|
|
|
11-06-2018: Reorg after @xla comments
|
|
|
|
13-06-2018: Clarification about usage of labels
|
|
|
|
|
|
|
|
## Context
|
|
|
|
|
|
|
|
In order to bring more visibility into Tendermint, we would like it to report
|
|
|
|
metrics and, maybe later, traces of transactions and RPC queries. See
|
|
|
|
https://github.com/tendermint/tendermint/issues/986.
|
|
|
|
|
|
|
|
A few solutions were considered:
|
|
|
|
|
|
|
|
1. [Prometheus](https://prometheus.io)
|
2018-08-27 15:33:46 +08:00
|
|
|
a) Prometheus API
|
|
|
|
b) [go-kit metrics package](https://github.com/go-kit/kit/tree/master/metrics) as an interface plus Prometheus
|
|
|
|
c) [telegraf](https://github.com/influxdata/telegraf)
|
|
|
|
d) new service, which will listen to events emitted by pubsub and report metrics
|
|
|
|
2. [OpenCensus](https://opencensus.io/go/index.html)
|
2018-06-13 14:47:30 +04:00
|
|
|
|
|
|
|
### 1. Prometheus
|
|
|
|
|
|
|
|
Prometheus seems to be the most popular product out there for monitoring. It has
|
|
|
|
a Go client library, powerful queries, alerts.
|
|
|
|
|
|
|
|
**a) Prometheus API**
|
|
|
|
|
|
|
|
We can commit to using Prometheus in Tendermint, but I think Tendermint users
|
|
|
|
should be free to choose whatever monitoring tool they feel will better suit
|
|
|
|
their needs (if they don't have existing one already). So we should try to
|
|
|
|
abstract interface enough so people can switch between Prometheus and other
|
|
|
|
similar tools.
|
|
|
|
|
|
|
|
**b) go-kit metrics package as an interface**
|
|
|
|
|
|
|
|
metrics package provides a set of uniform interfaces for service
|
|
|
|
instrumentation and offers adapters to popular metrics packages:
|
|
|
|
|
|
|
|
https://godoc.org/github.com/go-kit/kit/metrics#pkg-subdirectories
|
|
|
|
|
|
|
|
Comparing to Prometheus API, we're losing customisability and control, but gaining
|
|
|
|
freedom in choosing any instrument from the above list given we will extract
|
|
|
|
metrics creation into a separate function (see "providers" in node/node.go).
|
|
|
|
|
|
|
|
**c) telegraf**
|
|
|
|
|
|
|
|
Unlike already discussed options, telegraf does not require modifying Tendermint
|
|
|
|
source code. You create something called an input plugin, which polls
|
|
|
|
Tendermint RPC every second and calculates the metrics itself.
|
|
|
|
|
|
|
|
While it may sound good, but some metrics we want to report are not exposed via
|
|
|
|
RPC or pubsub, therefore can't be accessed externally.
|
|
|
|
|
|
|
|
**d) service, listening to pubsub**
|
|
|
|
|
|
|
|
Same issue as the above.
|
|
|
|
|
|
|
|
### 2. opencensus
|
|
|
|
|
|
|
|
opencensus provides both metrics and tracing, which may be important in the
|
|
|
|
future. It's API looks different from go-kit and Prometheus, but looks like it
|
|
|
|
covers everything we need.
|
|
|
|
|
|
|
|
Unfortunately, OpenCensus go client does not define any
|
|
|
|
interfaces, so if we want to abstract away metrics we
|
|
|
|
will need to write interfaces ourselves.
|
|
|
|
|
|
|
|
### List of metrics
|
|
|
|
|
2018-08-27 15:33:46 +08:00
|
|
|
| | Name | Type | Description |
|
|
|
|
| --- | ------------------------------------ | ------ | ----------------------------------------------------------------------------- |
|
|
|
|
| A | consensus_height | Gauge | |
|
|
|
|
| A | consensus_validators | Gauge | Number of validators who signed |
|
|
|
|
| A | consensus_validators_power | Gauge | Total voting power of all validators |
|
|
|
|
| A | consensus_missing_validators | Gauge | Number of validators who did not sign |
|
|
|
|
| A | consensus_missing_validators_power | Gauge | Total voting power of the missing validators |
|
|
|
|
| A | consensus_byzantine_validators | Gauge | Number of validators who tried to double sign |
|
|
|
|
| A | consensus_byzantine_validators_power | Gauge | Total voting power of the byzantine validators |
|
|
|
|
| A | consensus_block_interval | Timing | Time between this and last block (Block.Header.Time) |
|
|
|
|
| | consensus_block_time | Timing | Time to create a block (from creating a proposal to commit) |
|
|
|
|
| | consensus_time_between_blocks | Timing | Time between committing last block and (receiving proposal creating proposal) |
|
|
|
|
| A | consensus_rounds | Gauge | Number of rounds |
|
|
|
|
| | consensus_prevotes | Gauge | |
|
|
|
|
| | consensus_precommits | Gauge | |
|
|
|
|
| | consensus_prevotes_total_power | Gauge | |
|
|
|
|
| | consensus_precommits_total_power | Gauge | |
|
|
|
|
| A | consensus_num_txs | Gauge | |
|
|
|
|
| A | mempool_size | Gauge | |
|
|
|
|
| A | consensus_total_txs | Gauge | |
|
|
|
|
| A | consensus_block_size | Gauge | In bytes |
|
|
|
|
| A | p2p_peers | Gauge | Number of peers node's connected to |
|
|
|
|
|
|
|
|
`A` - will be implemented in the fist place.
|
2018-06-13 14:47:30 +04:00
|
|
|
|
|
|
|
**Proposed solution**
|
|
|
|
|
|
|
|
## Status
|
|
|
|
|
2018-06-15 17:19:09 +04:00
|
|
|
Proposed.
|
|
|
|
|
2018-06-13 14:47:30 +04:00
|
|
|
## Consequences
|
|
|
|
|
|
|
|
### Positive
|
|
|
|
|
2018-06-15 17:19:09 +04:00
|
|
|
Better visibility, support of variety of monitoring backends
|
|
|
|
|
2018-06-13 14:47:30 +04:00
|
|
|
### Negative
|
|
|
|
|
2018-06-15 17:19:09 +04:00
|
|
|
One more library to audit, messing metrics reporting code with business domain.
|
|
|
|
|
2018-06-13 14:47:30 +04:00
|
|
|
### Neutral
|
2018-06-15 17:19:09 +04:00
|
|
|
|
|
|
|
-
|