tendermint/docs/specification/corruption.rst

Corruption
==========

Important step
--------------

Make sure you have a backup of the Tendermint data directory.

Possible causes
---------------

Remember that most corruption is caused by hardware issues:

- RAID controllers with faulty / worn out battery backup, and an unexpected power loss
- Hard disk drives with write-back cache enabled, and an unexpected power loss
- Cheap SSDs with insufficient power-loss protection, and an unexpected power-loss
- Defective RAM
- Defective or overheating CPU(s)

Other causes can be:

- Database systems configured with fsync=off and an OS crash or power loss
- Filesystems configured to use write barriers plus a storage layer that ignores write barriers. LVM is a particular culprit.
- Tendermint bugs
- Operating system bugs
- Admin error
  - directly modifying Tendermint data-directory contents

(Source: https://wiki.postgresql.org/wiki/Corruption)

WAL Corruption
--------------

If consensus WAL is corrupted at the lastest height and you are trying to start
Tendermint, replay will fail with panic.

Recovering from data corruption can be hard and time-consuming. Here are two approaches you can take:

1) Delete the WAL file and restart Tendermint. It will attempt to sync with other peers.
2) Try to repair the WAL file manually:

  1. Create a backup of the corrupted WAL file:

  .. code:: bash

      cp "$TMHOME/data/cs.wal/wal" > /tmp/corrupted_wal_backup

  2. Use ./scripts/wal2json to create a human-readable version

  .. code:: bash

      ./scripts/wal2json/wal2json "$TMHOME/data/cs.wal/wal" > /tmp/corrupted_wal

  3. Search for a "CORRUPTED MESSAGE" line.
  4. By looking at the previous message and the message after the corrupted one
       and looking at the logs, try to rebuild the message. If the consequent
       messages are marked as corrupted too (this may happen if length header
       got corrupted or some writes did not make it to the WAL ~ truncation),
       then remove all the lines starting from the corrupted one and restart
       Tendermint.

  .. code:: bash

      $EDITOR /tmp/corrupted_wal

  5. After editing, convert this file back into binary form by running:

  .. code:: bash

      ./scripts/json2wal/json2wal /tmp/corrupted_wal > "$TMHOME/data/cs.wal/wal"
briefly describe the recover process [ci skip] 2017-12-12 13:02:52 -06:00			`Corruption`
			`==========`

			`Important step`
			`--------------`

			`Make sure you have a backup of the Tendermint data directory.`

			`Possible causes`
			`---------------`

			`Remember that most corruption is caused by hardware issues:`

			`- RAID controllers with faulty / worn out battery backup, and an unexpected power loss`
			`- Hard disk drives with write-back cache enabled, and an unexpected power loss`
			`- Cheap SSDs with insufficient power-loss protection, and an unexpected power-loss`
			`- Defective RAM`
			`- Defective or overheating CPU(s)`

			`Other causes can be:`

			`- Database systems configured with fsync=off and an OS crash or power loss`
			`- Filesystems configured to use write barriers plus a storage layer that ignores write barriers. LVM is a particular culprit.`
			`- Tendermint bugs`
			`- Operating system bugs`
			`- Admin error`
			`- directly modifying Tendermint data-directory contents`

			`(Source: https://wiki.postgresql.org/wiki/Corruption)`

			`WAL Corruption`
			`--------------`

			`If consensus WAL is corrupted at the lastest height and you are trying to start`
			`Tendermint, replay will fail with panic.`

			`Recovering from data corruption can be hard and time-consuming. Here are two approaches you can take:`

			`1) Delete the WAL file and restart Tendermint. It will attempt to sync with other peers.`
			`2) Try to repair the WAL file manually:`
update docs/examples & terraform/ansible (#1534) * update docs/examples * ansible: add node IDs from docs/examples * better monikers * ansible: clearer paths * upgrade version * examples: updates * docs: consolidate terraform & ansible * remove deprecated info, small reorgs * docs build fix * docs: t&a critical commit * s/dummy/kvstore/g * terraform/DO region unavailable, persistent error can't be bothered to debug rn * terraform: need vars * networks: t&a standalone integration script for DO * t&a more updates * examples: add script that shows what the testnet command does * use AMS3, since AMS2 is not available 2018-05-17 10:05:59 -04:00
briefly describe the recover process [ci skip] 2017-12-12 13:02:52 -06:00			`1. Create a backup of the corrupted WAL file:`

			`.. code:: bash`

			`cp "$TMHOME/data/cs.wal/wal" > /tmp/corrupted_wal_backup`

			`2. Use ./scripts/wal2json to create a human-readable version`

			`.. code:: bash`

			`./scripts/wal2json/wal2json "$TMHOME/data/cs.wal/wal" > /tmp/corrupted_wal`

			`3. Search for a "CORRUPTED MESSAGE" line.`
explain what to do in case of truncation [ci skip] 2017-12-15 11:11:21 -06:00			`4. By looking at the previous message and the message after the corrupted one`
			`and looking at the logs, try to rebuild the message. If the consequent`
			`messages are marked as corrupted too (this may happen if length header`
			`got corrupted or some writes did not make it to the WAL ~ truncation),`
			`then remove all the lines starting from the corrupted one and restart`
			`Tendermint.`
briefly describe the recover process [ci skip] 2017-12-12 13:02:52 -06:00
			`.. code:: bash`

			`$EDITOR /tmp/corrupted_wal`

			`5. After editing, convert this file back into binary form by running:`

			`.. code:: bash`

			`./scripts/json2wal/json2wal /tmp/corrupted_wal > "$TMHOME/data/cs.wal/wal"`