31 Commits

Author SHA1 Message Date
Phil Salant
d1d517a9b7 linters: enable scopelint (#3963)
* Pin range scope vars

* Don't disable scopelint

This PR repairs linter errors seen when running the following commands:
golangci-lint run --no-config --disable-all=true --enable=scopelint

Contributes to #3262
2019-09-11 09:15:18 +04:00
Anton Kaliaev
ec9bff5234
rename WAL#Flush to WAL#FlushAndSync (#3345)
* rename WAL#Flush to WAL#FlushAndSync

- rename auto#Flush to auto#FlushAndSync
- cleanup WAL interface to not leak implementation details!
  * remove Group()
  * add WALReader interface and return it in SearchForEndHeight()
- add interface assertions

Refs #3337

* replace WALReader with io.ReadCloser
2019-02-25 09:11:07 +04:00
Anton Kaliaev
67fd428354
fix TestWALPeriodicSync (#3342)
The test was sometimes failing due to processFlushTicks being called too
early. The solution is to call wal#Start later in the test.
2019-02-22 11:49:08 +04:00
Thane Thomson
dff3deb2a9 cs: sync WAL more frequently (#3300)
As per #3043, this adds a ticker to sync the WAL every 2s while the WAL is running.

* Flush WAL every 2s

This adds a ticker that flushes the WAL every 2s while the WAL is
running. This is related to #3043.

* Fix spelling

* Increase timeout to 2mins for slower build environments

* Make WAL sync interval configurable

* Add TODO to replace testChan with more comprehensive testBus

* Remove extraneous debug statement

* Remove testChan in favour of using system time

As per
https://github.com/tendermint/tendermint/pull/3300#discussion_r255886586,
this removes the `testChan` WAL member and replaces the approach with a
system time-oriented one. In this new approach, we keep track of the
system time at which each flush and periodic flush successfully
occurred.

The naming of the various functions is also updated here to be more
consistent with "flushing" as opposed to "sync'ing".

* Update naming convention and ensure lock for timestamp update

* Add Flush method as part of WAL interface

Adds a `Flush` method as part of the WAL interface to enforce the idea
that we can manually trigger a WAL flush from outside of the WAL. This
is employed in the consensus state management to flush the WAL prior to
signing votes/proposals, as per https://github.com/tendermint/tendermint/issues/3043#issuecomment-453853630

* Update CHANGELOG_PENDING

* Remove mutex approach and replace with DI

The dependency injection approach to dealing with testing concerns could
allow similar effects to some kind of "testing bus"-based approach. This
commit introduces an example of this, where instead of relying on
(potentially fragile) timing of things between the code and the test, we
inject code into the function under test that can signal the test
through a channel.

This allows us to avoid the `time.Sleep()`-based approach previously
employed.

* Update comment on WAL flushing during vote signing

Co-Authored-By: thanethomson <connect@thanethomson.com>

* Simplify flush interval definition

Co-Authored-By: thanethomson <connect@thanethomson.com>

* Expand commentary on WAL disk flushing

Co-Authored-By: thanethomson <connect@thanethomson.com>

* Add broken test to illustrate WAL sync test problem

Removes test-related state (dependency injection code) from the WAL data
structure and adds test code to illustrate the problem with using
`WALGenerateNBlocks` and `wal.SearchForEndHeight` to test periodic
sync'ing.

* Fix test error messages

* Use WAL group buffer size to check for flush

A function is added to `libs/autofile/group.go#Group` in order to return
the size of the buffered data (i.e. data that has not yet been flushed
to disk). The test now checks that, prior to a `time.Sleep`, the group
buffer has data in it. After the `time.Sleep` (during which time the
periodic flush should have been called), the buffer should be empty.

* Remove config root dir removal from #3291

* Add godoc for NewWAL mentioning periodic sync
2019-02-20 09:45:18 +04:00
Alessio Treglia
59cc6d36c9 improve ResetTestRootWithChainID() concurrency safety (#3291)
* improve ResetTestRootWithChainID() concurrency safety

Rely on ioutil.TempDir() to create test root directories and ensure
multiple same-chain id test cases can run in parallel.

* Update config/toml.go

Co-Authored-By: alessio <quadrispro@ubuntu.com>

* clean up test directories after completion

Closes: #1034

* Remove redundant EnsureDir call

* s/PanicSafety()/panic()/s

* Put create dir functionality back in ResetTestRootWithChainID

* Place test directories in OS's tempdir

In modern UNIX and UNIX-like systems /tmp is very often
mounted as tmpfs. This might speed test execution a bit.

* Set 0700 to a const

* rootsDirs -> configRootDirs

* Don't double remove directories

* Avoid global variables

* Fix consensus tests

* Reduce defer stack

* Address review comments

* Try to fix tests

* Update CHANGELOG_PENDING.md

Co-Authored-By: alessio <quadrispro@ubuntu.com>

* Update consensus/common_test.go

Co-Authored-By: alessio <quadrispro@ubuntu.com>

* Update consensus/common_test.go

Co-Authored-By: alessio <quadrispro@ubuntu.com>
2019-02-18 11:45:27 +04:00
Anton Kaliaev
0b0a8b3128
cs/wal: refuse to encode msg that is bigger than maxMsgSizeBytes (#3303)
Earlier this week somebody posted this in GoS Riot chat:

```
E[2019-02-12|10:38:37.596] Corrupted entry. Skipping... module=consensus wal=/home/gaia/.gaiad/data/cs.wal/wal err="DataCorruptionError[length 878916964 exceeded maximum possible value of 1048576 bytes]"
E[2019-02-12|10:38:37.596] Corrupted entry. Skipping... module=consensus wal=/home/gaia/.gaiad/data/cs.wal/wal err="DataCorruptionError[length 825701731 exceeded maximum possible value of 1048576 bytes]"
E[2019-02-12|10:38:37.596] Corrupted entry. Skipping... module=consensus wal=/home/gaia/.gaiad/data/cs.wal/wal err="DataCorruptionError[length 1631073634 exceeded maximum possible value of 1048576 bytes]"
E[2019-02-12|10:38:37.596] Corrupted entry. Skipping... module=consensus wal=/home/gaia/.gaiad/data/cs.wal/wal err="DataCorruptionError[length 912418148 exceeded maximum possible value of 1048576 bytes]"
E[2019-02-12|10:38:37.600] Corrupted entry. Skipping... module=consensus wal=/home/gaia/.gaiad/data/cs.wal/wal err="DataCorruptionError[failed to read data: EOF]"
E[2019-02-12|10:38:37.600] Error on catchup replay. Proceeding to start ConsensusState anyway module=consensus err="Cannot replay height 7242. WAL does not contain #ENDHEIGHT for 7241"
E[2019-02-12|10:38:37.861] Error dialing peer module=p2p err="dial tcp 35.183.126.181:26656: i/o timeout
```

Note the length error messages. What has happened is the length field got corrupted probably. I've looked at the code and noticed that we don't check the msg size during encoding. This PR fixes that. It also improves a few error messages in WALDecoder.
2019-02-18 11:23:06 +04:00
Ethan Buchman
45b70ae031
fix non deterministic test failures and race in privval socket (#3258)
* node: decrease retry conn timeout in test

Should fix #3256

The retry timeout was set to the default, which is the same as the
accept timeout, so it's no wonder this would fail. Here we decrease the
retry timeout so we can try many times before the accept timeout.

* p2p: increase handshake timeout in test

This fails sometimes, presumably because the handshake timeout is so low
(only 50ms). So increase it to 1s. Should fix #3187

* privval: fix race with ping. closes #3237

Pings happen in a go-routine and can happen concurrently with other
messages. Since we use a request/response protocol, we expect to send a
request and get back the corresponding response. But with pings
happening concurrently, this assumption could be violated. We were using
a mutex, but only a RWMutex, where the RLock was being held for sending
messages - this was to allow the underlying connection to be replaced if
it fails. Turns out we actually need to use a full lock (not just a read
lock) to prevent multiple requests from happening concurrently.

* node: fix test name. DelayedStop -> DelayedStart

* autofile: Wait() method

In the TestWALTruncate in consensus/wal_test.go we remove the WAL
directory at the end of the test. However the wal.Stop() does not
properly wait for the autofile group to finish shutting down. Hence it
was possible that the group's go-routine is still running when the
cleanup happens, which causes a panic since the directory disappeared.
Here we add a Wait() method to properly wait until the go-routine exits
so we can safely clean up. This fixes #2852.
2019-02-06 10:24:43 -05:00
Anton Kaliaev
d470945503
update gometalinter to 3.0.0 (#3233)
in the attempt to fix https://circleci.com/gh/tendermint/tendermint/43165

also

    code is simplified by running gofmt -s .
    remove unused vars
    enable linters we're currently passing
    remove deprecated linters
2019-01-30 12:24:26 +04:00
Thane Thomson
a335caaedb alias amino imports (#3219)
As per conversation here: https://github.com/tendermint/tendermint/pull/3218#discussion_r251364041

This is the result of running the following code on the repo:

```bash
find . -name '*.go' | grep -v 'vendor/' | xargs -n 1 goimports -w
```
2019-01-28 16:13:17 +04:00
Anton Kaliaev
b487feba42
node: refactor privValidator ext client code & tests (#2895)
* update ConsensusState#OnStop comment

* consensus: set logger for WAL in tests

* refactor privValidator client code and tests

follow-up on https://github.com/tendermint/tendermint/pull/2866
2018-11-21 21:24:13 +04:00
goolAdapter
110b07fb3f libs: Call Flush() before rename #2428 (#2439)
* fix Group.RotateFile need call Flush() before rename. #2428
* fix some review issue. #2428
 refactor Group's config: replace  setting member with initial option
* fix a handwriting mistake
* fix a time window error between rename and write.
* fix a syntax mistake.
* change option name Get_ to With_
* fix review issue
* fix review issue
2018-09-25 13:22:45 +02:00
Zarko Milosevic
7b88172f41 Implement BFT time (#2203)
* Implement BFT time

* set LastValidators when creating state in state helper

for heights >= 2
2018-08-31 19:33:51 -04:00
Dev Ojha
2756be5a59 libs: Remove usage of custom Fmt, in favor of fmt.Sprintf (#2199)
* libs: Remove usage of custom Fmt, in favor of fmt.Sprintf

Closes #2193

* Fix bug that was masked by custom Fmt!
2018-08-10 09:25:57 +04:00
Ethan Buchman
d55243f0e6 fix import paths 2018-07-01 22:36:49 -04:00
Anton Kaliaev
e88f74bb9b
remove wal_light setting
Closes #1428
2018-04-11 10:08:03 +02:00
Jae Kwon
e4492afbad Merge 2018-04-05 08:17:10 -07:00
Ethan Buchman
799beebd36 fix consensus tests 2018-04-05 17:54:26 +03:00
Jae Kwon
45ec5fd170 WIP consensus 2018-04-05 07:05:45 -07:00
Ethan Buchman
1cb76625d3 consensus: rename test funcs 2018-01-19 00:59:09 -05:00
Anton Kaliaev
40f9261d48
handle data corruption errors
Refs #573
2017-12-11 19:48:20 -06:00
Anton Kaliaev
5cb936fa00
fixes after my own review 2017-12-06 18:28:14 -06:00
Anton Kaliaev
c6f025f40e
generate WAL on the fly (Refs #468) 2017-12-06 16:01:08 -06:00
Anton Kaliaev
922af7c405
int64 height
uint64 is considered dangerous. the details will follow in a blog post.
2017-12-01 19:04:53 -06:00
Anton Kaliaev
86af889dfb remove unnecessary casts (Refs #911) 2017-12-01 17:17:22 -05:00
Emmanuel Odeke
42da8cd297
consensus/WAL: benchmark WALDecode across data sizes 2017-11-23 12:43:11 -07:00
Ethan Buchman
57a684d5ac fixes from review 2017-10-25 21:54:56 -04:00
Anton Kaliaev
3115c23762
binary format for WAL 2017-10-23 22:27:24 +04:00
Jae Kwon
1788a68b1c Consensus WAL uses AutoFile/Group 2016-10-28 15:01:14 -07:00
Ethan Buchman
47acada2cb consensus: t.Fatal -> panic 2016-07-11 22:37:39 -04:00
Ethan Buchman
3891e4d66d config: cswal_light, mempool_broadcast, mempool_reap 2016-03-03 06:31:59 +00:00
Ethan Buchman
26f0e2bc2d msgLogFP -> write ahead log 2016-01-18 14:44:45 -05:00