68 Commits

Author SHA1 Message Date
Ethan Buchman
dc6567c677 consensus: flush wal on stop (#3297)
Should fix #3295
Also partial fix of #3043
2019-02-12 09:32:40 +04:00
Ethan Buchman
45b70ae031
fix non deterministic test failures and race in privval socket (#3258)
* node: decrease retry conn timeout in test

Should fix #3256

The retry timeout was set to the default, which is the same as the
accept timeout, so it's no wonder this would fail. Here we decrease the
retry timeout so we can try many times before the accept timeout.

* p2p: increase handshake timeout in test

This fails sometimes, presumably because the handshake timeout is so low
(only 50ms). So increase it to 1s. Should fix #3187

* privval: fix race with ping. closes #3237

Pings happen in a go-routine and can happen concurrently with other
messages. Since we use a request/response protocol, we expect to send a
request and get back the corresponding response. But with pings
happening concurrently, this assumption could be violated. We were using
a mutex, but only a RWMutex, where the RLock was being held for sending
messages - this was to allow the underlying connection to be replaced if
it fails. Turns out we actually need to use a full lock (not just a read
lock) to prevent multiple requests from happening concurrently.

* node: fix test name. DelayedStop -> DelayedStart

* autofile: Wait() method

In the TestWALTruncate in consensus/wal_test.go we remove the WAL
directory at the end of the test. However the wal.Stop() does not
properly wait for the autofile group to finish shutting down. Hence it
was possible that the group's go-routine is still running when the
cleanup happens, which causes a panic since the directory disappeared.
Here we add a Wait() method to properly wait until the go-routine exits
so we can safely clean up. This fixes #2852.
2019-02-06 10:24:43 -05:00
Ethan Buchman
39eba4e154
WAL: better errors and new fail point (#3246)
* privval: more info in errors

* wal: change Debug logs to Info

* wal: log and return error on corrupted wal instead of panicing

* fail: Exit right away instead of sending interupt

* consensus: FAIL before handling our own vote

allows to replicate #3089:
- run using `FAIL_TEST_INDEX=0`
- delete some bytes from the end of the WAL
- start normally

Results in logs like:

```
I[2019-02-03|18:12:58.225] Searching for height module=consensus wal=/Users/ethanbuchman/.tendermint/data/cs.wal/wal height=1 min=0 max=0
E[2019-02-03|18:12:58.225] Error on catchup replay. Proceeding to start ConsensusState anyway module=consensus err="failed to read data: EOF"
I[2019-02-03|18:12:58.225] Started node module=main nodeInfo="{ProtocolVersion:{P2P:6 Block:9 App:1} ID_:35e87e93f2e31f305b65a5517fd2102331b56002 ListenAddr:tcp://0.0.0.0:26656 Network:test-chain-J8JvJH Version:0.29.1 Channels:4020212223303800 Moniker:Ethans-MacBook-Pro.local Other:{TxIndex:on RPCAddress:tcp://0.0.0.0:26657}}"
E[2019-02-03|18:12:58.226] Couldn't connect to any seeds module=p2p
I[2019-02-03|18:12:59.229] Timed out module=consensus dur=998.568ms height=1 round=0 step=RoundStepNewHeight
I[2019-02-03|18:12:59.230] enterNewRound(1/0). Current: 1/0/RoundStepNewHeight module=consensus height=1 round=0
I[2019-02-03|18:12:59.230] enterPropose(1/0). Current: 1/0/RoundStepNewRound module=consensus height=1 round=0
I[2019-02-03|18:12:59.230] enterPropose: Our turn to propose module=consensus height=1 round=0 proposer=AD278B7767B05D7FBEB76207024C650988FA77D5 privValidator="PrivValidator{AD278B7767B05D7FBEB76207024C650988FA77D5 LH:1, LR:0, LS:2}"
E[2019-02-03|18:12:59.230] enterPropose: Error signing proposal module=consensus height=1 round=0 err="Error signing proposal: Step regression at height 1 round 0. Got 1, last step 2"
I[2019-02-03|18:13:02.233] Timed out module=consensus dur=3s height=1 round=0 step=RoundStepPropose
I[2019-02-03|18:13:02.233] enterPrevote(1/0). Current: 1/0/RoundStepPropose module=consensus
I[2019-02-03|18:13:02.233] enterPrevote: ProposalBlock is nil module=consensus height=1 round=0
E[2019-02-03|18:13:02.234] Error signing vote module=consensus height=1 round=0 vote="Vote{0:AD278B7767B0 1/00/1(Prevote) 000000000000 000000000000 @ 2019-02-04T02:13:02.233897Z}" err="Error signing vote: Conflicting data"
```

Notice the EOF, the step regression, and the conflicting data.

* wal: change errors to be DataCorruptionError

* exit on corrupt WAL

* fix log

* fix new line
2019-02-04 13:00:06 -05:00
Anton Kaliaev
d178ea9eaf use our logger in autofile/group 2018-11-09 16:29:43 +01:00
goolAdapter
110b07fb3f libs: Call Flush() before rename #2428 (#2439)
* fix Group.RotateFile need call Flush() before rename. #2428
* fix some review issue. #2428
 refactor Group's config: replace  setting member with initial option
* fix a handwriting mistake
* fix a time window error between rename and write.
* fix a syntax mistake.
* change option name Get_ to With_
* fix review issue
* fix review issue
2018-09-25 13:22:45 +02:00
Anton Kaliaev
0e1cd88863 Remove ConsensusParams.TxSize and ConsensusParams.BlockGossip (#2364)
* remove ConsensusParams.TxSize and ConsensusParams.BlockGossip

Refs #2347

* block part size is now fixed

Refs #2347

* use max data size, not max bytes for tx limit

Refs #2347
2018-09-12 15:44:43 -04:00
Zarko Milosevic
7b88172f41 Implement BFT time (#2203)
* Implement BFT time

* set LastValidators when creating state in state helper

for heights >= 2
2018-08-31 19:33:51 -04:00
Dev Ojha
2756be5a59 libs: Remove usage of custom Fmt, in favor of fmt.Sprintf (#2199)
* libs: Remove usage of custom Fmt, in favor of fmt.Sprintf

Closes #2193

* Fix bug that was masked by custom Fmt!
2018-08-10 09:25:57 +04:00
Ethan Buchman
d55243f0e6 fix import paths 2018-07-01 22:36:49 -04:00
Anton Kaliaev
1f22f34edf
flush wal group on stop
Refs #1659
Refs https://github.com/tendermint/tmlibs/pull/217
2018-06-04 16:47:44 +04:00
Anton Kaliaev
708f35e5c1
do not look for height in older files if we've seen height - 1
Refs #1600
2018-05-25 15:11:15 +04:00
Anton Kaliaev
f3f5c7f472
we must only return io.EOF to progress to the next file in auto.Group
since we never write msg partially, if we've encountered io.EOF in the
middle of the msg, we must abort
2018-05-25 15:10:51 +04:00
Anton Kaliaev
68f6226bea
data is corrupted, but this requires manual intervention
i.e., can't be skipped

and we should only return DataCorruptionError if we can skip a msg safely
2018-05-25 15:10:51 +04:00
Anton Kaliaev
118b86b1ef
fix nil panic error
msg is nil and if we continue executing, we'll get nil exception at
`msg.Msg.(....)`
2018-05-25 15:10:51 +04:00
Anton Kaliaev
b9afcbe3a2
fix typo 2018-05-25 15:10:51 +04:00
Ethan Buchman
ee4eb59355 update comments 2018-05-20 16:44:08 -04:00
Ethan Buchman
082a02e6d1 consensus: only fsync wal after internal msgs 2018-05-20 14:40:47 -04:00
Anton Kaliaev
e88f74bb9b
remove wal_light setting
Closes #1428
2018-04-11 10:08:03 +02:00
Jae Kwon
fb64314d1c Review from Anton 2018-04-06 13:46:40 -07:00
Ethan Buchman
799beebd36 fix consensus tests 2018-04-05 17:54:26 +03:00
Jae Kwon
45ec5fd170 WIP consensus 2018-04-05 07:05:45 -07:00
Ethan Buchman
a17105fd46 p2p: peer.Key -> peer.ID 2018-01-01 22:39:05 -05:00
Anton Kaliaev
843e1ed400
Updates -> ValidatoSetUpdates 2017-12-19 13:03:39 -06:00
Anton Kaliaev
e57cad6c3f
correct maxMsgSizeBytes 2017-12-15 11:42:53 -06:00
Anton Kaliaev
06aece31cf
lower the max message size 2017-12-12 13:02:40 -06:00
Anton Kaliaev
af79a2a59e
fix error msg 2017-12-11 19:50:05 -06:00
Anton Kaliaev
ee66476d62
set max msg size
otherwise, it is easy to get OutOfMemory panic (somebody can even expoit
this)
2017-12-11 19:48:57 -06:00
Anton Kaliaev
40f9261d48
handle data corruption errors
Refs #573
2017-12-11 19:48:20 -06:00
Anton Kaliaev
90944bb1a2
be specific about what type we're encoding
to be consistent with Decode, which returns TimedWALMessage
2017-12-07 11:45:50 -06:00
Anton Kaliaev
07571741c5
[consensus] remove WAL separator (Refs #785)
We don't really need a separator unless we have complex structures
(rows, cells like RDBMS have https://www.sqlite.org/fileformat.html).
2017-12-07 11:36:46 -06:00
Anton Kaliaev
c6f025f40e
generate WAL on the fly (Refs #468) 2017-12-06 16:01:08 -06:00
Anton Kaliaev
922af7c405
int64 height
uint64 is considered dangerous. the details will follow in a blog post.
2017-12-01 19:04:53 -06:00
Anton Kaliaev
69b5da766c
service#Start, service#Stop signatures were changed
See https://github.com/tendermint/tmlibs/issues/45
2017-11-29 10:38:58 -06:00
Zach Ramsay
6f3c05545d fix new linting errors 2017-11-27 22:39:12 +00:00
Zach Ramsay
b75d4f73e7 errcheck: PR comment fixes 2017-11-27 22:39:11 +00:00
Anton Kaliaev
61d76a273f
fixes from Bucky's and Emmanuel's reviews 2017-10-30 11:12:01 -05:00
Anton Kaliaev
f6539737de
new pubsub package
comment out failing consensus tests for now

rewrite rpc httpclient to use new pubsub package

import pubsub as tmpubsub, query as tmquery

make event IDs constants
EventKey -> EventTypeKey

rename EventsPubsub to PubSub

mempool does not use pubsub

rename eventsSub to pubsub

new subscribe API

fix channel size issues and consensus tests bugs

refactor rpc client

add missing discardFromChan method

add mutex

rename pubsub to eventBus

remove IsRunning from WSRPCConnection interface (not needed)

add a comment in broadcastNewRoundStepsAndVotes

rename registerEventCallbacks to broadcastNewRoundStepsAndVotes

See https://dave.cheney.net/2014/03/19/channel-axioms

stop eventBuses after reactor tests

remove unnecessary Unsubscribe

return subscribe helper function

move discardFromChan to where it is used

subscribe now returns an err

this gives us ability to refuse to subscribe if pubsub is at its max
capacity.

use context for control overflow

cache queries

handle err when subscribing in replay_test

rename testClientID to testSubscriber

extract var

set channel buffer capacity to 1 in replay_file

fix byzantine_test

unsubscribe from single event, not all events

refactor httpclient to return events to appropriate channels

return failing testReplayCrashBeforeWriteVote test

fix TestValidatorSetChanges

refactor code a bit

fix testReplayCrashBeforeWriteVote

add comment

fix TestValidatorSetChanges

fixes from Bucky's review

update comment [ci skip]

test TxEventBuffer

update changelog

fix TestValidatorSetChanges (2nd attempt)

only do wg.Done when no errors

benchmark event bus

create pubsub server inside NewEventBus

only expose config params (later if needed)

set buffer capacity to 0 so we are not testing cache

new tx event format: key = "Tx" plus a tag {"tx.hash": XYZ}

This should allow to subscribe to all transactions! or a specific one
using a query: "tm.events.type = Tx and tx.hash = '013ABF99434...'"

use TimeoutCommit instead of afterPublishEventNewBlockTimeout

TimeoutCommit is the time a node waits after committing a block, before
it goes into the next height. So it will finish everything from the last
block, but then wait a bit. The idea is this gives it time to hear more
votes from other validators, to strengthen the commit it includes in the
next block. But it also gives it time to hear about new transactions.

waitForBlockWithUpdatedVals

rewrite WAL crash tests

Task:
test that we can recover from any WAL crash.

Solution:
the old tests were relying on event hub being run in the same thread (we
were injecting the private validator's last signature).

when considering a rewrite, we considered two possible solutions: write
a "fuzzy" testing system where WAL is crashing upon receiving a new
message, or inject failures and trigger them in tests using something
like https://github.com/coreos/gofail.

remove sleep

no cs.Lock around wal.Save

test different cases (empty block, non-empty block, ...)

comments

add comments

test 4 cases: empty block, non-empty block, non-empty block with smaller part size, many blocks

fixes as per Bucky's last review

reset subscriptions on UnsubscribeAll

use a simple counter to track message for which we panicked

also, set a smaller part size for all test cases
2017-10-30 00:32:22 -05:00
Ethan Buchman
57a684d5ac fixes from review 2017-10-25 21:54:56 -04:00
Anton Kaliaev
c74a359c46
fixes per Bucky's review 2017-10-24 12:14:21 +04:00
Anton Kaliaev
3115c23762
binary format for WAL 2017-10-23 22:27:24 +04:00
Anton Kaliaev
31030c6514
make crc32c a global var
change echo format in build.sh script
2017-10-23 22:09:42 +04:00
Anton Kaliaev
7b8ffc9981
add checksum and msg size to TimedWALMessage
updated test_data/build.sh script
2017-10-23 22:09:17 +04:00
Zach Ramsay
d56b44f3a5 all: no more anonymous imports 2017-10-04 16:40:45 -04:00
Anton Kaliaev
f8fdbe3dbc
changes as per Bucky's review 2017-05-13 16:22:51 +02:00
Anton Kaliaev
f803544195
new logging 2017-05-13 10:24:58 +02:00
Ethan Buchman
d1926bcad1 use tmlibs 2017-04-21 18:12:54 -04:00
Ethan Buchman
54b26869d5 consensus/wal: #HEIGHT -> #ENDHEIGHT 2017-04-18 21:27:31 -04:00
Ethan Buchman
4fd1471f11 remove BaseService.OnStart 2017-03-28 12:09:11 -04:00
Ethan Buchman
0bec99fbd4 consensus: handshake replay test using wal 2017-02-17 19:12:05 -05:00
Ethan Buchman
f4e6cf4439 consensus: sync wal.writeHeight 2016-12-22 15:01:02 -05:00