-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"kernel activity hash": add some amount of swingset state into the cosmos AppHash #3442
Comments
I'm adding this to the Metering Phase, but since adding this feature will expose any existing non-determinism pretty quickly (and we're already finding some from the previous phase), it's really important that we do extensive multi-validator multi-restart testing before the start of that phase. |
It occurred to me that we don't even need to do any hashing from the kernel side. Simply writing the block buffer into a single key of the cosmos DB at the end of each block should suffice, because it will be sampled immediately (it will change the AppHash for that same block). That might consume more space in the state vector, but it will be replaced on the next block, so it won't accumulate (except for archival nodes, which record the full history of the state vector). Recording a hash of the current block buffer, instead of the full contents, would reduce that. |
The mailbox device tracking the highest inbound message number and ack for each peer, to de-duplicate repeated messages, so it could reduce the amount of kernel activity. Each call to `deliverInbound` would return a boolean to indicate whether the messages/ack were new, and thus the kernel needed to be cycled. However, the device was holding this tracking data in non-durable state, so if/when the kernel was restarted, the state would be lost. A duplicate message/ack arriving in the restarted process would trigger kernel activity that would not have run in the original process. These extra cranks caused diverge between validators when one of them was restarted, and the client sent a duplicate message (such as the pre-emptive `ack` all clients send at startup). The extra crank does not get very far, because vattp does its own deduplication, so the divergence was only visible in the slog. But when #3442 is implemented, even a single extra crank will flag the validator as out of consensus. The fix is to remove the mailbox device's dedup code, and rely upon vattp for this function. The test was also updated to match, and a new test (comparing two parallel kernels, one restarted, one not) was added. closes #3471
The mailbox device tracking the highest inbound message number and ack for each peer, to de-duplicate repeated messages, so it could reduce the amount of kernel activity. Each call to `deliverInbound` would return a boolean to indicate whether the messages/ack were new, and thus the kernel needed to be cycled. However, the device was holding this tracking data in non-durable state, so if/when the kernel was restarted, the state would be lost. A duplicate message/ack arriving in the restarted process would trigger kernel activity that would not have run in the original process. These extra cranks caused diverge between validators when one of them was restarted, and the client sent a duplicate message (such as the pre-emptive `ack` all clients send at startup). The extra crank does not get very far, because vattp does its own deduplication, so the divergence was only visible in the slog. But when #3442 is implemented, even a single extra crank will flag the validator as out of consensus. The fix is to remove the mailbox device's dedup code, and rely upon vattp for this function. The test was also updated to match, and a new test (comparing two parallel kernels, one restarted, one not) was added. closes #3471
The mailbox device tracking the highest inbound message number and ack for each peer, to de-duplicate repeated messages, so it could reduce the amount of kernel activity. Each call to `deliverInbound` would return a boolean to indicate whether the messages/ack were new, and thus the kernel needed to be cycled. However, the device was holding this tracking data in non-durable state, so if/when the kernel was restarted, the state would be lost. A duplicate message/ack arriving in the restarted process would trigger kernel activity that would not have run in the original process. These extra cranks caused diverge between validators when one of them was restarted, and the client sent a duplicate message (such as the pre-emptive `ack` all clients send at startup). The extra crank does not get very far, because vattp does its own deduplication, so the divergence was only visible in the slog. But when #3442 is implemented, even a single extra crank will flag the validator as out of consensus. The fix is to remove the mailbox device's dedup code, and rely upon vattp for this function. The test was also updated to match, and a new test (comparing two parallel kernels, one restarted, one not) was added. closes #3471
The mailbox device tracking the highest inbound message number and ack for each peer, to de-duplicate repeated messages, so it could reduce the amount of kernel activity. Each call to `deliverInbound` would return a boolean to indicate whether the messages/ack were new, and thus the kernel needed to be cycled. However, the device was holding this tracking data in non-durable state, so if/when the kernel was restarted, the state would be lost. A duplicate message/ack arriving in the restarted process would trigger kernel activity that would not have run in the original process. These extra cranks caused diverge between validators when one of them was restarted, and the client sent a duplicate message (such as the pre-emptive `ack` all clients send at startup). The extra crank does not get very far, because vattp does its own deduplication, so the divergence was only visible in the slog. But when #3442 is implemented, even a single extra crank will flag the validator as out of consensus. The fix is to remove the mailbox device's dedup code, and rely upon vattp for this function. The test was also updated to match, and a new test (comparing two parallel kernels, one restarted, one not) was added. closes #3471
The mailbox device tracking the highest inbound message number and ack for each peer, to de-duplicate repeated messages, so it could reduce the amount of kernel activity. Each call to `deliverInbound` would return a boolean to indicate whether the messages/ack were new, and thus the kernel needed to be cycled. However, the device was holding this tracking data in non-durable state, so if/when the kernel was restarted, the state would be lost. A duplicate message/ack arriving in the restarted process would trigger kernel activity that would not have run in the original process. These extra cranks caused diverge between validators when one of them was restarted, and the client sent a duplicate message (such as the pre-emptive `ack` all clients send at startup). The extra crank does not get very far, because vattp does its own deduplication, so the divergence was only visible in the slog. But when #3442 is implemented, even a single extra crank will flag the validator as out of consensus. The fix is to remove the mailbox device's dedup code, and rely upon vattp for this function. The test was also updated to match, and a new test (comparing two parallel kernels, one restarted, one not) was added. closes #3471
The mailbox device tracking the highest inbound message number and ack for each peer, to de-duplicate repeated messages, so it could reduce the amount of kernel activity. Each call to `deliverInbound` would return a boolean to indicate whether the messages/ack were new, and thus the kernel needed to be cycled. However, the device was holding this tracking data in non-durable state, so if/when the kernel was restarted, the state would be lost. A duplicate message/ack arriving in the restarted process would trigger kernel activity that would not have run in the original process. These extra cranks caused diverge between validators when one of them was restarted, and the client sent a duplicate message (such as the pre-emptive `ack` all clients send at startup). The extra crank does not get very far, because vattp does its own deduplication, so the divergence was only visible in the slog. But when #3442 is implemented, even a single extra crank will flag the validator as out of consensus. The fix is to remove the mailbox device's dedup code, and rely upon vattp for this function. The test was also updated to match, and a new test (comparing two parallel kernels, one restarted, one not) was added. closes #3471
My current idea for the design:
await controller.run(runPolicy);
swingstore.commit();
const crankHash = controller.getCrankHash();
appState.save('swingset-crankhash', crankHash);
appState.commit(); In addition, the |
The host application (which uses swingset as a library) is obligated to provide a "hostStorage" object, which is a trio of kvStore, streamStore, and snapStore. The host typically uses openLMDBSwingStore() to create this, and holds onto the "commit block" facet. Previously the kernelKeeper was built on top of the streamStore, the snapStore, and an "enhancedCrankBuffer", which was the host-provided key-value store plus an abortable/committable "crank buffer", plus some extra convenience methods. The kernel held the abort/commit facet. The kernel was responsible for extracting the raw kvStore from `hostStorage`, wrapping it, and passing the enhanced version into `makeKernelKeeper` along with the other two values. Instead, we now give the entire "hostStorage" object to `makeKernelKeeper`, and give the kernel keeper responsibility for doing the wrapping. When the kernel wants to commit or abort, it calls `kernelKeeper.abortCrank()` or `commitCrank()`. This simplifies the `makeKernelKeeper` construction and will provide a place for `commitCrank` to store a hash of kernel activity. refs #3442
The host application (which uses swingset as a library) is obligated to provide a "hostStorage" object, which is a trio of kvStore, streamStore, and snapStore. The host typically uses openLMDBSwingStore() to create this, and holds onto the "commit block" facet. Previously the kernelKeeper was built on top of the streamStore, the snapStore, and an "enhancedCrankBuffer", which was the host-provided key-value store plus an abortable/committable "crank buffer", plus some extra convenience methods. The kernel held the abort/commit facet. The kernel was responsible for extracting the raw kvStore from `hostStorage`, wrapping it, and passing the enhanced version into `makeKernelKeeper` along with the other two values. Instead, we now give the entire "hostStorage" object to `makeKernelKeeper`, and give the kernel keeper responsibility for doing the wrapping. When the kernel wants to commit or abort, it calls `kernelKeeper.abortCrank()` or `commitCrank()`. This simplifies the `makeKernelKeeper` construction and will provide a place for `commitCrank` to store a hash of kernel activity. refs #3442
The multiple members of a consensus machines are supposed to perform identical exection of every crank. To detect any possible divergence as quickly as possible, the kernel maintains the "crankhash": a constantly-updated string which incorporates (by SHA256 hash) a copy of every DB write and delete. `controller.getCrankHash()` can be run after one or more cranks have finished, and the resulting hex string can be e.g. stored in a host application consensus state vector. If two members diverge in a way that causes their swingset state to differ, they will have different crankhashes, and the consensus state vectors will diverge. This should cause at least one of them to fall out of consensus. Some keys are excluded from consensus: currently just those involving vat snapshots and the truncation (non-initial starting point) of the transcript. refs #3442
The multiple members of a consensus machine are supposed to perform identical exection of every crank. To detect any possible divergence as quickly as possible, the kernel maintains the "crankhash": a constantly-updated string which incorporates (by SHA256 hash) a copy of every DB write and delete. `controller.getCrankHash()` can be run after one or more cranks have finished, and the resulting hex string can be e.g. stored in a host application consensus state vector. If two members diverge in a way that causes their swingset state to differ, they will have different crankhashes, and the consensus state vectors will diverge. This should cause at least one of them to fall out of consensus. Some keys are excluded from consensus: currently just those involving vat snapshots and the truncation (non-initial starting point) of the transcript. refs #3442
The multiple members of a consensus machine are supposed to perform identical exection of every crank. To detect any possible divergence as quickly as possible, the kernel maintains the "crankhash": a constantly-updated string which incorporates (by SHA256 hash) a copy of every DB write and delete. `controller.getCrankHash()` can be run after one or more cranks have finished, and the resulting hex string can be e.g. stored in a host application consensus state vector. If two members diverge in a way that causes their swingset state to differ, they will have different crankhashes, and the consensus state vectors will diverge. This should cause at least one of them to fall out of consensus. Some keys are excluded from consensus: currently just those involving vat snapshots and the truncation (non-initial starting point) of the transcript. refs #3442
This 'crankhash' covers all kernel state changes, so if any two validators diverge, their crankhashes should diverge too. By storing them into the cosmos-sdk state vector, the AppHashes will diverage as well, causing at least one of them to fall out of consensus. This should let us catch divergence as soon as possible. closes #3442
This 'crankhash' covers all kernel state changes, so if any two validators diverge, their crankhashes should diverge too. By storing them into the cosmos-sdk state vector, the AppHashes will diverage as well, causing at least one of them to fall out of consensus. This should let us catch divergence as soon as possible. closes #3442
This 'activityhash' covers all kernel state changes, so if any two validators diverge, their activityhashes should diverge too. By storing them into the cosmos-sdk state vector, the AppHashes will diverage as well, causing at least one of them to fall out of consensus. This should let us catch divergence as soon as possible. closes #3442
This 'activityhash' covers all kernel state changes, so if any two validators diverge, their activityhashes should diverge too. By storing them into the cosmos-sdk state vector, the AppHashes will diverage as well, causing at least one of them to fall out of consensus. This should let us catch divergence as soon as possible. closes #3442
The host application (which uses swingset as a library) is obligated to provide a "hostStorage" object, which is a trio of kvStore, streamStore, and snapStore. The host typically uses openLMDBSwingStore() to create this, and holds onto the "commit block" facet. Previously the kernelKeeper was built on top of the streamStore, the snapStore, and an "enhancedCrankBuffer", which was the host-provided key-value store plus an abortable/committable "crank buffer", plus some extra convenience methods. The kernel held the abort/commit facet. The kernel was responsible for extracting the raw kvStore from `hostStorage`, wrapping it, and passing the enhanced version into `makeKernelKeeper` along with the other two values. Instead, we now give the entire "hostStorage" object to `makeKernelKeeper`, and give the kernel keeper responsibility for doing the wrapping. When the kernel wants to commit or abort, it calls `kernelKeeper.abortCrank()` or `commitCrank()`. This simplifies the `makeKernelKeeper` construction and will provide a place for `commitCrank` to store a hash of kernel activity. refs #3442
We're defining the `local.*` namespace as the portion of the DB that is excluded from consensus (the #3442 kernel activity hash). Vat snapshots are a local concern: each member of a consensus machine can decide for itself if/when to make a snapshot. So both the vat->snapshot and the snapshot->vats tables must be moved to `local.`.
The kernelstats are merely incidentally deterministic: unlike c-lists and run-queues and object/promise state, stats aren't the primary thing we want to compare across replicas in a consensus machine. And we might want to add more in the future without causing a compatibility break. So this changes the DB key from `kernelStats` to `local.kernelStats`, which will exclude them from the #3442 kernel activity hash.
The multiple members of a consensus machine are supposed to perform identical exection of every crank. To detect any possible divergence as quickly as possible, the kernel maintains the "activityhash": a constantly-updated string which incorporates (by SHA256 hash) a copy of every DB write and delete. Each crank's DB writes/deletes are accumulated into the "crankhash". If the crank is committed, its crankhash is hashed into the old activityhash to form the new "activityhash". `controller.getActivityhash()` can be run after one or more cranks have finished, and the resulting hex string can be e.g. stored in a host application consensus state vector. If two members diverge in a way that causes their swingset state to differ, they will have different activityhashes, and the consensus state vectors will diverge. This should cause at least one of them to fall out of consensus. Some keys are excluded from consensus: currently just those involving vat snapshots and the truncation (non-initial starting point) of the transcript. refs #3442
This 'activityhash' covers all kernel state changes, so if any two validators diverge, their activityhashes should diverge too. By storing them into the cosmos-sdk state vector, the AppHashes will diverage as well, causing at least one of them to fall out of consensus. This should let us catch divergence as soon as possible. closes #3442
This 'activityhash' covers all kernel state changes, so if any two validators diverge, their activityhashes should diverge too. By storing them into the cosmos-sdk state vector, the AppHashes will diverage as well, causing at least one of them to fall out of consensus. This should let us catch divergence as soon as possible. closes #3442
What is the Problem Being Solved?
In trying to debug #3428, @mhofman and I compared slog traces of two different validators to confirm that they were running the same cranks. He found variation in the vbank messages (#3435), and we also observed differences in the GC Actions sent into vats between validators which had or had not reloaded a vat from snapshot (due to #3428 causing different GC syscalls to be made slightly earlier).
We've been kind of on the fence about how much variation to allow in the swingsets running on different validators. Currently, the only swingset variations that will cause a validator to break consensus are outbound comms/IBC messages (written into the cosmos-sdk "swingset" Module's state) and messages sent to the vbank Module (also saved in its state). Anything else will silently diverge until one of those two values observes a difference.
It's probably best to detect these kernel-level variations as quickly as possible. Ideally we'd store all consensus-sensitive kernel state in the cosmos DB where it can be rolled up into the AppHash along with everything else. We tried this early on, and found the performance was troublesome, but maybe we should consider trying it again.
Description of the Design
We may benefit from a partial approach, something like:
controller.step()
, read a "previous kernel state hash" value from the cosmos DBThat would at least be sensitive to deviations in the deliveries that the kernel is performing. This would have caught the GC Action variations we observed.
The next level would be to hash all the DB changes made during a block (the contents of the "block buffer") and fold them in as well. It wouldn't provide any insight into what the variance was, but it would detect it immediately. We've talked about excluding certain kernelDB keys from consensus (vat snapshot ID, transcript prefix offset); we'd need to implement that before adding the block buffer into the hash.
Consensus Considerations
This would be a consensus-breaking change. If we choose to implement it, we should do so before our first release.
Security Considerations
Most chain environments are entirely transparent, but if we anticipate some sort of hybrid model in which the swingset state is somehow concealed from the host-side application, we must consider the information leakage of this activity hash. This leakage can be prevented by including a full-strength random value (256 bits or more) into each hash computation, and not revealing this value to the host application.
Test Plan
A unit test which somehow modifies the kernel between two replays of the same state (maybe by swapping two items on the run-queue), and verifies that the two instances finish with different kernel activity hash values.
The text was updated successfully, but these errors were encountered: