-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce P3 and RocksDB fixes #624
Conversation
@ldmberman short explanation for ar_kv changes for you:
Worth reading, despite it is 10 years old: facebook/rocksdb#236 (comment) (other comments are also useful). |
Update: I found some traces of adding CF support for repair mechanism, but I cannot make it work anyway, neither with Erlang bindings nor RocksDB CLI tool (ldb). |
{exception, io_lib:format("~p", [Exc])}]), | ||
{error, Exc}; | ||
_ -> | ||
case reconnect(Name, Ref) of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new code discards the retry functionality. Is this what you were referring to in your PR comment about the reconnect functionality not working right? If the retry didn't work or, worse, introduced a race condition, then I agree good to remove - but just want to double check that we're not accidentally using a valuable bit of retry logic
Reconnect machanism introduced a race condition and was not used anyway, so it was removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. All methods were called with RetryCount set to 1? In that case, I completely agree - kill the retry functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but the problem was bigger than that.
The reconnect was called from the functions that were executed on the caller side (not in the ar_kv
process), so these calls may happen concurrently, a few at once. This means, that if the database for some reason is down, and the db reference is outdated, few other processes will do attempt gen_server:call
with demand to reconnect the database. This calls will be serialized in the process mailbox, and gen_server will do close/open sequence several times. While the database is closed, other processes will find the database reference dead, will call for reconnects, and this will never end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah got it - sounds like a mess. Good call removing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This calls will be serialized in the process mailbox, and gen_server will do close/open sequence several times. While the database is closed, other processes will find the database reference dead, will call for reconnects, and this will never end.
This is not true, the first reconnect changes the reference and the subsequent processes simply take it - https://github.com/ArweaveTeam/arweave/blob/master/apps/arweave/src/ar_kv.erl#L171
Nevertheless, we probably do not need the reconnect functionality now so it makes sense to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ldmberman yes, that's the race condition, because several processes will hit the same gen_server:call at the same time and the ar_kv
server will do reconnect sequence several times.
If you're referring to the fact that one database is only used by one process (not sure that this is true), then yes, there is no race condition just because the race contains just one process.
This is a relatively fresh discussion of the repairing procedure. The key point from there:
This is (so far) in line with what @shizzard observed locally in tests. There are no other caveats mentioned so I think it is only about flushing the CFs once they are created (no need to flush them explicitly later on.) In any case, @shizzard and I had a discussion and decided to remove the repairing code for now. Regarding the atomic_flush option, we do not need it because we do not use RocksDB transactions. I would not introduce it until we need it. |
… atomic_flush from default db options
@ldmberman @JamesPiechota I think it is ready to merge, unless you have any changes in mind. |
This PR includes several directions of work: