Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental verify checkpoints #4487

Merged
merged 1 commit into from
Oct 29, 2024

Conversation

ThomasBrady
Copy link
Contributor

@ThomasBrady ThomasBrady commented Sep 25, 2024

Resolves #4454

Description

Adds --trusted-hash-file argument to the verify-checkpoints command to support appending new verified checkpoints starting from the last checkpoint in the trusted hash file.

Adds --from-ledger to support generating a verified checkpoint hash file starting from a specific ledger to LCL/specified end ledger.

Design doc: https://docs.google.com/document/d/1GRzHAO4_YrfanXqoVc1UDIMhUV10PFqIMQyOxlPOW_s/edit

Usage example:

--from-ledger :

% src/stellar-core verify-checkpoints --from-ledger=53736369 --output-file=out.json --conf=../stellar-core.cfg
Result:

% cat out.json 
[
[53736575, "1de4bfa30f8af81716d2295b7c9f077afea250ddb88839345c13176de7b75e36"],
[53736511, "9f1bd24f21facc606b49216853c0e2162d55d2e3e898da96dd910ddd1ede784f"],
[53736447, "80a3083ea9e987b48949c2ad33006a5e750f06c6836c4814d5a853cab6bac1e3"],
[53736383, "2363bc49669667aa28da768588b5be7f09dc8c69c5e20416d870748b3739509b"],
[0, ""]
]

Append to existing file:

src/stellar-core verify-checkpoints --trusted-hash-file=out.json --output-file=out2.json --conf=../stellar-core.cfg
Result:

cat out2.json 
[
[53736959, "4b1900cb4bbaa77e86e3c8abb33be966e24a84098acdbda3d57977f237c5b13e"],
[53736895, "a163415903fa39efb53e4c79198fa2857cdbb12f92cc64f0ac3bcd0e6a7f2cce"],
[53736831, "2977e0c5653960a11359552dd74508a17982a5ca422db961f809fc335cd17901"],
[53736767, "ff7d80daad82981c1512c0f296a9ff9902f7b9d1ffa8ec8ad02e588cca16a9fd"],
[53736703, "0fb92338560bfac48ebd78dac530735ca988009132846fd93e42c061caa8cc5f"],
[53736639, "ba407b9b13e077cf9fb0a1c277416e12c6ff6857a42beef62f5805a9fdeec8ce"],
[53736575, "1de4bfa30f8af81716d2295b7c9f077afea250ddb88839345c13176de7b75e36"],
[53736511, "9f1bd24f21facc606b49216853c0e2162d55d2e3e898da96dd910ddd1ede784f"],
[53736447, "80a3083ea9e987b48949c2ad33006a5e750f06c6836c4814d5a853cab6bac1e3"],
[53736383, "2363bc49669667aa28da768588b5be7f09dc8c69c5e20416d870748b3739509b"],
[0, ""]
]

Usage of both --from-ledger and --trusted-hash-file -> ERROR

 % src/stellar-core verify-checkpoints --trusted-hash-file=out2.json --output-file=out3.json --from-ledger=9999 --conf=../stellar-core.cfg --ll trace 
Warning: running non-release version v22.0.0rc1-3-ge94e61395-dirty of stellar-core
2024-09-30T15:56:36.748 [default ERROR] Cannot specify both --from-ledger and --trusted-hash-file

Performance

Time for verification of checkpoints --from-ledger=53737040 to LCL=53739327
Output: hashes for checkpoints 53737023 to 53739327, total of 2304 ledgers = 2287 ledgers (from --from-ledger=53737040 to LCL=53739327) + 13 ledgers (from checkpoint 53737023 to --from-ledger=53737040):

time src/stellar-core verify-checkpoints --output-file=out4.json --from-ledger=53737040 --conf=../stellar-core.cfg

src/stellar-core verify-checkpoints --output-file=out4.json    15.22s user 1.25s system 8% cpu 3:25.09 total
  0.80s user 0.31s system 18% cpu 5.825 total

205 seconds / 2304 ledgers = 0.09 seconds, 90 milliseconds / ledger

Caveat: There is an overhead as the LCL is obtained from the network. On average we will wait 1/2 a checkpoint (32 ledgers) to find a checkpoint boundary LCL (32 ledgers * 5 seconds = 160 seconds).

Checklist

  • Reviewed the contributing document
  • Rebased on top of master (no merge commits)
  • Ran clang-format v8.0.0 (via make format or the Visual Studio extension)
  • Compiles
  • Ran all tests
  • If change impacts performance, include supporting evidence per the performance document

@ThomasBrady ThomasBrady changed the title WIP: Incremental verify checkpoints Incremental verify checkpoints Sep 30, 2024
@ThomasBrady ThomasBrady force-pushed the incremental-verify-checkpoints branch from 5abf2eb to c4810de Compare October 1, 2024 01:05
Copy link
Contributor

@SirTyson SirTyson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this change, sorry we kept going back and forth so much in the design phase :(. I did a quick pass, but I think there's a couple of issues with the interface that need to be fixed, then I'll do another pass once things are working a bit better. In particular

stellar-core --conf test.cfg verify-checkpoints --trusted-hash-file does-not-exist  

crashes after syncing with the network, but it looks like this should work based on the help comment from --trusted-hash-file. Either the comment should be changed and this error check should happen on startup if this is intended behavior, or it should be addressed.

I'm also not quite sure what the intended interface for this is. It looks like in the doc, we have

stellar-core verify-checkpoints –conf=core.cfg –trusted-hash-file=path/to/verified.json 

which takes in a previous file called path/to/verified.json, and at the end of the call updates path/to/verified.json such that is contains hashes to lcl. However, it looks like the interface has changed in this PR, where we take in

stellar-core verify-checkpoints --trusted-hash-file=path/to/verified.json   --output-file=path/to/verified2.json 

where the output file is a new file which contains the hashes from path/to/verified.json. The issue is, this doesn't actually work as an append operations, as the --output-file must not be the same as trusted-hash-file. To demonstrate this, I ran the following commands on testnet:

stellar-core ---conf testnet.cfg verify-checkpoints --output-file out --from-ledger 249443

This command succeeded. After a few checkpoints passed, I then attempted to append to the file to catch up to lcl with

stellar-core ---conf testnet.cfg verify-checkpoints --output-file out --trusted-hash-file out

which crashed. I doubt that Horizon operators will want to manager a collection of files, so we probably do want a truly append operation.

While I found a couple issues, I think it would be helpful to

  1. Validity checking on startup. If we crash due to a file not existing that's fine, but this should happen immediately on startup and not after waiting for the network's next checkpoint ledger.
  2. Take a step back and solidify what the interface should be. I know we've had some irl conversations back and forth and the expectations have been changing a lot throughout, but currently the design doc, commands.md doc, and command line "help" output all define different, mutually exclusive interfaces. I think this is making review and implementation a bit tricky.

@ThomasBrady
Copy link
Contributor Author

Thanks for this change, sorry we kept going back and forth so much in the design phase :(. I did a quick pass, but I think there's a couple of issues with the interface that need to be fixed, then I'll do another pass once things are working a bit better. In particular

stellar-core --conf test.cfg verify-checkpoints --trusted-hash-file does-not-exist  

crashes after syncing with the network, but it looks like this should work based on the help comment from --trusted-hash-file. Either the comment should be changed and this error check should happen on startup if this is intended behavior, or it should be addressed.

Do you know what error was printed when you ran this? For me I get 2024-10-02T15:43:40.210 GAL3A [default FATAL] Got an exception: error opening output file. If I specify a non-existent trusted hash file (with an output-file to write to), it verifies to genesis without raising an error.

I agree that the error reporting should happen earlier. I thought that calling .required() on the clara parser for --output-file would have raised an error immediately if that flag isn't provided, but that doesn't seem to be the case. I'll raise an error before connecting to the network if output-file isn't specified. If --trusted-hash-file does not exist, I think it should also result in an error being reported rather than silently verifying from genesis so I'll report that too.

I'm also not quite sure what the intended interface for this is. It looks like in the doc, we have

stellar-core verify-checkpoints –conf=core.cfg –trusted-hash-file=path/to/verified.json 

which takes in a previous file called path/to/verified.json, and at the end of the call updates path/to/verified.json such that is contains hashes to lcl. However, it looks like the interface has changed in this PR, where we take in

stellar-core verify-checkpoints --trusted-hash-file=path/to/verified.json   --output-file=path/to/verified2.json 

where the output file is a new file which contains the hashes from path/to/verified.json. The issue is, this doesn't actually work as an append operations, as the --output-file must not be the same as trusted-hash-file. To demonstrate this, I ran the following commands on testnet:

Correct, the design was updated not to append to the trusted-hash-file implicitly. An output-file must be explicitly specified with all invocations. I'll modify the file output logic to write to a temporary file if the specified --output-file is equal to the --trusted-hash-file to support the append use case.

stellar-core ---conf testnet.cfg verify-checkpoints --output-file out --from-ledger 249443

This command succeeded. After a few checkpoints passed, I then attempted to append to the file to catch up to lcl with

stellar-core ---conf testnet.cfg verify-checkpoints --output-file out --trusted-hash-file out

which crashed. I doubt that Horizon operators will want to manager a collection of files, so we probably do want a truly append operation.

While I found a couple issues, I think it would be helpful to

  1. Validity checking on startup. If we crash due to a file not existing that's fine, but this should happen immediately on startup and not after waiting for the network's next checkpoint ledger.
  2. Take a step back and solidify what the interface should be. I know we've had some irl conversations back and forth and the expectations have been changing a lot throughout, but currently the design doc, commands.md doc, and command line "help" output all define different, mutually exclusive interfaces. I think this is making review and implementation a bit tricky.

I've spotted a typo in commands.md (--trusted-checkpoint-hashes should be --trusted-checkpoint-file), and there was the example invocations in the design doc that erroneously included both --trusted-checkpoint-file and --from-ledger and excluded the mandatory --output-file argument. I've updated those in the relevant parts. Is that all you were referring to or are there other issues with the interface differing?

@SirTyson
Copy link
Contributor

SirTyson commented Oct 3, 2024

Do you know what error was printed when you ran this? For me I get 2024-10-02T15:43:40.210 GAL3A [default FATAL] Got an exception: error opening output file. If I specify a non-existent trusted hash file (with an output-file to write to), it verifies to genesis without raising an error.

Ya the error I was referring to was that one, with no output-file.

If --trusted-hash-file does not exist, I think it should also result in an error being reported rather than silently verifying from genesis so I'll report that too.

Sounds like a good idea!

Is that all you were referring to or are there other issues with the interface differing?

That definitely cleans up most of it, but I think there's still an issue in the command help message for "--trusted-hash-file":

        "file containing trusted hashes, generated by a previous call to "
        "verify-checkpoints or a non-existent file to generate a new one");

I don't think a non-existent file should be valid, and we should probably just crash immediately on startup in this case.

Copy link
Contributor

@SirTyson SirTyson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall working much better! A few small issues regarding graceful failure and making sure we don't corrupt output files.

Copy link
Contributor

@SirTyson SirTyson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a few final cleanups and one edge case question.

Copy link
Contributor

@SirTyson SirTyson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@SirTyson SirTyson enabled auto-merge October 29, 2024 17:28
@SirTyson SirTyson added this pull request to the merge queue Oct 29, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 29, 2024
…llow for incremental verification of checkpoints.
@ThomasBrady ThomasBrady force-pushed the incremental-verify-checkpoints branch from 7eb1104 to b827e40 Compare October 29, 2024 18:59
@SirTyson SirTyson added this pull request to the merge queue Oct 29, 2024
Merged via the queue into stellar:master with commit acf111d Oct 29, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

allow running verify-checkpoints incrementally
2 participants