Incremental verify checkpoints #4487

ThomasBrady · 2024-09-25T21:13:04Z

Resolves #4454

Description

Adds --trusted-hash-file argument to the verify-checkpoints command to support appending new verified checkpoints starting from the last checkpoint in the trusted hash file.

Adds --from-ledger to support generating a verified checkpoint hash file starting from a specific ledger to LCL/specified end ledger.

Design doc: https://docs.google.com/document/d/1GRzHAO4_YrfanXqoVc1UDIMhUV10PFqIMQyOxlPOW_s/edit

Usage example:

`--from-ledger` :

% src/stellar-core verify-checkpoints --from-ledger=53736369 --output-file=out.json --conf=../stellar-core.cfg
Result:

% cat out.json 
[
[53736575, "1de4bfa30f8af81716d2295b7c9f077afea250ddb88839345c13176de7b75e36"],
[53736511, "9f1bd24f21facc606b49216853c0e2162d55d2e3e898da96dd910ddd1ede784f"],
[53736447, "80a3083ea9e987b48949c2ad33006a5e750f06c6836c4814d5a853cab6bac1e3"],
[53736383, "2363bc49669667aa28da768588b5be7f09dc8c69c5e20416d870748b3739509b"],
[0, ""]
]

Append to existing file:

src/stellar-core verify-checkpoints --trusted-hash-file=out.json --output-file=out2.json --conf=../stellar-core.cfg
Result:

cat out2.json 
[
[53736959, "4b1900cb4bbaa77e86e3c8abb33be966e24a84098acdbda3d57977f237c5b13e"],
[53736895, "a163415903fa39efb53e4c79198fa2857cdbb12f92cc64f0ac3bcd0e6a7f2cce"],
[53736831, "2977e0c5653960a11359552dd74508a17982a5ca422db961f809fc335cd17901"],
[53736767, "ff7d80daad82981c1512c0f296a9ff9902f7b9d1ffa8ec8ad02e588cca16a9fd"],
[53736703, "0fb92338560bfac48ebd78dac530735ca988009132846fd93e42c061caa8cc5f"],
[53736639, "ba407b9b13e077cf9fb0a1c277416e12c6ff6857a42beef62f5805a9fdeec8ce"],
[53736575, "1de4bfa30f8af81716d2295b7c9f077afea250ddb88839345c13176de7b75e36"],
[53736511, "9f1bd24f21facc606b49216853c0e2162d55d2e3e898da96dd910ddd1ede784f"],
[53736447, "80a3083ea9e987b48949c2ad33006a5e750f06c6836c4814d5a853cab6bac1e3"],
[53736383, "2363bc49669667aa28da768588b5be7f09dc8c69c5e20416d870748b3739509b"],
[0, ""]
]

Usage of both `--from-ledger` and `--trusted-hash-file` -> ERROR

 % src/stellar-core verify-checkpoints --trusted-hash-file=out2.json --output-file=out3.json --from-ledger=9999 --conf=../stellar-core.cfg --ll trace 
Warning: running non-release version v22.0.0rc1-3-ge94e61395-dirty of stellar-core
2024-09-30T15:56:36.748 [default ERROR] Cannot specify both --from-ledger and --trusted-hash-file

Performance

Time for verification of checkpoints --from-ledger=53737040 to LCL=53739327
Output: hashes for checkpoints 53737023 to 53739327, total of 2304 ledgers = 2287 ledgers (from --from-ledger=53737040 to LCL=53739327) + 13 ledgers (from checkpoint 53737023 to --from-ledger=53737040):

time src/stellar-core verify-checkpoints --output-file=out4.json --from-ledger=53737040 --conf=../stellar-core.cfg

src/stellar-core verify-checkpoints --output-file=out4.json    15.22s user 1.25s system 8% cpu 3:25.09 total
  0.80s user 0.31s system 18% cpu 5.825 total

205 seconds / 2304 ledgers = 0.09 seconds, 90 milliseconds / ledger

Caveat: There is an overhead as the LCL is obtained from the network. On average we will wait 1/2 a checkpoint (32 ledgers) to find a checkpoint boundary LCL (32 ledgers * 5 seconds = 160 seconds).

Checklist

Reviewed the contributing document
Rebased on top of master (no merge commits)
Ran clang-format v8.0.0 (via make format or the Visual Studio extension)
Compiles
Ran all tests
If change impacts performance, include supporting evidence per the performance document

SirTyson

Thanks for this change, sorry we kept going back and forth so much in the design phase :(. I did a quick pass, but I think there's a couple of issues with the interface that need to be fixed, then I'll do another pass once things are working a bit better. In particular

stellar-core --conf test.cfg verify-checkpoints --trusted-hash-file does-not-exist

crashes after syncing with the network, but it looks like this should work based on the help comment from --trusted-hash-file. Either the comment should be changed and this error check should happen on startup if this is intended behavior, or it should be addressed.

I'm also not quite sure what the intended interface for this is. It looks like in the doc, we have

stellar-core verify-checkpoints –conf=core.cfg –trusted-hash-file=path/to/verified.json

which takes in a previous file called path/to/verified.json, and at the end of the call updates path/to/verified.json such that is contains hashes to lcl. However, it looks like the interface has changed in this PR, where we take in

stellar-core verify-checkpoints --trusted-hash-file=path/to/verified.json   --output-file=path/to/verified2.json

where the output file is a new file which contains the hashes from path/to/verified.json. The issue is, this doesn't actually work as an append operations, as the --output-file must not be the same as trusted-hash-file. To demonstrate this, I ran the following commands on testnet:

stellar-core ---conf testnet.cfg verify-checkpoints --output-file out --from-ledger 249443

This command succeeded. After a few checkpoints passed, I then attempted to append to the file to catch up to lcl with

stellar-core ---conf testnet.cfg verify-checkpoints --output-file out --trusted-hash-file out

which crashed. I doubt that Horizon operators will want to manager a collection of files, so we probably do want a truly append operation.

While I found a couple issues, I think it would be helpful to

Validity checking on startup. If we crash due to a file not existing that's fine, but this should happen immediately on startup and not after waiting for the network's next checkpoint ledger.
Take a step back and solidify what the interface should be. I know we've had some irl conversations back and forth and the expectations have been changing a lot throughout, but currently the design doc, commands.md doc, and command line "help" output all define different, mutually exclusive interfaces. I think this is making review and implementation a bit tricky.

src/historywork/WriteVerifiedCheckpointHashesWork.h

src/historywork/WriteVerifiedCheckpointHashesWork.cpp

ThomasBrady · 2024-10-02T23:19:55Z

Thanks for this change, sorry we kept going back and forth so much in the design phase :(. I did a quick pass, but I think there's a couple of issues with the interface that need to be fixed, then I'll do another pass once things are working a bit better. In particular
stellar-core --conf test.cfg verify-checkpoints --trusted-hash-file does-not-exist  
crashes after syncing with the network, but it looks like this should work based on the help comment from --trusted-hash-file. Either the comment should be changed and this error check should happen on startup if this is intended behavior, or it should be addressed.

Do you know what error was printed when you ran this? For me I get 2024-10-02T15:43:40.210 GAL3A [default FATAL] Got an exception: error opening output file. If I specify a non-existent trusted hash file (with an output-file to write to), it verifies to genesis without raising an error.

I agree that the error reporting should happen earlier. I thought that calling .required() on the clara parser for --output-file would have raised an error immediately if that flag isn't provided, but that doesn't seem to be the case. I'll raise an error before connecting to the network if output-file isn't specified. If --trusted-hash-file does not exist, I think it should also result in an error being reported rather than silently verifying from genesis so I'll report that too.

I'm also not quite sure what the intended interface for this is. It looks like in the doc, we have
stellar-core verify-checkpoints –conf=core.cfg –trusted-hash-file=path/to/verified.json 
which takes in a previous file called path/to/verified.json, and at the end of the call updates path/to/verified.json such that is contains hashes to lcl. However, it looks like the interface has changed in this PR, where we take in
stellar-core verify-checkpoints --trusted-hash-file=path/to/verified.json   --output-file=path/to/verified2.json 
where the output file is a new file which contains the hashes from path/to/verified.json. The issue is, this doesn't actually work as an append operations, as the --output-file must not be the same as trusted-hash-file. To demonstrate this, I ran the following commands on testnet:

Correct, the design was updated not to append to the trusted-hash-file implicitly. An output-file must be explicitly specified with all invocations. I'll modify the file output logic to write to a temporary file if the specified --output-file is equal to the --trusted-hash-file to support the append use case.

stellar-core ---conf testnet.cfg verify-checkpoints --output-file out --from-ledger 249443
This command succeeded. After a few checkpoints passed, I then attempted to append to the file to catch up to lcl with
stellar-core ---conf testnet.cfg verify-checkpoints --output-file out --trusted-hash-file out
which crashed. I doubt that Horizon operators will want to manager a collection of files, so we probably do want a truly append operation.

While I found a couple issues, I think it would be helpful to

Validity checking on startup. If we crash due to a file not existing that's fine, but this should happen immediately on startup and not after waiting for the network's next checkpoint ledger.

Take a step back and solidify what the interface should be. I know we've had some irl conversations back and forth and the expectations have been changing a lot throughout, but currently the design doc, commands.md doc, and command line "help" output all define different, mutually exclusive interfaces. I think this is making review and implementation a bit tricky.

I've spotted a typo in commands.md (--trusted-checkpoint-hashes should be --trusted-checkpoint-file), and there was the example invocations in the design doc that erroneously included both --trusted-checkpoint-file and --from-ledger and excluded the mandatory --output-file argument. I've updated those in the relevant parts. Is that all you were referring to or are there other issues with the interface differing?

SirTyson · 2024-10-03T01:11:05Z

Do you know what error was printed when you ran this? For me I get 2024-10-02T15:43:40.210 GAL3A [default FATAL] Got an exception: error opening output file. If I specify a non-existent trusted hash file (with an output-file to write to), it verifies to genesis without raising an error.

Ya the error I was referring to was that one, with no output-file.

If --trusted-hash-file does not exist, I think it should also result in an error being reported rather than silently verifying from genesis so I'll report that too.

Sounds like a good idea!

Is that all you were referring to or are there other issues with the interface differing?

That definitely cleans up most of it, but I think there's still an issue in the command help message for "--trusted-hash-file":

        "file containing trusted hashes, generated by a previous call to "
        "verify-checkpoints or a non-existent file to generate a new one");

I don't think a non-existent file should be valid, and we should probably just crash immediately on startup in this case.

src/historywork/WriteVerifiedCheckpointHashesWork.cpp

src/historywork/WriteVerifiedCheckpointHashesWork.h

src/historywork/WriteVerifiedCheckpointHashesWork.cpp

src/historywork/WriteVerifiedCheckpointHashesWork.h

src/historywork/WriteVerifiedCheckpointHashesWork.cpp

SirTyson

Overall working much better! A few small issues regarding graceful failure and making sure we don't corrupt output files.

SirTyson

Looks good! Just a few final cleanups and one edge case question.

src/historywork/WriteVerifiedCheckpointHashesWork.cpp

src/main/CommandLine.cpp

src/historywork/WriteVerifiedCheckpointHashesWork.cpp

SirTyson

LGTM!

…llow for incremental verification of checkpoints.

ThomasBrady changed the title ~~WIP: Incremental verify checkpoints~~ Incremental verify checkpoints Sep 30, 2024

ThomasBrady force-pushed the incremental-verify-checkpoints branch from 5abf2eb to c4810de Compare October 1, 2024 01:05

ThomasBrady requested review from marta-lokhova and SirTyson October 1, 2024 01:17

ThomasBrady mentioned this pull request Oct 2, 2024

Horizon: missing information passed to captive-core when configured to run "on disk" stellar/go#4538

Open

SirTyson requested changes Oct 2, 2024

View reviewed changes

src/historywork/WriteVerifiedCheckpointHashesWork.h Outdated Show resolved Hide resolved

src/historywork/WriteVerifiedCheckpointHashesWork.h Outdated Show resolved Hide resolved

src/historywork/WriteVerifiedCheckpointHashesWork.cpp Outdated Show resolved Hide resolved

ThomasBrady commented Oct 3, 2024

View reviewed changes

src/historywork/WriteVerifiedCheckpointHashesWork.cpp Show resolved Hide resolved