Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize HashPrefixStore #28099

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Optimize HashPrefixStore #28099

wants to merge 2 commits into from

Conversation

atuchin-m
Copy link
Collaborator

@atuchin-m atuchin-m commented Mar 11, 2025

The PR:

  • reduces the size of rewards publisher prefix storage by using another
  • uses flatbuffers as a container.

The structure:
Each prefix is considered as the first 2 bytes (are used as index) + suffix.
Prefixes are stored as a collection where the first 2 bytes are used as an
index into the 'offsets' array (size 256*256 for all possible 2-byte values).
Each offset points to the position in 'all_suffixes' where the suffixes
for that prefix index begin.
Suffixes are stored contiguously for efficient binary search lookup.

Memory usage before = 1.7M * 4 = 6.5 MB
Memory usage after = Header + 256KB (offsets) + 1.7M * 2 (prefix_count * (prefix_size - 2)) = 3.5MB

When this version is installed the old RewardsCreators.db will be dropped, the prefixes will be re-downloaded.

Submitter Checklist:

  • I confirm that no security/privacy review is needed and no other type of reviews are needed, or that I have requested them
  • There is a ticket for my issue
  • Used Github auto-closing keywords in the PR description above
  • Wrote a good PR/commit description
  • Squashed any review feedback or "fixup" commits before merge, so that history is a record of what happened in the repo, not your PR
  • Added appropriate labels (QA/Yes or QA/No; release-notes/include or release-notes/exclude; OS/...) to the associated issue
  • Checked the PR locally:
    • npm run test -- brave_browser_tests, npm run test -- brave_unit_tests wiki
    • npm run presubmit wiki, npm run gn_check, npm run tslint
  • Ran git rebase master (if needed)

Reviewer Checklist:

  • A security review is not needed, or a link to one is included in the PR description
  • New files have MPL-2.0 license header
  • Adequate test coverage exists to prevent regressions
  • Major classes, functions and non-trivial code blocks are well-commented
  • Changes in component dependencies are properly reflected in gn
  • Code follows the style guide
  • Test plan is specified in PR before merging

After-merge Checklist:

Test Plan:

@atuchin-m atuchin-m self-assigned this Mar 11, 2025
@atuchin-m atuchin-m requested a review from a team as a code owner March 11, 2025 23:20
// Serialize the vectors and the main table.
auto flat_offsets = builder.CreateVector(offsets.data(), offsets.size());
auto all_suffixes = builder.CreateVector(
reinterpret_cast<const uint8_t*>(all_suffixes_data.data()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reported by reviewdog 🐶
[semgrep] Using reinterpret_cast against some data types may lead to undefined behaviour. In general, when needing to do these conversions, check how Chromium upstream does them. Most of the times a reinterpret_cast is wrong and there's no guarantee the compiler will generate the code that you thought it would.

Source: https://github.com/brave/security-action/blob/main/assets/semgrep_rules/client/reinterpret_cast.yaml


Cc @stoletheminerals @thypon @cdesouza-chromium

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flatbuffers is sad... can any of this be realistically spanified?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cdesouza-chromium
Flatbuffers supports snap and I use it here.
The problem is FB doesn't support char, only int8_t or uint8_t for array.
Because the surrounding code uses std::string_view, we have to convert somewhere span<uint8_t> to span/string_view<char>.

I've managed to remove reinterpret_cast. Also guarded the rest with static_assert for the source type.

@atuchin-m atuchin-m force-pushed the optimize-HashPrefixStore branch from efadeed to 1b367cc Compare March 11, 2025 23:49
@atuchin-m atuchin-m requested a review from a team as a code owner March 11, 2025 23:49
std::string hash = crypto::SHA256HashString(value);
hash.resize(prefix_size_);
static_assert(std::is_same_v<const uint8_t*, decltype(all_suffixes.data())>);
std::string_view data(reinterpret_cast<const char*>(all_suffixes.data()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reported by reviewdog 🐶
[semgrep] Using reinterpret_cast against some data types may lead to undefined behaviour. In general, when needing to do these conversions, check how Chromium upstream does them. Most of the times a reinterpret_cast is wrong and there's no guarantee the compiler will generate the code that you thought it would.

Source: https://github.com/brave/security-action/blob/main/assets/semgrep_rules/client/reinterpret_cast.yaml


Cc @stoletheminerals @thypon @cdesouza-chromium

Copy link
Collaborator

@cdesouza-chromium cdesouza-chromium Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need the reinterpret_cast here. It should be something like:

auto data = base::as_string_view(base::span(all_suffixes));

Copy link
Collaborator

@cdesouza-chromium cdesouza-chromium Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example corrected above. But I suspect it could be even simpler as base::as_string_view(all_suffixes);. You're gonna need to test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no as_string_view() in base::span<const uint8_t>. It's available only for span`

Comment on lines +50 to +51
const uint16_t first_bytes =
256u * static_cast<uint8_t>(hash[0]) + static_cast<uint8_t>(hash[1]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these ops safe? Is there any chance this could overflow?

for (size_t i = 0; i < prefix_count.value(); i++) {
const auto prefix = prefixes_view.substr(i * prefix_size, prefix_size);
const auto [index, suffix] = GetIndexAndSuffix(prefix);
suffix_arrays[index].insert(suffix_arrays[index].end(), suffix.begin(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are doing double searches with suffix_arrays[index] being repeated twice.

const std::string_view prefixes_view(prefixes);

for (size_t i = 0; i < prefix_count.value(); i++) {
const auto prefix = prefixes_view.substr(i * prefix_size, prefix_size);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty tricky code. TBH, I'm leaning towards preferring the simpler current approach for now. What would you think about leaving as-is for a couple of releases, and then introducing the optimization later, if necessary?

Also, since we're changing the file format, we're going to need another pref migration to reset the fetch timer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants