Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework crawl page migration #2412

Merged
merged 70 commits into from
Feb 20, 2025
Merged

Rework crawl page migration #2412

merged 70 commits into from
Feb 20, 2025

Conversation

tw4l
Copy link
Member

@tw4l tw4l commented Feb 18, 2025

Fixes #2406

Converts migration 0042 to launch a background job (parallelized across several pods) to migrate all crawls by optimizing their pages and setting version: 2 on the crawl when complete.

Also Optimizes MongoDB queries for better performance.

Migration Improvements:

  • Add isMigrating and version fields to BaseCrawl
  • Add new background job type to use in migration with accompanying migration_job.yaml template that allows for parallelization
  • Add new API endpoint to launch this crawl migration job, and ensure that we have list and retry endpoints for superusers that work with background jobs that aren't tied to a specific org
  • Rework background job models and methods now that not all background jobs are tied to a single org
  • Ensure new crawls and uploads have version set to 2
  • Modify crawl and collection replay.json endpoints to only include fields for replay optimization (initialPages, pageQueryUrl, preloadResources) if all relevant crawls/uploads have version set to 2
  • Remove distinct calls from migration pathways
  • Consolidate collection recompute stats

Query Optimizations:

  • Remove all uses of $group and $facet
  • Optimize /replay.json endpoints to precompute preload_resources, avoid fetching crawl list twice
  • Optimize /collections endpoint by not fetching resources
  • Rename /urls -> /pageUrlCounts and avoid $group, instead sort with index, either by seed + ts or by url to get top matches.
  • Use $gte instead of $regex to get prefix matches on URL
  • Use $text instead of $regex to get text search on title
  • Remove total from /pages and /pageUrlCounts queries by not using $facet

Remaining work:

  • Testing
  • Add documentation on migration, how to check if it succeeds, how to restart if necessary (this should also be linked to from the release notes!)

ikreymer and others added 21 commits February 18, 2025 17:22
- use group instead of distinct in unique page stats, page filename migration
- consolidate collection recompute stats into single function
- bump to 1.14.0-beta.2
add timeout for backend gunicorn worker

update
Also adds isMigrating, which we'll use in migration 0042
- Adds new optimize pages background job that updates crawl pages
and sets version on updated crawls to 2
- New job uses a new migration_job.yaml template with parallelism
set to 3
- Update BackgroundJob model and some ops methods to allow for
creating and retrying a background job with no oid
- Add new API endpoint to retry one specific background job that
isn't tied to a specific org (superuser-only)
Now that we have background jobs that aren't tied to a specific org,
much of the background_jobs module needs to be reworked to account
for that.

That will take some time, so for now, so that we can test the
migration, we just pass the default org to retry_background_job
if the job doesn't have an oid set.
Only include initialPages, pagesQueryUrl, and preloadResources in
replay.json responses for crawls and collections if all of the
relevant crawls have version set to 2.
@tw4l tw4l requested a review from ikreymer February 18, 2025 22:35
- move preloadResources to be precomputed in update_collection_counts_and_tags()
- only query list of crawls once, reuse ids
- remove facet from list_collection_pages(), support passing in crawlIds
- rename /urls -> /pageSnapshots
- remove facet, just return results
- frontend: update to new model, don't check total, just look at results list length
- rename list_collection_pages -> list_replay_query_pages to be used for replay page querying for collection or single crawl, no page totals
- use list_replay_query_pages in crawl /replay.json
ikreymer and others added 19 commits February 19, 2025 13:32
… dialog until open (#2414)

Delays rendering contents of the collection settings dialog until
actually needed. Various fetches that internal components were running
were causing slowdowns in other parts of the app, so this should resolve
some of that.
- restore pages response models to include 'items' again
- rename endpoint to pageSnapshots -> pageUrlCounts
- fix tests to not check 'total'
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Instead of regex search on url and title:
- If search string starts with https:// or http://, do a prefix search
on URL
- Otherwise, do a $text search on title, sort by match score.

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
add migration_jobs_scale to customize this setting
chart values: document rerun_from_migration and migration_jobs_scale
chart: remove unneeded background worker timeout customization
Copy link
Member

@ikreymer ikreymer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@ikreymer ikreymer merged commit f8fb2d2 into main Feb 20, 2025
29 checks passed
@ikreymer ikreymer deleted the issue-2406-crawl-migration-rework branch February 20, 2025 23:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: Convert migration 0042 to background job / versioned crawl objects
3 participants