Rework crawl page migration #2412

tw4l · 2025-02-18T22:34:52Z

Converts migration 0042 to launch a background job (parallelized across several pods) to migrate all crawls by optimizing their pages and setting version: 2 on the crawl when complete.

Also Optimizes MongoDB queries for better performance.

Migration Improvements:

Add isMigrating and version fields to BaseCrawl
Add new background job type to use in migration with accompanying migration_job.yaml template that allows for parallelization
Add new API endpoint to launch this crawl migration job, and ensure that we have list and retry endpoints for superusers that work with background jobs that aren't tied to a specific org
Rework background job models and methods now that not all background jobs are tied to a single org
Ensure new crawls and uploads have version set to 2
Modify crawl and collection replay.json endpoints to only include fields for replay optimization (initialPages, pageQueryUrl, preloadResources) if all relevant crawls/uploads have version set to 2
Remove distinct calls from migration pathways
Consolidate collection recompute stats

Query Optimizations:

Remove all uses of $group and $facet
Optimize /replay.json endpoints to precompute preload_resources, avoid fetching crawl list twice
Optimize /collections endpoint by not fetching resources
Rename /urls -> /pageUrlCounts and avoid $group, instead sort with index, either by seed + ts or by url to get top matches.
Use $gte instead of $regex to get prefix matches on URL
Use $text instead of $regex to get text search on title
Remove total from /pages and /pageUrlCounts queries by not using $facet

Remaining work:

Testing
Add documentation on migration, how to check if it succeeds, how to restart if necessary (this should also be linked to from the release notes!)

- use group instead of distinct in unique page stats, page filename migration - consolidate collection recompute stats into single function - bump to 1.14.0-beta.2

add timeout for backend gunicorn worker update

… migration

Also adds isMigrating, which we'll use in migration 0042

- Adds new optimize pages background job that updates crawl pages and sets version on updated crawls to 2 - New job uses a new migration_job.yaml template with parallelism set to 3 - Update BackgroundJob model and some ops methods to allow for creating and retrying a background job with no oid - Add new API endpoint to retry one specific background job that isn't tied to a specific org (superuser-only)

Now that we have background jobs that aren't tied to a specific org, much of the background_jobs module needs to be reworked to account for that. That will take some time, so for now, so that we can test the migration, we just pass the default org to retry_background_job if the job doesn't have an oid set.

Only include initialPages, pagesQueryUrl, and preloadResources in replay.json responses for crawls and collections if all of the relevant crawls have version set to 2.

- simplify group query for unique pages - remove unused code - convert array index to getter

add typing

- move preloadResources to be precomputed in update_collection_counts_and_tags() - only query list of crawls once, reuse ids - remove facet from list_collection_pages(), support passing in crawlIds

- rename /urls -> /pageSnapshots - remove facet, just return results - frontend: update to new model, don't check total, just look at results list length

- rename list_collection_pages -> list_replay_query_pages to be used for replay page querying for collection or single crawl, no page totals - use list_replay_query_pages in crawl /replay.json

backend/btrixcloud/colls.py

… dialog until open (#2414) Delays rendering contents of the collection settings dialog until actually needed. Various fetches that internal components were running were causing slowdowns in other parts of the app, so this should resolve some of that.

- restore pages response models to include 'items' again - rename endpoint to pageSnapshots -> pageUrlCounts - fix tests to not check 'total'

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>

Instead of regex search on url and title: - If search string starts with https:// or http://, do a prefix search on URL - Otherwise, do a $text search on title, sort by match score. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>

backend/btrixcloud/crawlmanager.py

add migration_jobs_scale to customize this setting chart values: document rerun_from_migration and migration_jobs_scale chart: remove unneeded background worker timeout customization

ikreymer

🚀

ikreymer and others added 21 commits February 18, 2025 17:22

migration fixes:

05a7419

- use group instead of distinct in unique page stats, page filename migration - consolidate collection recompute stats into single function - bump to 1.14.0-beta.2

bump version to 1.14.0-beta.2

0ae4291

bump to 1.14.0-beta.3

feb4f09

add timeout for backend gunicorn worker update

typo fix

16dd7a4

empty filenames migration: use existing bulk re_add method for faster…

9ab4dac

… migration

bump to 1.14.0-beta.4

20b6a73

Add version to BaseCrawl, set to 2 for new crawls/uploads

b57e4fb

Also adds isMigrating, which we'll use in migration 0042

Make sure job failure emails are sent even if no oid

b67e2c7

Fix typing error

bd2e2c2

Only include replay optimization fields if all pages are optimized

eb455a1

Only include initialPages, pagesQueryUrl, and preloadResources in replay.json responses for crawls and collections if all of the relevant crawls have version set to 2.

Set isMigrating at same time as pulling next crawl

bef7a18

Redo background_jobs typing, add superuser non-org list endpoint

4f90555

Remove crawl_type from optimize_pages bg job

25c8244

Handle running crawls in page optimization

1891ef2

Remove isMigrating filter from running crawl check

b2d97f0

Fix bg job retry

ccd7f1e

Cast job to right type in retry

19d2547

Add API endpoint to launch migrate crawls job

7bc3397

Use SuccessResponseId model to include job id in response

538d4a3

tw4l requested a review from ikreymer February 18, 2025 22:35

tw4l and others added 8 commits February 18, 2025 18:05

Add first draft of upgrade notes in docs

7d27feb

cleanup:

821c02d

- simplify group query for unique pages - remove unused code - convert array index to getter

add logging to optimize pages job

d311692

logging

07d3c25

check for empty result

9fd26bf

version: bump to 1.14.0-beta.5

39ab851

fix response model

f71a89e

add retries for add_crawl_pages_to_db_from_wacz()

533f78c

add typing

ikreymer added 4 commits February 19, 2025 11:37

remove resources from collection list

b299941

optimize coll /replay.json

06e193b

- move preloadResources to be precomputed in update_collection_counts_and_tags() - only query list of crawls once, reuse ids - remove facet from list_collection_pages(), support passing in crawlIds

optimize page snapshot query:

1dafa5b

- rename /urls -> /pageSnapshots - remove facet, just return results - frontend: update to new model, don't check total, just look at results list length

crawl /replay.json optimization:

c95022a

- rename list_collection_pages -> list_replay_query_pages to be used for replay page querying for collection or single crawl, no page totals - use list_replay_query_pages in crawl /replay.json

tw4l commented Feb 19, 2025

View reviewed changes

backend/btrixcloud/colls.py Show resolved Hide resolved

ikreymer and others added 19 commits February 19, 2025 13:32

cleanup

e3e3651

fix typo

0efe822

lint

4d6b21c

optimize page snapshots, add non-group alternative

d6e76c4

fix page_count update?

85d4c6f

tweak page snapshot sort order

bf6f24b

use gte

f2ce698

readd unquote

2861292

lint fix

c639e25

tweak

837b35c

more cleanup:

9e093b9

- restore pages response models to include 'items' again - rename endpoint to pageSnapshots -> pageUrlCounts - fix tests to not check 'total'

test fix

7042d16

test fix

15a5332

Update backend/btrixcloud/crawlmanager.py

3731172

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>

Fix typo in response

6e70662

Add test asserts for crawl/upload version field

fb9b65b

remove dupe index

5713f88

emma-sg mentioned this pull request Feb 20, 2025

Show page titles in page selection dropdown #2416

Closed

ikreymer reviewed Feb 20, 2025

View reviewed changes

backend/btrixcloud/crawlmanager.py Outdated Show resolved Hide resolved

set pages migration job to run at scale 1 by default

967d2bd

add migration_jobs_scale to customize this setting chart values: document rerun_from_migration and migration_jobs_scale chart: remove unneeded background worker timeout customization

ikreymer approved these changes Feb 20, 2025

View reviewed changes

ikreymer merged commit f8fb2d2 into main Feb 20, 2025
29 checks passed

ikreymer deleted the issue-2406-crawl-migration-rework branch February 20, 2025 23:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework crawl page migration #2412

Rework crawl page migration #2412

tw4l commented Feb 18, 2025 •

edited by ikreymer

Loading

ikreymer left a comment

Rework crawl page migration #2412

Rework crawl page migration #2412

Conversation

tw4l commented Feb 18, 2025 • edited by ikreymer Loading

ikreymer left a comment

Choose a reason for hiding this comment

tw4l commented Feb 18, 2025 •

edited by ikreymer

Loading