Org-wide filename template for created/downloaded/ WARC/WACZ-files #412

Klindten · 2022-12-13T13:21:08Z

Input/template where you can define how files are named. For everyone, ie. easy
Possibility to add warc-field for user - like in Conifer-service

at KB we need to name files webrec_"collectionname" (added date can be fine). Also as some search-discovery tools, like SolrWayback, have case-sensitive search, it could be great to have a way to customize which characters are allowed in the WARC-file name.

ldko · 2023-02-07T18:05:41Z

I would also find it helpful for consistency with other WARCs we process at UNT to be able to name the WARC files created with a configured pattern akin to in pywb where in config.yml you can configure:

recorder:
   filename_template: UNT-{timestamp}-{hostname}-{random}.warc.gz

Klindten · 2023-11-07T10:42:13Z

Any news on this? We still need this functionality to test more/go in production.

Shrinks99 · 2023-11-07T18:10:12Z

@Klindten Not much to report unfortunately. We're busy working on QA features! As this is your issue, when it gets updated you'll be the first to know. ;)

I didn't want to completely leave you hanging though, and admitidly this has also been taking up space in my head so I took a few minutes today to create a mockup of where this might fit into the Org Settings page.

Would something like this meet your needs? What fields are required? What might I be missing?

Also worth noting that we may go with a simpler text-based customization system initially as that might be faster to get out the door!

Klindten · 2023-11-08T13:39:37Z

@Shrinks99 Thanks for getting back to me.

I think we should make a short meeting with you, Colin, Tue, and me asap- this will be easier than ping-pong here. I'll try to find a good time. Where are you based - so it´ll not be too early? Do other people from Webrec need to be in on this

Here´s our initial thoughts:

filenaming template should be for both wacz and warcfiles:

prefix-browsertrix_crawlname-browsertrix_created_by(initials?)-browsertrix_crawltimestamp(YYYYMMDDHHSS).warc.gz

prefix-browsertrix_crawlname-browsertrix_created_by(initials?)-browsertrix_crawltimestamp(YYYYMMDDHHSS).wacz.gz

e.g

webrec-tv2_tema_taleban_afganistan-jens_moeller-202308181556.warc.gz

webrec-tv2_tema_taleban_afganistan-jens_moeller-202308181556.wacz.gz

only prefix is customized/input by user (but once and for all)
(prefix is on an organisational level and should be set automatically)

Could also be. CSR: netarkivet_${crawl_name}${username}${timestamp}.warc.gz

- supports setting 'WARC_PREFIX' env var in browsertrix crawler (requires crawler 1.0.0-beta.4 or higher) - prefix set to <org slug>-<slug [crawl name | first seed host]> - using either crawl name, if provided, or host name of first seed. both are converted to slug (lowercase alphanum, seperate by dashes) - fixes #412

ikreymer · 2024-02-22T02:28:12Z

The proposed solution for WARCs is implemented in #1537. The WARCs will start with the prefix: <org slug>-<slug [crawl name | first seed host]>-.

WACZs are stored in S3 following a specific convention, so a bit harder to change, but very easy to simply rename after downloading / download to a custom named file simply by doing 'Save Link As...'. Additionally, collections offer a way to have custom WACZ name based on the collection and are the recommended approach for curation.

…/ First Seed URL. (#1537) Supports setting WARC prefix for WARCs inside WACZ to `<org slug>-<slug [crawl name | first seed host]>`. - Prefix set via WARC_PREFIX env var, supported in browsertrix-crawler 1.0.0-beta.4 or higher If crawl name is provided, uses crawl name, other hostname of first seed. The name is 'sluggified', using lowercase alphanum characters separated by dashes. Ex: in an organization called `Default Org`, a crawl of `https://specs.webrecorder.net/` and no name will have WARCs named: `default-org-specs-webrecorder-net-....warc.gz` If the crawl is given the name `SPECS`, the WARCs will be named `default-org-specs-manual-....warc.gz` Fixes #412 in a default way.

Klindten · 2025-02-27T11:12:28Z

Graveyard question.

In archiveweb.page you can rename your crawl after crawl, before download (thus giving the final name to the WARC-files). Sometimes this is handy if you started crawling with a not so descriptive name.
In Btrix you can change name of crawl under metadata tab while "Running" and "Waiting".
In "Archived Items" the name can´t be changed. It can be changed for new crawls though. Would it possible to do it the same as in AWP, to be able to change crawlname after crawl. I suspect there are good reasons for the current procedure:-)

Shrinks99 added the back end Requires back end dev work label Feb 7, 2023

Shrinks99 self-assigned this Feb 7, 2023

Shrinks99 added this to Webrecorder Projects Feb 7, 2023

github-project-automation bot moved this to Todo in Webrecorder Projects Feb 7, 2023

Shrinks99 mentioned this issue Feb 7, 2023

Org Storage Management V1 #580

Closed

4 tasks

Shrinks99 changed the title ~~Naming of created/downloaded/ WARC/WACZ-files via template~~ Org-wide filename template for created/downloaded/ WARC/WACZ-files Feb 7, 2023

Shrinks99 mentioned this issue Feb 21, 2023

Customize downloaded WACZ filename #616

Closed

Shrinks99 added feature design This issue tracks smaller sub issues that compose a feature ui/ux This issue requires UI/UX work labels Feb 21, 2023

ikreymer mentioned this issue Feb 22, 2024

More friendly WARC prefix inside WACZ based on Org slug + Crawl Name / First Seed URL. #1537

Merged

Shrinks99 added this to the v1.9.2 milestone Feb 22, 2024

Shrinks99 removed this from the v1.9.2 milestone Feb 22, 2024

ikreymer closed this as completed in 8ae032f Feb 23, 2024

github-project-automation bot moved this from Todo to Done! in Webrecorder Projects Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Org-wide filename template for created/downloaded/ WARC/WACZ-files #412

Org-wide filename template for created/downloaded/ WARC/WACZ-files #412

Klindten commented Dec 13, 2022

ldko commented Feb 7, 2023

Klindten commented Nov 7, 2023

Shrinks99 commented Nov 7, 2023 •

edited

Loading

Klindten commented Nov 8, 2023

ikreymer commented Feb 22, 2024

Klindten commented Feb 27, 2025

Org-wide filename template for created/downloaded/ WARC/WACZ-files #412

Org-wide filename template for created/downloaded/ WARC/WACZ-files #412

Comments

Klindten commented Dec 13, 2022

ldko commented Feb 7, 2023

Klindten commented Nov 7, 2023

Shrinks99 commented Nov 7, 2023 • edited Loading

Klindten commented Nov 8, 2023

ikreymer commented Feb 22, 2024

Klindten commented Feb 27, 2025

Shrinks99 commented Nov 7, 2023 •

edited

Loading