Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Org-wide filename template for created/downloaded/ WARC/WACZ-files #412

Closed
Klindten opened this issue Dec 13, 2022 · 6 comments
Closed

Org-wide filename template for created/downloaded/ WARC/WACZ-files #412

Klindten opened this issue Dec 13, 2022 · 6 comments
Assignees
Labels
back end Requires back end dev work feature design This issue tracks smaller sub issues that compose a feature ui/ux This issue requires UI/UX work

Comments

@Klindten
Copy link

  1. Input/template where you can define how files are named. For everyone, ie. easy
  2. Possibility to add warc-field for user - like in Conifer-service

at KB we need to name files webrec_"collectionname" (added date can be fine). Also as some search-discovery tools, like SolrWayback, have case-sensitive search, it could be great to have a way to customize which characters are allowed in the WARC-file name.

@ldko
Copy link

ldko commented Feb 7, 2023

I would also find it helpful for consistency with other WARCs we process at UNT to be able to name the WARC files created with a configured pattern akin to in pywb where in config.yml you can configure:

recorder:
   filename_template: UNT-{timestamp}-{hostname}-{random}.warc.gz

@Shrinks99 Shrinks99 added the back end Requires back end dev work label Feb 7, 2023
@Shrinks99 Shrinks99 self-assigned this Feb 7, 2023
@Shrinks99 Shrinks99 mentioned this issue Feb 7, 2023
4 tasks
@Shrinks99 Shrinks99 changed the title Naming of created/downloaded/ WARC/WACZ-files via template Org-wide filename template for created/downloaded/ WARC/WACZ-files Feb 7, 2023
@Shrinks99 Shrinks99 added feature design This issue tracks smaller sub issues that compose a feature ui/ux This issue requires UI/UX work labels Feb 21, 2023
@Klindten
Copy link
Author

Klindten commented Nov 7, 2023

Any news on this? We still need this functionality to test more/go in production.

@Shrinks99
Copy link
Member

Shrinks99 commented Nov 7, 2023

@Klindten Not much to report unfortunately. We're busy working on QA features! As this is your issue, when it gets updated you'll be the first to know. ;)

I didn't want to completely leave you hanging though, and admitidly this has also been taking up space in my head so I took a few minutes today to create a mockup of where this might fit into the Org Settings page.

Screenshot 2023-11-07 133331

Would something like this meet your needs? What fields are required? What might I be missing?

Also worth noting that we may go with a simpler text-based customization system initially as that might be faster to get out the door!

@Klindten
Copy link
Author

Klindten commented Nov 8, 2023

@Shrinks99 Thanks for getting back to me.

I think we should make a short meeting with you, Colin, Tue, and me asap- this will be easier than ping-pong here. I'll try to find a good time. Where are you based - so it´ll not be too early? Do other people from Webrec need to be in on this

Here´s our initial thoughts:

filenaming template should be for both wacz and warcfiles:

prefix-browsertrix_crawlname-browsertrix_created_by(initials?)-browsertrix_crawltimestamp(YYYYMMDDHHSS).warc.gz

prefix-browsertrix_crawlname-browsertrix_created_by(initials?)-browsertrix_crawltimestamp(YYYYMMDDHHSS).wacz.gz

e.g

webrec-tv2_tema_taleban_afganistan-jens_moeller-202308181556.warc.gz

webrec-tv2_tema_taleban_afganistan-jens_moeller-202308181556.wacz.gz

only prefix is customized/input by user (but once and for all)
(prefix is on an organisational level and should be set automatically)

Could also be. CSR: netarkivet_${crawl_name}${username}${timestamp}.warc.gz

ikreymer added a commit that referenced this issue Feb 21, 2024
- supports setting 'WARC_PREFIX' env var in browsertrix crawler (requires crawler 1.0.0-beta.4 or higher)
- prefix set to <org slug>-<slug [crawl name | first seed host]>
- using either crawl name, if provided, or host name of first seed. both are converted to slug (lowercase alphanum, seperate by dashes)
- fixes #412
@ikreymer
Copy link
Member

The proposed solution for WARCs is implemented in #1537. The WARCs will start with the prefix: <org slug>-<slug [crawl name | first seed host]>-.

WACZs are stored in S3 following a specific convention, so a bit harder to change, but very easy to simply rename after downloading / download to a custom named file simply by doing 'Save Link As...'. Additionally, collections offer a way to have custom WACZ name based on the collection and are the recommended approach for curation.

@Shrinks99 Shrinks99 added this to the v1.9.2 milestone Feb 22, 2024
ikreymer added a commit that referenced this issue Feb 22, 2024
…/ First Seed URL. (#1537)

Supports setting WARC prefix for WARCs inside WACZ to `<org slug>-<slug
[crawl name | first seed host]>`.
- Prefix set via WARC_PREFIX env var, supported in browsertrix-crawler
1.0.0-beta.4 or higher
If crawl name is provided, uses crawl name, other hostname of first
seed. The name is 'sluggified', using lowercase alphanum characters
separated by dashes.

Ex: in an organization called `Default Org`, a crawl of
`https://specs.webrecorder.net/` and no name will have WARCs named:
`default-org-specs-webrecorder-net-....warc.gz`
If the crawl is given the name `SPECS`, the WARCs will be named
`default-org-specs-manual-....warc.gz`

Fixes #412 in a default way.
@Shrinks99 Shrinks99 removed this from the v1.9.2 milestone Feb 22, 2024
@github-project-automation github-project-automation bot moved this from Todo to Done! in Webrecorder Projects Feb 23, 2024
@Klindten
Copy link
Author

Graveyard question.

  • In archiveweb.page you can rename your crawl after crawl, before download (thus giving the final name to the WARC-files). Sometimes this is handy if you started crawling with a not so descriptive name.

  • In Btrix you can change name of crawl under metadata tab while "Running" and "Waiting".

  • In "Archived Items" the name can´t be changed. It can be changed for new crawls though. Would it possible to do it the same as in AWP, to be able to change crawlname after crawl. I suspect there are good reasons for the current procedure:-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
back end Requires back end dev work feature design This issue tracks smaller sub issues that compose a feature ui/ux This issue requires UI/UX work
Projects
Archived in project
Development

No branches or pull requests

4 participants