-
-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Org-wide filename template for created/downloaded/ WARC/WACZ-files #412
Comments
I would also find it helpful for consistency with other WARCs we process at UNT to be able to name the WARC files created with a configured pattern akin to in pywb where in config.yml you can configure:
|
Any news on this? We still need this functionality to test more/go in production. |
@Klindten Not much to report unfortunately. We're busy working on QA features! As this is your issue, when it gets updated you'll be the first to know. ;) I didn't want to completely leave you hanging though, and admitidly this has also been taking up space in my head so I took a few minutes today to create a mockup of where this might fit into the Org Settings page. Would something like this meet your needs? What fields are required? What might I be missing? Also worth noting that we may go with a simpler text-based customization system initially as that might be faster to get out the door! |
@Shrinks99 Thanks for getting back to me. I think we should make a short meeting with you, Colin, Tue, and me asap- this will be easier than ping-pong here. I'll try to find a good time. Where are you based - so it´ll not be too early? Do other people from Webrec need to be in on this Here´s our initial thoughts: filenaming template should be for both wacz and warcfiles: prefix-browsertrix_crawlname-browsertrix_created_by(initials?)-browsertrix_crawltimestamp(YYYYMMDDHHSS).warc.gz prefix-browsertrix_crawlname-browsertrix_created_by(initials?)-browsertrix_crawltimestamp(YYYYMMDDHHSS).wacz.gz e.g webrec-tv2_tema_taleban_afganistan-jens_moeller-202308181556.warc.gz webrec-tv2_tema_taleban_afganistan-jens_moeller-202308181556.wacz.gz only prefix is customized/input by user (but once and for all) Could also be. CSR: netarkivet_${crawl_name}${username}${timestamp}.warc.gz |
- supports setting 'WARC_PREFIX' env var in browsertrix crawler (requires crawler 1.0.0-beta.4 or higher) - prefix set to <org slug>-<slug [crawl name | first seed host]> - using either crawl name, if provided, or host name of first seed. both are converted to slug (lowercase alphanum, seperate by dashes) - fixes #412
The proposed solution for WARCs is implemented in #1537. The WARCs will start with the prefix: WACZs are stored in S3 following a specific convention, so a bit harder to change, but very easy to simply rename after downloading / download to a custom named file simply by doing 'Save Link As...'. Additionally, collections offer a way to have custom WACZ name based on the collection and are the recommended approach for curation. |
…/ First Seed URL. (#1537) Supports setting WARC prefix for WARCs inside WACZ to `<org slug>-<slug [crawl name | first seed host]>`. - Prefix set via WARC_PREFIX env var, supported in browsertrix-crawler 1.0.0-beta.4 or higher If crawl name is provided, uses crawl name, other hostname of first seed. The name is 'sluggified', using lowercase alphanum characters separated by dashes. Ex: in an organization called `Default Org`, a crawl of `https://specs.webrecorder.net/` and no name will have WARCs named: `default-org-specs-webrecorder-net-....warc.gz` If the crawl is given the name `SPECS`, the WARCs will be named `default-org-specs-manual-....warc.gz` Fixes #412 in a default way.
Graveyard question.
|
at KB we need to name files webrec_"collectionname" (added date can be fine). Also as some search-discovery tools, like SolrWayback, have case-sensitive search, it could be great to have a way to customize which characters are allowed in the WARC-file name.
The text was updated successfully, but these errors were encountered: