Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Harvard case.law #3225

Closed
adam3smith opened this issue Jan 13, 2024 · 10 comments · Fixed by #3230
Closed

Add support for Harvard case.law #3225

adam3smith opened this issue Jan 13, 2024 · 10 comments · Fixed by #3230
Labels
Difficulty: Easy New Translator Pull requests for new translators

Comments

@adam3smith
Copy link
Collaborator

https://cite.case.law/

Requested: https://forums.zotero.org/discussion/110768/zotero-connector-and-case-law-courtlistener/p1

Looks like you can easily get to API-based JSON results that should work well

@adam3smith adam3smith added New Translator Pull requests for new translators Difficulty: Easy labels Jan 13, 2024
@franklindyer
Copy link
Contributor

I was tinkering with this and noticed some really strange query selector behavior, maybe someone can point out what I'm doing wrong. I'm using this entry as my test page.

First step was to extract the api.case.law URL corresponding to the given cite.case.law page. At first glance it might look like we can snatch the case ID e.g. 9903854 directly from the URL, but this isn't reliably possible, see for example this page in which the cite page URL doesn't contain the ID used in the api page URL. So it will be necessary to actually extract the api page URL from an HTML element's href attribute.

When I use the following query selector in my scrape function, it yields no results:

attr(doc, "a[href*='api.case.law/v1/cases/']", 'href');

But when I open the page in another browser, the corresponding query selector does yield results:

document.querySelectorAll("a[href*='api.case.law/v1/cases/']")

Also, using a regex-based text search of the entire document's innerHTML (which is awful, but I needed a sanity check) also does yield a match:

doc.body.innerHTML.match(/api\.case\.law\/v1\/cases\/([0-9]+)/)[0];

Any ideas what's going wrong here? Can anyone reproduce this? Maybe there is a silly mistake in my query selector. But I can't see why the same query selector that fails in scrape would succeed in another browser (I also have "defer": true, so I don't think it's a timing issue) even when the desired element is actually present in the raw text of the page's HTML.

@adam3smith
Copy link
Collaborator Author

adam3smith commented Jan 15, 2024

How exactly are you testing this? I just loaded https://cite.case.law/am-samoa/2/3/ into the Scaffold browser, created a new translator with the Web Translator template, changed detect so it always detects as a case and then put
Z.debug(attr(doc, "a[href*='api.case.law/v1/cases/']", 'href')) into the scrape function and ran doWeb. That returned https://api.case.law/v1/cases/206939/ as expected.
It sounds like you starting with the test cases (otherwise defer wouldn't matter)? I wouldn't recommend that. Test-driven development is not an effective way to develop Zotero translators. Figure out what works with the built-in detect and do buttons first, then handle page loading issues (which can come up with tests in general) later.

@franklindyer
Copy link
Contributor

Okay, thanks for the tip. I've tried again using just doWeb and added the exact same line of code Z.debug(attr(doc, "a[href*='api.case.law/v1/cases/']", 'href')) to my scrape function. It still gives no hits, despite the fact that the body plainly contains the anchor tag I'm looking for, and the regex match continues to find it successfully.

So, at the very least, we can say this isn't a page loading issue now... but I'm not sure what local misconfiguration could be causing this only on my end. (I'm not behind on any Zotero updates.)

@adam3smith
Copy link
Collaborator Author

This is in scaffold?

@franklindyer
Copy link
Contributor

That's correct.

@adam3smith
Copy link
Collaborator Author

Hmm -- at this point hard to say anything without seeing more code -- could you either put this into a draft PR or on a gist?

@franklindyer
Copy link
Contributor

Here's a gist containing the code I currently have for the translator, and I'm testing on the /am-samoa/2/3/ test case using the Run do* button as per your suggestion. The output is

14:02:48 Running doWeb
14:02:48 
14:02:48 Translation successful

@adam3smith
Copy link
Collaborator Author

Works for me. I'm in Zotero 7, but I can't really see how that matters.
You sure you have the page open in the browser in Scaffold? Because you also should be getting a monster return for the doc.body function debug you have in there.

@franklindyer
Copy link
Contributor

franklindyer commented Jan 15, 2024

Oops, I commented out the doc.body dump locally but forgot to do so in the gist. In any case, yes, with that line uncommented I get a huge return for the body and nothing for the href.

Turns out I was using Zotero 6, and I just installed the Zotero 7 beta as a last resort... and the code works for me there! Perplexing...

@adam3smith
Copy link
Collaborator Author

Cool. I'm guessing it might be the old Firefox version running underneath Zotero 6? I don't think I have seen this before. We obviously want to test with Z6, but I'm guessing it'll work outside of scaffold

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Difficulty: Easy New Translator Pull requests for new translators
2 participants