-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full-text Indexing #5030
Comments
I ran `` [ERROR] AdminIT.testConvertShibUserToBuiltin:300 Expected status code <200> doesn't match actual status code <400>. [ERROR] BuiltinUsersIT.testLeadingWhitespaceInEmailAddress:134 null expected:7862da6e@mailinator.com but was: [ERROR] SearchIT.testIdentifier:625 JSON path data.total_count doesn't match. [ERROR] UsersIT.convertNonBcryptUserFromBuiltinToShib:77 JSON path message doesn't match. [INFO] `` |
I can confirm these IT tests should resolve with a new merge of develop (fixed in #5036):
|
@qqmyers I just saw your note on pull request #5032 at #5032 (comment) about which files to make searchable and I'm wondering if this issue should now be in the "code review" column at https://waffle.io/IQSS/dataverse |
@pdurbin - checking one more thing - I'll ping you when this is ready to go again. |
The integration of Tika could be usefull for the upgrade to Java 11, too. See #4259. I get failing tests while detecting file types and I wonder if Tika can help. This is just a cross-reference, will open a new issue for this if necessary. |
@pdurbin - I think this is ready to go. FWIW/future reference, I looked into whether one could reuse the full-text field from a prior file (since the file content never changes) and think it should be possible but I see a couple issues that looked bigger than I can handle right now:
|
Hi @qqmyers your pull request was extra work to code review because it included so many formatting changed that I can only assume were introduced whatever IDE you're using. It doesn't seem to like long lines, for example. I created a new branch and pull request at #5147 that is the same code as your pull request but that represents a smaller diff so it's easy to review now and in the future. Basically, I backed out of a number of your formatting only changes. I did preserve some of your other cleanup such as not using deprecated methods for the global id. Off to QA. Thanks! |
@pdurbin - thanks. I'm trying to sync my Eclipse settings with Dataverse preferences. I've got tabs going to 4 spaces already - I'll add to allow lines out to 480 chars (was 120) and to not merge already wrapped lines. Hopefully that will limit unnecessary changes. Let me know if you spot other differences and I'll try to keep watching as well. |
@qqmyers no sweat. Since I was on the branch anyway I went ahead and deployed it to my laptop and it works! I uploaded a PDF that has the word "learning" in it... ... and I was able to find it by searching for "learning": Very cool! |
Oh, there were no docs so I added some in 278f37e and noted that this (awesome) feature is off by default but we could switch the boolean and docs so that it's on by default. Or we could leave it alone and treat it as an experimental feature for now. |
@qqmyers Still testing but noticed the file size limit for indexing does not seem to limit: :SolrMaxFileSizeForFullTextIndexing When I set it to 1, it still indexes a small file of 3 words and 15 bytes. Not sure whether it has something to do with the small values involved or what. I'll keep testing but I am out tomorrow. Will return on Friday to finish up. |
I've looked into full-text indexing and have created a basic implementation using Tika that should index most common file types (see https://tika.apache.org/1.18/formats.html). I'll put this in a pull-request. There are a few areas where testing/discussion/further work may be needed:
The text was updated successfully, but these errors were encountered: