Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full-text Indexing #5030

Closed
qqmyers opened this issue Sep 7, 2018 · 14 comments
Closed

Full-text Indexing #5030

qqmyers opened this issue Sep 7, 2018 · 14 comments
Assignees

Comments

@qqmyers
Copy link
Member

qqmyers commented Sep 7, 2018

I've looked into full-text indexing and have created a basic implementation using Tika that should index most common file types (see https://tika.apache.org/1.18/formats.html). I'll put this in a pull-request. There are a few areas where testing/discussion/further work may be needed:

  • Tika had conflicts with old XML/Sax parsing libraries (xerces) and I updated several components in the pom.xml file and had to exclude the xerces dependency of sword. I don't know if everything is forward compatible or if this has broken something.
  • I haven't looked at performance as the index grows. By default, I've configured Tika to index all files that it can. On QDR's small dev machine, there is one text file (187MB csv) that causes an outofmemory error - I can catch that and processing can continue, but it may be that some file size/type limits would make sense in order to limit overall index size and keep performance up. (Full-text indexing is optional, off by default, but is done as part of the overall indexing process if it runs, so performance during indexing may also be an issue).
  • I explored using a separate field for full-text indexing but decided to start by putting the full text in the catchall text field. That makes simple queries work across the title/description/metadata and full-text. There may be reasons to keep it separate in solr (to allow advanced queries over full text only, to apply English word stemming, etc. without doing that to text as a whole) and I think this can be done will keeping the ability to query across metadata and full text from the simple query box (just need to expand plain text queries under the hood to query across metadata OR full-text fields). W.r.t. stemming - I see that the text field is defined as text-general, but I wonder if text_en would be OK (text_en does extra processing for English words but is not English specific, and stemming may be useful in descriptions and titles as well).
@djbrooke
Copy link
Contributor

djbrooke commented Sep 7, 2018

Thanks @qqmyers. We were thinking of turning off Tika because it wasn't used (#5018). I'll keep that issue open for now but sounds like we shouldn't work on it in the near term.

@pameyer
Copy link
Contributor

pameyer commented Sep 10, 2018

I ran conf/docker-aio/run-test-suite.sh on ca6f59989b60d7d9e08b460b9c9a583b33adbd06:

``
[ERROR] Failures:
[ERROR] AdminIT.testConvertOAuthUserToBuiltin:392 Expected status code <200> doesn't match actual status code <400>.

[ERROR] AdminIT.testConvertShibUserToBuiltin:300 Expected status code <200> doesn't match actual status code <400>.

[ERROR] BuiltinUsersIT.testLeadingWhitespaceInEmailAddress:134 null expected:7862da6e@mailinator.com but was:
[ERROR] HarvestingServerIT.testSetCreation:77 expected:<201> but was:<500>
[ERROR] SearchIT.testDatasetThumbnail:353 Expected status code <200> doesn't match actual status code <500>.

[ERROR] SearchIT.testIdentifier:625 JSON path data.total_count doesn't match.
Expected: <0>
Actual: 1

[ERROR] UsersIT.convertNonBcryptUserFromBuiltinToShib:77 JSON path message doesn't match.
Expected: ["User doesn't know password."]
Actual: No user to convert. We couldn't find a single existing user account based on null and no user was found using specified id 0

[INFO]
[ERROR] Tests run: 74, Failures: 7, Errors: 0, Skipped: 0

``
It wouldn't surprise me if some or all of these went away after #5036 was merged into this branch (meaning updated to develop, since that PR's merged already). I'd suspect the failures reported at https://build.hmdc.harvard.edu:8443/job/phoenix.dataverse.org-apitest-develop/233/testReport/ to be picked up, but suspect the SearchIT ones might not.

@matthew-a-dunlap
Copy link
Contributor

I can confirm these IT tests should resolve with a new merge of develop (fixed in #5036):

  • AdminIT.testConvertOAuthUserToBuiltin
  • AdminIT.testConvertShibUserToBuiltin
  • BuiltinUsersIT.testLeadingWhitespaceInEmailAddress
  • UsersIT.convertNonBcryptUserFromBuiltinToShib

@pdurbin
Copy link
Member

pdurbin commented Sep 28, 2018

@qqmyers I just saw your note on pull request #5032 at #5032 (comment) about which files to make searchable and I'm wondering if this issue should now be in the "code review" column at https://waffle.io/IQSS/dataverse

@qqmyers
Copy link
Member Author

qqmyers commented Sep 28, 2018

@pdurbin - checking one more thing - I'll ping you when this is ready to go again.

@poikilotherm
Copy link
Contributor

The integration of Tika could be usefull for the upgrade to Java 11, too. See #4259. I get failing tests while detecting file types and I wonder if Tika can help.

This is just a cross-reference, will open a new issue for this if necessary.

@qqmyers
Copy link
Member Author

qqmyers commented Oct 2, 2018

@pdurbin - I think this is ready to go.

FWIW/future reference, I looked into whether one could reuse the full-text field from a prior file (since the file content never changes) and think it should be possible but I see a couple issues that looked bigger than I can handle right now:

  • the existing file documents are deleted in the complex logic of the IndexServiceBean.indexDataset method before the addOrUpdateDataset method, where the the file solr document is created/submitted. The old docs would need to be preserved long enough to read them - perhaps just by passing the list of docs to delete into the addOrUpdateDataset method, but I'm not sure given the multiple cases involved.
  • when I look at the file document now, I don't see the text field, which is where the full text results are being added. I presume that field is not being store since it is also the catch-all field. Making that stored, or using a separate field for the full text would probably handle this. (Given that the full-text field could be large, I don't think one would want to harvest it into the index field, but I did play earlier with adjusting the simple query to search index and another full-text field to keep the simple results the same as they are with the current PR IQSS/5030 full text idexing #5032)

@pdurbin
Copy link
Member

pdurbin commented Oct 5, 2018

Hi @qqmyers your pull request was extra work to code review because it included so many formatting changed that I can only assume were introduced whatever IDE you're using. It doesn't seem to like long lines, for example. I created a new branch and pull request at #5147 that is the same code as your pull request but that represents a smaller diff so it's easy to review now and in the future. Basically, I backed out of a number of your formatting only changes. I did preserve some of your other cleanup such as not using deprecated methods for the global id. Off to QA. Thanks!

@qqmyers
Copy link
Member Author

qqmyers commented Oct 5, 2018

@pdurbin - thanks. I'm trying to sync my Eclipse settings with Dataverse preferences. I've got tabs going to 4 spaces already - I'll add to allow lines out to 480 chars (was 120) and to not merge already wrapped lines. Hopefully that will limit unnecessary changes. Let me know if you spot other differences and I'll try to keep watching as well.

@pdurbin
Copy link
Member

pdurbin commented Oct 5, 2018

@qqmyers no sweat. Since I was on the branch anyway I went ahead and deployed it to my laptop and it works! I uploaded a PDF that has the word "learning" in it...

screen shot 2018-10-05 at 3 27 12 pm

... and I was able to find it by searching for "learning":

screen shot 2018-10-05 at 3 26 02 pm

Very cool!

@pdurbin
Copy link
Member

pdurbin commented Oct 5, 2018

Oh, there were no docs so I added some in 278f37e and noted that this (awesome) feature is off by default but we could switch the boolean and docs so that it's on by default. Or we could leave it alone and treat it as an experimental feature for now.

@kcondon kcondon self-assigned this Oct 5, 2018
@kcondon
Copy link
Contributor

kcondon commented Oct 10, 2018

@qqmyers Still testing but noticed the file size limit for indexing does not seem to limit: :SolrMaxFileSizeForFullTextIndexing

When I set it to 1, it still indexes a small file of 3 words and 15 bytes. Not sure whether it has something to do with the small values involved or what.

I'll keep testing but I am out tomorrow. Will return on Friday to finish up.

@qqmyers
Copy link
Member Author

qqmyers commented Oct 11, 2018

@kcondon - good catch. I hadn't realized that open() has to be called before the accessIO object size is available. Fixed (and tested locally) in the new commit. FWIW - this also lead me to identify what I think are resource leaks (see #5165 ).

@pdurbin
Copy link
Member

pdurbin commented Oct 11, 2018

@kcondon the new commit @qqmyers is talking about ( 3aad24b ) having to do with :SolrMaxFileSizeForFullTextIndexing and open() is now in the branch you're testing (5030-full-text-indexing / pull request #5147).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants