Full-text Indexing #5030

qqmyers · 2018-09-07T14:38:17Z

I've looked into full-text indexing and have created a basic implementation using Tika that should index most common file types (see https://tika.apache.org/1.18/formats.html). I'll put this in a pull-request. There are a few areas where testing/discussion/further work may be needed:

Tika had conflicts with old XML/Sax parsing libraries (xerces) and I updated several components in the pom.xml file and had to exclude the xerces dependency of sword. I don't know if everything is forward compatible or if this has broken something.
I haven't looked at performance as the index grows. By default, I've configured Tika to index all files that it can. On QDR's small dev machine, there is one text file (187MB csv) that causes an outofmemory error - I can catch that and processing can continue, but it may be that some file size/type limits would make sense in order to limit overall index size and keep performance up. (Full-text indexing is optional, off by default, but is done as part of the overall indexing process if it runs, so performance during indexing may also be an issue).
I explored using a separate field for full-text indexing but decided to start by putting the full text in the catchall text field. That makes simple queries work across the title/description/metadata and full-text. There may be reasons to keep it separate in solr (to allow advanced queries over full text only, to apply English word stemming, etc. without doing that to text as a whole) and I think this can be done will keeping the ability to query across metadata and full text from the simple query box (just need to expand plain text queries under the hood to query across metadata OR full-text fields). W.r.t. stemming - I see that the text field is defined as text-general, but I wonder if text_en would be OK (text_en does extra processing for English words but is not English specific, and stemming may be useful in descriptions and titles as well).

djbrooke · 2018-09-07T14:50:56Z

Thanks @qqmyers. We were thinking of turning off Tika because it wasn't used (#5018). I'll keep that issue open for now but sounds like we shouldn't work on it in the near term.

pameyer · 2018-09-10T22:09:32Z

I ran conf/docker-aio/run-test-suite.sh on ca6f59989b60d7d9e08b460b9c9a583b33adbd06:

``
[ERROR] Failures:
[ERROR] AdminIT.testConvertOAuthUserToBuiltin:392 Expected status code <200> doesn't match actual status code <400>.

[ERROR] AdminIT.testConvertShibUserToBuiltin:300 Expected status code <200> doesn't match actual status code <400>.

[ERROR] BuiltinUsersIT.testLeadingWhitespaceInEmailAddress:134 null expected:7862da6e@mailinator.com but was:
[ERROR] HarvestingServerIT.testSetCreation:77 expected:<201> but was:<500>
[ERROR] SearchIT.testDatasetThumbnail:353 Expected status code <200> doesn't match actual status code <500>.

[ERROR] SearchIT.testIdentifier:625 JSON path data.total_count doesn't match.
Expected: <0>
Actual: 1

[ERROR] UsersIT.convertNonBcryptUserFromBuiltinToShib:77 JSON path message doesn't match.
Expected: ["User doesn't know password."]
Actual: No user to convert. We couldn't find a single existing user account based on null and no user was found using specified id 0

[INFO]
[ERROR] Tests run: 74, Failures: 7, Errors: 0, Skipped: 0

``
It wouldn't surprise me if some or all of these went away after #5036 was merged into this branch (meaning updated to develop, since that PR's merged already). I'd suspect the failures reported at https://build.hmdc.harvard.edu:8443/job/phoenix.dataverse.org-apitest-develop/233/testReport/ to be picked up, but suspect the SearchIT ones might not.

matthew-a-dunlap · 2018-09-10T22:57:28Z

I can confirm these IT tests should resolve with a new merge of develop (fixed in #5036):

AdminIT.testConvertOAuthUserToBuiltin
AdminIT.testConvertShibUserToBuiltin
BuiltinUsersIT.testLeadingWhitespaceInEmailAddress
UsersIT.convertNonBcryptUserFromBuiltinToShib

pdurbin · 2018-09-28T19:28:02Z

@qqmyers I just saw your note on pull request #5032 at #5032 (comment) about which files to make searchable and I'm wondering if this issue should now be in the "code review" column at https://waffle.io/IQSS/dataverse

qqmyers · 2018-09-28T19:29:24Z

@pdurbin - checking one more thing - I'll ping you when this is ready to go again.

poikilotherm · 2018-09-28T20:40:58Z

The integration of Tika could be usefull for the upgrade to Java 11, too. See #4259. I get failing tests while detecting file types and I wonder if Tika can help.

This is just a cross-reference, will open a new issue for this if necessary.

qqmyers · 2018-10-02T21:00:42Z

@pdurbin - I think this is ready to go.

FWIW/future reference, I looked into whether one could reuse the full-text field from a prior file (since the file content never changes) and think it should be possible but I see a couple issues that looked bigger than I can handle right now:

the existing file documents are deleted in the complex logic of the IndexServiceBean.indexDataset method before the addOrUpdateDataset method, where the the file solr document is created/submitted. The old docs would need to be preserved long enough to read them - perhaps just by passing the list of docs to delete into the addOrUpdateDataset method, but I'm not sure given the multiple cases involved.
when I look at the file document now, I don't see the text field, which is where the full text results are being added. I presume that field is not being store since it is also the catch-all field. Making that stored, or using a separate field for the full text would probably handle this. (Given that the full-text field could be large, I don't think one would want to harvest it into the index field, but I did play earlier with adjusting the simple query to search index and another full-text field to keep the simple results the same as they are with the current PR IQSS/5030 full text idexing #5032)

pdurbin · 2018-10-05T19:12:56Z

Hi @qqmyers your pull request was extra work to code review because it included so many formatting changed that I can only assume were introduced whatever IDE you're using. It doesn't seem to like long lines, for example. I created a new branch and pull request at #5147 that is the same code as your pull request but that represents a smaller diff so it's easy to review now and in the future. Basically, I backed out of a number of your formatting only changes. I did preserve some of your other cleanup such as not using deprecated methods for the global id. Off to QA. Thanks!

qqmyers · 2018-10-05T19:22:14Z

@pdurbin - thanks. I'm trying to sync my Eclipse settings with Dataverse preferences. I've got tabs going to 4 spaces already - I'll add to allow lines out to 480 chars (was 120) and to not merge already wrapped lines. Hopefully that will limit unnecessary changes. Let me know if you spot other differences and I'll try to keep watching as well.

pdurbin · 2018-10-05T19:27:57Z

@qqmyers no sweat. Since I was on the branch anyway I went ahead and deployed it to my laptop and it works! I uploaded a PDF that has the word "learning" in it...

... and I was able to find it by searching for "learning":

Very cool!

pdurbin · 2018-10-05T19:33:57Z

Oh, there were no docs so I added some in 278f37e and noted that this (awesome) feature is off by default but we could switch the boolean and docs so that it's on by default. Or we could leave it alone and treat it as an experimental feature for now.

kcondon · 2018-10-10T23:02:19Z

@qqmyers Still testing but noticed the file size limit for indexing does not seem to limit: :SolrMaxFileSizeForFullTextIndexing

When I set it to 1, it still indexes a small file of 3 words and 15 bytes. Not sure whether it has something to do with the small values involved or what.

I'll keep testing but I am out tomorrow. Will return on Friday to finish up.

qqmyers · 2018-10-11T14:06:00Z

@kcondon - good catch. I hadn't realized that open() has to be called before the accessIO object size is available. Fixed (and tested locally) in the new commit. FWIW - this also lead me to identify what I think are resource leaks (see #5165 ).

pdurbin · 2018-10-11T14:36:10Z

@kcondon the new commit @qqmyers is talking about ( 3aad24b ) having to do with :SolrMaxFileSizeForFullTextIndexing and open() is now in the branch you're testing (5030-full-text-indexing / pull request #5147).

pdurbin added the communitydev label Sep 7, 2018

pdurbin assigned qqmyers Sep 7, 2018

This was referenced Sep 7, 2018

Solr: Disable Solr Cell/ Tika from solr config #5018

Closed

IQSS/5030 full text idexing #5032

Closed

djbrooke added Status: Code Review and removed communitydev labels Oct 2, 2018

djbrooke assigned matthew-a-dunlap and unassigned qqmyers Oct 2, 2018

pdurbin self-assigned this Oct 5, 2018

pdurbin added a commit that referenced this issue Oct 5, 2018

run through netbeans formatter (no-op) #5030

e78e73e

pdurbin added a commit that referenced this issue Oct 5, 2018

back out of formatting changes to make code review easier #5030

c5de143

pdurbin added a commit that referenced this issue Oct 5, 2018

make code review easier, back out of formatting changes #5030

bc1b6ca

pdurbin added a commit that referenced this issue Oct 5, 2018

doc :SolrFullTextIndexing :SolrMaxFileSizeForFullTextIndexing #5030

278f37e

pdurbin mentioned this issue Oct 5, 2018

5030 full text indexing #5147

Merged

pdurbin added Status: QA and removed Status: Code Review labels Oct 5, 2018

pdurbin unassigned pdurbin and matthew-a-dunlap Oct 5, 2018

kcondon self-assigned this Oct 5, 2018

This was referenced Oct 9, 2018

Iqss/5030 full text indexing #5156

Merged

poi update to match tika #5159

Merged

qqmyers mentioned this issue Oct 11, 2018

must open AccessIO object before comparing size with max #5166

Merged

5 tasks

kcondon closed this as completed Oct 12, 2018

kcondon removed the Status: QA label Oct 12, 2018

pdurbin added this to the 4.10 - Additional Data Transfer Options milestone Dec 19, 2018

pdurbin mentioned this issue Mar 6, 2019

Expand Index & Search Within Files #2185

Closed

jggautier mentioned this issue Jan 8, 2020

Add full text indexing to Harvard Dataverse IQSS/dataverse.harvard.edu#51

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full-text Indexing #5030

Full-text Indexing #5030

qqmyers commented Sep 7, 2018

djbrooke commented Sep 7, 2018

pameyer commented Sep 10, 2018 •

edited

Loading

matthew-a-dunlap commented Sep 10, 2018

pdurbin commented Sep 28, 2018

qqmyers commented Sep 28, 2018

poikilotherm commented Sep 28, 2018

qqmyers commented Oct 2, 2018

pdurbin commented Oct 5, 2018

qqmyers commented Oct 5, 2018

pdurbin commented Oct 5, 2018

pdurbin commented Oct 5, 2018

kcondon commented Oct 10, 2018 •

edited

Loading

qqmyers commented Oct 11, 2018

pdurbin commented Oct 11, 2018

Full-text Indexing #5030

Full-text Indexing #5030

Comments

qqmyers commented Sep 7, 2018

djbrooke commented Sep 7, 2018

pameyer commented Sep 10, 2018 • edited Loading

matthew-a-dunlap commented Sep 10, 2018

pdurbin commented Sep 28, 2018

qqmyers commented Sep 28, 2018

poikilotherm commented Sep 28, 2018

qqmyers commented Oct 2, 2018

pdurbin commented Oct 5, 2018

qqmyers commented Oct 5, 2018

pdurbin commented Oct 5, 2018

pdurbin commented Oct 5, 2018

kcondon commented Oct 10, 2018 • edited Loading

qqmyers commented Oct 11, 2018

pdurbin commented Oct 11, 2018

pameyer commented Sep 10, 2018 •

edited

Loading

kcondon commented Oct 10, 2018 •

edited

Loading