10734 Refactoring and optimization of importing and reindexing of harvested content #10836

landreev · 2024-09-11T19:00:13Z

What this PR does / why we need it:

Despite the title of the original issue - "Refactor reindexing of harvested datasets" - a careful reviewer will notice that the code changes in the PR have relatively little to do with the actual indexing. It nevertheless achieves the specified goal in #10734, which was to be able to take advantage of the recent indexing improvements, specifically where we now make an effort to avoid unnecessarily deleting and rebuilding from scratch the existing solr documents associated with the objects that are being reindexed. That approach could not be used with harvested content, because our harvesting model relied on the idea of destroying re-harvested datasets and recreating them from scratch. Note that this is a practical concern, especially when harvesting from other Dataverses, in one of the richer formats that result in harvesting both datasets and files. If a remote Dataverse marks a large number of their OAI-served datasets as updated, harvesting from them can potentially result in a very large number of dataset and file documents that need to be reindexed in Solr all at once. And that can be extra expensive on a Solr instance that already has a million indexed documents. (Harvesting and importing metadata are relatively low-cost tasks on their own).

The code changes in the PR address this by attempting to preserve and update the existing database entities of reharvested objects (and, therefore, the database ids of such, on which the tracking of Solr documents is based).

Which issue(s) this PR closes:

Closes #10734

Special notes for your reviewer:

See the description.

One area of concern is that there are currently no tests covering this new functionality. I absolutely want to add more tests for the Harvesting Client subsystem and I have a good idea of how to test the improvements I've just made. However, I would strongly prefer to address this separately, while at the same time refactoring our current testing setup, that relies on harvesting from https://demo.dataverse.org/oai. We've discussed various ways of achieving that - via test containers potentially, or even via a crude mock cgi script running under Apache on a Jenkins-run aws instance. But that deserves to be handled as a task of its own, and I'd prefer not to work on any new tests until that's done.

Suggestions on how to test this:

The obvious functionality test is to confirm that harvesting is still working, with the end result identical to what was observed without the changes in this PR - the content from the remote OAI set should get imported and indexed. Harvesting from another Dataverse (such as demo) in the dataverse_json format is recommended, since it provides more and richer metadata to import and index than the default oai_dc and, most importantly, this format offers harvesting files, in addition to datasets.
The goal of the PR was to improve re-harvesting of datasets and files that have already been harvested and imported. So this scenario needs to be specifically tested by re-harvesting the content created in step 1., above. In normal, real-life operations this would happen when the datasets get updated and re-published on the source side and made available to the next incremental harvest. ("incremental" means that for existing harvesting clients, consequent harvesting runs ask the remote OAI server for all the records updated since the last successful harvest). For testing, this can be achieved by actually modifying the remote datasets and/or by artificially manipulating the OAI records on the remote server. However, it is much easier to get the same end result by using a simple hack:
Once you have created a harvesting client and run at least one successful harvest, you should be seeing something like this in the clients dashboard:

but, if you delete the record of this, and any other harvests from that OAI server, like this:
DELETE FROM clientharvestrun WHERE harvestingclient_id=N
your Dataverse is going to assume that this is a brand new client, showing up like this:

so, next time you run a harvest it will request and re-harvest ALL the records from the configured set again.
The goal is to confirm that the end result is still the same - all the previously harvested datasets (and files) should be re-imported and re-indexed. An extra thorough check would be to verify that the underlying metadata gets re-imported (new DatasetVersions with new database ids and time stamps in the database), however, unlike with the code prior to this PR, the re-harvested datasets themselves should be preserved with the preexisting database ids. (i.e., we should no longer be destroying and recreating the datasets that are being re-harvested from scratch).

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

…xing process. #10734

coveralls · 2024-09-11T19:09:26Z

coverage: 20.031% (-0.8%) from 20.87%
when pulling b2e2f18 on 10734-harvest-reindex-refactoring
into f970ab3 on develop.

qqmyers · 2024-09-11T19:10:11Z

src/main/java/edu/harvard/iq/dataverse/api/imports/ImportServiceBean.java

+        String globalIdString = globalId.asString();
+
+        if (StringUtils.isEmpty(globalIdString)) {
+            // @todo this check may not be necessary, now that there's a null check above


I think that's true

qqmyers · 2024-09-11T19:17:28Z

src/main/java/edu/harvard/iq/dataverse/api/imports/ImportServiceBean.java


-            if (ds.getVersions().get(0).getReleaseTime() == null) {
-                ds.getVersions().get(0).setReleaseTime(oaiDateStamp);
+            if (harvestedVersion.getReleaseTime() == null) {


So don't update a current existing version's releaseTime if the oaiDateStamp is different? (I'm assuming a new one just parsed is always null? Or could there be a newly parsed version where releaseTime and oaiDateStamp are different?)

Let me check, but I don't think it's always null - if it's a harvest from another Dataverse using our proprietary json format, I would expect it to be the actual releaseTime from the source version.

(that said, this was one of the few lines I just adopted intact from the old code without thinking about it much)

Interesting. The json export as supplied by the source Dataverse does contain the "releaseTime" timestamp. However, our parseDatasetVersion() method in JsonParser does this:
dsv.setReleaseTime(parseDate(obj.getString("releaseDate", null)));

I can't think of where this releaseDate token comes from right away. The export method in JsonPrinter does

add("releaseTime", format(dsv.getReleaseTime()))

and it never outputs releaseDate. 🤔

Haven't tested, but we may be getting a non-null releaseTime from the remote Dataverse when harvesting using DDI as the format. (It looks like the json produced from DTO produced from our OAI_DDI will have this "releaseDate" field. 🤯

All that aside, the oaiDateStamp is a sensible substitute for releaseTime. Since that is what the remote archive advertises as the date the record in question was last modified and/or made available - or released via OAI.

qqmyers · 2024-09-11T19:28:45Z

src/main/java/edu/harvard/iq/dataverse/api/imports/ImportServiceBean.java

                for (ConstraintViolation<DatasetField> v : violations) {
                    DatasetField f =  v.getRootBean();
                    f.setSingleValue(DatasetField.NA_VALUE);
                }
            }
-
+
+            // is this the right place to call tidyUpFields()? 


I guess removing blank fields is useful before validating further, but do you need to DatasetVersion.validateRequired above if you're going to DatasetVersion.validate() below?

Again, adopted this from the old code without a second thought. It's done like this, in separate passes in doImport() (the non-harvesting import method) as well. Maybe the 2 validate methods did not use to overlap?
But, sure, it does seem like validate() should be enough.

qqmyers · 2024-09-11T19:48:14Z

src/main/java/edu/harvard/iq/dataverse/engine/command/impl/UpdateHarvestedDatasetCommand.java

+                int fileIndex = existingFilesIndex.get(location);
+
+                // Make sure to update the existing DataFiles that we are going 
+                // to keep in case their newly-harvested versions have different 


Not sure I'm following the overall goal - If these are different are you assuring their reindexed and that they aren't otherwise?

The goal is to end up with a harvested dataset/datafiles in the database that match the original objects on the source side as closely as possible, while preserving the database ids of the previously harvested objects as much as possible.

OK - I guess my underlying concern is to make sure we don't over-optimize in indexing and assume that if you have the same fileMetadata and datafile ids that we don't have to check for changes. Not sure we do that now, but not changing existing published ones here would avoid the problem as well. (I don't think Curate reuses ids for fileMetdatas - perhaps it could if they haven't changed at all to do the same optimization (although I understand the issue for Harvest is doing many datasets all at once.)

May be obvious by now, but my line of thought was, since our solr documents are based on db ids, if I preserve whatever db ids I can, we can, at least in theory, save on deleting the old solr docs and re-creating them from scratch; and instead modify those existing docs. I am actually not 100% sure that that's what our indexing code currently does when reindexing existing datasets. (?) But at least it'll be possible to achieve this, if the solr doc names stay the same for datasets and datafiles that have already been harvested before.

I was not yet thinking of/planning to able to skip the indexing of the datasets and datafiles that haven't changed since the last harvest. (I don't think we can, in any economical way - ?). But I am hoping that not having to recreate solr docs unnecessarily should be a benefit.

Case in point, we have 20K datasets harvested from Borealis. With files it's 300K dvobjects, plus the same number of perm. docs - on the order of magnitude of 1M. They have marked most of the OAI records as having been updated since the last time we harvested. So then we are talking of deleting and rebuilding from scratch that many docs on an index that has close to 10M docs total. Being able to reuse most of them gotta be a serious saving - ?

landreev · 2024-09-12T01:53:53Z

src/main/java/edu/harvard/iq/dataverse/api/imports/ImportServiceBean.java

+                    throw new ImportException("Error importing Harvested Dataset, existing dataset has " + existingDataset.getVersions().size() + " versions");
+                }
+
+                harvestedVersion = parser.parseDatasetVersion(obj.getJsonObject("datasetVersion"));


The next few lines felt kludgy to write... and they are, I can see now how some of the extra logic is not needed and how to streamline things here.

(I have reviewed and reorganized the code slightly since then; I wasn't able to simplify it as much as I thought I was going to. Regardless, I feel comfortable with the state of the code now)

qqmyers · 2024-09-12T20:53:49Z

src/main/java/edu/harvard/iq/dataverse/engine/command/impl/UpdateHarvestedDatasetCommand.java

+        // purge all the SOLR documents associated with the files
+        // we have just deleted:
+        if (!solrIdsOfDocumentsToDelete.isEmpty()) {
+            ctxt.index().deleteHarvestedDocuments(solrIdsOfDocumentsToDelete);


I don't think you need to do this - the reindex will drop any docs that don't correspond to current files.

I thought about it - but I figured I would leave the delete call in place. It's so easy to do there - I already know the ids of the documents that need to be purged; so, less work for the reindex code, and I don't think it can possibly hurt.

qqmyers

Looks good.

ofahimIQSS · 2024-10-22T16:50:11Z

Hi @landreev, can you please update the revision number from 6.3 to 6.4 please in the modules/dataverse-parent/pom.xml

landreev · 2024-10-23T15:05:50Z

Missed that, sorry. Will do in a sec.

Hi @landreev, can you please update the revision number from 6.3 to 6.4 please in the modules/dataverse-parent/pom.xml

…m/IQSS/dataverse into 10734-harvest-reindex-refactoring

github-actions · 2024-10-23T18:42:56Z

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:10734-harvest-reindex-refactoring

ghcr.io/gdcc/configbaker:10734-harvest-reindex-refactoring

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

ofahimIQSS · 2024-10-25T15:54:37Z

Testing Done - No issues found
Testing of 10734.docx

landreev added 4 commits September 9, 2024 10:49

Some work-in-progress modifications of the Harvested imports and inde…

03debc1

…xing process. #10734

intermediate/unfinished state - work in progress #10734

e62fc55

incremental. #10734

3fab4c8

Largely finalized versions of the refactored harvesting classes. #10734

bd08b69

landreev added the Feature: Harvesting label Sep 11, 2024

cmbz added the Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) label Sep 11, 2024

qqmyers reviewed Sep 11, 2024

View reviewed changes

This comment has been minimized.

Sign in to view

qqmyers reviewed Sep 11, 2024

View reviewed changes

a typo + some unused imports #10734

e02a584

cmbz added the FY25 Sprint 6 FY25 Sprint 6 label Sep 11, 2024

comment language #10734

30266dd

qqmyers reviewed Sep 11, 2024

View reviewed changes

This comment has been minimized.

Sign in to view

qqmyers reviewed Sep 11, 2024

View reviewed changes

minor/cosmetic #10734

9291554

landreev commented Sep 12, 2024

View reviewed changes

qqmyers mentioned this pull request Sep 12, 2024

Exclude others namespace from harvesting "oai_dc" metadata prefix #10837

Merged

qqmyers self-assigned this Sep 12, 2024

qqmyers reviewed Sep 12, 2024

View reviewed changes

qqmyers assigned landreev and unassigned qqmyers Sep 13, 2024

Some cleanup/streamlining #10734

7568cd0

This comment has been minimized.

Sign in to view

landreev removed their assignment Sep 19, 2024

qqmyers self-assigned this Sep 23, 2024

typo in comment

0a83bc3

qqmyers approved these changes Sep 23, 2024

View reviewed changes

This comment has been minimized.

Sign in to view

qqmyers removed their assignment Sep 23, 2024

cmbz added FY25 Sprint 7 FY25 Sprint 7 (2024-09-25 - 2024-10-09) FY25 Sprint 8 FY25 Sprint 8 (2024-10-09 - 2024-10-23) labels Oct 8, 2024

pdurbin added the Type: Feature a feature request label Oct 9, 2024

landreev added 2 commits October 23, 2024 14:30

Merge branch 'develop' into 10734-harvest-reindex-refactoring

51f9298

Merge branch '10734-harvest-reindex-refactoring' of https://github.co…

b2e2f18

…m/IQSS/dataverse into 10734-harvest-reindex-refactoring

cmbz added the FY25 Sprint 9 FY25 Sprint 9 (2024-10-23 - 2024-11-06) label Oct 23, 2024

stevenwinship assigned stevenwinship and unassigned stevenwinship Oct 24, 2024

ofahimIQSS self-assigned this Oct 24, 2024

ofahimIQSS merged commit 858eab6 into develop Oct 25, 2024
19 of 22 checks passed

ofahimIQSS deleted the 10734-harvest-reindex-refactoring branch October 25, 2024 15:54

ofahimIQSS removed their assignment Oct 25, 2024

pdurbin added this to the 6.5 milestone Oct 25, 2024

cmbz mentioned this pull request Oct 27, 2024

GREI 3: HDV Task - Improve OAI-PMH Harvesting IQSS/dataverse-pm#171

Open

60 tasks

This was referenced Oct 30, 2024

PR 10836 broke harvesting tests, needs an urgent fix #10989

Closed

A fix for the harvesting regression/tests introduced in 10836 #10990

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

10734 Refactoring and optimization of importing and reindexing of harvested content #10836

10734 Refactoring and optimization of importing and reindexing of harvested content #10836

landreev commented Sep 11, 2024 •

edited

Loading

coveralls commented Sep 11, 2024 •

edited

Loading

qqmyers Sep 11, 2024

This comment has been minimized.

qqmyers Sep 11, 2024

landreev Sep 11, 2024

landreev Sep 11, 2024

landreev Sep 11, 2024

landreev Sep 11, 2024

landreev Sep 11, 2024

qqmyers Sep 11, 2024

landreev Sep 11, 2024

landreev Sep 19, 2024

This comment has been minimized.

This comment has been minimized.

qqmyers Sep 11, 2024

landreev Sep 11, 2024

qqmyers Sep 12, 2024

landreev Sep 12, 2024

landreev Sep 12, 2024

landreev Sep 19, 2024

qqmyers Sep 12, 2024

landreev Sep 19, 2024

This comment has been minimized.

qqmyers left a comment

This comment has been minimized.

ofahimIQSS commented Oct 22, 2024

landreev commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

ofahimIQSS commented Oct 25, 2024

10734 Refactoring and optimization of importing and reindexing of harvested content #10836

10734 Refactoring and optimization of importing and reindexing of harvested content #10836

Conversation

landreev commented Sep 11, 2024 • edited Loading

coveralls commented Sep 11, 2024 • edited Loading

Choose a reason for hiding this comment

This comment has been minimized.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment has been minimized.

qqmyers left a comment

Choose a reason for hiding this comment

This comment has been minimized.

ofahimIQSS commented Oct 22, 2024

landreev commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

ofahimIQSS commented Oct 25, 2024

landreev commented Sep 11, 2024 •

edited

Loading

coveralls commented Sep 11, 2024 •

edited

Loading