Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

10734 Refactoring and optimization of importing and reindexing of harvested content #10836

Merged
merged 11 commits into from
Oct 25, 2024

Conversation

landreev
Copy link
Contributor

@landreev landreev commented Sep 11, 2024

What this PR does / why we need it:

Despite the title of the original issue - "Refactor reindexing of harvested datasets" - a careful reviewer will notice that the code changes in the PR have relatively little to do with the actual indexing. It nevertheless achieves the specified goal in #10734, which was to be able to take advantage of the recent indexing improvements, specifically where we now make an effort to avoid unnecessarily deleting and rebuilding from scratch the existing solr documents associated with the objects that are being reindexed. That approach could not be used with harvested content, because our harvesting model relied on the idea of destroying re-harvested datasets and recreating them from scratch. Note that this is a practical concern, especially when harvesting from other Dataverses, in one of the richer formats that result in harvesting both datasets and files. If a remote Dataverse marks a large number of their OAI-served datasets as updated, harvesting from them can potentially result in a very large number of dataset and file documents that need to be reindexed in Solr all at once. And that can be extra expensive on a Solr instance that already has a million indexed documents. (Harvesting and importing metadata are relatively low-cost tasks on their own).

The code changes in the PR address this by attempting to preserve and update the existing database entities of reharvested objects (and, therefore, the database ids of such, on which the tracking of Solr documents is based).

Which issue(s) this PR closes:

Closes #10734

Special notes for your reviewer:

See the description.

One area of concern is that there are currently no tests covering this new functionality. I absolutely want to add more tests for the Harvesting Client subsystem and I have a good idea of how to test the improvements I've just made. However, I would strongly prefer to address this separately, while at the same time refactoring our current testing setup, that relies on harvesting from https://demo.dataverse.org/oai. We've discussed various ways of achieving that - via test containers potentially, or even via a crude mock cgi script running under Apache on a Jenkins-run aws instance. But that deserves to be handled as a task of its own, and I'd prefer not to work on any new tests until that's done.

Suggestions on how to test this:

  1. The obvious functionality test is to confirm that harvesting is still working, with the end result identical to what was observed without the changes in this PR - the content from the remote OAI set should get imported and indexed. Harvesting from another Dataverse (such as demo) in the dataverse_json format is recommended, since it provides more and richer metadata to import and index than the default oai_dc and, most importantly, this format offers harvesting files, in addition to datasets.

  2. The goal of the PR was to improve re-harvesting of datasets and files that have already been harvested and imported. So this scenario needs to be specifically tested by re-harvesting the content created in step 1., above. In normal, real-life operations this would happen when the datasets get updated and re-published on the source side and made available to the next incremental harvest. ("incremental" means that for existing harvesting clients, consequent harvesting runs ask the remote OAI server for all the records updated since the last successful harvest). For testing, this can be achieved by actually modifying the remote datasets and/or by artificially manipulating the OAI records on the remote server. However, it is much easier to get the same end result by using a simple hack:
    Once you have created a harvesting client and run at least one successful harvest, you should be seeing something like this in the clients dashboard:
    Screen Shot 2024-10-01 at 10 07 35 AM
    but, if you delete the record of this, and any other harvests from that OAI server, like this:
    DELETE FROM clientharvestrun WHERE harvestingclient_id=N
    your Dataverse is going to assume that this is a brand new client, showing up like this:
    Screen Shot 2024-10-01 at 10 08 07 AM
    so, next time you run a harvest it will request and re-harvest ALL the records from the configured set again.
    The goal is to confirm that the end result is still the same - all the previously harvested datasets (and files) should be re-imported and re-indexed. An extra thorough check would be to verify that the underlying metadata gets re-imported (new DatasetVersions with new database ids and time stamps in the database), however, unlike with the code prior to this PR, the re-harvested datasets themselves should be preserved with the preexisting database ids. (i.e., we should no longer be destroying and recreating the datasets that are being re-harvested from scratch).

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

@cmbz cmbz added the Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) label Sep 11, 2024
@coveralls
Copy link

coveralls commented Sep 11, 2024

Coverage Status

coverage: 20.031% (-0.8%) from 20.87%
when pulling b2e2f18 on 10734-harvest-reindex-refactoring
into f970ab3 on develop.

String globalIdString = globalId.asString();

if (StringUtils.isEmpty(globalIdString)) {
// @todo this check may not be necessary, now that there's a null check above
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's true

This comment has been minimized.


if (ds.getVersions().get(0).getReleaseTime() == null) {
ds.getVersions().get(0).setReleaseTime(oaiDateStamp);
if (harvestedVersion.getReleaseTime() == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So don't update a current existing version's releaseTime if the oaiDateStamp is different? (I'm assuming a new one just parsed is always null? Or could there be a newly parsed version where releaseTime and oaiDateStamp are different?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check, but I don't think it's always null - if it's a harvest from another Dataverse using our proprietary json format, I would expect it to be the actual releaseTime from the source version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(that said, this was one of the few lines I just adopted intact from the old code without thinking about it much)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. The json export as supplied by the source Dataverse does contain the "releaseTime" timestamp. However, our parseDatasetVersion() method in JsonParser does this:
dsv.setReleaseTime(parseDate(obj.getString("releaseDate", null)));

I can't think of where this releaseDate token comes from right away. The export method in JsonPrinter does

add("releaseTime", format(dsv.getReleaseTime()))

and it never outputs releaseDate. 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't tested, but we may be getting a non-null releaseTime from the remote Dataverse when harvesting using DDI as the format. (It looks like the json produced from DTO produced from our OAI_DDI will have this "releaseDate" field. 🤯

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All that aside, the oaiDateStamp is a sensible substitute for releaseTime. Since that is what the remote archive advertises as the date the record in question was last modified and/or made available - or released via OAI.

@cmbz cmbz added the FY25 Sprint 6 FY25 Sprint 6 label Sep 11, 2024
for (ConstraintViolation<DatasetField> v : violations) {
DatasetField f = v.getRootBean();
f.setSingleValue(DatasetField.NA_VALUE);
}
}


// is this the right place to call tidyUpFields()?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess removing blank fields is useful before validating further, but do you need to DatasetVersion.validateRequired above if you're going to DatasetVersion.validate() below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, adopted this from the old code without a second thought. It's done like this, in separate passes in doImport() (the non-harvesting import method) as well. Maybe the 2 validate methods did not use to overlap?
But, sure, it does seem like validate() should be enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(done)

This comment has been minimized.

1 similar comment

This comment has been minimized.

int fileIndex = existingFilesIndex.get(location);

// Make sure to update the existing DataFiles that we are going
// to keep in case their newly-harvested versions have different
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I'm following the overall goal - If these are different are you assuring their reindexed and that they aren't otherwise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is to end up with a harvested dataset/datafiles in the database that match the original objects on the source side as closely as possible, while preserving the database ids of the previously harvested objects as much as possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I guess my underlying concern is to make sure we don't over-optimize in indexing and assume that if you have the same fileMetadata and datafile ids that we don't have to check for changes. Not sure we do that now, but not changing existing published ones here would avoid the problem as well. (I don't think Curate reuses ids for fileMetdatas - perhaps it could if they haven't changed at all to do the same optimization (although I understand the issue for Harvest is doing many datasets all at once.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be obvious by now, but my line of thought was, since our solr documents are based on db ids, if I preserve whatever db ids I can, we can, at least in theory, save on deleting the old solr docs and re-creating them from scratch; and instead modify those existing docs. I am actually not 100% sure that that's what our indexing code currently does when reindexing existing datasets. (?) But at least it'll be possible to achieve this, if the solr doc names stay the same for datasets and datafiles that have already been harvested before.

I was not yet thinking of/planning to able to skip the indexing of the datasets and datafiles that haven't changed since the last harvest. (I don't think we can, in any economical way - ?). But I am hoping that not having to recreate solr docs unnecessarily should be a benefit.

Case in point, we have 20K datasets harvested from Borealis. With files it's 300K dvobjects, plus the same number of perm. docs - on the order of magnitude of 1M. They have marked most of the OAI records as having been updated since the last time we harvested. So then we are talking of deleting and rebuilding from scratch that many docs on an index that has close to 10M docs total. Being able to reuse most of them gotta be a serious saving - ?

throw new ImportException("Error importing Harvested Dataset, existing dataset has " + existingDataset.getVersions().size() + " versions");
}

harvestedVersion = parser.parseDatasetVersion(obj.getJsonObject("datasetVersion"));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next few lines felt kludgy to write... and they are, I can see now how some of the extra logic is not needed and how to streamline things here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I have reviewed and reorganized the code slightly since then; I wasn't able to simplify it as much as I thought I was going to. Regardless, I feel comfortable with the state of the code now)

// purge all the SOLR documents associated with the files
// we have just deleted:
if (!solrIdsOfDocumentsToDelete.isEmpty()) {
ctxt.index().deleteHarvestedDocuments(solrIdsOfDocumentsToDelete);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to do this - the reindex will drop any docs that don't correspond to current files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it - but I figured I would leave the delete call in place. It's so easy to do there - I already know the ids of the documents that need to be purged; so, less work for the reindex code, and I don't think it can possibly hurt.

@qqmyers qqmyers assigned landreev and unassigned qqmyers Sep 13, 2024

This comment has been minimized.

@landreev landreev removed their assignment Sep 19, 2024
@qqmyers qqmyers self-assigned this Sep 23, 2024
Copy link
Member

@qqmyers qqmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

This comment has been minimized.

@qqmyers qqmyers removed their assignment Sep 23, 2024
@cmbz cmbz added FY25 Sprint 7 FY25 Sprint 7 (2024-09-25 - 2024-10-09) FY25 Sprint 8 FY25 Sprint 8 (2024-10-09 - 2024-10-23) labels Oct 8, 2024
@pdurbin pdurbin added the Type: Feature a feature request label Oct 9, 2024
@ofahimIQSS
Copy link
Contributor

Hi @landreev, can you please update the revision number from 6.3 to 6.4 please in the modules/dataverse-parent/pom.xml

@landreev
Copy link
Contributor Author

Missed that, sorry. Will do in a sec.

Hi @landreev, can you please update the revision number from 6.3 to 6.4 please in the modules/dataverse-parent/pom.xml

Copy link

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:10734-harvest-reindex-refactoring
ghcr.io/gdcc/configbaker:10734-harvest-reindex-refactoring

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

@cmbz cmbz added the FY25 Sprint 9 FY25 Sprint 9 (2024-10-23 - 2024-11-06) label Oct 23, 2024
@ofahimIQSS ofahimIQSS self-assigned this Oct 24, 2024
@ofahimIQSS
Copy link
Contributor

Testing Done - No issues found
Testing of 10734.docx

@ofahimIQSS ofahimIQSS merged commit 858eab6 into develop Oct 25, 2024
19 of 22 checks passed
@ofahimIQSS ofahimIQSS deleted the 10734-harvest-reindex-refactoring branch October 25, 2024 15:54
@ofahimIQSS ofahimIQSS removed their assignment Oct 25, 2024
@pdurbin pdurbin added this to the 6.5 milestone Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Harvesting FY25 Sprint 6 FY25 Sprint 6 FY25 Sprint 7 FY25 Sprint 7 (2024-09-25 - 2024-10-09) FY25 Sprint 8 FY25 Sprint 8 (2024-10-09 - 2024-10-23) FY25 Sprint 9 FY25 Sprint 9 (2024-10-23 - 2024-11-06) Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) Type: Feature a feature request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor reindexing of harvested datasets
7 participants