-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
10734 Refactoring and optimization of importing and reindexing of harvested content #10836
Conversation
String globalIdString = globalId.asString(); | ||
|
||
if (StringUtils.isEmpty(globalIdString)) { | ||
// @todo this check may not be necessary, now that there's a null check above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's true
This comment has been minimized.
This comment has been minimized.
|
||
if (ds.getVersions().get(0).getReleaseTime() == null) { | ||
ds.getVersions().get(0).setReleaseTime(oaiDateStamp); | ||
if (harvestedVersion.getReleaseTime() == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So don't update a current existing version's releaseTime if the oaiDateStamp is different? (I'm assuming a new one just parsed is always null? Or could there be a newly parsed version where releaseTime and oaiDateStamp are different?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me check, but I don't think it's always null - if it's a harvest from another Dataverse using our proprietary json format, I would expect it to be the actual releaseTime from the source version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(that said, this was one of the few lines I just adopted intact from the old code without thinking about it much)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. The json export as supplied by the source Dataverse does contain the "releaseTime"
timestamp. However, our parseDatasetVersion()
method in JsonParser
does this:
dsv.setReleaseTime(parseDate(obj.getString("releaseDate", null)));
I can't think of where this releaseDate
token comes from right away. The export method in JsonPrinter
does
add("releaseTime", format(dsv.getReleaseTime()))
and it never outputs releaseDate
. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't tested, but we may be getting a non-null releaseTime from the remote Dataverse when harvesting using DDI as the format. (It looks like the json produced from DTO produced from our OAI_DDI will have this "releaseDate"
field. 🤯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All that aside, the oaiDateStamp
is a sensible substitute for releaseTime
. Since that is what the remote archive advertises as the date the record in question was last modified and/or made available - or released via OAI.
for (ConstraintViolation<DatasetField> v : violations) { | ||
DatasetField f = v.getRootBean(); | ||
f.setSingleValue(DatasetField.NA_VALUE); | ||
} | ||
} | ||
|
||
|
||
// is this the right place to call tidyUpFields()? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess removing blank fields is useful before validating further, but do you need to DatasetVersion.validateRequired above if you're going to DatasetVersion.validate() below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, adopted this from the old code without a second thought. It's done like this, in separate passes in doImport()
(the non-harvesting import method) as well. Maybe the 2 validate methods did not use to overlap?
But, sure, it does seem like validate()
should be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(done)
This comment has been minimized.
This comment has been minimized.
1 similar comment
This comment has been minimized.
This comment has been minimized.
int fileIndex = existingFilesIndex.get(location); | ||
|
||
// Make sure to update the existing DataFiles that we are going | ||
// to keep in case their newly-harvested versions have different |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I'm following the overall goal - If these are different are you assuring their reindexed and that they aren't otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal is to end up with a harvested dataset/datafiles in the database that match the original objects on the source side as closely as possible, while preserving the database ids of the previously harvested objects as much as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - I guess my underlying concern is to make sure we don't over-optimize in indexing and assume that if you have the same fileMetadata and datafile ids that we don't have to check for changes. Not sure we do that now, but not changing existing published ones here would avoid the problem as well. (I don't think Curate reuses ids for fileMetdatas - perhaps it could if they haven't changed at all to do the same optimization (although I understand the issue for Harvest is doing many datasets all at once.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be obvious by now, but my line of thought was, since our solr documents are based on db ids, if I preserve whatever db ids I can, we can, at least in theory, save on deleting the old solr docs and re-creating them from scratch; and instead modify those existing docs. I am actually not 100% sure that that's what our indexing code currently does when reindexing existing datasets. (?) But at least it'll be possible to achieve this, if the solr doc names stay the same for datasets and datafiles that have already been harvested before.
I was not yet thinking of/planning to able to skip the indexing of the datasets and datafiles that haven't changed since the last harvest. (I don't think we can, in any economical way - ?). But I am hoping that not having to recreate solr docs unnecessarily should be a benefit.
Case in point, we have 20K datasets harvested from Borealis. With files it's 300K dvobjects, plus the same number of perm. docs - on the order of magnitude of 1M. They have marked most of the OAI records as having been updated since the last time we harvested. So then we are talking of deleting and rebuilding from scratch that many docs on an index that has close to 10M docs total. Being able to reuse most of them gotta be a serious saving - ?
throw new ImportException("Error importing Harvested Dataset, existing dataset has " + existingDataset.getVersions().size() + " versions"); | ||
} | ||
|
||
harvestedVersion = parser.parseDatasetVersion(obj.getJsonObject("datasetVersion")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The next few lines felt kludgy to write... and they are, I can see now how some of the extra logic is not needed and how to streamline things here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I have reviewed and reorganized the code slightly since then; I wasn't able to simplify it as much as I thought I was going to. Regardless, I feel comfortable with the state of the code now)
// purge all the SOLR documents associated with the files | ||
// we have just deleted: | ||
if (!solrIdsOfDocumentsToDelete.isEmpty()) { | ||
ctxt.index().deleteHarvestedDocuments(solrIdsOfDocumentsToDelete); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to do this - the reindex will drop any docs that don't correspond to current files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about it - but I figured I would leave the delete call in place. It's so easy to do there - I already know the ids of the documents that need to be purged; so, less work for the reindex code, and I don't think it can possibly hurt.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
This comment has been minimized.
This comment has been minimized.
Hi @landreev, can you please update the revision number from 6.3 to 6.4 please in the modules/dataverse-parent/pom.xml |
Missed that, sorry. Will do in a sec.
|
…m/IQSS/dataverse into 10734-harvest-reindex-refactoring
📦 Pushed preview images as
🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name. |
Testing Done - No issues found |
What this PR does / why we need it:
Despite the title of the original issue - "Refactor reindexing of harvested datasets" - a careful reviewer will notice that the code changes in the PR have relatively little to do with the actual indexing. It nevertheless achieves the specified goal in #10734, which was to be able to take advantage of the recent indexing improvements, specifically where we now make an effort to avoid unnecessarily deleting and rebuilding from scratch the existing solr documents associated with the objects that are being reindexed. That approach could not be used with harvested content, because our harvesting model relied on the idea of destroying re-harvested datasets and recreating them from scratch. Note that this is a practical concern, especially when harvesting from other Dataverses, in one of the richer formats that result in harvesting both datasets and files. If a remote Dataverse marks a large number of their OAI-served datasets as updated, harvesting from them can potentially result in a very large number of dataset and file documents that need to be reindexed in Solr all at once. And that can be extra expensive on a Solr instance that already has a million indexed documents. (Harvesting and importing metadata are relatively low-cost tasks on their own).
The code changes in the PR address this by attempting to preserve and update the existing database entities of reharvested objects (and, therefore, the database ids of such, on which the tracking of Solr documents is based).
Which issue(s) this PR closes:
Closes #10734
Special notes for your reviewer:
See the description.
One area of concern is that there are currently no tests covering this new functionality. I absolutely want to add more tests for the Harvesting Client subsystem and I have a good idea of how to test the improvements I've just made. However, I would strongly prefer to address this separately, while at the same time refactoring our current testing setup, that relies on harvesting from
https://demo.dataverse.org/oai
. We've discussed various ways of achieving that - via test containers potentially, or even via a crude mock cgi script running under Apache on a Jenkins-run aws instance. But that deserves to be handled as a task of its own, and I'd prefer not to work on any new tests until that's done.Suggestions on how to test this:
The obvious functionality test is to confirm that harvesting is still working, with the end result identical to what was observed without the changes in this PR - the content from the remote OAI set should get imported and indexed. Harvesting from another Dataverse (such as demo) in the
dataverse_json
format is recommended, since it provides more and richer metadata to import and index than the defaultoai_dc
and, most importantly, this format offers harvesting files, in addition to datasets.The goal of the PR was to improve re-harvesting of datasets and files that have already been harvested and imported. So this scenario needs to be specifically tested by re-harvesting the content created in step 1., above. In normal, real-life operations this would happen when the datasets get updated and re-published on the source side and made available to the next incremental harvest. ("incremental" means that for existing harvesting clients, consequent harvesting runs ask the remote OAI server for all the records updated since the last successful harvest). For testing, this can be achieved by actually modifying the remote datasets and/or by artificially manipulating the OAI records on the remote server. However, it is much easier to get the same end result by using a simple hack:


Once you have created a harvesting client and run at least one successful harvest, you should be seeing something like this in the clients dashboard:
but, if you delete the record of this, and any other harvests from that OAI server, like this:
DELETE FROM clientharvestrun WHERE harvestingclient_id=N
your Dataverse is going to assume that this is a brand new client, showing up like this:
so, next time you run a harvest it will request and re-harvest ALL the records from the configured set again.
The goal is to confirm that the end result is still the same - all the previously harvested datasets (and files) should be re-imported and re-indexed. An extra thorough check would be to verify that the underlying metadata gets re-imported (new DatasetVersions with new database ids and time stamps in the database), however, unlike with the code prior to this PR, the re-harvested datasets themselves should be preserved with the preexisting database ids. (i.e., we should no longer be destroying and recreating the datasets that are being re-harvested from scratch).
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Is there a release notes update needed for this change?:
Additional documentation: