-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Develop/find some end-user stories #30
Comments
Some helpful comments from @mattfry-ceh on the product description document for kick-starting things here:
and
|
Some more ideas, from the product description document:
|
Next question to think about - how do the above use cases affect the need and form of an API |
The three dominant languages in use in the hydrological community are Python, R and Fortran (note this is just my hunch from experience, I haven't verified it anywhere...) Modellers using Fortran code will struggle to make use of this product. They will want to be able to download the driving data and have it available locally on disk (which we don't really want). An alternative is for us to use some bespoke Fortran code or wrap some Python in Fortran code that these modellers could use to access the data from the object store instead. Could be a lot of work if Fortran libraries to do this don't already exist. Are there Fortran libraries to integrate with an API? Python or R modellers will hopefully uptake this more, given we are aiming to make the necessary changes to scripts as minimal as possible. My hunch is that most Python modellers use xarray to work with NetCDF, those that don't might need to change to using it, but we can already point them at a UKCEH training course I developed for xarray (or the vast "array" of training courses for xarray that already exist). Not so sure about R, but I would hope we can do similar things than we can in python: provide an intermediate library (possibly with an API) between the object store and the netcdf4 R package that essentially makes the object store appear as a disk to the user.
People are used to running scripts like this on local and local-ish machines (such as UKCEH private cloud VMs, JASMIN Sci servers). For small (fits in memory) requests of data I guess this remains fine. I don't think there's a specific case for an API here from the users' perspective, but we might want one as developers for other developy/monitory reasons. Either way we would need to supply the template/boilerplate code needed to access the data (which can either involve an API or not). A separate issue is consistent (python/R) environment setups, but we can provide basic instructions with the example code and wash our hands of the rest ("not our problem"), especially if we're providing other environment-controlled infrastructure on which the code can be run. Such as...:
...if the example code on the EIDC catalogue page of the data links to a notebook on datalabs (or JASMIN notebook service or jupyter labs on AWS etc.) with the right environment already installed. Previously I/users doing this sort of thing have accessed object-storage data using some boilerplate "run once and get out of the way" code/code libraries like FSSpec or Intake where all the necessary config can be prefilled/pregenerated for a given dataset and then ignored. All the user need then do is input their secrets/credentials if necessary (usually when the dataset is not public) and essentially run code as normal via an xarray open_zarr command. Any API and associated config would have to play nice with these libraries and be similarly "get out of the way" code.
What does "access this type of data over the web" mean? I'm going to assume it means accessing via a website/portal with a clicky GUI and maps.
are the simple ones More complicated ones that would need the aggregation to be processed somewhere are:
Exactly which will be the dominant use case will depend on the dataset. For our trial dataset - GEAR 1hrly - extracting out timeseries for spatial points/catchment areas for analysis is probably the dominant use case??
This is more about the rechunking tool stage of the product, so less relevant to the API discussion. |
How will they want to use the available data/interact with it?
This will inform discussions about the API and version control layers
The text was updated successfully, but these errors were encountered: