-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new recipe format #59
Comments
So, to have an overview. One of the problems we have to solve is that we are joining two or more scopes of concepts that might be incompatible with one another. Right now the scope of procedures is always the ingredient. Two ingredients in one recipe could use the same concept (with a different definition/values etc) without a problem. Only when merging two ingredients to one, merging the two scopes could give problems. Now, with our new plan, the scope of what a recipe modifies moves to a dataset. And it would be great if we could keep the scope of a recipe to one dataset as well. It keeps recipes simple. But in a recipe we want to join the scopes of different datasets to one. And to be able to join the scopes, transformations need to be done. To do those transformation we need to be able to work in the scopes of the dataset before joining. So... options are
I would like to move towards something similar as the datapackage pipelines, so we could use their development resources for our benefit. However, of course not at any cost, it should still make sense. |
@semio how would your proposal enable streaming? |
I want to make one recipe has only one main pipeline, and in one pipeline, the procedures are processed one by one, and input/output of procedures are datasets. So datasets will be passed from one procedure to another. Chef will always start from the main pipeline, and when it need the output from other pipeline, it will open a new process and get the result from that pipeline. Sub pipelines can also be set like dependencies in the datapackage-pipeline, and stored in separate recipe files. So the streaming is base on datasets in my proposal, not like the datapackage-pipelines, which can process resource row by row |
In reply to #59 (comment) Ok, I see there are a few ways we can solve the scope problem. I think point 1 is most like the datapackage pipelines. In datapackage-pipelines, all pipelines are described in the For point 2, I think the problem is that we can't manage the ontology dataset itself with recipe, which means that when we want to add some new data(concepts/entities) from other dataset to the ontology dataset, we can't use recipe because we don't have the ontology dataset that contains all information.
yes, that would be good, and I think it's possible to do that, because their input/output are also datasets with datapackage.json. options are:
|
@jheeffer I think point 3 is better option for us, because we want to make procedures base on datasets, but datapackage-pipeline is base on rows. Also datapackage-pipeline doesn't use the pydata packages (e.g numpy/pandas etc) at all, it might add a lot of work for us to fit in their system. I will try to build a plugin for datapackage-pipelines, add some simple processors to it, see if it works: https://github.com/semio/datapackage_pipelines_ddf UPDATE: I dug into their source codes and tried to create processors for ddf, but I don't think it will be good to make our recipe in this way. The disadvantage are:
so my opinion is to continue with our chef module and try to learn the good things from them. |
Alright! Thanks for the research @semio ! So, if I understand correctly you tried to see how our pipeline projects can merge. Which in essence means applying logic in gapminder procedures to datapackage processors. The main problems are that:
So we choose to use more memory, for a faster processing it seems now? I suppose it's possible to implement many of our procedures, applied on a dataset-scope, in a row-based manner. However, it would probably be a lot harder since you can't use all the yummyness in pandas. I haven't gotten to a conclusion yet or a recipe format. In the mean time you can maybe work on ddfSchema generation in python? |
Yes, you are correct about the problems. Indeed we can implement our procedures in row based way. And I guess we can improve the performance by using multithreading / multiprocessing. On the debugging problem, you can see the implement of join or concatenate, which I think it would be easier to write if we stream base on table. They have many lines of codes, so when I tried to understand what is going on in the program, I want to stop the program at some point and see what's in the variables. That's why I want to do interactive debugging. But now I have to add many logging statements, so it's just like you can't inspect elements in the browser but only print things in the console. When we face memory limits, we can consider okay, I will switch to ddfSchema for now. Let me know when you have new ideas or questions :) P.S. I also found |
goal for new recipe format:
proposal
on
ingredients
:we don't need key/value any more, so we just drop them and keep the id/dataset/row_filter
on
cooking
:take gapminder population as example, the pipeline process is as this image
We can see that, the pipeline of gm-pop and un-pop are independent pipelines, and the result of these pipelines are used in the main pipeline. So I suggest that we make these pipelines as objects in the
cooking
section. In each pipeline the data will streamed from one procedure to another, and final result will be available as ingredient dataset with name same as the pipeline name. If we want to reference to pipeline itself, usethis
ingredient.Furthermore, we can point pipeline object to a recipe file, which means run the recipe file to get pipeline result.
on
serve
:The result of last procedure of main_pipeline should be the final dataset, and we can use the
serve
procedure to set options for output format. But theserve
should be optional.Example (Note: the usage of the procedures are not fully discussed yet)
The text was updated successfully, but these errors were encountered: