-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: Pushing batches of data to online store: Should conn.commit()
happen in the for loop or after?
#4036
Comments
@job-almekinders thanks for raising this. Using multiple commits (while not ideal) is not the end of the world here. note that depending on what materiliasation engine you're using you might still end up with multiple commits even if we pull the commit out of the loop (if using spark or bytewax). Having said that, I think it seems to be impossible to make that change now anyway. Batch size of 5000 is a really small number for even the most common use cases. I have never used this in practice, but I suspect for feature views numbering several million entities and tens of features, the whole process probably takes ages. Leaving the connection uncommited for so long is not ideal either. We could make batch size configurable first, but increasing it is also too risky. If I'm not mistaken (?) |
Thank you for your reply and additional information, this helps a lot :) I think in the short term we could indeed make the batch_size configurable, to give a bit more control! And maybe also add some documentation in the docstring and/or a log statement about the risks of setting the batch size as a very large number. The bulk writes from abdc look very promising indeed! It would be very interesting to try this out. |
@job-almekinders - Would you like to create a PR for this change to make it configurable? |
Hey @lokeshrangineni - I'm currently occupied with other work unfortunately. My focus might shift towards this topic again in some time, so I might open a PR then! Feel free to open one in the mean time yourself if you need this change earlier :) |
This piece of code in the
online_write_batch
function in the Postgres online store pushes data in batches to the online store.I was wondering whether it makes more sense to put the
conn.commit()
inside the for loop, or after the for loop. I would love to hear the different trade-offs between the two options!Copy of the code snippet here:
The text was updated successfully, but these errors were encountered: