-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PostgresOnlineStore: Improve materialization #4309
Comments
Performance comparison sounds good. Both use COPY command under the hood, so I wouldn't expect too much of a difference, but stranger things have happened :). |
Thanks for your input! We will probably give it a spin sometime soon, so let's see what the results are :) |
So unfortunately
This is indeed what happens (at least with my dataset and Postgres database). The chunks are oftentimes a lot smaller than 10,000. So the So how best to proceed?
Anyone any suggestions? (P.S. I also had a look at the |
There is one small improvement possible, and that is getting rid of the batching within |
@TomSteenbergen Can you share some numbers as well? How much of an improvement are we talking about if we do away with batching in the local materialization engine? |
@tokoko Testing using a feature view with 100,000 entities and 6 features (so 600,000 table rows):
|
@TomSteenbergen thanks, are the last numbers with materialization engine batching left intact or disabled? Can't you also set DEFAULT_BATCH_SIZE to a very large number and run workloads like that? |
Intact, I did not alter
I can, but changing that number will affect materialization for all online stores. Note that many of the returned batches by |
Just ran the materialization with the same feature set by changing |
Yeah, I got that. It might be worth it to mess with other online stores if it proves to be important for performance here (or we can make materialization engine ask online store for the value instead of using a default one).
I'm thinking maybe that's the case because pyarrow is following some heuristics and keeping those numbers below Leaving all that aside, this seems to be much more of an upsert rather than an insert problem. We might have more luck optionally switching to "batch upsert" mode when the dataset to materialize is large enough (write to staging, lock table, run update and insert statements in a transaction). This means quite a few design changes in both materialization engines and online store interfaces, though. We might also need to distinguish between full table materializations and small inserts from |
@TomSteenbergen Another request if it's not a trouble. Can you get rid of |
With |
Agreed. Inserting data in a staging table with a Perhaps we should, in addition to (This might be a nice method to have in any case, e.g. when an earlier materialization run materialized some erroneous data to the online store.) |
Not necessarily, instead of
Maybe, but if we solve the materialize problem above this could just be ( |
@tokoko That's an option indeed, but why would that be a lot faster? A quick Google search seems to indicate that
Fair point, indeed not the easiest route. As a short term improvement, I can get rid of the batching in WDYT @tokoko? |
hmm.. after a google search of my own, sure, I may be wrong. I still feel like there should be a more performant workaround though.. something like doing a merge with a simple join and write out results to a (second 😄) staging table, then truncate the existing table and move data from staging to main. idk, maybe, maybe not.
Sure, that certainly can't hurt. |
@tokoko Created a PR here: #4331 Now that we are using
|
Is your feature request related to a problem? Please describe.
The current implementation of
PostgreSQLOnlineStoreConfig.online_write_batch()
is quite slow.Describe the solution you'd like
One of the suggested options discussed in this issue is using bulk writes with adbc.
Describe alternatives you've considered
Another option could be to use
COPY FROM
command. This performance benchmark shows that the number of rows written per second is significantly larger compared to the batch insert we are doing now.This image shows the comparison from the performance benchmark:

Additional context
I think we should benchmark both solutions and pick the one that achieves the best performance.
The text was updated successfully, but these errors were encountered: