-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data is replicated when overwriting data concurrently #1254
Comments
Thanks for reporting this. We should indeed test this better. It's possible that #632 will fix this. I will try that out soon. |
On first glance I would think @wjones127 is correct. Since we are not doing conflict resolution we can delete the same file twice. In the above scenario I would assume that the the table is at version 3, contains a duplicate remove action for the initial file, as well as both new added files. @sfilimonov-exos, can you confirm is this is what we are seeing, or share the delta log? |
Removed sequential overwrites from the code to reproduce to shrink the log and got: 00000000000000000000.json
00000000000000000001.json
00000000000000000002.json
00000000000000000003.json
And at this point it already has duplicated data and |
Yeah, I think will was correct, this is a commit issue. op 1 and 2 read version 0 of the table. You can see that both contain identical remove actions. op 3 must have read version 1 of the table, where 1 commit already happed, this it only removes the file from 1, but not the file from 2. So we are left with 2 active files, from op 2 and 3. As will pointed out, we are about to merge a PR that will enable us to support the optimistic commit protocol and resolve situations like this. |
#632 will not fix this for Python, but in a follow-up PR we will get the Python bindings using the new conflict checker. |
# Description Switches Python implementation over to the new transaction API. # Related Issue(s) * Closes #1254 # Documentation <!--- Share links to useful documentation --->
Environment
Delta-rs version: 0.8.1
Environment:
Bug
What happened: Data is replicated in the Delta Table when writing to the table with
write_deltalake(..., mode="overwrite")
What you expected to happen: Data is not replicated, there are no extra duplications in the data.
How to reproduce it:
More details:
The same happens when writing to S3 having enabled locking through dynamo (
AWS_S3_LOCKING_PROVIDER = "dynamodb"
in envs).The text was updated successfully, but these errors were encountered: