Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: WriteToBigQuery doesn't do WRITE_TRUNCATE properly with identical table names but in different datasets #34247

Open
1 of 17 tasks
portikCoder opened this issue Mar 11, 2025 · 1 comment

Comments

@portikCoder
Copy link

portikCoder commented Mar 11, 2025

What happened?

Describe the bug

Using WriteToBigQuery transform for batch load with write disposition specified to truncate doesn't do its job as intended, instead of truncating all tables, it does truncate the first one.
It's happening only, in case the table IDs are identical in single batch job, but located in different BQ datasets.

To Reproduce

Steps to reproduce the behavior:

  1. Prepare a json file with several inputs into different dataset locations, but identically named BQ tables
  2. Initiate a BQ load job through WriteToBigQuery transform
  3. Set write disposition to BigQueryDisposition.WRITE_TRUNCATE
  4. Run it several times
  5. Expect only the first table being truncated correctly, none of the others.

E.g.:

with Pipeline(options=pipeline_options) as pipeline:
  data = (pipeline
            | "ReadAll" >> ReadFromText(user_options.source_path))
  (data 
     | "Load data into BQ" >> WriteToBigQuery(..., write_disposition=BigQueryDisposition.WRITE_TRUNCATE))

Expected behavior

All the identically named tables within different datasets must be truncated, properly.

Actual behavior

Only the first table is being truncated (whatever first means in a heavily distributed system).

Environment (tested on)

  • Apache Beam version: 2.63.0
  • Runner: DirectRunner, DataflowRunner
  • OS: MacOS 15.3.1; build: 24D70
  • Python version: Python 3.11.9

Additional context

I already have a solution, just need to add test if even possible, didn't yet validate that.

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@portikCoder
Copy link
Author

portikCoder commented Mar 11, 2025

.take-issue
.add-labels P2,dataflow,python

portikCoder added a commit to portikCoder/beam that referenced this issue Mar 11, 2025
…th WRITE_TRUNCATE write disposition (apache#34247)

* It doesn't take care of identical table-ids but from different dataset-id.
portikCoder added a commit to portikCoder/beam that referenced this issue Mar 11, 2025
…th WRITE_TRUNCATE write disposition (apache#34247)

* It only truncates the first table, but originally didn't take care of identical table-ids but from different dataset-id.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant