Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: javaJar provider does not work with --yaml_pipeline flag: TypeError: a bytes-like object is required, not 'str' #34343

Closed
2 of 17 tasks
jonathaningram opened this issue Mar 19, 2025 · 11 comments · Fixed by #34351
Assignees
Labels

Comments

@jonathaningram
Copy link

jonathaningram commented Mar 19, 2025

What happened?

Beam version: at least v2.63.0.

The --yaml_pipeline flag contains a string-like version of the pipeline. The --yaml_pipeline_file flag contains a path to the file.

We can successfully use the --yaml_pipeline_file flag locally to run our YAML pipeline. As soon as we switch to --yaml_pipeline, it fails with an error. We tried both --yaml-pipeline and --yaml-pipeline-file flags from gcloud dataflow yaml run, and both seem to have the same issue.

Note: We haven't been able run any YAML pipeline with a Java provider successfully in Dataflow, so we're interested in the possibility of a patch being applied to Dataflow, or if there's a workaround that would be great.

Stack trace
<snip>
INFO:apache_beam.yaml.yaml_transform:Expanding "Create" at line 4
INFO:apache_beam.yaml.yaml_transform:Expanding "Identity" at line 18
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py", line 371, in create_ptransform
    ptransform = provider.create_transform(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_provider.py", line 192, in create_transform
    self._service = self._service()
                    ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_provider.py", line 328, in <lambda>
    urns, lambda: external.JavaJarExpansionService(jar_provider()))
                                                   ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_provider.py", line 260, in <lambda>
    urns, lambda: _join_url_or_filepath(provider_base_path, jar))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_provider.py", line 1282, in _join_url_or_filepath
    path_scheme = urllib.parse.urlparse(path, base_scheme).scheme
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/urllib/parse.py", line 395, in urlparse
    splitresult = urlsplit(url, scheme, allow_fragments)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/urllib/parse.py", line 478, in urlsplit
    scheme = scheme.strip(_WHATWG_C0_CONTROL_OR_SPACE)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: a bytes-like object is required, not 'str'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.11/site-packages/apache_beam/yaml/main.py", line 154, in <module>
    run()
  File "/usr/local/lib/python3.11/site-packages/apache_beam/yaml/main.py", line 143, in run
    yaml_transform.expand_pipeline(
  File "/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py", line 1077, in expand_pipeline
    providers or {})).expand(beam.pvalue.PBegin(pipeline))
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py", line 1042, in expand
    result = expand_transform(
             ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py", line 442, in expand_transform
    return expand_composite_transform(spec, scope)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py", line 520, in expand_composite_transform
    return CompositePTransform.expand(None)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py", line 508, in expand
    inner_scope.compute_all()
  File "/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py", line 196, in compute_all
    self.compute_outputs(transform_id)
  File "/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py", line 97, in wrapper
    self._cache[key] = func(self, *args)
                       ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py", line 232, in compute_outputs
    return expand_transform(self._transforms_by_uuid[transform_id], self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py", line 444, in expand_transform
    return expand_leaf_transform(spec, scope)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py", line 466, in expand_leaf_transform
    ptransform = scope.create_ptransform(spec, inputs_dict.values())
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py", line 413, in create_ptransform
    raise ValueError(
ValueError: Invalid transform specification at "Identity" at line 18: a bytes-like object is required, not 'str'
Building pipeline...

I've made a repro here: https://github.com/jonathaningram/beam-starter-java-provider-repro which contains much of the same info as I've put in this ticket.

The issue seems to be an encoding one.

A possible patch that works locally, but I haven't verified how suitable the fix is, so I've not proposed a PR.

Inside the beam repo:

➜  beam git:(v2.63.0) ✗ gb
* (HEAD detached at sdks/v2.63.0)
  master
➜  beam git:(v2.63.0) ✗ gd
diff --git a/sdks/python/apache_beam/yaml/yaml_provider.py b/sdks/python/apache_beam/yaml/yaml_provider.py
index aa3c5d90515..f9d1bcf914c 100755
--- a/sdks/python/apache_beam/yaml/yaml_provider.py
+++ b/sdks/python/apache_beam/yaml/yaml_provider.py
@@ -1279,7 +1279,7 @@ def _as_list(func):

 def _join_url_or_filepath(base, path):
   base_scheme = urllib.parse.urlparse(base, '').scheme
-  path_scheme = urllib.parse.urlparse(path, base_scheme).scheme
+  path_scheme = urllib.parse.urlparse(path.encode(), base_scheme).scheme
   if path_scheme != base_scheme:
     return path
   elif base_scheme and base_scheme in urllib.parse.uses_relative:

You can mount the beam source code in the container in my repro and observe that it now works:

docker run -v "$(pwd):/app" \
    -v "$BEAM_PYTHON_SRC:/usr/local/lib/python3.11/site-packages/apache_beam/yaml" \
    -v ~/.config/gcloud:/root/.config/gcloud \
    -w /app \
    --entrypoint /bin/bash beam_python3.11_sdk_with_java:2.63.0 \
    -c "python -m apache_beam.yaml.main --yaml_pipeline='$(yq -o=json '.' "$PIPELINE_FILE")' --runner=DataflowRunner"

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@chamikaramj
Copy link
Contributor

@robertwb or @derrickaw can you check pls ?

@chamikaramj
Copy link
Contributor

BTW @jonathaningram , is it possible for you to also try running from a machine where Beam is directly installed in a virtual environment [1] [2] instead of running from a Docker container ? As mentioned elsewhere, published "beam_python3.11_sdk" container [3 ] is intended to be our worker container that is used internally by Beam runners. Submitting jobs from that container is not something we test/support currently.

[1] https://beam.apache.org/documentation/sdks/yaml/#prerequisites
[2] https://beam.apache.org/get-started/quickstart/python/#create-and-activate-a-virtual-environment
[3] https://hub.docker.com/r/apache/beam_python3.11_sdk

@jonathaningram
Copy link
Author

@chamikaramj yep I can look at that. Is that mostly just about my local version being done “the right way” for future issues/support? Or is there info for this ticket you’re hoping to gain that I can provide after doing that?

@chamikaramj
Copy link
Contributor

I think it should be useful for this ticket. It could be that running from SDK harness containers is just broken in a strange way since that's something we don't test/support officially.

@chamikaramj
Copy link
Contributor

BTW @robertwb submitted #34304 to provide a better error when providers are unavailable. So hopefully you'll see a better error starting Beam 2.64.0.

@robertwb
Copy link
Contributor

robertwb commented Mar 19, 2025

The containers such as beam_python3.11_sdk_with_java:2.63.0 are not meant for constructing beam pipelines, we build them to provide to workers for executing the pipeline as a distributed system (I filed #34350 ). Granted, they install a lot of the same bits, but they're certainly not tested as being a full working dev environment.

@robertwb
Copy link
Contributor

That being said, this does look like a bug and a PR with your suggested patch would be appreciated. (Still looking into why we get a bytes object here to begin with.)

@robertwb
Copy link
Contributor

Thanks for catching this! Alternative fix at #34351 (turns out the bytes object was coming from urllib.parse.urlparse). We'll try to get this in the next release.

@jonathaningram
Copy link
Author

@robertwb awesome thank you. I'm glad you did the fix, it would have taken me some time to understand if it was the right fix. Do you know what the release timing for ending up in GCP looks like for this? Just want to set my own expectations for when I could try again.

@robertwb
Copy link
Contributor

Yeah, that behavior is pretty surprising.

The timing is actually pretty good--the Beam release will be cut shortly and out within ~weeks. GCP should pick things up shortly after that.

Until then you can use your patch of this.

@jonathaningram
Copy link
Author

@robertwb thanks. Excuse my ignorance, but how do apply this patch for running a job in GCP Dataflow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants