Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] pyarrow: using role_arn with S3FileSystem results in Anonymous user error #38421

Open
MisterT314 opened this issue Oct 23, 2023 · 5 comments

Comments

@MisterT314
Copy link

MisterT314 commented Oct 23, 2023

Hi, I'm initializing S3FileSystem with a role_arn, to get refreshable temporary credentials, as described in
https://arrow.apache.org/docs/python/generated/pyarrow.fs.S3FileSystem.html

The init call is the following:

self._fs = S3FileSystem(
                region=_AWS_REGION,
                role_arn=self._role_arn, 
                anonymous=False,
            ) 

A bit later I'm initializing a parquetwriter with the file system

    fs = _get_service().get_file_system()
    output_url = output_url.replace("s3://", "")
    schema = Schema.from_pandas(df=data)
    writer = ParquetWriter(where=output_url, schema=schema, filesystem=fs)

I then get the error:

OSError: When initiating multiple part upload for key 'resources/01HBTZN7DBR1BE62S5TJRQDZXP/data.parquet' in bucket 'my-bucket-name': AWS Error ACCESS_DENIED during CreateMultipartUpload operation: Anonymous users cannot initiate multipart uploads. Please authenticate.

I expect that S3FileSystem would use STS to get temporary credentials like described in the documentation. The role is configured to allow the service (glue) to assume the role.

Pyarrow version 12.0.0
S3FS 2023.6.0
Using Pyarrow in a aws glue v4 environment

Component(s)

Python

@antonio-yuen-zocdoc
Copy link

@MisterT314 have you been able to resolve this or work around it by any chance? I'm running into the same issue.

@kou kou changed the title pyarrow: using role_arn with S3FileSystem results in Anonymous user error [Python] pyarrow: using role_arn with S3FileSystem results in Anonymous user error Feb 27, 2024
@nph
Copy link
Contributor

nph commented Feb 27, 2024

@antonio-yuen-zocdoc A potential work around is to get the temporary AWS credentials using boto3 and then pass these directly to S3FileSystem, e.g.:

import boto3

session = boto3.session.Session()
sts = session.client('sts')
response = sts.assume_role(
    RoleArn=role_arn,
    RoleSessionName='my-role-session',
)

fs = S3FileSystem(
    access_key=response['Credentials']['AccessKeyId'],
    secret_key=response['Credentials']['SecretAccessKey'],
    session_token=response['Credentials']['SessionToken'],
)

I've used a similar approach to create an S3FileSystem using credentials for a specific AWS profile name (which currently isn't possible using S3FileSystem directly).

@kharigardner
Copy link

Seeing this issue in pandas.to_parquet method, with SSO credentials, but not env vars. Any workaround for that?

@lauragalera
Copy link

I am facing not the exact but a similar error using pyarrow version 10.0.0 for a glue job.

After calling S3FileSystem with an arn_role and using it to read a table, I get an access error

df = pq.read_table('key', columns = ["_id","preferences"], schema = custom_schema,filesystem=s3).to_pandas()

OSError: When getting information for key 'X' in bucket 'X': AWS Error ACCESS_DENIED during HeadObject operation: No response body.

I thought it could be due to role policies but the role has enough permission on the S3 bucket and the trust-relationship looks fine.

I have come with a workaround using boto3.client(), but I'd like to know what is causing the access denied.

@antonio-yuen-zocdoc
Copy link

Seeing this issue in pandas.to_parquet method, with SSO credentials, but not env vars. Any workaround for that?

I got rid of pandas on my end and replaced with polars but I believe I was able to solve this issue using this doc.
https://pandas.pydata.org/docs/user_guide/io.html#reading-writing-remote-files

I believe I set this option on the to_parquet method: storage_options={"s3": {"anon": False}}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants