Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Parquet Modular decryption (read) support + encryption flag #6637

Open
wants to merge 84 commits into
base: main
Choose a base branch
from

Conversation

rok
Copy link
Member

@rok rok commented Oct 28, 2024

Which issue does this PR close?

This PR is based on branch and an internal patch and aims to provide basic modular decryption support. Partially closes #3511. We decided to split encryption work into a separate PR.

Rationale for this change

See #3511.

What changes are included in this PR?

This introduces AesGcmV1 cypher decryption to ArrowReaderMetadata and ParquetRecordBatchReader. Introduced classes and functions are tested on sample files from parquet-dataset.

Are there any user-facing changes?

Several new classes and method parameters are introduced. If project is compiled without encryption flag changes are not breaking. If encryption flag is on some methods and constructors (e.g. ParquetMetaData::new) will require new parameters which would be a breaking change.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 28, 2024
@rok
Copy link
Member Author

rok commented Oct 28, 2024

Currently this is a rough rebase of work done by @ggershinsky. As ParquetMetaDataReader is now available some refactoring will be required.

@etseidl
Copy link
Contributor

etseidl commented Oct 28, 2024

As ParquetMetaDataReader is now available some refactoring will be required.

@rok let me know if you want any help shoehorning this into ParquetMetaDataReader.

@brainslush
Copy link

Is there any help, input or contribution needed here?

@rok
Copy link
Member Author

rok commented Nov 21, 2024

Thanks for the offer @etseidl & @brainslush! I'm making some progress and would definitely appreciate a review! I'll ping once I push.

@rok rok force-pushed the decryption-basics-fork branch 2 times, most recently from fe488b3 to d263510 Compare November 23, 2024 23:06
@rok
Copy link
Member Author

rok commented Dec 4, 2024

As ParquetMetaDataReader is now available some refactoring will be required.

@rok let me know if you want any help shoehorning this into ParquetMetaDataReader.

@etseidl could you please do a quick pass to say if this makes sense in respect to ParquetMetaDataReader?
I'll continue with data decryption.

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only looking at the metadata bits for now...looks good to me so far. Just a few minor nits. Thanks @rok!

@rok rok force-pushed the decryption-basics-fork branch from f90d8b4 to 29d55eb Compare December 16, 2024 23:51
@adamreeve adamreeve force-pushed the decryption-basics-fork branch from 27e77ad to 7db06cc Compare December 20, 2024 02:15
@rok rok force-pushed the decryption-basics-fork branch from a4105d5 to 3e7646d Compare January 9, 2025 21:59
@rok rok force-pushed the decryption-basics-fork branch 4 times, most recently from deedba9 to 951f2fa Compare January 21, 2025 20:35
@rok rok changed the title Parquet Modular Encryption support Parquet Modular decryption support Jan 21, 2025
@rok rok force-pushed the decryption-basics-fork branch 2 times, most recently from f6b9e88 to 23375d1 Compare January 23, 2025 18:17
@adamreeve adamreeve force-pushed the decryption-basics-fork branch 3 times, most recently from 7f94e39 to 177d826 Compare January 24, 2025 02:46
@rok rok force-pushed the decryption-basics-fork branch 2 times, most recently from ac4ac21 to 3241425 Compare February 5, 2025 10:01
@rok rok force-pushed the decryption-basics-fork branch from 9e5156f to a3708b2 Compare March 7, 2025 23:46
@rok rok force-pushed the decryption-basics-fork branch 3 times, most recently from cca1d57 to 27a1071 Compare March 8, 2025 01:52
@rok rok force-pushed the decryption-basics-fork branch 3 times, most recently from 8d5bc46 to 7a6ec1f Compare March 8, 2025 03:16
Co-authored-by: Corwin Joy <corwin.joy@gmail.com>
Co-authored-by: Adam Reeve <adreeve@gmail.com>
@rok rok force-pushed the decryption-basics-fork branch from 7a6ec1f to 276fc1a Compare March 8, 2025 03:28
@rok
Copy link
Member Author

rok commented Mar 8, 2025

Thanks for reviews so far! I've tried to address all comments (especially regarding the API changes) and I think this is now ready for another round of review.

@rok rok requested review from alamb, etseidl and adamreeve March 8, 2025 03:30
@alamb
Copy link
Contributor

alamb commented Mar 8, 2025

I plan to give this a good review Monday

@adamreeve
Copy link
Contributor

Copy link
Contributor

@adamreeve adamreeve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some review comments thanks Rok. This is looking pretty good to me but my main concern is handling of encrypted files when decryption properties aren't specified, which looks like it will no longer raise a helpful error after some of the recent changes, and will instead go down the code paths that don't support encryption even if the encryption feature is enabled.

I think it would also be good to reduce the number of public structs here so it's safer to change things in future without worrying too much about breaking things. I haven't left detailed comments on that but it seems like users would need to create FileDecryptionProperties but shouldn't need public access to a lot of the other structs in the encryption module. Eg. CryptoContext and FileDecryptor seem like implementation details that could be private? I could be wrong though if these are needed as parameters for other public methods.


#[cfg(feature = "encryption")]
{
let ret = Ok(ret.unwrap().with_crypto_context(crypto_context));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unwrap here could panic. This section including the #[cfg(not(feature = "encryption"))] part below could be simplified to something like:

        #[cfg(feature = "encryption")]
        let ret = ret.map(|reader| reader.with_crypto_context(crypto_context));

        Some(ret.map(|x| Box::new(x) as _))

@@ -1167,6 +1297,8 @@ mod tests {
data: data.clone(),
metadata: metadata.clone(),
requests: Default::default(),
#[cfg(feature = "encryption")]
file_decryption_properties: None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be tidied up by using a similar approach to what we've done elsewhere and making a TestReader::new method and TestReader::with_file_decryption_properties

.with_prefetch_hint(self.metadata_size_hint);
#[cfg(feature = "encryption")]
let metadata = metadata
.with_decryption_properties(self.file_decryption_properties.clone().as_ref());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this with_decryption_properties call should be removed as this is now in the get_encrypted_metadata method.


pub mod ciphers;
pub mod decryption;
pub mod modules;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the ciphers or modules modules should need to be public. Can these be made private? Or at least pub(crate).

@@ -1788,6 +1859,151 @@ mod tests {
assert!(col.value(2).is_nan());
}

#[test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add a test for the encrypt_columns_and_footer_disable_aad_storage.parquet.encrypted file? It looks like that will need a change to get_file_decryptor so it uses the AAD from the file decryption properties rather than the one from the file.

And if an AAD prefix is specified in both the decryption properties and the file, the one in the decryption properties should be used.

A test that specifying the wrong AAD prefix results in an error when the AAD is stored in the file would also be useful.

This could possibly be a follow up change rather than needing to be done in this PR though.

column_index: None,
offset_index: None,
}
}

#[allow(missing_docs)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this just have a doc comment added :)

@@ -372,6 +414,11 @@ impl ParquetMetaDataReader {
mut fetch: F,
file_size: usize,
) -> Result<()> {
#[cfg(feature = "encryption")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The encryption and not(encryption) blocks here are identical.

@@ -175,6 +227,15 @@ impl ArrowReaderMetadata {
) -> Result<Self> {
// TODO: this is all rather awkward. It would be nice if AsyncFileReader::get_metadata
// took an argument to fetch the page indexes.
#[cfg(feature = "encryption")]
let mut metadata = if options.file_decryption_properties.is_some() {
Copy link
Contributor

@adamreeve adamreeve Mar 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With these recent changes it doesn't look like we properly handle when a file is encrypted but no decryption properties are provided. Here we check that file decryption properties are set before calling get_encrypted_metadata, and then later in ParquetMetadataReader::decrypt_metadata we raise an error if the footer is encrypted but no decryption properties are set (

return Err(general_err!("Parquet file has an encrypted footer but no decryption properties were provided"));
).

The code here also seems to have the same problem:

if self.file_decryption_properties.is_some() {

I think we can just remove the if options.file_decryption_properties.is_some() check, and maybe rename get_encrypted_metadata etc to indicate that the metadata isn't necessarily encrypted. (maybe get_metadata_with_encryption?). Plus ParquetMetaDataReader::decrypt_metadata also seems misnamed. There are also some documentation comments that will need to be fixed to indicate that these methods aren't only for encrypted files, but can handle encrypted files.

Or alternatively we could move the check for an encrypted footer higher up and simplify some of the lower down methods to not handle the un-encrypted case?

Can we fix that and add a test case? And also a test case for a file with an encrypted footer that's run when encryption is disabled to check we get this error:

"Parquet file has an encrypted footer but the encryption feature is disabled"

)
.await;

// todo: should this be double_field?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These todos should all be addressed. It looks like column_2_key is used for float_field so the error should be about double_field, so something is wrong here.

I think the problem is the if column_1_key.is_empty() and if column_2_key.is_empty() checks above. These should surely be if !column_1_key.is_empty() and if !column_2_key.is_empty().

That should also fix the expected key length error problem.

rok and others added 3 commits March 10, 2025 10:19
Co-authored-by: Adam Reeve <adreeve@gmail.com>
Co-authored-by: Adam Reeve <adreeve@gmail.com>
@rok rok force-pushed the decryption-basics-fork branch from 1fcca9c to fb6d3e3 Compare March 10, 2025 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet Modular Encryption support
8 participants