-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Parquet Modular decryption (read) support + encryption
flag
#6637
base: main
Are you sure you want to change the base?
Conversation
Currently this is a rough rebase of work done by @ggershinsky. As |
@rok let me know if you want any help shoehorning this into |
Is there any help, input or contribution needed here? |
Thanks for the offer @etseidl & @brainslush! I'm making some progress and would definitely appreciate a review! I'll ping once I push. |
fe488b3
to
d263510
Compare
@etseidl could you please do a quick pass to say if this makes sense in respect to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only looking at the metadata bits for now...looks good to me so far. Just a few minor nits. Thanks @rok!
f90d8b4
to
29d55eb
Compare
27e77ad
to
7db06cc
Compare
a4105d5
to
3e7646d
Compare
deedba9
to
951f2fa
Compare
f6b9e88
to
23375d1
Compare
7f94e39
to
177d826
Compare
ac4ac21
to
3241425
Compare
9e5156f
to
a3708b2
Compare
cca1d57
to
27a1071
Compare
8d5bc46
to
7a6ec1f
Compare
Co-authored-by: Corwin Joy <corwin.joy@gmail.com> Co-authored-by: Adam Reeve <adreeve@gmail.com>
7a6ec1f
to
276fc1a
Compare
Thanks for reviews so far! I've tried to address all comments (especially regarding the API changes) and I think this is now ready for another round of review. |
I plan to give this a good review Monday |
I've made a few follow up issues for features that aren't in this initial PR: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some review comments thanks Rok. This is looking pretty good to me but my main concern is handling of encrypted files when decryption properties aren't specified, which looks like it will no longer raise a helpful error after some of the recent changes, and will instead go down the code paths that don't support encryption even if the encryption feature is enabled.
I think it would also be good to reduce the number of public structs here so it's safer to change things in future without worrying too much about breaking things. I haven't left detailed comments on that but it seems like users would need to create FileDecryptionProperties
but shouldn't need public access to a lot of the other structs in the encryption module. Eg. CryptoContext
and FileDecryptor
seem like implementation details that could be private? I could be wrong though if these are needed as parameters for other public methods.
|
||
#[cfg(feature = "encryption")] | ||
{ | ||
let ret = Ok(ret.unwrap().with_crypto_context(crypto_context)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unwrap here could panic. This section including the #[cfg(not(feature = "encryption"))]
part below could be simplified to something like:
#[cfg(feature = "encryption")]
let ret = ret.map(|reader| reader.with_crypto_context(crypto_context));
Some(ret.map(|x| Box::new(x) as _))
@@ -1167,6 +1297,8 @@ mod tests { | |||
data: data.clone(), | |||
metadata: metadata.clone(), | |||
requests: Default::default(), | |||
#[cfg(feature = "encryption")] | |||
file_decryption_properties: None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be tidied up by using a similar approach to what we've done elsewhere and making a TestReader::new
method and TestReader::with_file_decryption_properties
.with_prefetch_hint(self.metadata_size_hint); | ||
#[cfg(feature = "encryption")] | ||
let metadata = metadata | ||
.with_decryption_properties(self.file_decryption_properties.clone().as_ref()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this with_decryption_properties
call should be removed as this is now in the get_encrypted_metadata
method.
parquet/src/encryption/mod.rs
Outdated
|
||
pub mod ciphers; | ||
pub mod decryption; | ||
pub mod modules; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the ciphers
or modules
modules should need to be public. Can these be made private? Or at least pub(crate)
.
@@ -1788,6 +1859,151 @@ mod tests { | |||
assert!(col.value(2).is_nan()); | |||
} | |||
|
|||
#[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also add a test for the encrypt_columns_and_footer_disable_aad_storage.parquet.encrypted
file? It looks like that will need a change to get_file_decryptor
so it uses the AAD from the file decryption properties rather than the one from the file.
And if an AAD prefix is specified in both the decryption properties and the file, the one in the decryption properties should be used.
A test that specifying the wrong AAD prefix results in an error when the AAD is stored in the file would also be useful.
This could possibly be a follow up change rather than needing to be done in this PR though.
column_index: None, | ||
offset_index: None, | ||
} | ||
} | ||
|
||
#[allow(missing_docs)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this just have a doc comment added :)
@@ -372,6 +414,11 @@ impl ParquetMetaDataReader { | |||
mut fetch: F, | |||
file_size: usize, | |||
) -> Result<()> { | |||
#[cfg(feature = "encryption")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The encryption
and not(encryption)
blocks here are identical.
@@ -175,6 +227,15 @@ impl ArrowReaderMetadata { | |||
) -> Result<Self> { | |||
// TODO: this is all rather awkward. It would be nice if AsyncFileReader::get_metadata | |||
// took an argument to fetch the page indexes. | |||
#[cfg(feature = "encryption")] | |||
let mut metadata = if options.file_decryption_properties.is_some() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With these recent changes it doesn't look like we properly handle when a file is encrypted but no decryption properties are provided. Here we check that file decryption properties are set before calling get_encrypted_metadata
, and then later in ParquetMetadataReader::decrypt_metadata
we raise an error if the footer is encrypted but no decryption properties are set (
arrow-rs/parquet/src/file/metadata/reader.rs
Line 730 in 276fc1a
return Err(general_err!("Parquet file has an encrypted footer but no decryption properties were provided")); |
The code here also seems to have the same problem:
arrow-rs/parquet/src/file/metadata/reader.rs
Line 577 in 276fc1a
if self.file_decryption_properties.is_some() { |
I think we can just remove the if options.file_decryption_properties.is_some()
check, and maybe rename get_encrypted_metadata
etc to indicate that the metadata isn't necessarily encrypted. (maybe get_metadata_with_encryption
?). Plus ParquetMetaDataReader::decrypt_metadata
also seems misnamed. There are also some documentation comments that will need to be fixed to indicate that these methods aren't only for encrypted files, but can handle encrypted files.
Or alternatively we could move the check for an encrypted footer higher up and simplify some of the lower down methods to not handle the un-encrypted case?
Can we fix that and add a test case? And also a test case for a file with an encrypted footer that's run when encryption is disabled to check we get this error:
arrow-rs/parquet/src/file/metadata/reader.rs
Line 571 in 276fc1a
"Parquet file has an encrypted footer but the encryption feature is disabled" |
) | ||
.await; | ||
|
||
// todo: should this be double_field? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These todos should all be addressed. It looks like column_2_key
is used for float_field
so the error should be about double_field
, so something is wrong here.
I think the problem is the if column_1_key.is_empty()
and if column_2_key.is_empty()
checks above. These should surely be if !column_1_key.is_empty()
and if !column_2_key.is_empty()
.
That should also fix the expected key length error problem.
Co-authored-by: Adam Reeve <adreeve@gmail.com>
Co-authored-by: Adam Reeve <adreeve@gmail.com>
…ion_disabled_aad_storage
1fcca9c
to
fb6d3e3
Compare
Which issue does this PR close?
This PR is based on branch and an internal patch and aims to provide basic modular decryption support. Partially closes #3511. We decided to split encryption work into a separate PR.
Rationale for this change
See #3511.
What changes are included in this PR?
This introduces AesGcmV1 cypher decryption to
ArrowReaderMetadata
andParquetRecordBatchReader
. Introduced classes and functions are tested on sample files fromparquet-dataset
.Are there any user-facing changes?
Several new classes and method parameters are introduced. If project is compiled without
encryption
flag changes are not breaking. Ifencryption
flag is on some methods and constructors (e.g.ParquetMetaData::new
) will require new parameters which would be a breaking change.