Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds protobuf serialization to FeatureMap, Hasher, and VariableInfo implementations #226

Merged
merged 40 commits into from
Jun 2, 2022

Conversation

Craigacp
Copy link
Member

@Craigacp Craigacp commented Mar 25, 2022

Description

Adds protobuf definitions for FeatureMap, Hasher, Transformer, and VariableInfo which are of the form (int version, String class_name, Any serialized_data). Each of the concrete implementations of those classes has its own protobuf which forms the serialized_data field. The higher level objects are redirect hooks for the serialization system, the class name field denotes which class should handle deserialization. Classes which are protobuf serializable implement org.tribuo.ProtoSerializable which exposes a single public T serialize() method. They must also implement a public static U deserializeFromProto(int version, String className, Any serializedData) which rebuilds the object from the protobuf. These deserialize methods are called by org.tribuo.util.ProtoUtil.Instantiate, and at the moment FeatureMap also has a static deserialize(FeatureDomainProto proto) method.

I think there probably need to be more static helpers which manage the serialization and deserialization to input/output streams automatically, and I'm also considering a top level redirection proto which can contain at most one of each serializable type in Tribuo. That will allow us to expose deserialize and serialize methods on ProtoUtil which operate on any Tribuo types, redirecting to the appropriate classes as necessary, and thus get us most of the flexibility back that we lose in a non-java.io.Serializable system.

To support the tests I added equals methods to FeatureMap, Hasher implementations and VariableInfo implementations.

There are other protos defined in tribuo-core.proto which aren't used in this PR, the design of those is still under consideration (especially the model one, as it's different from the rest), so they could change in future PRs (and as such shouldn't be considered blockers for reviewing this PR).

Note at the moment the deserializeFromProto method is accessed reflectively, and tries to change the access permissions on the method so it's accessible. I think that's likely to cause problems with modules in the future, and will require an update to our security policy if we release a version with this in, however I'm currently unsure if we want the deserialization methods to be public or not.

The plan is to roll this out to all Tribuo classes which implement java.io.Serializable before the next minor release. Next up after this PR is merged is Output, then Example, Prediction and Dataset. Then we'll work through all the models.

This is currently a draft PR as I've removed the generated protos for ease of reviewing. When we're happy with all the code I'll regenerate them and add them to the branch.

Motivation

java.io.Serializable makes it difficult to evolve the in memory representations of objects without drastically affecting the serialization format, plus it's the source of a number of security issues. We've mitigated the security issues by providing a JEP290 filter, however the restraint on evolving classes remains. Adding protobuf serialization should allow us to evolve models past major version boundaries which involve class renaming or replacement (with some caveats).

@Craigacp Craigacp added Oracle employee This PR is from an Oracle employee squash-commits Squash the commits when merging this PR labels Mar 25, 2022
@Craigacp Craigacp marked this pull request as draft March 25, 2022 14:21
@Craigacp Craigacp force-pushed the protobuf-serialization branch from 88748f8 to 578763a Compare April 9, 2022 00:55
@Craigacp
Copy link
Member Author

Craigacp commented May 3, 2022

@pogren has added an automatic serialization mechanism based on annotations, and I've modified the redirect mechanism a bit after internal discussion. This PR still needs some further cleanup wrt the deserialization methods (as they are a little inconsistent and need to have version number validation everywhere).

@Craigacp Craigacp marked this pull request as ready for review May 9, 2022 01:27
Copy link
Member

@JackSullivan JackSullivan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good apart from the comments, just a few places where we want more documentation

pogren and others added 23 commits May 24, 2022 22:23
added @ProtoSerializableMapValuesField
remove register redirects
clean up serialization of fields logic
- remove getMapFields
- clean up getFields
findMethod can be used for setters and getters by specifying expected
param count.
remove various compile warnings in package.
which allows you to annotate e.g. an array of doubles as done in
BinningTransformer (found in BinningTransformation).
developer must implement deserializeFromProto and won't get a crappy
solution that does something if they don't.

also, put back in detailed error handling messages in the higher-level
'deserialize' method (FKA 'instantiate').
this version has a bunch of sysout statements
added unit tests
better unit tests
removed sysout statements
fixed a bug found in unit tests.
fix problem with field name in annotation
added default impls to all the effected subclasses that now require
them.
@Craigacp Craigacp force-pushed the protobuf-serialization branch from c24d4d2 to bfb1c2b Compare May 25, 2022 02:28
@Craigacp
Copy link
Member Author

I fixed the things raised in the review comments, should be good for another review now.

Copy link
Member

@JackSullivan JackSullivan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@Craigacp Craigacp merged commit 5940cb3 into oracle:main Jun 2, 2022
@Craigacp Craigacp deleted the protobuf-serialization branch June 2, 2022 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Oracle employee This PR is from an Oracle employee squash-commits Squash the commits when merging this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants