-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds protobuf serialization to FeatureMap, Hasher, and VariableInfo implementations #226
Conversation
88748f8
to
578763a
Compare
@pogren has added an automatic serialization mechanism based on annotations, and I've modified the redirect mechanism a bit after internal discussion. This PR still needs some further cleanup wrt the deserialization methods (as they are a little inconsistent and need to have version number validation everywhere). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good apart from the comments, just a few places where we want more documentation
…quality test to CategoricalInfoTest.
…for equality and serialization in FeatureMapTest.
…mplementations package private so they are easier to test.
added @ProtoSerializableMapValuesField remove register redirects clean up serialization of fields logic - remove getMapFields - clean up getFields findMethod can be used for setters and getters by specifying expected param count.
remove various compile warnings in package.
which allows you to annotate e.g. an array of doubles as done in BinningTransformer (found in BinningTransformation).
developer must implement deserializeFromProto and won't get a crappy solution that does something if they don't. also, put back in detailed error handling messages in the higher-level 'deserialize' method (FKA 'instantiate').
this version has a bunch of sysout statements added unit tests
better unit tests removed sysout statements fixed a bug found in unit tests.
fix problem with field name in annotation
refactored serialize()
added default impls to all the effected subclasses that now require them.
…king version not default, adding javadoc.
c24d4d2
to
bfb1c2b
Compare
I fixed the things raised in the review comments, should be good for another review now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
Description
Adds protobuf definitions for
FeatureMap
,Hasher
,Transformer
, andVariableInfo
which are of the form(int version, String class_name, Any serialized_data)
. Each of the concrete implementations of those classes has its own protobuf which forms the serialized_data field. The higher level objects are redirect hooks for the serialization system, the class name field denotes which class should handle deserialization. Classes which are protobuf serializable implementorg.tribuo.ProtoSerializable
which exposes a singlepublic T serialize()
method. They must also implement apublic static U deserializeFromProto(int version, String className, Any serializedData)
which rebuilds the object from the protobuf. These deserialize methods are called byorg.tribuo.util.ProtoUtil.Instantiate
, and at the momentFeatureMap
also has a staticdeserialize(FeatureDomainProto proto)
method.I think there probably need to be more static helpers which manage the serialization and deserialization to input/output streams automatically, and I'm also considering a top level redirection proto which can contain at most one of each serializable type in Tribuo. That will allow us to expose
deserialize
andserialize
methods onProtoUtil
which operate on any Tribuo types, redirecting to the appropriate classes as necessary, and thus get us most of the flexibility back that we lose in a non-java.io.Serializable
system.To support the tests I added equals methods to
FeatureMap
,Hasher
implementations andVariableInfo
implementations.There are other protos defined in
tribuo-core.proto
which aren't used in this PR, the design of those is still under consideration (especially the model one, as it's different from the rest), so they could change in future PRs (and as such shouldn't be considered blockers for reviewing this PR).Note at the moment the
deserializeFromProto
method is accessed reflectively, and tries to change the access permissions on the method so it's accessible. I think that's likely to cause problems with modules in the future, and will require an update to our security policy if we release a version with this in, however I'm currently unsure if we want the deserialization methods to be public or not.The plan is to roll this out to all Tribuo classes which implement
java.io.Serializable
before the next minor release. Next up after this PR is merged isOutput
, thenExample
,Prediction
andDataset
. Then we'll work through all the models.This is currently a draft PR as I've removed the generated protos for ease of reviewing. When we're happy with all the code I'll regenerate them and add them to the branch.
Motivation
java.io.Serializable
makes it difficult to evolve the in memory representations of objects without drastically affecting the serialization format, plus it's the source of a number of security issues. We've mitigated the security issues by providing a JEP290 filter, however the restraint on evolving classes remains. Adding protobuf serialization should allow us to evolve models past major version boundaries which involve class renaming or replacement (with some caveats).