Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs specify Pandas DataFrame Codec input shapes incorrectly. #1678

Closed
ReveStobinson opened this issue Apr 11, 2024 · 0 comments · Fixed by #1679
Closed

Docs specify Pandas DataFrame Codec input shapes incorrectly. #1678

ReveStobinson opened this issue Apr 11, 2024 · 0 comments · Fixed by #1679

Comments

@ReveStobinson
Copy link
Contributor

This is possibly semi-related to #1625, but I'm submitting this as a separate issue because I think there is an additional issue that is specific to the docs and won't be solved with a solution to #1625.

I will fork and make a PR for fixing this, since it's just documentation-related, but I do think that PR should be conditional on #1625 also being closed as complete.

The Pandas DataFrame section of the Codec docs provides an example DataFrame for encoding to the V2 inference protocol:

A B C
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4

The tabs below it show the encoding for that payload, but neither one actually produces a "correct" encoding for that data that would be acceptable to e.g. a scikit-learn model without input modifications.

JSON Payload error

The JSON encoding provided is:

{
  "parameters": {
    "content_type": "pd"
  },
  "inputs": [
    {
      "name": "A",
      "data": ["a1", "a2", "a3", "a4"]
      "datatype": "BYTES",
      "shape": [3],
    },
    {
      "name": "B",
      "data": ["b1", "b2", "b3", "b4"]
      "datatype": "BYTES",
      "shape": [3],
    },
    {
      "name": "C",
      "data": ["c1", "c2", "c3", "c4"]
      "datatype": "BYTES",
      "shape": [3],
    },
  ]
}

The "shape": [3] present in all three of those inputs is not the correct encoding of that DataFrame. While it does have 3 columns, the shape for each of the inputs should be at least the number of rows in the DataFrame. This is actually mentioned in the last bullet point above the example dataframe on that very page:

  • The shape field of each input (or output) entry will contain (at least)
    the amount of rows included in the dataframe.

So the true JSON encoding should be:

{
  "parameters": {
    "content_type": "pd"
  },
  "inputs": [
    {
      "name": "A",
      "data": ["a1", "a2", "a3", "a4"]
      "datatype": "BYTES",
      "shape": [4],
    },
    {
      "name": "B",
      "data": ["b1", "b2", "b3", "b4"]
      "datatype": "BYTES",
      "shape": [4],
    },
    {
      "name": "C",
      "data": ["c1", "c2", "c3", "c4"]
      "datatype": "BYTES",
      "shape": [4],
    },
  ]
}

Pandas Request Codec Payload Error

The output of the Pandas Request Codec is also incorrect, as mentioned in #1625.

The following code snippet is presented:

import pandas as pd

from mlserver.codecs import PandasCodec

foo = pd.DataFrame({
  "A": ["a1", "a2", "a3", "a4"],
  "B": ["b1", "b2", "b3", "b4"],
  "C": ["c1", "c2", "c3", "c4"]
})

inference_request = PandasCodec.encode_request(foo)

But if we actually run this code and get the output of that request, we see it doesn't match either of the JSON Payloads above. inference_request.dict() yields a dictionary that looks like this:

{
  "parameters": {
    "content_type": "pd"
  },
  "inputs": [
    {
      "name": "A",
      "shape": [4, 1],
      "datatype": "BYTES",
      "data": ["a1", "a2", "a3", "a4"]
    },
    {
      "name": "B",
      "shape": [4, 1],
      "datatype": "BYTES",
      "data": ["b1", "b2", "b3", "b4"]
    },
    {
      "name": "C",
      "shape": [4, 1],
      "datatype": "BYTES",
      "data": ["c1", "c2", "c3", "c4"]
    }
  ]
}

But, again, it would seem like the proper serialization here should probably be "shape": [4], not [4, 1].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant