Docs specify Pandas DataFrame Codec input shapes incorrectly. #1678

ReveStobinson · 2024-04-11T16:26:32Z

This is possibly semi-related to #1625, but I'm submitting this as a separate issue because I think there is an additional issue that is specific to the docs and won't be solved with a solution to #1625.

I will fork and make a PR for fixing this, since it's just documentation-related, but I do think that PR should be conditional on #1625 also being closed as complete.

The Pandas DataFrame section of the Codec docs provides an example DataFrame for encoding to the V2 inference protocol:

A	B	C
a1	b1	c1
a2	b2	c2
a3	b3	c3
a4	b4	c4

The tabs below it show the encoding for that payload, but neither one actually produces a "correct" encoding for that data that would be acceptable to e.g. a scikit-learn model without input modifications.

JSON Payload error

The JSON encoding provided is:

{
  "parameters": {
    "content_type": "pd"
  },
  "inputs": [
    {
      "name": "A",
      "data": ["a1", "a2", "a3", "a4"]
      "datatype": "BYTES",
      "shape": [3],
    },
    {
      "name": "B",
      "data": ["b1", "b2", "b3", "b4"]
      "datatype": "BYTES",
      "shape": [3],
    },
    {
      "name": "C",
      "data": ["c1", "c2", "c3", "c4"]
      "datatype": "BYTES",
      "shape": [3],
    },
  ]
}

The "shape": [3] present in all three of those inputs is not the correct encoding of that DataFrame. While it does have 3 columns, the shape for each of the inputs should be at least the number of rows in the DataFrame. This is actually mentioned in the last bullet point above the example dataframe on that very page:

The shape field of each input (or output) entry will contain (at least)
the amount of rows included in the dataframe.

So the true JSON encoding should be:

{
  "parameters": {
    "content_type": "pd"
  },
  "inputs": [
    {
      "name": "A",
      "data": ["a1", "a2", "a3", "a4"]
      "datatype": "BYTES",
      "shape": [4],
    },
    {
      "name": "B",
      "data": ["b1", "b2", "b3", "b4"]
      "datatype": "BYTES",
      "shape": [4],
    },
    {
      "name": "C",
      "data": ["c1", "c2", "c3", "c4"]
      "datatype": "BYTES",
      "shape": [4],
    },
  ]
}

Pandas Request Codec Payload Error

The output of the Pandas Request Codec is also incorrect, as mentioned in #1625.

The following code snippet is presented:

import pandas as pd

from mlserver.codecs import PandasCodec

foo = pd.DataFrame({
  "A": ["a1", "a2", "a3", "a4"],
  "B": ["b1", "b2", "b3", "b4"],
  "C": ["c1", "c2", "c3", "c4"]
})

inference_request = PandasCodec.encode_request(foo)

But if we actually run this code and get the output of that request, we see it doesn't match either of the JSON Payloads above. inference_request.dict() yields a dictionary that looks like this:

{
  "parameters": {
    "content_type": "pd"
  },
  "inputs": [
    {
      "name": "A",
      "shape": [4, 1],
      "datatype": "BYTES",
      "data": ["a1", "a2", "a3", "a4"]
    },
    {
      "name": "B",
      "shape": [4, 1],
      "datatype": "BYTES",
      "data": ["b1", "b2", "b3", "b4"]
    },
    {
      "name": "C",
      "shape": [4, 1],
      "datatype": "BYTES",
      "data": ["c1", "c2", "c3", "c4"]
    }
  ]
}

But, again, it would seem like the proper serialization here should probably be "shape": [4], not [4, 1].

The text was updated successfully, but these errors were encountered:

ReveStobinson mentioned this issue Apr 11, 2024

Fix JSON input shapes #1679

Merged

sakoush closed this as completed in #1679 Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docs specify Pandas DataFrame Codec input shapes incorrectly. #1678

Docs specify Pandas DataFrame Codec input shapes incorrectly. #1678

ReveStobinson commented Apr 11, 2024

Docs specify Pandas DataFrame Codec input shapes incorrectly. #1678

Docs specify Pandas DataFrame Codec input shapes incorrectly. #1678

Comments

ReveStobinson commented Apr 11, 2024

JSON Payload error

Pandas Request Codec Payload Error