Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataContent fails with valid data URI #6055

Closed
luisquintanilla opened this issue Mar 7, 2025 · 12 comments · Fixed by #6072
Closed

DataContent fails with valid data URI #6055

luisquintanilla opened this issue Mar 7, 2025 · 12 comments · Fixed by #6072
Labels
area-ai Microsoft.Extensions.AI libraries bug This issue describes a behavior which is not expected - a bug.

Comments

@luisquintanilla
Copy link
Contributor

luisquintanilla commented Mar 7, 2025

Description

When using Microsoft.Extensions.AI.AzureAIInference IChatClient, sending DataContent with a valid data URI fails.

Repro:

https://gist.github.com/luisquintanilla/48a3dae23ca160a505b29ac8b87d38a4

For text content, rather than checking that the media type is text, it assumes that the AIContent is TextContent.

case TextContent textContent:
parts.Add(new ChatMessageTextContentItem(textContent.Text));
break;

I would expect the switch statement to treat everything as DataContent and for text to also use the media type.

This also raises another question. For custom media / MIME types, what is the mechanism for handling those? Would I have to implement my own DelegateChatClient?

Reproduction Steps

https://gist.github.com/luisquintanilla/48a3dae23ca160a505b29ac8b87d38a4

Expected behavior

I can get a response from the model provider.

Actual behavior

Error: System.ArgumentException: Value cannot be an empty collection. (Parameter 'content')
   at Azure.AI.Inference.Argument.AssertNotNullOrEmpty[T](IEnumerable`1 value, String name)
   at Azure.AI.Inference.ChatRequestUserMessage..ctor(IEnumerable`1 content)
   at Microsoft.Extensions.AI.AzureAIInferenceChatClient.ToAzureAIInferenceChatMessages(IList`1 inputs)+MoveNext()
   at System.Collections.Generic.List`1..ctor(IEnumerable`1 collection)
   at System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source)
   at Azure.AI.Inference.ChatCompletionsOptions..ctor(IEnumerable`1 messages)
   at Microsoft.Extensions.AI.AzureAIInferenceChatClient.ToAzureAIOptions(IList`1 chatContents, ChatOptions options)
   at Microsoft.Extensions.AI.AzureAIInferenceChatClient.GetResponseAsync(IList`1 chatMessages, ChatOptions options, CancellationToken cancellationToken)
   at Submission#8.<<Initialize>>d__0.MoveNext()
--- End of stack trace from previous location ---
   at Microsoft.CodeAnalysis.Scripting.ScriptExecutionState.RunSubmissionsAsync[TResult](ImmutableArray`1 precedingExecutors, Func`2 currentExecutor, StrongBox`1 exceptionHolderOpt, Func`2 catchExceptionOpt, CancellationToken cancellationToken)

Regression?

No response

Known Workarounds

No response

Configuration

No response

Other information

No response

@luisquintanilla luisquintanilla added bug This issue describes a behavior which is not expected - a bug. untriaged labels Mar 7, 2025
@github-actions github-actions bot added the area-ai Microsoft.Extensions.AI libraries label Mar 7, 2025
@luisquintanilla luisquintanilla changed the title DataContent fails for valid data URI. DataContent fails with valid data URI. Mar 7, 2025
@luisquintanilla luisquintanilla changed the title DataContent fails with valid data URI. DataContent fails with valid data URI Mar 7, 2025
@stephentoub
Copy link
Member

I would expect the switch statement to treat everything as DataContent and for text to also use the media type.

I'm not following. Are you saying you expect providers to treat a DataContent with a text/* media type the same as TextContent?

@luisquintanilla
Copy link
Contributor Author

luisquintanilla commented Mar 7, 2025

In the case of AzureAIInferenceChatClient.cs, I would expect text to be handled similar to image and audio.

https://github.com/dotnet/extensions/blob/d325bcb7c81eb59d1df2fd93d836571ed97d57d9/src/Libraries/Microsoft.Extensions.AI.AzureAIInference/AzureAIInferenceChatClient.cs#L493C17-L493C93

In the code I shared, that should be valid content I'm passing but the AI Inference implementation doesn't think so because for text it assumes that I'm sending TextContent.

In that particular implementation, as an end-user without access to the source, when the code throws with message:

System.ArgumentException: Value cannot be an empty collection. (Parameter 'content')

It's not obvious to me that what I should have done was use TextContent.

More generally though, I don't think text should to be treated differently.

@stephentoub
Copy link
Member

stephentoub commented Mar 7, 2025

More generally though, I don't think text should to be treated differently.

Right, so you're saying you'd want implementations to handle both TextContent and DataContent with "text/*" as text content.

It's not obvious to me that what I should have done was use TextContent.

To be honest, it did not occur to me that anyone would try to pass in a DataContent for handling text. How did you end up there?

cc: @MackinnonBuck, @SteveSandersonMS

@luisquintanilla
Copy link
Contributor Author

luisquintanilla commented Mar 7, 2025

Right, so you're saying you'd want implementations to handle both TextContent and DataContent with "text/*" as text content.

Yeah. Although I think in most cases, I think this would be up to the implementer. Not sure what, if anything, we could do there to make that pattern easier. In the Azure AI Inference case since that implementation is under the extensions repo, it seems like the right thing to do.

To be honest, it did not occur to me that anyone would try to pass in a DataContent for handling text. How did you end up there?

I was tinkering and trying to understand how custom mime types might work (the bottom portion of the notebook). When that threw an error, to rule out user error on my part, I tested the text scenario which I expected to work.

@stephentoub
Copy link
Member

stephentoub commented Mar 7, 2025

Although I think in most cases, I think this would be up to the implementer.

Yes, but we need to provide guidance / best practices on how these implementations handle things. If we'd consider it a bug in an implementation that it didn't handle DataContent.StartsWithMediaType("text/") specially, then we're saying that any conforming implementation does so. And it's not ideal when there are two different ways the same information can be specified.

It's also not trivial to handle "text/"... the MIME type can include an optional charset, which dictates an encoding with which the bytes should be converted back to text, so it's possibly quite complicated. Here's what HttpClient does:
https://github.com/dotnet/runtime/blob/81dff0439effb8dabb62421904cdcea8f26c8f0f/src/libraries/System.Net.Http/src/System/Net/Http/HttpContent.cs#L121-L148

If this is really something we think needs to be handled, then we should be providing some kind of helper to make that easier.

I'm not convinced it's worth the complication, though; do we have real uses cases where we'd expect this to occur?

@luisquintanilla
Copy link
Contributor Author

luisquintanilla commented Mar 7, 2025

Just looking at the content types in Abstractions, text seems to be the only one that is treated special and has its own class.

Maybe we just make everything DataContent and rely on mime types?

I also think this starts to get into the last part of the issue which may be worth having a separate thread on.

For custom media / MIME types, what is the mechanism for handling those? Would I have to implement my own DelegateChatClient?

At least when it comes to handling MIME types and parsing them, looks like we have the DataUriParser, handles the most common cases. Anything else, we fall back to behavior similar to HttpClient.

public static bool IsValidMediaType(ReadOnlySpan<char> mediaTypeSpan, ref string? mediaType)
{
Debug.Assert(
mediaType is null || mediaTypeSpan.Equals(mediaType.AsSpan(), StringComparison.Ordinal),
"mediaType string should either be null or the same as the span");
// If the media type is empty or all whitespace, normalize it to null.
if (mediaTypeSpan.IsWhiteSpace())
{
mediaType = null;
return true;
}
// For common media types, we can avoid both allocating a string for the span and avoid parsing overheads.
string? knownType = mediaTypeSpan switch
{
"application/json" => "application/json",
"application/octet-stream" => "application/octet-stream",
"application/pdf" => "application/pdf",
"application/xml" => "application/xml",
"audio/mpeg" => "audio/mpeg",
"audio/ogg" => "audio/ogg",
"audio/wav" => "audio/wav",
"image/apng" => "image/apng",
"image/avif" => "image/avif",
"image/bmp" => "image/bmp",
"image/gif" => "image/gif",
"image/jpeg" => "image/jpeg",
"image/png" => "image/png",
"image/svg+xml" => "image/svg+xml",
"image/tiff" => "image/tiff",
"image/webp" => "image/webp",
"text/css" => "text/css",
"text/csv" => "text/csv",
"text/html" => "text/html",
"text/javascript" => "text/javascript",
"text/plain" => "text/plain",
"text/plain;charset=UTF-8" => "text/plain;charset=UTF-8",
"text/xml" => "text/xml",
_ => null,
};
if (knownType is not null)
{
mediaType ??= knownType;
return true;
}
// Otherwise, do the full validation using the same logic as HttpClient.
mediaType ??= mediaTypeSpan.ToString();
return MediaTypeHeaderValue.TryParse(mediaType, out _);
}

In the case of text, params like charset are well-known. For code, you might also have information like the programming language. Regardless of whether they are well-known or custom params, seems like it'd be good to have a way to handle or expose to users.

@luisquintanilla
Copy link
Contributor Author

luisquintanilla commented Mar 7, 2025

Also, for this constructor. Would we ever expect the Uri to be a non-data URI? If not, is the mediaType needed?

public DataContent([StringSyntax(StringSyntaxAttribute.Uri)] string uri, string? mediaType = null)

@stephentoub
Copy link
Member

Just looking at the content types in Abstractions, text seems to be the only one that is treated special and has its own class.
Maybe we just make everything DataContent and rely on mime types?

Text represents the 99% case today when dealing with LLMs. I think it needs to be simple. To your point, we could make it work with just DataContent, but that feels like it's adding a lot of complication and expense in the name of consistency/consolidation. We'd end up wanting special factory methods for creating DataContent from text, and then we'd need special methods for extracting text if the media type was "text/*", plus the performance overhead.

If we believe the DataContent case is real, we can add the relevant code to handle it. We'd need to do so not only in all of the abstraction implementations, but also in the object model types, such as ChatResponse, which have properties like Text that today enumerate all contents looking for TextContent... those would also need to look for DataContent with "text/*". And as mentioned, we'd likely need to add some kind of TryGetText(out string text) helper for DataContent that would parse out the encoding and use it to convert the bytes into text.

Also, for this constructor. Would we ever expect the Uri to be a non-data URI?

Yes, some services let you specify urls to public locations and they'll download the image rather than you having to send the base64-encoded data. This is why DataContent.Data is a ReadOnlyMemory<byte>? rather than a ReadOnlyMemory<byte>, because it may not contain any data. We do suffer a bit from the fact that in these cases you really need the media type to be specified, at least the prefix, e.g. "image/*", as otherwise implementations won't know what kind of data is at the other end of that url. Ideally there'd be a way to force the caller to provide a media type in such cases, but unless we make the parameter required always, we lack a good way to do that, and making it required would be duplicative of most data uris that include the media type in them. We could come up with different factory methods, but it's not clear to me we have anything significantly better. Open to ideas.

@SteveSandersonMS, any opinions on all this?

@SteveSandersonMS
Copy link
Member

I don't think of TextContent and "DataContent with a text/* media type" as being equivalent at all. In my mental model, the two are different in the same way that typing text into an email and adding a .txt file as an attachment are different, even though both are ways of supplying text.

Since we leave it up to IChatClient implementers to decide how to map the MEAI object model (which, today, differentiates those two representations) to the backend's API, each implementation is free to choose which content types it supports and how to handle it in each case. As such it can choose whether to support text/*, and within that, which text encodings or other subformats to handle, and beyond that whether to map it to be equivalent to TextContent or differentiate it in some way.

@luisquintanilla Can you suggest scenarios where app developers would want or need to use DataContent with a text/* media type instead of TextContent? I personally don't think this will occur to most app developers, or if they do it, I wouldn't necessarily assume they intend it to be handled as equivalent to TextContent.

A significant drawback of us choosing to make DataContent+text/* equivalent to TextContent is the perf overhead that @stephentoub mentions, both in having to check media types on all content objects whenever processing text, but also possibly needing to convert between different text representations and losing the representation that was originally used. So I'd only want to add all this extra sophistication if there were really compelling reasons.

@luisquintanilla
Copy link
Contributor Author

I think it needs to be simple.

I agree that the ergonomics need to be simple for a developer as well as for implementers. At minimum I think it should be consistent for implementers. If we want to make it easy and reduce overhead by providing a type to represent a set of commonly used MIME types (TextContent in the case of text/), should ImageContent and AudioContent be brought back to make it feel consistent? It feels weird that Text is the only one that gets treated differently, especially since we expect to extend support for other image and audio scenarios.

Yes, some services let you specify urls to public locations and they'll download the image rather than you having to send the base64-encoded data.

Thanks for clarifying.

making it required would be duplicative of most data uris that include the media type in them

This is mainly why I was asking. Making it optional though seems like a good way to handle for now.

Can you suggest scenarios where app developers would want or need to use DataContent with a text/* media type instead of TextContent?

As an end-user, if you give me a type that covers a large percentage of cases, which I think TextContent should for text scenarios, I'll use that for cases where the input is a prompt.

both in having to check media types on all content objects whenever processing text, but also possibly needing to convert between different text representations and losing the representation that was originally used.

Would we need to do this in the abstractions or is this something that the implementers would handle? In the AI Inference package we are the implementers.

@stephentoub
Copy link
Member

It feels weird that Text is the only one that gets treated differently

Why does it feel weird? Text is the 99.9% case. It is special.

Note that it's special in .NET in general. There's char, even though you could represent it as just multiple bytes. And there's string, even though you could represent it as a char[] or a byte[].

TextContent then benefits from being able to represent the data exactly as it needs to be consumed throughout .NET, as a string. That is not the case for ImageContent/AudioContent: those types add no additional representation, the derived types had no additional functionality or value... all they provided was a way to check for the kind of data represented, which is handled by dataContent.HasTopLevelMediaType("image") just as dataContent is ImageContent.

Would we need to do this in the abstractions or is this something that the implementers would handle?

Both.

@luisquintanilla
Copy link
Contributor Author

Why does it feel weird?

Because there are other forms of data the models support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-ai Microsoft.Extensions.AI libraries bug This issue describes a behavior which is not expected - a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants