Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] BWC bug that cause query shard failure #17125

Closed
zhichao-aws opened this issue Jan 26, 2025 · 9 comments
Closed

[BUG] BWC bug that cause query shard failure #17125

zhichao-aws opened this issue Jan 26, 2025 · 9 comments
Labels
bug Something isn't working Search Search query, autocomplete ...etc

Comments

@zhichao-aws
Copy link
Member

Describe the bug

Queries fail to serialization/deserialization between 3.0 node and 2.19 node. It results in query shard failure.

Related component

Search

To Reproduce

opensearch-project/neural-search#1142 (comment)

  1. set up a two node cluster with 1 3.0.0 node + 1 2.19 snapshot node.
  2. create index
PUT test
{
    "settings": {
      "number_of_shards": 2,
      "number_of_replicas": 0
    },
    "mappings": {
      "properties": {
        "passage_embedding": {
          "type": "rank_features"
        },
        "passage_text": {
          "type": "text"
        }
      }
    }
}
  1. ingest doc
POST _bulk
{"index":{"_index":"test"}}
{"passage_embedding":{"hello":1.1,"world":1.2}, "passage_text": "hello world"}
{"index":{"_index":"test"}}
{"passage_embedding":{"hello":1.1,"world":1.2}, "passage_text": "hello world"}
{"index":{"_index":"test"}}
{"passage_embedding":{"hello":1.1,"world":1.2}, "passage_text": "hello world"}
{"index":{"_index":"test"}}
{"passage_embedding":{"hello":1.1,"world":1.2}, "passage_text": "hello world"}
{"index":{"_index":"test"}}
{"passage_embedding":{"hello":1.1,"world":1.2}, "passage_text": "hello world"}
  1. do search
GET test/_search

here we can use any query(if leave it empty, it's match_all), and send request to any nodes. we get shard failure of serialization/deserialization error in response:

{'took': 63,
 'timed_out': False,
 '_shards': {'total': 2,
  'successful': 1,
  'skipped': 0,
  'failed': 1,
  'failures': [{'shard': 1,
    'index': 'test2',
    'node': 'AEmZn6uASnGjAU0SCF-EXg',
    'reason': {'type': 'illegal_state_exception',
     'reason': 'unexpected byte [0x3f]'}}]},
 'hits': {'total': {'value': 3, 'relation': 'eq'},
  'max_score': 1.0,
  'hits': [ ... ]}}

Expected behavior

should have no shard failure and return 5 hits

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@zhichao-aws zhichao-aws added bug Something isn't working untriaged labels Jan 26, 2025
@github-actions github-actions bot added the Search Search query, autocomplete ...etc label Jan 26, 2025
@sandeshkr419
Copy link
Contributor

[Search Triage] As checked with team, this was broken very briefly. Can you please pull up the latest 2.x / 3.0 changes - it should work now.

@zhichao-aws
Copy link
Member Author

Hi @sandeshkr419 I can still reproduce the error with latest main and 2.x

@reta
Copy link
Collaborator

reta commented Feb 8, 2025

I believe the custom-codecs plugin hits the same BWC issue [1]:


CustomCodecsBwcCompatibilityIT > testDataIngestionAndSearchBackwardsCompatibility FAILED
    org.opensearch.client.ResponseException: method [POST], host [http://[::]:64743], URI [test-custom-codec-index/_search], status line [HTTP/1.1 500 Internal Server Error]
    {"error":{"root_cause":[{"type":"illegal_state_exception","reason":"unexpected byte [0x9c]"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"test-custom-codec-index","node":"xB6tPA3zShebHizq068Z6Q","reason":{"type":"illegal_state_exception","reason":"unexpected byte [0x9c]"}}],"caused_by":{"type":"illegal_state_exception","reason":"unexpected byte [0x9c]","caused_by":{"type":"illegal_state_exception","reason":"unexpected byte [0x9c]"}}},"status":500}
        at __randomizedtesting.SeedInfo.seed([1392CC734671C79D:6150F3CB4484D999]:0)
        at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:501)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:384)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:359)
        at app//org.opensearch.customcodecs.bwc.helper.RestHelper.makeRequest(RestHelper.java:62)
        at app//org.opensearch.customcodecs.bwc.helper.RestHelper.requestAgainstAllNodes(RestHelper.java:82)
        at app//org.opensearch.customcodecs.bwc.helper.RestHelper.requestAgainstAllNodes(RestHelper.java:69)
        at app//org.opensearch.customcodecs.bwc.CustomCodecsBwcCompatibilityIT.searchMatchAll(CustomCodecsBwcCompatibilityIT.java:214)
        at app//org.opensearch.customcodecs.bwc.CustomCodecsBwcCompatibilityIT.testDataIngestionAndSearchBackwardsCompatibility(CustomCodecsBwcCompatibilityIT.java:166)

[1] https://github.com/opensearch-project/custom-codecs/actions/runs/13167039041/job/36749522655?pr=217

@peterzhuamazon
Copy link
Member

@zhichao-aws @reta Maybe not related but JS is able to fix its own bwc issues:
opensearch-project/job-scheduler#730 (comment)

Thanks.

@reta
Copy link
Collaborator

reta commented Feb 11, 2025

@zhichao-aws @reta Maybe not related but JS is able to fix its own bwc issues:
opensearch-project/job-scheduler#730 (comment)

Thanks.

Thanks @peterzhuamazon , but the issues seem not to be related

@andrross
Copy link
Member

The server-side stack trace is:

Caused by: java.lang.IllegalStateException: unexpected byte [0x9c]
  at org.opensearch.core.common.io.stream.StreamInput.readBoolean(StreamInput.java:596) ~[opensearch-core-2.20.0-SNAPSHOT.jar:2.20.0-SNAPSHOT]
  at org.opensearch.core.common.io.stream.StreamInput.readOptionalBoolean(StreamInput.java:606) ~[opensearch-core-2.20.0-SNAPSHOT.jar:2.20.0-SNAPSHOT]
  at org.opensearch.search.internal.ShardSearchRequest.<init>(ShardSearchRequest.java:255) ~[opensearch-2.20.0-SNAPSHOT.jar:2.20.0-SNAPSHOT]
  at org.opensearch.transport.RequestHandlerRegistry.newRequest(RequestHandlerRegistry.java:87) ~[opensearch-2.20.0-SNAPSHOT.jar:2.20.0-SNAPSHOT]
  at org.opensearch.transport.NativeMessageHandler.newRequest(NativeMessageHandler.java:316) ~[opensearch-2.20.0-SNAPSHOT.jar:2.20.0-SNAPSHOT]
  at org.opensearch.transport.NativeMessageHandler.handleRequest(NativeMessageHandler.java:271) ~[opensearch-2.20.0-SNAPSHOT.jar:2.20.0-SNAPSHOT]
  at org.opensearch.transport.NativeMessageHandler.handleMessage(NativeMessageHandler.java:146) ~[opensearch-2.20.0-SNAPSHOT.jar:2.20.0-SNAPSHOT]
  at org.opensearch.transport.NativeMessageHandler.messageReceived(NativeMessageHandler.java:126) ~[opensearch-2.20.0-SNAPSHOT.jar:2.20.0-SNAPSHOT]
  at org.opensearch.transport.InboundHandler.messageReceivedFromPipeline(InboundHandler.java:120) ~[opensearch-2.20.0-SNAPSHOT.jar:2.20.0-SNAPSHOT]
  at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:112) ~[opensearch-2.20.0-SNAPSHOT.jar:2.20.0-SNAPSHOT]
  at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:768) ~[opensearch-2.20.0-SNAPSHOT.jar:2.20.0-SNAPSHOT]
  at org.opensearch.transport.InboundBytesHandler.forwardFragments(InboundBytesHandler.java:137) ~[opensearch-2.20.0-SNAPSHOT.jar:2.20.0-SNAPSHOT]
  at org.opensearch.transport.InboundBytesHandler.doHandleBytes(InboundBytesHandler.java:77) ~[opensearch-2.20.0-SNAPSHOT.jar:2.20.0-SNAPSHOT]
  at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:124) ~[opensearch-2.20.0-SNAPSHOT.jar:2.20.0-SNAPSHOT]
  at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:113) ~[opensearch-2.20.0-SNAPSHOT.jar:2.20.0-SNAPSHOT]
  at org.opensearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:95) ~[?:?]
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[?:?]
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[?:?]
  at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[?:?]
  at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280) ~[?:?]
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[?:?]
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[?:?]
  at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[?:?]
  at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:107) ~[?:?]
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[?:?]
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[?:?]
  at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[?:?]
  at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1357) ~[?:?]
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) ~[?:?]
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[?:?]
  at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:868) ~[?:?]
  at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) ~[?:?]
  at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) ~[?:?]
  at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689) ~[?:?]
  at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652) ~[?:?]
  at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) ~[?:?]
  at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[?:?]
  at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
  at java.lang.Thread.run(Thread.java:1583) ~[?:?]

which points to this line:

but there are no recent changes there so it's not immediately obvious what is causing this...

@andrross
Copy link
Member

There is a more recent change (#17098) in SearchSourceBuilder which is deserialized in the ShardSearchRequest constructor.

@reta @zhichao-aws I think #17098 is the fix for this issue, but the problem is that the fix was committed after the change to publish the "3.0.0-alpha1" snapshot, so the "3.0.0" snapshot does not have this fix. FYI @peterzhuamazon

@reta
Copy link
Collaborator

reta commented Feb 13, 2025

Thanks @andrross , so moving to 3.0.0-alpha1 should fix it, the custom-codec plugin is sill in process of migration

@zhichao-aws
Copy link
Member Author

The bwc test passed after bump to 3.0.0-alpha1. Thanks @andrross !

@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in Search Project Board Feb 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Search Search query, autocomplete ...etc
Projects
Status: Done
Development

No branches or pull requests

5 participants