Default readiness on historicals breaks operator. #97

mindreader · 2023-08-21T17:54:56Z

In pr #72 , defaults probes are set for various components.

This breaks the operator because while /druid/historical/v1/readiness is just fine to call /druid/historical/v1/loadstatus is a privileged call that requires authentication. Since it places this readinessProbe by default if you don't have one, the only way to disable it is to hard wire one yourself to another call.

curl -vvv http://127.0.0.1:4124/druid/historical/v1/readiness
HTTP/1.1 200 OK

curl -vvv http://127.0.0.1:4124/druid/historical/v1/loadstatus
HTTP/1.1 401 Unauthorized

The text was updated successfully, but these errors were encountered:

mindreader · 2023-08-21T21:10:59Z

This also caused a painful outage, as the operator seemed to be forcing a default startupProbe onto the coordinator/overlord. And the tasks that are spawned through kubernetes-overlord-extensions copied that pod spec for themselves, but they are hard coded in the druid source code to use 8100, which doesn't match the druid.port (8088) in my operator druid spec., so the startupProbes failed always, causing the pods to die after a couple minutes repeatedly.

I ended up having to disable the operator entirely to get things back in order.

cyril-corbon · 2023-08-22T06:15:09Z

Hello @mindreader

Can you give us more details on your configuration and the druid version your are using ?

I cannot reproduce the 401 on druid version 0.23.0 to 25.0.0

as both path have the same goals I will change this probe.

mindreader · 2023-08-22T12:24:09Z

I have a druid basic escalator which means intra-communication in my cluster requires credentials. I'm on druid version 25.

In any case, you don't have to believe me, the relevant code in druid is here: https://github.com/apache/druid/blob/master/server/src/main/java/org/apache/druid/server/http/HistoricalResource.java

That StateResourceFilter class is what is responsible for requiring permission to view api routes that peer at druid's internal state. It is not clear to me how k8s could be supplied credentials capable of querying that route.

The second issue is more serious, by putting Probes in that I did not have in my spec, pods just get killed, especially in the case of the task pods, I could not stop them from inheriting a bad configuration, as the overlord definitely needs a readiness probe, and the tasks definitely need a very different one.

cyril-corbon mentioned this issue Aug 22, 2023

fix: readiness probe on historical #98

Merged

4 tasks

itamar-marom added the bug Something isn't working label Aug 22, 2023

cyril-corbon closed this as completed in #98 Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default readiness on historicals breaks operator. #97

Default readiness on historicals breaks operator. #97

mindreader commented Aug 21, 2023

mindreader commented Aug 21, 2023

cyril-corbon commented Aug 22, 2023 •

edited

Loading

mindreader commented Aug 22, 2023 •

edited

Loading

Default readiness on historicals breaks operator. #97

Default readiness on historicals breaks operator. #97

Comments

mindreader commented Aug 21, 2023

mindreader commented Aug 21, 2023

cyril-corbon commented Aug 22, 2023 • edited Loading

mindreader commented Aug 22, 2023 • edited Loading

cyril-corbon commented Aug 22, 2023 •

edited

Loading

mindreader commented Aug 22, 2023 •

edited

Loading