I have implemented a http health check and a separate http liveness check for my pod.
For both, I see that Kubernetes works as expected if my pod delays before responding.
However, when they respond immediately with a status 500, Kubernetes treats that as a success response.
This is after the pod is up and running OK - before the checks start returning status 500.
In fact, I see that returning status 500 actually resets the failure count, so it caused my pod to be treated as healthy again.
Question is whether I am doing something wrong?
How to get Kubernetes to do its stuff when my pod is unhealthy?
$ k version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-13T16:12:48Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-18T12:00:47Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
To investigate this problem, I have added test endpoints to my pod so that I can change the behaviour at runtime: pass (200), fail (500), delay fail (wait 15 seconds, then return 500).
And I separated the health and liveness endpoints.
From describe pod:
Liveness: exec [curl http://localhost:30030/livez] delay=10s timeout=1s period=10s #success=1 #failure=5
Readiness: exec [curl http://localhost:30030/healthz] delay=10s timeout=1s period=10s #success=1 #failure=3
I tested the endpoints by exec into the pod and curl the endpoints from there (details below).
Then I cycled both the liveness check and the health check through the 3 modes and monitored the Kubernetes response.
Health Check: expect pod to be restarted after failing health check 5 times in a row.
Liveness Check: describe the service and expect IP address of the pod to be removed from the list of endpoints.
Success case:
bash-4.4$ curl http://localhost:30030/unfailhealth
unfailhealth: REMOVE force all health checks to fail, was failHealth=false, delayFailHealth=false
bash-4.4$ curl http://localhost:30030/healthz -v
* Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 30030 (#0)
> GET /healthz HTTP/1.1
> Host: localhost:30030
> User-Agent: curl/7.61.1
> Accept: */*
>
< HTTP/1.1 200 OK
< X-Powered-By: Express
< Content-Type: text/html; charset=utf-8
< Content-Length: 3
< ETag: W/"3-CftlTBfMBbEe9TvTWqcB9tVQ6OE"
< Date: Fri, 05 Feb 2021 13:30:59 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
<
OK
* Connection #0 to host localhost left intact
Failure case:
bash-4.4$ curl http://localhost:30030/failhealth
failhealth: force all health checks to fail, was failHealth=true, delayFailHealth=false
bash-4.4$ curl http://localhost:30030/healthz -v
* Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 30030 (#0)
> GET /healthz HTTP/1.1
> Host: localhost:30030
> User-Agent: curl/7.61.1
> Accept: */*
>
< HTTP/1.1 500 Internal Server Error
< X-Powered-By: Express
< Content-Type: text/html; charset=utf-8
< Content-Length: 26
< ETag: W/"1a-yI5D4Rtao1KH34GZVYKKvxZoEVo"
< Date: Fri, 05 Feb 2021 13:29:14 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
<
FAKE HEALTH CHECK FAILURE
* Connection #0 to host localhost left intact
Delayed failure case:
bash-4.4$ curl http://localhost:30030/delayfailhealth
delayfailhealth: force all health checks to sleep 15sec, then fail, was failHealth=false, delayFailHealth=true
bash-4.4$ date; curl http://localhost:30030/healthz -v
Fri Feb 5 13:33:08 UTC 2021
* Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 30030 (#0)
> GET /healthz HTTP/1.1
> Host: localhost:30030
> User-Agent: curl/7.61.1
> Accept: */*
>
< HTTP/1.1 500 Internal Server Error
< X-Powered-By: Express
< Content-Type: text/html; charset=utf-8
< Content-Length: 47
< ETag: W/"2f-n+Ix8oU/09OT9+cpPVm1/EejE9Y"
< Date: Fri, 05 Feb 2021 13:33:23 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
<
FAKE HEALTH CHECK FAILURE - AFTER 15 SEC DELAY
* Connection #0 to host localhost left intact
Test Results
Default to SUCCESS for both health and liveness endpoints,return status 200 -> pod starts and works OK.
Set liveness check to FAIL, return status 500 -> no change, pod IP still in service, requests still dispatched to the pod.
Set liveness check to DELAY before responding (then 500) -> pod is removed from Kubernetes service (yippee)
Set liveness check to FAIL (quickly) again -> pod is restored to the service (treated like success).
Set health check to FAIL (return status 500) -> no effect, pod continues without restart.
Set health check to DELAY before responding (then 500) -> pod is restarted after 5 failed probes
Thanks for any help with this. I guess I can change my code to delay before responding in the failure case but that seems like a workaround.
question from:
https://stackoverflow.com/questions/66064514/kubernetes-http-health-check-not-working-as-expected-500-response-is-ignored