weaveworks/kured: Kubernetes Reboot Daemon

原作者: [db:作者] 来自: 网络收藏邀请

开源软件名称（OpenSource Name）：

weaveworks/kured

开源软件地址(OpenSource Url)：

https://github.com/weaveworks/kured

开源编程语言(OpenSource Language)：

Go 88.6%

开源软件介绍(OpenSource Introduction)：

kured - Kubernetes Reboot Daemon

kured - Kubernetes Reboot Daemon

Introduction

Kured (KUbernetes REboot Daemon) is a Kubernetes daemonset that performs safe automatic node reboots when the need to do so is indicated by the package management system of the underlying OS.

Watches for the presence of a reboot sentinel file e.g. /var/run/reboot-required or the successful run of a sentinel command.
Utilises a lock in the API server to ensure only one node reboots at a time
Optionally defers reboots in the presence of active Prometheus alerts or selected pods
Cordons & drains worker nodes before reboot, uncordoning them after

Kubernetes & OS Compatibility

The daemon image contains versions of k8s.io/client-go and k8s.io/kubectl (the binary of kubectl in older releases) for the purposes of maintaining the lock and draining worker nodes. Kubernetes aims to provide forwards and backwards compatibility of one minor version between client and server:

kured	kubectl	k8s.io/client-go	k8s.io/apimachinery	expected kubernetes compatibility
main	1.23.6	v0.23.6	v0.23.6	1.22.x, 1.23.x, 1.24.x
1.10.1	1.23.6	v0.23.6	v0.23.6	1.22.x, 1.23.x, 1.24.x
1.9.2	1.22.4	v0.22.4	v0.22.4	1.21.x, 1.22.x, 1.23.x
1.8.1	1.21.4	v0.21.4	v0.21.4	1.20.x, 1.21.x, 1.22.x
1.7.0	1.20.5	v0.20.5	v0.20.5	1.19.x, 1.20.x, 1.21.x
1.6.1	1.19.4	v0.19.4	v0.19.4	1.18.x, 1.19.x, 1.20.x
1.5.1	1.18.8	v0.18.8	v0.18.8	1.17.x, 1.18.x, 1.19.x
1.4.4	1.17.7	v0.17.0	v0.17.0	1.16.x, 1.17.x, 1.18.x
1.3.0	1.15.10	v12.0.0	release-1.15	1.15.x, 1.16.x, 1.17.x
1.2.0	1.13.6	v10.0.0	release-1.13	1.12.x, 1.13.x, 1.14.x
1.1.0	1.12.1	v9.0.0	release-1.12	1.11.x, 1.12.x, 1.13.x
1.0.0	1.7.6	v4.0.0	release-1.7	1.6.x, 1.7.x, 1.8.x

See the release notes for specific version compatibility information, including which combination have been formally tested.

Versions >=1.1.0 enter the host mount namespace to invoke systemctl reboot, so should work on any systemd distribution.

Installation

To obtain a default installation without Prometheus alerting interlock or Slack notifications:

latest=$(curl -s https://api.github.com/repos/weaveworks/kured/releases | jq -r .[0].tag_name)
kubectl apply -f "https://github.com/weaveworks/kured/releases/download/$latest/kured-$latest-dockerhub.yaml"

If you want to customise the installation, download the manifest and edit it in accordance with the following section before application.

Configuration

The following arguments can be passed to kured via the daemonset pod template:

Kubernetes Reboot Daemon

Usage:
  kured [flags]

Flags:
      --alert-filter-regexp regexp.Regexp   alert names to ignore when checking for active alerts
      --alert-firing-only                   only consider firing alerts when checking for active alerts
      --annotate-nodes                      if set, the annotations 'weave.works/kured-reboot-in-progress' and 'weave.works/kured-most-recent-reboot-needed' will be given to nodes undergoing kured reboots
      --blocking-pod-selector stringArray   label selector identifying pods whose presence should prevent reboots
      --drain-grace-period int              time in seconds given to each pod to terminate gracefully, if negative, the default value specified in the pod will be used (default -1)
      --drain-timeout duration              timeout after which the drain is aborted (default: 0, infinite time)
      --ds-name string                      name of daemonset on which to place lock (default "kured")
      --ds-namespace string                 namespace containing daemonset on which to place lock (default "kube-system")
      --end-time string                     schedule reboot only before this time of day (default "23:59:59")
      --force-reboot                        force a reboot even if the drain fails or times out
  -h, --help                                help for kured
      --lock-annotation string              annotation in which to record locking node (default "weave.works/kured-node-lock")
      --lock-release-delay duration         delay lock release for this duration (default: 0, disabled)
      --lock-ttl duration                   expire lock annotation after this duration (default: 0, disabled)
      --log-format string                   use text or json log format (default "text")
      --message-template-drain string       message template used to notify about a node being drained (default "Draining node %s")
      --message-template-reboot string      message template used to notify about a node being rebooted (default "Rebooting node %s")
      --message-template-uncordon string    message template used to notify about a node being successfully uncordoned (default "Node %s rebooted & uncordoned successfully!")
      --node-id string                      node name kured runs on, should be passed down from spec.nodeName via KURED_NODE_ID environment variable
      --notify-url string                   notify URL for reboot notifications (cannot use with --slack-hook-url flags)
      --period duration                     sentinel check period (default 1h0m0s)
      --post-reboot-node-labels strings     labels to add to nodes after uncordoning
      --pre-reboot-node-labels strings      labels to add to nodes before cordoning
      --prefer-no-schedule-taint string     Taint name applied during pending node reboot (to prevent receiving additional pods from other rebooting nodes). Disabled by default. Set e.g. to "weave.works/kured-node-reboot" to enable tainting.
      --prometheus-url string               Prometheus instance to probe for active alerts
      --reboot-command string               command to run when a reboot is required (default "/bin/systemctl reboot")
      --reboot-days strings                 schedule reboot on these days (default [su,mo,tu,we,th,fr,sa])
      --reboot-delay duration               delay reboot for this duration (default: 0, disabled)
      --reboot-sentinel string              path to file whose existence triggers the reboot command (default "/var/run/reboot-required")
      --reboot-sentinel-command string      command for which a zero return code will trigger a reboot command
      --skip-wait-for-delete-timeout int    when seconds is greater than zero, skip waiting for the pods whose deletion timestamp is older than N seconds while draining a node
      --slack-channel string                slack channel for reboot notifications
      --slack-hook-url string               slack hook URL for reboot notifications [deprecated in favor of --notify-url]
      --slack-username string               slack username for reboot notifications (default "kured")
      --start-time string                   schedule reboot only after this time of day (default "0:00")
      --time-zone string                    use this timezone for schedule inputs (default "UTC")

Reboot Sentinel File & Period

By default kured checks for the existence of /var/run/reboot-required every sixty minutes; you can override these values with --reboot-sentinel and --period. Each replica of the daemon uses a random offset derived from the period on startup so that nodes don't all contend for the lock simultaneously.

Reboot Sentinel Command

Alternatively, a reboot sentinel command can be used. If a reboot sentinel command is used, the reboot sentinel file presence will be ignored. When the command exits with code 0, kured will assume that a reboot is required.

For example, if you're using RHEL or its derivatives, you can set the sentinel command to sh -c "! needs-restarting --reboothint" (by default the command will return 1 if a reboot is required, so we wrap it in sh -c and add ! to negate the return value).

configuration:
  rebootSentinelCommand: sh -c "! needs-restarting --reboothint"

Setting a schedule

By default, kured will reboot any time it detects the sentinel, but this may cause reboots during odd hours. While service disruption does not normally occur, anything is possible and operators may want to restrict reboots to predictable schedules. Use --reboot-days, --start-time, --end-time, and --time-zone to set a schedule. For example, business hours on the west coast USA can be specified with:

  --reboot-days=mon,tue,wed,thu,fri
  --start-time=9am
  --end-time=5pm
  --time-zone=America/Los_Angeles

Times can be formatted in numerous ways, including 5pm, 5:00pm 17:00, and 17. --time-zone represents a Go time.Location, and can be UTC, Local, or any entry in the standard Linux tz database.

Note that when using smaller time windows, you should consider shortening the sentinel check period (--period).

Blocking Reboots via Alerts

You may find it desirable to block automatic node reboots when there are active alerts - you can do so by providing the URL of your Prometheus server:

--prometheus-url=http://prometheus.monitoring.svc.cluster.local

By default the presence of any active (pending or firing) alerts will block reboots, however you can ignore specific alerts:

--alert-filter-regexp=^(RebootRequired|AnotherBenignAlert|...$

You can also only block reboots for firing alerts:

--alert-firing-only=true

See the section on Prometheus metrics for an important application of this filter.

Blocking Reboots via Pods

You can also block reboots of an individual node when specific pods are scheduled on it:

--blocking-pod-selector=runtime=long,cost=expensive

Since label selector strings use commas to express logical 'and', you can specify this parameter multiple times for 'or':

--blocking-pod-selector=runtime=long,cost=expensive
--blocking-pod-selector=name=temperamental

In this case, the presence of either an (appropriately labelled) expensive long running job or a known temperamental pod on a node will stop it rebooting.

Try not to abuse this mechanism - it's better to strive for restartability where possible. If you do use it, make sure you set up a RebootRequired alert as described in the next section so that you can intervene manually if reboots are blocked for too long.

Adding node labels before and after reboots

If you need to add node labels before and after the reboot process, you can use --pre-reboot-node-labels and --post-reboot-node-labels:

      --pre-reboot-node-labels=zalando=notready
      --post-reboot-node-labels=zalando=ready

Labels can be comma-delimited (e.g. --pre-reboot-node-labels=zalando=notready,thisnode=disabled) or you can supply the flags multiple times.

Note that label keys specified by these two flags should match. If they do not match, a warning will be generated.

Prometheus Metrics

Each kured pod exposes a single gauge metric (:8080/metrics) that indicates the presence of the sentinel file:

# HELP kured_reboot_required OS requires reboot due to software updates.
# TYPE kured_reboot_required gauge
kured_reboot_required{node="ip-xxx-xxx-xxx-xxx.ec2.internal"} 0

The purpose of this metric is to power an alert which will summon an operator if the cluster cannot reboot itself automatically for a prolonged period:

# Alert if a reboot is required for any machines. Acts as a failsafe for the
# reboot daemon, which will not reboot nodes if there are pending alerts save
# this one.
ALERT RebootRequired
  IF          max(kured_reboot_required) != 0
  FOR         24h
  LABELS      { severity="warning" }
  ANNOTATIONS {
    summary = "Machine(s) require being rebooted, and the reboot daemon has failed to do so for 24 hours",
    impact = "Cluster nodes more vulnerable to security exploits. Eventually, no disk space left.",
    description = "Machine(s) require being rebooted, probably due to kernel update.",
  }

If you choose to employ such an alert and have configured kured to probe for active alerts before rebooting, be sure to specify --alert-filter-regexp=^RebootRequired$ to avoid deadlock!

Notifications

When you specify a formatted URL using --notify-url, kured will notify about draining and rebooting nodes across a list of technologies.

Alternatively you can use the --message-template-drain, --message-template-reboot and --message-template-uncordon to customize the text of the message, e.g.

--message-template-drain="Draining node %s part of *my-cluster* in region *xyz*"

Here is the syntax:

slack: slack://tokenA/tokenB/tokenC

(slack://<USERNAME>@tokenA/tokenB/tokenC - in case you want to respect username)

(--slack-hook-url is deprecated but possible to use)

For the new slack App integration, use:
slack://xoxb:123456789012-1234567890123-4mt0t4l1YL3g1T5L4cK70k3N@<CHANNEL_NAME>?botname=<BOTNAME>
for more information, look here
rocketchat: rocketchat://[username@]rocketchat-host/token[/channel|@recipient]
teams: teams://group@tenant/altId/groupOwner?host=organization.webhook.office.com
Email: smtp://username:password@host:port/?fromAddress=fromAddress&toAddresses=recipient1[,recipient2,...]

More details here: containrrr.dev/shoutrrr/v0.5/services/overview

Overriding Lock Configuration

The --ds-name and --ds-namespace arguments should match the name and namespace of the daemonset used to deploy the reboot daemon - the locking is implemented by means of an annotation on this resource. The defaults match the daemonset YAML provided in the repository.

Similarly --lock-annotation can be used to change the name of the annotation kured will use to store the lock, but the default is almost certainly safe.

Operation

The example commands in this section assume that you have not overriden the default lock annotation, daemonset name or namespace; if you have, you will have to adjust the commands accordingly.

Testing

You can test your configuration by provoking a reboot on a node:

sudo touch /var/run/reboot-required

Disabling Reboots

If you need to temporarily stop kured from rebooting any nodes, you can take the lock manually:

kubectl -n kube-system annotate ds kured weave.works/kured-node-lock='{"nodeID":"manual"}'

Don't forget to release it afterwards!

Manual Unlock

In exceptional circumstances, such as a node experiencing a permanent failure whilst rebooting, manual intervention may be required to remove the cluster lock:

kubectl -n kube-system annotate ds kured weave.works/kured-node-lock-

NB the - at the end of the command is important - it instructs kubectl to remove that annotation entirely.

Automatic Unlock

In exceptional circumstances (especially when used with cluster-autoscaler) a node which holds lock might be killed thus annotation will stay there for ever.

Using --lock-ttl=30m will allow other nodes to take over if TTL has expired (in this case 30min) and continue reboot process.

Delaying Lock Release

Using --lock-release-delay=30m will cause nodes to hold the lock for the specified time frame (in this case 30min) before it is released and the reboot process continues. This can be used to throttle reboots across the cluster.

Building

Kured now uses Go Modules, so build instructions vary depending on where you have checked out the repository:

Building outside $GOPATH:

make

Building inside $GOPATH:

GO111MODULE=on make

You can find the current preferred version of Golang in the go.mod file.

If you are interested in contributing code to kured, please take a look at our development docs.

Frequently Asked/Anticipated Questions

Why is there no `latest` tag on Docker Hub?

Use of latest for production deployments is bad practice - see here for details. The manifest on main refers to latest for local development testing with minikube only; for production use choose a versioned manifest from the release page.

Getting Help

If you have any questions about, feedback for or problems with kured:

Invite yourself to the Weave Users Slack.
Ask a question on the #kured slack channel.
File an issue.
Join us in our monthly meeting, every fourth Wednesday of the month at 16:00 UTC.

We follow the CNCF Code of Conduct.

Your feedback is always welcome!

鲜花

握手

雷人

路过

鸡蛋

该文章已有0人参与评论

请发表评论

全部评论

专题导读

More+

10-27 六六分期app的软件客服如何联系？(六六分期

11-06 可心卡盟:win10系统火狐flash插件崩溃怎么

11-06 亲亲特价:怎么删除回收站图标

11-06 济南大学虚拟社区:鲁大师节能降温的具体办

11-06 xlueops.exe:无线网络安装向导

11-06 女斗合众国:win7系统cf与主机连接不稳定怎

11-06 0xc000022-[cf烟雾头]cf怎么调烟雾头

11-06 qizideyouhuo:应用程序无法正常启动0xc0000

11-06 ipz-185:win7系统vcf文件怎么打开

11-06 傻哥蹦迪:win10系统s4怎么打开usb调试

11-06 八神浩树gtaste:回收站清空了怎么恢复

11-06 妖尾之黑色守护:win10系统电脑没有1440x900

11-06 校园至尊魔王小说:win7系统浏览网页时字体

11-06 女斗合众国:win10系统访问共享文件夹提示请

11-06 tokyo hot n0654:恢复win7系统默认字体一招

11-06 雨酷仙境:设置win7系统转移临时文件夹腾出

11-06 阿穆纳伊之杖:win7系统开始菜单在右边还原

11-06 tunespotting:win10系统火狐flash插件总是

11-06 甘尔葛分析师：计谋网站seo关键词暴涨有什

11-06 蔡贵霖: 计谋网站seo关键词暴涨有什么秘密

11-06 博益网首页:ao3网页版进入不了解决方法

11-06 漏斗子专栏: 网站数据分析小白易懂精华篇

11-06 见证双虹怎么做:win7系统开启telnet命令的

11-06 颾狐蝶蜋:系统资源不足无法完成请求的服务

11-06 国光中学校歌:提交网站到alexa查询详细步骤

11-06 西安有情天:静态网页和动态网页的区别

11-06 红木雅尚斋:外部链接构造对网站的好处

11-06 前官礼遇：防止域名劫持–增强域安全性的10

11-06 密传二转答案: 中文分词算法有哪些

11-06 金泉家园邮编:百度快照劫持的表现及应对方

pires/kubernetes-elasticsearch-cluster: Elasticsearch cluster on top of Kubernet ...发布时间：2022-08-13

devtron-labs/devtron: Tool integration platform for Kubernetes发布时间：2022-08-13

剪的笔顺,诠释剪的笔画,认识剪的部首

1 六六分期app的软件客服如何联系？(六六分期

六六分期app的软件客服如何联系？不知道吗？加qq群【895510560】即可！标题：六六分期

阅读：19587|2023-10-27

2 可心卡盟:win10系统火狐flash插件崩溃怎么

今天小编告诉大家如何处理win10系统火狐flash插件总是崩溃的问题，可能很多用户都不知

阅读：10092|2022-11-06

3 亲亲特价:怎么删除回收站图标

今天小编告诉大家如何对win10系统删除桌面回收站图标进行设置，可能很多用户都不知道

阅读：8397|2022-11-06

4 济南大学虚拟社区:鲁大师节能降温的具体办

今天小编告诉大家如何对win10系统电脑设置节能降温的设置方法，想必大家都遇到过需要

阅读：8751|2022-11-06

5 xlueops.exe:无线网络安装向导

我们在使用xp系统的过程中,经常需要对xp系统无线网络安装向导设置进行设置，可能很多

阅读：8704|2022-11-06

6 女斗合众国:win7系统cf与主机连接不稳定怎

今天小编告诉大家如何处理win7系统玩cf老是与主机连接不稳定的问题，可能很多用户都不

阅读：9745|2022-11-06

7 0xc000022-[cf烟雾头]cf怎么调烟雾头

电脑对日常生活的重要性小编就不多说了，可是一旦碰到win7系统设置cf烟雾头的问题，很

阅读：8692|2022-11-06

8 qizideyouhuo:应用程序无法正常启动0xc0000

我们在日常使用电脑的时候，有的小伙伴们可能在打开应用的时候会遇见提示应用程序无法

阅读：8057|2022-11-06

9 ipz-185:win7系统vcf文件怎么打开

今天小编告诉大家如何对win7系统打开vcf文件进行设置，可能很多用户都不知道怎么对win

阅读：8737|2022-11-06

10 傻哥蹦迪:win10系统s4怎么打开usb调试

今天小编告诉大家如何对win10系统s4开启USB调试模式进行设置，可能很多用户都不知道怎

阅读：7593|2022-11-06

客服电话

电子邮件

weaveworks/kured: Kubernetes Reboot Daemon

开源软件名称（OpenSource Name）：

开源软件地址(OpenSource Url)：

开源编程语言(OpenSource Language)：

开源软件介绍(OpenSource Introduction)：

kured - Kubernetes Reboot Daemon

Introduction

Kubernetes & OS Compatibility

Installation

Configuration

Reboot Sentinel File & Period

Reboot Sentinel Command

Setting a schedule

Blocking Reboots via Alerts

Blocking Reboots via Pods

Adding node labels before and after reboots

Prometheus Metrics

Notifications

Overriding Lock Configuration

Operation

Testing

Disabling Reboots

Manual Unlock

Automatic Unlock

Delaying Lock Release

Building

Frequently Asked/Anticipated Questions

Why is there no latest tag on Docker Hub?

Getting Help

请发表评论

全部评论

上一篇：

下一篇：

krishnaik06/Machine-Learning-in-90-days

微信小程序上拉加载和下拉刷新（wepy）

armancodv/building-energy-model-matlab:

美元符号为什么是“$”

FGRibreau/import-tweets-to-mastodon: How

剪的笔顺,诠释剪的笔画,认识剪的部首

六六分期app的软件客服如何联系？(六六分期

florent37/ViewAnimator: A fluent Android

florent37/Shrine-MaterialDesign2: implem

CVE-2020-36276

SimpleSoftwareIO/simple-sms: Send and re

关于我们

产品与服务

解决方案

139-2527-9053

Why is there no `latest` tag on Docker Hub?