kubernetes debugging
By Aravind
Table of contents
- Set alias for long commands
- Check which file is consuming the most space
- Delete all Evicted pods
- Advertise out of a specific interface
- Label a node
- Join a cluster
- Starting cluster
- uninstall k8s
- install k8s
- renew-certs
Set alias for long commands
alias k=kubectl
Try it out!
root@k8s-master:~/cka_practice/cert# k get pods
NAME READY STATUS RESTARTS AGE
virt-launcher-ubuntu-kv1-mgvvz 1/1 Running 0 18d
virt-launcher-vsrx-sriov-qrfxm 2/2 Running 0 19d
Check which file is consuming the most space
du -h <dir> 2>/dev/null | grep '[0-9\.]\+G'
Delete all Evicted pods
This can be used when there are 100’s of evicted pods and you want to delete all of them
kubectl get pods | grep Evicted | awk '{print $1}' | xargs kubectl delete pod
Advertise out of a specific interface
Typically k8s picks the interface with default IP when creating the cluster using kubeadm init. One can specify the interface in order to advertise the API.
using the below flag in kubeadm init
--apiserver-advertise-addresses=<the eth1 ip addr>
Example
kubeadm init --apiserver-advertise-address=192.169.1.11 --pod-network-cidr=10.244.0.0/16
Label a node
Sometimes if node is not labeled as master pods may not get scheduled. Or if node selectors are being used and label mismatch/not found. you can label the node as below
Find labels
root@master:~# kubectl get nodes --show-labels
NAME STATUS ROLES AGE VERSION LABELS
master Ready control-plane 2m43s v1.25.3 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=master,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node.kubernetes.io/exclude-from-external-load-balancers=
worker1 Ready <none> 116s v1.25.3 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=worker1,kubernetes.io/os=linux
worker2 Ready <none> 72s v1.25.3 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=worker2,kubernetes.io/os=linux
Label a node
root@master:~# kubectl label nodes master node-role.kubernetes.io/master=
node/master labeled
root@master:~# kubectl label nodes worker1 node-role.kubernetes.io/worker=
node/worker1 labeled
root@master:~# kubectl label nodes worker2 node-role.kubernetes.io/worker=
Verify
root@master:~# kubectl get nodes
NAME STATUS ROLES AGE VERSION
master Ready control-plane,master 4m26s v1.25.3
worker1 Ready worker 3m39s v1.25.3
worker2 Ready worker 2m55s v1.25.3
Join a cluster
when a cluster is already created and want to join a new node , we would need the token
Obtain the token
$ kubeadm token list
TOKEN TTL EXPIRES USAGES DESCRIPTION EXTRA GROUPS
jt5hul.iy8150scfrqvf3l3 1h 2022-11-17T19:29:31Z authentication,signing The default bootstrap token generated by 'kubeadm init'. system:bootstrappers:kubeadm:default-node-token
Create the token
$ kubeadm token create --print-join-command
kubeadm join 192.168.2.12:6443 --token nmlxzk.kt5gr2e5pgi8nb4n --discovery-token-ca-cert-hash sha256:3a032c12536c479c959aec0ee2d300a8847d9d0a75f6bd9892be1d61c6afbdb5
Use the token on new worker node
user@newworker:~$ sudo kubeadm join 192.168.2.12:6443 --token nmlxzk.kt5gr2e5pgi8nb4n --discovery-token-ca-cert-hash sha256:3a032c12536c479c959aec0ee2d300a8847d9d0a75f6bd9892be1d61c6afbdb5
[preflight] Running pre-flight checks
[WARNING SystemVerification]: missing optional cgroups: blkio
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...
This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.
Run 'kubectl get nodes' on the control-plane to see this node join the cluster.
Verify
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
master NotReady control-plane 22h v1.25.4
worker1 NotReady <none> 22h v1.25.4
worker2 NotReady <none> 6m18s v1.25.4
Starting cluster
To start using your cluster, you need to run the following as a regular user:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
Error while dialing dial unix /var/run/dockershim.sock
when a k8s cluster is brought up there are situations where you might have seen the error “Error while dialing dial unix /var/run/dockershim.sock”. This would be seen in below
root@jcnr2:~# crictl ps
WARN[0000] runtime connect using default endpoints: [unix:///var/run/dockershim.sock unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead.
WARN[0000] image connect using default endpoints: [unix:///var/run/dockershim.sock unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead.
E0428 19:02:30.325516 24230 remote_runtime.go:390] "ListContainers with filter from runtime service failed" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /var/run/dockershim.sock: connect: no such file or directory\"" filter="&ContainerFilter{Id:,State:&ContainerStateValue{State:CONT
To solve, the below, we need to ensure below is added to file /etc/crictl.yaml
root@vm:~# more /etc/crictl.yaml
runtime-endpoint: unix:///run/containerd/containerd.sock
image-endpoint: unix:///run/containerd/containerd.sock
Restart service
sudo systemctl restart containerd
Verify
root@jcnr2:~# crictl ps
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
kubelet failing
root@vm:~# service kubelet status
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset:>
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: activating (auto-restart) (Result: exit-code) since Fri 2023-04-28 1>
Docs: https://kubernetes.io/docs/home/
Process: 25125 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_C>
Main PID: 25125 (code=exited, status=1/FAILURE)
Notice that kubelet is not runing and hence kubeadm init fails . In order to fix sometimes, we might have
sudo kubeadm reset
sudo kubeadm init phase certs all
sudo kubeadm init phase kubeconfig all
sudo kubeadm init phase control-plane all --pod-network-cidr 10.244.0.0/16
sudo sed -i 's/initialDelaySeconds: [0-9][0-9]/initialDelaySeconds: 240/g' /etc/kubernetes/manifests/kube-apiserver.yaml
sudo sed -i 's/failureThreshold: [0-9]/failureThreshold: 18/g' /etc/kubernetes/manifests/kube-apiserver.yaml
sudo sed -i 's/timeoutSeconds: [0-9][0-9]/timeoutSeconds: 20/g' /etc/kubernetes/manifests/kube-apiserver.yaml
sudo kubeadm init --v=1 --skip-phases=certs,kubeconfig,control-plane --ignore-preflight-errors=all --pod-network-cidr 10.244.0.0/16
Load images in CRIO
images loaded should be in .tar format and hence the files are .tgz are used, we need to unzip them.
root@jcnr2:~# gunzip crpd.tgz
root@jcnr2:~# ls -l crpd.tar
-rw-r--r-- 1 root root 507279360 Apr 28 19:26 crpd.tar
root@jcnr2:~# ctr -n=k8s.io image import crpd.tar
unpacking docker.io/library/crpd:23.1R1.8 (sha256:bb82530036904d12f19bc2036a3734450712014780e3d27b8de841929a16fc97)...done
root@jcnr2:~# crictl images | grep crpd
docker.io/library/crpd 23.1R1.8 a1748707249d3 507MB
Pods not scheduled
This could be because the node may not be tainted. Modify them accordingly to allow pod scheduling since this should be an all in one cluster. The master node itself will run pods.
kubectl taint nodes --all node-role.kubernetes.io/control-plane-
Master node labeling
kubectl label node ubuntu node-role.kubernetes.io/master=
Make file issues when compiling
Error when compiling in ubuntu using Make
strip jcnr
make: strip: Command not found
make: *** [Makefile:38: docker-images] Error 127
To fix this we need to install below along with installing make
apt install binutils
Connection refused when using kubectl commands
we might see connection refused when running commands such as kubectl get pods
[root@testvm user]# kubectl get pods -A
The connection to the server localhost:8080 was refused - did you specify the right host or port?
in such cases, there could be multiple reasons.
Verify kubectl config
[root@testvm user]# kubectl config view
apiVersion: v1
clusters: null
contexts: null
current-context: ""
kind: Config
preferences: {}
users: null
This is incorrect and it needs to have values.
Verify if pods are running
[root@testvm03 test]# crictl ps
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
a421932c4e18f 28d55f91d3d8f7f6bd66168eed2cfd72b448be5e7807055c05a77499ce5c0674 3 minutes ago Running kube-proxy 0 5503fb3113a7a kube-proxy-2xz8r
d43febcc6a688 3ea2571fcc83d8e2cb02bff3da18165a38799e59c78cbe845cd807631f0c5cc3 4 minutes ago Running kube-controller-manager 0 1c7c2ed2f969b kube-controller-manager-testvm.ocpvm2.net
578e38437b299 165df46c1bb9b79c9b441ac039ac408ed6788164404dad56f966c810dc61f05a 4 minutes ago Running kube-scheduler 1 a705545a02333 kube-scheduler-testvm.ocpvm2.net
e50fdc53512fc dc245db8c2faecaeac427ebcdf308ebe2c60e40728bf4f45f33d857ef3179969 4 minutes ago Running kube-apiserver 1 18e27e12fa20e kube-apiserver-testvm.ocpvm2.net
4d1213b151d2e 4694d02f8e611efdffe9fb83a86d9d2969ef57b4b56622388eca6627287d6fd6 4 minutes ago Running etcd 1 065ab508a642c etcd-testvm.ocpvm2.net
This shows that context is the issue and pods are running correctly. This could happen because of config file not placed correctly.
- verify if config file is correct under
~/.kube/config
- if above has content and if you used
sudo
to bring up the cluster as a sudo user , verify if same is present under /root - if not, copy
cp /home/test/.kube/ .
Validate the above using
[root@testvm03 test ~]# kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-565d847f94-d6r69 0/1 ContainerCreating 0 8m37s
kube-system coredns-565d847f94-sbmt8 0/1 ContainerCreating 0 8m37s
kube-system etcd-testvm.ocpvm2.net 1/1 Running 1 8m51s
kube-system kube-apiserver-testvm.ocpvm2.net 1/1 Running 1 8m53s
kube-system kube-controller-manager-testvm.ocpvm2.net 1/1 Running 0 8m51s
kube-system kube-proxy-2xz8r 1/1 Running 0 8m37s
kube-system kube-scheduler-testvm.ocpvm2.net 1/1 Running 1 8m51s
Multiple runtimes found
if you find below multiple run time error, edit the file
[user@testvm images]$ sudo kubeadm init --pod-network-cidr=10.244.0.0/16
Found multiple CRI endpoints on the host. Please define which one do you wish to use by setting the 'criSocket' field in the kubeadm configuration file: unix:///var/run/containerd/containerd.sock, unix:///var/run/crio/crio.sock
To see the stack trace of this error execute with --v=5 or higher
This is because both crio and containerd socks are present and we would need to uninstall one of them. Alternatively, you point to exact runtime we would want to connect
sudo kubeadm init --pod-network-cidr=10.244.0.0/16 --cri-socket /var/run/containerd/containerd.sock
The flag --cri-socker
should also be passed for kubeadm resets as well if we have a cluster running on a different socket
Images not present in crictl even though loaded successfully
This is because of container run time not being recognized correctly. Ensure containerd is referenced everywhere correctly . Follow the below steps if you need to migrate from one container runtime to another (docker -> containerd or crio -> containerd).
Migrating to a different container run time
drain the node
[user@testvm images]$ kubectl drain testvm.ocpvm2.net --ignore-daemonsets
node/testvm.ocpvm2.net cordoned
Warning: ignoring DaemonSet-managed Pods: kube-flannel/kube-flannel-ds-c5v45, kube-system/kube-multus-ds-tnjhl, kube-system/kube-proxy-2xz8r
evicting pod kube-system/coredns-565d847f94-w7snq
evicting pod kube-system/coredns-565d847f94-sbmt8
pod/coredns-565d847f94-w7snq evicted
pod/coredns-565d847f94-sbmt8 evicted
node/testvm.ocpvm2.net drained
Stop kubelet
[user@testvm images]$ systemctl stop kubelet
Edit node file
[user@testvm images]$ sudo kubectl edit no testvm.ocpvm2.net
node/testvm.ocpvm2.net edited
« change value of kubeadm.alpha.kubernetes.io/cri-socket to unix:///run/containerd/containerd.sock » save and close
Verify if node is using different run time
[user@testvm images]$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
testvm.ocpvm2.net Ready,SchedulingDisabled control-plane 16h v1.25.4 192.168.2.16 <none> Red Hat Enterprise Linux 8.7 (Ootpa) 4.18.0-425.19.2.el8_7.x86_64 containerd://1.6.21
[user@testvm ~]$ sudo kubectl get pods -A
[sudo] password for user:
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-565d847f94-9wb5z 0/1 Pending 0 6m37s
kube-system coredns-565d847f94-gmqkg 0/1 Pending 0 6m37s
kube-system etcd-testvm.ocpvm2.net 1/1 Running 8 6m50s
kube-system kube-apiserver-testvm.ocpvm2.net 1/1 Running 8 6m50s
kube-system kube-controller-manager-testvm.ocpvm2.net 1/1 Running 28 6m49s
kube-system kube-proxy-t66jp 1/1 Running 0 6m37s
kube-system kube-scheduler-testvm.ocpvm2.net 1/1 Running 21 6m50s
Kubernetes nodes not ready ?
when the nodes are not ready kubectl describe node < name>
gives an idea.
Here , we notice that InvalidDiskCapacity
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 42m kube-proxy
Normal NodeHasSufficientMemory 42m (x4 over 42m) kubelet Node testvm.ocpvm2.net status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 42m (x4 over 42m) kubelet Node testvm.ocpvm2.net status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 42m (x4 over 42m) kubelet Node testvm.ocpvm2.net status is now: NodeHasSufficientPID
Normal Starting 42m kubelet Starting kubelet.
Warning InvalidDiskCapacity 42m kubelet invalid capacity 0 on image filesystem
Normal NodeAllocatableEnforced 42m kubelet Updated Node Allocatable limit across pods
Normal NodeHasSufficientMemory 42m kubelet Node testvm.ocpvm2.net status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 42m kubelet Node testvm.ocpvm2.net status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 42m kubelet Node testvm.ocpvm2.net status is now: NodeHasSufficientPID
Look at space and ensure things are correct
df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 58G 0 58G 0% /dev
tmpfs 63G 0 63G 0% /dev/shm
tmpfs 63G 92M 63G 1% /run
tmpfs 63G 0 63G 0% /sys/fs/cgroup
/dev/nvme0n1p5 625G 32G 594G 5% /
/dev/nvme0n1p3 301G 12G 289G 4% /home
/dev/nvme0n1p2 1014M 414M 601M 41% /boot
/dev/nvme0n1p1 1.1G 10M 1.1G 1% /boot/efi
tmpfs 13G 4.0K 13G 1% /run/user/1007
Looks like configuration of containerd is the problem . Re-generate and restart containerd
sudo containerd config default > config.toml
sudo cp config.toml /etc/containerd/config.toml
[user@testvm ~]$ systemctl restart containerd
Validate
[user@testvm ~]$ sudo kubectl get nodes
NAME STATUS ROLES AGE VERSION
testvm.ocpvm2.net Ready control-plane,master 50m v1.25.4
[user@testvm ~]$ sudo kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-flannel kube-flannel-ds-p2xrc 1/1 Running 0 43m
kube-system coredns-565d847f94-9wb5z 1/1 Running 0 50m
kube-system coredns-565d847f94-gmqkg 1/1 Running 0 50m
kube-system etcd-testvm.ocpvm2.net 1/1 Running 8 50m
kube-system kube-apiserver-testvm.ocpvm2.net 1/1 Running 8 50m
kube-system kube-controller-manager-testvm.ocpvm2.net 1/1 Running 28 50m
kube-system kube-proxy-t66jp 1/1 Running 0 50m
kube-system kube-scheduler-testvm.ocpvm2.net 1/1 Running 21 50m
8. Uninstall k8s cluster
kubeadm reset
/*On Debian base Operating systems you can use the following command.*/
# on debian base
sudo apt-get purge kubeadm kubectl kubelet kubernetes-cni kube*
# on debian base
sudo apt-get autoremove
#on centos base
sudo yum autoremove
/For all/
sudo rm -rf ~/.kube
kubeadm reset -f
rm -rf /etc/cni /etc/kubernetes /var/lib/dockershim /var/lib/etcd /var/lib/kubelet /var/run/kubernetes ~/.kube/*
iptables -F && iptables -X
iptables -t nat -F && iptables -t nat -X
iptables -t raw -F && iptables -t raw -X
iptables -t mangle -F && iptables -t mangle -X
systemctl restart docker
9.Installaing
sudo apt-get update
sudo apt install apt-transport-https curl
Install containerd (reference: https://docs.docker.com/engine/install/ubuntu/)
sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install containerd.io
Create containerd configuration
sudo mkdir -p /etc/containerd
sudo containerd config default | sudo tee /etc/containerd/config.toml
Edit /etc/containerd/config.toml
sudo nano /etc/containerd/config.toml set SystemdCgroup = true
sudo systemctl restart containerd
Install Kubernetes
curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add
sudo apt-add-repository "deb http://apt.kubernetes.io/ kubernetes-xenial main"
sudo apt install kubeadm kubelet kubectl kubernetes-cni
Disable swap
sudo swapoff -a
Check and remove any swap entry if exists
sudo nano /etc/fstab
Avoid error "/proc/sys/net/bridge/bridge-nf-call-iptables does not exist" on kubeinit (reference https://github.com/kubernetes/kubeadm/issues/1062). This is not necessary if docker is also installed in step 6.
sudo modprobe br_netfilter
sudo nano /proc/sys/net/ipv4/ip_forward Edit entry in ip_forward file and change to 1. (Or use sysctl -w net.ipv4.ip_forward=1 - thanks to @dpjanes, see comments)
kubeinit for use with Flannel
sudo kubeadm init --pod-network-cidr=10.244.0.0/16
Copy to config as kubadm command says
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
Apply Flannel (reference https://github.com/flannel-io/flannel)
kubectl apply -f https://raw.githubusercontent.com/flannel-io/flannel/v0.20.2/Documentation/kube-flannel.yml
All should be running now:
kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-flannel kube-flannel-ds-mcjmm 1/1 Running 0 76s
kube-system coredns-787d4945fb-fb59g 1/1 Running 0 8m8s
kube-system coredns-787d4945fb-t25tj 1/1 Running 0 8m8s
kube-system etcd-kube-master 1/1 Running 0 8m19s
kube-system kube-apiserver-kube-master 1/1 Running 0 8m19s
kube-system kube-controller-manager-kube-master 1/1 Running 0 8m19s
kube-system kube-proxy-2hz29 1/1 Running 0 8m8s
kube-system kube-scheduler-kube-master 1/1 Running 0 8m19s
Share
Follow
10. Renew ceritificates
Sometimes certificates on cluster expires and it is painful to fix that with the different resources we find.
First check if pods are running as expected
kubectl get pods --insecure-skip-tls-verify=true
Verify cert expirations
kubeadm certs check-expiration
Delete all existing certs
rm /etc/kubernetes/pki/apiserver*
rm /etc/kubernetes/pkt/front*
Re-initialize and renew again to make sure process is clean without errors
kubeadm init phase certs all
kubeadm certs renew all
Restart kubelet and export again
service kubelet restart
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
all commands should now work fine
Switching context on kubernetes/Openshift
[core@b4-96-91-d4-e2-c0 Juniper_Cloud_Native_Router_23.2]$ kubectl config get-contexts
CURRENT NAME CLUSTER AUTHINFO NAMESPACE
admin qctq05 admin
* jcnr/api-qctq05-idev-net:6443/system:admin api-qctq05-idev-net:6443 system:admin/api-qctq05-idev-net:6443 jcnr
pktgen/api-qctq05-idev-net:6443/system:admin api-qctq05-idev-net:6443 system:admin/api-qctq05-idev-net:6443 pktgen
[core@b4-96-91-d4-e2-c0 Juniper_Cloud_Native_Router_23.2]$ oc config use-context admin
Switched to context "admin".
[core@b4-96-91-d4-e2-c0 Juniper_Cloud_Native_Router_23.2]$ helm ls
[core@b4-96-91-d4-e2-c0 Juniper_Cloud_Native_Router_23.2]$ kubectl config get-contexts
CURRENT NAME CLUSTER AUTHINFO NAMESPACE
* admin qctq05 admin
jcnr/api-qctq05-idev-net:6443/system:admin api-qctq05-idev-net:6443 system:admin/api-qctq05-idev-net:6443 jcnr
pktgen/api-qctq05-idev-net:6443/system:admin api-qctq05-idev-net:6443 system:admin/api-qctq05-idev-net:6443 pktgen
kubernetes
]
tags: kubernetes