SUSE CaaS Platform/Troubleshooting
Kubernetes
Kubernetes Cluster Information
To get information about the cluster and its master nodes, run:
kubectl cluster-info
Etcd Troubleshooting
Etcd Cluster Status
For information about etcd, log in to any node of the cluster and run:
set -a; source /etc/sysconfig/etcdctl; set +a; etcdctl cluster-health
Docker Troubleshooting
Each docker container has a "log" function where you can review the system logs for that container.
Example 1
If I have a mysql container running, this is what it would look like if I reviewed logs:
jsevans@lab:~> docker logs clever_kalam error: database is uninitialized and password option is not specified You need to specify one of MYSQL_ROOT_PASSWORD, MYSQL_ALLOW_EMPTY_PASSWORD and MYSQL_RANDOM_ROOT_PASSWORD jsevans@lab:~>
In the example, you can see that this container failed to start because a root password was never set.
CaaS Platform has several containers that run in the admin node
jse-velum:~ # docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 07ce66a2fb0d 2e581ae1971b "bash /usr/local/bin/" 2 hours ago Up 2 hours k8s_haproxy_haproxy-127.0.0.1_kube-system_fc86b28e4b267305c6be1c4873664816_0 713107d66b64 sles12/pause:1.0.0 "/usr/share/suse-dock" 2 hours ago Up 2 hours k8s_POD_haproxy-127.0.0.1_kube-system_fc86b28e4b267305c6be1c4873664816_0 90371851a838 0ae495fb075d "entrypoint.sh salt-m" 3 hours ago Up 3 hours k8s_salt-master_velum-public- 127.0.0.1_default_8febe56752d5b78228314baf894d5740_1 66a569fe4bb0 82023d5e30f1 "entrypoint.sh bundle" 3 hours ago Up 3 hours k8s_velum-event-processor_velum-public- 127.0.0.1_default_8febe56752d5b78228314baf894d5740_2 111ebdbc8d89 1b49b518ec09 "/usr/local/bin/entry" 3 hours ago Up 2 hours k8s_openldap_velum-public-127.0.0.1_default_8febe56752d5b78228314baf894d5740_0 0128b51867fe 82023d5e30f1 "entrypoint.sh bin/in" 3 hours ago Up 2 hours k8s_velum-dashboard_velum-public- 127.0.0.1_default_8febe56752d5b78228314baf894d5740_0 fb62bc923356 3dbd223dcfac "salt-minion.sh" 3 hours ago Up 3 hours k8s_salt-minion-ca_velum-public- 127.0.0.1_default_8febe56752d5b78228314baf894d5740_0 f328d25e963d 23b3d88e6f1a "salt-api" 3 hours ago Up 3 hours k8s_salt-api_velum-public-127.0.0.1_default_8febe56752d5b78228314baf894d5740_0 8f795c149af7 82023d5e30f1 "entrypoint.sh bundle" 3 hours ago Up 3 hours k8s_velum-autoyast_velum-public- 127.0.0.1_default_8febe56752d5b78228314baf894d5740_0 f409e46f8938 1732313eb42f "entrypoint.sh /usr/l" 3 hours ago Up 3 hours k8s_velum-mariadb_velum-private- 127.0.0.1_default_bf640ab62f9d8d01fa0c2f7e66744787_0 341a7eba9003 sles12/pause:1.0.0 "/usr/share/suse-dock" 3 hours ago Up 3 hours k8s_POD_velum-public-127.0.0.1_default_8febe56752d5b78228314baf894d5740_0 6a26fbdeba6c sles12/pause:1.0.0 "/usr/share/suse-dock" 3 hours ago Up 3 hours k8s_POD_velum-private-127.0.0.1_default_bf640ab62f9d8d01fa0c2f7e66744787_0
For each of these, you can run the log command against the CONTAINER ID or use the following bash shortcut to see the current system logs for that container:
Example 2
This is how to see the logs from the salt-master
docker logs `docker ps | grep salt-master | awk '{print $1}'` -f local: - master.pem - master.pub minions: - admin - ca minions_denied: minions_pre: - 86c31b34f5694c0f968f7ac4b09ad9fd - fc02599431da43b0bef03aa0343efe35 minions_rejected:
salt
You can add access to salt by adding the following aliases to /root/.bashrc (you will need to create this file). Then logout and log back in to use them. Using SALT directly is good for troubleshooting, but should generally not be used unless you are very familiar with SALT.
alias salt='docker exec `docker ps -q --filter name=salt-master` salt' alias salt-key='docker exec `docker ps -q --filter name=salt-master` salt-key'
Example 3
Check salt keys:
jse-velum:~ # docker exec `docker ps -q --filter name=salt-master` salt-key -L Accepted Keys: 0d07bd4cd7e54a4880423fc42f025b88 2dc9b41185c649f1bf01e4a451efb1bf 56b5e2057320402fb06c757ceaebbe04 6ba7e76c7deb482685362a596ae24442 872e3dd4eeb04007ad3ae0aabe4018bf 9a2c6ef0187e4c5faaebc255e074b793 a6dfc9dff43a4b29ae18213ebd743295 admin ca de2a71947dc544739e4f46489288984f Denied Keys: Unaccepted Keys: Rejected Keys:
Example 4
Check IP addresses:
jse-velum:~ # docker exec `docker ps -q --filter name=salt-master` salt '*' grains.get ipv4 2dc9b41185c649f1bf01e4a451efb1bf: - 127.0.0.1 - 149.44.138.247 - 149.44.139.246 - 172.17.0.1 6ba7e76c7deb482685362a596ae24442: - 127.0.0.1 - 149.44.138.242 - 149.44.139.211 - 172.17.0.1 a6dfc9dff43a4b29ae18213ebd743295: - 127.0.0.1 - 149.44.139.162 - 172.17.0.1 56b5e2057320402fb06c757ceaebbe04: - 127.0.0.1 - 149.44.138.241 - 149.44.139.247 - 172.17.0.1 872e3dd4eeb04007ad3ae0aabe4018bf: - 127.0.0.1 - 149.44.138.246 - 149.44.139.164 - 172.17.0.1 0d07bd4cd7e54a4880423fc42f025b88: - 127.0.0.1 - 149.44.138.248 - 149.44.139.223 - 172.17.0.1 9a2c6ef0187e4c5faaebc255e074b793: - 127.0.0.1 - 149.44.139.224 - 172.17.0.1 admin: - 127.0.0.1 - 149.44.138.239 - 172.17.0.1 de2a71947dc544739e4f46489288984f: - 127.0.0.1 - 149.44.138.240 - 149.44.139.225 - 172.17.0.1 ca: - 127.0.0.1 - 149.44.138.239 - 172.17.0.1
Example 5
Check if all salt-minions are up and responding to salt. It might be useful when Velum web is showing "We're sorry, but something went wrong.":
jse-velum:~ # docker exec `docker ps -q --filter name=salt-master` salt '*' test.ping admin: True ca: True adbb810767ec43209ec10338d1cfdc27: True 32b55f8a6c6149a6ac095438a215fe22: True ee1aea918d924391b54d7ef56c8df027: True ce8b28d0fa1f406cb48bcb1213a7f45e: True 6751dcc7fafe42ffa37932b79f83a240: True 2ce8e42643b44c8b8f72a2292aac8640: True c10985b20e034e98acdee76822284534: True 37c5bb6101614a5fad3a56fb2cb03442: Minion did not return. [No response]
Node properly accepted, present in Salt accepted minions list but not visible on New nodes bootstrapping page in Velum
It may happen that a node that has been accepted does not become available for selection in "New nodes" / assign_nodes page In that case, it should be checked that the node is in the accepted minions list in salt master
If you know the hostname you can look for the node id on the admin node
grep caasp-worker-scale-159 /etc/hosts 10.10.0.175 caasp-worker-scale-159 caasp-worker-scale-159.infra.caasp.local bc50b74d23724a448050ca0dd9412b1c bc50b74d23724a448050ca0dd9412b1c.infra.caasp.local
Still on the admin node you can check that the node is accepted in Salt
# docker exec `docker ps -q --filter name=salt-master` salt '*' grains.get ipv4 |grep -C5 bc50b74d23724a448050ca0dd9412b1c - 172.17.0.1 8d8868b39cf94ce4a25ef104a1ecfb78: - 10.10.0.154 - 127.0.0.1 - 172.17.0.1 bc50b74d23724a448050ca0dd9412b1c: - 10.10.0.175 - 127.0.0.1 - 172.17.0.1 fe3aaeae4ee2406a990d4ec8e1ccfe16: - 10.10.0.54
You can ping it with salt also
# docker exec -it $saltid salt bc50b74d23724a448050ca0dd9412b1c test.ping bc50b74d23724a448050ca0dd9412b1c: True
So from Salt perspective, the node is ready and accepted. A possible workaround is to restart the salt-minion on the node missing in velum
# ssh root@bc50b74d23724a448050ca0dd9412b1c caasp-worker-scale-159:~ # systemctl restart salt-minion
Replace salt master address on all minions
You may want to replace the salt master address on all salt minions (master/worker nodes), for example to replace ip with name
On admin node:
# docker exec -it $(docker ps | grep salt-master | awk '{print $1}') salt "*" file.replace /etc/salt/minion.d/master.conf pattern='^master:.*' repl='master: admin.infra.caasp.local' append_if_not_found=false # docker exec -i $(docker ps | grep salt-master | awk '{print $1}') salt --batch 15 -P "roles:(admin|kube-(master|minion))" cmd.run 'systemctl restart salt-minion'
To check that it worked you can use salt ping
# docker exec `docker ps -q --filter name=salt-master` salt '*' test.ping
supportconfig
The following files are added specifically for CaaS Platform:
velum-files.txt velum-migrations.txt velum-minions.yml velum-routes.txt velum-salt-events.yml velum-salt-pillars.yml kubernetes.txt
The problem with these logs is that they are in yaml format and not easy to troubleshoot. Rather than grepping as we normally would, a better strategy would be to grep for context.
Example 6
grep -C3 fail velum-salt-events.yml fun: grains.get id: 0d07bd4cd7e54a4880423fc42f025b88 - fun_args: - tx_update_failed jid: '20180214102215466369' return: retcode: 0
The -C3 flag will get 3 lines before and after the line that matches so you can get a better idea of what is happening that than just getting a single line that won't tell you anything about what the error is actually about.
Bootstrapping
If a bootstrap fails, we are not given any direct output about why it fails. We have a couple of new tools now:
First, we can manually kick off another bootstrap and record the output to bootstrap.log.
docker exec -it $(docker ps | grep salt-master | awk '{print $1}') salt-run -l debug state.orchestrate orch.kubernetes | tee bootstrap.log
The CaaS Platform team says, "This is unsupported and may cause issues further down the line but you can run." so we don't want to do this unless this is a recurring issue that the customer can't get past. In CaaS Platform 3, this will be an option built into Velum though I don't know if it will be logged.
Secondly, if this is recurring, we can put put the salt-master into debug mode after installing the admin node and before installing other nodes.
vim /etc/caasp/salt-master-custom.conf (on admin node)
and add
# Custom Configurations for Salt-Master log_level: debug
You can then restart the salt-master container with:
docker stop `docker ps | grep salt-master | awk '{print $1}`
CaaSP will automatically restart the container with debugging turned on.
The logs that the salt-master produces are in json format. Before starting the bootstrap, run "script" and then run the command on Example 2. Once the bootstrap is failed, do CTRL-C and then exit. The output that was just on your screen from the log command will be in a file called "typescript".
From the dev team:
During bootstrap, thousands of "things" are done and sequenced across the set of cluster machines - and the set of things is constantly changing - any list would be out of date pretty quickly.
Generally speaking, finding what specific salt step has failed is not *all that difficult* by looking at the velm-salt-events.yaml file in the admin node's supportconfig dump.
The process looks like this:
- Open velm-salt-events.yaml in an editor
- Search for "result: False"
- If the match you find says something like "Failed as prerequisite bar failed" (I don't have the exact wording handy) - then trace "up" the chain of failures logs to the "bar" failure
- Repeat until you see a failure where the message is not simply a prerequisite failing, and you have found the real failure event.
- The failure event will, more often than not, have a reasonably descriptive name. The name will indicate which area failed and needs investigation - e.g. if the name includes "etcd", start checking etcd
logs on each node etc.
Removing unwanted node from Velum before accepting node/bootstrapping
If you see any unwanted pending node(s) for eg. from previous CaaSP deployment in Velum, you can throw away the pending node(s) before its accepting and bootstrapping cluster by performing following command on admin node:
docker exec `docker ps -q --filter name=salt-master` salt-key -y -d f3c95d561fc6462b82805e40d7f0d17f
Where 'f3c95d561fc6462b82805e40d7f0d17f' is a machine-id of unwanted node.