Clean install and unable to launch workspace

I get the following error:

Sorry, something went wrong :sweat:

Error: Request createWorkspace failed with message: 14 UNAVAILABLE: failed to connect to all addresses

While it is trying to build I can see the node-daemon starts to crashloop.

1 Like

Error: Request createWorkspace failed with message: 14 UNAVAILABLE: failed to connect to all addresses

This indicates that the server pod cannot connect to the ws-manager pod. I saw this with k3s as well a few times. Don’t know why. If I recall correctly, redeploying fixed this. I’ll investigate this problem further soon. Don’t know if I’m able to do it this week, though.

Regarding crashlooped node-daemon:

Could you please provide the output of

  • kubectl logs node-daemon...
  • kubectl describe pod node-daemon...
1 Like
   Theia (version v0.4.0) became available BUT we've failed to mark the node (attempt 8/10)
node/gitpod2-3obzm not patched

Name:         node-daemon-65vd5
Namespace:    default
Priority:     0
Node:         gitpod2-3obzm/10.108.0.18
Start Time:   Fri, 10 Jul 2020 15:47:10 -0500
Labels:       app=gitpod
              component=node-daemon
              controller-revision-hash=686b546d64
              kind=daemonset
              pod-template-generation=2
              stage=production
              subcomponent=node-daemon
Annotations:  <none>
Status:       Running
IP:           10.244.1.28
IPs:
  IP:           10.244.1.28
Controlled By:  DaemonSet/node-daemon
Init Containers:
  theia:
    Container ID:  docker://e5836f3a38adbbdca673f17bb9246cd507dbf61a707e9bea5b8d77d3c26a9969
    Image:         gcr.io/gitpod-io/theia-server:v0.4.0
    Image ID:      docker-pullable://gcr.io/gitpod-io/theia-server@sha256:79ab7d75beffdef3fa018cbe3eaebec749561ab6244885e08d3fabe256076984
    Port:          <none>
    Host Port:     <none>
    Args:
      --copy-to
      /mnt/theia-storage/theia/theia-v0.4.0
      -d
      /theia
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 10 Jul 2020 15:47:12 -0500
      Finished:     Fri, 10 Jul 2020 15:47:12 -0500
    Ready:          True
    Restart Count:  0
    Limits:
      memory:  250Mi
    Requests:
      cpu:        5m
      memory:     250Mi
    Environment:  <none>
    Mounts:
      /mnt/theia-storage from theia-storage (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from node-daemon-token-qcdmt (ro)
  node-init:
    Container ID:  docker://1776252f610303e49a40917faba77a1245096c2330a25ee82df3349a5fddfc5f
    Image:         alpine:3.7
    Image ID:      docker-pullable://alpine@sha256:8421d9a84432575381bfabd248f1eb56f3aa21d9d7cd2511583c68c9b7511d10
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      apk add findutils
      trap end 15; end() {
        echo "[node] Received SIGTERM, exiting with 0";
        exit 0;
      }; echo "[node] Start";
      (
        echo "[node] Patching node..." &&
        sysctl -w net.core.somaxconn=4096 &&
        sysctl -w "net.ipv4.ip_local_port_range=5000 65000" &&
        sysctl -w "net.ipv4.tcp_tw_reuse=1" &&
        sysctl -w fs.inotify.max_user_watches=1000000 &&
        sysctl -w "kernel.dmesg_restrict=1"
      ) && echo "[node] done!" || echo "[node] failed!" &&
      echo "[node] Initialized."

    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 10 Jul 2020 15:47:13 -0500
      Finished:     Fri, 10 Jul 2020 15:47:14 -0500
    Ready:          True
    Restart Count:  0
    Limits:
      memory:  50Mi
    Requests:
      cpu:        5m
      memory:     50Mi
    Environment:  <none>
    Mounts:
      /mnt/theia-storage from theia-storage (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from node-daemon-token-qcdmt (ro)
Containers:
  node:
    Container ID:   docker://d211eac546b54a706a80645f164a4f13a37b469816d7e8738ce6f33d716e0ba0
    Image:          gcr.io/gitpod-io/node-daemon:v0.4.0
    Image ID:       docker-pullable://gcr.io/gitpod-io/node-daemon@sha256:216017ca27131241de5608e948969783112d04f4e7b9b56a226e2fe9d34cfd92
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Mon, 13 Jul 2020 10:48:08 -0500
      Finished:     Mon, 13 Jul 2020 10:48:19 -0500
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Mon, 13 Jul 2020 10:42:57 -0500
      Finished:     Mon, 13 Jul 2020 10:43:07 -0500
    Ready:          False
    Restart Count:  765
    Limits:
      memory:  50Mi
    Requests:
      cpu:     5m
      memory:  50Mi
    Environment:
      KUBE_STAGE:                     production
      KUBE_NAMESPACE:                 default (v1:metadata.namespace)
      VERSION:                        v0.4.0
      HOST_URL:                       https://my.url.here
      GITPOD_REGION:                  local
      GITPOD_INSTALLATION_LONGNAME:   production.gitpod.local.00
      GITPOD_INSTALLATION_SHORTNAME:  local-00
      EXECUTING_NODE_NAME:             (v1:spec.nodeName)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from node-daemon-token-qcdmt (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  theia-storage:
    Type:          HostPath (bare host directory volume)
    Path:          /var/gitpod
    HostPathType:  DirectoryOrCreate
  node-exporter:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/node_exporter/textfile_collector
    HostPathType:  DirectoryOrCreate
  node-daemon-token-qcdmt:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  node-daemon-token-qcdmt
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/pid-pressure:NoSchedule
                 node.kubernetes.io/unreachable:NoExecute
                 node.kubernetes.io/unschedulable:NoSchedule
Events:
  Type     Reason   Age                      From                    Message
  ----     ------   ----                     ----                    -------
  Normal   Pulled   16m (x763 over 2d19h)    kubelet, gitpod2-3obzm  Container image "gcr.io/gitpod-io/node-daemon:v0.4.0" already present on machine
  Warning  BackOff  75s (x17861 over 2d19h)  kubelet, gitpod2-3obzm  Back-off restarting failed container

I have a similar issue on a civo k3s cluster - with node-daemon being the only deployment that is not coming up as it should.

I am guessing it has something to do with the default /var/gitpod/ directory that I can see is used on the nodes if nothing else is specified in the values.yaml file for the helm deployment.

I’ve tried the same install on a Rancher RKE cluster I created - and don’t get the same problem here. (proxy load balancer is not working - but that is a different issue)

Spoke too soon - I uninstalled and re-installed Gitpod on my RKE cluster - and am now also getting the ... Theia (version v0.4.0) became available BUT we've failed to mark the node error.

I tried deleting the /var/gitpod/theia directory on one of the nodes to see if it was the existence of an older directory - but that did not make any difference.

Does anyone know what needs to happen for the nodes to be marked as successful?

From the log above it’s not clear why the patching does not work.

The failing command that node-daemon attempts to execute is

kubectl patch node $EXECUTING_NODE_NAME --patch '{"metadata":{"labels":{"gitpod.io/theia.v0.4.0": "available"}}}'

Could you run this command (replacing $EXECUTING_NODE_NAME with a proper node name) and see if that fails?

I then get :

node/nio-scw-k3s-001 patched (no change)

Removing the label from the node seems to help - then the node-daemon pods are able to complete

There we have it … the script does not cope well if the label is already present on the node. I’ll file a bug internally.

:+1:

I now have a fully running install on k3s managed with Rancher - even after an uninstall and re-install, which was broken last week.

But… trying to create a new workspace for one of my Gitlab repos, I get the same error @wade_wilson was gettin:

Sorry, something went wrong 😓
Error: Request createWorkspace failed with message: 14 UNAVAILABLE: failed to connect to all addresses
Please file an issue if you think this is a bug.

Which service(s) should I be checking for errors to try and track this down?

Looking at the serverlogs, I see the following:

...
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"v0.4.0"},"stack_trace":"Error: 14 UNAVAILABLE: failed to connect to all addresses\n    at Object.exports.createStatusError (/app/node_modules/grpc/src/common.js:91:15)\n    at Object.onReceiveStatus (/app/node_modules/grpc/src/client_interceptors.js:1204:28)\n    at InterceptingListener._callNext (/app/node_modules/grpc/src/client_interceptors.js:568:42)\n    at InterceptingListener.onReceiveStatus (/app/node_modules/grpc/src/client_interceptors.js:618:8)\n    at callback (/app/node_modules/grpc/src/client_interceptors.js:845:24)","component":"server","severity":"ERROR","time":"2020-07-22T18:44:36.002Z","environment":"production","region":"local","message":"Request createWorkspace failed with internal server error","error":"Error: 14 UNAVAILABLE: failed to connect to all addresses\n    at Object.exports.createStatusError (/app/node_modules/grpc/src/common.js:91:15)\n    at Object.onReceiveStatus (/app/node_modules/grpc/src/client_interceptors.js:1204:28)\n    at InterceptingListener._callNext (/app/node_modules/grpc/src/client_interceptors.js:568:42)\n    at InterceptingListener.onReceiveStatus (/app/node_modules/grpc/src/client_interceptors.js:618:8)\n    at callback (/app/node_modules/grpc/src/client_interceptors.js:845:24)","payload":{"method":"createWorkspace","args":[{"contextUrl":"https://gitlab.com/nodeable/mi-poc","mode":"select-if-running"},{"_isCancelled":false}]}}
{"component":"server","severity":"INFO","time":"2020-07-22T18:44:36.072Z","environment":"production","region":"local","context":{"userId":"271b85d7-86cc-4814-8b2b-f72d465c492a"},"message":"getLoggedInUser"}
{"component":"server","severity":"INFO","time":"2020-07-22T18:44:36.073Z","environment":"production","region":"local","context":{"userId":"271b85d7-86cc-4814-8b2b-f72d465c492a"},"message":"getShowPaymentUI"}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"v0.4.0"},"component":"server","severity":"ERROR","time":"2020-07-22T18:44:54.590Z","environment":"production","region":"local","message":"Error in fetching sampling strategy: Error: connect ECONNREFUSED 0.0.0.0:5778.","loggedViaConsole":true}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"v0.4.0"},"component":"server","severity":"ERROR","time":"2020-07-22T18:45:27.385Z","environment":"production","region":"local","message":"Error in fetching sampling strategy: Error: connect ECONNREFUSED 0.0.0.0:5778.","loggedViaConsole":true}
... 

@csweichel OMG, thanks! I was also experiencing this same issue and was manually removing the labels after uninstalling. :rofl:

Good