Harbor Containers Fail to Start

Harbor 啟動失敗調查

這篇文章將介紹客戶端 Harbor 啟動時有機率發生錯誤的紀錄,並說明調查步驟及結果。

事件發生

某日客戶通報一個 issue,他們在啟動 Harbor 時,有機率會在啟動後,部分的 API 無法使用,可以透過訪問 80 port 看到 web 介面,但登入後點擊部分連結,卻會回覆 500 error。在錯誤發生後,經過多次重啟 container,有機率會恢復正常,否則就需要重安裝 Harbor

客戶詢問我這是不是 Harbor 的一個 bug,是否可以升級 Harbor 版本來解決,但我在 github 上沒有看到相關的 issue。

客戶的 Harbor 版本為 v2.5.6

調查經過

調查 container 狀態

Harbor 的啟動方式是透過 docker-compose 工具,設定檔為 install.sh 所生成的 docker-compose.yaml,啟動時會建立以下的幾個服務,可以透過 docker ps 查看 container 的狀態:

harbor-jobservice
nginx
harbor-core
registry
harbor-db
redis
harbor-portal
registryctl
harbor-log

我發現 harbor-jobservice 一直處於 restart 的狀態,透過 docker logs -f harbor-jobservice 查看 logs 後,發現如下的錯誤訊息:

2024-01-24T01:44:33Z [ERROR] [/pkg/config/rest/rest.go:50]: Failed on load rest config err:invalid character '<' looking for beginning of value, url:http://core:8080/api/v2.0/internalconfig
2024-01-24T01:44:33Z [INFO] [/pkg/config/rest/rest.go:47]: get configuration from url: http://core:8080/api/v2.0/internalconfig
2024-01-24T01:44:33Z [INFO] [/jobservice/logger/sweeper_controller.go:97]: 0 outdated log entries are sweepped by sweeper *sweeper.FileSweeper
2024-01-24T01:44:33Z [ERROR] [/pkg/config/rest/rest.go:50]: Failed on load rest config err:invalid character '<' looking for beginning of value, url:http://core:8080/api/v2.0/internalconfig
2024-01-24T01:44:33Z [ERROR] [/jobservice/job/impl/context.go:75]: Job context initialization error: failed to load rest config
2024-01-24T01:44:33Z [INFO] [/jobservice/job/impl/context.go:78]: Retry in 9 seconds
2024-01-24T01:44:42Z [INFO] [/pkg/config/rest/rest.go:47]: get configuration from url: http://core:8080/api/v2.0/internalconfig
2024-01-24T01:44:42Z [ERROR] [/pkg/config/rest/rest.go:50]: Failed on load rest config err:invalid character '<' looking for beginning of value, url:http://core:8080/api/v2.0/internalconfig
2024-01-24T01:44:42Z [ERROR] [/jobservice/job/impl/context.go:75]: Job context initialization error: failed to load rest config
2024-01-24T01:44:42Z [INFO] [/jobservice/job/impl/context.go:78]: Retry in 13 seconds
2024-01-24T01:44:55Z [INFO] [/pkg/config/rest/rest.go:47]: get configuration from url: http://core:8080/api/v2.0/internalconfig
2024-01-24T01:44:55Z [ERROR] [/pkg/config/rest/rest.go:50]: Failed on load rest config err:invalid character '<' looking for beginning of value, url:http://core:8080/api/v2.0/internalconfig
2024-01-24T01:44:55Z [ERROR] [/jobservice/job/impl/context.go:75]: Job context initialization error: failed to load rest config
2024-01-24T01:44:55Z [INFO] [/jobservice/job/impl/context.go:78]: Retry in 19 seconds
2024-01-24T01:45:14Z [INFO] [/pkg/config/rest/rest.go:47]: get configuration from url: http://core:8080/api/v2.0/internalconfig
2024-01-24T01:45:14Z [ERROR] [/pkg/config/rest/rest.go:50]: Failed on load rest config err:invalid character '<' looking for beginning of value, url:http://core:8080/api/v2.0/internalconfig
2024-01-24T01:45:14Z [ERROR] [/jobservice/job/impl/context.go:75]: Job context initialization error: failed to load rest config
2024-01-24T01:45:14Z [INFO] [/jobservice/job/impl/context.go:78]: Retry in 29 seconds
2024-01-24T01:45:43Z [INFO] [/pkg/config/rest/rest.go:47]: get configuration from url: http://core:8080/api/v2.0/internalconfig
2024-01-24T01:45:43Z [ERROR] [/pkg/config/rest/rest.go:50]: Failed on load rest config err:invalid character '<' looking for beginning of value, url:http://core:8080/api/v2.0/internalconfig
2024-01-24T01:45:43Z [ERROR] [/jobservice/job/impl/context.go:75]: Job context initialization error: failed to load rest config
2024-01-24T01:45:43Z [FATAL] [/jobservice/main.go:87]: initialize job context error: job context initialization error: failed to load rest config (5 times tried)

從錯誤訊息中,大概可以看出,harbor-jobservice 不斷嘗試訪問 http://core:8080/api/v2.0/internalconfig,但卻發生 Failed on load 無法讀取的問題,原因與 response 的內容有關,可能有包含 <:

invalid character '<' looking for beginning of value

調查 API 問題

為了精準定位錯誤,需要先調查 container 的行為,透過 cat docker-compose.yaml 查看 container 的設定:

...
core:
    image: goharbor/harbor-core:v2.5.6
    container_name: harbor-core
    env_file:
      - ./common/config/core/env
    restart: always
    cap_drop:
      - ALL
    cap_add:
      - SETGID
      - SETUID
    volumes:
      - /data/ca_download/:/etc/core/ca/:z
      - /data/:/data/:z
      - ./common/config/core/certificates/:/etc/core/certificates/:z
      - type: bind
        source: ./common/config/core/app.conf
        target: /etc/core/app.conf
      - type: bind
        source: /data/secret/core/private_key.pem
        target: /etc/core/private_key.pem
      - type: bind
        source: /data/secret/keys/secretkey
        target: /etc/core/key
      - type: bind
        source: ./common/config/shared/trust-certificates
        target: /harbor_cust_cert
    networks:
      harbor:
    depends_on:
      - log
      - registry
      - redis
      - postgresql
    logging:
      driver: "syslog"
      options:
        syslog-address: "tcp://localhost:1514"
        tag: "core"
...
jobservice:
    image: goharbor/harbor-jobservice:v2.5.6
    container_name: harbor-jobservice
    env_file:
      - ./common/config/jobservice/env
    restart: always
    cap_drop:
      - ALL
    cap_add:
      - CHOWN
      - SETGID
      - SETUID
    volumes:
      - /data/job_logs:/var/log/jobs:z
      - type: bind
        source: ./common/config/jobservice/config.yml
        target: /etc/jobservice/config.yml
      - type: bind
        source: ./common/config/shared/trust-certificates
        target: /harbor_cust_cert
    networks:
      - harbor
    depends_on:
      - core
    logging:
      driver: "syslog"
      options:
        syslog-address: "tcp://localhost:1514"
        tag: "jobservice"
...

透過以下指令,調查 harbor-jobservice 的 entrypoint,可以得知 container 在啟動時執行 /harbor/entrypoint.sh:

docker inspect -f '{{.Config.Entrypoint}}' harbor-jobservice
[/harbor/entrypoint.sh]

嘗試使用與 harbor-jobservice 相同 namespace 的環境,查看 /harbor/entrypoint.sh 的內容:

docker exec harbor-jobservice cat /harbor/entrypoint.sh
#!/bin/sh

set -e

/harbor/install_cert.sh

exec /harbor/harbor_jobservice -c /etc/jobservice/config.yml

/harbor/entrypoint.sh 猜測 container 的行為,/harbor/harbor_jobservice 存取 http://core:8080/api/v2.0/internalconfig 時,發生了非預期的錯誤,導致啟動失敗:

harbor-jobservice-fail-call

透過 curl 指令存取 http://core:8080/api/v2.0/internalconfig,或許能看到回傳的內容:

docker exec harbor-jobservice curl http://core:8080/api/v2.0/internalconfig

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    99  100    99    0     0  12895      0 --:--:-- --:--:-- --:--:-- 14142
{"errors":[{"code":"UNAUTHORIZED","message":"only internal service is allowed to call this API"}]}

可以看到 core:8080 回傳一段 JSON format 的 response 內容,並沒有 character <,由於 harbor-jobservice 呼叫 API 時會帶入 token,透過 package document 得知,呼叫 API 時需要加入 header Authorization : Harbor-Secret <secret>,而 token 則被記錄在 common/config/jobservice/env 上:

https://pkg.go.dev/github.com/goharbor/harbor/src/jobservice#section-readme

job-service-env

存取 ./common/config/jobservice/env ,透過查看變數 JOBSERVICE_SECRET 獲取 token,重新使用 curl:

請將 <secret> 替換為 token 內容

docker exec harbor-jobservice curl -H "Authorization: Harbor-Secret <secret>" http://core:8080/api/v2.0/internalconfig
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3962    0  3962    0     0  1029k      0 --:--:-- --:--:-- --:--:-- 1289k
{"admin_initial_password":{"editable":true,"value":"Admin12345"},"auth_mode":{"editable":false,"value":"ldap_auth"},"cache_enabled":{"editable":true,"value":false},"cache_expire_hours":{"editable":true,"value":24},"core_local_url":{"editable":true,"value":"http://127.0.0.1:8080"},"core_url":{"editable":true,"value":"http://core:8080"},"database_type":{"editable":true,"value":"postgresql"},"execution_status_refresh_interval_seconds":{"editable":true,"value":30},"ext_endpoint":{"editable":true,"value":"https://harbor.cnsdomain.com"},"gdpr_delete_user":{"editable":true,"value":false},"http_authproxy_skip_search":{"editable":true,"value":false},"http_authproxy_verify_cert":{"editable":true,"value":true},"jobservice_url":{"editable":true,"value":"http://jobservice:8080"},"ldap_base_dn":{"editable":true,"value":"DC=ad,DC=cnsdomain,DC=com"},"ldap_group_membership_attribute":{"editable":true,"value":"memberof"},"ldap_group_search_scope":{"editable":true,"value":2},"ldap_scope":{"editable":true,"value":2},"ldap_search_dn":{"editable":true,"value":"CN=Administrator,CN=Users,DC=ad,DC=cnsdomain,DC=com"},"ldap_search_password":{"editable":true,"value":"Admin12345"},"ldap_timeout":{"editable":true,"value":5},"ldap_uid":{"editable":true,"value":"cn"},"ldap_url":{"editable":true,"value":"192.168.50.35"},"ldap_verify_cert":{"editable":true,"value":true},"max_job_workers":{"editable":true,"value":10},"metric_enable":{"editable":true,"value":false},"metric_path":{"editable":true,"value":"/metrics"},"metric_port":{"editable":true,"value":9090},"notary_url":{"editable":true,"value":"http://notary-server:4443"},"notification_enable":{"editable":true,"value":true},"oidc_auto_onboard":{"editable":true,"value":false},"oidc_extra_redirect_parms":{"editable":true,"value":"{}"},"oidc_verify_cert":{"editable":true,"value":true},"postgresql_conn_max_idle_time":{"editable":true,"value":"0"},"postgresql_conn_max_lifetime":{"editable":true,"value":"5m"},"postgresql_database":{"editable":true,"value":"registry"},"postgresql_host":{"editable":true,"value":"postgresql"},"postgresql_max_idle_conns":{"editable":true,"value":100},"postgresql_max_open_conns":{"editable":true,"value":900},"postgresql_password":{"editable":true,"value":"root123"},"postgresql_port":{"editable":true,"value":5432},"postgresql_sslmode":{"editable":true,"value":"disable"},"postgresql_username":{"editable":true,"value":"postgres"},"primary_auth_mode":{"editable":true,"value":true},"project_creation_restriction":{"editable":true,"value":"everyone"},"pull_audit_log_disable":{"editable":true,"value":false},"pull_count_update_disable":{"editable":true,"value":false},"pull_time_update_disable":{"editable":true,"value":false},"quota_per_project_enable":{"editable":true,"value":true},"read_only":{"editable":true,"value":false},"registry_controller_url":{"editable":true,"value":"http://registryctl:8080"},"registry_storage_provider_name":{"editable":true,"value":"filesystem"},"registry_url":{"editable":true,"value":"http://registry:5000"},"robot_name_prefix":{"editable":true,"value":"robot$"},"robot_scanner_name_prefix":{"editable":true,"value":"H8T9ACT4"},"robot_token_duration":{"editable":true,"value":30},"scanner_skip_update_pulltime":{"editable":true,"value":false},"self_registration":{"editable":true,"value":false},"session_timeout":{"editable":true,"value":60},"skip_audit_log_database":{"editable":true,"value":false},"storage_per_project":{"editable":true,"value":-1},"token_expiration":{"editable":true,"value":30},"token_service_url":{"editable":true,"value":"http://core:8080/service/token"},"trace_enabled":{"editable":true,"value":false},"trace_jaeger_agent_port":{"editable":true,"value":"6831"},"trace_sample_rate":{"editable":true,"value":1},"trivy_adapter_url":{"editable":true,"value":"http://trivy-adapter:8080"},"uaa_verify_cert":{"editable":true,"value":false},"with_notary":{"editable":true,"value":false},"with_trivy":{"editable":true,"value":false}}

由於 response 的內容也沒有出現 character <,因此我懷疑 curlharbor-jobservice 在 call API 時的行為不一樣,需要錄製封包才能得知兩者的差異。

錄製封包

可以透過 tcpdump 指定固定的 veth,但 container 不斷地 restart,會使 docker 不斷更換 veth。為了使 veth 固定,需要改寫 container 的 entrypoint,修改 docker-compose.yaml 並加入 entrypoint,透過 docker-compose up -d 重啟 container:

...
jobservice:
    image: goharbor/harbor-jobservice:v2.5.6
    container_name: harbor-jobservice
    entrypoint: tail -f /dev/null
...

查看 container 在 node 上的 namespace 的 veth id:

docker exec harbor-jobservice cat /sys/class/net/eth0/iflink
<ifid>

透過 ip link 可以查看 node 上的 network interface,找到與 <ifid> 相同 id 的 veth 後,複製名稱,使用 tcpdump 指令錄製封包,並加入背景執行 &,結果存放至 /tmp/network1.pcap:

請替換 <veth-name>veth 的名稱

tcpdump -i <veth-name> -w /tmp/network1.pcap &

執行 harbor-jobservice 後按下 ctrl+c 關閉 session,然後執行 fg 後按下 ctrl+c 結束 process:

docker exec harbor-jobservice /harbor/harbor_jobservice -c /etc/jobservice/config.yml

重複以上作業,將錄製目錄改為 /tmp/network2.pcap,執行 curl 後錄製封包,然後執行 fg 後按下 ctrl+c 結束 process:

docker exec harbor-jobservice curl -H "Authorization: Harbor-Secret <secret>" http://core:8080/api/v2.0/internalconfig

可以使用 tcpdump 讀取 pcap 檔案:

-nn 將 port 從 name 轉換成 number

tcpdump -nn -r /tmp/network1.pcap
tcpdump -nn -r /tmp/network2.pcap

以下是 /tmp/network1.pcap 的內容,透過 wireshark 打開:

network1

以下是 /tmp/network2.pcap 的內容,透過 wireshark 打開:

network2

pcap 的內容可以看到,harbor-jobservicecurl 呼叫 API 的 IP 地址不同,雖然都是訪問 core:8080,但 curl 訪問的是 docker dns 解析後的 container IP,但 harbor-jobservice 卻訪問 HTTP_PROXY 的 IP,這不應該發生,因為 common/config/jobservice/env 上會設定 NO_PROXY 為 container 的名稱,如下所示:

NO_PROXY=.internal,127.0.0.1,localhost,registry,registryctl,exporter,nginx,trivy-adapter,redis,notary-signer,notary-server,db,.local,postgresql,jobservice,core,portal,log

透過 docker inspect 查看 harbor-jobservice 的 env ,發現 NO_PROXY 的內容與 env_file 不符:

docker inspect -f '{{.Config.Env}}' harbor-jobservice

env

Docker Compose Bug

查看 container 的 env 後發現,所使用的 env 是 node 上的 env,而啟動 container 的方式是透過 docker-compose up

換句話說 docker-compose 本來應該在新增 container 時,賦予 container 的 config 的 env 為 env_file 的設定,但卻優先讀取 node 上的 env variable,且這個問題時好時壞的,並不是每次發生。我在 docker-compose 的 github 上並沒有看到相關的 issue,而客戶所使用的 docker-compose 版本為 v2.6.0

https://github.com/docker/compose/releases/tag/v2.6.0

在我使用相同版本的 docker-compose 測試後,成功復現優先讀取 node 上的 env 的行為,我認為這是 docker-compose 的 bug

結論

客戶環境並沒有安裝 docker-compose-plugin,新版本的 docker 已經可以安裝 docker-compose-plugin 使包含 compose 的指令,且不會有沒有讀取 env_file 的行為,因此我建議客戶透過 升級 docker 版本 來啟動 container,這個問題也與 Harbor 無關,因此即使升級 Harbor 也可能發生這個問題。