Harbor Containers Fail to Start
Harbor 啟動失敗調查
這篇文章將介紹客戶端 Harbor
啟動時有機率發生錯誤的紀錄,並說明調查步驟及結果。
事件發生
某日客戶通報一個 issue,他們在啟動 Harbor
時,有機率會在啟動後,部分的 API 無法使用,可以透過訪問 80
port 看到 web 介面,但登入後點擊部分連結,卻會回覆 500 error。在錯誤發生後,經過多次重啟 container,有機率會恢復正常,否則就需要重安裝 Harbor
。
客戶詢問我這是不是 Harbor
的一個 bug,是否可以升級 Harbor
版本來解決,但我在 github 上沒有看到相關的 issue。
客戶的 Harbor
版本為 v2.5.6
。
調查經過
調查 container 狀態
Harbor
的啟動方式是透過 docker-compose
工具,設定檔為 install.sh
所生成的 docker-compose.yaml
,啟動時會建立以下的幾個服務,可以透過 docker ps
查看 container 的狀態:
harbor-jobservice
nginx
harbor-core
registry
harbor-db
redis
harbor-portal
registryctl
harbor-log
我發現 harbor-jobservice
一直處於 restart
的狀態,透過 docker logs -f harbor-jobservice
查看 logs 後,發現如下的錯誤訊息:
2024-01-24T01:44:33Z [ERROR] [/pkg/config/rest/rest.go:50]: Failed on load rest config err:invalid character '<' looking for beginning of value, url:http://core:8080/api/v2.0/internalconfig
2024-01-24T01:44:33Z [INFO] [/pkg/config/rest/rest.go:47]: get configuration from url: http://core:8080/api/v2.0/internalconfig
2024-01-24T01:44:33Z [INFO] [/jobservice/logger/sweeper_controller.go:97]: 0 outdated log entries are sweepped by sweeper *sweeper.FileSweeper
2024-01-24T01:44:33Z [ERROR] [/pkg/config/rest/rest.go:50]: Failed on load rest config err:invalid character '<' looking for beginning of value, url:http://core:8080/api/v2.0/internalconfig
2024-01-24T01:44:33Z [ERROR] [/jobservice/job/impl/context.go:75]: Job context initialization error: failed to load rest config
2024-01-24T01:44:33Z [INFO] [/jobservice/job/impl/context.go:78]: Retry in 9 seconds
2024-01-24T01:44:42Z [INFO] [/pkg/config/rest/rest.go:47]: get configuration from url: http://core:8080/api/v2.0/internalconfig
2024-01-24T01:44:42Z [ERROR] [/pkg/config/rest/rest.go:50]: Failed on load rest config err:invalid character '<' looking for beginning of value, url:http://core:8080/api/v2.0/internalconfig
2024-01-24T01:44:42Z [ERROR] [/jobservice/job/impl/context.go:75]: Job context initialization error: failed to load rest config
2024-01-24T01:44:42Z [INFO] [/jobservice/job/impl/context.go:78]: Retry in 13 seconds
2024-01-24T01:44:55Z [INFO] [/pkg/config/rest/rest.go:47]: get configuration from url: http://core:8080/api/v2.0/internalconfig
2024-01-24T01:44:55Z [ERROR] [/pkg/config/rest/rest.go:50]: Failed on load rest config err:invalid character '<' looking for beginning of value, url:http://core:8080/api/v2.0/internalconfig
2024-01-24T01:44:55Z [ERROR] [/jobservice/job/impl/context.go:75]: Job context initialization error: failed to load rest config
2024-01-24T01:44:55Z [INFO] [/jobservice/job/impl/context.go:78]: Retry in 19 seconds
2024-01-24T01:45:14Z [INFO] [/pkg/config/rest/rest.go:47]: get configuration from url: http://core:8080/api/v2.0/internalconfig
2024-01-24T01:45:14Z [ERROR] [/pkg/config/rest/rest.go:50]: Failed on load rest config err:invalid character '<' looking for beginning of value, url:http://core:8080/api/v2.0/internalconfig
2024-01-24T01:45:14Z [ERROR] [/jobservice/job/impl/context.go:75]: Job context initialization error: failed to load rest config
2024-01-24T01:45:14Z [INFO] [/jobservice/job/impl/context.go:78]: Retry in 29 seconds
2024-01-24T01:45:43Z [INFO] [/pkg/config/rest/rest.go:47]: get configuration from url: http://core:8080/api/v2.0/internalconfig
2024-01-24T01:45:43Z [ERROR] [/pkg/config/rest/rest.go:50]: Failed on load rest config err:invalid character '<' looking for beginning of value, url:http://core:8080/api/v2.0/internalconfig
2024-01-24T01:45:43Z [ERROR] [/jobservice/job/impl/context.go:75]: Job context initialization error: failed to load rest config
2024-01-24T01:45:43Z [FATAL] [/jobservice/main.go:87]: initialize job context error: job context initialization error: failed to load rest config (5 times tried)
從錯誤訊息中,大概可以看出,harbor-jobservice
不斷嘗試訪問 http://core:8080/api/v2.0/internalconfig
,但卻發生 Failed on load
無法讀取的問題,原因與 response 的內容有關,可能有包含 <
:
invalid character '<' looking for beginning of value
調查 API 問題
為了精準定位錯誤,需要先調查 container 的行為,透過 cat docker-compose.yaml
查看 container 的設定:
...
core:
image: goharbor/harbor-core:v2.5.6
container_name: harbor-core
env_file:
- ./common/config/core/env
restart: always
cap_drop:
- ALL
cap_add:
- SETGID
- SETUID
volumes:
- /data/ca_download/:/etc/core/ca/:z
- /data/:/data/:z
- ./common/config/core/certificates/:/etc/core/certificates/:z
- type: bind
source: ./common/config/core/app.conf
target: /etc/core/app.conf
- type: bind
source: /data/secret/core/private_key.pem
target: /etc/core/private_key.pem
- type: bind
source: /data/secret/keys/secretkey
target: /etc/core/key
- type: bind
source: ./common/config/shared/trust-certificates
target: /harbor_cust_cert
networks:
harbor:
depends_on:
- log
- registry
- redis
- postgresql
logging:
driver: "syslog"
options:
syslog-address: "tcp://localhost:1514"
tag: "core"
...
jobservice:
image: goharbor/harbor-jobservice:v2.5.6
container_name: harbor-jobservice
env_file:
- ./common/config/jobservice/env
restart: always
cap_drop:
- ALL
cap_add:
- CHOWN
- SETGID
- SETUID
volumes:
- /data/job_logs:/var/log/jobs:z
- type: bind
source: ./common/config/jobservice/config.yml
target: /etc/jobservice/config.yml
- type: bind
source: ./common/config/shared/trust-certificates
target: /harbor_cust_cert
networks:
- harbor
depends_on:
- core
logging:
driver: "syslog"
options:
syslog-address: "tcp://localhost:1514"
tag: "jobservice"
...
透過以下指令,調查 harbor-jobservice
的 entrypoint,可以得知 container 在啟動時執行 /harbor/entrypoint.sh
:
docker inspect -f '{{.Config.Entrypoint}}' harbor-jobservice
[/harbor/entrypoint.sh]
嘗試使用與 harbor-jobservice
相同 namespace 的環境,查看 /harbor/entrypoint.sh
的內容:
docker exec harbor-jobservice cat /harbor/entrypoint.sh
#!/bin/sh
set -e
/harbor/install_cert.sh
exec /harbor/harbor_jobservice -c /etc/jobservice/config.yml
從 /harbor/entrypoint.sh
猜測 container 的行為,/harbor/harbor_jobservice
存取 http://core:8080/api/v2.0/internalconfig
時,發生了非預期的錯誤,導致啟動失敗:
透過 curl
指令存取 http://core:8080/api/v2.0/internalconfig
,或許能看到回傳的內容:
docker exec harbor-jobservice curl http://core:8080/api/v2.0/internalconfig
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 99 100 99 0 0 12895 0 --:--:-- --:--:-- --:--:-- 14142
{"errors":[{"code":"UNAUTHORIZED","message":"only internal service is allowed to call this API"}]}
可以看到 core:8080
回傳一段 JSON format 的 response 內容,並沒有 character <
,由於 harbor-jobservice
呼叫 API 時會帶入 token,透過 package document 得知,呼叫 API 時需要加入 header Authorization : Harbor-Secret <secret>
,而 token 則被記錄在 common/config/jobservice/env
上:
https://pkg.go.dev/github.com/goharbor/harbor/src/jobservice#section-readme
存取 ./common/config/jobservice/env
,透過查看變數 JOBSERVICE_SECRET
獲取 token,重新使用 curl
:
請將
<secret>
替換為 token 內容
docker exec harbor-jobservice curl -H "Authorization: Harbor-Secret <secret>" http://core:8080/api/v2.0/internalconfig
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3962 0 3962 0 0 1029k 0 --:--:-- --:--:-- --:--:-- 1289k
{"admin_initial_password":{"editable":true,"value":"Admin12345"},"auth_mode":{"editable":false,"value":"ldap_auth"},"cache_enabled":{"editable":true,"value":false},"cache_expire_hours":{"editable":true,"value":24},"core_local_url":{"editable":true,"value":"http://127.0.0.1:8080"},"core_url":{"editable":true,"value":"http://core:8080"},"database_type":{"editable":true,"value":"postgresql"},"execution_status_refresh_interval_seconds":{"editable":true,"value":30},"ext_endpoint":{"editable":true,"value":"https://harbor.cnsdomain.com"},"gdpr_delete_user":{"editable":true,"value":false},"http_authproxy_skip_search":{"editable":true,"value":false},"http_authproxy_verify_cert":{"editable":true,"value":true},"jobservice_url":{"editable":true,"value":"http://jobservice:8080"},"ldap_base_dn":{"editable":true,"value":"DC=ad,DC=cnsdomain,DC=com"},"ldap_group_membership_attribute":{"editable":true,"value":"memberof"},"ldap_group_search_scope":{"editable":true,"value":2},"ldap_scope":{"editable":true,"value":2},"ldap_search_dn":{"editable":true,"value":"CN=Administrator,CN=Users,DC=ad,DC=cnsdomain,DC=com"},"ldap_search_password":{"editable":true,"value":"Admin12345"},"ldap_timeout":{"editable":true,"value":5},"ldap_uid":{"editable":true,"value":"cn"},"ldap_url":{"editable":true,"value":"192.168.50.35"},"ldap_verify_cert":{"editable":true,"value":true},"max_job_workers":{"editable":true,"value":10},"metric_enable":{"editable":true,"value":false},"metric_path":{"editable":true,"value":"/metrics"},"metric_port":{"editable":true,"value":9090},"notary_url":{"editable":true,"value":"http://notary-server:4443"},"notification_enable":{"editable":true,"value":true},"oidc_auto_onboard":{"editable":true,"value":false},"oidc_extra_redirect_parms":{"editable":true,"value":"{}"},"oidc_verify_cert":{"editable":true,"value":true},"postgresql_conn_max_idle_time":{"editable":true,"value":"0"},"postgresql_conn_max_lifetime":{"editable":true,"value":"5m"},"postgresql_database":{"editable":true,"value":"registry"},"postgresql_host":{"editable":true,"value":"postgresql"},"postgresql_max_idle_conns":{"editable":true,"value":100},"postgresql_max_open_conns":{"editable":true,"value":900},"postgresql_password":{"editable":true,"value":"root123"},"postgresql_port":{"editable":true,"value":5432},"postgresql_sslmode":{"editable":true,"value":"disable"},"postgresql_username":{"editable":true,"value":"postgres"},"primary_auth_mode":{"editable":true,"value":true},"project_creation_restriction":{"editable":true,"value":"everyone"},"pull_audit_log_disable":{"editable":true,"value":false},"pull_count_update_disable":{"editable":true,"value":false},"pull_time_update_disable":{"editable":true,"value":false},"quota_per_project_enable":{"editable":true,"value":true},"read_only":{"editable":true,"value":false},"registry_controller_url":{"editable":true,"value":"http://registryctl:8080"},"registry_storage_provider_name":{"editable":true,"value":"filesystem"},"registry_url":{"editable":true,"value":"http://registry:5000"},"robot_name_prefix":{"editable":true,"value":"robot$"},"robot_scanner_name_prefix":{"editable":true,"value":"H8T9ACT4"},"robot_token_duration":{"editable":true,"value":30},"scanner_skip_update_pulltime":{"editable":true,"value":false},"self_registration":{"editable":true,"value":false},"session_timeout":{"editable":true,"value":60},"skip_audit_log_database":{"editable":true,"value":false},"storage_per_project":{"editable":true,"value":-1},"token_expiration":{"editable":true,"value":30},"token_service_url":{"editable":true,"value":"http://core:8080/service/token"},"trace_enabled":{"editable":true,"value":false},"trace_jaeger_agent_port":{"editable":true,"value":"6831"},"trace_sample_rate":{"editable":true,"value":1},"trivy_adapter_url":{"editable":true,"value":"http://trivy-adapter:8080"},"uaa_verify_cert":{"editable":true,"value":false},"with_notary":{"editable":true,"value":false},"with_trivy":{"editable":true,"value":false}}
由於 response 的內容也沒有出現 character <
,因此我懷疑 curl
與 harbor-jobservice
在 call API 時的行為不一樣,需要錄製封包才能得知兩者的差異。
錄製封包
可以透過 tcpdump
指定固定的 veth
,但 container 不斷地 restart,會使 docker
不斷更換 veth
。為了使 veth
固定,需要改寫 container 的 entrypoint,修改 docker-compose.yaml
並加入 entrypoint
,透過 docker-compose up -d
重啟 container:
...
jobservice:
image: goharbor/harbor-jobservice:v2.5.6
container_name: harbor-jobservice
entrypoint: tail -f /dev/null
...
查看 container 在 node 上的 namespace 的 veth
id:
docker exec harbor-jobservice cat /sys/class/net/eth0/iflink
<ifid>
透過 ip link
可以查看 node 上的 network interface,找到與 <ifid>
相同 id 的 veth
後,複製名稱,使用 tcpdump
指令錄製封包,並加入背景執行 &
,結果存放至 /tmp/network1.pcap
:
請替換
<veth-name>
為veth
的名稱
tcpdump -i <veth-name> -w /tmp/network1.pcap &
執行 harbor-jobservice
後按下 ctrl+c
關閉 session,然後執行 fg
後按下 ctrl+c
結束 process:
docker exec harbor-jobservice /harbor/harbor_jobservice -c /etc/jobservice/config.yml
重複以上作業,將錄製目錄改為 /tmp/network2.pcap
,執行 curl
後錄製封包,然後執行 fg
後按下 ctrl+c
結束 process:
docker exec harbor-jobservice curl -H "Authorization: Harbor-Secret <secret>" http://core:8080/api/v2.0/internalconfig
可以使用 tcpdump
讀取 pcap
檔案:
-nn
將 port 從 name 轉換成 number
tcpdump -nn -r /tmp/network1.pcap
tcpdump -nn -r /tmp/network2.pcap
以下是 /tmp/network1.pcap
的內容,透過 wireshark
打開:
以下是 /tmp/network2.pcap
的內容,透過 wireshark
打開:
從 pcap
的內容可以看到,harbor-jobservice
與 curl
呼叫 API 的 IP 地址不同,雖然都是訪問 core:8080
,但 curl
訪問的是 docker dns
解析後的 container IP,但 harbor-jobservice
卻訪問 HTTP_PROXY
的 IP,這不應該發生,因為 common/config/jobservice/env
上會設定 NO_PROXY
為 container 的名稱,如下所示:
NO_PROXY=.internal,127.0.0.1,localhost,registry,registryctl,exporter,nginx,trivy-adapter,redis,notary-signer,notary-server,db,.local,postgresql,jobservice,core,portal,log
透過 docker inspect
查看 harbor-jobservice
的 env ,發現 NO_PROXY
的內容與 env_file
不符:
docker inspect -f '{{.Config.Env}}' harbor-jobservice
Docker Compose Bug
查看 container 的 env 後發現,所使用的 env 是 node 上的 env,而啟動 container 的方式是透過 docker-compose up
。
換句話說 docker-compose
本來應該在新增 container 時,賦予 container 的 config 的 env 為 env_file
的設定,但卻優先讀取 node 上的 env variable,且這個問題時好時壞的,並不是每次發生。我在 docker-compose
的 github 上並沒有看到相關的 issue,而客戶所使用的 docker-compose
版本為 v2.6.0
。
https://github.com/docker/compose/releases/tag/v2.6.0
在我使用相同版本的 docker-compose
測試後,成功復現優先讀取 node 上的 env 的行為,我認為這是 docker-compose
的 bug。
結論
客戶環境並沒有安裝 docker-compose-plugin
,新版本的 docker
已經可以安裝 docker-compose-plugin
使包含 compose
的指令,且不會有沒有讀取 env_file
的行為,因此我建議客戶透過 升級 docker 版本 來啟動 container,這個問題也與 Harbor
無關,因此即使升級 Harbor
也可能發生這個問題。