本文介绍如何排查 ClusterResourcePlacementRolloutStarted
在 Azure Kubernetes Fleet Manager 中使用 ClusterResourcePlacement
API 对象传播资源时出现的问题。
症状
使用 ClusterResourcePlacement
Azure Kubernetes Fleet Manager 中的 API 对象传播资源时,所选资源不会在所有计划群集中推出, ClusterResourcePlacementRolloutStarted
条件状态显示为 False
。
注释
若要详细了解部署为何不启动,可以检查部署控制器日志。
原因
群集资源放置推出策略被阻止,因为 RollingUpdate
配置过于严格。
故障排除步骤
- 在
ClusterResourcePlacement
状态部分中,检查placementStatuses
以识别RolloutStarted
群集,其状态为False
。 - 找到标识的群集的相应
ClusterResourceBinding
位置。 有关详细信息,请参阅 如何查找最新的 ClusterResourceBinding 资源? 此资源应指示Work
状态(是创建还是更新)。 - 验证
maxUnavailable
和maxSurge
的值,以确保它们符合您的预期。
案例研究
在以下示例中,ClusterResourcePlacement
尝试将命名空间传播到三个成员群集。 但是,在初始创建ClusterResourcePlacement
期间,集群集线器上不存在命名空间,并且机群当前包含两个名为kind-cluster-1
和kind-cluster-2
的成员群集。
ClusterResourcePlacement 规范
spec:
policy:
numberOfClusters: 3
placementType: PickN
resourceSelectors:
- group: ""
kind: Namespace
name: test-ns
version: v1
revisionHistoryLimit: 10
strategy:
type: RollingUpdate
ClusterResourcePlacement 状态
status:
conditions:
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: could not find all the clusters needed as specified by the scheduling
policy
observedGeneration: 1
reason: SchedulingPolicyUnfulfilled
status: "False"
type: ClusterResourcePlacementScheduled
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: All 2 cluster(s) start rolling out the latest resource
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: ClusterResourcePlacementRolloutStarted
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: ClusterResourcePlacementOverridden
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: Works(s) are successfully created or updated in the 2 target clusters'
namespaces
observedGeneration: 1
reason: WorkSynchronized
status: "True"
type: ClusterResourcePlacementWorkSynchronized
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: The selected resources are successfully applied to 2 clusters
observedGeneration: 1
reason: ApplySucceeded
status: "True"
type: ClusterResourcePlacementApplied
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: The selected resources in 2 cluster are available now
observedGeneration: 1
reason: ResourceAvailable
status: "True"
type: ClusterResourcePlacementAvailable
observedResourceIndex: "0"
placementStatuses:
- clusterName: kind-cluster-2
conditions:
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: 'Successfully scheduled resources for placement in kind-cluster-2 (affinity
score: 0, topology spread score: 0): picked by scheduling policy'
observedGeneration: 1
reason: Scheduled
status: "True"
type: Scheduled
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: Detected the new changes on the resources and started the rollout process
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: RolloutStarted
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: Overridden
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: All of the works are synchronized to the latest
observedGeneration: 1
reason: AllWorkSynced
status: "True"
type: WorkSynchronized
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: All corresponding work objects are applied
observedGeneration: 1
reason: AllWorkHaveBeenApplied
status: "True"
type: Applied
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: All corresponding work objects are available
observedGeneration: 1
reason: AllWorkAreAvailable
status: "True"
type: Available
- clusterName: kind-cluster-1
conditions:
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: 'Successfully scheduled resources for placement in kind-cluster-1 (affinity
score: 0, topology spread score: 0): picked by scheduling policy'
observedGeneration: 1
reason: Scheduled
status: "True"
type: Scheduled
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: Detected the new changes on the resources and started the rollout process
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: RolloutStarted
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: Overridden
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: All of the works are synchronized to the latest
observedGeneration: 1
reason: AllWorkSynced
status: "True"
type: WorkSynchronized
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: All corresponding work objects are applied
observedGeneration: 1
reason: AllWorkHaveBeenApplied
status: "True"
type: Applied
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: All corresponding work objects are available
observedGeneration: 1
reason: AllWorkAreAvailable
status: "True"
type: Available
上述输出指示中心群集上从未存在资源 test-ns
命名空间,并显示以下 ClusterResourcePlacement
条件状态:
- 条件
ClusterResourcePlacementScheduled
状态显示为False
,因为指定的策略旨在选取三个群集,但计划程序只能容纳两个当前可用且已加入的群集中的放置。 - 条件
ClusterResourcePlacementRolloutStarted
状态显示为True
,因为推出过程已从选择了两个群集开始。 - 条件
ClusterResourcePlacementOverridden
状态显示为True
,因为未为所选资源配置替代规则。 - 条件
ClusterResourcePlacementWorkSynchronized
状态显示为True
。 - 条件
ClusterResourcePlacementApplied
状态显示为True
。 - 条件
ClusterResourcePlacementAvailable
状态显示为True
。
若要确保跨相关群集无缝传播命名空间,请继续在中心群集上创建 test-ns
命名空间。
在中心集群上创建命名空间“test-ns”后,ClusterResourcePlacement 的状态
status:
conditions:
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: could not find all the clusters needed as specified by the scheduling
policy
observedGeneration: 1
reason: SchedulingPolicyUnfulfilled
status: "False"
type: ClusterResourcePlacementScheduled
- lastTransitionTime: "2024-05-07T23:13:51Z"
message: The rollout is being blocked by the rollout strategy in 2 cluster(s)
observedGeneration: 1
reason: RolloutNotStartedYet
status: "False"
type: ClusterResourcePlacementRolloutStarted
observedResourceIndex: "1"
placementStatuses:
- clusterName: kind-cluster-2
conditions:
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: 'Successfully scheduled resources for placement in kind-cluster-2 (affinity
score: 0, topology spread score: 0): picked by scheduling policy'
observedGeneration: 1
reason: Scheduled
status: "True"
type: Scheduled
- lastTransitionTime: "2024-05-07T23:13:51Z"
message: The rollout is being blocked by the rollout strategy
observedGeneration: 1
reason: RolloutNotStartedYet
status: "False"
type: RolloutStarted
- clusterName: kind-cluster-1
conditions:
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: 'Successfully scheduled resources for placement in kind-cluster-1 (affinity
score: 0, topology spread score: 0): picked by scheduling policy'
observedGeneration: 1
reason: Scheduled
status: "True"
type: Scheduled
- lastTransitionTime: "2024-05-07T23:13:51Z"
message: The rollout is being blocked by the rollout strategy
observedGeneration: 1
reason: RolloutNotStartedYet
status: "False"
type: RolloutStarted
selectedResources:
- kind: Namespace
name: test-ns
version: v1
在前面的输出中, ClusterResourcePlacementScheduled
条件状态显示为 False
。 状态 ClusterResourcePlacementRolloutStarted
也显示为 False
消息: The rollout is being blocked by the rollout strategy in 2 cluster(s)
。
运行命令检查最新的ClusterResourceSnapshot
,详细步骤在如何查找最新的 ClusterResourceBinding 资源?中。
最新的集群资源快照
apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterResourceSnapshot
metadata:
annotations:
kubernetes-fleet.io/number-of-enveloped-object: "0"
kubernetes-fleet.io/number-of-resource-snapshots: "1"
kubernetes-fleet.io/resource-hash: 72344be6e268bc7af29d75b7f0aad588d341c228801aab50d6f9f5fc33dd9c7c
creationTimestamp: "2024-05-07T23:13:51Z"
generation: 1
labels:
kubernetes-fleet.io/is-latest-snapshot: "true"
kubernetes-fleet.io/parent-CRP: crp-3
kubernetes-fleet.io/resource-index: "1"
name: crp-3-1-snapshot
ownerReferences:
- apiVersion: placement.kubernetes-fleet.io/v1beta1
blockOwnerDeletion: true
controller: true
kind: ClusterResourcePlacement
name: crp-3
uid: b4f31b9a-971a-480d-93ac-93f093ee661f
resourceVersion: "14434"
uid: 85ee0e81-92c9-4362-932b-b0bf57d78e3f
spec:
selectedResources:
- apiVersion: v1
kind: Namespace
metadata:
labels:
kubernetes.io/metadata.name: test-ns
name: test-ns
spec:
finalizers:
- kubernetes
在 ClusterResourceSnapshot
规范中,selectedResources
部分现在显示命名空间 test-ns
。
检查ClusterResourceBinding
中的kind-cluster-1
是否在创建test-ns
命名空间后进行了更新。 有关详细信息,请参阅 如何查找最新的 ClusterResourceBinding 资源?。
kind-cluster-1 的 ClusterResourceBinding
apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterResourceBinding
metadata:
creationTimestamp: "2024-05-07T23:08:53Z"
finalizers:
- kubernetes-fleet.io/work-cleanup
generation: 2
labels:
kubernetes-fleet.io/parent-CRP: crp-3
name: crp-3-kind-cluster-1-7114c253
resourceVersion: "14438"
uid: 0db4e480-8599-4b40-a1cc-f33bcb24b1a7
spec:
applyStrategy:
type: ClientSideApply
clusterDecision:
clusterName: kind-cluster-1
clusterScore:
affinityScore: 0
priorityScore: 0
reason: picked by scheduling policy
selected: true
resourceSnapshotName: crp-3-0-snapshot
schedulingPolicySnapshotName: crp-3-0
state: Bound
targetCluster: kind-cluster-1
status:
conditions:
- lastTransitionTime: "2024-05-07T23:13:51Z"
message: The resources cannot be updated to the latest because of the rollout
strategy
observedGeneration: 2
reason: RolloutNotStartedYet
status: "False"
type: RolloutStarted
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: No override rules are configured for the selected resources
observedGeneration: 2
reason: NoOverrideSpecified
status: "True"
type: Overridden
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: All of the works are synchronized to the latest
observedGeneration: 2
reason: AllWorkSynced
status: "True"
type: WorkSynchronized
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: All corresponding work objects are applied
observedGeneration: 2
reason: AllWorkHaveBeenApplied
status: "True"
type: Applied
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: All corresponding work objects are available
observedGeneration: 2
reason: AllWorkAreAvailable
status: "True"
type: Available
ClusterResourceBinding
保持不变。 在ClusterResourceBinding
规范中,resourceSnapshotName
仍引用旧ClusterResourceSnapshot
名称。 如果用户没有显式 RollingUpdate
输入,因为应用了默认值,则会出现此问题:
- 该值
maxUnavailable
配置为 25% × 3(所需数字),舍入为1
。 - 该值
maxSurge
配置为 25% × 3(所需数字),舍入为1
。
为何 ClusterResourceBinding 未更新呢?
最初,当ClusterResourcePlacement
被创建时,会生成两个ClusterResourceBindings
。 但是,由于推出不适用于初始阶段,条件 ClusterResourcePlacementRolloutStarted
设置为 True
。
在中心群集上创建 test-ns
命名空间时,推出控制器尝试更新两个现有 ClusterResourceBindings
命名空间。 但是,由于缺少成员群集,maxUnavailable
被设置为 1
,因此 RollingUpdate
配置过于严格。
注释
在更新期间,如果其中一个绑定无法应用,它还会违反 RollingUpdate
配置,这会导致 maxUnavailable
设置为 1
。
决议
在这种情况下,为了解决问题,请考虑手动将 maxUnavailable
设置为比 1
更大的值,以放宽 RollingUpdate
配置。 或者,可以加入第三个成员群集。