资源传播失败:ClusterResourcePlacementRolloutStarted 为 false

本文介绍如何排查 ClusterResourcePlacementRolloutStarted 在 Azure Kubernetes Fleet Manager 中使用 ClusterResourcePlacement API 对象传播资源时出现的问题。

症状

使用 ClusterResourcePlacement Azure Kubernetes Fleet Manager 中的 API 对象传播资源时,所选资源不会在所有计划群集中推出, ClusterResourcePlacementRolloutStarted 条件状态显示为 False

注释

若要详细了解部署为何不启动,可以检查部署控制器日志。

原因

群集资源放置推出策略被阻止,因为 RollingUpdate 配置过于严格。

故障排除步骤

  1. ClusterResourcePlacement状态部分中,检查placementStatuses以识别RolloutStarted群集,其状态为False
  2. 找到标识的群集的相应 ClusterResourceBinding 位置。 有关详细信息,请参阅 如何查找最新的 ClusterResourceBinding 资源? 此资源应指示 Work 状态(是创建还是更新)。
  3. 验证 maxUnavailablemaxSurge 的值,以确保它们符合您的预期。

案例研究

在以下示例中,ClusterResourcePlacement 尝试将命名空间传播到三个成员群集。 但是,在初始创建ClusterResourcePlacement期间,集群集线器上不存在命名空间,并且机群当前包含两个名为kind-cluster-1kind-cluster-2的成员群集。

ClusterResourcePlacement 规范

spec:
  policy:
    numberOfClusters: 3
    placementType: PickN
  resourceSelectors:
  - group: ""
    kind: Namespace
    name: test-ns
    version: v1
  revisionHistoryLimit: 10
  strategy:
    type: RollingUpdate

ClusterResourcePlacement 状态

status:
  conditions:
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: could not find all the clusters needed as specified by the scheduling
      policy
    observedGeneration: 1
    reason: SchedulingPolicyUnfulfilled
    status: "False"
    type: ClusterResourcePlacementScheduled
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: All 2 cluster(s) start rolling out the latest resource
    observedGeneration: 1
    reason: RolloutStarted
    status: "True"
    type: ClusterResourcePlacementRolloutStarted
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: No override rules are configured for the selected resources
    observedGeneration: 1
    reason: NoOverrideSpecified
    status: "True"
    type: ClusterResourcePlacementOverridden
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: Works(s) are successfully created or updated in the 2 target clusters'
      namespaces
    observedGeneration: 1
    reason: WorkSynchronized
    status: "True"
    type: ClusterResourcePlacementWorkSynchronized
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: The selected resources are successfully applied to 2 clusters
    observedGeneration: 1
    reason: ApplySucceeded
    status: "True"
    type: ClusterResourcePlacementApplied
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: The selected resources in 2 cluster are available now
    observedGeneration: 1
    reason: ResourceAvailable
    status: "True"
    type: ClusterResourcePlacementAvailable
  observedResourceIndex: "0"
  placementStatuses:
  - clusterName: kind-cluster-2
    conditions:
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: 'Successfully scheduled resources for placement in kind-cluster-2 (affinity
        score: 0, topology spread score: 0): picked by scheduling policy'
      observedGeneration: 1
      reason: Scheduled
      status: "True"
      type: Scheduled
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: Detected the new changes on the resources and started the rollout process
      observedGeneration: 1
      reason: RolloutStarted
      status: "True"
      type: RolloutStarted
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: No override rules are configured for the selected resources
      observedGeneration: 1
      reason: NoOverrideSpecified
      status: "True"
      type: Overridden
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: All of the works are synchronized to the latest
      observedGeneration: 1
      reason: AllWorkSynced
      status: "True"
      type: WorkSynchronized
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: All corresponding work objects are applied
      observedGeneration: 1
      reason: AllWorkHaveBeenApplied
      status: "True"
      type: Applied
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: All corresponding work objects are available
      observedGeneration: 1
      reason: AllWorkAreAvailable
      status: "True"
      type: Available
  - clusterName: kind-cluster-1
    conditions:
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: 'Successfully scheduled resources for placement in kind-cluster-1 (affinity
        score: 0, topology spread score: 0): picked by scheduling policy'
      observedGeneration: 1
      reason: Scheduled
      status: "True"
      type: Scheduled
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: Detected the new changes on the resources and started the rollout process
      observedGeneration: 1
      reason: RolloutStarted
      status: "True"
      type: RolloutStarted
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: No override rules are configured for the selected resources
      observedGeneration: 1
      reason: NoOverrideSpecified
      status: "True"
      type: Overridden
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: All of the works are synchronized to the latest
      observedGeneration: 1
      reason: AllWorkSynced
      status: "True"
      type: WorkSynchronized
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: All corresponding work objects are applied
      observedGeneration: 1
      reason: AllWorkHaveBeenApplied
      status: "True"
      type: Applied
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: All corresponding work objects are available
      observedGeneration: 1
      reason: AllWorkAreAvailable
      status: "True"
      type: Available

上述输出指示中心群集上从未存在资源 test-ns 命名空间,并显示以下 ClusterResourcePlacement 条件状态:

  • 条件 ClusterResourcePlacementScheduled 状态显示为 False,因为指定的策略旨在选取三个群集,但计划程序只能容纳两个当前可用且已加入的群集中的放置。
  • 条件 ClusterResourcePlacementRolloutStarted 状态显示为 True,因为推出过程已从选择了两个群集开始。
  • 条件 ClusterResourcePlacementOverridden 状态显示为 True,因为未为所选资源配置替代规则。
  • 条件 ClusterResourcePlacementWorkSynchronized 状态显示为 True
  • 条件 ClusterResourcePlacementApplied 状态显示为 True
  • 条件 ClusterResourcePlacementAvailable 状态显示为 True

若要确保跨相关群集无缝传播命名空间,请继续在中心群集上创建 test-ns 命名空间。

在中心集群上创建命名空间“test-ns”后,ClusterResourcePlacement 的状态

status:
  conditions:
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: could not find all the clusters needed as specified by the scheduling
      policy
    observedGeneration: 1
    reason: SchedulingPolicyUnfulfilled
    status: "False"
    type: ClusterResourcePlacementScheduled
  - lastTransitionTime: "2024-05-07T23:13:51Z"
    message: The rollout is being blocked by the rollout strategy in 2 cluster(s)
    observedGeneration: 1
    reason: RolloutNotStartedYet
    status: "False"
    type: ClusterResourcePlacementRolloutStarted
  observedResourceIndex: "1"
  placementStatuses:
  - clusterName: kind-cluster-2
    conditions:
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: 'Successfully scheduled resources for placement in kind-cluster-2 (affinity
        score: 0, topology spread score: 0): picked by scheduling policy'
      observedGeneration: 1
      reason: Scheduled
      status: "True"
      type: Scheduled
    - lastTransitionTime: "2024-05-07T23:13:51Z"
      message: The rollout is being blocked by the rollout strategy
      observedGeneration: 1
      reason: RolloutNotStartedYet
      status: "False"
      type: RolloutStarted
  - clusterName: kind-cluster-1
    conditions:
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: 'Successfully scheduled resources for placement in kind-cluster-1 (affinity
        score: 0, topology spread score: 0): picked by scheduling policy'
      observedGeneration: 1
      reason: Scheduled
      status: "True"
      type: Scheduled
    - lastTransitionTime: "2024-05-07T23:13:51Z"
      message: The rollout is being blocked by the rollout strategy
      observedGeneration: 1
      reason: RolloutNotStartedYet
      status: "False"
      type: RolloutStarted
  selectedResources:
  - kind: Namespace
    name: test-ns
    version: v1

在前面的输出中, ClusterResourcePlacementScheduled 条件状态显示为 False。 状态 ClusterResourcePlacementRolloutStarted 也显示为 False 消息: The rollout is being blocked by the rollout strategy in 2 cluster(s)

运行命令检查最新的ClusterResourceSnapshot,详细步骤在如何查找最新的 ClusterResourceBinding 资源?中。

最新的集群资源快照

apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterResourceSnapshot
metadata:
  annotations:
    kubernetes-fleet.io/number-of-enveloped-object: "0"
    kubernetes-fleet.io/number-of-resource-snapshots: "1"
    kubernetes-fleet.io/resource-hash: 72344be6e268bc7af29d75b7f0aad588d341c228801aab50d6f9f5fc33dd9c7c
  creationTimestamp: "2024-05-07T23:13:51Z"
  generation: 1
  labels:
    kubernetes-fleet.io/is-latest-snapshot: "true"
    kubernetes-fleet.io/parent-CRP: crp-3
    kubernetes-fleet.io/resource-index: "1"
  name: crp-3-1-snapshot
  ownerReferences:
  - apiVersion: placement.kubernetes-fleet.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: ClusterResourcePlacement
    name: crp-3
    uid: b4f31b9a-971a-480d-93ac-93f093ee661f
  resourceVersion: "14434"
  uid: 85ee0e81-92c9-4362-932b-b0bf57d78e3f
spec:
  selectedResources:
  - apiVersion: v1
    kind: Namespace
    metadata:
      labels:
        kubernetes.io/metadata.name: test-ns
      name: test-ns
    spec:
      finalizers:
      - kubernetes

ClusterResourceSnapshot 规范中,selectedResources 部分现在显示命名空间 test-ns

检查ClusterResourceBinding中的kind-cluster-1是否在创建test-ns命名空间后进行了更新。 有关详细信息,请参阅 如何查找最新的 ClusterResourceBinding 资源?

kind-cluster-1 的 ClusterResourceBinding

apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterResourceBinding
metadata:
  creationTimestamp: "2024-05-07T23:08:53Z"
  finalizers:
  - kubernetes-fleet.io/work-cleanup
  generation: 2
  labels:
    kubernetes-fleet.io/parent-CRP: crp-3
  name: crp-3-kind-cluster-1-7114c253
  resourceVersion: "14438"
  uid: 0db4e480-8599-4b40-a1cc-f33bcb24b1a7
spec:
  applyStrategy:
    type: ClientSideApply
  clusterDecision:
    clusterName: kind-cluster-1
    clusterScore:
      affinityScore: 0
      priorityScore: 0
    reason: picked by scheduling policy
    selected: true
  resourceSnapshotName: crp-3-0-snapshot
  schedulingPolicySnapshotName: crp-3-0
  state: Bound
  targetCluster: kind-cluster-1
status:
  conditions:
  - lastTransitionTime: "2024-05-07T23:13:51Z"
    message: The resources cannot be updated to the latest because of the rollout
      strategy
    observedGeneration: 2
    reason: RolloutNotStartedYet
    status: "False"
    type: RolloutStarted
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: No override rules are configured for the selected resources
    observedGeneration: 2
    reason: NoOverrideSpecified
    status: "True"
    type: Overridden
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: All of the works are synchronized to the latest
    observedGeneration: 2
    reason: AllWorkSynced
    status: "True"
    type: WorkSynchronized
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: All corresponding work objects are applied
    observedGeneration: 2
    reason: AllWorkHaveBeenApplied
    status: "True"
    type: Applied
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: All corresponding work objects are available
    observedGeneration: 2
    reason: AllWorkAreAvailable
    status: "True"
    type: Available

ClusterResourceBinding 保持不变。 在ClusterResourceBinding规范中,resourceSnapshotName仍引用旧ClusterResourceSnapshot名称。 如果用户没有显式 RollingUpdate 输入,因为应用了默认值,则会出现此问题:

  • 该值 maxUnavailable 配置为 25% × 3(所需数字),舍入为 1
  • 该值 maxSurge 配置为 25% × 3(所需数字),舍入为 1

为何 ClusterResourceBinding 未更新呢?

最初,当ClusterResourcePlacement被创建时,会生成两个ClusterResourceBindings。 但是,由于推出不适用于初始阶段,条件 ClusterResourcePlacementRolloutStarted 设置为 True

在中心群集上创建 test-ns 命名空间时,推出控制器尝试更新两个现有 ClusterResourceBindings命名空间。 但是,由于缺少成员群集,maxUnavailable 被设置为 1,因此 RollingUpdate 配置过于严格。

注释

在更新期间,如果其中一个绑定无法应用,它还会违反 RollingUpdate 配置,这会导致 maxUnavailable 设置为 1

决议

在这种情况下,为了解决问题,请考虑手动将 maxUnavailable 设置为比 1 更大的值,以放宽 RollingUpdate 配置。 或者,可以加入第三个成员群集。

联系我们以获得帮助

如果您有任何疑问或需要帮助,可以创建支持请求,或咨询Azure社区支持。 您还可以向Azure反馈社区提交产品反馈。