Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EC2 + GCE GPU CI Jobs not running any test cases #124950

Open
Vyom-Yadav opened this issue May 19, 2024 · 14 comments
Open

EC2 + GCE GPU CI Jobs not running any test cases #124950

Vyom-Yadav opened this issue May 19, 2024 · 14 comments
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@Vyom-Yadav
Copy link
Member

Which jobs are failing?

master-blocking

  • gce-device-plugin-gpu-master

Which tests are failing?

kubetest.Up

Since when has it been failing?

05/17

Testgrid link

https://testgrid.k8s.io/sig-release-master-blocking#gce-device-plugin-gpu-master

Reason for failure (if possible)

ERROR: (gcloud.compute.instance-groups.managed.create) Could not fetch resource:
 - The resource 'projects/220512457637/zones/us-west1-b/acceleratorTypes/nvidia-tesla-k80' was not found
ERROR: (gcloud.compute.instance-groups.managed.wait-until) Some requests did not succeed:
 - The resource 'projects/k8s-infra-e2e-boskos-gpu-01/zones/us-west1-b/instanceGroupManagers/bootstrap-e2e-minion-group' was not found
Waiting for 2 ready nodes. 1 ready nodes, 1 registered. Retrying.
Using image: cos-109-17800-218-26 from project: cos-cloud as master image
Using image: cos-109-17800-147-22 from project: cos-cloud as node image
Detected 1 ready nodes, found 1 nodes out of expected 2. Your cluster may not be fully functional.

Anything else we need to know?

No response

Relevant SIG(s)

/sig k8s-infra

@Vyom-Yadav Vyom-Yadav added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label May 19, 2024
@k8s-ci-robot k8s-ci-robot added sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 19, 2024
@Vyom-Yadav
Copy link
Member Author

cc @BenTheElder @dims @ameukam

@BenTheElder
Copy link
Member

Note that while the job is "green" after kubernetes/test-infra#32635 we are not running [Feature:GPUDevicePlugin] run Nvidia GPU Device Plugin tests anymore and the Windows test is skipped so ... no real tests are run.

https://testgrid.k8s.io/sig-release-master-blocking#gce-device-plugin-gpu-master&show-stale-tests=&width=5

This issue is a sub-variant of kubernetes/test-infra#32242

@BenTheElder
Copy link
Member

/remove-sig k8s-infra
/sig node
/triage accepted

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 20, 2024
@aojea
Copy link
Member

aojea commented May 21, 2024

@BenTheElder
Copy link
Member

It seems fixed https://testgrid.k8s.io/sig-release-master-blocking#gce-device-plugin-gpu-master

It's not, see above #124950 (comment)

The [Feature:GPUDevicePlugin] run Nvidia GPU Device Plugin tests is no longer even in stale tests because it hasn't run in so long now, but it was yesterday as a stale test, we're running no actual GPU tests currently.

@haircommander haircommander moved this from Triage to Issues - In progress in SIG Node CI/Test Board May 22, 2024
@BenTheElder
Copy link
Member

I don't even seen [Feature:GPUDevicePlugin] run Nvidia GPU Device Plugin tests in the skipped results

The only matching test is for windows ... asking if something changed with the test cases in SIG Node slack
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-device-plugin-gpu/1795477868910743552/artifacts/junit_01.xml

@BenTheElder
Copy link
Member

The test cases were all deleted ... bf268f0#diff-7629c065680da0396ef2e8d190ce7cdd1dbf2c336f99c22ec543a4be61d74ccd

@BenTheElder
Copy link
Member

NOTE: This also impacts the EC2 Job which is no longer running any test cases.

The GCE job is running and "passing" same as the ec2 job now ... neither of which run any tests.

/retitle EC2 + GCE GPU CI Jobs not running any test cases

See an old run: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-ec2-device-plugin-gpu/1781225142752382976

(ran: Kubernetes e2e suite: [It] [sig-scheduling] [Feature:GPUDevicePlugin] run Nvidia GPU Device Plugin tests, 8 tests passed, the other "tests" are just cluster bringup / test runner etc)

Current:

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-ec2-device-plugin-gpu/1795464027577520128

(7 "tests" passed, none of which are actual e2e tests)

@k8s-ci-robot k8s-ci-robot changed the title [Failing Test] gce-device-plugin-gpu-master (kubetest.Up) EC2 + GCE GPU CI Jobs not running any test cases May 28, 2024
@aojea
Copy link
Member

aojea commented May 28, 2024

😅

@pacoxu
Copy link
Member

pacoxu commented May 29, 2024

@BenTheElder
Copy link
Member

Yes, I think the job is coming up but not the driver install or device plugin. We need to add more log dump there, I've been discussing a bit with dims what we should do about the test removal in #sig-node: https://kubernetes.slack.com/archives/C0BP8PW9G/p1716914276819719?thread_ts=1716913485.823089&cid=C0BP8PW9G

@BenTheElder
Copy link
Member

Talked to @elfinhe this morning about the driver install.

@dims
Copy link
Member

dims commented May 30, 2024

FYI #122828 documents that [Feature:GPUDevicePlugin] run Nvidia GPU Device Plugin tests is getting dropped!

@BenTheElder
Copy link
Member

BenTheElder commented May 30, 2024

I'm going to revisit test cases once we figure out the driver install issue on 1.30 with existing test cases on that branch. There are WIP PRs for this and I'm in contact with the team supporting us on the driver problems, providing upstream CI pointers. #125208 / #125206

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
SIG Node CI/Test Board
Issues - In progress
Development

No branches or pull requests

6 participants