运行和监控作业 (Run:ai CLI)#
本节介绍如何使用 Run:ai CLI 提交和检查作业状态。
使用
login
选项登录 runai。运行以下命令并输入用户名和密码。
runai login
示例结果
1researcher1@basepod-head1:~/runai_dir$ runai login 2Username: researcher1@nvidia.com 3Password: 4INFO[0022] Logged in successfully
提交分布式训练作业。
runai submit-mpi dist-job1 -i \n gcr.io/run-ai-demo/quickstart-distributed:tf-2.1.0 -g 4 -p researcher1
示例结果
1researcher1@basepod-head1:~/runai_dir$ runai submit-mpi dist-job1 -i \n gcr.io/run-ai-demo/quickstart-distributed:tf-2.1.0 -g 4 -p researcher1 2Job dist-job1 submitted successfully.
使用
runai describe
检查作业状态。runai describe job dist-job1 -p researcher1
示例结果
1researcher1@basepod-head1:~/runai_dir$ runai describe job dist-job1 -p researcher1 2Name: dist-job1 3Namespace: runai-researcher1 4Type: Train 5Status: Running 6Duration: 5m0 7GPUs: 0 8Total Requested GPUs: 0 9Allocated GPUs: 2 10Allocated GPUs memory: 0 11Running PODs: 3 12Pending PODs: 0 13Parallelism: 1 14Completions: 1 15Succeeded PODs: 0 16Failed PODs: 0 17Is Distributed Workload: true 18Service URLs: 19Command Line: runai submit-mpi dist-job1 --processes=2 -g 1 -i gcr.io/run-ai-demo/quickstart-distributed:tf-2.1.0 -e RUNAI_SLEEP_SECS=60