运行和监控作业 (Run:ai CLI)#

本节介绍如何使用 Run:ai CLI 提交和检查作业状态。

  1. 使用 login 选项登录 runai

    运行以下命令并输入用户名和密码。

    runai login
    

    示例结果

    1researcher1@basepod-head1:~/runai_dir$ runai login
    2Username: researcher1@nvidia.com
    3Password:
    4INFO[0022] Logged in successfully
    
  2. 提交分布式训练作业。

    runai submit-mpi dist-job1 -i \n gcr.io/run-ai-demo/quickstart-distributed:tf-2.1.0 -g 4 -p researcher1
    

    示例结果

    1researcher1@basepod-head1:~/runai_dir$ runai submit-mpi dist-job1 -i \n gcr.io/run-ai-demo/quickstart-distributed:tf-2.1.0 -g 4 -p researcher1
    2Job dist-job1 submitted successfully.
    
  3. 使用 runai describe 检查作业状态。

    runai describe job dist-job1 -p researcher1
    

    示例结果

     1researcher1@basepod-head1:~/runai_dir$ runai describe job dist-job1 -p researcher1
     2Name: dist-job1
     3Namespace: runai-researcher1
     4Type: Train
     5Status: Running
     6Duration: 5m0
     7GPUs: 0
     8Total Requested GPUs: 0
     9Allocated GPUs: 2
    10Allocated GPUs memory: 0
    11Running PODs: 3
    12Pending PODs: 0
    13Parallelism: 1
    14Completions: 1
    15Succeeded PODs: 0
    16Failed PODs: 0
    17Is Distributed Workload: true
    18Service URLs:
    19Command Line: runai submit-mpi dist-job1 --processes=2 -g 1 -i gcr.io/run-ai-demo/quickstart-distributed:tf-2.1.0 -e RUNAI_SLEEP_SECS=60