User Guide
This documentation explains how regular users access to Skyway and submit jobs to use cloud services. Please refer to the Skyway home page for more information and news.
Gaining Access
You first need an active RCC User account (see accounts and allocations page). Next, you should contact your PI or class instructors for access to Skyway. Alternatively, you can reach out to our Help Desk at help@rcc.uchicago.edu for assistance.
Connecting
You need to log in to the HPC cluster.
ssh -Y [cnetid]@midway3.rcc.uchicago.edu
For Midway3 users,
module load skyway
Running jobs on the cloud
You submit jobs to cloud in a similar manner to what do on your HPC cluster. The difference is that you should specify different partitions and accounts corresponding to the cloud services you have access to. Additionally, the instance configuration should be specified via --constraint.
List all the node types available to an account
skyway_nodetypes --account=your-cloud-account
skyway_nodetypes --account=your-gcp-account
--constraint=[VM Type]. For instance, the AWS VM types currently available through Skyway can be found in the table below. Other VM types will be included per requests.
VM Type |
AWS EC2 Instance | Configuration | Description |
|---|---|---|---|
| t1 | t2.micro | 1 core, 1GB RAM | for testing and building software |
| c1 | c5.large | 1 core, 4B RAM | for serial jobs |
| c8 | c5.4xlarge | 8 cores, 32GB RAM | for medium sized multicore jobs |
| c36 | c5.18xlarge | 36 cores, 144GB RAM | for large memory jobs |
| g1 | p3.2xlarge | 4 cores, 61 GB RAM, 1x V100 GPU | for GPU jobs |
| g4 | p3.8xlarge | 16 cores, 244 GB RAM, 4x V100 GPU | for heavy GPU jobs |
| g5 | p5.2xlarge | 8 cores, 32 GB RAM, 1x A10G GPU | for heavy GPU jobs |
| m24 | c5.12xlarge | 24 cores, 384GB RAM | for large memory jobs |
The following steps show a representative workflow with Skyway.
Allocate/provision an instance
skyway_alloc --account=your-cloud-account --constraint=t1 --time=01:00:00
skyway_alloc -A your-cloud-account --constraint=g5 --time=00:30:00
List all the running VMs with an account
skyway_list --account=your-cloud-account
Transfer data
To copy a file from the login node to the instance named your-run
skyway_transfer -A your-cloud-account -J your-run training.py
Transfer a file from an instance to the login node
skyway_transfer -A your-cloud-account -J your-run --from-cloud --cloud-path=~/output.txt $HOME/output.txt
Connect to the VM named your-run
skyway_connect --account=your-cloud-account your-run
Once on the VM, do
nvidia-smi
source activate pytorch
python training.py > ~/output.txt
scp output.txt [yourcnetid]@midway3.rcc.uchicago.edu:~/
exit
Stop/restart a job
To stop an instance, you use the skyway_stop command with your account and the instance ID:
skyway_stop -A=your-cloud-account -i [instanceID]
To restart a stopped instance, you use the skyway_restart command with your account and the instance ID:
skyway_restart -A=your-cloud-account -i [instanceID] -t 02:00:00
Cancel/terminate/cancel a job
skyway_cancel --account=your-cloud-account [job_name]
Expected behavior: The jobs (VMs) got terminated. When run skyway_list (step 3 above) the VM will not be present.
Note that when a VM is terminated, all the data on the VM is erased. The user should transfer the intermediate output to a cloud storage space or to local storage.
The following steps are for launching interactive and batch jobs.
Submit an interactive job (combinig steps 4, 6 and 7)
skyway_interative --account=your-cloud-account --constraint=t1 --time=01:00:00
skyway_interative --account=your-cloud-account --constraint=g5 -t 00:30:00
Expected behavior: the user lands on a compute node or a VM on a separate terminal.
Submit a batch job
A sample job script job_script.sh is given as bellow
#!/bin/sh
#SBATCH --job-name=your-run
#SBATCH --account=your-cloud-account
#SBATCH --constraint=g1
#SBATCH --time=06:00:00
skyway_transfer training.py
source activate pytorch
wget [url-of-the-dataset]
python training.py
To submit the job, use the skyway_batch command
skyway_batch job_script.sh
If the job is terminated due to time limit, the instance is stopped. You can find the instance ID with the skyway_list command and restart it to back up the data.
Transfer output data from cloud
skyway_transfer -A your-cloud-account -J your-run --from-cloud --cloud-path=~/model*.pkl .
You can cancel the job using the skyway_cancel command.
Troubleshooting
For further assistance, please contact our Help Desk at help@rcc.uchicago.edu.