How to use supercomputers from Texas Advanced Computing Center (TACC)
Boost your training speed
Before you begin
- Create a TACC account (https://portal.tacc.utexas.edu/)
- Solve Multi-factor authentication at TACC user portal
- different from utexas multi-factor authentication
- https://portal.tacc.utexas.edu/tutorials/multifactor-authentication
Login
ssh xy0000@hikari.tacc.utexas.edu
- Replace xy0000 with your own eid
- Replace hikari with your own system (eg. maverick2, lonestar5)
Login using your TACC password and multi-factor authentication token code
Transfer file
# For file
localhost$ scp path/to/file xy0000@maverick2.tacc.utexas.edu:\$WORK/path
# For folder
localhost$ tar cvf ./mydata.tar mydata # create archive
localhost$ scp ./mydata.tar xy0000@maverick2.tacc.utexas.edu:\$WORK # transfer archive
- WORK directory is usually larger than HOME directory
Run
Method 1 (sbatch)
- Do not run the code directly at login.
Create a .slurm file
Example slurm file below:
#!/bin/bash #---------------------------------------------------- # Example SLURM job script to run code #---------------------------------------------------- #SBATCH -J lab_job # Job name #SBATCH -o console_output.txt # Name of stdout output file #SBATCH -e console_error_output.txt # Name of stdout output error file #SBATCH -p normal # Queue name #SBATCH -N 1 # Total number of nodes requested, multi-node means parallel #SBATCH -n 1 # Total number of task requested #SBATCH -t 01:30:00 # Run time (hh:mm:ss) your allocation ends in 1.5 h (program can be unfinished) # The next line is required if the user has more than one project #SBATCH -A XXXX # Project/allocation number, the one you apply to TACC with # This example will run 1 task on 1 nodes # Launch the job, the file you want to run python ./file.py
Run slurm
login2.hikari(26)$ sbatch your_filename.slurm
Watch the job
login2.hikari(41)$ watch squeue login2.hikari(29)$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 47767 normal lab_job xy0000 R 0:06 1 c262-102
Check console output
cat console_output.txt
- the file you define in slurm
Cancel job
login2.hikari(41)$ scancel 47767
- scancel JOBID
- scancel JOBID
Run
Method 2 (idev)
login2.hikari(36)$ idev -t 01:30:00
- idev: interactive development something
- -t the total time you requested
- This one doesn’t need slurm file
c262-104.hikari(2)$ python file.py
- Run the code as normal like in a terminal, same speed as the one with slurm
c262-104.hikari(3)$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 47769 normal idv08693 xy0000 R 3:20 1 c262-104
Logout
c262-104.hikari(4)$ exit Connection to c262-104 closed. Cleaning up: submitted job (yes) removing job 47769.
Deep Learning Using Python
- Use Python 3 if you need h5py
Input following before running your code
module load intel/17.0.4 python3/3.6.3 module load cuda/10.0 cudnn/7.6.2 nccl/2.4.7 pip3 install --user tensorflow-gpu==1.13.2 pip3 install --user keras pip3 install --user h5py export HDF5_USE_FILE_LOCKING='FALSE'