Commit 93509700 authored by Marco Rorro's avatar Marco Rorro
Browse files

Initial commit

parents
Pipeline #131 failed with stages
# Forward
This repository adapt https://github.com/vsoch/forward to CINECA resources
## What is this?
Forward sets up an sbatch script on your cluster resource and port forwards it back to your local machine!
Useful for jupyter notebook and tensorboard, amongst other things.
- **start.sh** is intended for submitting a job and setting up ssh forwarding
- **start-node.sh** will submit the job and give you a command to ssh to the node, without port forwarding
The folder [sbatches](sbatches) contains scripts, organized by cluster resource, that are intended
for use and submission. It's up to you to decide if you want a port forwarded (e.g., for a jupyter notebook)
or just an instruction for how to connect to a running node with your application.
## Quick Start
```
bash hosts/galileo_ssh.sh >> ~/.ssh/config
bash setup.sh
bash start.sh jupyter-pip
```
You should be able to open the url `localhost:port` as printed in the output instructions.
Once finished,
```
bash end.sh jupyter
```
will kill the slurm job and the listeners.
## Tiny Tutorials
Here we will provide some "tiny tutorials" to go along with helping to use the software. These are tiny because there
are many possible use cases!
- [Using sherlock/py3-jupyter](https://gist.github.com/vsoch/f2034e2ff768de7eb14d42fef92cc43e) and copying notebook first from your host to use a notebook module (python 3) on the Sherlock cluster at Stanford [Version 0.0.1](https://github.com/vsoch/forward/releases/tag/0.0.1).
- [Running an R Kernel in a Jupyter Notebook](https://vsoch.github.io/lessons/sherlock-juputer-r/)
- [Using containershare with repo2docker-julia](https://vsoch.github.io/lessons/containershare) a repo2docker-julia Singularity container deployed on Sherlock using [Version 0.0.1](https://github.com/vsoch/forward/releases/tag/0.0.1)
## Setup
For interested users, a few tutorials are provided on the [Research Computing Lessons](https://vsoch.github.io/lessons) site.
Brief instructions are also documented in this README.
### Clone the Repository
Clone this repository to your local machine.
You will then need to create a parameter file. To do so, follow the prompts at:
```bash
bash setup.sh
```
You can always edit `params.sh` later to change these configuration options.
It should look like:
```
USERNAME="your user name"
PORT="15432"
PARTITION="gll_usr_gpuprod --gres=gpu:kepler:1"
RESOURCE="galileo"
ACCOUNT="cin_staff"
TIME="8:00:00"
```
#### Parameters
- **RESOURCE** should refer to an identifier for your cluster resource that will be recorded in your ssh configuration, and then referenced in the scripts to interact with the resource (e.g., `ssh sherlock`).
- **PARTITION** If you intend to use a GPU (e.g., [sbatches/py2-tensorflow.sbatch](sbatches/py2-tensorflow.sbatch) the name of the PARTITION variable should be "gpu."
- **CONTAINERSHARE** (optional) is a location on your cluster resource (typically world readable) where you might find containers (named by a hash of the container name in the [library]() that are ready to go! If you are at Stanford, leave this to be default. If you aren't, then ask your cluster admin about [setting up a containershare](https://www.github.com/vsoch/containershare)
If you want to modify the partition flag to have a different gpu setup (other than `--partition gpu --gres gpu:1`) then you should set this **entire** string for the partition variable.
### SSH config
You will also need to at the minimum configure your ssh to recognize your cluster (e.g., sherlock) as
a valid host. We have provided a [hosts folder](hosts) for helper scripts that will generate
recommended ssh configuration snippets to put in your `~/.ssh/config` file. Based
on the name of the folder, you can intuit that the configuration depends on the cluster
host. Here is how you can generate this configuration for Sherlock:
```bash
bash hosts/galileo_ssh.sh
```
```
Host galileo
User put_your_username_here
Hostname login.galileo.cineca.it
GSSAPIDelegateCredentials yes
GSSAPIAuthentication yes
ControlMaster auto
ControlPersist yes
ControlPath ~/.ssh/%l%r@%h:%p
```
Using these options can reduce the number of times you need to authenticate. If you
don't have a file in the location `~/.ssh/config` then you can generate it programatically:
```bash
bash hosts/galileo_ssh.sh >> ~/.ssh/config
```
Do not run this command if there is content in the file that you might overwrite!
# Notebooks
Notebooks have associated sbatch scripts that are intended to start a jupyter (or similar)
notebook, and then forward the port back to your machine. If you just want to submit a job,
(without port forwarding) see [the job submission](#job-submission) section. For
notebook job submission, you will want to use the [start.sh](start.sh) script.
## Notebook password
If you have not set up notebook authentication before, you will need to set a
password via `jupyter notebook password` on your cluster resource.
Make sure to pick a secure password!
# Job Submission
Job submission can mean executing a command to a container, running a container, or
writing your own sbatch script (and submitting from your local machine). For
standard job submission, you will want to use the [start-node.sh](start-node.sh) script.
If your cluster has a containershare, you can use the `containershare-notebook`
set of scripts to have a faster deployment (without needing to pull).
## Usage
```bash
# Choose a containershare notebook, and launch it! On Galileo, the containers are already in the share
bash start.sh sherlock/containershare-notebook docker://vanessa/repo2docker-julia
# Run a Singularity container that already exists on your resource (recommended)
bash start-node.sh singularity-run /scratch/users/vsochat/share/pytorch-dev.simg
# Execute a custom command to the same Singularity container
bash start-node.sh singularity-exec /scratch/users/vsochat/share/pytorch-dev.simg echo "Hello World"
# Run a Singularity container from a url, `docker://ubuntu`
bash start-node.sh singularity-run docker://ubuntu
# Execute a custom command to the same container
bash start-node.sh singularity-exec docker://ubuntu echo "Hello World"
# Execute your own custom sbatch script
cp myscript.job sbatches/
bash start-node.sh myscript
```
As a service for Stanford users, @vsoch provides a [containershare](https://vsoch.github.io/containershare)
of ready to go containers to use on Sherlock! The majority of these deploy interactive notebooks,
however can also be run without (use start-node.sh instead of [start.sh](start.sh)). If you
want to build your own container for containershare (or request a container) see the
[README](https://www.github.com/vsoch/containershare) in the repository that serves it.
```bash
# Run a containershare container with a notebook
bash start.sh sherlock/containershare-notebook docker://vanessa/repo2docker-julia
```
If you would like to request a custom notebook, please [reach out](https://www.github.com/vsoch/containershare/issues).
## Usage
```bash
# To start a jupyter notebook in a specific directory ON the cluster resource
bash start.sh jupyter <cluster-dir>
# If you don't specify a path on the cluster, it defaults to your ${SCRATCH}
bash start.sh jupyter /scratch/users/<username>
# To start a jupyter notebook with tensorflow in a specific directory
bash start.sh py2-tensorflow <cluster-dir>
# If you want a GPU node, make sure your partition is set to "gpu."
# To start a jupyter notebook (via a Singularity container!) in a specific directory
bash start.sh singularity-jupyter <cluster-dir>
```
Want to create your own Singularity jupyter container? Use [repo2docker](https://www.github.com/jupyter/repo2docker) and then specify the container URI at the end.
```bash
bash start.sh singularity.jupyter <cluster-dir> <container>
# You can also run a general singularity container!
bash start.sh singularity <cluster-dir> <container>
# To start tensorboard in a specific directory (careful here and not recommended, as is not password protected)
bash start.sh start <cluster-dir>
# To stop the running jupyter notebook server
bash end.sh jupyter
```
If the sbatch job is still running, but your port forwarding stopped (e.g. if
your computer went to sleep), you can resume with:
```bash
bash resume.sh jupyter`
```
# Debugging
Along with some good debugging notes [here](https://vsoch.github.io/lessons/jupyter-tensorflow#debugging), common errors are below.
### Connection refused after start.sh finished
Sometimes you can get connection refused messages after the script has started
up. Just wait up to a minute and then refresh the opened web page, and this
should fix the issue.
### Terminal Hangs when after start.sh
Sometimes when you have changes in your network, you would need to reauthenticate.
In the same way you might get a login issue here, usually opening a new shell resolves
the hangup.
### Terminal Hangs on "== Checking for previous notebook =="
This is the same bug as above - this command specifically is capturing output into
a variable, so if it hangs longer than 5-10 seconds, it's likely hit the password
prompt and would hang indefinitely. If you issue a standard command that will
re-prompt for your password in the terminal session, you should fix the issue.
```bash
$ ssh sherlock pwd
```
### slurm_load_jobs error: Socket timed out on send/recv operation
[This error](https://www.rc.fas.harvard.edu/resources/faq/slurm-errors-socket-timed-out) is basically
saying something to the effect of "slurm is busy, try again later." It's not an issue with submitting
the job, but rather a ping to slurm to perform the check. In the case that the next ping continues, you should be ok. However, if the script is terminate, while you can't control the "busyness" of slurm, you **can**
control how likely it is to be allocated a node, or the frequency of checking. Thus, you can do either of the
following to mitigate this issue:
**choose a partition that is more readily available**
In your params.sh file, choose a partition that is likely to be allocated sooner, thus reducing the
queries to slurm, and the chance of the error.
**offset the checks by changing the timeout between attempts**
The script looks for an exported variable, `TIMEOUT` and sets it to be 1 (1 second) if
not defined. Thus, to change the timeout, you can export this variable:
```bash
export TIMEOUT=3
```
While the forward tool cannot control the busyness of slurm, these two strategies should help a bit.
### I ended a script, but can't start
As you would kill a job on Sherlock and see some delay for the node to come down, the
same can be try here! Try waiting 20-30 seconds to give the node time to exit, and try again.
## How do I contribute?
First, please read the [contributing docs](CONTRIBUTING.md). Generally, you will want to:
- fork the repository to your username
- clone your fork
- checkout a new branch for your feature, commit and push
- add your name to the CONTRIBUTORS.md
- issue a pull request!
## Adding new sbatch scripts
You can add more sbatch scripts by putting them in the sbatches directory.
#!/bin/bash
#
# Starts a remote sbatch jobs and sets up correct port forwarding.
# Sample usage: bash end.sh jupyter
# bash end.sh tensorboard
if [ ! -f params.sh ]
then
echo "Need to configure params before first run, run setup.sh!"
exit
fi
source params.sh
if [ "$#" -eq 0 ]
then
echo "Need to give name of sbatch job to kill!"
exit
fi
NAME=$1
echo "Killing $NAME slurm job on ${RESOURCE}"
ssh ${RESOURCE} "squeue --name=$NAME --user=$USERNAME -o '%A' -h | xargs --no-run-if-empty /usr/bin/scancel"
echo "Killing listeners on ${RESOURCE}"
#ssh ${RESOURCE} "/usr/sbin/lsof -i :$PORT -t | xargs kill &>/dev/null"
lsof -a -i :$PORT -c ssh -t | xargs kill &>/dev/null
#!/bin/bash
#
# Helper Functions shared between forward tool scripts
#
# Configuration
#
function set_forward_script() {
FOUND="no"
echo "== Finding Script =="
declare -a FORWARD_SCRIPTS=("sbatches/${RESOURCE}/$SBATCH"
"sbatches/$SBATCH"
"${RESOURCE}/$SBATCH"
"$SBATCH");
for FORWARD_SCRIPT in "${FORWARD_SCRIPTS[@]}"
do
echo "Looking for ${FORWARD_SCRIPT}";
if [ -f "${FORWARD_SCRIPT}" ]
then
FOUND="${FORWARD_SCRIPT}"
echo "Script ${FORWARD_SCRIPT}";
break
fi
done
echo
if [ "${FOUND}" == "no" ]
then
echo "sbatch script not found!!";
echo "Make sure \$RESOURCE is defined" ;
echo "and that your sbatch script exists in the sbatches folder.";
exit
fi
}
#
# Job Manager
#
function check_previous_submit() {
echo "== Checking for previous notebook =="
PREVIOUS=`ssh ${RESOURCE} squeue --name=$NAME --user=$USERNAME -o "%R" -h`
if [ -z "$PREVIOUS" -a "${PREVIOUS+xxx}" = "xxx" ];
then
echo "No existing ${NAME} jobs found, continuing..."
else
echo "Found existing job for ${NAME}, ${PREVIOUS}."
echo "Please end.sh before using start.sh, or use resume.sh to resume."
exit 1
fi
}
function set_partition() {
if [ "${PARTITION}" == "gpu" ];
then
echo "== Requesting GPU =="
PARTITION="${PARTITION} --gres gpu:1"
fi
}
function get_machine() {
TIMEOUT=${TIMEOUT-1}
ATTEMPT=0
echo
echo "== Waiting for job to start, using exponential backoff =="
MACHINE=""
ALLOCATED="no"
while [[ $ALLOCATED == "no" ]]
do
# nodelist
MACHINE=`ssh ${RESOURCE} squeue --name=$NAME --user=$USERNAME -o "%N" -h`
if [[ "$MACHINE" != "" ]]
then
echo "Attempt ${ATTEMPT}: resources allocated to ${MACHINE}!.." 1>&2
ALLOCATED="yes"
break
fi
echo "Attempt ${ATTEMPT}: not ready yet... retrying in $TIMEOUT.." 1>&2
sleep $TIMEOUT
ATTEMPT=$(( ATTEMPT + 1 ))
TIMEOUT=$(( TIMEOUT * 2 ))
done
echo $MACHINE
MACHINE="`ssh ${RESOURCE} squeue --name=$NAME --user=$USERNAME -o "%R" -h`"
echo $MACHINE
# If we didn't get a node...
if [[ "$MACHINE" == "" ]]
then
echo "Tried ${ATTEMPTS} attempts!" 1>&2
exit 1
fi
}
#
# Instructions
#
function instruction_get_logs() {
echo
echo "== View logs in separate terminal =="
echo "ssh ${RESOURCE} cat $RESOURCE_HOME/forward-util/${SBATCH_NAME}.out"
echo "ssh ${RESOURCE} cat $RESOURCE_HOME/forward-util/${SBATCH_NAME}.err"
}
function print_logs() {
ssh ${RESOURCE} cat $RESOURCE_HOME/forward-util/${SBATCH_NAME}.out
ssh ${RESOURCE} cat $RESOURCE_HOME/forward-util/${SBATCH_NAME}.err
}
#
# Port Forwarding
#
function setup_port_forwarding() {
echo
echo "== Setting up port forwarding =="
sleep 5
echo "ssh -f -4 -L $PORT:localhost:$PORT ${RESOURCE} ssh -4 -L $PORT:localhost:$PORT -N $MACHINE "
ssh -f -4 -L $PORT:localhost:$PORT ${RESOURCE} ssh -4 -L $PORT:localhost:$PORT -N "$MACHINE"
}
#!/bin/bash
#
# Galileo cluster at CINECA
# Prints an ssh configuration for the user, selecting a login node at random
# Sample usage: bash galileo_ssh.sh
echo
read -p "Galileo username > " USERNAME
# Randomly select login node from 1..4
LOGIN_NODE=login.galileo.cineca.it
echo "Host galileo
User ${USERNAME}
Hostname ${LOGIN_NODE}
StrictHostKeyChecking no"
#!/bin/bash
#
# Resumes an already running remote sbatch job.
# Sample usage: bash resume.sh
if [ ! -f params.sh ]
then
echo "Need to configure params before first run, run setup.sh!"
exit
fi
source params.sh
NAME="${1}"
# The user is required to specify port
echo "ssh ${RESOURCE} squeue --name=$NAME --user=$USERNAME -o "%N" -h"
MACHINE=`ssh ${RESOURCE} squeue --name=$NAME --user=$USERNAME -o "%N" -h`
ssh -4 -L $PORT:localhost:$PORT ${RESOURCE} ssh -4 -L $PORT:localhost:$PORT -N $MACHINE &
#!/bin/bash
PORT=$1
NOTEBOOK_DIR=$2
cd $NOTEBOOK_DIR
. ~/.bashrc
module load cuda/10.0
conda activate /gpfs/scratch/userinternal/mrorro00/tensorflow-2.0rc
jupyter notebook --ip=127.0.0.1 --no-browser --port=$PORT
#!/bin/bash
PORT=$1
NOTEBOOK_DIR=$2
cd $NOTEBOOK_DIR
module load anaconda/2019.07
jupyter notebook --ip=127.0.0.1 --no-browser --port=$PORT
#!/bin/bash
PORT=$1
NOTEBOOK_DIR=$2
cd $NOTEBOOK_DIR
module load python/3.6.4 cuda/10.0
module load profile/deeplrn cudnn/7.6.3--cuda--10.0
#module load profile/deeplrn cudnn/7.5.1--cuda--10.0
. /gpfs/scratch/userinternal/mrorro00/tensorflow-gpu-2.0rc/bin/activate
jupyter notebook --ip=127.0.0.1 --no-browser --port=$PORT
#!/bin/bash
PORT=$1
NOTEBOOK_DIR=$2
cd $NOTEBOOK_DIR
module load profile/deeplrn
module load autoload tensorflow/1.12
. /gpfs/scratch/userinternal/mrorro00/tensorflow-gpu-2.0rc/bin/activate
jupyter notebook --ip=127.0.0.1 --no-browser --port=$PORT
#!/bin/bash
PORT=$1
NOTEBOOK_DIR=$2
cd $NOTEBOOK_DIR
module load python/3.6.4
. /gpfs/scratch/userinternal/mrorro00/jupyter/bin/activate
jupyter notebook --ip=127.0.0.1 --no-browser --port=$PORT
#!/bin/bash
#
# Sets up parameters for use with other scripts. Should be run once.
# Sample usage: bash setup.sh
echo "First, choose the resource identifier that specifies your cluster resoure. We
will set up this name in your ssh configuration, and use it to reference the resource (galileo)."
echo
read -p "Resource identifier (default: galileo) > " RESOURCE
RESOURCE=${RESOURCE:-galileo}
echo
read -p "${RESOURCE} username > " USERNAME
port=$((32000 + RANDOM))
echo
echo "Next, pick a port to use. If someone else is port forwarding using that
port already, this script will not work. If you pick a random number in the
range 49152-65335, you should be good."
echo
read -p "Port to use (default:-$port)> " PORT
PORT=${PORT:-$port}
echo
echo "Next, pick the ${RESOURCE} partition on which you will be running your
notebooks. Default partition (gll_all_serial). To specify gpu partition use gll_usr_gpuprod --gres=gpu:kepler:1"
echo
read -p "${RESOURCE} partition (default: gll_all_serial) > " PARTITION
PARTITION=${PARTITION:-gll_all_serial}
echo
echo "Specify the project to be accounted"
echo
read -p "Account to use > " ACCOUNT
TIME=4:00:00
for var in USERNAME PORT PARTITION RESOURCE ACCOUNT TIME
do
echo "$var="'"'"$(eval echo '$'"$var")"'"'
done >> params.sh
#!/bin/bash
#
# Starts a remote sbatch jobs without port forwarding.
# Sample usage: bash start-node.sh singularity docker://ubuntu
if [ ! -f params.sh ]
then
echo "Need to configure params before first run, run setup.sh!"
exit
fi
. params.sh
if [ ! -f helpers.sh ]
then
echo "Cannot find helpers.sh script!"
exit
fi
. helpers.sh
if [ "$#" -eq 0 ]
then
echo "Need to give name of sbatch job to run!"
exit
fi
NAME="${1:-}"
# The user could request either <resource>/<script>.sbatch or
# <name>.sbatch
SBATCH="$NAME.sbatch"
# Exponential backoff Configuration
# set FORWARD_SCRIPT and FOUND
set_forward_script
check_previous_submit
echo
echo "== Getting destination directory =="
RESOURCE_HOME=`ssh ${RESOURCE} pwd`
ssh ${RESOURCE} mkdir -p $RESOURCE_HOME/forward-util
echo
echo "== Uploading sbatch script =="
scp "${FORWARD_SCRIPT}" "${RESOURCE}:$RESOURCE_HOME/forward-util/"
# adjust PARTITION if necessary
set_partition
echo
echo "== Submitting sbatch =="
SBATCH_NAME=$(basename $SBATCH)
command="sbatch
--job-name=$NAME
--partition=$PARTITION