Workflow
In this document, you can find an introduction to tools commonly used for remote editing and development. The generla workflow is the following:
- You create and edit code on your local machine
- An indipendent software mirrors your code to the host (
syncthing
in this document) - You interface with the host through terminal, give commands and run code (
tmux
) - Finally, you debug your code (
debug
section), and edit it your local machine
This workflow allows you to:
- use your preferred software for editing
- use very low connections (if you are in train or in the smurfs house, for instance)
- use any language on the host
- run intensive tasks for many many days without interrupting your connection
- have the resources on the host machine always at hand
Basic set up
Create a .ssh
folder in your home if it doesn’t exist and put in it a file called config
with the following content:
Host *
ControlMaster auto
ControlPath ~/.ssh/sockets/%r@%h-%p
ControlPersist 60
Host amadeus.lim.di.unimi.it !exec "[ -e ~/.ssh/socket-%r@%h:%p ]"
LocalForward 22001 localhost:22000
RemoteForward 22001 localhost:22000
LocalForward 8384 localhost:8384
LocalForward 8097 localhost:8097
LocalForward 6006 localhost:6006
This configuration allows the reuse of the first connection among several ssh
commands: if you connect through ssh to any host, all the folloqing ssh
commands will use the first connection: you won’t need to retype your password,
the overhead decreases and firewalls will not push you back because of the
proliferating of the number of connections. It also instructs ssh
command about the
file to be used as socket.
Moreover, we set ssh to always forward certain ports for the server amadeus.lim.di.unimi.it
.
For instance, port 8097 is the one used by visdom
for plotting from PyTorch and numpy,
ports 6006, 22001 and 22000 are used by Syncthing for syncing over ssh.
Syncing with syncthing
syncthing
is an open source p2p syncing tool much more powerful than simple SFTP clients. However it must be configured on both sides.
- Download and install
syncthing
on both client and host (amadeus has it, but you can still download the pre-compiled binaries in your home if you do not have) - Start syncthing on the client and go to
localhost:8384
or to the displayed address - Look into
Actions > Show ID
for the client device ID - Connect via ssh tunneling to the host
ssh user@host -L 8385:localhost:8384
- Start syncthing on the host and go to
localhost:8485
from the client browser to access syncthing host - Look into
Actions > Show ID
for the host device ID - On both client and host add a remote device (on host with the client ID and on client with host ID)
- Add a new folder and share it with the host
- Add ignore patterns (e.g.
.git
) - Set folder type to
send only
- Check that
watch for changes
is selected - Repeat points 8-11 on host
- On client, click on up-right
Actions > Advanced > [synced folder name] > Fs Watcher Delay
and set to 1 (the minimum allowed, 1 second)
Save the configuration and restart.
If one of your machines is behind a strict firewall, it can be useful to use ssh tunneling for connecting.
This requires a particular configuration that you first connect to the remote machine through ssh with port forwarding and then start syncthing on both machines.
See the .ssh/config
file in previous paragraph. More info in the official guides
.
Running tmux
Once you have synced code, you should connect through ssh to the remote host and start tmux
to run your code. tmux
is a nice piece of software used for working on remote host. It can be used to run long experiments that need to disconnect your machine (e.g. experiments lasting many days). I really recommend to always use tmux. Using it, you can also use the same terminal creating multiple tabs and split. One recurring pattern, for instance, is having the terminal with 3 splits: one for the experiment running, the second for controlling the resources on the remote machine and the third to give commands to the machine.
For controlling tmux
you need to prepend each command with CTRL-b
: everything pressed after this combination will be interpreted as a tmux command and it will not be written to the terminal.
These steps exemplify the use of tmux
:
- Connect to the remote host with
ssh username@host
- Run
tmux
. Now, a new blank terminal is shown. Notice the bottom bar indicating some useful information - Press in sequence
CTRL-b
and%
. Now the window is splitted. - Try
CTRL-b
and"
. - With
CTRL-b
andx
you can kill the splits (and all the precesses running from it). The same with the commandexit
- With
CTRL-b
and arrow you can move to an existing split - With
CTRL-b
andc
you can create a new tab - With
CTRL-b
and[
you enter to “copy mode”: in this modality you can go up and down in the terminal and you can select and copy its content. Try surfing with arrows andPgUp
andPgDown
keys. - Press
CTRL+SPACE
to start copying - Press
ALT-w
orCTRL-w
to copy the selection - Press
q
to exit the copy mode - Press
CTRL+b
and]
to paste - Try
CTRL+b
andd
. The tmux session will be detached - Exit from the ssh connection and reconnect
- Use
tmux a
to reattach to the previous session, that is continued to exist
Since key CTRL-b
is not that comfortable, I suggest to change it: create a file in your home called .tmux.conf
and put the following lines:
unbind-key C-b
set -g prefix `
bind-key ` send-prefix
Now instead of pressing CTRL-b
, you can just press ` char (for italian keyboard, you could use the \
char, for instance).
You can also customize your keybindings to make them more easy to remember. For instance, I use keybindings very similar to vim
. For more info see https://tmuxcheatsheet.com/
.
N.B. If the terminal become unresponsive, it can be that the ssh connection has been closed server-side (e.g. because of a connection error). In such a case, press ~
and then .
to close the connection and return to your local terminal. If it doesn’t work, see the “SOS” section.
Controlling resources
It is a good practice to continuously monitor the resources on the host in a separate split.
For CPU and RAM, just run htop
command in a seprate always live split.
For GPU, run watch -n 5 nvidia-smi
. The watch
command will repeatedly run the nvidia-smi
command every 5 seconds.
With these two commands you should be able to understand if:
- you are using too many resources (please, stop your program before the machine RAM is full, otherwise see “SOS” section)
- you are using too few ersources (for instance if GPU is not used or no parallel processing is in action)
- you are using wrong resources (too many parallel processes, wrong GPU, etc.)
Note that in pytorch
you should use internal commands torch.cuda.memory_allocated
and torch.cuda.max_memory_allocated
since nvidia-smi
fails in showing the real amount of RAM used: docs
Debugging code
For debugging code from remote you need to use a debugger. I suggest to always debug your code without parallelism whenever possible.
Python
In Python, just use the default debugger. Since version 3.7 you can simply add the instruction debugger()
one line before the one you want to start the debug. You can set up your preferred debugger (I suggest ipdb
). For previous versions use import pdb; pdb.set_trace()
.
A better debugger is ipdb (import ipdb; ipdb.set_trace()
). You will probably need to install it: pip install ipdb
. See its commands by pressing h
or here
If you really want a graphical debugger, you try wdb
. You will probably need to install it: pip install wdb
.
Another useful tool is pysnooper . You can use it in place of printing to stdout or of logging. It’s much easier to use and very powerful. I use it for debugging scripts with multiprocessing on different files.
I also recommend to use pyenv
and poetry
to isolate your project from the OS python packages.
Matlab
In matlab
, you can use the default debugger by using the statement keyboard
just one line after you want to stop. Then you will be prompted and you can use typical matlab commands to show variables. However, the editing needs the use a emacs
keys (as of now I am still not able to change these mappings). Remember these ones:
KEY | ACTION |
---|---|
CTRL-a | move cursor to (at) beginning-of-line |
CTRL-e | move cursor to end-of-line |
CTRL-f | move cursor forward one character |
CTRL-b* | move cursor backward one character |
* if you are using the default tmux keymap, CTRL_B
is also the tmux escape sequence; in this case you’ll have to press twice CTRL-B
to send it to Matlab
For managing the debugging itself, you can use dbcont , dbstep , dbquit and all the commands listed in the official docs
Julia
Use Debugger
package: command ]add Debugger
from Julia REPL.
Set breakpoints with @bp
. Start debugging a function with @enter functionName(args)
(will stop at the first instruction) or with @run functionName(args)
(will stop at the first breakpoint).
Open the REPL. Run include('filename'); mainFunction()
to test mainFunction after having edited it. Alternatively, you can also try Revise
to automatically reload changed modules.
SOS
It can happen that your program fill the host resources. In that case, you can:
- try to kill the program from inside the
ssh
connection keeping pressedCTRL-C
- if
1.
doesn’t work, create a new terminal or a new split in tmux and runkill PROCESS NUMBER
(you can find the process number withps x
orhtop
) - if neither
2.
works for your, try killing the termux split withCTRL-b
andx
- if your ssh connection is completely saturated or the host doesn’t answer, then from another local terminal run
ssh username@host killall program
whereprogram
is the command of the experiment that your were executing (i.e.python
,python3
,matlab
, etc.). This command will try to create a new ssh connection, run the commandkillall program
and then exit suddenly. - If the machine is not responding contact the administrator as soon as possible
Apptainer
Why
When you are in a server without root access and you need to install some application, you are in a difficult situation. You can try to install it in your home, but it is not always possible and usually requires the admin intervention (i.e. Homebrew and Nix). Moreover, if you are worried about computational performances, Homebrew and Nix are not the best solutions. The best solution is to use a container.
Containers are environments that are isolated from the rest of the system. Most of the container technologies, however, still require the admin intervention to be installed, even though they’re much more likely already setup in the server than Homebrew and Nix.
It is difficult to chose the proper technology, though. 99% of the benchmarks around are made by sysadmins, that are mainly interested in startup times and web server performances, not in the computational overhead imposed by the container while running our scientific code.
A few time ago I have made some little benchmarks for these purposes. The results are shown below:
As you see, the best solution is by far Apptainer, then docker, devbox and podman. It must be noted that docker total score is so high because of its ability to asynchronously read and write files. However, this is not that relevant for scientific code, usually. Removed that, I would suggest using podman over docker.
But the truth is that Apptainer has been built for scientific code and perfectly fits our needs:
- you can use it everywhere, without root access (docker and podman require some admin set ups)
- an Apptainer image is easy to build and to share, so your code will be totally reproducible
- you can use Docker images in Apptainer, so you can use most of the Docker ecosystem
- it is fast, especially in terms of CPU and RAM usage
- it gives you access to your home and files on the server by default in an optimized way
How
So, use it. The workflow is like this:
- Install Apptainer in your computer, where you have admin/root access
- Prepare a file definition that specify an OS, the packages to install and the commands to run to configure the container
- Copy the Apptainer binary installed in your computer to the server (it’s portable! 🎉)
- Build the container
- Copy the Apptainer image to the server
- Run the container
- Execute your code
Points 1 and 2 are executed only once, while 3-5 are executed for each different server you deal with. Points 6-7 are executed every time you want to run your code.
Installation
So, here (Linux) and here (Windows and Mac) are the instructions to install Apptainer. In Windows, I suggest you use WSL, which is simpler.
Building an image
Now you need to define an image. Here is an example that I have used in the past. However, here you can find more details.
You can build a definition file with apptainer build mycontainer.sif mycontainer.def
,
where mycontainer.def
is the name of the definition file (e.g. the one below) and
mycontainer.sif
is the name of the image you want to create.
# This line means "pick the base container image from
# the docker hub".
# Bootstrap: docker
Bootstrap: localimage
# Whereas here we specify the particular image we are
# interested in using as the base image, in this case
# a basic `fedora` system at version `39`.
# The base image is the operating system configuration
# that you want to customize.
From: fedora:39
# you can also start building from an existing image to modify it
# In this case, comment out this line and the line in %post that were already executed
# From: mycontainer.sif
# environment variables here
%environment
export LANG=en_US.UTF-8 # luatex needs locale set
# A definition file has several sections, see the documentation.
# In the `post` section you can run commands to customize
# your environment
%post
# This is the place where you can
# install additional dependencies.
########### My basic stuffs ###########
# software for development
dnf -y install tmux neovim syncthing fish zoxide fzf ripgrep openssh-clients openssh powerline git hostname fd-find copr-cli procps-ng syncthing htop
# python build dependencies
dnf -y install make gcc patch zlib-devel bzip2 bzip2-devel readline-devel sqlite sqlite-devel openssl-devel tk-devel libffi-devel xz-devel libuuid-devel gdbm-libs libnsl2
# python-pip
dnf -y install python3-pip
# lazygit
dnf -y install 'dnf-command(copr)'
dnf copr enable atim/lazygit -y
dnf -y install lazygit
# font
curl -OL https://github.com/ryanoasis/nerd-fonts/releases/latest/download/SourceCodePro.tar.xz
# locale
dnf -y install glibc-locale-source glibc-langpack-en
########### Other more specific stuffs ###########
# luatex and gregotex
dnf -y install texlive-collection-luatex texlive-collection-fontsrecommended texlive-collection-latexrecommended texlive-collection-latexextra texlive-collection-latex texlive-collection-music texlive-collec
tion-mathscience texlive-gregoriotex
dnf -y install latexmk
# tesseract
dnf -y install tesseract tesseract-langpack-eng tesseract-langpack-ita tesseract-langpack-ita_old tesseract-script-latin tesseract-tools tesseract-osd
Running an image
Now you can run the image. You can do it in two ways:
- Running an
instance
to which you can connect even after ssh disconnection
apptainer instance start mycontainer.sif a_name_for_this_instance
apptainer shell instance://a_name_for_this_instance
apptainer instance stop mycontainer.sif a_name_for_this_instance
- Just run some commands in the container and close it
apptainer shell mycontainer.sif
You could actually define a command to run inside the container in the %runscript
section of the definition file. In this case, you can just run ./mycontainer.sif
to
run it or apptainer instance run mycontainer.sif a_name_for_this_instance
to run it in
a detached instance.