How to use chdb for launching embarrassingly parallel jobs ? A tutorial describing the supported use cases
A REPRENDRE !!!!
In this tutorial, you’ll learn how to use chdb to run your embarassingly parallel computations. All the usecases supported by chdb will be described here.
How to use this tutorial ?
Each entry point of the tutorial is dedicated to a usecase, from the most general and simple to the most specific and complicated. To understand what happens when chdb is launched for a real problem we’ll use some very simple bash commands. You should :
- read carefully the few lines before and after each code sample, to understand the point.
- copy-and-paste each code sample provided here, execute it from the frontal node, and look at the created files using the unix standard tools as find, ls, less, etc.
Of course, in the real life, you’ll have to replace the toy command line launched by chdb by your code, and yo must launch chdb through an sbatch command, as usual. Please remember you cannot use the frontal nodes to launch real applications, the frontal nodes are dedicated to file edition or test runs.
You should be connected to eos before working on this tutorial. You must initialize your environment  to be able to use the commands described here :
You should work in an empty temporary directory, to be sure you’ll not remove important files while executing the exercises described in the following paragraphs :
You can ask
chdb for help :
The goal of chdb is to execute the same code on a series of input files, and store the results produced by those instances.
Using chdb implies that the executions of your code are independent and that the order in which the files are treated is irrelevant.
In order to use
chdb, you must put all your input files inside a common "input directory".
Let’s create 10 files under a directory called
Then let’s use those 10 files as entry for a little toy program. We specify
--in-type txt to tell chdb to take all
*.txt files as input :
A new directory has been created, it is called
inp.out and it contains all the output files. If you don’t believe me, just try those commands :
Run again the toy program : it does not work, you get the following message :
chdb creates himself the output directory, and refuses to run if the output directory already exists : thus you are protected from destroying by mistake previous computation results
chdb is an mpi application running in master-slave mode. This means that the rank 0 process is the master, it distributes the tasks to the other processes, called "slaves". Those processes are responsible for launching your code. Thus if you request n processes for your chdb job, you will have n-1 slaves running.
However, the master should not take too much cpu time, as it only sends some work to the slaves and retrieves their result. So on a 20 cores-node machine, this makes sense to request 21 processes. If you start chdb on 2 nodes, you could request 41 processes, they should be correctly placed on the cores.
Please note that the number of slaves running should be less or equal than the number of input files. For instance, the following will not work (2 input files and 3 slaves) :
try again replacing
-n 4 by
If you do not provide chdb with any output specification, chdb will build an output directory name for you. But of course it is possible to change this behaviour. This is particularly useful when using chdb through slurm (which is the standard way of working), as you can integrate the job Id in the output directory name :
May be you’ll want to execute the program only on a subset of the files found in the input directory :
Create 10 files in an input directory :
Create a file containing the list of files to be processed. this will be our subset :
Execute chdb using only the subset as input :
The h in
chdb stands for hierarchy : the input files do not have to be in a "flat directory", they can be organized in a hierarchy, which is sometimes more convenient if you have a lot of files. The following commands will create a hierarchy in the input directory, execute the toy program on the .txt files, then recreate the same hierarchy in output an write here the outputfiles :
Please note the switch
--out-files is required here, as it is the only way for chdb to know which subdirectories must be made to recreate the hierarchy in the output directory (you may just try to run the command above without specifying
--out-files and see what happens)
The default behaviour of chdb is to stop as soon as an error is encountered, as shown here : we create 10 files, artificially provoke an error on two files, and run chdb :
chdb writes an error message on the standard error, and stops working : very few output files are indeed created. We know that the problem arose in file
2.txt, so it is easy to investigate :
Thanks to this behaviour, you’ll avoid wasting a lot of cpu time just because you had a error in some parameter. However, please note that the file 3.txt has the same error, but as chdb stopped before running processing this file, you are not aware of this situation. You can modify this behaviour :
Now all files were processed, and a file called
errors.txt is created : the files for which your program returned a code different from 0 (thus considered as an error) are cited in the output file errors.txt. It is thus simple to know exactly what files were not processed, and for what reason (if the error code returned by your code is meaningful). When the problem is corrected, you’ll be able to run chdb again using only those files on input.
for more clarity we changed the file name of errors.txt to input_files.txt, but this is not required :
You may ask an execution report with the switch
The report gives you some information about :
- The treatment time of every input file
- The computation time of every slave
- Some global data
Those informations are useful to check and improve load balancing, as explained under :
Load balancing is an important feature to be aware of, and may be to improve. You can have some information using the
--report switch, as explained above.
If the work is well load balanced, all the slaves finish their job at the same time, an no cpu cycles are wasted. In the opposite, if some slaves take more time to complete as others, the faster ones have to wait for the latecomers, and you can loose a lot of cpu time.
In the following example, we have have 10 files to treat, and we launch chdb with 4 slaves. each treatment takes about 1s :
Here is the report file :
It is easy to see that 2 slaves work during 2 seconds each, and 2 slaves work during 3 seconds. The first two slaves will have to wait for the last two. But you reserved 4 processors, so you’ll have to "pay" for 4x3=12 seconds, for only 10 useful seconds : 17% cpu time is wasted. In this very simple example, you can easily correct the problem, using 5 slaves instead of 4 :
The report.txt file shows that this time you will have to "pay" for only 5x2=10 seconds.
In the following example, we create files of different size, then we arrange for the toy code to last more time on bigger files :
The report shows that the situation is somewhat catastrophic in terms of load balancing :
Files 8 et 9 are treated at the end, because the files are by default distributed to the slaves in alphabetical order. Unfortunately, the last file needs a rather long time to be processed, leading to a bad load balancing.
A more clever scenario could consist of working at first with the long jobs : the file nb 9 will be treated first (by slave nb 1), letting the other files to the other slave, hence a much better load balancing :
The load balancing is dramatically improved :
NOTE - Many codes, but not all codes, take a longer time to treat bigger files than smaller ones. If this is a feature of your code, you can try using the
--sort-by-size switch. but if it is not, this will be useless.
While having different run times for the many code instances you launch may cause a performance issue (see above), launching a code which lasts always the same time may also be a problem : Real codes, when running, frequently need to read or write huge datafiles, generally at the beginning of the job and at the end (and sometimes during the job, between iterations). If you launch simultaneously 10 such codes with chdb, the file system may be rapidly completely saturated, because all the jobs will try to read or write data synchronously. chdb allows for you desynchronizing the jobs, as shown under :
--sleep n causes each slave to wait n * slave_nb seconds before the first run, where rank is the mpi rank (the slave number). With this trick, the slaves should be desynchronized : If every slave reads or writes files at the same time relative to the start of the run, the shift in start time insures that the I/O operations are not synchronized.
Warning ! If you work with say 100 slaves, the slave nb 99 will have to wait 2x99=198 seconds before starting, which can be a problem unless the total time of execution is much more than that. We are still working on this point to improve it.
The code you want to launch through chdb does not have to be sequential : it can be an mpi-base program. In this case you must tell this to chdb.
To go on with the tutorial, please download this toy program. This code :
- Prints the chdb slave number and the mpi rank number
- Is linked to the numa library, to print the cores that could be used by each mpi process
(if you do not understand the C code of hell-mpi.c this is not a problem, it is not needed to go on with this tutorial).
We compile the program :
Now, let’s run our mpi code through
chdb : for this exercise we use srun, in order to leave the front node and work on a complete node. We have 10 input files, and our slave needs 2 mpi processes to run. This makes sense reserving a complete node, starting 11 mpi processes :
NOTE : We are using srun with intel-mpi, do not forget the traditional export (see also here) :
WARNING : In a production context, do not put the >2/dev/null, as this throws away the error messages you could get !
If you look at the output above, you’ll see that if several mpi processes belonging to the same slave do not overlap (they use different core sets), there is a huge overlap between different chdb slaves. This can be avoided, slightly modifying the
chdb command 
The important switch is again
--mpi-slaves 10:2:1, telling chdb that :
- We are running 10 slaves per node
- Each slave is an mpi code using 2 mpi processes
- Each mpi process uses 1 thread
Many scientific codes prefer to work inside a current directory, reading and writing files in this directory, sometimes hardcoding file names.
This is possible with chdb, you just have to use
dir as a file type (consequently dir is a reserved file type and you should not use it for ordinary files). Or course, your code may - or not - be an mpi code.
Here is a sample :
The behaviour is somewhat different now :
- There is no default output directory, because we now work in the input directory, which is also the output directory.
- However, you can specify an output directory if this is convenient for you
- While in the previous chdb runs the input directory could have been read only, this is no more the case.
chdb can be used to execute a code a predefined number of times, without specifying any file or directory input :
If you program many iterations and generate 1 or more files per iteration, please store them into a bdbh container : you have to call the output directory
anything.db and to specify the switch
It is even not required specifying an output directory, as shown here :
The environment variables and the file specification templates are still available in this mode of operation.
WARNING - The feature described here is still considered as EXPERIMENTAL - Feedbacks are welcome
chdb receives a SIGTERM or a SIGKILL signals (for instance if you
scancel your running job), the signal is intercepted, and the input files or iterations not yet processed are saved to a file called
CHDB-INTERRUPTION.txt. The files being processed when the signal are received are considered as "not processed", because the processing will be interrupted.
It is then easy to launch
chdb again, using
CHDB-INTERRUPTION.txt as input.
Let’s try it using in-type iteration for more simplicity :
Execute the above commands, wait about 10 seconds, then type
CTRL-C. The program is interrupted,
CHDB-INTERRUPTION.txt is created.
First have a look to
CHDB-INTERRUPTION.txt, check the last line :
If this line is not present, the
CHDB-INTERRUPTION.txt is not complete and it can be difficult to resume the execution of your job.
CHDB-INTERRUPTION.txt is complete, we can launch
chdb again using the
--in-files switch, and this time we wait until the end :
You can now check that every file was correctly created ; the following command should display 100 :
Small files, big data
WARNING - The feature described here is still considered as EXPERIMENTAL - I you need using it, please contact us !
Sometimes, users have to process a huge quantity (says at least 100000) of very small (currently < 4Mb) files. Those situations are very difficult to manage for the filesystems. To support these use cases,
chdb can work with all your files (input and output) deposited in a "data container", so that the input and the output appears as a pair of big files to the filesystem. The data container can be created and managed with the program called
bdbh. You can read now the bdbh tutorial (to be published), however the following is enough for beginning with
The first step is to create the data container itself :
Then, we fill the data container with 100 little files :
Run chdb against those files, as usually :
Check that 100 files were created :
Read the output :
- When the input directory is called
something.db, chdb looks for a bdbh datacontainer, and complains if it does not find it.
- The default output directory is
- The default output directory is a bdbh data container
- The switch
out-filesis required, because chdb must know the name of the output files to be able to store them inside the output bdbh container.