Dans cette page :
How to use this tutorial ?
Each entry point of the tutorial is dedicated to a usecase, from the most general and simple to the most specific and complicated. To understand what happens when chdb is launched for a real problem we’ll use some very simple bash commands. You should :
- read carefully the few lines before and after each code sample, to understand the point.
- copy-and-paste each code sample provided here, execute it from the frontal node, and look at the created files using the unix standard tools as find, ls, less, etc.
Of course, in the real life, you’ll have to replace the toy command line launched by chdb by your code, and yo must launch chdb through an sbatch command, as usual. Please remember you cannot use the frontal nodes to launch real applications, the frontal nodes are dedicated to file edition or test runs.
Also in the real life, you should use srun to launch chdb program, not mpirun. Use mpirun to stay on the frontal, and srun to go to the nodes.
Introduction
Before starting...
You should be connected to Olympe before working on this tutorial. You must initialize your environment to be able to use the commands described here :
module load intelmpi chdb/1.0
You should work in an empty temporary directory, to be sure you’ll not remove important files while executing the exercises described in the following paragraphs :
mkdir TUTO-CHDB cd TUTO-CHDB
Help !
You can ask chdb
for help :
chdb --help
Essential
- Executing the same code on several input files
- How many processes for a chdb job ?
- specifying the output directory
- Executing your code on a subset of files
- Executing your code on a hierarchy of files
- Managing the errors
Executing the same code on several input files
The goal of chdb is to execute the same code on a series of input files, and store the results produced by those instances.
Using chdb implies that the executions of your code are independent and that the order in which the files are treated is irrelevant.
In order to use chdb
, you must put all your input files inside a common "input directory".
Let’s create 10 files under a directory called inp
:
rm -r inp inp.out; mkdir inp; for i in $(seq 1 10); do echo "coucou $i" > inp/$i.txt; done
Then let’s use those 10 files as entry for a little toy program. We specify --in-type txt
to tell chdb to take all *.txt
files as input :
mpirun -n 2 chdb --in-dir inp \ --in-type txt \ --command-line 'cat %in-dir%/%path% >%out-dir%/%basename%.out'
A new directory has been created, it is called inp.out
and it contains all the output files. If you don’t believe me, just try those commands :
find inp.out -ls find inp.out -type f |xargs cat
Run again the toy program : it does not work, you get the following message :
ERROR - Cannot create directory inp.out - Error= File exists application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
chdb creates himself the output directory, and refuses to run if the output directory already exists : thus you are protected from destroying by mistake previous computation results
How many processes for a chdb job ?
chdb is an mpi application running in master-slave mode. This means that the rank 0 process is the master, it distributes the tasks to the other processes, called "slaves". Those processes are responsible for launching your code. Thus if you request n processes for your chdb job, you will have n-1 slaves running. However, the master should not take too much cpu time, as it only sends some work to the slaves and retrieves their result. So on a 36 cores-node machine, this makes sense to request 37 processes. If you start chdb on 2 nodes, you could request 73 processes, they should be correctly placed on the cores. Please note that the number of slaves running should be less or equal than the number of input files. For instance, the following will not work (2 input files and 3 slaves) :
rm -r inp inp.out; mkdir inp; for i in $(seq 1 2); do echo "coucou $i" > inp/$i.txt; done mpirun -n 4 chdb --in-dir inp --in-type txt \ --command-line 'cat %in-dir%/%path% >%out-dir%/%basename%.out'
try again, replacing -n 4
by -n 3
specifying the output directory
If you do not provide chdb with any output specification, chdb will build an output directory name for you. But of course it is possible to change this behaviour. This is particularly useful when using chdb through slurm (which is the standard way of working), as you can integrate the job Id in the output directory name :
salloc --time=2:0; mpirun -n 2 chdb --in-dir inp --in-type txt \ --command-line 'cat %in-dir%/%path% >%out-dir%/%basename%.out' \ --out-dir $SLURM_JOB_ID exit
Executing your code on a subset of files
May be you’ll want to execute the program only on a subset of the files found in the input directory :
Create 10 files in an input directory :
rm -r inp inp.out mkdir inp for i in $(seq 1 10); do echo "coucou $i" > inp/$i.txt; done
Create a file containing the list of files to be processed. this will be our subset :
for i in $(seq 1 2 10); do echo $i.txt >> input_files.txt; done
Execute chdb using only the subset as input :
mpirun -n 2 chdb --in-dir inp --in-type txt \ --command-line 'cat %in-dir%/%path% >%out-dir%/%basename%.out' \ --in-files input_files.txt;
Executing your code on a hierarchy of files
The h in chdb
stands for hierarchy : the input files do not have to be in a "flat directory", they can be organized in a hierarchy, which is sometimes more convenient if you have a lot of files. The following commands will create a hierarchy in the input directory, execute the toy program on the .txt files, then recreate the same hierarchy in output an write here the outputfiles :
rm -r inp inp.out mkdir -p inp/odd inp/even for i in $(seq 1 2 9); do echo "coucou $i" > inp/odd/$i.txt;done for i in $(seq 0 2 8); do echo "coucou $i" > inp/even/$i.txt; done mpirun -n 2 chdb --in-dir inp --in-type txt \ --command-line 'cat %in-dir%/%path% >%out-dir%/%dirname%/%basename%.out' \ --out-files %out-dir%/%dirname%/%basename%.out
Please note the switch --out-files
is required here, as it is the only way for chdb to know which subdirectories must be made to recreate the hierarchy in the output directory (you may just try to run the command above without specifying --out-files
and see what happens)
Managing the errors
The default behaviour of chdb is to stop as soon as an error is encountered, as shown here : we create 10 files, artificially provoke an error on two files, and run chdb :
rm -r inp inp.out; mkdir inp; for i in $(seq 1 10); do echo "coucou $i" > inp/$i.txt; done chmod u-r inp/2.txt inp/3.txt mpirun -n 2 chdb --in-dir inp --in-type txt \ --command-line 'cat %in-dir%/%path% >%out-dir%/%basename%.out \ 2>%out-dir%/%basename%.err' find inp.out
chdb
writes an error message on the standard error, and stops working : very few output files are indeed created. We know that the problem arose in file 2.txt
, so it is easy to investigate :
ERROR with external command - sts=1 input file=2.txt ABORTING - Try again with --on-error find inp.out cat inp.out/2.err
Thanks to this behaviour, you’ll avoid wasting a lot of cpu time just because you had a error in some parameter. However, please note that the file 3.txt has the same error, but as chdb stopped before running processing this file, you are not aware of this situation. You can modify this behaviour :
rm -r inp inp.out; mkdir inp; for i in $(seq 1 10); do echo "coucou $i" > inp/$i.txt; done chmod u-r inp/2.txt inp/3.txt mpirun -n 2 chdb --in-dir inp --in-type txt \ --command-line 'cat %in-dir%/%path% >%out-dir%/%basename%.out \ 2>%out-dir%/%basename%.err' \ --on-error errors.txt find inp.out
Now all files were processed, and a file called errors.txt
is created : the files for which your program returned a code different from 0 (thus considered as an error) are cited in the output file errors.txt. It is thus simple to know exactly what files were not processed, and for what reason (if the error code returned by your code is meaningful). When the problem is corrected, you’ll be able to run chdb again using only those files on input.
for more clarity we changed the file name of errors.txt to input_files.txt
, but this is not required :
chmod u+r inp/2.txt inp/3.txt mv errors.txt input_files.txt mpirun -n 2 chdb --in-dir inp --in-type txt \ --command-line 'cat %in-dir%/%path% >%out-dir%/%basename%.out \ 2>%out-dir%/%basename%.err' \ --on-error errors-1.txt \ --out-dir inp.out-1 \ --in-files input_files.txt
Advanced
- Generating an execution report
- Improving the load balancing (1/2)
- Improving the load balancing (2/2) :
- Avoiding Input-output saturation
- Launching an mpi program
- Controlling the placement of an mpi code
Generating an execution report
You may ask an execution report with the switch --report
:
rm -r inp inp.out; mkdir inp; for i in $(seq 1 10); do echo "coucou $i" > inp/$i.txt; done mpirun -n 3 chdb --in-dir inp --in-type txt \ --command-line 'sleep 1; cat %in-dir%/%path% >%out-dir%/%basename%.out \ 2>%out-dir%/%basename%.err' \ --report report.txt
The report gives you some information about :
- The treatment time of every input file
- The computation time of every slave
- Some global data
Those informations are useful to check and improve load balancing, as explained under :
Improving the load balancing (1/2)
Load balancing is an important feature to be aware of, and may be to improve. You can have some information using the --report
switch, as explained above.
If the work is well load balanced, all the slaves finish their job at the same time, an no cpu cycles are wasted. In the opposite, if some slaves take more time to complete as others, the faster ones have to wait for the latecomers, and you can loose a lot of cpu time.
In the following example, we have have 10 files to treat, and we launch chdb with 4 slaves. each treatment takes about 1s :
rm -r inp inp.out; mkdir inp; for i in $(seq 1 10); do echo "coucou $i" > inp/$i.txt; done mpirun -n 5 chdb --in-dir inp --in-type txt \ --command-line 'sleep 1; cat %in-dir%/%path% >%out-dir%/%basename%.out \ 2>%out-dir%/%basename%.err' \ --report report.txt cat report.txt
Here is the report file :
SLAVE TIME(s) STATUS INPUT PATH 2 1.00965 0 2.txt 1 1.01282 0 1.txt 3 1.01392 0 10.txt 4 1.01529 0 3.txt 2 1.00846 0 4.txt 1 1.00863 0 5.txt 3 1.01006 0 6.txt 4 1.00955 0 7.txt 2 1.0082 0 8.txt 1 1.00727 0 9.txt ---------------------------------------------- SLAVE N INP CUMULATED TIME (s) 1 3 3.02872 2 3 3.02631 3 2 2.02398 4 2 2.02484 ---------------------------------------------- AVERAGE TIME (s) = 2.52596 STANDARD DEVIATION (s) = 0.579144 MIN VALUE (s) = 2.02398 MAX VALUE (s) = 3.02872
It is easy to see that 2 slaves work during 2 seconds each, and 2 slaves work during 3 seconds. The first two slaves will have to wait for the last two. But you reserved 4 processors, so you’ll have to "pay" for 4x3=12 seconds, for only 10 useful seconds : 17% cpu time is wasted. In this very simple example, you can easily correct the problem, using 5 slaves instead of 4 :
rm -r inp.out mpirun -n 6 chdb --in-dir inp --in-type txt --command-line 'sleep 1; cat %in-dir%/%path% >%out-dir%/%basename%.out 2>%out-dir%/%basename%.err' --report report.txt cat report.txt
The report.txt file shows that this time you will have to "pay" for only 5x2=10 seconds.
SLAVE TIME(s) STATUS INPUT PATH 2 1.01017 0 2.txt 3 1.01334 0 10.txt 1 1.01385 0 1.txt 5 1.01459 0 4.txt 4 1.01562 0 3.txt 2 1.00696 0 5.txt 1 1.0086 0 7.txt 3 1.01 0 6.txt 5 1.01138 0 8.txt 4 1.01139 0 9.txt ---------------------------------------------- SLAVE N INP CUMULATED TIME (s) 1 2 2.02245 2 2 2.01713 3 2 2.02334 4 2 2.02701 5 2 2.02597 ---------------------------------------------- AVERAGE TIME (s) = 2.02318 STANDARD DEVIATION (s) = 0.00386248 MIN VALUE (s) = 2.01713 MAX VALUE (s) = 2.02701
Improving the load balancing (2/2) :
In the following example, we create files of different size, then we arrange for the toy code to last more time on bigger files : rm -r inp inp.out mkdir inp X='X'; for i in $(seq 1 9); do echo "coucou $i" > inp/$i.txt; X="$X$X"; for j in $(seq 1 $i); do echo $X >> inp/$i.txt; done; done mpirun -n 3 chdb --in-dir inp --in-type txt \ --command-line \ 'sleep $(( $(cat %in-dir%/%path%|wc -c) / 100 + 1)); cat %in-dir%/%path% >%out-dir%/%basename%.out 2>%out-dir%/%basename%.err' \ --report report.txt
The report shows that the situation is somewhat catastrophic in terms of load balancing :
SLAVE TIME(s) STATUS INPUT PATH 1 2.01111 0 1.txt 2 2.01391 0 2.txt 1 2.01003 0 3.txt 2 2.01012 0 4.txt 1 3.01058 0 5.txt 2 5.00927 0 6.txt 1 11.0105 0 7.txt 2 22.01 0 8.txt 1 48.0117 0 9.txt ---------------------------------------------- SLAVE N INP CUMULATED TIME (s) 1 5 66.054 2 4 31.0433 ---------------------------------------------- AVERAGE TIME (s) = 48.5486 STANDARD DEVIATION (s) = 24.7562 MIN VALUE (s) = 31.0433 MAX VALUE (s) = 66.054
Files 8 et 9 are treated at the end, because the files are by default distributed to the slaves in alphabetical order. Unfortunately, the last file needs a rather long time to be processed, leading to a bad load balancing. A more clever scenario could consist of working at first with the long jobs : the file nb 9 will be treated first (by slave nb 1), letting the other files to the other slave, hence a much better load balancing :
rm -r inp.out; mpirun -n 3 chdb --in-dir inp --in-type txt \ --command-line \ 'sleep $(( $(cat %in-dir%/%path%|wc -c) / 100 )); \ cat %in-dir%/%path% >%out-dir%/%basename%.out 2>%out-dir%/%basename%.err' \ --report report.txt \ --sort-by-size
The load balancing is dramatically improved :
SLAVE TIME(s) STATUS INPUT PATH 2 22.0131 0 8.txt 2 11.01 0 7.txt 2 5.00993 0 6.txt 2 3.0096 0 5.txt 2 2.00942 0 4.txt 2 2.00944 0 3.txt 2 2.01015 0 2.txt 1 48.0111 0 9.txt 2 2.0096 0 1.txt ---------------------------------------------- SLAVE N INP CUMULATED TIME (s) 1 1 48.0111 2 8 49.0813 ---------------------------------------------- AVERAGE TIME (s) = 48.5462 STANDARD DEVIATION (s) = 0.756772 MIN VALUE (s) = 48.0111 MAX VALUE (s) = 49.0813
NOTE - Many codes, but not all codes, take a longer time to treat bigger files than smaller ones. If this is a feature of your code, you can try using the --sort-by-size
switch. but if it is not, this will be useless.
Avoiding Input-output saturation
While having different run times for the many code instances you launch may cause a performance issue (see above), launching a code which lasts always the same time may also be a problem : Real codes, when running, frequently need to read or write huge datafiles, generally at the beginning of the job and at the end (and sometimes during the job, between iterations). If you launch simultaneously 10 such codes with chdb, the file system may be rapidly completely saturated, because all the jobs will try to read or write data synchronously. chdb allows for you desynchronizing the jobs, as shown under :
rm -r inp inp.out; mkdir inp; for i in $(seq 1 10); do echo "coucou $i" > inp/$i.txt; done mpirun -n 6 chdb --in-dir inp --in-type txt \ --command-line \ 'cat %in-dir%/%path% >%out-dir%/%basename%.out' \ --sleep 2
The switch --sleep n
causes each slave to wait n * rank seconds before the first run, where rank is the mpi rank (the slave number). With this trick, the slaves should be desynchronized : If every slave reads or writes files at the same time relative to the start of the run, the shift in start time insures that the I/O operations are not synchronized.
Warning ! If you work with say 100 slaves, the slave nb 99 will have to wait 2x99=198 seconds before starting, which can be a problem unless the total time of execution is much more than that. We are still working on this point to improve it.
Launching an mpi program
The code you want to launch through chdb does not have to be sequential : it can be an mpi-base program. In this case you must tell this to chdb. To go on with the tutorial, please download this toy program. This code :
- Prints the chdb slave number and the mpi rank number
- Is linked to the numa library, to print the cores that could be used by each mpi process (if you do not understand the C code of hell-mpi.c this is not a problem, it is not needed to go on with this tutorial).
We rename the downloaded file and compile the program :
mv hello-mpi.c.txt hello-mpi.c mpicc -o hello-mpi hello-mpi.c -lnuma
Now, let’s run our mpi code through chdb
: for this exercise we use srun, in order to leave the front node and work on a complete node. We have 10 input files, and our slave needs 2 mpi processes to run. This makes sense reserving a complete node, starting 11 mpi processes :
rm -r inp inp.out; mkdir inp; for i in $(seq 1 18); do echo "coucou $i" > inp/$i.txt; done srun -N 1 -n 19 chdb --in-dir inp --in-type txt \ --command-line './hello-mpi %name%' \ --mpi-slaves 2
The switch —mpi-slaves tells chdb that the external code is an mpi program, and the number of processes to launch. Here is the output :
Controlling the placement of an mpi code
If you look at the output above, you’ll see that if several mpi processes belonging to the same slave do not overlap (they use different core sets), there is a huge overlap between different chdb slaves. This can be avoided, slightly modifying the chdb
command:
srun -N 1 -n 19 chdb --in-dir inp --in-type txt \ --command-line './hello-mpi %name%' \ --mpi-slaves 18:2:1
The important switch is --mpi-slaves 18:2:1
, telling chdb that :
- We are running 18 slaves per node
- Each slave is an mpi code using 2 mpi processes
- Each mpi process uses 1 thread
Here is the output, this time there is no overlap between slaves :
Miscellaneous
Working with directories as input
Many scientific codes prefer to work inside a current directory, reading and writing files in this directory, sometimes hardcoding file names.
This is possible with chdb, you just have to use dir
as a file type (consequently dir is a reserved file type and you should not use it for ordinary files). Or course, your code may - or not - be an mpi code.
Here is a sample :
rm -r inpd; for d in $(seq 1 10); do mkdir -p inpd/$d.dir; echo "coucou $d" > inpd/$d.dir/input; done mpirun -n 11 chdb --in-dir inpd --in-type dir --command-line 'cat input >output'
The behaviour is somewhat different now :
- There is no default output directory, because we now work in the input directory, which is also the output directory.
- However, you can specify an output directory if this is convenient for you
- While in the previous chdb runs the input directory could have been read only, this is no more the case.
Executing a code a predefined number of times
chdb can be used to execute a code a predefined number of times, without specifying any file or directory input :
rm -r iter.out mpirun -n 3 chdb --out-dir iter.out \ --command-line 'echo %path% >%out-dir%/%path%' \ --in-type "1 10"
If you program many iterations and generate 1 or more files per iteration, please store them into a bdbh container : you have to call the output directory anything.db
and to specify the switch --out-files
:
rm -r iter.out.db mpirun -n 3 chdb \ --in-type "1 200" \ --out-dir iter.out.db \ --out-files='%out-dir%/%path%.txt' \ --command-line 'echo slave nb $CHDB_RANK iteration nb %path% > %out-dir%/%path%.txt'
It is even not required specifying an output directory, as shown here :
rm -r iter.out mpirun -n 3 chdb \ --in-type "1 20 2" \ --command-line 'echo slave nb $CHDB_RANK iteration nb %path%'
The environment variables and the file specification templates are still available in this mode of operation.
Checkpointing chdb
WARNING - The feature described here is still considered as EXPERIMENTAL - Feedbacks are welcome
If chdb
receives a SIGTERM or a SIGKILL signals (for instance if you scancel
your running job), the signal is intercepted, and the input files or iterations not yet processed are saved to a file called CHDB-INTERRUPTION.txt
. The files being processed when the signal are received are considered as "not processed", because the processing will be interrupted.
It is then easy to launch chdb
again, using CHDB-INTERRUPTION.txt
as input.
Let’s try it using in-type iteration for more simplicity :
rm -rf iter.out iter1.out mpirun -n 3 chdb --out-dir iter.out --in-type "1 100" \ --command-line 'sleep 1;echo %path% >%out-dir%/%path%'
Execute the above commands, wait about 10 seconds, then type CTRL-C
. The program is interrupted, CHDB-INTERRUPTION.txt
is created.
First have a look to CHDB-INTERRUPTION.txt
, check the last line :
# Number of files not yet processed = 56
If this line is not present, the CHDB-INTERRUPTION.txt
is not complete and it can be difficult to resume the execution of your job.
If CHDB-INTERRUPTION.txt
is complete, we can launch chdb
again using the --in-files
switch, and this time we wait until the end :
mpirun -n 3 chdb --out-dir iter1.out --in-type "1 100" \ --command-line 'sleep 1;echo %path% >%out-dir%/%path%' \ --in-files CHDB-INTERRUPTION.txt
You can now check that every file was correctly created ; the following command should display 100 :
find iter.out iter1.out -type f| wc -l