Read-me for workflow.jobTree, 10/07/2009 Benedict Paten (benedict AT soe DOT ucsc dot edu) !!!!!!!!!!!!!!!!!!!!!!!! BEFORE CONTINUING WITH THIS NOTE THE PREFERRED WAY TO COMMUNICATE WITH JOBTREE IS VIA THE SCRIPTTREE INTERFACE, THEREFORE PLEASE REFERENCE THAT README FILE (CALLED JUST README), THE XML COMMUNICATION SCHEME DESCRIBED BELOW IS LIABLE TO CHANGE TO SOMETHING LESS CLUNKY AND MORE SCALEABLE AND IS ALREADY SOMEWHAT OUT OF DATE !!!!!!!!!!!!!!!!!!!!!! Most batch systems (such as LSF, Parasol, etc.) do not allow jobs to spawn other jobs in an easy way. The basic pattern is as follows: (1) You have a job running on your cluster which requires further parallelisation. (2) You create a list of jobs to perform this parallelisation. These are the child jobs of your process, we call them collectively the 'children'. (3) You create a 'follow on' job, to be performed after all the children have successfully completed. This job is responsible for cleaning up the input files created for the children and doing any further processing. CHILDREN SHOULD NOT CLEANUP FILES CREATED BY PARENTS, in case of a batch system failure which requires the child to be re-run (see below) (4) You end your current job successfully. (5) The batch system runs the children. These jobs may in turn have children and follow on jobs. (6) Upon completion of all the children (and children's children and follow ons, collectively descendants) the follow on job is run. The follow on job may create more children. --- Overview implementation instructions: (a) You must install jobTree (see instructions for installing sonTrace, or to selectively install jobTree do a 'make all' inside of the jobTree sub-directory. It should work just fine on its own, though it uses some of the sonLib Python library. You will also need to add to the PYTHONPATH environment variable the sonTrace/src dir. (b) Setup your jobs to be compatible with jobTree. Job tree jobs must accept a job-file argument (you get to specify how, see the --command argument below) which they fill out with instructions regarding children and a follow on job. See 'Setting up jobs' below. (b) You call jobTree.py (which will be in the bin directory after making, which should be put on your path). It has two essential arguments: (I) The --command '[STRING]' argument, which specifies the first job to run. The command string must contain the substring 'JOB_FILE', which will be substituted with a job filename. This job file is edited by the job (see 'Setting up jobs' below). (II) The --jobTree [STRING] argument, which specifies where to create the jobTree directory from which the jobs will be managed. On first running jobTree.py this directory must not exist, but will be created by the job manager. Once created, this directory must be globally visible by all the jobs that will be run, so that some state about the jobs can be updated from the nodes. (c) Wait until jobTree.py finishes. Run jobTreeStatus with the same --jobTree argument. If it reports that all jobs are done then you are okay, go to step (e). If it does not, go to step (d). See 'Crash only design' below to understand how jobTree is designed to avoid crashes. (d) At some point something, or many things, went wrong. This may be because jobTree was crashed, or be because your children crashed (repeatedly). See below for details about crashes. If there are no bugs in your jobs then you can simply restart jobTree.py with the same --jobTree argument. You can omit the --command argument, jobTree.py will restart your jobs and try again. Return to step (c). (e) You're done, you can delete the jobTree directory you created with the --jobTree command. Setting up jobs: Each job, as described must accept a job-file, which is an XML formatted data structure used to describe any children or follow on job. If you've updated the XML file your job must then write the back file to disk before exiting. A job-file initially has the empty structure XML template structure: The attribute "log_level", which is either equal to the string DEBUG/INFO/CRITICAL, gives the logging flag given to jobTree when it was invoked (this can be set by an argument to jobTree.py). Using this flag (though not mandatory), allows a simple way to communicate global logging info to jobs. The wrapper will collect the std out/ std err into a log file, so any logging errors are retrievable after a jobs failed. The attribute "local_temp_dir" gives a path to a directory on the local machine on which the job is being run and in which you can create temporary files. As this is on the local machine it is not guaranteed to be globally visible to other machines. This directory should be used for making files that last only the duration of the job (excluding any time for the children or follow on jobs). Its advantage is that, being on the local file system it can be safely thrashed without bringing down the network file system. Thus directory will be cleaned up by jobTree upon finishing, so you can guarantee your job was clean if you use this directory for temp files. The attribute "global_temp_dir" is a directory which persists for the length of the job, its children and any follow on jobs. The chain of follow on jobs will also have this same directory as their global_temp_dir, so this directory will not necessarily be empty at the start of a job, or indeed, if the job is rerun, so some care must be made to avoid collisions with other files. This directory is guaranteed to be globally visible to all machines on the cluster. Thus jobs may make files to be read by children, and altered/destroyed by follow on jobs. The job number, is simply, in order of the jobs being issued, the index of the job. This job number will change if a job is rerun after failure. Having this number can be useful for deriving a unique integer id associated with the job. If you wish to create a follow on job you add the 'command' attribute to the job tag, thus: If you wish to create children then you add "child" tags, in an unordered list to the children tag, thus: ... The optional time attribute is used to determine how your jobs are distributed on the cluster. If you don't supply a value then it is assumed that time is some arbitrarily large value, and hence maximum paralleism will be used. If the time values are supplied and sufficiently small then the schedular may issue child jobs in series, because of the costs of latency associated with creating each new job. The time attribute, along with the parameter 'jobTime' to jobTree.py allows job tree to use your estimate of how long the child and its descendants and follow-ons will take to run, to try and optimise how many jobs it runs in parallel, and how many it schedules serially. Thus if you have lots of short running jobs and a few big jobs, then job tree will take your tiny jobs and avoid scheduling each one independently, instead it will clump them into batches of total running length 'jobTime', each of which will be executed in series on a node, while leaving your larger jobs to be scheduled on their own. Atomicity: Job tree is designed to be robust, so that individuals jobs can fail (see below), and be subsequently restarted. It is assumed jobs can fail at any point. Thus until jobTree knows your children have been completed okay you can not assume that your job has been completed. To ensure that your pipeline can be restarted after a failure ensure that every job: (1) NEVER CLEANSUP/ALTERS ITS OWN INPUT FILES. Instead, parents and follow on jobs may clean up the files of children or prior jobs. (2) Can be re-run from just its input files any number of times. A job should only depend on its input files, and it should be possible to run the job as many times as desired, essentially until news of its completion is successfully transmitted to the job tree master process. These two properties are the key to job atomicity. Additionally, you'll find it much easier if a job: (3) Only creates temp files in the two provided temporary file directories. This ensures we don't soil the cluster's disks. (4) Logs sensibly, so that error messages can be transmitted back to the master and the pipeline can be successfully debugged. Environment: Job tree replicates the environment in which jobTree.py is called and provides this environment to all the jobs. This ensures uniformity of the environment variables for every job. Probably FAQ's: Why do we use this pattern? Ideally when issuing children the parent job could just go to sleep on the cluster. But unless it frees the machine its sleeping on, then the cluster soon jams up with sleeping jobs. This design is a pragmatic way of designing simple parallel code. It isn't heavy duty, it isn't map-reduce, but it has it's niche. What do you mean 'crash only' software? This is just a fancy way of saying that job-tree checkpoints its state on disk, so that it or the job manager can be wiped out and restarted. The test code shows how this works, it will keep crashing everything, at random points, but eventually everything will complete okay. As a consumer you needn't worry about any of this, but your child jobs must be atomic (as with all batch systems), and must follow the convention regarding input files. How scaleable? Probably not very, but it could be. You should be safe to have a 1000 concurrent jobs running, depending on your file-system and batch system. Can you support my XYZ batch system? See the abstract base class 'AbstractBatchSystem' in the code to see what's required. You'll probably need to speak to me as I haven't attempted to comprehensively document these functions, though it's pretty straight forward. Is there an API for the jobTree top level commands? Not really - but jobTree.py functions could be called by Python. See the main() function to see how it works. There are some arguments that are munged from the command line by the python optparse module that you need to provide etc..