Read-me for workflow.jobTree, 10/07/2009 Benedict Paten (benedict AT soe DOT ucsc dot edu)

!!!!!!!!!!!!!!!!!!!!!!!!
BEFORE CONTINUING WITH THIS NOTE THE PREFERRED WAY TO COMMUNICATE WITH JOBTREE IS
VIA THE SCRIPTTREE INTERFACE, THEREFORE PLEASE REFERENCE THAT README FILE (CALLED JUST README), 
THE XML COMMUNICATION SCHEME DESCRIBED BELOW
IS LIABLE TO CHANGE TO SOMETHING LESS CLUNKY AND MORE SCALEABLE AND IS ALREADY SOMEWHAT
OUT OF DATE
!!!!!!!!!!!!!!!!!!!!!!

Most batch systems (such as LSF, Parasol, etc.) do not allow jobs to spawn
other jobs in an easy way. 

The basic pattern is as follows: 

	(1) You have a job running on your cluster which requires further parallelisation. 
	
	(2) You create a list of jobs to perform this parallelisation. 
	These are the child jobs of your process, we call them collectively the 'children'.
	
	(3) You create a 'follow on' job, to be performed after all the children 
	have successfully completed. This job is responsible for cleaning up
	the input files created for the children and doing any further processing. 
	CHILDREN SHOULD NOT CLEANUP FILES CREATED BY PARENTS, in case of a batch system 
	failure which requires the child to be re-run (see below)
	
	(4) You end your current job successfully.
	
	(5) The batch system runs the children. These jobs may in turn have children and follow on jobs.
	
	(6) Upon completion of all the children (and children's children and follow ons, collectively descendants) 
	the follow on job is run. The follow on job may create more children.

---

Overview implementation instructions:

	(a) You must install jobTree (see instructions for installing sonTrace, or to
	selectively install jobTree do a 'make all' inside of the jobTree sub-directory.
	It should work just fine on its own, though it uses some of the sonLib Python library.
	You will also need to add to the PYTHONPATH environment variable the sonTrace/src dir. 
	
	(b) Setup your jobs to be compatible with jobTree. Job tree jobs must accept
	a job-file argument (you get to specify how, see the --command argument below)
	which they fill out with instructions regarding children and a follow on job.
	See 'Setting up jobs' below.
	
	(b) You call jobTree.py (which will be in the bin directory after making, which
	should be put on your path). It has two essential arguments:
	
		(I) The --command '[STRING]' argument, which specifies the first job to run. 
		The command string must contain the substring 'JOB_FILE', which will be substituted with
		a job filename. This job file is edited by the job (see 'Setting up jobs' below).
		
		(II) The --jobTree [STRING] argument, which specifies where to create the jobTree
		directory from which the jobs will be managed. On first running jobTree.py
		this directory must not exist, but will be created by the job manager. 
		Once created, this directory must be globally visible by all the jobs that will be run, 
		so that some state about the jobs can be updated from the nodes.
		
	(c) Wait until jobTree.py finishes. Run jobTreeStatus with the same --jobTree
	argument. If it reports that all jobs are done then you are okay, go to step (e). 
	If it does not, go to step (d). See 'Crash only design' below to understand how jobTree
	is designed to avoid crashes. 
	
	(d) At some point something, or many things, went wrong. This may be because 
	jobTree was crashed, or be because your children crashed (repeatedly). 
	See below for details about crashes. If there are no bugs in your jobs then you can simply 
	restart jobTree.py with the same --jobTree argument. You can omit
	the --command argument, jobTree.py will restart your jobs and try again. Return
	to step (c).
	
	(e) You're done, you can delete the jobTree directory you created with the
--jobTree command.


Setting up jobs:

	Each job, as described must accept a job-file, which is an XML formatted 
	data structure used to describe any children or follow on job. If you've 
	updated the XML file your job must then write the back file to disk before exiting.
	
	A job-file initially has the empty structure XML template structure:
	
	<job 
		log_level="DEBUG/INFO/CRITICAL" 
		local_temp_dir="path to local temporary directory"
		global_temp_dir="path to global temporary directory"
		job_number="number of job">
		<children>
		</children>
	</job>	
	
	The attribute "log_level", which is either equal to the string DEBUG/INFO/CRITICAL,
	gives the logging flag given to jobTree when it was invoked (this can be set by an
	argument to jobTree.py). Using this flag (though not mandatory), allows a simple
	way to communicate global logging info to jobs. The wrapper will collect the std out/
	std err into a log file, so any logging errors are retrievable after a jobs
	failed.
	
	The attribute "local_temp_dir" gives a path to a directory on the local
	machine on which the job is being run and in which you can create temporary files. 
	As this is on the local	machine it is not guaranteed to be globally visible
	to other machines. This directory should be used for making files
	that last only the duration of the job (excluding any time for the children
	or follow on jobs). Its advantage is that, being on the local file system
	it can be safely thrashed without bringing down the network file system.
	Thus directory will be cleaned up by jobTree upon finishing, so you
	can guarantee your job was clean if you use this directory for temp files.
	
	The attribute "global_temp_dir" is a directory which persists for the 
	length of the job, its children and any follow on jobs. The chain of follow on jobs
	will also have this same directory as their global_temp_dir, so this directory 
	will not necessarily be empty at the start of a job, or indeed, if the job
	is rerun, so some care must be made to avoid collisions with other files. 
	This directory is guaranteed to be globally visible to all machines on the
	cluster. Thus jobs may make files to be read by children, and altered/destroyed
	by follow on jobs.
	
	The job number, is simply, in order of the jobs being issued, the index of the job.
	This job number will change if a job is rerun after failure.
	Having this number can be useful for deriving a unique integer id associated with the job.
	
	If you wish to create a follow on job you add the 'command' attribute to the 
	job tag, thus:
	
	<job command="follow on command string" ... >
		<children>
		</children>
	</job>
	
	If you wish to create children then you add "child" tags, in an unordered list
	to the children tag, thus:
	
	<job command="follow on command string" ... >
		<children>
			<child command="child command string 1" time="estimated time for job and its children and follows on to complete in seconds"/>
			<child command="child command string 2" time="estimated time for job and its children and follows on to complete in seconds"/>
			...
			<child command="command string N" time="estimated time for job and its children and follows on to complete in seconds"/>
		</children>
	</job>
	
	The optional time attribute is used to determine how your jobs are distributed on the cluster. If you don't supply a value then
	it is assumed that time is some arbitrarily large value, and hence maximum paralleism will be used. If the time values
	are supplied and sufficiently small then the schedular may issue child jobs in series, because of the costs of latency associated with
	creating each new job.
	
	The time attribute, along with the parameter 'jobTime' to jobTree.py allows job tree to use your estimate of how long the child and its descendants 
	and follow-ons will take to run, to try and optimise how many jobs it runs in parallel, and how many it schedules serially. 
	Thus if you have lots of short running jobs and a few big jobs, then job tree will take your tiny jobs and avoid scheduling each one independently, 
	instead it will clump them into batches of total running length 'jobTime', each of which will be executed in series on a node, while leaving your larger jobs to be scheduled on their own.
	
Atomicity:
	
	Job tree is designed to be robust, so that individuals jobs can fail (see below), and be 
	subsequently restarted. It is assumed jobs can fail at any point. Thus until 
	jobTree knows your children have been completed okay you can not assume that 
	your job has been completed. To ensure that your pipeline can be restarted after a failure 
	ensure that every job:
	
	(1) NEVER CLEANSUP/ALTERS ITS OWN INPUT FILES. 
	Instead, parents and follow on jobs may clean up the files of children or prior jobs.
	
	(2) Can be re-run from just its input files any number of times. A job should only depend on its
	input files, and it should be possible to run the job as many times as desired, essentially
	until news of its completion is successfully transmitted to the job tree master process.
	
	These two properties are the key to job atomicity. Additionally, you'll find it much easier if a job:
	
	(3) Only creates temp files in the two provided temporary file directories. This ensures we don't
	soil the cluster's disks.
	
	(4) Logs sensibly, so that error messages can be transmitted back to the master and the pipeline can be successfully
	debugged.
	
Environment:

	Job tree replicates the environment in which jobTree.py is called and provides this environment to all
	the jobs. This ensures uniformity of the environment variables for every job.

Probably FAQ's:

	Why do we use this pattern?
	
		Ideally when issuing children the parent job could just go to sleep on the cluster.
		But unless it frees the machine its sleeping on, then the cluster soon jams up
		with sleeping jobs. This design is a pragmatic way of designing simple parallel code.
		It isn't heavy duty, it isn't map-reduce, but it has it's niche.

	What do you mean 'crash only' software?
	
		This is just a fancy way of saying that job-tree checkpoints its state on  
		disk, so that it or the job manager can be wiped out and restarted. 
		The test code shows how this works, it will keep crashing everything, at random
		points, but eventually everything will complete okay.
		As a consumer you needn't worry about any of this, but your child jobs must 
		be atomic (as with all batch systems), and must follow the convention regarding
		input files.
	
	How scaleable?
	
		Probably not very, but it could be. You should be safe to have a 1000 concurrent
		jobs running, depending on your file-system and batch system.
		
	Can you support my XYZ batch system?
	
		See the abstract base class 'AbstractBatchSystem' in the code to see what's required.
		You'll probably need to speak to me as I haven't attempted to comprehensively document these
		functions, though it's pretty straight forward.
		
	Is there an API for the jobTree top level commands?
	
		Not really - but jobTree.py functions could be called by Python. See
		the main() function to see how it works. There are some arguments 
		that are munged from the command line by the python optparse module
		that you need to provide etc..