Files
mercury/compiler
Peter Wang 712027f307 This patch changes the parallel execution mechanism in the low level backend.
Estimated hours taken: 100
Branches: main

This patch changes the parallel execution mechanism in the low level backend.
The main idea is that, even in programs with only moderate parallelism, we
won't have enough processors to exploit it all.  We should try to reduce the
cost in the common case, i.e. when a parallel conjunction gets executed
sequentially.  This patch does two things along those lines:

(1) Instead of unconditionally executing all parallel conjuncts (but the last)
in separate Mercury contexts, we allow a context to continue execution of the
next conjunct of a parallel conjunction if it has just finished executing the
previous conjunct.  This saves on allocating unnecessary contexts, which can
be a big reduction in memory usage.

We also try to execute conjuncts left-to-right so as to minimise the
need to suspend contexts when there are dependencies between conjuncts.

(2) Conjuncts that *are* executed in parallel still need separate contexts.
We used to pass variable bindings to those conjuncts by flushing input
variable values to stack slots and copying the procedure's stack frame to the
new context.  When the conjunct finished, we would copy new variable bindings
back to stack slots in the original context.

What happens now is that we don't do any copying back and forth.  We introduce
a new abstract machine register `parent_sp' which points to the location of
the stack pointer at the time that a parallel conjunction began.  In parallel
conjuncts we refer to all stack slots via the `parent_sp' pointer, since we
could be running on a different context altogether and `sp' would be pointing
into a new detstack.  Since parallel conjuncts now share the procedure's stack
frame, we have to allocate stack slots such that all parallel conjuncts in a
procedure that could be executing simultaneously have distinct sets of stack
slots.  We currently use the simplest possible strategy, i.e. don't allow
variables in parallel conjuncts to reuse stack slots.

Note: in effect parent_sp is a frame pointer which is only set for and used by
the code of parallel conjuncts.  We don't call it a frame pointer as it can be
confused with "frame variables" which have to do with the nondet stack.


compiler/code_info.m:
	Add functionality to keep track of how deep inside of nested parallel
	conjunctions the code generator is.

	Add functionality to acquire and release "persistent" temporary stack
	slots.  Unlike normal temporary stack slots, these don't get implicitly
	released when the code generator's location-dependent state is reset.

	Conform to additions of `parent_sp' and parent stack variables.

compiler/exprn_aux.m:
	Generalise the `substitute_lval_in_*' predicates by
	`transform_lval_in_*' predicates.  Instead of performing a fixed
	substitution, these take a higher order predicate which performs some
	operation on each lval.  Redefine the substitution predicates in terms
	of the transformation predicates.

	Conform to changes in `fork', `join_and_terminate' and
	`join_and_continue' instructions.

	Conform to additions of `parent_sp' and parent stack variables.

	Remove `substitute_rval_in_args' and `substitute_rval_in_arg' which
	were unused.

compiler/live_vars.m:
	Introduce a new type `parallel_stackvars' which is threaded through
	`build_live_sets_in_goal'.  We accumulate the sets of variables which
	are assigned stack slots in each parallel conjunct.  At the end of
	processing a parallel conjunction, use this information to force
	variables which are assigned stack slots to have distinct slots.

compiler/llds.m:
	Change the semantics of the `fork' instruction.  It now takes a single
	argument: the label of the next conjunct after the current one.  The
	instruction now "sparks" the next conjunct to be run, either in a
	different context (possibly in parallel, on another Mercury engine) or
	is queued to be executed in the current context after the current
	conjunct is finished.

	Change the semantics of the `join_and_continue' instruction.  This
	instruction now serves to end all parallel conjuncts, not just the
	last one in a parallel conjunction.

	Remove the `join_and_terminate' instruction (no longer used).

	Add the new abstract machine register `parent_sp'.

	Introduce "parent stack slots", which are similar to normal stack
	slots but relative to the `parent_sp' register.

compiler/par_conj_gen.m:
	Change the code generated for parallel conjunctions.  That is:

	- use the new `fork' instruction at the beginning of a parallel
	  conjunct;

	- use the `join_and_continue' instruction at the end of all parallel
	  conjuncts;

	- keep track of how deep the code generator currently is in parallel
	  conjunctions;

	- set and restore the `parent_sp' register when entering a non-nested
	  parallel conjunction;

	- after generating the code of a parallel conjunct, replace all
	  references to stack slots by parent stack slots;

	- remove code to copy back output variables when a parallel conjunct
	  finishes.

	Update some comments.

runtime/mercury_context.c:
runtime/mercury_context.h:
	Add the type `MR_Spark'.  Sparks are allocated on the heap and contain
	enough information to begin execution of a single parallel conjunct.

	Add globals `MR_spark_queue_head' and `MR_spark_queue_tail'.  These
	are pointers to the start and end of a global queue of sparks.  Idle
	engines can pick up work from this queue in the same way that they can
	pick up work from the global context queue (the "run queue").

	Add new fields to the MR_Context structure.  `MR_ctxt_parent_sp' is a
	saved copy of the `parent_sp' register for when the context is
	suspended.  `MR_ctxt_spark_stack' is a stack of sparks that we decided
	not to put on the global spark queue.

	Update `MR_load_context' and `MR_save_context' to save and restore
	`MR_ctxt_parent_sp'.

	Add the counters `MR_num_idle_engines' and
	`MR_num_outstanding_contexts_and_sparks'.  These are used to decide,
	when a `fork' instruction is reached, whether a spark should be put on
	the global spark queue (with potential for parallelism but also more
	overhead) or on the calling context's spark stack (no parallelism and
	less overhead).

	Rename `MR_init_context' to `MR_init_context_maybe_generator'.  When
	initialising contexts, don't reset redzones of already allocated
	stacks.  It seems to be unnecessary (and the reset implementation is
	buggy anyway, though it's fine on Linux).

	Rename `MR_schedule' to `MR_schedule_context'.  Add new functions
	`MR_schedule_spark_globally' and `MR_schedule_spark_locally'.

	In `MR_do_runnext', add code for idle engines to get work from the
	global spark queue.  Resuming contexts are prioritised over sparks.

	Rename `MR_fork_new_context' to `MR_fork_new_child'.  Change the
	definitions of `MR_fork_new_child' and `MR_join_and_continue' as per
	the new behaviour of the `fork' and `join_and_continue' instructions.
	Delete `MR_join_and_terminate'.

	Add a new field `MR_st_orig_context' to the MR_SyncTerm structure to
	record which context originated the parallel conjunction instance
	represented by a MR_SyncTerm instance, and update `MR_init_sync_term'.
	This is needed by the new behaviour of `MR_join_and_continue'.

	Update some comments.

runtime/mercury_engine.h:
runtime/mercury_regs.c:
runtime/mercury_regs.h:
runtime/mercury_stacks.h:
	Add the abstract machine register `parent_sp' and code to copy it to
	and from the fake_reg array.

	Add a macro `MR_parent_sv' to access stack slots via `parent_sp'.

	Add `MR_eng_parent_sp' to the MercuryEngine structure.

runtime/mercury_wrapper.c:
runtime/mercury_wrapper.h:
	Add Mercury runtime option `--max-contexts-per-thread' which is saved
	in the global variable `MR_max_contexts_per_thread'.  The number
	`MR_max_outstanding_contexts' is derived from this.  It sets a soft
	limit on the number of sparks we put in the global spark queue,
	relative to the number of threads we are running.  We don't want to
	put too many sparks on the global queue if there are plenty of ready
	contexts or sparks already on the global queues, as they are likely to
	result in new contexts being allocated.

	When initially creating worker engines, wait until all the worker
	engines have acknowledged that they are idle before continuing.  This
	is mainly so programs (especially benchmarks and test cases) with only
	a few fork instructions near the beginning of the program don't
	execute the forks before any worker engines are ready, resulting in no
	parallelism.

runtime/mercury_engine.c:
runtime/mercury_thread.c:
	Don't allocate a context at the time a Mercury engine is created.  An
	engine only needs a new context when it is about to pick up a spark.

configure.in:
compiler/options.m:
scripts/Mercury.config.in:
	Update to reflect the extra field in MR_SyncTerm.

	Add the option `--sync-term-size' and actually make use the result of
	the sync term size calculated during configuration.

compiler/code_util.m:
compiler/continuation_info.m:
compiler/dupelim.m:
compiler/dupproc.m:
compiler/global_data.m:
compiler/hlds_llds.m:
compiler/jumpopt.m:
compiler/livemap.m:
compiler/llds_out.m:
compiler/middle_rec.m:
compiler/opt_debug.m:
compiler/opt_util.m:
compiler/reassign.m:
compiler/stack_layout.m:
compiler/use_local_vars.m:
compiler/var_locn.m:
	Conform to changes in `fork', `join_and_terminate' and
	`join_and_continue' instructions.

	Conform to additions of `parent_sp' and parent stack variables.

	XXX not sure about the changes in stack_layout.m

library/par_builtin.m:
	Conform to changes in the runtime system.
2006-09-26 03:53:23 +00:00
..
2006-09-20 09:33:08 +00:00