Commit Graph

19 Commits

Author SHA1 Message Date
Mark Brown
d465fa53cb Update the COPYING.LIB file and references to it.
Discussion of these changes can be found on the Mercury developers
mailing list archives from June 2018.

COPYING.LIB:
    Add a special linking exception to the LGPL.

*:
    Update references to COPYING.LIB.

    Clean up some minor errors that have accumulated in copyright
    messages.
2018-06-09 17:43:12 +10:00
Zoltan Somogyi
53b573692a Convert C code to use // style comments.
runtime/*.[ch]:
trace/*.[chyl]:
    As above. In some places, improve comments, e.g. by expanding contractions
    such as "we've". Add #ifndef guards against double inclusion around
    the trace/*.h files that did not already have them.

tools/*:
    Make the corresponding changes in shell scripts that generate .[ch] files
    in the runtime.

tests/*:
    Conform to a slight change in the text of a message.
2016-07-14 13:57:35 +02:00
Zoltan Somogyi
67326f16e4 Fix style issues in the runtime.
Move all .h and .c files to four-space indentation without tabs,
if they weren't there already.

Use the same vim line for all .h and .c files.

Align all backslashes at the ends of lines in macro definitions.
Align close comment signs.

In some places, fix inconsistent indentation.

Fix a bunch of comments. Add XXXs to a few of them.
2016-07-09 12:14:00 +02:00
Julien Fischer
787f8b2c6d Fix spelling and grammer in runtime comments.
runtime/*.[ch]:
    As above.
2015-09-03 15:43:35 +10:00
Julien Fischer
f522a3dbaf Define atomic ops in high-level .par grades.
runtime/mercury_atomic_ops.c:
	Fix a mismatch between the macros protecting the declarations and
	the definitions of the atomic ops.
2014-08-13 14:36:41 +10:00
Paul Bone
a9f82d004b On some systems the CPU's time stamp counter (TSC) cannot reliabily be
used.  Mercury's ThreadScope support will now use gettimeofday() by
default, but use of the TSC may be enabled.

Note that in Linux, gettimeofday() does not always make a system call.

runtime/mercury_threadscope.[ch]:
    Add support for measuring time with gettimeofday().

    Use gettimeofday() to measure time by default.

runtime/mercury_atomic_ops.[ch]
    Add a new function MR_tsc_is_sensible(), It returns true if the TSC can
    (as far as the RTS can detect) be used.

    Fix trailing whitespace.

runtime/mercury_wrapper.c:
    Add a new runtime option --threadscope-use-tsc.
    When specified this option allows threadscope to use the CPU's TSC to
    measure time.

doc/userguide.texi:
    Document the --threadscope-use-tsc option.  This documentation is
    commented out.
2012-06-20 13:13:34 +00:00
Julien Fischer
8af00f7a2a Avoid using the __GNUC__ macro in the runtime as a test for the presence of
Branches: main, 11.07

Avoid using the __GNUC__ macro in the runtime as a test for the presence of
gcc, since clang also defines that macro.  Since clang doesn't support all
of the GNU C extensions, we can't actually use __GNUC__ without also checking
whether we are actually using clang.

runtime/mercury_conf_param.h:
	Add three new macros, MR_CLANG, MR_GNUC and MR_MSVC that are defined
	only when the C compiler is clang, gcc, or Visual C respectively.
	(In particular, MR_GNUC will _not_ be defined when the C compiler
	is clang.)

runtime/mercury.c:
runtime/mercury.h:
runtime/mercury_atomic_ops.c:
runtime/mercury_atomic_ops.h
runtime/mercury_bitmap.h:
runtime/mercury_float.h:
runtime/mercury_getopt.c:
runtime/mercury_goto.h:
runtime/mercury_heap.h:
runtime/mercury_std.h:
	Replace uses of the __GNUC__ and __clang__ macros with the above.

runtime/mercury_regs.h:
	As above, also #include mercury_conf_param.h directly since
	this file is #included by some of the tests in the configure
	script.
2011-08-01 07:06:21 +00:00
Peter Wang
7e26b55e74 Implement a new form of memory profiling, which tells the user what memory
Branches: main

Implement a new form of memory profiling, which tells the user what memory
is being retained during a program run.  This is done by allocating an extra
word before each cell, which is used to "attribute" the cell to an
allocation site.  The attribution, or "allocation id", is an address to an
MR_AllocSiteInfo structure generated by the Mercury compiler, giving the
procedure, filename and line number of the allocation, and the type
constructor and arity of the cell that it allocates.

The user must manually instrument the program with calls to
`benchmarking.report_memory_attribution', which forces a GC and summarises
the live objects on the heap using the attributions.  The mprof tool is
extended with a new mode to parse and present that data.

Objects which are unattributed (e.g. by hand-written C code which hasn't
been updated) are still accounted for, but show up in profiles as "unknown".

Currently this profiling mode only works in conjunction with the Boehm
garbage collector, though in principle it can work with any memory allocator
for which we can access a list of the live objects.  Since term size
profiling relies on the same technique of using an extra word per memory
cell, the two profiling modes are incompatible.

The output from `mprof -s' looks like this:

------ [1] some label ------
   cells            words         cumul  procedure / type (location)
   14150            38872                total

*   1949/ 13.8%      4872/ 12.5%  12.5%  <predicate `parser.parse_rest/7' mode 0>
     975/  6.9%      1950/  5.0%         list.list/1 (parser.m:502)
     487/  3.4%      1948/  5.0%         term.term/1 (parser.m:501)
     487/  3.4%       974/  2.5%         term.const/0 (parser.m:501)

*   1424/ 10.1%      4272/ 11.0%  23.5%  <predicate `parser.parse_simple_term_2/6' mode 0>
     708/  5.0%      2832/  7.3%         term.term/1 (parser.m:643)
     708/  5.0%      1416/  3.6%         term.const/0 (parser.m:643)
...


boehm_gc/alloc.c:
boehm_gc/include/gc.h:
boehm_gc/misc.c:
boehm_gc/reclaim.c:
	Add a callback function to be called for every live object after a GC.

	Add a function to write out the GC_size_map array.

compiler/layout.m:
	Define the alloc_site_info type which is equivalent to the
	MR_AllocSiteInfo C structure.

	Add alloc_site_array as a kind of "layout" array.

compiler/llds.m:
	Add allocation sites to `cfile' structure.

	Replace TypeMsg argument (which was also for profiling) on `incr_hp'
	instructions by an allocation site identifier.

	Add a new foreign_proc_component for allocation site ids.

compiler/code_info.m:
compiler/global_data.m:
compiler/proc_gen.m:
	Keep the set of allocation sites in the code_info and global_data
	structures.

compiler/unify_gen.m:
	Add allocation sites to LLDS allocation instructions.

compiler/layout_out.m:
compiler/llds_out_file.m:
compiler/llds_out_instr.m:
	Output MR_AllocSiteInfo arrays in generated C files.

	Output code to register the MR_AllocSiteInfo array with the Mercury
	runtime.

	Output allocation site ids for memory allocation instructions.

compiler/llds_out_util.m:
	Add allocation sites to llds_out_info.

compiler/pragma_c_gen.m:
compiler/ml_foreign_proc_gen.m:
	Generate a macro MR_ALLOC_ID which resolves to an allocation site
	structure, for every foreign_proc whose C code contains the string
	"MR_ALLOC_ID".  This is to be used by hand-written C code which
	allocates memory.

	MR_PROC_LABELs are retained for backwards compatibility.  Though
	they were introduced for profiling, they seem to have been co-opted
	for printf-debugging since then.

compiler/ml_global_data.m:
	Add allocation site structures to the MLDS global data.

compiler/mlds.m:
compiler/ml_unify_gen.m:
	Add allocation site id to `new_object' instruction.

compiler/mlds_to_c.m:
	Output allocation site arrays and allocation ids in high-level C code.

	Output a call to register the allocation site array with the Mercury
	runtime.

	Delete an unused predicate.

compiler/exprn_aux.m:
compiler/jumpopt.m:
compiler/livemap.m:
compiler/mercury_compile_llds_back_end.m:
compiler/middle_rec.m:
compiler/ml_accurate_gc.m:
compiler/ml_elim_nested.m:
compiler/ml_optimize.m:
compiler/ml_util.m:
compiler/mlds_to_cs.m:
compiler/mlds_to_gcc.m:
compiler/mlds_to_il.m:
compiler/mlds_to_java.m:
compiler/mlds_to_managed.m:
compiler/opt_debug.m:
compiler/opt_util.m:
compiler/use_local_vars.m:
compiler/var_locn.m:
	Conform to changes.

compiler/pickle.m:
compiler/prog_event.m:
compiler/timestamp.m:
	Conform to changes in memory allocation macros.

library/benchmarking.m:
	Add the `report_memory_attribution' instrumentation predicates.

	Conform to changes to MR_memprof_record.

library/array.m:
library/bit_buffer.m:
library/bitmap.m:
library/construct.m:
library/deconstruct.m:
library/dir.m:
library/io.m:
library/mutvar.m:
library/store.m:
library/string.m:
library/thread.semaphore.m:
library/version_array.m:
	Use attributed memory allocation throughout the standard library so
	that objects don't show up in the memory profile as "unknown".

	Replace MR_PROC_LABEL by MR_ALLOC_ID.

mdbcomp/program_representation.m:
mdbcomp/rtti_access.m:
	Replace MR_PROC_LABEL by MR_ALLOC_ID.

profiler/Mercury.options:
profiler/globals.m:
profiler/mercury_profile.m:
profiler/options.m:
profiler/output.m:
profiler/snapshots.m:
	Add a new mode to `mprof' to parse and present the data from
	`Prof.Snapshots' files.

	Add options for the new profiling mode.

profiler/process_file.m:
	Fix a typo.

runtime/mercury_conf_param.h:
	#define MR_MPROF_PROFILE_MEMORY_ATTRIBUTION if memory profiling
	is enabled and we are using Boehm GC.

runtime/mercury.h:
	Make MR_new_object take an allocation id argument.

	Conform to changes in memory allocation macros.

runtime/mercury_memory.c:
runtime/mercury_memory.h:
runtime/mercury_types.h:
	Define MR_AllocSiteInfo.

	Add memory allocation functions and macros which take into the
	account the additional word necessary for the new profiling mode.
	These should be used in preferences to the raw memory allocation
	functions wherever possible so that objects do not show up in the
	profile as "unknown".

	Add analogues of realloc/free which take into account the offset
	introduced by the attribution word.

	Add function versions of the MR_new_object macros, which can't be
	written in standard C.  They are only used when necessary.

	Add built-in allocation site ids, to be used in the runtime and
	other hand-written code when context-specific ids are unavailable.

runtime/mercury_heap.h:
	Make MR_tag_offset_incr_hp_msg and MR_tag_offset_incr_hp_atomic_msg
	allocate an extra word when memory attribution is desired, and store
	the allocation id there.

	Similarly for MR_create{1,2,3}_msg.

	Replace proclabel arguments in allocation macros by alloc_id
	arguments.

	Replace MR_hp_alloc_atomic by MR_hp_alloc_atomic_msg.  It was only
	used for boxing floats.

	Conform to change to MR_new_object macro.

runtime/mercury_bootstrap.h:
	Delete obsolete macro hp_alloc_atomic.

runtime/mercury_heap_profile.c:
runtime/mercury_heap_profile.h:
	Add the code to summarise the live objects on the Boehm GC heap and
	writes out the data to `Prof.Snapshots', for display by mprof.

	Don't store the procedure name in MR_memprof_record: the procedure
	address is enough and faster to compare.

runtime/mercury_prof.c:
	Finish and close the `Prof.Snapshots' file when the program
	terminates.

	Conform to changes in MR_memprof_record.

runtime/mercury_misc.h:
	Add a macro to expand to the name of the allocation sites array
	in LLDS grades.

runtime/mercury_bitmap.c:
runtime/mercury_bitmap.h:
	Pass allocation id through bitmap allocation functions.

	Delete unused function MR_string_to_bitmap.

runtime/mercury_string.h:
	Add MR_make_aligned_string_copy_msg.

	Make string allocation macros take allocation id arguments.

runtime/mercury.c:
runtime/mercury_array_macros.h:
runtime/mercury_context.c:
runtime/mercury_deconstruct.c:
runtime/mercury_deconstruct_macros.h:
runtime/mercury_dlist.c:
runtime/mercury_engine.c:
runtime/mercury_float.h:
runtime/mercury_hash_table.c:
runtime/mercury_ho_call.c:
runtime/mercury_label.c:
runtime/mercury_prof_mem.c:
runtime/mercury_stacks.c:
runtime/mercury_stm.c:
runtime/mercury_string.c:
runtime/mercury_thread.c:
runtime/mercury_trace_base.c:
runtime/mercury_trail.c:
runtime/mercury_type_desc.c:
runtime/mercury_type_info.c:
runtime/mercury_wsdeque.c:
	Use attributed memory allocation throughout the runtime so that
	objects don't show up in the profile as "unknown".

runtime/mercury_memory_zones.c:
	Attribute memory zones to the Mercury runtime.

runtime/mercury_tabling.c:
runtime/mercury_tabling.h:
	Use attributed memory allocation macros for tabling structures.

	Delete unused MR_table_realloc_* and MR_table_copy_bytes macros.

runtime/mercury_deep_copy_body.h:
	Try to retain the original attribution word when copying values.

runtime/mercury_ml_expand_body.h:
	Conform to changes in memory allocation macros.

runtime/mercury_tags.h:
	Replace proclabel arguments by alloc_id arguments in allocation macros.

runtime/mercury_wrapper.c:
	If memory attribution is enabled, tell Boehm GC that pointers may be
	displaced by an extra word.

trace/mercury_trace.c:
trace/mercury_trace_tables.c:
	Conform to changes in memory allocation macros.

extras/net/tcp.m:
extras/solver_types/library/any_array.m:
extras/trailed_update/tr_array.m:
	Conform to changes in memory allocation macros.

doc/user_guide.texi:
	Document the new profiling mode.

doc/reference_manual.texi:
	Update a commented out example.
2011-05-20 04:16:58 +00:00
Paul Bone
edc230406e Fix a number of errors and warnings in the runtime picked up by GCC 4.x in
parallel and threadscope grades.

We had been using types with the wrong signedness well calling atomic operations.
GCC 4.x also picked up an error where #elif was used instead of #else.

While testing these changes on a 32bit system more bugs where found on the i386
architecture and on AMD brand processors.

runtime/mercury_atomic_ops.h:
runtime/mercury_atomic_ops.c:
    Add unsigned variants of the following atomic operations:
        increment,
        add,
        add_and_fetch,
        dec_and_is_zero,

    Add a signed variant for compare and swap.

    Rename the MR_atomic_dec_<type>_and_is_zero operation to move the type to
    the end of the name.

    Use volatile storage in the MR_Stats structure.

    A 32bit machine cannot do atomic operations on 64bit values and MR_Stats
    must use 64bit values.  Therefore 64bit values in the MR_Stats structure
    are now protected by a lock on 32bit machines.

runtime/mercury_atomic_ops.h:
    Fix a typeo in the i386 version of MR_atomic_dec_and_is_zero_uint().

runtime/mercury_atomic_ops.c:
    AMD CPUs do not conform to Intel's specification for being able to
    extract the CPU clock speed from the brand string.  When we cannot
    determine the CPU's clock speed then we write out threadscope
    timestamps in raw clock cycles rather than nanoseconds.

    On i386 machines the ebx register is used to implement PIC code,
    however the CPUID instruction uses it to output information.  Save
    this register on C's stack while we issue CPUID and retrieve the
    result in ebx.

    We now pass native machine sized values to the inline assembler code
    that implements RDTSC and RDTSCP.

    Fix commenting style in some places.

runtime/mercury_atomic_ops.c:
    Fix some incorrect C preprocessor code for conditional compilation.

runtime/mercury_grade.h:
    Increment binary compatibility number.  This should have been done in a
    prior change when the MR_runnext macro changed which broke binary
    compatibility in the parallel low-level C grades.

runtime/mercury_context.h:
    In MR_SyncTerm_Struct use an unsigned value for the number of conjuncts
    remaining before the conjunction is complete.

runtime/mercury_threadscope.c:
    Record raw cpu clock ticks rather than milliseconds when we don't
    know the processor's clock speed.

runtime/mercury_context.c:
runtime/mercury_wsdeque.h:
runtime/mercury_wsdeque.c:
    Conform to changes in mercury_atomic_ops.h
2010-03-20 10:15:51 +00:00
Paul Bone
6b2bc6a66a When an engine steals a spark and executes it using the context it is
currently holding it did not allocate a new context ID.  A user looking at
this behaviour from threadscope would see thread 27 (for instance) finish, and
then immediately begin executing again.  Therefore we now allocates a new
context ID when a context is reused making the context look different from
threadscope's point of view.  New context IDs are already allocated to
contexts that are allocated from the free context lists.

runtime/mercury_context.c:
    As above.

    The next context id variable is now accessed atomically rather than being
    protected by the free context list lock.

runtime/mercury_atomic_ops.h:
runtime/mercury_atomic_ops.c:
    Implement a new atomic operation, MR_atomic_add_and_fetch_int, this is
    used to allocate context ids.

    Reimplement MR_atomic_add_int in terms of MR_atomic_add_and_fetch_int when
    handwritten assembler support is not available.

runtime/mercury_atomic_ops.c:
    Re-order atomic operations to match the order in the header file.

runtime/mercury_atomic_ops.h:
    Place the definition of the MR_ATOMIC_PAUSE macro before the other atomic
    operations since MR_atomic_add_and_fetch_int depends on it.  This also
    conforms with the coding standard.

runtime/mercury_threadscope.h:
    Make the Context ID type a MR_Integer to match the argument size on the
    available atomic operations.
2010-02-17 02:37:45 +00:00
Paul Bone
83a6f14708 Create a threadscope grade component.
Threadscope grades are enabled by using the grade component 'threadscope'.
They are supported only with low-lavel C parallel grades.  Support for
threadscope in high level C grades is intended in the future but does not work
now.

runtime/mercury_conf_param.h:
    Create the MR_THREADSCOPE macro that is defined if the grade is a
    threadscope grade.

    Define MR_PROFILE_FOR_PARALLEL_EXECUTION if MR_THREADSCOPE is defined.

    Emit an error if MR_LL_PARALLEL_CONJ is defined before it is implied by
    MR_THREADSAFE and ! MR_HIGHLEVEL_CODE

runtime/mercury_grade.h
    Update the grade symbol for the threadscope grade component.

runtime/mercury_atomic_ops.c:
runtime/mercury_atomic_ops.h:
runtime/mercury_context.c:
runtime/mercury_context.h:
runtime/mercury_engine.c:
runtime/mercury_engine.h:
runtime/mercury_thread.c:
runtime/mercury_threadscope.c:
runtime/mercury_threadscope.h:
runtime/mercury_wrapper.c:
    Now that MR_PROFILE_FOR_IMPLICIT_PARALLELISM is implied by MR_THREADSAFE we
    don't need to test for MR_THREADSAFE when we test for
    MR_PROFILE_FOR_IMPLICIT_PARALLELISM.  The same is true for
    MR_LL_PARALLEL_CONJ which is implied by MR_THREADSAFE &&
    !MR_HIGHLEVEL_CODE.

    Replace some occurances of MR_PROFILE_FOR_IMPLICIT_PARALLELISM with
    MR_THREADSCOPE where the conditionally compiled code is used to support
    threadscope profiling.

scripts/init_grade_options.sh-subr:
scripts/canonical_grade.sh-subr:
scripts/parse_grade_options.sh-subr:
scripts/final_grade_options.sh-subr:
scripts/mgnuc.in:
compiler/handle_options.m:
compiler/options.m:
compiler/compile_target_code.m:
configure.in:
    Add support for the new grade component.

    Pass -DMR_THREADSCOPE to the C compiler when using a threadscope grade.

    Add assertions to ensure that the 'threadscope' grade component is used
    only with the 'par' grade component.

doc/user_guide.texi:
    Added commented-out documentation for the threadscope greate component.

    Adjusted documentation of the --profile-parallel-execution runtime option
    to describe the correct prerequisite compile time options.

    Added my name to the authors list.

runtime/mercury_context.c:
    Corrected grammar and prose in comments in the MR_do_join_and_continue code.
2010-01-10 04:53:40 +00:00
Paul Bone
5cfd73644a Implement work stealing.
This patch is heavily based on earlier, uncommitted work by Peter Wang.  It
has been updated so that it applies against the current version of the source.
A number of other changes have been made.  Peter's original ChangeLog
follows:

	Implement work stealing for parallel conjunctions.  This builds on an
	older patch which introduced work-stealing deques to the runtime but
	didn't perform work stealing.

	Previously when we came across a parallel conjunct, we would place a spark
	into either the _global spark queue_ or the _local spark stack_ of the
	Mercury context.  A spark on the global spark queue may be picked up for
	parallel execution by an idle Mercury engine, whereas a spark on a local
	spark stack is confined to execution in the context that originated it.

	The problem is that we have to decide, ahead of time, where to put a
	spark.  Ideally, we should have just enough sparks in the global queue to
	keep the available Mercury engines busy, and leave the rest of the sparks
	to execute in their original contexts since that is more efficient.  But
	we can't predict the future so have to make do with guesses using simple
	heuristics.  A bad decision, once made, cannot be reversed.  An engine may
	sit idle due to an empty global spark queue, even while there are sparks
	available in some local spark stacks.

	In the work stealing scheme, sparks are always placed into each context's
	_local spark deque_.  Idle engines actively try to steal sparks from
	random spark deques.  We don't need to make irreversible and potentially
	suboptimal decisions about where to put sparks.  Making a spark available
	for parallel execution is cheap and happens by default because of the
	work-stealing deques; putting a spark on a global queue implies
	synchronisation with other threads.  The downside is that idle engines
	need to expend more time and effort to find the work from multiple places
	instead of just one place.

	Practically, the new scheme seems to work as well as the old scheme and
	vice versa, except that the old scheme often required
	`--max-context-per-threads' to be set "correctly" to get good results.

	Only tested on x86-64, which has a relatively constrained memory model.

My modifications include:

	The difference between 'shared' and 'private' synchronisation terms has
	been removed.  All sync terms are assumed to be shared and thread-safe
	operations are used everywhere.  This allows us to remove complicated code
	used when a private synchronisation term became shared.  This may change
	the performance of thread stealing, in particular it may become slower due
	to the assumption that all sync terms are shared and therefore atomic
	operations must always be used when decrementing their count field.

	I've re-factored MR_do_join_and_continue, It is now much simpler as the
	conditional code in it enumerates the possible cases clearly.

This change bootchecks and successfully runs the test suite in asm_fast.gc
asm_fast.gc.par hlc.gc and hlc.par, no other grades where tested.  I have not
yet tested performance.

runtime/mercury_context.c:
runtime/mercury_context.h:
	Keep pointers to all spark deques in a flat array, so we have access
    to them for stealing.

	Added functions to manage the global array of spark deques.

	Modify MR_do_run_next, it now attempts to steal work from other context's
	spark queues.  Threads sleeping on the condition variable in
	MR_do_run_next now use a timed wait so they can wakeup and try to steal
	sparks.

	Re-factored MR_do_join_and_continue.

	MR_num_idle_engines is used by atomic operations, it has been made an
	MR_Integer so that it's size matches the expectations of the atomic
	operations we have defined.

	Modified the MR_SyncTerm and MR_Spark structures.  Sparks now point to
	their sync terms.  The perant stack pointer has been moved into the
	SyncTerm structure.  The MR_st_is_shared field in the MR_SyncTerm
	structure has been removed.

runtime/mercury_atomic_ops.c:
runtime/mercury_atomic_ops.h:
	Implement a new atomic operation: decrement integer and is zero.  On the
	x86/x86_64 one can't atomically decrement an integer and fetch the result
	in a single instruction, a loop with a 'compare and exchange' instruction
	is necessary.  However since we only want to test if the value has become
	zero after the decrement we can use the processor's flags.  This can be
	done in two instructions, but more importantly a loop is not required and
	only one instruction is atomic.

runtime/mercury_wrapper.c:
runtime/mercury_wrapper.h:
	Added runtime tunable options for work stealing.  These control the number
	of attempts an idle engine will make when looking for work, and the
	duration to sleep after failing to find any work.

runtime/mercury_thread.c:
runtime/mercury_thread.h:
	Added MR_COND_TIMED_WAIT, which waits on condition variables like
	MR_COND_WAIT except that it may time out.

runtime/mercury_wsdeque.h:
runtime/mercury_wsdeque.c:
	MR_wsdeque_pop_bottom now uses it's second argument to return the code
	address to jump to rather the whole spark.

runtime/mercury_conf.h.in:
configure.in:
	Test for sched_yield()

	Change the synchronisation term structure.

doc/user_guide.texi:
    Add commented out documentation for two new tunable parameters,
    `--worksteal-max-attempts' and `--worksteal-sleep-msecs'.
    Implementors may want to experiment with different values but end
    users shouldn't need to know about them.
2009-12-15 02:29:07 +00:00
Paul Bone
92afa23af5 Support for threadscope profiling of the parallel runtime.
This change adds support for threadscope profiling of the parallel runtime in
low level C grades.  It can be enabled by compiling _all_ code with the
MR_PROFILE_PARALLEL_EXECUTION_SUPPORT C macro defined.  The runtime, libraries
and applications must all have this flag defined as it alters the MercuryEngine
and MR_Context structures.

See Don Jones Jr, Simon Marlow, Satnam Singh - Parallel Performance Tuning for
Haskell.

This change also includes:

    Smarter thread pinning (the primordial thread is pinned to the thread that
    it is currently running on).

    The addition of callbacks from the Boehm GC to notify the runtime of
    stop the world garbage collections.

    Implement some userspace spin loops and conditions.  These are cheaper than
    their POSIX equivalents, do not support sleeping, and are signal handler
    safe.

boehm_gc/alloc.h:
boehm_gc/alloc.c:
    Declare and define the new callback functions.

boehm_gc/alloc.c:
    Call the start and stop collect callbacks when we start and stop a
    stop-the-world collection.

    Correct how we record the time spent collecting, it now includes
    collections that stop prematurely.

boehm_gc/pthread_stop_world.c:
    Call the pause and resume thread callbacks in each thread where the GC
    arranges for that thread to be stopped during a stop-the-world collection.

runtime/mercury_threadscope.c:
runtime/mercury_threadscope.h:
    New files implementing the threadscope support.

runtime/mercury_atomic_ops.c:
runtime/mercury_atomic_ops.h:
    Rename MR_configure_profiling_timers to MR_do_cpu_feature_detection.

    Add a new function MR_read_cpu_tsc() to read the TSC register from the CPU,
    this simply abstracts the static MR_rdtsc function.

runtime/mercury_atomic_ops.h:
    Modify the C inline assembler to ensure we tell the C compiler that the
    value in the register mapped to the 'old' parameter is also an output from
    the instructions.  That is, the C compiler must not depend on the value of
    'old' being the same before and after the instruction is executed.  This
    has never been a problem in practice though.

    Implement some cheap userspace mutual exclusion locks and condition
    variables.  These will be faster than pthread's mutexes when critical
    sections are short and threads are pinned to separate CPUs.

runtime/mercury_context.c:
runtime/mercury_context.h:
    Add a new function for pinning the primordial thread.  If the OS supports
    sched_getcpu we use it to determine which CPU the primordial thread should
    use.  No other thread will be pinned to this CPU.

    Add a numeric id field to each context, this id is uniquely assigned and
    identifies each context for threadscope.

    MR_schedule_context posts the 'context runnable' threadscope event.

    MR_do_runnext has been modified to destroy engines differently, it ensures
    they cleanup properly so that their threadscope events are flushed properly
    and then calls pthread_exit(0)

    MR_do_runnext posts events for threadscope.

    MR_do_join_and_continue posts events for threadscope.

runtime/mercury_engine.h:
    Add new fields to the MercuryEngine structure including a buffer of
    threadscope events, a clock offset (used to synchronize the TSC clocks) and
    a unique identifier for the engine,

runtime/mercury_engine.c:
    Call MR_threadscope_setup_engine() and MR_threadscope_finalize_engine for
    newly created and about-to-be-destroyed engines.

    When the main context finishes on a thread that's not the primordial thread
    post a 'context is yielding' message before re-scheduling the context on
    the primordial thread.

runtime/mercury_thread.c:
    Added an XXX comment about a potential problem, it's only relevant for
    programs using thread.spawn.

    Added calls to the TSC synchronisation code used for threadscope profiling.
    It appears that this is not necessary on modern x86 machines, it has been
    commented out.

    Post a threadscope event when we create a new context.

    Don't call pthread_exit in MR_destroy_thread, we now do this in
    MR_do_runnext so that we can unlock the runqueue mutex after cleaning up.

runtime/mercury_wrapper.c:
    Conform to changes in mercury_atomic_ops.[ch]

    Post an event immediately before calling main to mark the beginning of the
    program in the threadscope profile.

    Post a "context finished" event at the end of the program.

    Wait until all engines have exited before cleaning up global data, this is
    important for finishing writing the threadscope data file.

configure.in:
runtime/mercury_conf.h.in:
    Test for the sched_getcpu C function and utmpx.h header file, these are
    used for thread pinning.

runtime/Mmakefile:
    Include the mercury_threadscope.[hc] files in the list of runtime headers
    and sources respectively.
2009-12-03 05:28:00 +00:00
Paul Bone
3e75c9bd61 Print the CPU's clock speed in the profiling data for the parallelism runtime.
While the CPU's clock speed is not always the speed that the CPU operates at,
it is the number of TSC units per second when the CPU supports the constant TSC
feature (see /proc/cpuinfo / CPUID).  This change set is used in preparation for
upcoming work on exporting ThreadScope profiling data, see "Parallel
Performance Tuning for Haskell" - Don Jones Jr, Simon Marlow and Satnam Singh.

runtime/mercury_atomic_ops.c:
    Retrieve the CPU frequency from the CPUID instruction and store it in
    MR_cpu_cycles_per_sec

runtime/mercury_atomic_ops.h:
    Export MR_cpu_cycles_per_sec

runtime/mercury_context.c:
    Print the number of cycles per second in the parallel profiling data file
    if we where able to detect it.
2009-11-06 05:40:24 +00:00
Paul Bone
db9a526d6b Corrections in response to review comments on recent parallel runtime changes.
Made thread pinning off by default, the operating system should handle this
unless we have a good reason to.

configure.in:
    Removed newly added declaration checking code.

doc/user_guide.texi:
    Documentation corrections.
    Adjusted the --pin-threads runtime option default.

runtime/mercury_atomic_ops.c
runtime/mercury_atomic_ops.h
    Use __x86_64__ instead of __amd64__
    Altered comments at the beginning of sections of the file to better
    describe the contents of that section.
    Placed comments at the end of long conditional compilation blocks that
    match the condition at the beginning of the block.

runtime/mercury_conf.h.in:
    Added editor hint for vim at the top of the file.
    Remove newly added declarations section.

runtime/mercury_context.c:
    Adjusted default behaviour of --pin-threads
    Fixed some style issues.

runtime/mercury_context.h:
    Fixed grammatical error.

runtime/mercury_wrapper.c:
    Fixed grammatical error.
    Fixed a missing break statement in a switch statement.
2009-11-05 05:47:40 +00:00
Paul Bone
4f1bfc2ebc Parallel runtime profiling improvements.
Improve the profiling of the parallel runtime code in two main ways:
	+ Record data for more events.
	+ Record high-precision timing data on x86 machines via the TSC where
	  access to the TSC is available.

Access to the TSC is available via two machine instructions.  RDTSC - read
TSC. and RDTSCP - read TSC and processor ID.  We prefer the latter as a
process migrated between two calls to RDTSC may cause an incorrect time
duration to be calculated (since TSC counts are seldom synchronized).  We
fall back to RDTSC when RDTSCP is not available and gracefully record no
timing information when neither is available.  Availability is detected via
the CPUID instruction, see MR_configure_profiling_timers().

runtime/mercury_context.c:
runtime/mercury_context.h:
	Runtime profiling changes as above.

runtime/mercury_atomic_ops.c:
runtime/mercury_atomic_ops.h:
	Add runtime profiling timing code.
	Add new add and subtract atomic operations.

runtime/mercury_wrapper.c:
	Call the new MR_configure_profiling_timers() procedure to detect the CPU
	and configure access to the TSC.

Mmakefile:
runtime/Mmakefile:
	'mmake tags' at the top level now builds the tags file for the runtime
	directory.
	The tags target in the runtime directory is now marked as PHONY so it is
	generated even if it already exists.
2009-08-16 10:18:36 +00:00
Paul Bone
4d41cf6c23 Rename the runtime granularity control macros, variables and predicates.
Estimated hours taken: 3
Branches: main

Rename the runtime granularity control macros, variables and predicates.

Names of the runtime granularity control macros, variables and predicates are
now more descriptive and more consistent.

An alternative runtime granularity control predicate and macro is now
available, it considers the number of contexts and all sparks whereas the
original predicate and macro considers only the number of contexts and sparks
on the global queue.

A new predicate has been added to determine the number of worker threads that
the mercury runtime is configured to use.


library/par_builtin.m:
	Renamed predicates.
	Conform to changes in runtime/mercury_thread.h
	Added the new predicates.
	Removed some old foreign procedure attributes.
	Addressed an XXX comment left by Zoltan.

runtime/mercury_context.c:
runtime/mercury_context.h:
	Rename existing runtime granularity control variables and macros.
	Add new runtime granularity control variable and macro.

runtime/mercury_wrapper.c:
runtime/mercury_wrapper.h:
	Export MR_num_threads variable.
	Make this variable an MR_Unsigned.

runtime/mercury_atomic_ops.c:
runtime/mercury_atomic_ops.h:
	Introduce new atomic increment and decrement instructions.  These are used
	to count the number of local sparks created which is done outside of a
	critical section.

library/Mmakefile:
	Rebuild the par_builtin module when either runtime/mercury_context.h or
	runtime/mercury_thread.h change.

compiler/granularity.m:
	Conform to changes in runtime/mercury_context.h
2009-06-17 03:26:00 +00:00
Julien Fischer
cd849f451a Avoid problems with some versions of gcc when compiling the runtime atomic_ops
Estimated hours taken: 0.2
Branches: main

Avoid problems with some versions of gcc when compiling the runtime atomic_ops
module at -O0 in non low-level .par grades.

runtime/mercury_atomic_ops.[hc]:
	Implement MR_compare_and_swap_word using the builtin gcc compare
	and swap if it is available.  (In particular prefer this to the
	handcoded assembler versions on x86 and x86_64.)

	Only include the code in these modules if MR_LL_PARALLEL_CONJ
	is defined.  This helps to avoid problems with gcc in grades
	where atomic operations are not used.
2007-10-24 05:28:52 +00:00
Peter Wang
cb8459d517 Make the parallel conjunction execution mechanism more efficient.
Branches: main

Make the parallel conjunction execution mechanism more efficient.

1. Don't allocate sync terms on the heap.  Sync terms are now allocated in
the stack frame of the procedure call which originates a parallel
conjunction.

2. Don't allocate individual sparks on the heap.  Sparks are now stored in
preallocated, growing arrays using an algorithm that doesn't use locks.

3. Don't have one mutex per sync term.  Just use one mutex to protect
concurrent accesses to all sync terms (it's is rarely needed anyway).  This
makes sync terms smaller and saves initialising a mutex for each parallel
conjunction encountered.

4. We don't bother to acquire the global sync term lock if we know a parallel
conjunction couldn't be executing in parallel.  In a highly parallel program,
the majority of parallel conjunctions will be executed sequentially so
protecting the sync terms from concurrent accesses is unnecessary.


par_fib(39) is ~8.4 times faster (user time) on my laptop (Linux 2.6, x86_64),
which is ~3.5 as slow as sequential execution.


configure.in:
	Update the configuration for a changed MR_SyncTerm structure.

compiler/llds.m:
	Make the fork instruction take a second argument, which is the base
	stack slot of the sync term.

	Rename it to fork_new_child to match the macro name in the runtime.

compiler/par_conj_gen.m:
	Change the generated code for parallel conjunctions to allocate sync
	terms on the stack and to pass the sync term to fork_new_child.

compiler/dupelim.m:
compiler/dupproc.m:
compiler/exprn_aux.m:
compiler/global_data.m:
compiler/jumpopt.m:
compiler/livemap.m:
compiler/llds_out.m:
compiler/llds_to_x86_64.m:
compiler/middle_rec.m:
compiler/opt_debug.m:
compiler/opt_util.m:
compiler/reassign.m:
compiler/use_local_vars.m:
	Conform to the change in the fork instruction.

compiler/liveness.m:
compiler/proc_gen.m:
	Disable use of the parallel conjunction operator in the compiler as
	older versions of the compiler will generate code incompatible with
	the new runtime.

runtime/mercury_context.c:
runtime/mercury_context.h:
	Remove the next pointer field from MR_Spark as it's no longer needed.

	Remove the mutex from MR_SyncTerm.  Add a field to record if a spark
	belonging to the sync term was scheduled globally, i.e. if the
	parallel conjunction might be executed in parallel.

	Define MR_SparkDeque and MR_SparkArray.

	Use MR_SparkDeques to hold per-context sparks and global sparks.

	Change the abstract machine instructions MR_init_sync_term,
	MR_fork_new_child, MR_join_and_continue as per the main change log.

	Use a preprocessor macro MR_LL_PARALLEL_CONJ as a shorthand for
	!MR_HIGHLEVEL_CODE && MR_THREAD_SAFE.

	Take the opportunity to clean things up a bit.

runtime/mercury_wsdeque.c:
runtime/mercury_wsdeque.h:
	New files containing an implementation of work-stealing deques.  We
	don't do work stealing yet but we use the underlying data structure.

runtime/mercury_atomic_ops.c:
runtime/mercury_atomic_ops.h:
	New files to contain atomic operations.  Currently it just contains
	compare-and-swap for gcc/x86_64, gcc/x86 and gcc-4.1.

runtime/Mmakefile:
	Add the new files.

runtime/mercury_engine.h:
runtime/mercury_mm_own_stacks.c:
runtime/mercury_wrapper.c:
	Conform to runtime changes.

runtime/mercury_conf_param.h:
	Update an outdated comment.
2007-10-11 12:18:03 +00:00