mirror of
https://github.com/Mercury-Language/mercury.git
synced 2026-04-15 17:33:38 +00:00
Estimated hours taken: 120 Branches: main The algorithm that decides whether the order independent state update transformation is applicable in a given module needs access to the list of oisu pragmas in that module, and to information about the types of variables in the procedures named in those pragmas. This diff puts this information in Deep.procrep files, to make them available to the autoparallelization feedback program, to which that algorithm will later be added. Compilers that have this diff will generate Deep.procrep files in a new, slightly different format, but the deep profiler will be able to read Deep.procrep files not just in the new format, but in the old format as well. runtime/mercury_stack_layout.h: Add to module layout structures the fields holding the new information we want to put into Deep.procrep files. This means three things: - a bytecode array in module layout structures encoding the list of oisu pragmas in the module; - additions to the bytecode arrays in procedure layout structures mapping the procedure's variables to their types; and - a bytecode array containing the encoded versions of those types themselves in the module layout structure. This allows us to represent each type used in the module just once. Since there is now information in module layout structures that is needed only for deep profiling, as well as information that is needed only for debugging, the old arrangement that split a module's information between two structures, MR_ModuleLayout (debug specific info) and MR_ModuleCommonLayout (info used by both debugging and profiling), is no longer approriate. We could add a third structure containing profiling-specific info, but it is simpler to move all the info into just one structure, some of whose fields may not be used. This wastes only a few words of memory per module, but allows the runtime system to avoid unnecessary indirections. runtime/mercury_types.h: Remove the type synonym for the deleted type. runtime/mercury_grade.h: The change in mercury_stack_layout.h destroys binary compatibility with previous versions of Mercury for debug and deep profiling grades, so bump their grade-component-specific version numbers. runtime/mercury_deep_profiling.c: Write out the information in the new fields in module layout structures, if they are filled in. Since this changes the format of the Deep.procrep file, bump its version number. runtime/mercury_deep_profiling.h: runtime/mercury_stack_layout.c: Conform to the change to mercury_stack_layout.h. mdbcomp/program_representation.m: Add to module representations information about the oisu pragmas defined in that module, and the type table of the module. Optionally add to procedure representations a map mapping the variables of the procedure to their types. Rename the old var_table type to be the var_name_table type, since it contains just names. Make the var to type map separate, since it will be there only for selected procedures. Modify the predicates reading in module and procedure representations to allow them to read in the new representation, while still accepting the old one. Use the version number in the Deep.procrep file to decide which format to expect. mdbcomp/rtti_access.m: Add functions to encode the data representations that this module also decodes. Conform to the changes above. mdbcomp/feedback.automatic_parallelism.m: Conform the changes above. mdbcomp/prim_data.m: Fix layout. compiler/layout.m: Update the compiler's representation of layout structures to conform to the change to runtime/mercury_stack_layout.h. compiler/layout_out.m: Output the new parts of module layout structures. compiler/opt_debug.m: Allow the debugging of code referring to the new parts of module layout structures. compiler/llds_out_file.m: Conform to the move to a single module layout structure. compiler/prog_rep_tables.m: This new module provided mechanisms for building the string table and the type table components of module layouts. The string table part is old (it is moved here from stack_layout.m); the type table part is new. Putting this code in a module of its own allows us to remove a circular dependency between prog_rep.m and stack_layout.m; instead, both now just depend on prog_rep_tables.m. compiler/ll_backend.m: Add the new module. compiler/notes/compiler_design.html: Describe the new module. compiler/prog_rep.m: When generating the representation of a module for deep profiling, include the information needed by the order independent state update analysis: the list of oisu pragmas in the module, if any, and information about the types of variables in selected procedures. To avoid having these additions increasing the size of the bytecode representation too much, convert some fixed 32 bit numbers in the bytecode to use variable sized numbers, which will usually be 8 or 16 bits. Do not use predicates from bytecode_gen.m to encode numbers, since there is nothing keeping these in sync with the code that reads them in mdbcomp/program_representation.m. Instead, use new predicates in program_representation.m itself. compiler/stack_layout.m: Generate the new parts of module layouts. Remove the code moved to prog_rep_tables.m. compiler/continuation_info.m: compiler/proc_gen.m: Make some more information available to stack_layout.m. compiler/prog_data.m: Fix some formatting. compiler/introduce_parallelism.m: Conform to the renaming of the var_table type. compiler/follow_code.m: Fix the bug that used to cause the failure of the hard_coded/mode_check_clauses test case in deep profiling grades. deep_profiler/program_representation_utils.m: Output the new parts of module and procedure representations, to allow the correctness of this change to be tested. deep_profiler/mdprof_create_feedback.m: If we cannot read the Deep.procrep file, print a single error message and exit, instead of continuing with an analysis that will generate a whole bunch of error messages, one for each attempt to access a procedure's representation. deep_profiler/mdprof_procrep.m: Give this program an option that specifies what file it is to look at; do not hardwire in "Deep.procrep" in the current directory. deep_profiler/report.m: Add a report type that just prints the representation of a module. It returns the same information as mdprof_procrep, but from within the deep profiler, which can be more convenient. deep_profiler/create_report.m: deep_profiler/display_report.m: Respectively create and display the new report type. deep_profiler/query.m: Recognize a query asking for the new report type. deep_profiler/autopar_calc_overlap.m: deep_profiler/autopar_find_best_par.m: deep_profiler/autopar_reports.m: deep_profiler/autopar_search_callgraph.m: deep_profiler/autopar_search_goals.m: deep_profiler/autopar_types.m: deep_profiler/branch_and_bound.m: deep_profiler/coverage.m: deep_profiler/display.m: deep_profiler/html_format.m: deep_profiler/mdprof_test.m: deep_profiler/measurements.m: deep_profiler/query.m: deep_profiler/read_profile.m: deep_profiler/recursion_patterns.m: deep_profiler/top_procs.m: deep_profiler/top_procs.m: Conform to the changes above. Fix layout. tests/debugger/declarative/dependency.exp2: Add this file as a possible expected output. It contains the new field added to module representations.
426 lines
18 KiB
Mathematica
426 lines
18 KiB
Mathematica
%-----------------------------------------------------------------------------%
|
|
% vim: ft=mercury ts=4 sw=4 et
|
|
%-----------------------------------------------------------------------------%
|
|
% Copyright (C) 2010-2011 The University of Melbourne.
|
|
% This file may only be copied under the terms of the GNU General
|
|
% Public License - see the file COPYING in the Mercury distribution.
|
|
%-----------------------------------------------------------------------------%
|
|
%
|
|
% File: feedback.automatic_parallelism.m.
|
|
% Main author: pbone.
|
|
%
|
|
% This module defines data structures for representing automatic parallelism
|
|
% feedback information and some procedures for working with these structures.
|
|
%
|
|
% NOTE: After modifying any of these structures please increment the
|
|
% feedback_version in feedback.m
|
|
%
|
|
%-----------------------------------------------------------------------------%
|
|
%-----------------------------------------------------------------------------%
|
|
|
|
:- module mdbcomp.feedback.automatic_parallelism.
|
|
|
|
:- interface.
|
|
|
|
:- import_module mdbcomp.goal_path.
|
|
:- import_module mdbcomp.program_representation.
|
|
|
|
:- import_module assoc_list.
|
|
:- import_module bool.
|
|
:- import_module list.
|
|
:- import_module maybe.
|
|
:- import_module set.
|
|
:- import_module string.
|
|
|
|
%-----------------------------------------------------------------------------%
|
|
|
|
:- type stat_measure
|
|
---> stat_mean
|
|
; stat_median.
|
|
|
|
:- type candidate_par_conjunctions_params
|
|
---> candidate_par_conjunctions_params(
|
|
% The number of desired busy sparks.
|
|
cpcp_desired_parallelism :: float,
|
|
|
|
% Follow variable use across module boundaries.
|
|
cpcp_intermodule_var_use :: bool,
|
|
|
|
% The cost of creating a spark and adding it to the local
|
|
% work queue, measured in call sequence counts.
|
|
cpcp_sparking_cost :: int,
|
|
|
|
% The time taken between the creation of the spark and when
|
|
% it starts being executed, measured in call sequence counts.
|
|
cpcp_sparking_delay :: int,
|
|
|
|
% The cost of barrier synchronisation for each conjunct at the
|
|
% end of the parallel conjunction.
|
|
cpcp_barrier_cost :: int,
|
|
|
|
% The costs of maintaining a lock on a single dependent
|
|
% variable, measured in call sequence counts. The first number
|
|
% gives the cost of the call to signal, and the second gives
|
|
% the cost of the call to wait assuming that the value is
|
|
% already available.
|
|
cpcp_future_signal_cost :: int,
|
|
cpcp_future_wait_cost :: int,
|
|
|
|
% The time it takes for a context to resume execution once
|
|
% it has been put on the runnable queue, assuming that an
|
|
% engine is available to pick it up. Measured in call sequence
|
|
% counts.
|
|
%
|
|
% We use this to calculate how soon a context can recover
|
|
% after being blocked by a future. It is also used to determine
|
|
% how quickly the context executing MR_join_and_continue after
|
|
% completing the leftmost conjunct of a parallel conjunction
|
|
% can recover after being blocked on the completion of
|
|
% one of the other conjuncts.
|
|
cpcp_context_wakeup_delay :: int,
|
|
|
|
% The cost threshold in call sequence counts of a clique
|
|
% before we consider it for parallel execution.
|
|
cpcp_clique_threshold :: int,
|
|
|
|
% The cost threshold in call sequence counts of a call site
|
|
% before we consider it for parallel execution.
|
|
cpcp_call_site_threshold :: int,
|
|
|
|
% The speedup we require before we allow a conjunction to be
|
|
% automatically parallelised. Should be either exactly 1.0
|
|
% or just above 1.0.
|
|
cpcp_speedup_threshold :: float,
|
|
|
|
% Whether we will allow parallelisation to result in
|
|
% dependent parallel conjunctions, and if so, how we estimate
|
|
% the speedup we get for them.
|
|
cpcp_parallelise_dep_conjs :: parallelise_dep_conjs,
|
|
|
|
cpcp_best_par_alg :: best_par_algorithm
|
|
).
|
|
|
|
:- type parallelise_dep_conjs
|
|
---> do_not_parallelise_dep_conjs
|
|
; parallelise_dep_conjs(speedup_estimate_alg).
|
|
|
|
:- type speedup_estimate_alg
|
|
---> estimate_speedup_naively
|
|
% Be naive to dependent parallelism, pretend it is independent.
|
|
|
|
; estimate_speedup_by_num_vars
|
|
% Use the num vars approximation for how much conjuncts overlap.
|
|
|
|
; estimate_speedup_by_overlap.
|
|
% Use the overlap calculation for dependent parallelism.
|
|
|
|
% This type is used to select which algorithm is used to find the most
|
|
% profitable parallelisation of a particular conjunction.
|
|
%
|
|
% TODO: The type name could be improved to make it distinct from the
|
|
% algorithm use use to search through the clique graph.
|
|
%
|
|
:- type best_par_algorithm
|
|
---> bpa_complete_branches(
|
|
% Use the complete algorithm until this many branches have been
|
|
% created during the search. Once this many evaluations have
|
|
% occurred the greedy algorithm is used; that is to say that
|
|
% once this fallsback, all existing alternatives will be
|
|
% explored but no new ones will be generated.
|
|
int
|
|
)
|
|
; bpa_complete_size(
|
|
% Use the complete algorithm for conjunctions with fewer than
|
|
% this many conjuncts, or a greedy algorithm. The recommended
|
|
% value is 50.
|
|
int
|
|
)
|
|
; bpa_complete
|
|
% The complete (bnb) algorithm with no fallback.
|
|
; bpa_greedy.
|
|
% Always use a greedy and linear algorithm.
|
|
|
|
% The set of candidate parallel conjunctions within a procedure.
|
|
%
|
|
:- type candidate_par_conjunctions_proc(GoalType)
|
|
---> candidate_par_conjunctions_proc(
|
|
% A variable name table for the variables that have
|
|
% sensible names.
|
|
cpcp_var_table :: var_name_table,
|
|
|
|
% Each push represents a program transformation.
|
|
% Most of the time, we expect the list to be empty,
|
|
% but if it isn't, then the list of candidate conjunctions
|
|
% is valid only AFTER the transformations described
|
|
% by this list have been applied. (The transformations
|
|
% should be independent of one another, so it should be
|
|
% OK to apply them in any order.)
|
|
cpcp_push_goals :: list(push_goal),
|
|
|
|
cpcp_par_conjs :: list(candidate_par_conjunction(GoalType))
|
|
).
|
|
|
|
% This goal describes 'push goal' transformations.
|
|
%
|
|
% This is where a goal may be pushed into the arms of a branching goal that
|
|
% occurs before it in the same conjunction. It can allow the pushed goal
|
|
% to be parallelised against goals in one or more branches without
|
|
% parallelising the whole branch goal (whose per-call cost may be
|
|
% too small).
|
|
%
|
|
:- type push_goal
|
|
---> push_goal(
|
|
% The goal path of the conjunction in which the push is done.
|
|
pg_goal_path :: goal_path_string,
|
|
|
|
% The range of conjuncts to push (inclusive).
|
|
pg_pushee_lo :: int,
|
|
pg_pushee_hi :: int,
|
|
|
|
% The set of expensive goals inside earlier conjuncts in that
|
|
% conjunction "next" to which the pushee goals should be
|
|
% pushed. By "next", we mean that the pushee goals should be
|
|
% added to the end of whatever conjunction contains the
|
|
% expensive goal, creating a containing conjunction if
|
|
% there wasn't one there before.
|
|
%
|
|
% Each of these expensive goals should be on a different
|
|
% execution path.
|
|
%
|
|
% This list should not be empty.
|
|
pg_pushed_into :: list(goal_path_string)
|
|
).
|
|
|
|
:- type candidate_par_conjunctions_proc ==
|
|
candidate_par_conjunctions_proc(pard_goal).
|
|
|
|
% A conjunction that is a candidate for parallelisation, it is identified
|
|
% by a procedure label, goal path to the conjunction and the call sites
|
|
% within the conjunction that are to be parallelised.
|
|
%
|
|
% TODO: In the future support more expressive candidate parallel
|
|
% conjunctions, so that more opportunities for parallelism can be found.
|
|
% Although it's probably not a good idea to parallelise three conjuncts or
|
|
% more against one another without first having a good system for reaching
|
|
% and maintaining the target amount of parallelism, this may involve
|
|
% distance granularity.
|
|
%
|
|
:- type candidate_par_conjunction(GoalType)
|
|
---> candidate_par_conjunction(
|
|
% The path within the procedure to this conjunction.
|
|
cpc_goal_path :: goal_path_string,
|
|
|
|
% If the candidate is dependent on a push being performed,
|
|
% what is that push? Note that any push that specifies the same
|
|
% goals being pushed and the same OR GREATER set of goals next
|
|
% to which to push them is acceptable: if such a push is
|
|
% performed, then this candidate is viable.
|
|
cpc_maybe_push_goal :: maybe(push_goal),
|
|
|
|
% The position within the original conjunction that this
|
|
% parallelisation starts.
|
|
cpc_first_conj_num :: int,
|
|
|
|
cpc_is_dependent :: conjuncts_are_dependent,
|
|
|
|
cpc_goals_before :: list(GoalType),
|
|
cpc_goals_before_cost :: float,
|
|
|
|
% A list of parallel conjuncts, each is a sequential
|
|
% conjunction of inner goals. All inner goals that are
|
|
% seen in the program presentation must be stored here
|
|
% unless they are to be scheduled before or after the
|
|
% sequential conjunction. If these conjuncts are flattened,
|
|
% the inner goals will appear in the same order as the
|
|
% program representation. By maintaining these two rules
|
|
% the compiler and analysis tools can use similar
|
|
% algorithms to construct the same parallel conjunction
|
|
% from the same program representation/HLDS structure.
|
|
|
|
cpc_conjs :: list(seq_conj(GoalType)),
|
|
|
|
cpc_goals_after :: list(GoalType),
|
|
cpc_goals_after_cost :: float,
|
|
|
|
cpc_par_exec_metrics :: parallel_exec_metrics
|
|
).
|
|
|
|
:- type seq_conj(GoalType)
|
|
---> seq_conj(
|
|
sc_conjs :: list(GoalType)
|
|
).
|
|
|
|
:- type callee_rep
|
|
---> unknown_callee
|
|
% An unknown callee such as a higher order or method call.
|
|
|
|
; named_callee(
|
|
% A known callee. Note that arity and mode are not stored at
|
|
% all. XXX why?
|
|
|
|
nc_module_name :: string,
|
|
nc_proc_name :: string
|
|
).
|
|
|
|
% A parallelised goal (pard_goal), a goal within a parallel conjunction.
|
|
% We don't yet have to represent many types of goals or details about them.
|
|
%
|
|
:- type pard_goal == goal_rep(pard_goal_annotation).
|
|
|
|
:- type pard_goal_annotation
|
|
---> pard_goal_annotation(
|
|
% The per-call cost of this call in call sequence counts.
|
|
pga_cost_percall :: float,
|
|
|
|
pga_coat_above_threshold :: cost_above_par_threshold,
|
|
|
|
% Variable use information.
|
|
pga_var_productions :: assoc_list(var_rep, float),
|
|
pga_var_consumptions :: assoc_list(var_rep, float)
|
|
).
|
|
|
|
:- type cost_above_par_threshold
|
|
---> cost_above_par_threshold
|
|
% The goal has a significant enough cost to be considered for
|
|
% parallelisation.
|
|
|
|
; cost_not_above_par_threshold.
|
|
% The goal is too cheap to be considered for parallelisation.
|
|
% We track it in the feedback information to help inform the
|
|
% compiler about _how_ to parallelise calls around it.
|
|
|
|
:- type conjuncts_are_dependent
|
|
---> conjuncts_are_dependent(set(var_rep))
|
|
; conjuncts_are_independent.
|
|
|
|
:- pred convert_candidate_par_conjunctions_proc(
|
|
pred(candidate_par_conjunction(A), A, B)::in(pred(in, in, out) is det),
|
|
candidate_par_conjunctions_proc(A)::in,
|
|
candidate_par_conjunctions_proc(B)::out) is det.
|
|
|
|
:- pred convert_candidate_par_conjunction(
|
|
pred(candidate_par_conjunction(A), A, B)::in(pred(in, in, out) is det),
|
|
candidate_par_conjunction(A)::in, candidate_par_conjunction(B)::out)
|
|
is det.
|
|
|
|
:- pred convert_seq_conj(
|
|
pred(A, B)::in(pred(in, out) is det),
|
|
seq_conj(A)::in, seq_conj(B)::out) is det.
|
|
|
|
%-----------------------------------------------------------------------------%
|
|
|
|
% Represent the metrics of a parallel execution.
|
|
%
|
|
:- type parallel_exec_metrics
|
|
---> parallel_exec_metrics(
|
|
% The number of calls into this parallelisation.
|
|
pem_num_calls :: int,
|
|
|
|
% The elapsed time of the original sequential execution.
|
|
pem_seq_time :: float,
|
|
|
|
% The estimated elapsed time of the parallel execution.
|
|
pem_par_time :: float,
|
|
|
|
% The overheads of parallel execution. These are already
|
|
% included in pem_par_time. Overheads are seperated into
|
|
% different causes.
|
|
pem_par_overhead_xpark_cost :: float,
|
|
pem_par_overhead_barrier :: float,
|
|
pem_par_overhead_signals :: float,
|
|
pem_par_overhead_waits :: float,
|
|
|
|
% The amount of time the initial (left most) conjunct spends
|
|
% waiting for the other conjuncts. During this time,
|
|
% the context used by this conjunct must be kept alive
|
|
% because it will resume executing sequential code after
|
|
% the conjunct, however we know that it cannot be resumed
|
|
% before its children have completed.
|
|
pem_first_conj_dead_time :: float,
|
|
|
|
% The amount of time all conjuncts spend blocked on the
|
|
% production of futures.
|
|
pem_future_dead_time :: float
|
|
).
|
|
|
|
% The speedup per call: SeqTime / ParTime. For example, a value of 2.0
|
|
% means that the goal is twice as fast when parallelised.
|
|
%
|
|
:- func parallel_exec_metrics_get_speedup(parallel_exec_metrics) = float.
|
|
|
|
% The amount of time saved per-call: SeqTime - ParTime.
|
|
%
|
|
:- func parallel_exec_metrics_get_time_saving(parallel_exec_metrics) = float.
|
|
|
|
% The amount of time spent 'on cpu', (seq time plus non-dead overheads).
|
|
%
|
|
:- func parallel_exec_metrics_get_cpu_time(parallel_exec_metrics) = float.
|
|
|
|
% The overheads of parallel execution.
|
|
%
|
|
% Add these to pem_seq_time to get the 'time on cpu' of this execution.
|
|
%
|
|
:- func parallel_exec_metrics_get_overheads(parallel_exec_metrics) = float.
|
|
|
|
%-----------------------------------------------------------------------------%
|
|
%-----------------------------------------------------------------------------%
|
|
|
|
:- implementation.
|
|
|
|
:- import_module exception.
|
|
:- import_module float.
|
|
:- import_module map.
|
|
:- import_module require.
|
|
:- import_module unit.
|
|
:- import_module univ.
|
|
|
|
%-----------------------------------------------------------------------------%
|
|
|
|
parallel_exec_metrics_get_speedup(PEM) = SeqTime / ParTime :-
|
|
SeqTime = PEM ^ pem_seq_time,
|
|
ParTime = PEM ^ pem_par_time.
|
|
|
|
parallel_exec_metrics_get_time_saving(PEM) = SeqTime - ParTime :-
|
|
SeqTime = PEM ^ pem_seq_time,
|
|
ParTime = PEM ^ pem_par_time.
|
|
|
|
parallel_exec_metrics_get_cpu_time(PEM) = SeqTime + Overheads :-
|
|
SeqTime = PEM ^ pem_seq_time,
|
|
Overheads = parallel_exec_metrics_get_overheads(PEM).
|
|
|
|
parallel_exec_metrics_get_overheads(PEM) =
|
|
SparkCosts + BarrierCosts + SignalCosts + WaitCosts :-
|
|
PEM = parallel_exec_metrics(_, _, _, SparkCosts, BarrierCosts,
|
|
SignalCosts, WaitCosts, _, _).
|
|
|
|
%-----------------------------------------------------------------------------%
|
|
%
|
|
% Helper predicates for the candidate parallel conjunctions type.
|
|
%
|
|
|
|
convert_candidate_par_conjunctions_proc(Conv, CPCProcA, CPCProcB) :-
|
|
CPCProcA = candidate_par_conjunctions_proc(VarTable, PushGoals, CPCA),
|
|
list.map(convert_candidate_par_conjunction(Conv), CPCA, CPCB),
|
|
CPCProcB = candidate_par_conjunctions_proc(VarTable, PushGoals, CPCB).
|
|
|
|
convert_candidate_par_conjunction(Conv0, CPC0, CPC) :-
|
|
CPC0 = candidate_par_conjunction(GoalPath, MaybePushGoal, FirstGoalNum,
|
|
IsDependent, GoalsBefore0, GoalsBeforeCost, Conjs0,
|
|
GoalsAfter0, GoalsAfterCost, Metrics),
|
|
Conv = (pred(A::in, B::out) is det :-
|
|
Conv0(CPC0, A, B)
|
|
),
|
|
list.map(convert_seq_conj(Conv), Conjs0, Conjs),
|
|
list.map(Conv, GoalsBefore0, GoalsBefore),
|
|
list.map(Conv, GoalsAfter0, GoalsAfter),
|
|
CPC = candidate_par_conjunction(GoalPath, MaybePushGoal, FirstGoalNum,
|
|
IsDependent, GoalsBefore, GoalsBeforeCost, Conjs,
|
|
GoalsAfter, GoalsAfterCost, Metrics).
|
|
|
|
convert_seq_conj(Conv, seq_conj(Conjs0), seq_conj(Conjs)) :-
|
|
list.map(Conv, Conjs0, Conjs).
|
|
|
|
%-----------------------------------------------------------------------------%
|
|
:- end_module mdbcomp.feedback.automatic_parallelism.
|
|
%-----------------------------------------------------------------------------%
|