mercury/compiler/notes/mlds_tail_recursion.html

<html>
<head>
<title>
Implementing tail recursion in the MLDS code generator
</title>
</head>

<body
	bgcolor="#ffffff"
	text="#000000"
>

<h1>Implementing tail recursion in the MLDS code generator</h1>

<h2>Tail recursion optimization versus last call optimization</h2>

<p>
Most implementations of declarative languages implement
<em>last call optimization</em> (LCO)
to allow recursive algorithms to handle arbitrary amounts of data
using constant stack space.
With LCO, when the last thing that a procedure does is
call a callee whose vector of return values is the same as
the vector of return values of the caller,
it deallocates the stack frame of the caller before the call,
allowing its space to be used to store the stack frame of the callee.

<p>
In its general form, LCO does not require any knowledge of the callee;
it works even when the identity of the callee is unknown at compile time,
as when the last call is a higher order call or a method call,
or its code is unavailable,
as when it is defined in a different compilation unit.
However, implementing this general form of LCO
requires the implementation to have
direct control over the use of the stack,
and the ability to generate jumps (not calls) to arbitrary locations.
In the Mercury compiler, the LLDS code generator can do both these things,
but the MLDS code generator can do neither.
This is why it can implement only
<em>tail recursion optimization</em> (TRO).
This differs from LCO in two ways.
<ul>
<li>
We don't deallocate the stack frame of the caller before the tail call;
instead, we reuse that stack frame to become the stack frame of the callee.
<li>
We don't use a global (between functions) jump
to the start of the code of the callee whereever it happens to be in memory;
instead, we include the code of the callee next to the code of the caller,
and use local (within a function) branches to transfer control.
</ul>
This means that TRO is a less general form of last call optimization,
because it is applicable only
when the call is a first order, so the callee is statically known,
and the caller and callee are in the same compilation unit,
so they can be compiled together into a single target language function.

<p>
Tail recursion optimization is applicable
to both self-recursion and mutual recursion.
TRO for self recursion is significantly simpler,
so we describe that first.
This is a general principle we use everywhere below:
we introduce the simplest case first,
and add the complications (and their solutions) later.

<p>
<h2>Self tail recursion</h2>

<p>
To explain how the Mercury compiler applies TRO to self-recursive calls,
we will use this example predicate:

<p>
<pre>
:- pred len(list(int)::in, int::in, int::out) is det.

len(L, Len0, Len) :-
  (
    L = [],
    Len = Len0
  ;
    L = [_ | T],
    Len1 = Len0 + 1,
      len(T, Len1, Len)
  ).
</pre>

<p>
Here is the C code of generated by the Mercury compiler for this predicate
without TRO:

<p>
<pre>
void MR_CALL
x__len_3_p_0(
  MR_Word L_4,
  MR_Integer Len0_5,
  MR_Integer * Len_6)
{
  if ((L_4 == ((MR_Word) MR_mkword(MR_mktag(0), MR_mkbody((MR_Integer) 0)))))
    *Len_6 = Len0_5;
  else
  {
    MR_Word T_8;
    MR_Integer Len1_9;
    MR_Integer Var_10;
    MR_Integer Var_7;

    Var_7 = ((MR_Integer) (MR_hl_field(MR_mktag(1), L_4, (MR_Integer) 0)));
    T_8 = ((MR_Word) (MR_hl_field(MR_mktag(1), L_4, (MR_Integer) 1)));
    Var_10 = (MR_Integer) 1;
    Len1_9 = (Len0_5 + Var_10);
    x__len_3_p_0(T_8, Len1_9, Len_6);
  }
}
</pre>

<p>
The last call is a tail recursive call.
When this predicate is compiled with TRO,
we get this C code:

<p>
<pre>
void MR_CALL
x__len_3_p_0(
  MR_Word L_4,
  MR_Integer Len0_5,
  MR_Integer * Len_6)
{
  while (MR_TRUE)
  {
    if ((L_4 == ((MR_Word) MR_mkword(MR_mktag(0), MR_mkbody((MR_Integer) 0)))))
      *Len_6 = Len0_5;
    else
    {
      MR_Word T_8 = ((MR_Word) (MR_hl_field(MR_mktag(1), L_4, (MR_Integer) 1)));
      MR_Integer Len1_9;
      MR_Integer Var_10 = (MR_Integer) 1;
      MR_Integer Var_7 = ((MR_Integer) (MR_hl_field(MR_mktag(1), L_4, (MR_Integer) 0)));
      MR_Word next_value_of_L_4;
      MR_Integer next_value_of_Len0_5;

      Len1_9 = (Len0_5 + Var_10);
      // direct tailcall eliminated
      next_value_of_L_4 = T_8;
      next_value_of_Len0_5 = Len1_9;
      L_4 = next_value_of_L_4;
      Len0_5 = next_value_of_Len0_5;
      continue;
    }
    break;
  }
}
</pre>

<p>
This differs from the unoptimized code in two major aspects.

<p>
The first aspect affected by TRO
is the translation of the self tail call
(or self tail calls, plural, in the general case).
TRO replaces the call with

<ul>
<li>
code that assigns the input arguments of the self-recursive call
(in this case, T and Len1),
to the corresponding input arguments in the head (L and Len0), and
<li>
code that transfers control back to the start of the procedure,
i.e. the entry point of the callee.
</ul>

<p>
There is no code for handling the output arguments,
since (by the definition of tail calls)
these must be the same in the caller and the callee.
On every non-recursive path,
we return the values of the output arguments
using the exact same code as we would use without TRO.

<p>
Note that code that passes the input arguments does so in two stages:
assignments of the actual parameter values
to the next_value_of_ forms of the input arguments,
followed by assignments of these next_value_of_ forms
to the input arguments themselves.
This is to handle the case where some variable is both
an input argument and an actual parameter of the call.
If we just assigned each actual parameter to the corresponding input directly
in (say) ascending order of argument number,
then the translation of a call such as foo(In2, In1, <i>outputs</i>)
in a predicate whose head looks like foo(In1, In2, <i>outputs</i>)
would consist of the assignments

<p>
<pre>
In1 = In2;
In2 = In1;
</pre>

and the first assignment would clobber the value to be assigned by the second.
This is the standard problem of swapping two values,
and its solution requires at least one temporary variable
(if we don't want to resort to unnecessarily complicated code using xors).
Our solution works because
the next_value_of_ forms of the input arguments are never live
outside the small blocks of code resulting from a single tail recursive call
(we simply don't generate references to them in any other context),
and inside each block, each such variable
is written exactly once and read exactly once (in that order).
The fact that we use more temporaries
than may be strictly necessary does not matter,
because the final decision on
how the assigned values end up in their target locations
is not up to the Mercury compiler;
it is up to the compiler that translates the generated C, C# or Java
to machine code.

<p>
The second aspect affected by TRO is that
the entire body of the target language (in this case C) code
we generate for the procedure is wrapped up in a loop.

<p>
The usual way we wrap the procedure body is with a while loop:

<p>
<pre>
ret_type func_name(args)
{
  while (MR_TRUE)
  {
    // procedure body
    // in which tail calls transfer control using "continue"
  }
}
</pre>

<p>
However, we can also use gotos:

<p>
<pre>
ret_type func_name(args)
{
top_of_proc:
  {
    // procedure body
    // in which tail calls transfer control using "goto top_of_proc"
  }
}
</pre>
<p>

<h2>Mutual tail recursion</h2>

<p>
The MLDS code generator partitions the procedures of a module
into a sequence of SCCs,
where each SCC (strongly connected component)
consists of a set of procedures
that are all reachable from each other via calls, whether tail or non-tail.
Since TRO applies only to tail calls,
it also partitions each SCC further into one or more TSCCs (tail SCCs),
which are strongly connected components of a graph
whose nodes represent procedures
and in which there are edges only for <em>tail</em> calls.
This means that by definition, every procedure in a TSCC
is reachable from every procedure in that TSCC using only tail calls.
It then implements tail recursion optimization
in each TSCC that contains tail calls.

<p>
Most TSCCs contain only one procedure,
which means that we can implement TRO for them
using only the techniques above,
without using any of the techniques below.
The techniques below are needed only for TSCCs that contain
two or more procedures.

<p>
Note that two (or more) mutually recursive procedures
can end up in <em>different</em> TSCCs
even if there is a tail call between them,
if the tail calls go only one way,
e.g. if procedure p calls procedure q using tail calls,
but q calls p using only ordinary nontail calls.
The LLDS backend can optimize the tail calls to q in p,
but the MLDS backend cannot do so,
because it cannot generate nonlocal gotos.

<p>
To implement mutual tail recursion between the procedures of a nontrivial TSCC,
we need to generalize
<ul>
<li>
the mechanism for parameter passing input arguments at tail calls,
<li>
the mechanism for returning output arguments on nonrecursive paths, and
<li>
the mechanism for transferring control.
</ul>

<h3>Transfers of control</h3>

<p>
The easiest to generalize is the last one: the transfer of control.
To see how it is done,
consider a small TSCC containing two procedures, tscc_a and tscc_b.
Since we need to translate tail calls into <em>local</em> transfers of control,
we translate each TSCC together,
either using labels and gotos like this:

<p>
<pre>
ret_type_a tscc_a(args_a)
{
  goto top_of_proc_1;
top_of_proc_1:
  {
    // body of procedure tscc_a
    // in which tail calls transfer control using "goto top_of_proc_N"
    goto tscc_end;
  }
top_of_proc_2:
  {
    // body of procedure tscc_b
    // in which tail calls transfer control using "goto top_of_proc_N"
    goto tscc_end;
  }
tscc_end:
  return ...
}

ret_type_b tscc_b(args_b)
{
  goto top_of_proc_2;
top_of_proc_1:
  {
    // body of procedure tscc_a
    // in which tail calls transfer control using "goto top_of_proc_N"
    goto tscc_end;
  }
top_of_proc_2:
  {
    // body of procedure tscc_b
    // in which tail calls transfer control using "goto top_of_proc_N"
    goto tscc_end;
  }
tscc_end:
  return ...
}
</pre>

<p>
or using while loops and switches like this:

<p>
<pre>
ret_type_a tscc_a(args_a)
{
  int tscc_selector = 1;
  switch (tscc_selector)
  {
    case 1:
      {
        // body of procedure tscc_a
        // in which tail calls transfer control using
        // "tscc_selector = N; continue"
      }
      break;
    case 2:
      {
        // body of procedure tscc_b
        // in which tail calls transfer control using
        // "tscc_selector = N; continue"
      }
      break;
  }

  return ...
}

ret_type_b tscc_b(args_b)
{
  int tscc_selector = 2;
  switch (tscc_selector)
  {
    case 1:
      {
        // body of procedure tscc_a
        // in which tail calls transfer control using
        // "tscc_selector = N; continue"
      }
      break;
    case 2:
      {
        // body of procedure tscc_b
        // in which tail calls transfer control using
        // "tscc_selector = N; continue"
      }
      break;
  }

  return ...
}
</pre>

In both cases,
each procedure in the TSCC has its own number in the TSCC
(in this case, tscc_a is procedure 1 in the TSCC
and tscc_b is procedure 2 in the TSCC).
We call this number the procedure's in-TSCC id number.

<p>
We translate each procedure in the TSCC into MLDS code just once,
yielding the code represented by "body of procedure ..." above.
We call these <em>inner</em> or <em>wrapped</em> procedures.
If the TSCC contains N procedures,
then each C function we generate will contain N wrapped procedures.
We call entirety of each C function
an <em>outer</em> or <em>container</em> procedure,
since each contains two or more wrapped procedures.
We must generate a container procedure
for every member of the TSCC
that may be called by a non-tail call from anywhere;
from other modules,
from other (higher) SCCs in the current module,
from procedures in the current SCC that are not in the TSCC,
and via non-tail calls from any procedure in the TSCC itself.
This means that
the code of every procedure in a TSCC that contains N procedures
will be present up to N times in the executable.
Since mutually-tail-recursive procedures are relatively rare,
and most TSCCs contain only two or three procedures,
this increase in the total code memory requirement
is usually a more than acceptable price to pay
for the ability to handle arbitrarily deep recursion in constant stack space.
(In fact, the increased memory requirement is probably not as important
as the reduction of the effectiveness of the instruction cache:
the cache misses that bring in the code of a wrapped procedure from main memory
have to be incurred for <em>each</em> one of its executed copies.)

<h3>Parameter passing</h3>

<p>
Parameter passing between the procedures of a TSCC at tail calls
is not as simple as parameter passing at self-tail-recursive calls,
because (except in the case of self-tail-calls)
the actual parameters in the caller and the formal parameters of the callee
will come from two different procedures, and thus from two different varsets.
Since every procedure's varset contains variables
whose numbers are allocated consecutively from zero,
the sets of variable numbers in two different procedures
will of course greatly overlap,
and it is possible for a variable with a given number
to have the same name in both varsets as well.
We don't want any such accidental name collisions
to result in the generated code using the same C variable
to represent <em>both</em> of the colliding variables,
since that would be semantically wrong.
(For starters, the two variables could even have different types,
but the sharing of their storage would be a bug
even if they had the same type.)
We therefore need a mechanism to avoid this problem.

<p>
One possible solution would be to rename (or renumber) apart
either the varsets of the procedures in the TSCC before code generation,
or their sets of MLDS variables either during or after code generation.
Both are problematic.
HLDS procedures contain <em>lots</em> of fields that contain variables,
so the code for renaming or renumbering variables in all of them
would be big (which would pose a program maintenance burden)
and would turn over a lot of memory (a performance problem).
And some compiler-generated MLDS variable names
have fixed names and no changeable number.

<p>
Our chosen solution sidesteps such problems altogether
by inventing a new set of compiler-generated MLDS variables
specifically for parameter passing in TSCCs.

<ul>
<li>
For every input argument of a nondummy type
for every procedure in the TSCC,
we create the MLDS variable tscc_proc_N_input_M_VarName,
where N is the procedure's in-TSCC id number,
M is input argument's position
in the list of the procedure's list of input arguments
(i.e. this is the procedure's Mth input argument),
and VarName is the name of the argument.
The VarName part is not needed for correctness;
it is there only to make the generated MLDS easier to read.
<li>
For every output argument
in the vector of output arguments of the procedures of the TSCC
(which must be the same in all procedures of the TSCC),
we create the MLDS variable tscc_output_M_VarName.
M is the output argument's position in this vector,
while VarName is the output argument's name
in <em>one</em> of the procedures of the TSCC.
The fact that this name need not match the name of the output argument
in the other procedures does not matter, because again
the name is there only to make the generated MLDS easier to read.
<li>
All the procedures in a TSCC must have the same code model,
which may be either model_det or model_semi.
In the latter case, we also create the MLDS variable tscc_output_succeeded.
Model semi procedures return the value of the succeeded MLDS variable
as if it were a sort-of output argument;
the tscc_output_succeeded variable
has the same relationship to the succeeded variable
as the tscc_output_M_VarName variables have
to the MLDS output variables they correspond to.
</ul>

<p>
In every procedure,
every argument that participates in parameter passing
(i.e. every argument that is not of a dummy type and whose mode is not unused)
has either a corresponding tscc_proc_N_input_M_VarName variable
(if it is an input argument)
or a corresponding tscc_output_M_VarName variable
(if it is an output argument).
In each such pair of corresponding variables,
we call the MLDS variable representing the argument
the procedure's own variable,
and we call the other the tscc variable.

<p>
Suppose both tscc_a and tscc_b are det functions
whose argument vectors are tscc_a(AIn1, AIn2) = AOut1
and tscc_b(BIn1) = BOut1 respectively,
and the name of the MLDS type of each variable
is the name of the variable with a "Type" added to it.
However, since AOutType1 must be the same as BOutType1,
we will replace both with just "OutType1".
Then the parameter passing code we generate will look like this:

<p>
<pre>
OutType1
tscc_a(
  AInType1      tscc_proc_1_input_1_AIn1,
  AInType2      tscc_proc_1_input_2_AIn2)
{
  BInType1      tscc_proc_1_input_2_BIn1;
  OutType1      tscc_output_1_AOut1;

  goto top_of_proc_1;
top_of_proc_1:
  {
    AInType1    AIn1 = tscc_proc_1_input_1_AIn1;
    AInType2    AIn2 = tscc_proc_1_input_2_AIn2;
    OutType1    AOut1;

    // body of procedure tscc_a in which
    // tail calls to tscc_a look like this:
    //      tscc_proc_1_input_1_AIn1 = input arg 1 of tail call;
    //      tscc_proc_1_input_2_AIn2 = input arg 2 of tail call;
    //      goto top_of_proc_1;
    // tail calls to tscc_b look like this:
    //      tscc_proc_1_input_2_BIn1 = input arg 1 of tail call;
    //      goto top_of_proc_2;
    // and base cases assign to AOut1 as usual

    tscc_output_1_AOut1 = AOut1;
    goto tscc_end;
  }
top_of_proc_2:
  {
    BInType1    BIn1 = tscc_proc_1_input_2_BIn1;
    OutType1    BOut1;

    // body of procedure tscc_a in which
    // tail calls to both tscc_a and tscc_b look like they do above
    // and base cases assign to BOut1 as usual

    tscc_output_1_AOut1 = BOut1;
    goto tscc_end;
  }
tscc_end:
  return tscc_output_1_AOut1;
}

OutType1
tscc_b(
  BInType1      tscc_proc_2_input_1_BIn1)
{
  AInType1      tscc_proc_1_input_1_AIn1;
  AInType2      tscc_proc_1_input_2_AIn2;
  OutType1      tscc_output_1_AOut1;

  goto top_of_proc_2;
top_of_proc_1:
  {
    AInType1    AIn1 = tscc_proc_1_input_1_AIn1;
    AInType2    AIn2 = tscc_proc_1_input_2_AIn2;
    OutType1    AOut1;

    // body of procedure tscc_a in which
    // tail calls to tscc_a look like this:
    //      tscc_proc_1_input_1_AIn1 = input arg 1 of tail call;
    //      tscc_proc_1_input_2_AIn2 = input arg 2 of tail call;
    //      goto top_of_proc_1;
    // tail calls to tscc_b look like this:
    //      tscc_proc_1_input_2_BIn1 = input arg 1 of tail call;
    //      goto top_of_proc_2;
    // and base cases assign to AOut1 as usual

    tscc_output_1_AOut1 = AOut1;
    goto tscc_end;
  }
top_of_proc_2:
  {
    BInType1    BIn1 = tscc_proc_1_input_2_BIn1;
    OutType1    BOut1;

    // body of procedure tscc_a in which
    // tail calls to both tscc_a and tscc_b look like they do above
    // and base cases assign to BOut1 as usual

    tscc_output_1_AOut1 = BOut1;
    goto tscc_end;
  }
tscc_end:
  return tscc_output_1_AOut1;
}
</pre>

<p>
The general principles of our parameter passing scheme are as follows.

<ul>
<li>
The MLDS variables that
we would generate for a procedure in the absence of TRO,
which we call the procedure's <em>own</em> variables,
are visible <em>only</em> in the scope
containing the wrapped body of that procedure.
(These are the scopes after the top_of_proc_N labels above.)
Since these scopes never overlap,
there is never any place in the generated MLDS code
where the own variables of more than one TSCC procedure are visible at once.
<li>
Each scope containing a wrapped procedure occurs
either after the label corresponding to the procedure
(if we are using labels and gotos)
or in the switch case corresponding to the procedure
(if we are using while loops and switches),
and consists of
    <ul>
    <li>
    the declarations of the procedure's own MLDS variables
    for both the input and output arguments, followed by
    <li>
    assignments that set the value of each own MLDS input argument
    from the value of the corresponding tscc_proc_N_input_M_VarName variable
    (the above code examples show these
    merged with the definitions of the variables being set), followed by
    <li>
    the wrapped body of the procedure, followed by
    <li>
    assignments that copy the value of each own MLDS output argument
    to the corresponding tscc_output_M_VarName variable, followed by
    <li>
    a jump to the end of the function, via either goto or a break.
    </ul>
Note that in this case,
the tail calls do not need to use any next_value_of_ variables.
The actual parameters can only be the procedure's own variables,
and the formal parameters being assigned to
are all tscc_proc_N_input_M_VarName variables,
so there cannot be any overlap between them.
<li>
For procedures whose outputs are all returned by value
(which includes both tscc_a and tscc_b above),
their container function consists of the following.
    <ul>
    The declaration of the function signature,
    giving the type (or in general, types)
    of the output argument(s) returned by value,
    and the types and variable names of the input arguments.
    The variables in the signature
    are the tscc_proc_N_input_M_VarName variables
    of the procedure that the container is for,
    which this signature declares.
    <li>
    The body of the container function starts by declaring
    all the tscc_proc_N_input_M_VarName variables
    of all the <em>other</em> procedures in the TSCC,
    and all the tscc_output_M_VarName variables
    (each of which is shared by all the procedures in the TSCC).
    <p>
    The tscc_proc_N_input_M_VarName variables of the procedure
    that the container is for start out initialized by the caller.
    The tscc_proc_N_input_M_VarName variables
    of the other procedures in the TSCC start out uninitialized,
    but their contents will be read
    only by code in the corresponding wrapped procedure body,
    which is reachable only by tail call,
    and every such tail call will first set those variables
    to the appropriate values for the tail call.
    <li>
    Either a jump to the label at the start of the wrapped version
    of the procedure that the container function is for,
    or a setting of the tscc_selector variable that achieves
    the same effect.
    The target of the jump is the wrapped procedure
    whose tscc_proc_N_input_M_VarName variables
    are listed in the container function's signature,
    and whose values are therefore defined for us by the caller.
    <li>
    The bodies of the procedures of the TSCC,
    either with each being preceded by its own label,
    or wrapped up in a loop and a switch,
    as shown above.
    <li>
    A return statement for all the tscc_output_M_VarName variables
    corresponding to output arguments returned by value.
    (In the above example, all output arguments are returned by value.)
    Execution reaches here only after a branch here
    from the end of one of the wrapped procedures,
    each of which, immediately before the branch,
    assigns to all the tscc_output_M_VarName variables.
    </ul>
</ul>

<p>
For model semi predicates,
the succeeded (own) variable
and the tscc_output_succeeded variable corresponding to it
effectively function as an output argument returned by value.
By its position in the vector of output arguments,
it is effectively output argument 0.

<p>
The final detail is the treatment of output arguments
that are passed by reference.
Our chosen approach is designed to work even in cases where
the procedures of the TSCC,
although they return the same vector of outputs,
return different subsets of them by reference.

<p>
The basic idea is to generate the wrapped procedures
as if all the output arguments were returned by value,
exactly as shown in the example above,
and to handle the difference at the container function level.

<p>
The parts of a container function
corresponding to an output argument passed by value are the following.

<ul>
<li>
The type of the output argument is part of the vector of return types
in the function signature.
<li>
The tscc_output_M_VarName variable of the output is declared
at the start of the container function.
<li>
The epilogue of each wrapped procedure
assigns to the tscc_output_M_VarName variable of the output.
<li>
The epilogue of the container function
returns the value of this tscc_output_M_VarName variable
as part of the vector of return values.
</ul>

<p>
When an output argument is passed by reference,
we create a tscc_output_ptr_M_VarName variable for it
as well as a tscc_output_ptr_M_VarName variable.
The parts of a container function
corresponding to such an output argument are the following.

<ul>
<li>
The type of the output argument
is <em>not</em> part of the vector of return types in the function signature.
Instead, the argument vector contains tscc_output_ptr_M_VarName,
and its type is a pointer to the type of the output argument.
The position of the tscc_output_ptr_M_VarName is derived from
the position of the argument in the HLDS.
<li>
The tscc_output_M_VarName variable of the output is declared
at the start of the container function, as before.
<li>
The epilogue of each wrapped procedure
assigns to the tscc_output_M_VarName variable of the output,
as before.
<li>
The epilogue of the container function
returns the value of this tscc_output_M_VarName variable,
but <em>not</em> as part of the vector of return values.
Instead, just before the return statement,
we include the assignment *tscc_output_ptr_M_VarName = tscc_output_M_VarName.
</ul>

</body>
</html>