Files
mercury/compiler/string_encoding.m
Zoltan Somogyi 9dbee8bdb4 Implement trie string switches for the LLDS backend.
For now, the implementation covers only non-lookup switches.

compiler/builtin_ops.m:
    Generalize the existing offset_str_eq binary op by adding an optional
    size parameter, which, if present, restricts the equality test to look at
    the given number of code units at most.

compiler/llds_out_data.m:
compiler/mlds_to_c_data.m:
    Generalize the output of binop rvals whose operation is offset_str_eq.
    In llds_out_data.m, fix a bug in the original code. (This bug did not
    lead to problems because before this diff, we never generated this op.)

compiler/string_switch_util.m:
    Add a predicate that recognizes when a trie node that is NOT a leaf
    nevertheless represents the top of a stick, which means that it has
    only one possible next code unit, which itself may have only one
    possible next code unit, and so on, until we reach a node that *does*
    have two or more next code units. (One of those may be the code unit
    of the string-ending NULL character.)

compiler/ml_string_switch.m:
    Use the new predicate in string_switch_util.m to generate better code
    for sticks. Instead of comparing each character in the stick individually
    against the relevant code unit of the string being switched on, compare
    them all at once using the new binary op.

compiler/ml_switch_gen.m:
    Insist on both the host machine and the target machine
    using the C backend.

compiler/string_switch.m:
    Implement non-lookup trie switches. The code follows the approach used
    in ml_string_switch.m as much as possible, but there are plenty of
    differences caused by targeting the LLDS.

    Rename some predicates to specify which switch implementation method
    they belong to.

    Write a comment just once, and refer to it from elsewhere instead of
    duplicating it at each reference site.

compiler/switch_gen.m:
    Enable the use of trie switches when the option values call for it,
    and when the switch is not a lookup switch.

compiler/cse_detection.m:
    Do not flood the output of mmc -V with messages that have nothing to do
    with the module being compiled.

compiler/options.m:
    Add a way to specify --no-allow-inlining on the command line.
    This can help debug code generator changes like this, by disallowing
    a transform that can modify the Mercury code whose compilation process
    you are trying to debug. (The documentation of the --inlining option
    implies that --no-inlining should do the same job, but it does not.)
    The option is not documented for users.

compiler/string_encoding.m:
    Provide a version of from_code_unit_list_in_encoding that allows
    non-well-formed code unit sequences as input, and provide det versions
    of both versions. This is for use by both string_switch.m and
    ml_string_switch.m.

compiler/hlds_goal.m:
    Document the properties of case_ids.

compiler/llds.m:
    Document the possibility that string constants are not well formed.

compiler/bytecode.m:
compiler/code_util.m:
compiler/mlds_dump.m:
compiler/ml_global_data.m:
compiler/mlds_to_cs_data.m:
compiler/mlds_to_java_data.m:
compiler/opt_debug.m:
    Conform to the changes above.

library/string.m:
    Replace the non-exported test predicate internal_encoding_is_utf8 with
    an exported function that returns an enum specifying the string encoding.

NEWS.md:
    Announce the new function.

runtime/mercury_string.h:
    Add the C macro that implements the new form of the offset_str_eq
    binary op.

tests/hard_coded/string_switch4.{m,exp}:
    We have long had three copies of the exact same code, in string_switch.m,
    string_switch2.m and string_switch3.m, which were compiled with

    - no smart switch implementation
    - smart switch implementation forced to use the hash table method
    - smart switch implementation forced to use binary search method

    Add this new copy, which is compiled with

    - smart switch implementation forced to use the new trie method

tests/hard_coded/Mmakefile:
    Add the new test case.

tests/hard_coded/Mercury.options:
    Update the options of the test cases, and specify them for the new.

tests/hard_coded/string_switch.m:
tests/hard_coded/string_switch2.m:
tests/hard_coded/string_switch3.m:
    Update the top-of-module comment block to be identical in all four copies
    of this module.
2024-03-26 21:17:31 +11:00

137 lines
4.6 KiB
Mathematica

%----------------------------------------------------------------------------%
% vim: ft=mercury ts=4 sw=4 et
%----------------------------------------------------------------------------%
% Copyright (C) 2015, 2024 The Mercury team.
% This file may only be copied under the terms of the GNU General
% Public License - see the file COPYING in the Mercury distribution.
%----------------------------------------------------------------------------%
:- module backend_libs.string_encoding.
:- interface.
:- import_module libs.
:- import_module libs.globals.
:- import_module list.
:- import_module string.
% target_char_range(Target, Min, Max):
%
% Return the smallest and largest integers that represent
% valid code points in the encoding we use on the given target platform.
%
:- pred target_char_range(compilation_target::in, int::out, int::out) is det.
% Return the string_encoding we use on the given target platform.
%
:- func target_string_encoding(compilation_target) = string_encoding.
% Convert a string to the list of its code units in the given encoding.
%
:- pred to_code_unit_list_in_encoding(string_encoding::in, string::in,
list(int)::out) is det.
% Convert a list of code units in the given encoding to a string.
% Fails if the list does not follow the rules of the encoding.
%
:- pred from_code_unit_list_in_encoding(string_encoding::in, list(int)::in,
string::out) is semidet.
:- pred det_from_code_unit_list_in_encoding(string_encoding::in, list(int)::in,
string::out) is det.
% Convert a list of code units in the given encoding to a string.
% Allows ill-formed sequences, and will succeed *unless* the given list
% includes a zero, signifying a null character.
%
% At the moment, it works only when the encoding specified by the first
% argument is utf8, *and* the compiler's own encoding is utf8.
% If either encoding is utf16, it will throw an exception.
%
:- pred from_code_unit_list_in_encoding_allow_ill_formed(string_encoding::in,
list(int)::in, string::out) is semidet.
:- pred det_from_code_unit_list_in_encoding_allow_ill_formed(
string_encoding::in, list(int)::in, string::out) is det.
%----------------------------------------------------------------------------%
%----------------------------------------------------------------------------%
:- implementation.
:- import_module require.
target_char_range(_Target, Min, Max) :-
% The range of `char' is the same for all existing targets.
Min = 0,
Max = 0x10ffff.
target_string_encoding(Target) = Encoding :-
(
Target = target_c,
Encoding = utf8
;
( Target = target_java
; Target = target_csharp
),
Encoding = utf16
).
to_code_unit_list_in_encoding(Encoding, String, CodeUnits) :-
require_complete_switch [Encoding]
(
Encoding = utf8,
string.to_utf8_code_unit_list(String, CodeUnits)
;
Encoding = utf16,
string.to_utf16_code_unit_list(String, CodeUnits)
).
from_code_unit_list_in_encoding(Encoding, CodeUnits, String) :-
require_complete_switch [Encoding]
(
Encoding = utf8,
string.from_utf8_code_unit_list(CodeUnits, String)
;
Encoding = utf16,
string.from_utf16_code_unit_list(CodeUnits, String)
).
det_from_code_unit_list_in_encoding(Encoding, CodeUnits, String) :-
( if from_code_unit_list_in_encoding(Encoding, CodeUnits, StringPrime) then
String = StringPrime
else
unexpected($pred, "from_code_unit_list_in_encoding failed")
).
from_code_unit_list_in_encoding_allow_ill_formed(Encoding, CodeUnits, String) :-
require_complete_switch [Encoding]
(
Encoding = utf8,
InternalEncoding = internal_string_encoding,
(
InternalEncoding = utf8,
string.from_code_unit_list_allow_ill_formed(CodeUnits, String)
;
InternalEncoding = utf16,
unexpected($pred, "implementing on utf16 is nyi")
)
;
Encoding = utf16,
unexpected($pred, "utf16 is nyi")
).
det_from_code_unit_list_in_encoding_allow_ill_formed(Encoding,
CodeUnits, String) :-
( if
from_code_unit_list_in_encoding_allow_ill_formed(Encoding,
CodeUnits, StringPrime)
then
String = StringPrime
else
unexpected($pred,
"from_code_unit_list_in_encoding_allow_ill_formed failed")
).
%----------------------------------------------------------------------------%
:- end_module backend_libs.string_encoding.
%----------------------------------------------------------------------------%