Commit Graph

467 Commits

Author SHA1 Message Date
Zoltan Somogyi
040d6717a6 Fix comments. 2021-01-26 23:22:54 +11:00
Zoltan Somogyi
9c248726a6 Add uint{64,}_to_lc_hex_string.
library/string.m:
    We long had uint_to_hex_string and uint_to_uc_hex_string. Add
    uint_to_lc_hex_string as well, and make uint_to_hex_string call it.
    This way, users don't have to remember which of the upper and lower
    case versions is defined, and which is missing.

    Do the same for the 64 bit version.

NEWS:
    Announce the new functions.

library/string.format.m:
    Call the new functions.
2021-01-22 17:18:57 +11:00
Julien Fischer
52b31f5089 Add uint64 to string conversion for bases 8 and 16.
library/string.m:
     Add functions for converting uint64s to strings of base 8 or base 16
     digits. For most integer types we can cast to a uint and then use the
     uint versions of these operations but for 64-bit types we cannot since
     on some of our supported platforms uints are 32-bit.

NEWS:
     Announce the additions.

tests/hard_coded/Mmakefile:
tests/hard_coded/uint64_string_conv.{m,exp}:
     Add a test of the new functions.
2020-12-15 22:45:31 +11:00
Julien Fischer
f8e65add3a Format uints directly.
Currently, the Mercury implementation of string formatting handles uints by
casting them to ints and then using the code for formatting signed integers as
unsigned values.  Add an implementation that works directly on uints and make
the code that formats signed integers as unsigned integers use that instead.
The new implementation is simpler and avoids unnecessary conversions to
arbitrary precision integers.

Add new functions for converting uint values directly to octal and hexadecimal
strings that use functionality provided by the underlying platforms; replace
the Mercury code that previously did that with calls to these new functions.

library/string.m:
    Add the functions uint_to_hex_string/1, uint_to_uc_hex_string/1 and
    uint_to_octal_string/1.

library/string.format.m:
    Make format_uint/6 operate directly on uints instead of casting the value
    to a signed int and calling format_unsigned_int/6.

    Make format_unsigned_int/6 cast the int value to a uint and then call
    format_uint/6.

    Delete predicates and functions used to convert ints to octal and
    hexadecimal strings.  We now just use the functions exported by
    the string module.

NEWS:
    Announce the additions to the string module.

tests/hard_coded/Mmakefile:
tests/hard_coded/uint_string_conv.{m,exp*}:
     Add a test of uint string conversion.
2020-11-20 23:07:52 +11:00
Julien Fischer
8f35be65f5 Delete default Mercury clauses previously used for the Erlang backend.
library/string.m:
    As above.
2020-11-14 14:39:08 +11:00
Zoltan Somogyi
d4861d739d Allow formatting of sized integers.
library/string.m:
    Add {i,u}{8.16,32,64} as function symbols in the poly_type type,
    each with a single argument containing an integer with the named
    signedness and size.

    The idea is that each of these poly_type values works exactly
    the same way as the i(_) poly_type (if signed) or the u(_) poly_type
    (if unsigned), with the exception that the value specified by the call
    is cast to int or uint before being processed.

library/string.parse_runtime.m:
    Parse the new kinds of poly_types. Change the representation of the result
    of the parsing to allow recording of the sizes of ints and uints.

    Put the code that does the parsing into a predicate of its own.

library/string.format.m:
    Do a cast to int or uint if the size information recorded in the
    specification of a signed or unsigned integer value calls for it.

    Provide functions to do the casting that do not require the import
    of {int,uint}{8,16,32,64}.m. This is to allow the compiler to generate
    calls to do such casts without having to implicitly import those modules.

    Abort if a 64 bit number is being cast to a 32 bit word.

compiler/parse_string_format.m:
    Make the same changes as in string.parse_runtime.m, mutatis mutandis.

compiler/format_call.m:
    Handle the new kinds of poly_types by adding a cast to int or uint
    if necessary, using the predicates added to library/string.format.m.

    Use a convenience function to make code creating instmap deltas
    more readable.

library/io.m:
library/pprint.m:
library/string.parse_util.m:
tests/invalid/string_format_bad.m:
tests/invalid/string_format_unknown.m:
    Conform to the changes above.

tests/string_format/string_format_d.m:
tests/string_format/string_format_u.m:
    Test the printing of some of the new poly_types.

tests/string_format/string_format_d.exp2:
tests/string_format/string_format_u.exp2:
    Update the expected output of these tests on 64-bit platforms.

tests/string_format/string_format_lib.m:
    Update programming style.
2020-11-10 11:00:47 +11:00
Peter Wang
0d3fcbaae3 Delete Erlang code from library/mdbcomp/browser directories.
library/*.m:
    Delete Erlang foreign code and foreign types.

    Delete documentation specific to Erlang targets.

library/deconstruct.m:
    Add pragma no_determinism_warning to allow functor_number_cc/3
    to compile for now.

library/Mercury.options:
    Delete workaround only needed when targetting Erlang.

browser/listing.m:
mdbcomp/rtti_access.m:
    Delete Erlang foreign code and foreign types.
2020-10-28 14:10:56 +11:00
Zoltan Somogyi
a36eed702d Add add_suffix to the standard library.
compiler/write_deps_file.m:
library/string.m:
    Move a generally-useful function to the library.

NEWS:
    Announce the addition.
2020-10-19 15:52:47 +11:00
Julien Fischer
9528f326d2 Formatting of uints using string.format etc.
Extend the operations that perform formatted conversion, such as
string.format/2, to be able to handle values of type uint directly. We have
always supported formatting values of type int as unsigned values, but
currently the only way to format uint values is by explicitly casting them to
an int. This addresses Mantis issue #502.

library/string.m:
    Add a new alternative to the poly_type/0 type that wraps uint
    values.

    Update the documentation for string.format. uint values may
    now be formatted using the u, x, X, o or p  conversion specifiers.

library/string.format.m:
   Add the necessary machinery for handling formatting of uint values.

library/string.parse_runtime.m:
library/string.parse_util.m:
   Handle uint poly_types.

library/io.m:a
   Handle uint values in the write_many predicates.

library/pprint.m:
   Handle uint values in the poly/1 function.

compiler/format_call.m:
compiler/parse_string_format.m:
    Conform to the above changes.

compiler/options.m:
    Add a way to detect if a compiler supports this change.

NEWS:
    Announce the above changes.

tests/hard_coded/stream_format.{m,exp}:
    Extend this test to cover uints.

tests/invalid/string_format_bad.m:
tests/invalid/string_format_unknown.m:
    Conform to the above changes.

tests/string_format/Mmakefile:
tests/string_format/string_format_uint_o.{m,exp,exp2}:
tests/string_format/string_format_uint_u.{m,exp,exp2}:
tests/string_format/string_format_uint_x.{m,exp,exp2}:
   Add tests of string.format with uints.
2020-05-23 14:01:01 +10:00
Zoltan Somogyi
a6228a9e1a Fix too-long lines. 2020-04-10 03:22:40 +10:00
Zoltan Somogyi
a2bdcece54 Improve English in some comments. 2020-04-07 22:24:00 +10:00
Peter Wang
ff0c363ea4 Define int to string conversions more precisely.
library/string.m:
    As above.
2020-01-21 16:19:27 +11:00
Peter Wang
7d52b9f593 Announce recent changes to string type and string module.
NEWS:
    Announce changes regarding ill-formed code unit sequences in
    strings.

library/string.m:
    Delete a note about ongoing work.
2019-11-19 14:23:15 +11:00
Peter Wang
78da14c581 Add string indexing predicates that indicate a code unit was replaced.
library/string.m:
    Add index_next_repl, unsafe_index_next_repl, prev_index_repl,
    unsafe_prev_index_repl predicates that return an indication if a
    replacement character was returned because an ill-formed code unit
    sequence was encountered.

    Add more pragma inlines for indexing predicates.

    Remove may_not_duplicate attribute on the Erlang version of
    unsafe_prev_index_repl, which would conflict with the pragma inline
    declaration. This requires the helper function do_unsafe_prev_index
    to be exported.

tests/hard_coded/string_append_ooi_ilseq.m:
tests/hard_coded/string_set_char_ilseq.m:
    Use index_next_repl in test cases.

NEWS:
    Announce additions.
2019-11-19 14:23:15 +11:00
Peter Wang
9a042f4fb1 Minor documentation changes.
library/string.m:
    Add missing word.

    Just write "code points" instead of "character" followed by
    clarification in a few spots.

    Delete _underscores_ which aren't particularly helpful.
2019-11-14 15:45:40 +11:00
Peter Wang
7ef407e937 Enable pragma obsolete_proc declarations.
library/string.m:
    Enable pragma obsolete_proc declarations since we now require a
    recent enough compiler version.
2019-11-14 11:28:25 +11:00
Peter Wang
5c3b392ed0 Implement string.(un)capitalize_first more efficiently.
library/string.m:
    Avoid creating temporary string in capitalize_first and
    uncapitalize_first.
2019-11-12 17:16:50 +11:00
Peter Wang
f71b5f20ed Define behaviour of string.set_char etc on ill-formed sequences.
library/string.m:
    Define behaviour of set_char, det_set_char and unsafe_set_char on
    ill-formed sequences. Also define them to throw an exception on an
    attempt to set a null character or surrogate code point in a UTF-8
    string.

    Delete claim that unsafe_set_char is constant time. That would only
    be true for the destructive mode of unsafe_set_char, and that mode
    has been disabled for a long time.

    Implement the defined behaviour for C and C# versions of
    unsafe_set_char. The Java version already behaved as defined.

    Use unsafe_set_char to implement set_char instead of duplicating
    foreign code.

    Replace a couple of uses of strcpy with MR_memcpy as it was
    convenient to do so. (On OpenBSD, the linker issues a warning
    whenever strcpy is used. Avoiding the warning is not high priority
    but we might still like to eliminate all uses of strcpy eventually.)

tests/hard_coded/Mmakefile:
tests/hard_coded/string_set_char_ilseq.exp:
tests/hard_coded/string_set_char_ilseq.exp2:
tests/hard_coded/string_set_char_ilseq.m:
    Add test case.
2019-11-12 17:16:34 +11:00
Peter Wang
ae2dda693e Avoid range checks in string.split_at_separator.
library/string.m:
    Avoid unnecessary range checks in split_at_separator.
2019-11-08 14:25:23 +11:00
Peter Wang
b68548d4dc Avoid garbage in Mercury versions of string.append_list/join_list.
library/string.m:
    Use unsafe_append_string_pieces in Mercury implementations of
    append_list and join_list. This has no practical effect as we have
    foreign code implementations of both, for all target languages.
2019-11-08 14:25:23 +11:00
Peter Wang
68ae33c426 Avoid intermediate strings in string.replace_all.
library/string.m:
    Implement string.replace_all using unsafe_append_string_pieces to
    avoid intermediate strings. Use unsafe_sub_string_search_start to
    avoid repeated range checks.
2019-11-08 14:25:23 +11:00
Peter Wang
3daee4fc23 Avoid intermediate strings in string.replace.
library/string.m:
    Implement string.replace using unsafe_append_string_pieces.
2019-11-08 14:25:23 +11:00
Peter Wang
7eb78c66d1 Add string.unsafe_sub_string_search_start.
library/string.m:
    Add unsafe_sub_string_search_start/4.

NEWS:
    Announce addition.
2019-11-08 14:25:23 +11:00
Peter Wang
3621cfa650 Delete deprecated substring predicates and functions.
library/string.m:
    Delete long-deprecated substring/3 function and substring/4 predicate.
    The newly introduced `string_piece' type has a substring/3 data
    constructor which takes (start, end) offsets into the base string,
    whereas the function and predicate take (start, count) arguments.
    To reduce potential confusion, delete the deprecated function and
    predicate.

    Delete other deprecated substring predicates and functions as well.

tests/general/Mercury.options:
tests/general/string_foldl_substring.exp:
tests/general/string_foldl_substring.m:
tests/general/string_foldr_substring.exp:
tests/general/string_foldr_substring.m:
tests/hard_coded/Mercury.options:
tests/hard_coded/string_substring.m:
    Delete tests for deprecated predicates.

tests/tabling/mercury_java_parser_dead_proc_elim_bug.m:
tests/tabling/mercury_java_parser_dead_proc_elim_bug2.m:
tests/valid/mercury_java_parser_follow_code_bug.m:
    Replace calls to unsafe_substring with unsafe_between.

NEWS:
    Announce the changes.
2019-11-08 14:25:23 +11:00
Peter Wang
96b2caf536 Add string.unsafe_append_string_pieces.
library/string.m:
    Add unsafe_append_string_pieces/2 predicate.

NEWS:
    Announce addition.
2019-11-08 14:23:06 +11:00
Peter Wang
f2e0998651 Add string.append_string_pieces.
library/string.m:
    Add append_string_pieces/2 predicate.

library/io.m:
    Add a comment about a potential future change.

tests/hard_coded/Mmakefile:
tests/hard_coded/string_append_pieces.exp:
tests/hard_coded/string_append_pieces.m:
    Add test case.

NEWS:
    Announce addition.
2019-11-08 14:23:06 +11:00
Peter Wang
d2c3ede17d Make string.replace_all with empty pattern preserve ill-formed sequences.
library/string.m:
    Define behaviour of string.replace_all on ill-formed code unit
    sequences when the pattern is empty.

    Implement that behaviour.

    Use better variable names in documentation of string.replace and
    string.replace_all.

tests/general/string_replace.exp:
tests/general/string_replace.exp2:
tests/general/string_replace.m:
    Extend test case.

    Update code style.
2019-11-08 13:57:38 +11:00
Peter Wang
0a1f289b6d Make generic versions of string.to_upper/lower preserve ill-formed sequences.
library/string.m:
    Make generic implementations of string.to_upper and string.to_lower
    preserve ill-formed sequences. (The foreign language implementations
    already did so.)
2019-11-06 13:43:54 +11:00
Peter Wang
031b6d915d Document that string.count_utf8_code_units throws exceptions.
library/string.m:
    Document that count_utf8_code_units throws an exception if the
    string contains an unpaired surrogate code point.

    Make the exception message thrown more useful to callers.

    Delete unnecessary foreign_procs.
2019-11-06 13:43:54 +11:00
Peter Wang
2e5f6ddef9 Make string.to_utf16_code_unit_list throw exception for ill-formed UTF-8.
library/string.m:
    As above.
2019-11-06 13:43:54 +11:00
Peter Wang
67234fc898 Document that string.to_utf8_code_unit_list throws exceptions.
library/string.m:
    Document that string.to_utf8_code_unit_list throws an exception
    if the string contains an unpaired surrogate code point.
2019-11-06 13:43:54 +11:00
Peter Wang
1e85dcb99e Add string.from_code_unit_list_allow_ill_formed.
library/string.m:
    Add string.from_code_unit_list_allow_ill_formed/2.

tests/hard_coded/string_from_code_unit_list.exp:
tests/hard_coded/string_from_code_unit_list.exp2:
tests/hard_coded/string_from_code_unit_list.m:
    Extend test case.

NEWS:
    Announce addition.
2019-11-06 13:43:54 +11:00
Peter Wang
adbf4c51c8 Tighten up string.from_code_unit_list et al.
library/string.m:
    Document that from_code_unit_list fails if the result string would
    contain a null character, and enforce that in the Java and C#
    implementations. It was already enforced in the C implementation.

    Make from_code_unit_list fail if the code unit list contains an
    invalid value (negative or >0xff or >0xffff).

    Document that from_utf{8,16}_code_unit_list fails if the result
    string would contain a null character.

    Make from_utf8_code_unit_list call semidet_from_rev_char_list rather
    than from_rev_char_list so that it fails as documented instead of
    throwing an exception if the code unit list correctly encodes a list
    of code points, but the code points cannot be encoded into a string.

    Similarly for from_utf16_code_unit_list.

tests/hard_coded/Mmakefile:
tests/hard_coded/string_from_code_unit_list.exp:
tests/hard_coded/string_from_code_unit_list.exp2:
tests/hard_coded/string_from_code_unit_list.m:
    Add test case.
2019-11-06 13:43:54 +11:00
Peter Wang
0c6778c89f Simplify Erlang implementation of sub_string_search_start.
library/string.m:
    As above. (Not that simple in the end.)
2019-10-31 17:20:09 +11:00
Peter Wang
c4fcbdaea3 Make generic version of string.sub_string_search_start more efficient.
library/string.m:
    Use unsafe_compare_substrings in generic version of
    sub_string_search_start.
2019-10-31 17:20:09 +11:00
Peter Wang
91868fe7ef Define string.sub_string_search_start for out-of-range starting offset.
library/string.m:
    Define sub_string_search_start to fail if the BeginAt parameter is
    negative or past the end of the string to search. The original C
    implementation did not check for an out-of-range starting offset,
    and could crash the program. The C implementation was later amended
    to fail instead, but not other implementations.

    Check for negative starting offset in non-C implementations of
    sub_string_search_start.

tests/hard_coded/string_sub_string_search.m:
    Extend test case.
2019-10-31 17:20:09 +11:00
Peter Wang
30d0933f59 Fix C# version of string.sub_string_search to be culture-insensitive.
library/string.m:
    Make C# implementation of sub_string_search perform ordinal
    (Unicode code point) based string search, instead of a
    culture-sensitive search.
2019-10-31 15:56:24 +11:00
Peter Wang
d40ab1ab44 Slightly improve string stripping functions.
library/string.m:
    Use unsafe_between for chomp, lstrip_pred, rstrip_pred
    to avoid range checks.
2019-10-30 16:51:00 +11:00
Peter Wang
09512195fc Make string.split_at_separator skip ill-formed sequences in UTF-8 strings.
library/string.m:
    Make split_at_separator never consider ill-formed sequences in UTF-8
    strings as potential separators, as they cannot contain any code
    points that could satify any given DelimP predicate on code points.
    Previously, split_at_separator would call DelimP(U+FFFD) for every
    code unit in an ill-formed sequence.
2019-10-30 16:51:00 +11:00
Peter Wang
1b91cf375c Make string.words_separator skip ill-formed sequences in UTF-8.
library/string.m:
    Make words_separator never consider ill-formed sequences in UTF-8
    strings as potential separators, as they cannot contain any code
    points that could satisfy any given SepP predicate on code points.
    Previously, words_separator would call SepP(U+FFFD) for every code
    unit in an ill-formed sequence.
2019-10-30 16:51:00 +11:00
Peter Wang
de2af8cdd7 Make string.all_match fail on UTF-8 string containing ill-formed sequence.
library/string.m:
    Make all_match(Pred, String) always fail if the string contains an
    ill-formed code unit sequence, and strings use UTF-8 encoding.
    Such sequences do not contain any code points that could satisfy a
    test on code points. Previously, all_match would call Pred(U+FFFD)
    for every code unit in an ill-formed sequence.

    Define all_match to rule out an interpretation that could ignore
    ill-formed sequences.
2019-10-30 16:51:00 +11:00
Peter Wang
817cf44efd Make string.prefix_length/suffix_length stop at ill-formed sequence.
library/string.m:
    Make prefix_length and suffix_length stop at an ill-formed sequence
    in UTF-8 strings. Such a sequence does not contain any code point
    that could satisfy a test on code points. Previously, prefix_length
    and suffix_length would would call Pred(U+FFFD) for every code unit
    in an ill-formed sequence.

    Tweak documentation.

    Delete obsolete comments.
2019-10-30 16:51:00 +11:00
Peter Wang
265ffa15f0 Fix two bugs in string.contains_char.
library/string.m:
    Fix C implementation of contains_char to fail when asked to test for
    a surrogate code point in a string. It previously would (always)
    succeed, which is a bug.

    Fix generic implementation so that contains_char(String, '\uFFFD')
    will not succeed just because String contains an ill-formed sequence
    (in UTF-8 grades).

    Delete obsolete comment.
2019-10-30 16:51:00 +11:00
Peter Wang
6c0c337568 Add string indexing predicates that indicate if the char was replaced.
library/string.m:
    Add index_next_repl, unsafe_index_next_repl, prev_index_repl,
    unsafe_prev_index_repl predicates. These are internal for now,
    so we can try them out in the string module without committing
    to the interface.
2019-10-30 16:51:00 +11:00
Peter Wang
7da7c103df Improve definition of string.index, index_next, prev_index.
library/string.m:
    Fix definition of index/3 and index_next/4 to account for an offset
    into a non-initial code unit in a well-formed code unit sequence.

    Similarly for prev_index/4.
2019-10-30 16:51:00 +11:00
Peter Wang
9bee18553c Correct documentation for string.from_char_list. 2019-10-30 12:02:42 +11:00
Peter Wang
831003f042 Delete outdated todo. 2019-10-30 11:21:02 +11:00
Peter Wang
658c8a5ad5 Define behaviour of string.char_to_string on edge cases.
library/string.m:
    Define behaviour of char_to_string when the string is not
    well-formed or if the char is a surrogate code point.

    Implement char_to_string/2 using multiple clauses
    as the described behaviour doesn't match to_char_list/2.

tests/hard_coded/Mmakefile:
tests/hard_coded/char_to_string.exp:
tests/hard_coded/char_to_string.exp2:
tests/hard_coded/char_to_string.m:
    Add test case.
2019-10-30 11:21:02 +11:00
Peter Wang
56687d235e Define behaviour of string.first_char/3 on edge cases.
library/string.m:
    Define first_char/3 to fail if the input string begins with an
    ill-formed code unit sequence.

    Define the reverse mode to throw an exception on an attempt to
    encode a null character or surrogate code point in the output
    string.

    Reimplement first_char/3 in Mercury.

hard_coded/Mmakefile:
hard_coded/string_first_char_ilseq.exp:
hard_coded/string_first_char_ilseq.m:
    Add test case.
2019-10-30 11:21:02 +11:00
Peter Wang
025bee0549 Check for surrogates when converting list of char to string.
library/string.m:
    Make from_char_list, from_rev_char_list, to_char_list throw an
    exception if the list of chars includes a surrogate code point that
    cannot be encoded in a UTF-8 string.

    Make semidet_from_char_list, semidet_from_rev_char_list,
    to_char_list fail if the list of chars includes a surrogate code
    point that cannot be encoded in a UTF-8 string.

runtime/mercury_string.h:
    Document return value of MR_utf8_width.

tests/hard_coded/Mmakefile:
tests/hard_coded/string_from_char_list_ilseq.exp:
tests/hard_coded/string_from_char_list_ilseq.exp2:
tests/hard_coded/string_from_char_list_ilseq.m:
    Add test case.

tests/hard_coded/null_char.exp:
    Expect new message in exceptions thrown by from_char_list,
    from_rev_char_list.

tests/hard_coded/string_hash.m:
    Don't generate surrogate code points in random strings.
2019-10-30 11:21:02 +11:00