Commit Graph

494 Commits

Author SHA1 Message Date
Peter Wang
3621cfa650 Delete deprecated substring predicates and functions.
library/string.m:
    Delete long-deprecated substring/3 function and substring/4 predicate.
    The newly introduced `string_piece' type has a substring/3 data
    constructor which takes (start, end) offsets into the base string,
    whereas the function and predicate take (start, count) arguments.
    To reduce potential confusion, delete the deprecated function and
    predicate.

    Delete other deprecated substring predicates and functions as well.

tests/general/Mercury.options:
tests/general/string_foldl_substring.exp:
tests/general/string_foldl_substring.m:
tests/general/string_foldr_substring.exp:
tests/general/string_foldr_substring.m:
tests/hard_coded/Mercury.options:
tests/hard_coded/string_substring.m:
    Delete tests for deprecated predicates.

tests/tabling/mercury_java_parser_dead_proc_elim_bug.m:
tests/tabling/mercury_java_parser_dead_proc_elim_bug2.m:
tests/valid/mercury_java_parser_follow_code_bug.m:
    Replace calls to unsafe_substring with unsafe_between.

NEWS:
    Announce the changes.
2019-11-08 14:25:23 +11:00
Peter Wang
96b2caf536 Add string.unsafe_append_string_pieces.
library/string.m:
    Add unsafe_append_string_pieces/2 predicate.

NEWS:
    Announce addition.
2019-11-08 14:23:06 +11:00
Peter Wang
f2e0998651 Add string.append_string_pieces.
library/string.m:
    Add append_string_pieces/2 predicate.

library/io.m:
    Add a comment about a potential future change.

tests/hard_coded/Mmakefile:
tests/hard_coded/string_append_pieces.exp:
tests/hard_coded/string_append_pieces.m:
    Add test case.

NEWS:
    Announce addition.
2019-11-08 14:23:06 +11:00
Peter Wang
d2c3ede17d Make string.replace_all with empty pattern preserve ill-formed sequences.
library/string.m:
    Define behaviour of string.replace_all on ill-formed code unit
    sequences when the pattern is empty.

    Implement that behaviour.

    Use better variable names in documentation of string.replace and
    string.replace_all.

tests/general/string_replace.exp:
tests/general/string_replace.exp2:
tests/general/string_replace.m:
    Extend test case.

    Update code style.
2019-11-08 13:57:38 +11:00
Peter Wang
0a1f289b6d Make generic versions of string.to_upper/lower preserve ill-formed sequences.
library/string.m:
    Make generic implementations of string.to_upper and string.to_lower
    preserve ill-formed sequences. (The foreign language implementations
    already did so.)
2019-11-06 13:43:54 +11:00
Peter Wang
031b6d915d Document that string.count_utf8_code_units throws exceptions.
library/string.m:
    Document that count_utf8_code_units throws an exception if the
    string contains an unpaired surrogate code point.

    Make the exception message thrown more useful to callers.

    Delete unnecessary foreign_procs.
2019-11-06 13:43:54 +11:00
Peter Wang
2e5f6ddef9 Make string.to_utf16_code_unit_list throw exception for ill-formed UTF-8.
library/string.m:
    As above.
2019-11-06 13:43:54 +11:00
Peter Wang
67234fc898 Document that string.to_utf8_code_unit_list throws exceptions.
library/string.m:
    Document that string.to_utf8_code_unit_list throws an exception
    if the string contains an unpaired surrogate code point.
2019-11-06 13:43:54 +11:00
Peter Wang
1e85dcb99e Add string.from_code_unit_list_allow_ill_formed.
library/string.m:
    Add string.from_code_unit_list_allow_ill_formed/2.

tests/hard_coded/string_from_code_unit_list.exp:
tests/hard_coded/string_from_code_unit_list.exp2:
tests/hard_coded/string_from_code_unit_list.m:
    Extend test case.

NEWS:
    Announce addition.
2019-11-06 13:43:54 +11:00
Peter Wang
adbf4c51c8 Tighten up string.from_code_unit_list et al.
library/string.m:
    Document that from_code_unit_list fails if the result string would
    contain a null character, and enforce that in the Java and C#
    implementations. It was already enforced in the C implementation.

    Make from_code_unit_list fail if the code unit list contains an
    invalid value (negative or >0xff or >0xffff).

    Document that from_utf{8,16}_code_unit_list fails if the result
    string would contain a null character.

    Make from_utf8_code_unit_list call semidet_from_rev_char_list rather
    than from_rev_char_list so that it fails as documented instead of
    throwing an exception if the code unit list correctly encodes a list
    of code points, but the code points cannot be encoded into a string.

    Similarly for from_utf16_code_unit_list.

tests/hard_coded/Mmakefile:
tests/hard_coded/string_from_code_unit_list.exp:
tests/hard_coded/string_from_code_unit_list.exp2:
tests/hard_coded/string_from_code_unit_list.m:
    Add test case.
2019-11-06 13:43:54 +11:00
Peter Wang
0c6778c89f Simplify Erlang implementation of sub_string_search_start.
library/string.m:
    As above. (Not that simple in the end.)
2019-10-31 17:20:09 +11:00
Peter Wang
c4fcbdaea3 Make generic version of string.sub_string_search_start more efficient.
library/string.m:
    Use unsafe_compare_substrings in generic version of
    sub_string_search_start.
2019-10-31 17:20:09 +11:00
Peter Wang
91868fe7ef Define string.sub_string_search_start for out-of-range starting offset.
library/string.m:
    Define sub_string_search_start to fail if the BeginAt parameter is
    negative or past the end of the string to search. The original C
    implementation did not check for an out-of-range starting offset,
    and could crash the program. The C implementation was later amended
    to fail instead, but not other implementations.

    Check for negative starting offset in non-C implementations of
    sub_string_search_start.

tests/hard_coded/string_sub_string_search.m:
    Extend test case.
2019-10-31 17:20:09 +11:00
Peter Wang
30d0933f59 Fix C# version of string.sub_string_search to be culture-insensitive.
library/string.m:
    Make C# implementation of sub_string_search perform ordinal
    (Unicode code point) based string search, instead of a
    culture-sensitive search.
2019-10-31 15:56:24 +11:00
Peter Wang
d40ab1ab44 Slightly improve string stripping functions.
library/string.m:
    Use unsafe_between for chomp, lstrip_pred, rstrip_pred
    to avoid range checks.
2019-10-30 16:51:00 +11:00
Peter Wang
09512195fc Make string.split_at_separator skip ill-formed sequences in UTF-8 strings.
library/string.m:
    Make split_at_separator never consider ill-formed sequences in UTF-8
    strings as potential separators, as they cannot contain any code
    points that could satify any given DelimP predicate on code points.
    Previously, split_at_separator would call DelimP(U+FFFD) for every
    code unit in an ill-formed sequence.
2019-10-30 16:51:00 +11:00
Peter Wang
1b91cf375c Make string.words_separator skip ill-formed sequences in UTF-8.
library/string.m:
    Make words_separator never consider ill-formed sequences in UTF-8
    strings as potential separators, as they cannot contain any code
    points that could satisfy any given SepP predicate on code points.
    Previously, words_separator would call SepP(U+FFFD) for every code
    unit in an ill-formed sequence.
2019-10-30 16:51:00 +11:00
Peter Wang
de2af8cdd7 Make string.all_match fail on UTF-8 string containing ill-formed sequence.
library/string.m:
    Make all_match(Pred, String) always fail if the string contains an
    ill-formed code unit sequence, and strings use UTF-8 encoding.
    Such sequences do not contain any code points that could satisfy a
    test on code points. Previously, all_match would call Pred(U+FFFD)
    for every code unit in an ill-formed sequence.

    Define all_match to rule out an interpretation that could ignore
    ill-formed sequences.
2019-10-30 16:51:00 +11:00
Peter Wang
817cf44efd Make string.prefix_length/suffix_length stop at ill-formed sequence.
library/string.m:
    Make prefix_length and suffix_length stop at an ill-formed sequence
    in UTF-8 strings. Such a sequence does not contain any code point
    that could satisfy a test on code points. Previously, prefix_length
    and suffix_length would would call Pred(U+FFFD) for every code unit
    in an ill-formed sequence.

    Tweak documentation.

    Delete obsolete comments.
2019-10-30 16:51:00 +11:00
Peter Wang
265ffa15f0 Fix two bugs in string.contains_char.
library/string.m:
    Fix C implementation of contains_char to fail when asked to test for
    a surrogate code point in a string. It previously would (always)
    succeed, which is a bug.

    Fix generic implementation so that contains_char(String, '\uFFFD')
    will not succeed just because String contains an ill-formed sequence
    (in UTF-8 grades).

    Delete obsolete comment.
2019-10-30 16:51:00 +11:00
Peter Wang
6c0c337568 Add string indexing predicates that indicate if the char was replaced.
library/string.m:
    Add index_next_repl, unsafe_index_next_repl, prev_index_repl,
    unsafe_prev_index_repl predicates. These are internal for now,
    so we can try them out in the string module without committing
    to the interface.
2019-10-30 16:51:00 +11:00
Peter Wang
7da7c103df Improve definition of string.index, index_next, prev_index.
library/string.m:
    Fix definition of index/3 and index_next/4 to account for an offset
    into a non-initial code unit in a well-formed code unit sequence.

    Similarly for prev_index/4.
2019-10-30 16:51:00 +11:00
Peter Wang
9bee18553c Correct documentation for string.from_char_list. 2019-10-30 12:02:42 +11:00
Peter Wang
831003f042 Delete outdated todo. 2019-10-30 11:21:02 +11:00
Peter Wang
658c8a5ad5 Define behaviour of string.char_to_string on edge cases.
library/string.m:
    Define behaviour of char_to_string when the string is not
    well-formed or if the char is a surrogate code point.

    Implement char_to_string/2 using multiple clauses
    as the described behaviour doesn't match to_char_list/2.

tests/hard_coded/Mmakefile:
tests/hard_coded/char_to_string.exp:
tests/hard_coded/char_to_string.exp2:
tests/hard_coded/char_to_string.m:
    Add test case.
2019-10-30 11:21:02 +11:00
Peter Wang
56687d235e Define behaviour of string.first_char/3 on edge cases.
library/string.m:
    Define first_char/3 to fail if the input string begins with an
    ill-formed code unit sequence.

    Define the reverse mode to throw an exception on an attempt to
    encode a null character or surrogate code point in the output
    string.

    Reimplement first_char/3 in Mercury.

hard_coded/Mmakefile:
hard_coded/string_first_char_ilseq.exp:
hard_coded/string_first_char_ilseq.m:
    Add test case.
2019-10-30 11:21:02 +11:00
Peter Wang
025bee0549 Check for surrogates when converting list of char to string.
library/string.m:
    Make from_char_list, from_rev_char_list, to_char_list throw an
    exception if the list of chars includes a surrogate code point that
    cannot be encoded in a UTF-8 string.

    Make semidet_from_char_list, semidet_from_rev_char_list,
    to_char_list fail if the list of chars includes a surrogate code
    point that cannot be encoded in a UTF-8 string.

runtime/mercury_string.h:
    Document return value of MR_utf8_width.

tests/hard_coded/Mmakefile:
tests/hard_coded/string_from_char_list_ilseq.exp:
tests/hard_coded/string_from_char_list_ilseq.exp2:
tests/hard_coded/string_from_char_list_ilseq.m:
    Add test case.

tests/hard_coded/null_char.exp:
    Expect new message in exceptions thrown by from_char_list,
    from_rev_char_list.

tests/hard_coded/string_hash.m:
    Don't generate surrogate code points in random strings.
2019-10-30 11:21:02 +11:00
Peter Wang
527fae384e Make compare_ignore_case_ascii loop over code units.
library/string.m:
    Make compare_ignore_case_ascii loop over code units instead of code
    points, allowing it to work on strings that contain ill-formed code
    unit sequences.
2019-10-29 11:16:23 +11:00
Peter Wang
0ed3599f26 Deprecate multi modes of string.prefix and string.suffix.
The two modes of string.prefix and string.suffix are not equivalent in
the presence of ill-formed code unit sequences. The solution is to
deprecate the lesser used mode of each.

library/string.m:
    As above.

    Delete outdated comments.

NEWS:
    Announce the changes.
2019-10-25 15:10:45 +11:00
Peter Wang
cdddf3a047 Simplify string.prefix and string.suffix implementations.
library/string.m:
    Implement prefix(in, in) and suffix(in, in) using
    compare_substrings.
2019-10-25 15:10:45 +11:00
Peter Wang
a12ee1907e Add string.append(uo, in, in) mode.
library/string.m:
    Add string.append(uo, in, in) mode. The comment about it being
    multi instead of semidet was written back when string.append was
    implemented in terms of list.append.

    Implement remove_suffix using the new procedure (more efficient).

    Implement remove_suffix_if_present using remove_suffix
    (more efficient).

    Add comments about the argument orders of remove_suffix,
    det_remove_suffix, remove_suffix_if_present.
2019-10-25 15:10:45 +11:00
Peter Wang
a12663ea76 Deprecate string.append(out, out, in) mode.
Mark pointed out that the string.append(out, out, in) mode does not
match the forward modes. The simplest solution is to deprecate and
eventually remove it.

library/string.m:
    Deprecate string.append(out, out, in) mode.

    Add string.nondet_append/3 as its replacement.

    Add more documentation.

NEWS:
    Announce changes.
2019-10-24 12:58:28 +11:00
Peter Wang
cd899271c6 Make string.append(out, out, in) work with ill-formed sequences.
library/string.m:
    Simplify string.append(out, out, in) and make it work sensibly in
    the presence of ill-formed code unit sequences, breaking the input
    string after each code point or code unit in an ill-formed sequence.

tests/hard_coded/Mmakefile:
tests/hard_coded/string_append_ooi_ilseq.exp:
tests/hard_coded/string_append_ooi_ilseq.exp2:
tests/hard_coded/string_append_ooi_ilseq.m:
    Add test case.
2019-10-24 12:31:29 +11:00
Peter Wang
30f287951c Simplify string.append(in, in, in) implementation.
library/string.m:
    Replace foreign code implementations with Mercury code.
2019-10-24 12:31:29 +11:00
Peter Wang
93bf252632 Simplify string.append(in, out, in) implementation.
library/string.m:
    Replace foreign code implementations with Mercury code.
2019-10-24 12:31:29 +11:00
Peter Wang
3c68d3d8f2 Implement string.unsafe_compare_substrings with foreign code.
library/string.m:
    Add C and C# native implementations of unsafe_compare_substrings.
2019-10-24 12:31:29 +11:00
Peter Wang
bf1f624632 Add string.compare_substrings and unsafe_compare_substrings.
library/string.m:
    Add the new predicates.

tests/hard_coded/Mmakefile:
tests/hard_coded/string_compare_substrings.exp:
tests/hard_coded/string_compare_substrings.m:
    Add test case.

NEWS:
    Announce additions.
2019-10-24 12:31:29 +11:00
Peter Wang
e28ff5bbe7 Define string.between_codepoints more precisely and fix bug.
library/string.m:
    Define string.between_codepoints in terms of codepoint_offset.

    Fix behaviour in the case where
        Start < 0,
        End < 0,
        End > Start

tests/hard_coded/string_codepoint.exp:
tests/hard_coded/string_codepoint.exp2:
tests/hard_coded/string_codepoint.m:
    Extend test case.
2019-10-24 12:31:29 +11:00
Peter Wang
8143b07257 Simplify string.between implementation.
library/string.m:
    Replace foreign code implementations with Mercury code.
2019-10-24 12:31:29 +11:00
Peter Wang
5e52d45cc4 Make string.left and string.right not create unused substrings.
library/string.m:
    Implement string.left and string.right using string.between
    instead of string.split so as not to create unused substrings.
2019-10-24 12:31:29 +11:00
Peter Wang
b18a47c32f Simplify string.split implementation.
library/string.m:
    Replace foreign code implementations with Mercury code.
2019-10-24 12:31:29 +11:00
Peter Wang
778bff560d Deprecate modes of string predicates that imply round-trippability.
Mark pointed out that to_char_list/2 having multiple modes implies the
ability to round trip convert between a string and list of chars,
which is not true if to_char_list replaces code units in ill-formed
sequences with U+FFFD; converting the list of chars back to a string
may produce a different string from the original input.

library/string.m:
    Deprecate reverse modes of to_char_list/2, to_rev_char_list/2 and
    from_char_list/2. Add commented out `obsolete_proc' pragmas to be
    enabled at a later date.

    Delete the unused Mercury implementation of string.append/3
    that depends on multi-moded to_char_list/2. The implementation is
    incorrect anyway in the presence of ill-formed code unit sequences.

    Add comment about a future change to char_to_string.

NEWS:
    Announce changes.
2019-10-24 09:24:50 +11:00
Peter Wang
7350e7f0b6 Define behaviour of string.codepoint_offset on ill-formed sequences.
library/string.m:
    Define how string.codepoint_offset counts code units in ill-formed
    sequences.

    Delete C and C# foreign implementations in favour of the Mercury
    implementation that has the intended behaviour.
    (The Java implementation uses String.offsetByCodePoints which
    also matches our intended behaviour.)

tests/hard_coded/Mmakefile:
tests/hard_coded/string_codepoint_offset_ilseq.exp2:
tests/hard_coded/string_codepoint_offset_ilseq.m:
    Add test case.
2019-10-24 09:22:13 +11:00
Peter Wang
edfbeb1d9a Define behaviour of string.foldl etc on ill-formed sequences.
library/string.m:
    As above.

tests/hard_coded/Mmakefile:
tests/hard_coded/string_fold_ilseq.exp:
tests/hard_coded/string_fold_ilseq.exp2:
tests/hard_coded/string_fold_ilseq.m:
    Add test case.
2019-10-24 09:22:13 +11:00
Peter Wang
250b5bcc2e Define behaviour of string.count_codepoints with ill-formed sequences.
library/string.m:
    Make each code unit in an ill-formed sequence contribute one
    to the value of string.count_codepoints.

tests/hard_coded/Mmakefile:
tests/hard_coded/string_count_codepoints_ilseq.exp:
tests/hard_coded/string_count_codepoints_ilseq.exp2:
tests/hard_coded/string_count_codepoints_ilseq.m:
    Add test case.
2019-10-24 09:14:46 +11:00
Peter Wang
9b25e167e1 Define behaviour of string.to_char_list (and rev) on ill-formed sequences.
library/string.m:
    Define string.to_char_list and string.to_rev_char_list to either
    replace code units in ill-formed sequences with U+FFFD or return
    unpaired surrogate code points.

    Use Mercury version of do_to_char_list instead of updating
    the foreign language implementations.

tests/hard_coded/Mmakefile:
tests/hard_coded/string_char_list_ilseq.exp:
tests/hard_coded/string_char_list_ilseq.exp2:
tests/hard_coded/string_char_list_ilseq.m:
    Add test case.
2019-10-24 09:14:46 +11:00
Peter Wang
0c9bdf2587 Define behaviour of string.prev_index on ill-formed sequences.
library/string.m:
    Make string.prev_index and string.unsafe_prev_index
    return either U+FFFD or an unpaired surrogate code point
    when an ill-formed code unit sequence is detected.

tests/hard_coded/Mmakefile:
tests/hard_coded/string_prev_index_ilseq.exp:
tests/hard_coded/string_prev_index_ilseq.exp2:
tests/hard_coded/string_prev_index_ilseq.m:
    Add test case.
2019-10-24 09:14:46 +11:00
Peter Wang
d055627fd2 Define behaviour of string.index_next on ill-formed sequences.
library/string.m:
    Make string.index_next and string.unsafe_index_next
    return either U+FFFD or an unpaired surrogate code point
    when an ill-formed code unit sequence is detected.

tests/hard_coded/Mmakefile:
tests/hard_coded/string_index_next_ilseq.exp:
tests/hard_coded/string_index_next_ilseq.exp2:
tests/hard_coded/string_index_next_ilseq.m:
    Add test case.
2019-10-24 09:14:46 +11:00
Peter Wang
47d0f70ea4 Define behaviour of string.index on ill-formed sequences.
library/string.m:
    Make string.index/3 and string.unsafe_index/3
    return either U+FFFD or an unpaired surrogate code point
    when an ill-formed code unit sequence is detected.

tests/hard_coded/Mmakefile:
tests/hard_coded/string_index_ilseq.exp:
tests/hard_coded/string_index_ilseq.exp2:
tests/hard_coded/string_index_ilseq.m:
    Add test case.
2019-10-24 09:14:46 +11:00
Peter Wang
1a619af68e Add more TODOs relating to ill-formed code unit sequences. 2019-09-13 15:51:02 +10:00