library/string.m:
Delete long-deprecated substring/3 function and substring/4 predicate.
The newly introduced `string_piece' type has a substring/3 data
constructor which takes (start, end) offsets into the base string,
whereas the function and predicate take (start, count) arguments.
To reduce potential confusion, delete the deprecated function and
predicate.
Delete other deprecated substring predicates and functions as well.
tests/general/Mercury.options:
tests/general/string_foldl_substring.exp:
tests/general/string_foldl_substring.m:
tests/general/string_foldr_substring.exp:
tests/general/string_foldr_substring.m:
tests/hard_coded/Mercury.options:
tests/hard_coded/string_substring.m:
Delete tests for deprecated predicates.
tests/tabling/mercury_java_parser_dead_proc_elim_bug.m:
tests/tabling/mercury_java_parser_dead_proc_elim_bug2.m:
tests/valid/mercury_java_parser_follow_code_bug.m:
Replace calls to unsafe_substring with unsafe_between.
NEWS:
Announce the changes.
library/string.m:
Define behaviour of string.replace_all on ill-formed code unit
sequences when the pattern is empty.
Implement that behaviour.
Use better variable names in documentation of string.replace and
string.replace_all.
tests/general/string_replace.exp:
tests/general/string_replace.exp2:
tests/general/string_replace.m:
Extend test case.
Update code style.
library/string.m:
Make generic implementations of string.to_upper and string.to_lower
preserve ill-formed sequences. (The foreign language implementations
already did so.)
library/string.m:
Document that count_utf8_code_units throws an exception if the
string contains an unpaired surrogate code point.
Make the exception message thrown more useful to callers.
Delete unnecessary foreign_procs.
library/string.m:
Document that from_code_unit_list fails if the result string would
contain a null character, and enforce that in the Java and C#
implementations. It was already enforced in the C implementation.
Make from_code_unit_list fail if the code unit list contains an
invalid value (negative or >0xff or >0xffff).
Document that from_utf{8,16}_code_unit_list fails if the result
string would contain a null character.
Make from_utf8_code_unit_list call semidet_from_rev_char_list rather
than from_rev_char_list so that it fails as documented instead of
throwing an exception if the code unit list correctly encodes a list
of code points, but the code points cannot be encoded into a string.
Similarly for from_utf16_code_unit_list.
tests/hard_coded/Mmakefile:
tests/hard_coded/string_from_code_unit_list.exp:
tests/hard_coded/string_from_code_unit_list.exp2:
tests/hard_coded/string_from_code_unit_list.m:
Add test case.
library/string.m:
Define sub_string_search_start to fail if the BeginAt parameter is
negative or past the end of the string to search. The original C
implementation did not check for an out-of-range starting offset,
and could crash the program. The C implementation was later amended
to fail instead, but not other implementations.
Check for negative starting offset in non-C implementations of
sub_string_search_start.
tests/hard_coded/string_sub_string_search.m:
Extend test case.
library/string.m:
Make C# implementation of sub_string_search perform ordinal
(Unicode code point) based string search, instead of a
culture-sensitive search.
library/string.m:
Make split_at_separator never consider ill-formed sequences in UTF-8
strings as potential separators, as they cannot contain any code
points that could satify any given DelimP predicate on code points.
Previously, split_at_separator would call DelimP(U+FFFD) for every
code unit in an ill-formed sequence.
library/string.m:
Make words_separator never consider ill-formed sequences in UTF-8
strings as potential separators, as they cannot contain any code
points that could satisfy any given SepP predicate on code points.
Previously, words_separator would call SepP(U+FFFD) for every code
unit in an ill-formed sequence.
library/string.m:
Make all_match(Pred, String) always fail if the string contains an
ill-formed code unit sequence, and strings use UTF-8 encoding.
Such sequences do not contain any code points that could satisfy a
test on code points. Previously, all_match would call Pred(U+FFFD)
for every code unit in an ill-formed sequence.
Define all_match to rule out an interpretation that could ignore
ill-formed sequences.
library/string.m:
Make prefix_length and suffix_length stop at an ill-formed sequence
in UTF-8 strings. Such a sequence does not contain any code point
that could satisfy a test on code points. Previously, prefix_length
and suffix_length would would call Pred(U+FFFD) for every code unit
in an ill-formed sequence.
Tweak documentation.
Delete obsolete comments.
library/string.m:
Fix C implementation of contains_char to fail when asked to test for
a surrogate code point in a string. It previously would (always)
succeed, which is a bug.
Fix generic implementation so that contains_char(String, '\uFFFD')
will not succeed just because String contains an ill-formed sequence
(in UTF-8 grades).
Delete obsolete comment.
library/string.m:
Add index_next_repl, unsafe_index_next_repl, prev_index_repl,
unsafe_prev_index_repl predicates. These are internal for now,
so we can try them out in the string module without committing
to the interface.
library/string.m:
Fix definition of index/3 and index_next/4 to account for an offset
into a non-initial code unit in a well-formed code unit sequence.
Similarly for prev_index/4.
library/string.m:
Define behaviour of char_to_string when the string is not
well-formed or if the char is a surrogate code point.
Implement char_to_string/2 using multiple clauses
as the described behaviour doesn't match to_char_list/2.
tests/hard_coded/Mmakefile:
tests/hard_coded/char_to_string.exp:
tests/hard_coded/char_to_string.exp2:
tests/hard_coded/char_to_string.m:
Add test case.
library/string.m:
Define first_char/3 to fail if the input string begins with an
ill-formed code unit sequence.
Define the reverse mode to throw an exception on an attempt to
encode a null character or surrogate code point in the output
string.
Reimplement first_char/3 in Mercury.
hard_coded/Mmakefile:
hard_coded/string_first_char_ilseq.exp:
hard_coded/string_first_char_ilseq.m:
Add test case.
library/string.m:
Make from_char_list, from_rev_char_list, to_char_list throw an
exception if the list of chars includes a surrogate code point that
cannot be encoded in a UTF-8 string.
Make semidet_from_char_list, semidet_from_rev_char_list,
to_char_list fail if the list of chars includes a surrogate code
point that cannot be encoded in a UTF-8 string.
runtime/mercury_string.h:
Document return value of MR_utf8_width.
tests/hard_coded/Mmakefile:
tests/hard_coded/string_from_char_list_ilseq.exp:
tests/hard_coded/string_from_char_list_ilseq.exp2:
tests/hard_coded/string_from_char_list_ilseq.m:
Add test case.
tests/hard_coded/null_char.exp:
Expect new message in exceptions thrown by from_char_list,
from_rev_char_list.
tests/hard_coded/string_hash.m:
Don't generate surrogate code points in random strings.
library/string.m:
Make compare_ignore_case_ascii loop over code units instead of code
points, allowing it to work on strings that contain ill-formed code
unit sequences.
The two modes of string.prefix and string.suffix are not equivalent in
the presence of ill-formed code unit sequences. The solution is to
deprecate the lesser used mode of each.
library/string.m:
As above.
Delete outdated comments.
NEWS:
Announce the changes.
library/string.m:
Add string.append(uo, in, in) mode. The comment about it being
multi instead of semidet was written back when string.append was
implemented in terms of list.append.
Implement remove_suffix using the new procedure (more efficient).
Implement remove_suffix_if_present using remove_suffix
(more efficient).
Add comments about the argument orders of remove_suffix,
det_remove_suffix, remove_suffix_if_present.
Mark pointed out that the string.append(out, out, in) mode does not
match the forward modes. The simplest solution is to deprecate and
eventually remove it.
library/string.m:
Deprecate string.append(out, out, in) mode.
Add string.nondet_append/3 as its replacement.
Add more documentation.
NEWS:
Announce changes.
library/string.m:
Simplify string.append(out, out, in) and make it work sensibly in
the presence of ill-formed code unit sequences, breaking the input
string after each code point or code unit in an ill-formed sequence.
tests/hard_coded/Mmakefile:
tests/hard_coded/string_append_ooi_ilseq.exp:
tests/hard_coded/string_append_ooi_ilseq.exp2:
tests/hard_coded/string_append_ooi_ilseq.m:
Add test case.
library/string.m:
Add the new predicates.
tests/hard_coded/Mmakefile:
tests/hard_coded/string_compare_substrings.exp:
tests/hard_coded/string_compare_substrings.m:
Add test case.
NEWS:
Announce additions.
library/string.m:
Define string.between_codepoints in terms of codepoint_offset.
Fix behaviour in the case where
Start < 0,
End < 0,
End > Start
tests/hard_coded/string_codepoint.exp:
tests/hard_coded/string_codepoint.exp2:
tests/hard_coded/string_codepoint.m:
Extend test case.
Mark pointed out that to_char_list/2 having multiple modes implies the
ability to round trip convert between a string and list of chars,
which is not true if to_char_list replaces code units in ill-formed
sequences with U+FFFD; converting the list of chars back to a string
may produce a different string from the original input.
library/string.m:
Deprecate reverse modes of to_char_list/2, to_rev_char_list/2 and
from_char_list/2. Add commented out `obsolete_proc' pragmas to be
enabled at a later date.
Delete the unused Mercury implementation of string.append/3
that depends on multi-moded to_char_list/2. The implementation is
incorrect anyway in the presence of ill-formed code unit sequences.
Add comment about a future change to char_to_string.
NEWS:
Announce changes.
library/string.m:
Define how string.codepoint_offset counts code units in ill-formed
sequences.
Delete C and C# foreign implementations in favour of the Mercury
implementation that has the intended behaviour.
(The Java implementation uses String.offsetByCodePoints which
also matches our intended behaviour.)
tests/hard_coded/Mmakefile:
tests/hard_coded/string_codepoint_offset_ilseq.exp2:
tests/hard_coded/string_codepoint_offset_ilseq.m:
Add test case.
library/string.m:
As above.
tests/hard_coded/Mmakefile:
tests/hard_coded/string_fold_ilseq.exp:
tests/hard_coded/string_fold_ilseq.exp2:
tests/hard_coded/string_fold_ilseq.m:
Add test case.
library/string.m:
Make each code unit in an ill-formed sequence contribute one
to the value of string.count_codepoints.
tests/hard_coded/Mmakefile:
tests/hard_coded/string_count_codepoints_ilseq.exp:
tests/hard_coded/string_count_codepoints_ilseq.exp2:
tests/hard_coded/string_count_codepoints_ilseq.m:
Add test case.
library/string.m:
Define string.to_char_list and string.to_rev_char_list to either
replace code units in ill-formed sequences with U+FFFD or return
unpaired surrogate code points.
Use Mercury version of do_to_char_list instead of updating
the foreign language implementations.
tests/hard_coded/Mmakefile:
tests/hard_coded/string_char_list_ilseq.exp:
tests/hard_coded/string_char_list_ilseq.exp2:
tests/hard_coded/string_char_list_ilseq.m:
Add test case.
library/string.m:
Make string.prev_index and string.unsafe_prev_index
return either U+FFFD or an unpaired surrogate code point
when an ill-formed code unit sequence is detected.
tests/hard_coded/Mmakefile:
tests/hard_coded/string_prev_index_ilseq.exp:
tests/hard_coded/string_prev_index_ilseq.exp2:
tests/hard_coded/string_prev_index_ilseq.m:
Add test case.
library/string.m:
Make string.index_next and string.unsafe_index_next
return either U+FFFD or an unpaired surrogate code point
when an ill-formed code unit sequence is detected.
tests/hard_coded/Mmakefile:
tests/hard_coded/string_index_next_ilseq.exp:
tests/hard_coded/string_index_next_ilseq.exp2:
tests/hard_coded/string_index_next_ilseq.m:
Add test case.
library/string.m:
Make string.index/3 and string.unsafe_index/3
return either U+FFFD or an unpaired surrogate code point
when an ill-formed code unit sequence is detected.
tests/hard_coded/Mmakefile:
tests/hard_coded/string_index_ilseq.exp:
tests/hard_coded/string_index_ilseq.exp2:
tests/hard_coded/string_index_ilseq.m:
Add test case.