It has been deprecated since Mercury 20.01.
library/random.m:
Delete the old generator.
library/array.m:
Delete the predicate random_permutation/4.
NEWS:
Announce the above.
tests/hard_coded/string_hash.m:
Update this test to use the new RNG framework.
tests/hard_coded/Mmakefile:
tests/hard_coded/random_permutation.{m,exp}:
tests/hard_coded/random_simple.{m,exp}:
Delete these tests, they were specific to the old RNG.
extras/curs/samples/nibbles.m:
extras/solver_types/library/any_array.m:
Replace use of the old RNG.
library/string.m:
Make from_char_list, from_rev_char_list, to_char_list throw an
exception if the list of chars includes a surrogate code point that
cannot be encoded in a UTF-8 string.
Make semidet_from_char_list, semidet_from_rev_char_list,
to_char_list fail if the list of chars includes a surrogate code
point that cannot be encoded in a UTF-8 string.
runtime/mercury_string.h:
Document return value of MR_utf8_width.
tests/hard_coded/Mmakefile:
tests/hard_coded/string_from_char_list_ilseq.exp:
tests/hard_coded/string_from_char_list_ilseq.exp2:
tests/hard_coded/string_from_char_list_ilseq.m:
Add test case.
tests/hard_coded/null_char.exp:
Expect new message in exceptions thrown by from_char_list,
from_rev_char_list.
tests/hard_coded/string_hash.m:
Don't generate surrogate code points in random strings.
Estimated hours taken: 12
Branches: main
Further improvements in the implementation of string switches, along with
some bug fixes.
If the chosen hash function does not yield any collisions for the strings
in the switch arms, then we can optimize away the table column that we would
otherwise need for open addressing. This was implemented in a previous diff.
For an ordinary (non-lookup) string switch, the hash table has two columns
in the presence of collisions and one column in their absence. Therefore if
doubling the size of the table allows us to eliminate collisions, the table
size is unaffected, though the corresponding array of labels we have to put
into the computed_goto instruction we generate has to double as well.
Thus the only cost of such doubling is an increase in "code" size, and
for small tables, the elimination of the open addressing loop may compensate
for this, at least partially.
For lookup string switches, doubling the table size this way has a bigger
space cost, but the elimination of the open addressing loop still brings
a useful speed boost.
We therefore now DO double the table size if this eliminates collisions.
In the library, compiler etc directories, this eliminates collisions in
19 out of 47 switch switches that had collisions with the standard table size.
compiler/switch_util.m:
Replace the separate sets of predicates we used to have for computing
hash maps (one for lookup switches and one for non-lookup switches)
with a single set that works for both.
Change this set to double the table size if this eliminates collisions.
This requires it to decide the table size, a task previously done
separately by each of its callers.
One version of this set had an old bug, which caused it to effectively
ignore the second and third string hash functions. This diff fixes it.
There were two bugs in my previous diff: the unneeded table column
was not being optimized away from several_soln lookup switches, and the
lookup code for one_soln lookup switches used the wrong column offset.
This diff fixes these too.
Since doubling the table size requires recalculating all the hash
values, decouple the computation of the hash values from generating
code for each switch arm, since the latter shouldn't be done more than
once.
Add a note on an old problem.
compiler/ml_string_switch.m:
compiler/string_switch.m:
Bring the code for generating code for the arms of string switches
here from switch_util.m.
tests/hard_coded/Mmakefile:
Fix the reason why the bugs mentioned above were not detected:
the relevant test cases weren't enabled.
tests/hard_coded/string_hash.m:
Update this test case to test the correspondence of the compiler's
and the runtime's versions of not just the first hash function,
but also the second and third.
runtime/mercury_string.h:
Fix a typo in a comment.
Branches: main
Improve Unicode support.
Declare that we use the Unicode character set, and UTF-8 or UTF-16 for the
internal string representation (depending on the backend). User code may be
written to those assumptions. Other external encodings can be supported in
the future by translating to/from Unicode internally.
The `char' type now represents a Unicode code point.
NOTE: questions about how to handle unpaired surrogate code points, etc. have
been left for later.
library/char.m:
Define a `char' to be a Unicode code point and extend ranges
appropriately.
Add predicates: to_utf8, to_utf16, is_surrogate, is_noncharacter.
Update some documentation.
library/io.m:
Declare I/O predicates on text streams to read/write code points, not
ambiguous "characters". Text files are expected to use UTF-8 encoding.
Supporting other encodings is for future work.
Update the C and Erlang implementations to understand UTF-8 encoding.
Update Java and C# implementations to read/write code points (Mercury
char) instead of UTF-16 code units.
Add `may_not_duplicate' attributes to some foreign_procs.
Improve Erlang implementations of seeking and getting the stream size.
library/string.m:
Declare the string representations, as described earlier.
Distinguish between code units and code points everywhere.
Existing functions and predicates which take offset and length
arguments continue to take them in terms of code units.
Add procedures: count_code_units, count_codepoints, codepoint_offset,
to_code_unit_list, from_code_unit_list, index_next, unsafe_index_next,
unsafe_prev_index, unsafe_index_code_unit, split_by_codepoint,
left_by_codepoint, right_by_codepoint, substring_by_codepoint.
Make index, index_det call error/1 if an illegal sequence is detected,
as they already do for invalid offsets.
Clarify that is_all_alpha, is_all_alnum_or_underscore,
is_alnum_or_underscore only succeed for the ASCII characters under each
of those categories.
Clarify that whitespace stripping functions only strip whitespace
characters in the ASCII range.
Add comments about the future treatment of surrogate code points
(not yet implemented).
Use Mercury format implementation when necessary instead of `sprintf'.
The %c specifier does not work for code points which require multi-byte
representation. The field width modifier for %s only works if the
string contains only single-byte code points.
library/lexer.m:
Conform to string encoding changes.
Simplify code dealing with \uNNNN escapes now that encoding/decoding
is handled by the string module.
library/term_io.m:
Allow code points above 126 directly in Mercury source.
NOTE: \x and \o codes are treated as code points by this change.
runtime/mercury_types.h:
Redefine `MR_Char' to be `int' to hold a Unicode code point.
`MR_String' has to be defined as a pointer to `char' instead of a
pointer to `MR_Char'. Some C foreign code will be affected by this
change.
runtime/mercury_string.c:
runtime/mercury_string.h:
Add UTF-8 helper routines and macros.
Make hash routines conform to type changes.
compiler/c_util.m:
Fix output_quoted_string_lang so that it correctly outputs non-ASCII
characters for each of the target languages.
Fix quote_char for non-ASCII characters.
compiler/elds_to_erlang.m:
Write out code points above 126 normally instead of using escape
syntax.
Conform to string encoding changes.
compiler/mlds_to_cs.m:
Change Mercury `char' to be represented by C# `int'.
compiler/mlds_to_java.m:
Change Mercury `char' to be represented by Java `int'.
doc/reference_manual.texi:
Uncomment description of \u and \U escapes in string literals.
Update description of C# and Java representations for Mercury `char'
which are now `int'.
tests/debugger/tailrec1.m:
Conform to renaming.
tests/general/string_replace.exp:
tests/general/string_replace.m:
Test non-ASCII characters to string.replace.
tests/general/string_test.exp:
tests/general/string_test.m:
Test non-ASCII characters to string.duplicate_char,
string.pad_right, string.pad_left, string.format_table.
tests/hard_coded/char_unicode.exp:
tests/hard_coded/char_unicode.m:
Add test for new procedures in `char' module.
tests/hard_coded/contains_char_2.m:
Test non-ASCII characters to string.contains_char.
tests/hard_coded/nonascii.exp:
tests/hard_coded/nonascii.m:
tests/hard_coded/nonascii_gen.c:
Add code points above 255 to this test case.
Change test data encoding to UTF-8.
tests/hard_coded/string_class.exp:
tests/hard_coded/string_class.m:
Add test case for string.is_alpha, etc.
tests/hard_coded/string_codepoint.exp:
tests/hard_coded/string_codepoint.exp2:
tests/hard_coded/string_codepoint.m:
Add test case for new string procedures dealing with code points.
tests/hard_coded/string_first_char.exp:
tests/hard_coded/string_first_char.m:
Add test case for all modes of string.first_char.
tests/hard_coded/string_hash.m:
Don't use buggy random.random/5 predicate which can overflow on
a large range (such as the range of code points).
tests/hard_coded/string_presuffix.exp:
tests/hard_coded/string_presuffix.m:
Add test case for string.prefix, string.suffix, etc.
tests/hard_coded/string_set_char.m:
Test non-ASCII characters to string.set_char.
tests/hard_coded/string_strip.exp:
tests/hard_coded/string_strip.m:
Test non-ASCII characters to string stripping procedures.
tests/hard_coded/string_sub_string_search.m:
Test non-ASCII characters to string.sub_string_search.
tests/hard_coded/unicode_test.exp:
Update expected output due to change of behaviour of
`string.to_char_list'.
tests/hard_coded/unicode_test.m:
Test non-ASCII character in separator string argument to
string.join_list.
tests/hard_coded/utf8_io.exp:
tests/hard_coded/utf8_io.m:
Add tests for UTF-8 I/O.
tests/hard_coded/words_separator.exp:
tests/hard_coded/words_separator.m:
Add test case for `string.words_separator'.
tests/hard_coded/Mmakefile:
Add new test cases.
Make special_char test case run on all backends.
tests/hard_coded/special_char.exp:
tests/valid/mercury_java_parser_follow_code_bug.m:
Reencode these files in UTF-8.
NEWS:
Add a news entry.
Estimated hours taken: 15
Branches: main
Make all functions which create strings from characters throw an exception
or fail if the list of characters contains a null character.
This removes a potential source of security vulnerabilities where one
part of the program performs checks against the whole of a string passed
in by an attacker (processing the string as a list of characters or using
`unsafe_index' to look past the null character), but then passes the string
to another part of the program or an operating system call that only sees
up to the first null character. Even if Mercury stored the length with
the string, allowing the creation of strings containing nulls would be a
bad idea because it would be too easy to pass a string to foreign code
without checking.
For examples see:
<http://insecure.org/news/P55-07.txt>
<http://www.securiteam.com/securitynews/5WP0B1FKKQ.html>
<http://www.securityfocus.com/archive/1/445788>
<http://www.securityfocus.com/archive/82/368750>
<http://secunia.com/advisories/16420/>
NEWS:
Document the change.
library/string.m:
Throw an exception if null characters are found in
string.from_char_list and string.from_rev_char_list.
Add string.from_char_list_semidet and string.from_rev_char_list_semidet
which fail rather throwing an exception. This doesn't match the
normal naming convention, but string.from_{,rev_}char_list are widely
used, so changing their determinism would be a bit too disruptive.
Don't allocate an unnecessary extra word for each string created by
from_char_list and from_rev_char_list.
Explain that to_upper and to_lower only work on un-accented
Latin letters.
library/lexer.m:
Check for invalid characters when reading Mercury strings and
quoted names.
Improve error messages by skipping to the end of any string
or quoted name containing an error. Previously we just stopped
processing at the error leaving an unmatched quote.
library/io.m:
Make io.read_line_as_string and io.read_file_as_string return
an error code if the input file contains a null character.
Fix an XXX: '\0\' is not recognised as a character constant,
but char.det_from_int can be used to make a null character.
library/char.m:
Explain the workaround for '\0\' not being accepted as a char
constant.
Explain that to_upper and to_lower only work on un-accented
Latin letters.
compiler/layout.m:
compiler/layout_out.m:
compiler/c_util.m:
compiler/stack_layout.m:
compiler/llds.m:
compiler/mlds.m:
compiler/ll_backend.*.m:
compiler/ml_backend.*.m:
Don't pass around strings containing null characters (the string
tables for the debugger). This doesn't cause any problems now,
but won't work with the accurate garbage collector. Use lists
of strings instead, and add the null characters when writing the
strings out.
tests/hard_coded/null_char.{m,exp}:
Change an existing test case to test that creation of a string
containing a null throws an exception.
tests/hard_coded/null_char.exp2:
Deleted because alternative output is no longer needed.
tests/invalid/Mmakefile:
tests/invalid/null_char.m:
tests/invalid/null_char.err_exp:
Test error messages for construction of strings containing null
characters by the lexer.
tests/invalid/unicode{1,2}.err_exp:
Update the expected output after the change to the handling of
invalid quoted names and strings.
Estimated hours taken: 1
Branches: main, release
runtime/mercury_string.h:
Fix a bug which caused the results of MR_hash_string()
and string__hash to differ -- cast each character to
MR_UnsignedChar before combining it with the hash value.
tests/hard_coded/Mmakefile:
tests/hard_coded/string_hash.{m,exp}:
Test case.