Preserves really uses Unicode scalar values, not code points.

This commit is contained in:
Tony Garnock-Jones 2023-10-13 14:01:21 +02:00
parent ad3da3896b
commit 8edb657603
6 changed files with 77 additions and 45 deletions

View File

@ -13,5 +13,5 @@ defaults:
layout: page
title: "Preserves"
version_date: "March 2023"
version: "0.7.0"
version_date: "October 2023"
version: "0.7.1"

View File

@ -127,7 +127,7 @@ normalization form. A `NormalizedString` is a `Record` labelled with
`unicode-normalization` and having two fields, the first of which is a
`Symbol` specifying the normalization form used (e.g. `nfc`, `nfd`,
`nfkc`, `nfkd`), and the second of which is a `String` whose
underlying code point representation *MUST* be normalized according to
underlying Unicode scalar value sequence *MUST* be normalized according to
the named normalization form.
## IRIs (URIs, URLs, URNs, etc.).

View File

@ -143,8 +143,8 @@ example,
Syntax for these three types varies only in the tag used. For `String`
and `Symbol`, the data following the tag is a UTF-8 encoding of the
`Value`'s code points, while for `ByteString` it is the raw data
contained within the `Value` unmodified.
`Value`, while for `ByteString` it is the raw data contained within the
`Value` unmodified.
«S» = [0xB1] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ String
[0xB2] ++ varint(|S|) ++ S if S ∈ ByteString
@ -198,11 +198,10 @@ the same `Value` to yield different binary `Repr`s.
Every tag byte in a binary Preserves `Document` falls within the range
[`0x80`, `0xBF`]. These bytes, interpreted as UTF-8, are *continuation
bytes*, and will never occur as the first byte of a UTF-8 encoded code
point. This means no binary-encoded document can be misinterpreted as
valid UTF-8.
bytes*, and will never occur as the first byte of a UTF-8 encoding. This
means no binary-encoded document can be misinterpreted as valid UTF-8.
Conversely, a UTF-8 document must start with a valid codepoint,
Conversely, a UTF-8 document must start with a valid scalar value,
meaning in particular that it must not start with a byte in the range
[`0x80`, `0xBF`]. This means that no UTF-8 encoded textual-syntax
Preserves document can be misinterpreted as a binary-syntax document.

View File

@ -21,7 +21,7 @@ The definition uses [case-sensitive ABNF][abnf].
ABNF allows easy definition of US-ASCII-based languages. However,
Preserves is a Unicode-based language. Therefore, we reinterpret ABNF as
a grammar for recognising sequences of Unicode code points.
a grammar for recognising sequences of Unicode scalar values.
**Encoding.** Textual syntax for a `Value` *SHOULD* be encoded using
UTF-8 where possible.
@ -82,10 +82,13 @@ false, respectively.
Boolean = %s"#t" / %s"#f"
`String`s are,
[as in JSON](https://tools.ietf.org/html/rfc8259#section-7), possibly
escaped text surrounded by double quotes. The escaping rules are the
same as for JSON.[^string-json-correspondence] [^escaping-surrogate-pairs]
`String`s are, [as in JSON](https://tools.ietf.org/html/rfc8259#section-7),
possibly escaped text surrounded by double quotes. The escaping rules are
the same as for JSON,[^string-json-correspondence]
[^escaping-surrogate-pairs]
[except](https://tools.ietf.org/html/rfc8259#section-8.2) that unpaired
[surrogate code points](https://unicode.org/glossary/#surrogate_code_point)
*MUST NOT* be generated or accepted.[^unpaired-surrogates]
String = %x22 *char %x22
char = unescaped / %x7C / escape (escaped / %x22 / %s"u" 4HEXDIG)
@ -106,33 +109,49 @@ same as for JSON.[^string-json-correspondence] [^escaping-surrogate-pairs]
largely unmodified from the text of RFC 8259.
[^escaping-surrogate-pairs]: In particular, note JSON's rules around
the use of surrogate pairs for code points not in the Basic
the use of surrogate pairs for scalar values not in the Basic
Multilingual Plane. We encourage implementations to avoid using
`\u` escapes when producing output, and instead to rely on the
UTF-8 encoding of the entire document to handle non-ASCII
codepoints correctly.
UTF-8 encoding of the entire document to handle scalar values outside
the ASCII range correctly.
A `ByteString` may be written in any of three different forms.
[^unpaired-surrogates]: Because Preserves forbids unpaired surrogates in
its text syntax, any valid JSON text including an unpaired [surrogate
code point](https://unicode.org/glossary/#surrogate_code_point) will
not be parseable using the Preserves text syntax rules.
The first is similar to a `String`, but prepended with a hash sign
`#`. In addition, only Unicode code points overlapping with printable
7-bit ASCII are permitted unescaped inside such a `ByteString`; other
byte values must be escaped by prepending a two-digit hexadecimal
value with `\x`.
A `ByteString` may be written in any of three different forms.[^rationale-bytestring]
[^rationale-bytestring]: **Rationale.** While the [machine-oriented
syntax](preserves-binary.html) defines just one representation for
binary data, the text syntax is intended primarily for humans to use,
and so it defines many. Different usages of binary data will be more
naturally expressed in text as hexadecimal, Base 64, or almost-ASCII.
Accepting multiple syntax variations improves the ergonomics of the
text syntax.
The first is similar to a `String`, but prepended with a hash sign `#`.
Many bytes map directly to printable 7-bit ASCII; the remainder must be
escaped, either as `\x` followed by a two-digit hexadecimal number, or
following the usual rules for double quote and backslash.
ByteString = "#" %x22 *binchar %x22
binchar = binunescaped / escape (escaped / %x22 / %s"x" 2HEXDIG)
binunescaped = %x20-21 / %x23-5B / %x5D-7E
The second is as a sequence of pairs of hexadecimal digits interleaved
The second is a sequence of pairs of hexadecimal digits interleaved
with whitespace and surrounded by `#x"` and `"`.
ByteString =/ %s"#x" %x22 *(ws / 2HEXDIG) ws %x22
The third is as a sequence of
[Base64](https://tools.ietf.org/html/rfc4648) characters, interleaved
with whitespace and surrounded by `#[` and `]`. Plain and URL-safe
Base64 characters are allowed.
The third is a sequence of [Base64](https://tools.ietf.org/html/rfc4648)
characters, interleaved with whitespace and surrounded by `#[` and `]`.
[Plain](https://datatracker.ietf.org/doc/html/rfc4648#section-4) (`+`,`/`)
and [URL-safe](https://datatracker.ietf.org/doc/html/rfc4648#section-5)
(`-`,`_`) Base64 characters are accepted;
[URL-safe](https://datatracker.ietf.org/doc/html/rfc4648#section-5)
(`-`,`_`) characters *SHOULD* be generated by default. Padding characters
(`=`) may be omitted.
ByteString =/ "#[" *(ws / base64char) ws "]"
base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "="
@ -156,7 +175,7 @@ it must be interpreted as a bare `Symbol`.
baresymchar = ALPHA / DIGIT / sympunct / symuchar
sympunct = "~" / "!" / "$" / "%" / "^" / "&" / "*" /
"?" / "_" / "=" / "+" / "-" / "/" / "."
symuchar = <any code point greater than 127 whose Unicode
symuchar = <any scalar value greater than 127 whose Unicode
category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me, Nd,
Nl, No, Pc, Pd, Po, Sc, Sm, Sk, So, or Co>
@ -240,7 +259,7 @@ denoted object, prefixed with `#!`.
Embedded = "#!" Value
## Annotations
## <a id="annotations"></a>Annotations and Comments
When written down, a `Value` may have an associated sequence of
*annotations* carrying “out-of-band” contextual metadata about the

View File

@ -163,9 +163,8 @@ For example,
### Strings, ByteStrings and Symbols.
Syntax for these three types varies only in the tag used. For `String` and
`Symbol`, the encoded data is a UTF-8 encoding of the `Value`'s code
points, while for `ByteString` it is the raw data contained within the
`Value` unmodified.
`Symbol`, the encoded data is a UTF-8 encoding of the `Value`, while for
`ByteString` it is the raw data contained within the `Value` unmodified.
Encoded data of length between 1 and 7 bytes is represented as an immediate
`Ref` where the low *five* bits are `00010` (`String`), `10001`

View File

@ -52,17 +52,20 @@ A `SignedInteger` is an arbitrarily-large signed integer.
### Unicode strings.
A `String` is a sequence of Unicode
[code-point](http://www.unicode.org/glossary/#code_point)s.[^nul-permitted]
`String`s are compared lexicographically, code-point by
code-point.[^utf8-is-awesome]
A `String` is a sequence of [Unicode
scalar value](http://www.unicode.org/glossary/#unicode_scalar_value)s.[^nul-permitted]
`String`s are compared lexicographically, scalar value by
scalar value.[^utf8-is-awesome]
[^utf8-is-awesome]: Happily, the design of UTF-8 is such that this
gives the same result as a lexicographic byte-by-byte comparison
of the UTF-8 encoding of a string!
[^nul-permitted]: All Unicode code-points are permitted, including NUL
(code point zero).
[^nul-permitted]: All Unicode scalar values are permitted, including NUL
(scalar value zero). Because scalar values are defined as code points
*excluding* surrogate code points
(D800<sub>16</sub>DFFF<sub>16</sub>), surrogates are *not* permitted
in Preserves Unicode data.
### Binary data.
@ -73,8 +76,8 @@ lexicographically.
Programming languages like Lisp and Prolog frequently use string-like
values called *symbols*. Here, a `Symbol` is, like a `String`, a
sequence of Unicode code-points representing an identifier of some
kind. `Symbol`s are also compared lexicographically by code-point.
sequence of Unicode scalar values representing an identifier of some
kind. `Symbol`s are also compared lexicographically by scalar value.
### Booleans.
@ -198,7 +201,9 @@ The total ordering specified [above](#total-order) means that the following stat
| `<capture <discard>>` | B4 B3 07 'c' 'a' 'p' 't' 'u' 'r' 'e' B4 B3 07 'd' 'i' 's' 'c' 'a' 'r' 'd' 84 84 |
| `[1 2 3 4]` | B5 91 92 93 94 84 |
| `[-2 -1 0 1]` | B5 9E 9F 90 91 84 |
| `"hello"` (format B) | B1 05 'h' 'e' 'l' 'l' 'o' |
| `"hello"` | B1 05 'h' 'e' 'l' 'l' 'o' |
| `"z水𝄞"` | B1 08 'z' E6 B0 B4 F0 9D 84 9E |
| `"z水\uD834\uDD1E"` | B1 08 'z' E6 B0 B4 F0 9D 84 9E |
| `["a" b #"c" [] #{} #t #f]` | B5 B1 01 'a' B3 01 'b' B2 01 'c' B5 84 B6 84 81 80 84 |
| `-257` | A1 FE FF |
| `-1` | 9F |
@ -250,9 +255,19 @@ encodes to
### JSON examples.
Preserves text syntax is a superset of JSON, so the examples from [RFC
8259](https://tools.ietf.org/html/rfc8259#section-13) read as valid
Preserves.
Preserves text syntax is a superset of JSON,[^json-string-caveat] so the
examples from [RFC 8259](https://tools.ietf.org/html/rfc8259#section-13)
read as valid Preserves.
[^json-string-caveat]: There is one caveat to be aware of. [Section 8.2
of RFC 8259](https://tools.ietf.org/html/rfc8259#section-8.2)
explicitly permits unpaired [surrogate code
point](https://unicode.org/glossary/#surrogate_code_point)s in JSON
texts without specifying an interpretation for them. Preserves mandates
UTF-8 in its binary syntax, forbids unpaired surrogates in its text
syntax, and disallows surrogate code points in `String`s and `Symbol`s,
meaning that any valid JSON text including an unpaired surrogate will
not be parseable using the Preserves text syntax rules.
The JSON literals `true`, `false` and `null` all read as `Symbol`s, and
JSON numbers read (unambiguously) either as `SignedInteger`s or as