Preserves really uses Unicode scalar values, not code points.

2023-10-13 14:01:21 +02:00 · 2023-10-13 14:01:21 +02:00 · 8edb657603
parent ad3da3896b
commit 8edb657603
6 changed files with 77 additions and 45 deletions
--- a/_config.yml
+++ b/_config.yml
@ -13,5 +13,5 @@ defaults:
      layout: page

 title: "Preserves"
-version_date: "March 2023"
-version: "0.7.0"
+version_date: "October 2023"
+version: "0.7.1"
--- a/conventions.md
+++ b/conventions.md
@ -127,7 +127,7 @@ normalization form. A `NormalizedString` is a `Record` labelled with
 `unicode-normalization` and having two fields, the first of which is a
 `Symbol` specifying the normalization form used (e.g. `nfc`, `nfd`,
 `nfkc`, `nfkd`), and the second of which is a `String` whose
-underlying code point representation *MUST* be normalized according to
+underlying Unicode scalar value sequence *MUST* be normalized according to
 the named normalization form.

 ## IRIs (URIs, URLs, URNs, etc.).
--- a/preserves-binary.md
+++ b/preserves-binary.md
@ -143,8 +143,8 @@ example,

 Syntax for these three types varies only in the tag used. For `String`
 and `Symbol`, the data following the tag is a UTF-8 encoding of the
-`Value`'s code points, while for `ByteString` it is the raw data
-contained within the `Value` unmodified.
+`Value`, while for `ByteString` it is the raw data contained within the
+`Value` unmodified.

    «S» = [0xB1] ++ varint(|utf8(S)|) ++ utf8(S)  if S ∈ String
          [0xB2] ++ varint(|S|) ++ S              if S ∈ ByteString
@ -198,11 +198,10 @@ the same `Value` to yield different binary `Repr`s.

 Every tag byte in a binary Preserves `Document` falls within the range
 [`0x80`, `0xBF`]. These bytes, interpreted as UTF-8, are *continuation
-bytes*, and will never occur as the first byte of a UTF-8 encoded code
-point. This means no binary-encoded document can be misinterpreted as
-valid UTF-8.
+bytes*, and will never occur as the first byte of a UTF-8 encoding. This
+means no binary-encoded document can be misinterpreted as valid UTF-8.

-Conversely, a UTF-8 document must start with a valid codepoint,
+Conversely, a UTF-8 document must start with a valid scalar value,
 meaning in particular that it must not start with a byte in the range
 [`0x80`, `0xBF`]. This means that no UTF-8 encoded textual-syntax
 Preserves document can be misinterpreted as a binary-syntax document.
--- a/preserves-text.md
+++ b/preserves-text.md
@ -21,7 +21,7 @@ The definition uses [case-sensitive ABNF][abnf].

 ABNF allows easy definition of US-ASCII-based languages. However,
 Preserves is a Unicode-based language. Therefore, we reinterpret ABNF as
-a grammar for recognising sequences of Unicode code points.
+a grammar for recognising sequences of Unicode scalar values.

 **Encoding.** Textual syntax for a `Value` *SHOULD* be encoded using
 UTF-8 where possible.
@ -82,10 +82,13 @@ false, respectively.

           Boolean = %s"#t" / %s"#f"

-`String`s are,
-[as in JSON](https://tools.ietf.org/html/rfc8259#section-7), possibly
-escaped text surrounded by double quotes. The escaping rules are the
-same as for JSON.[^string-json-correspondence] [^escaping-surrogate-pairs]
+`String`s are, [as in JSON](https://tools.ietf.org/html/rfc8259#section-7),
+possibly escaped text surrounded by double quotes. The escaping rules are
+the same as for JSON,[^string-json-correspondence]
+[^escaping-surrogate-pairs]
+[except](https://tools.ietf.org/html/rfc8259#section-8.2) that unpaired
+[surrogate code points](https://unicode.org/glossary/#surrogate_code_point)
+*MUST NOT* be generated or accepted.[^unpaired-surrogates]

            String = %x22 *char %x22
              char = unescaped / %x7C / escape (escaped / %x22 / %s"u" 4HEXDIG)
@ -106,33 +109,49 @@ same as for JSON.[^string-json-correspondence] [^escaping-surrogate-pairs]
    largely unmodified from the text of RFC 8259.

  [^escaping-surrogate-pairs]: In particular, note JSON's rules around
-    the use of surrogate pairs for code points not in the Basic
+    the use of surrogate pairs for scalar values not in the Basic
    Multilingual Plane. We encourage implementations to avoid using
    `\u` escapes when producing output, and instead to rely on the
-    UTF-8 encoding of the entire document to handle non-ASCII
-    codepoints correctly.
+    UTF-8 encoding of the entire document to handle scalar values outside
+    the ASCII range correctly.

-A `ByteString` may be written in any of three different forms.
+  [^unpaired-surrogates]: Because Preserves forbids unpaired surrogates in
+    its text syntax, any valid JSON text including an unpaired [surrogate
+    code point](https://unicode.org/glossary/#surrogate_code_point) will
+    not be parseable using the Preserves text syntax rules.

-The first is similar to a `String`, but prepended with a hash sign
-`#`. In addition, only Unicode code points overlapping with printable
-7-bit ASCII are permitted unescaped inside such a `ByteString`; other
-byte values must be escaped by prepending a two-digit hexadecimal
-value with `\x`.
+A `ByteString` may be written in any of three different forms.[^rationale-bytestring]
+
+  [^rationale-bytestring]: **Rationale.** While the [machine-oriented
+    syntax](preserves-binary.html) defines just one representation for
+    binary data, the text syntax is intended primarily for humans to use,
+    and so it defines many. Different usages of binary data will be more
+    naturally expressed in text as hexadecimal, Base 64, or almost-ASCII.
+    Accepting multiple syntax variations improves the ergonomics of the
+    text syntax.
+
+The first is similar to a `String`, but prepended with a hash sign `#`.
+Many bytes map directly to printable 7-bit ASCII; the remainder must be
+escaped, either as `\x` followed by a two-digit hexadecimal number, or
+following the usual rules for double quote and backslash.

        ByteString = "#" %x22 *binchar %x22
           binchar = binunescaped / escape (escaped / %x22 / %s"x" 2HEXDIG)
      binunescaped = %x20-21 / %x23-5B / %x5D-7E

-The second is as a sequence of pairs of hexadecimal digits interleaved
+The second is a sequence of pairs of hexadecimal digits interleaved
 with whitespace and surrounded by `#x"` and `"`.

       ByteString =/ %s"#x" %x22 *(ws / 2HEXDIG) ws %x22

-The third is as a sequence of
-[Base64](https://tools.ietf.org/html/rfc4648) characters, interleaved
-with whitespace and surrounded by `#[` and `]`. Plain and URL-safe
-Base64 characters are allowed.
+The third is a sequence of [Base64](https://tools.ietf.org/html/rfc4648)
+characters, interleaved with whitespace and surrounded by `#[` and `]`.
+[Plain](https://datatracker.ietf.org/doc/html/rfc4648#section-4) (`+`,`/`)
+and [URL-safe](https://datatracker.ietf.org/doc/html/rfc4648#section-5)
+(`-`,`_`) Base64 characters are accepted;
+[URL-safe](https://datatracker.ietf.org/doc/html/rfc4648#section-5)
+(`-`,`_`) characters *SHOULD* be generated by default. Padding characters
+(`=`) may be omitted.

       ByteString =/ "#[" *(ws / base64char) ws "]"
        base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "="
@ -156,7 +175,7 @@ it must be interpreted as a bare `Symbol`.
       baresymchar = ALPHA / DIGIT / sympunct / symuchar
          sympunct = "~" / "!" / "$" / "%" / "^" / "&" / "*" /
                     "?" / "_" / "=" / "+" / "-" / "/" / "."
-          symuchar = <any code point greater than 127 whose Unicode
+          symuchar = <any scalar value greater than 127 whose Unicode
                      category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me, Nd,
                      Nl, No, Pc, Pd, Po, Sc, Sm, Sk, So, or Co>

@ -240,7 +259,7 @@ denoted object, prefixed with `#!`.

           Embedded = "#!" Value

-## Annotations
+## <a id="annotations"></a>Annotations and Comments

 When written down, a `Value` may have an associated sequence of
 *annotations* carrying “out-of-band” contextual metadata about the
--- a/preserves-zerocopy.md
+++ b/preserves-zerocopy.md
@ -163,9 +163,8 @@ For example,
 ### Strings, ByteStrings and Symbols.

 Syntax for these three types varies only in the tag used. For `String` and
-`Symbol`, the encoded data is a UTF-8 encoding of the `Value`'s code
-points, while for `ByteString` it is the raw data contained within the
-`Value` unmodified.
+`Symbol`, the encoded data is a UTF-8 encoding of the `Value`, while for
+`ByteString` it is the raw data contained within the `Value` unmodified.

 Encoded data of length between 1 and 7 bytes is represented as an immediate
 `Ref` where the low *five* bits are `00010` (`String`), `10001`
--- a/preserves.md
+++ b/preserves.md
@ -52,17 +52,20 @@ A `SignedInteger` is an arbitrarily-large signed integer.

 ### Unicode strings.

-A `String` is a sequence of Unicode
-[code-point](http://www.unicode.org/glossary/#code_point)s.[^nul-permitted]
-`String`s are compared lexicographically, code-point by
-code-point.[^utf8-is-awesome]
+A `String` is a sequence of [Unicode
+scalar value](http://www.unicode.org/glossary/#unicode_scalar_value)s.[^nul-permitted]
+`String`s are compared lexicographically, scalar value by
+scalar value.[^utf8-is-awesome]

  [^utf8-is-awesome]: Happily, the design of UTF-8 is such that this
    gives the same result as a lexicographic byte-by-byte comparison
    of the UTF-8 encoding of a string!

-  [^nul-permitted]: All Unicode code-points are permitted, including NUL
-    (code point zero).
+  [^nul-permitted]: All Unicode scalar values are permitted, including NUL
+    (scalar value zero). Because scalar values are defined as code points
+    *excluding* surrogate code points
+    (D800<sub>16</sub>–DFFF<sub>16</sub>), surrogates are *not* permitted
+    in Preserves Unicode data.

 ### Binary data.

@ -73,8 +76,8 @@ lexicographically.

 Programming languages like Lisp and Prolog frequently use string-like
 values called *symbols*. Here, a `Symbol` is, like a `String`, a
-sequence of Unicode code-points representing an identifier of some
-kind. `Symbol`s are also compared lexicographically by code-point.
+sequence of Unicode scalar values representing an identifier of some
+kind. `Symbol`s are also compared lexicographically by scalar value.

 ### Booleans.

@ -198,7 +201,9 @@ The total ordering specified [above](#total-order) means that the following stat
 | `<capture <discard>>`                               | B4 B3 07 'c' 'a' 'p' 't' 'u' 'r' 'e' B4 B3 07 'd' 'i' 's' 'c' 'a' 'r' 'd' 84 84 |
 | `[1 2 3 4]`                                         | B5 91 92 93 94 84                                                               |
 | `[-2 -1 0 1]`                                       | B5 9E 9F 90 91 84                                                               |
-| `"hello"` (format B)                                | B1 05 'h' 'e' 'l' 'l' 'o'                                                       |
+| `"hello"`                                           | B1 05 'h' 'e' 'l' 'l' 'o'                                                       |
+| `"z水𝄞"`                                            | B1 08 'z' E6 B0 B4 F0 9D 84 9E                                                  |
+| `"z水\uD834\uDD1E"`                                 | B1 08 'z' E6 B0 B4 F0 9D 84 9E                                                  |
 | `["a" b #"c" [] #{} #t #f]`                         | B5 B1 01 'a' B3 01 'b' B2 01 'c' B5 84 B6 84 81 80 84                           |
 | `-257`                                              | A1 FE FF                                                                        |
 | `-1`                                                | 9F                                                                              |
@ -250,9 +255,19 @@ encodes to

 ### JSON examples.

-Preserves text syntax is a superset of JSON, so the examples from [RFC
-8259](https://tools.ietf.org/html/rfc8259#section-13) read as valid
-Preserves.
+Preserves text syntax is a superset of JSON,[^json-string-caveat] so the
+examples from [RFC 8259](https://tools.ietf.org/html/rfc8259#section-13)
+read as valid Preserves.
+
+  [^json-string-caveat]: There is one caveat to be aware of. [Section 8.2
+    of RFC 8259](https://tools.ietf.org/html/rfc8259#section-8.2)
+    explicitly permits unpaired [surrogate code
+    point](https://unicode.org/glossary/#surrogate_code_point)s in JSON
+    texts without specifying an interpretation for them. Preserves mandates
+    UTF-8 in its binary syntax, forbids unpaired surrogates in its text
+    syntax, and disallows surrogate code points in `String`s and `Symbol`s,
+    meaning that any valid JSON text including an unpaired surrogate will
+    not be parseable using the Preserves text syntax rules.

 The JSON literals `true`, `false` and `null` all read as `Symbol`s, and
 JSON numbers read (unambiguously) either as `SignedInteger`s or as