diff --git a/_config.yml b/_config.yml index 52f1c49..7ba1bac 100644 --- a/_config.yml +++ b/_config.yml @@ -13,5 +13,5 @@ defaults: layout: page title: "Preserves" -version_date: "March 2023" -version: "0.7.0" +version_date: "October 2023" +version: "0.7.1" diff --git a/conventions.md b/conventions.md index b82a236..3ff423c 100644 --- a/conventions.md +++ b/conventions.md @@ -127,7 +127,7 @@ normalization form. A `NormalizedString` is a `Record` labelled with `unicode-normalization` and having two fields, the first of which is a `Symbol` specifying the normalization form used (e.g. `nfc`, `nfd`, `nfkc`, `nfkd`), and the second of which is a `String` whose -underlying code point representation *MUST* be normalized according to +underlying Unicode scalar value sequence *MUST* be normalized according to the named normalization form. ## IRIs (URIs, URLs, URNs, etc.). diff --git a/preserves-binary.md b/preserves-binary.md index ddae30c..235f022 100644 --- a/preserves-binary.md +++ b/preserves-binary.md @@ -143,8 +143,8 @@ example, Syntax for these three types varies only in the tag used. For `String` and `Symbol`, the data following the tag is a UTF-8 encoding of the -`Value`'s code points, while for `ByteString` it is the raw data -contained within the `Value` unmodified. +`Value`, while for `ByteString` it is the raw data contained within the +`Value` unmodified. «S» = [0xB1] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ String [0xB2] ++ varint(|S|) ++ S if S ∈ ByteString @@ -198,11 +198,10 @@ the same `Value` to yield different binary `Repr`s. Every tag byte in a binary Preserves `Document` falls within the range [`0x80`, `0xBF`]. These bytes, interpreted as UTF-8, are *continuation -bytes*, and will never occur as the first byte of a UTF-8 encoded code -point. This means no binary-encoded document can be misinterpreted as -valid UTF-8. +bytes*, and will never occur as the first byte of a UTF-8 encoding. This +means no binary-encoded document can be misinterpreted as valid UTF-8. -Conversely, a UTF-8 document must start with a valid codepoint, +Conversely, a UTF-8 document must start with a valid scalar value, meaning in particular that it must not start with a byte in the range [`0x80`, `0xBF`]. This means that no UTF-8 encoded textual-syntax Preserves document can be misinterpreted as a binary-syntax document. diff --git a/preserves-text.md b/preserves-text.md index d9eb5ad..b4b60e9 100644 --- a/preserves-text.md +++ b/preserves-text.md @@ -21,7 +21,7 @@ The definition uses [case-sensitive ABNF][abnf]. ABNF allows easy definition of US-ASCII-based languages. However, Preserves is a Unicode-based language. Therefore, we reinterpret ABNF as -a grammar for recognising sequences of Unicode code points. +a grammar for recognising sequences of Unicode scalar values. **Encoding.** Textual syntax for a `Value` *SHOULD* be encoded using UTF-8 where possible. @@ -82,10 +82,13 @@ false, respectively. Boolean = %s"#t" / %s"#f" -`String`s are, -[as in JSON](https://tools.ietf.org/html/rfc8259#section-7), possibly -escaped text surrounded by double quotes. The escaping rules are the -same as for JSON.[^string-json-correspondence] [^escaping-surrogate-pairs] +`String`s are, [as in JSON](https://tools.ietf.org/html/rfc8259#section-7), +possibly escaped text surrounded by double quotes. The escaping rules are +the same as for JSON,[^string-json-correspondence] +[^escaping-surrogate-pairs] +[except](https://tools.ietf.org/html/rfc8259#section-8.2) that unpaired +[surrogate code points](https://unicode.org/glossary/#surrogate_code_point) +*MUST NOT* be generated or accepted.[^unpaired-surrogates] String = %x22 *char %x22 char = unescaped / %x7C / escape (escaped / %x22 / %s"u" 4HEXDIG) @@ -106,33 +109,49 @@ same as for JSON.[^string-json-correspondence] [^escaping-surrogate-pairs] largely unmodified from the text of RFC 8259. [^escaping-surrogate-pairs]: In particular, note JSON's rules around - the use of surrogate pairs for code points not in the Basic + the use of surrogate pairs for scalar values not in the Basic Multilingual Plane. We encourage implementations to avoid using `\u` escapes when producing output, and instead to rely on the - UTF-8 encoding of the entire document to handle non-ASCII - codepoints correctly. + UTF-8 encoding of the entire document to handle scalar values outside + the ASCII range correctly. -A `ByteString` may be written in any of three different forms. + [^unpaired-surrogates]: Because Preserves forbids unpaired surrogates in + its text syntax, any valid JSON text including an unpaired [surrogate + code point](https://unicode.org/glossary/#surrogate_code_point) will + not be parseable using the Preserves text syntax rules. -The first is similar to a `String`, but prepended with a hash sign -`#`. In addition, only Unicode code points overlapping with printable -7-bit ASCII are permitted unescaped inside such a `ByteString`; other -byte values must be escaped by prepending a two-digit hexadecimal -value with `\x`. +A `ByteString` may be written in any of three different forms.[^rationale-bytestring] + + [^rationale-bytestring]: **Rationale.** While the [machine-oriented + syntax](preserves-binary.html) defines just one representation for + binary data, the text syntax is intended primarily for humans to use, + and so it defines many. Different usages of binary data will be more + naturally expressed in text as hexadecimal, Base 64, or almost-ASCII. + Accepting multiple syntax variations improves the ergonomics of the + text syntax. + +The first is similar to a `String`, but prepended with a hash sign `#`. +Many bytes map directly to printable 7-bit ASCII; the remainder must be +escaped, either as `\x` followed by a two-digit hexadecimal number, or +following the usual rules for double quote and backslash. ByteString = "#" %x22 *binchar %x22 binchar = binunescaped / escape (escaped / %x22 / %s"x" 2HEXDIG) binunescaped = %x20-21 / %x23-5B / %x5D-7E -The second is as a sequence of pairs of hexadecimal digits interleaved +The second is a sequence of pairs of hexadecimal digits interleaved with whitespace and surrounded by `#x"` and `"`. ByteString =/ %s"#x" %x22 *(ws / 2HEXDIG) ws %x22 -The third is as a sequence of -[Base64](https://tools.ietf.org/html/rfc4648) characters, interleaved -with whitespace and surrounded by `#[` and `]`. Plain and URL-safe -Base64 characters are allowed. +The third is a sequence of [Base64](https://tools.ietf.org/html/rfc4648) +characters, interleaved with whitespace and surrounded by `#[` and `]`. +[Plain](https://datatracker.ietf.org/doc/html/rfc4648#section-4) (`+`,`/`) +and [URL-safe](https://datatracker.ietf.org/doc/html/rfc4648#section-5) +(`-`,`_`) Base64 characters are accepted; +[URL-safe](https://datatracker.ietf.org/doc/html/rfc4648#section-5) +(`-`,`_`) characters *SHOULD* be generated by default. Padding characters +(`=`) may be omitted. ByteString =/ "#[" *(ws / base64char) ws "]" base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "=" @@ -156,7 +175,7 @@ it must be interpreted as a bare `Symbol`. baresymchar = ALPHA / DIGIT / sympunct / symuchar sympunct = "~" / "!" / "$" / "%" / "^" / "&" / "*" / "?" / "_" / "=" / "+" / "-" / "/" / "." - symuchar = @@ -240,7 +259,7 @@ denoted object, prefixed with `#!`. Embedded = "#!" Value -## Annotations +## Annotations and Comments When written down, a `Value` may have an associated sequence of *annotations* carrying “out-of-band” contextual metadata about the diff --git a/preserves-zerocopy.md b/preserves-zerocopy.md index c5fd655..884c047 100644 --- a/preserves-zerocopy.md +++ b/preserves-zerocopy.md @@ -163,9 +163,8 @@ For example, ### Strings, ByteStrings and Symbols. Syntax for these three types varies only in the tag used. For `String` and -`Symbol`, the encoded data is a UTF-8 encoding of the `Value`'s code -points, while for `ByteString` it is the raw data contained within the -`Value` unmodified. +`Symbol`, the encoded data is a UTF-8 encoding of the `Value`, while for +`ByteString` it is the raw data contained within the `Value` unmodified. Encoded data of length between 1 and 7 bytes is represented as an immediate `Ref` where the low *five* bits are `00010` (`String`), `10001` diff --git a/preserves.md b/preserves.md index 475a12e..c4090c6 100644 --- a/preserves.md +++ b/preserves.md @@ -52,17 +52,20 @@ A `SignedInteger` is an arbitrarily-large signed integer. ### Unicode strings. -A `String` is a sequence of Unicode -[code-point](http://www.unicode.org/glossary/#code_point)s.[^nul-permitted] -`String`s are compared lexicographically, code-point by -code-point.[^utf8-is-awesome] +A `String` is a sequence of [Unicode +scalar value](http://www.unicode.org/glossary/#unicode_scalar_value)s.[^nul-permitted] +`String`s are compared lexicographically, scalar value by +scalar value.[^utf8-is-awesome] [^utf8-is-awesome]: Happily, the design of UTF-8 is such that this gives the same result as a lexicographic byte-by-byte comparison of the UTF-8 encoding of a string! - [^nul-permitted]: All Unicode code-points are permitted, including NUL - (code point zero). + [^nul-permitted]: All Unicode scalar values are permitted, including NUL + (scalar value zero). Because scalar values are defined as code points + *excluding* surrogate code points + (D80016–DFFF16), surrogates are *not* permitted + in Preserves Unicode data. ### Binary data. @@ -73,8 +76,8 @@ lexicographically. Programming languages like Lisp and Prolog frequently use string-like values called *symbols*. Here, a `Symbol` is, like a `String`, a -sequence of Unicode code-points representing an identifier of some -kind. `Symbol`s are also compared lexicographically by code-point. +sequence of Unicode scalar values representing an identifier of some +kind. `Symbol`s are also compared lexicographically by scalar value. ### Booleans. @@ -198,7 +201,9 @@ The total ordering specified [above](#total-order) means that the following stat | `>` | B4 B3 07 'c' 'a' 'p' 't' 'u' 'r' 'e' B4 B3 07 'd' 'i' 's' 'c' 'a' 'r' 'd' 84 84 | | `[1 2 3 4]` | B5 91 92 93 94 84 | | `[-2 -1 0 1]` | B5 9E 9F 90 91 84 | -| `"hello"` (format B) | B1 05 'h' 'e' 'l' 'l' 'o' | +| `"hello"` | B1 05 'h' 'e' 'l' 'l' 'o' | +| `"z水𝄞"` | B1 08 'z' E6 B0 B4 F0 9D 84 9E | +| `"z水\uD834\uDD1E"` | B1 08 'z' E6 B0 B4 F0 9D 84 9E | | `["a" b #"c" [] #{} #t #f]` | B5 B1 01 'a' B3 01 'b' B2 01 'c' B5 84 B6 84 81 80 84 | | `-257` | A1 FE FF | | `-1` | 9F | @@ -250,9 +255,19 @@ encodes to ### JSON examples. -Preserves text syntax is a superset of JSON, so the examples from [RFC -8259](https://tools.ietf.org/html/rfc8259#section-13) read as valid -Preserves. +Preserves text syntax is a superset of JSON,[^json-string-caveat] so the +examples from [RFC 8259](https://tools.ietf.org/html/rfc8259#section-13) +read as valid Preserves. + + [^json-string-caveat]: There is one caveat to be aware of. [Section 8.2 + of RFC 8259](https://tools.ietf.org/html/rfc8259#section-8.2) + explicitly permits unpaired [surrogate code + point](https://unicode.org/glossary/#surrogate_code_point)s in JSON + texts without specifying an interpretation for them. Preserves mandates + UTF-8 in its binary syntax, forbids unpaired surrogates in its text + syntax, and disallows surrogate code points in `String`s and `Symbol`s, + meaning that any valid JSON text including an unpaired surrogate will + not be parseable using the Preserves text syntax rules. The JSON literals `true`, `false` and `null` all read as `Symbol`s, and JSON numbers read (unambiguously) either as `SignedInteger`s or as