Preserves really uses Unicode scalar values, not code points.
This commit is contained in:
parent
ad3da3896b
commit
8edb657603
|
@ -13,5 +13,5 @@ defaults:
|
|||
layout: page
|
||||
|
||||
title: "Preserves"
|
||||
version_date: "March 2023"
|
||||
version: "0.7.0"
|
||||
version_date: "October 2023"
|
||||
version: "0.7.1"
|
||||
|
|
|
@ -127,7 +127,7 @@ normalization form. A `NormalizedString` is a `Record` labelled with
|
|||
`unicode-normalization` and having two fields, the first of which is a
|
||||
`Symbol` specifying the normalization form used (e.g. `nfc`, `nfd`,
|
||||
`nfkc`, `nfkd`), and the second of which is a `String` whose
|
||||
underlying code point representation *MUST* be normalized according to
|
||||
underlying Unicode scalar value sequence *MUST* be normalized according to
|
||||
the named normalization form.
|
||||
|
||||
## IRIs (URIs, URLs, URNs, etc.).
|
||||
|
|
|
@ -143,8 +143,8 @@ example,
|
|||
|
||||
Syntax for these three types varies only in the tag used. For `String`
|
||||
and `Symbol`, the data following the tag is a UTF-8 encoding of the
|
||||
`Value`'s code points, while for `ByteString` it is the raw data
|
||||
contained within the `Value` unmodified.
|
||||
`Value`, while for `ByteString` it is the raw data contained within the
|
||||
`Value` unmodified.
|
||||
|
||||
«S» = [0xB1] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ String
|
||||
[0xB2] ++ varint(|S|) ++ S if S ∈ ByteString
|
||||
|
@ -198,11 +198,10 @@ the same `Value` to yield different binary `Repr`s.
|
|||
|
||||
Every tag byte in a binary Preserves `Document` falls within the range
|
||||
[`0x80`, `0xBF`]. These bytes, interpreted as UTF-8, are *continuation
|
||||
bytes*, and will never occur as the first byte of a UTF-8 encoded code
|
||||
point. This means no binary-encoded document can be misinterpreted as
|
||||
valid UTF-8.
|
||||
bytes*, and will never occur as the first byte of a UTF-8 encoding. This
|
||||
means no binary-encoded document can be misinterpreted as valid UTF-8.
|
||||
|
||||
Conversely, a UTF-8 document must start with a valid codepoint,
|
||||
Conversely, a UTF-8 document must start with a valid scalar value,
|
||||
meaning in particular that it must not start with a byte in the range
|
||||
[`0x80`, `0xBF`]. This means that no UTF-8 encoded textual-syntax
|
||||
Preserves document can be misinterpreted as a binary-syntax document.
|
||||
|
|
|
@ -21,7 +21,7 @@ The definition uses [case-sensitive ABNF][abnf].
|
|||
|
||||
ABNF allows easy definition of US-ASCII-based languages. However,
|
||||
Preserves is a Unicode-based language. Therefore, we reinterpret ABNF as
|
||||
a grammar for recognising sequences of Unicode code points.
|
||||
a grammar for recognising sequences of Unicode scalar values.
|
||||
|
||||
**Encoding.** Textual syntax for a `Value` *SHOULD* be encoded using
|
||||
UTF-8 where possible.
|
||||
|
@ -82,10 +82,13 @@ false, respectively.
|
|||
|
||||
Boolean = %s"#t" / %s"#f"
|
||||
|
||||
`String`s are,
|
||||
[as in JSON](https://tools.ietf.org/html/rfc8259#section-7), possibly
|
||||
escaped text surrounded by double quotes. The escaping rules are the
|
||||
same as for JSON.[^string-json-correspondence] [^escaping-surrogate-pairs]
|
||||
`String`s are, [as in JSON](https://tools.ietf.org/html/rfc8259#section-7),
|
||||
possibly escaped text surrounded by double quotes. The escaping rules are
|
||||
the same as for JSON,[^string-json-correspondence]
|
||||
[^escaping-surrogate-pairs]
|
||||
[except](https://tools.ietf.org/html/rfc8259#section-8.2) that unpaired
|
||||
[surrogate code points](https://unicode.org/glossary/#surrogate_code_point)
|
||||
*MUST NOT* be generated or accepted.[^unpaired-surrogates]
|
||||
|
||||
String = %x22 *char %x22
|
||||
char = unescaped / %x7C / escape (escaped / %x22 / %s"u" 4HEXDIG)
|
||||
|
@ -106,33 +109,49 @@ same as for JSON.[^string-json-correspondence] [^escaping-surrogate-pairs]
|
|||
largely unmodified from the text of RFC 8259.
|
||||
|
||||
[^escaping-surrogate-pairs]: In particular, note JSON's rules around
|
||||
the use of surrogate pairs for code points not in the Basic
|
||||
the use of surrogate pairs for scalar values not in the Basic
|
||||
Multilingual Plane. We encourage implementations to avoid using
|
||||
`\u` escapes when producing output, and instead to rely on the
|
||||
UTF-8 encoding of the entire document to handle non-ASCII
|
||||
codepoints correctly.
|
||||
UTF-8 encoding of the entire document to handle scalar values outside
|
||||
the ASCII range correctly.
|
||||
|
||||
A `ByteString` may be written in any of three different forms.
|
||||
[^unpaired-surrogates]: Because Preserves forbids unpaired surrogates in
|
||||
its text syntax, any valid JSON text including an unpaired [surrogate
|
||||
code point](https://unicode.org/glossary/#surrogate_code_point) will
|
||||
not be parseable using the Preserves text syntax rules.
|
||||
|
||||
The first is similar to a `String`, but prepended with a hash sign
|
||||
`#`. In addition, only Unicode code points overlapping with printable
|
||||
7-bit ASCII are permitted unescaped inside such a `ByteString`; other
|
||||
byte values must be escaped by prepending a two-digit hexadecimal
|
||||
value with `\x`.
|
||||
A `ByteString` may be written in any of three different forms.[^rationale-bytestring]
|
||||
|
||||
[^rationale-bytestring]: **Rationale.** While the [machine-oriented
|
||||
syntax](preserves-binary.html) defines just one representation for
|
||||
binary data, the text syntax is intended primarily for humans to use,
|
||||
and so it defines many. Different usages of binary data will be more
|
||||
naturally expressed in text as hexadecimal, Base 64, or almost-ASCII.
|
||||
Accepting multiple syntax variations improves the ergonomics of the
|
||||
text syntax.
|
||||
|
||||
The first is similar to a `String`, but prepended with a hash sign `#`.
|
||||
Many bytes map directly to printable 7-bit ASCII; the remainder must be
|
||||
escaped, either as `\x` followed by a two-digit hexadecimal number, or
|
||||
following the usual rules for double quote and backslash.
|
||||
|
||||
ByteString = "#" %x22 *binchar %x22
|
||||
binchar = binunescaped / escape (escaped / %x22 / %s"x" 2HEXDIG)
|
||||
binunescaped = %x20-21 / %x23-5B / %x5D-7E
|
||||
|
||||
The second is as a sequence of pairs of hexadecimal digits interleaved
|
||||
The second is a sequence of pairs of hexadecimal digits interleaved
|
||||
with whitespace and surrounded by `#x"` and `"`.
|
||||
|
||||
ByteString =/ %s"#x" %x22 *(ws / 2HEXDIG) ws %x22
|
||||
|
||||
The third is as a sequence of
|
||||
[Base64](https://tools.ietf.org/html/rfc4648) characters, interleaved
|
||||
with whitespace and surrounded by `#[` and `]`. Plain and URL-safe
|
||||
Base64 characters are allowed.
|
||||
The third is a sequence of [Base64](https://tools.ietf.org/html/rfc4648)
|
||||
characters, interleaved with whitespace and surrounded by `#[` and `]`.
|
||||
[Plain](https://datatracker.ietf.org/doc/html/rfc4648#section-4) (`+`,`/`)
|
||||
and [URL-safe](https://datatracker.ietf.org/doc/html/rfc4648#section-5)
|
||||
(`-`,`_`) Base64 characters are accepted;
|
||||
[URL-safe](https://datatracker.ietf.org/doc/html/rfc4648#section-5)
|
||||
(`-`,`_`) characters *SHOULD* be generated by default. Padding characters
|
||||
(`=`) may be omitted.
|
||||
|
||||
ByteString =/ "#[" *(ws / base64char) ws "]"
|
||||
base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "="
|
||||
|
@ -156,7 +175,7 @@ it must be interpreted as a bare `Symbol`.
|
|||
baresymchar = ALPHA / DIGIT / sympunct / symuchar
|
||||
sympunct = "~" / "!" / "$" / "%" / "^" / "&" / "*" /
|
||||
"?" / "_" / "=" / "+" / "-" / "/" / "."
|
||||
symuchar = <any code point greater than 127 whose Unicode
|
||||
symuchar = <any scalar value greater than 127 whose Unicode
|
||||
category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me, Nd,
|
||||
Nl, No, Pc, Pd, Po, Sc, Sm, Sk, So, or Co>
|
||||
|
||||
|
@ -240,7 +259,7 @@ denoted object, prefixed with `#!`.
|
|||
|
||||
Embedded = "#!" Value
|
||||
|
||||
## Annotations
|
||||
## <a id="annotations"></a>Annotations and Comments
|
||||
|
||||
When written down, a `Value` may have an associated sequence of
|
||||
*annotations* carrying “out-of-band” contextual metadata about the
|
||||
|
|
|
@ -163,9 +163,8 @@ For example,
|
|||
### Strings, ByteStrings and Symbols.
|
||||
|
||||
Syntax for these three types varies only in the tag used. For `String` and
|
||||
`Symbol`, the encoded data is a UTF-8 encoding of the `Value`'s code
|
||||
points, while for `ByteString` it is the raw data contained within the
|
||||
`Value` unmodified.
|
||||
`Symbol`, the encoded data is a UTF-8 encoding of the `Value`, while for
|
||||
`ByteString` it is the raw data contained within the `Value` unmodified.
|
||||
|
||||
Encoded data of length between 1 and 7 bytes is represented as an immediate
|
||||
`Ref` where the low *five* bits are `00010` (`String`), `10001`
|
||||
|
|
39
preserves.md
39
preserves.md
|
@ -52,17 +52,20 @@ A `SignedInteger` is an arbitrarily-large signed integer.
|
|||
|
||||
### Unicode strings.
|
||||
|
||||
A `String` is a sequence of Unicode
|
||||
[code-point](http://www.unicode.org/glossary/#code_point)s.[^nul-permitted]
|
||||
`String`s are compared lexicographically, code-point by
|
||||
code-point.[^utf8-is-awesome]
|
||||
A `String` is a sequence of [Unicode
|
||||
scalar value](http://www.unicode.org/glossary/#unicode_scalar_value)s.[^nul-permitted]
|
||||
`String`s are compared lexicographically, scalar value by
|
||||
scalar value.[^utf8-is-awesome]
|
||||
|
||||
[^utf8-is-awesome]: Happily, the design of UTF-8 is such that this
|
||||
gives the same result as a lexicographic byte-by-byte comparison
|
||||
of the UTF-8 encoding of a string!
|
||||
|
||||
[^nul-permitted]: All Unicode code-points are permitted, including NUL
|
||||
(code point zero).
|
||||
[^nul-permitted]: All Unicode scalar values are permitted, including NUL
|
||||
(scalar value zero). Because scalar values are defined as code points
|
||||
*excluding* surrogate code points
|
||||
(D800<sub>16</sub>–DFFF<sub>16</sub>), surrogates are *not* permitted
|
||||
in Preserves Unicode data.
|
||||
|
||||
### Binary data.
|
||||
|
||||
|
@ -73,8 +76,8 @@ lexicographically.
|
|||
|
||||
Programming languages like Lisp and Prolog frequently use string-like
|
||||
values called *symbols*. Here, a `Symbol` is, like a `String`, a
|
||||
sequence of Unicode code-points representing an identifier of some
|
||||
kind. `Symbol`s are also compared lexicographically by code-point.
|
||||
sequence of Unicode scalar values representing an identifier of some
|
||||
kind. `Symbol`s are also compared lexicographically by scalar value.
|
||||
|
||||
### Booleans.
|
||||
|
||||
|
@ -198,7 +201,9 @@ The total ordering specified [above](#total-order) means that the following stat
|
|||
| `<capture <discard>>` | B4 B3 07 'c' 'a' 'p' 't' 'u' 'r' 'e' B4 B3 07 'd' 'i' 's' 'c' 'a' 'r' 'd' 84 84 |
|
||||
| `[1 2 3 4]` | B5 91 92 93 94 84 |
|
||||
| `[-2 -1 0 1]` | B5 9E 9F 90 91 84 |
|
||||
| `"hello"` (format B) | B1 05 'h' 'e' 'l' 'l' 'o' |
|
||||
| `"hello"` | B1 05 'h' 'e' 'l' 'l' 'o' |
|
||||
| `"z水𝄞"` | B1 08 'z' E6 B0 B4 F0 9D 84 9E |
|
||||
| `"z水\uD834\uDD1E"` | B1 08 'z' E6 B0 B4 F0 9D 84 9E |
|
||||
| `["a" b #"c" [] #{} #t #f]` | B5 B1 01 'a' B3 01 'b' B2 01 'c' B5 84 B6 84 81 80 84 |
|
||||
| `-257` | A1 FE FF |
|
||||
| `-1` | 9F |
|
||||
|
@ -250,9 +255,19 @@ encodes to
|
|||
|
||||
### JSON examples.
|
||||
|
||||
Preserves text syntax is a superset of JSON, so the examples from [RFC
|
||||
8259](https://tools.ietf.org/html/rfc8259#section-13) read as valid
|
||||
Preserves.
|
||||
Preserves text syntax is a superset of JSON,[^json-string-caveat] so the
|
||||
examples from [RFC 8259](https://tools.ietf.org/html/rfc8259#section-13)
|
||||
read as valid Preserves.
|
||||
|
||||
[^json-string-caveat]: There is one caveat to be aware of. [Section 8.2
|
||||
of RFC 8259](https://tools.ietf.org/html/rfc8259#section-8.2)
|
||||
explicitly permits unpaired [surrogate code
|
||||
point](https://unicode.org/glossary/#surrogate_code_point)s in JSON
|
||||
texts without specifying an interpretation for them. Preserves mandates
|
||||
UTF-8 in its binary syntax, forbids unpaired surrogates in its text
|
||||
syntax, and disallows surrogate code points in `String`s and `Symbol`s,
|
||||
meaning that any valid JSON text including an unpaired surrogate will
|
||||
not be parseable using the Preserves text syntax rules.
|
||||
|
||||
The JSON literals `true`, `false` and `null` all read as `Symbol`s, and
|
||||
JSON numbers read (unambiguously) either as `SignedInteger`s or as
|
||||
|
|
Loading…
Reference in New Issue