Split up spec!

2022-06-18 19:11:08 +02:00 · 2022-06-18 19:11:08 +02:00 · 7d3789e371
parent 1f495eef1e
commit 7d3789e371
6 changed files with 631 additions and 570 deletions
--- a/README.md
+++ b/README.md
@ -6,22 +6,24 @@ no_site_title: true
 ---
 This [repository]({{page.projectpages}}) contains a
-[proposal](preserves.html) and various implementations of *Preserves*,
+[proposal](preserves.html) and various implementations of *Preserves*, a
-a new data model and serialization format in many ways comparable to
+new data model, with associated serialization formats, in many ways
-JSON, XML, S-expressions, CBOR, ASN.1 BER, and so on.
+comparable to JSON, XML, S-expressions, CBOR, ASN.1 BER, and so on.
 ## Core documents
 ### Preserves data model and serialization formats
 Preserves is defined in terms of a syntax-neutral
-[data model and semantics](preserves.html#starting-with-semantics)
+[data model and semantics](preserves.html#semantics)
 which all transfer syntaxes share. This allows trivial, completely
 automatic, perfect-fidelity conversion between syntaxes.
 - [Preserves specification](preserves.html):
    - [Preserves semantics and data model](preserves.html#semantics),
    - [Preserves textual syntax](preserves-text.html), and
    - [Preserves machine-oriented binary syntax](preserves-binary.html)
 - [Preserves tutorial](TUTORIAL.html)
 - [Preserves specification](preserves.html), including semantics,
   data model, textual syntax, and compact binary syntax
 - [Canonical Form for Binary Syntax](canonical-binary.html)
 - [Syrup](https://github.com/ocapn/syrup#pseudo-specification), a
   hybrid binary/human-readable syntax for the Preserves data model
--- a/_config.yml
+++ b/_config.yml
@ -13,3 +13,5 @@ defaults:
      layout: page
 title: "Preserves"
 version_date: "June 2022"
 version: "0.6.3"
--- a/canonical-binary.md
+++ b/canonical-binary.md
@ -17,8 +17,8 @@ their *syntax* for equivalence gives the same result as comparing them
 That is, canonical forms are equal if and only if the encoded `Value`s
 are equal.
-This document specifies canonical form for the Preserves compact
+This document specifies canonical form for the Preserves [machine-oriented
-binary syntax.
+binary syntax](preserves-binary.html).
 **Annotations.**
 Annotations *MUST NOT* be present.
--- a/preserves-binary.md
+++ b/preserves-binary.md
@ -0,0 +1,260 @@
 ---
 no_site_title: true
 title: "Preserves: Binary Syntax"
 ---
 Tony Garnock-Jones <tonyg@leastfixedpoint.com>  
 {{ site.version_date }}. Version {{ site.version }}.
  [sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt
  [spki]: http://world.std.com/~cme/html/spki.html
  [varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
  [LEB128]: https://en.wikipedia.org/wiki/LEB128
  [erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map
  [abnf]: https://tools.ietf.org/html/rfc7405
  [canonical]: canonical-binary.html
 *Preserves* is a data model, with associated serialization formats. This
 document defines one of those formats: a binary syntax for `Value`s from
 the [Preserves data model](preserves.html) that is easy for computer
 software to read and write. An [equivalent human-readable text
 syntax](preserves-text.html) also exists.
 ## Machine-Oriented Binary Syntax
 A `Repr` is a binary-syntax encoding, or representation, of a `Value`.
 For a value `v`, we write `«v»` for the `Repr` of v.
 ### Type and Length representation.
 Each `Repr` starts with a tag byte, describing the kind of information
 represented. Depending on the tag, a length indicator, further encoded
 information, and/or an ending tag may follow.
    tag                          (simple atomic data and small integers)
    tag ++ binarydata            (most integers)
    tag ++ length ++ binarydata  (large integers, strings, symbols, and binary)
    tag ++ repr ++ ... ++ endtag (compound data)
 The unique end tag is byte value `0x84`.
 If present after a tag, the length of a following piece of binary data
 is formatted as a [base 128 varint][varint].[^see-also-leb128] We
 write `varint(m)` for the varint-encoding of `m`. Quoting the
 [Google Protocol Buffers][varint] definition,
  [^see-also-leb128]: Also known as [LEB128][] encoding, for unsigned
    integers. Varints and LEB128-encoded integers differ only for
    signed integers, which are not used in Preserves.
 > Each byte in a varint, except the last byte, has the most
 > significant bit (msb) set – this indicates that there are further
 > bytes to come. The lower 7 bits of each byte are used to store the
 > two's complement representation of the number in groups of 7 bits,
 > least significant group first.
 The following table illustrates varint-encoding.
 | Number, `m` | `m` in binary, grouped into 7-bit chunks  | `varint(m)` bytes |
 | ------      | -------------------                       | ------------      |
 | 15          | `0001111`                                 | 15                |
 | 300         | `0000010 0101100`                         | 172 2             |
 | 1000000000  | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 |
 It is an error for a varint-encoded `m` in a `Repr` to be anything
 other than the unique shortest encoding for that `m`. That is, a
 varint-encoding of `m` *MUST NOT* end in `0` unless `m`=0.
 ### Records, Sequences, Sets and Dictionaries.
          «<L F_1...F_m>» = [0xB4] ++ «L» ++ «F_1» ++...++ «F_m» ++ [0x84]
            «[X_1...X_m]» = [0xB5] ++ «X_1» ++...++ «X_m» ++ [0x84]
           «#{E_1...E_m}» = [0xB6] ++ «E_1» ++...++ «E_m» ++ [0x84]
    «{K_1:V_1...K_m:V_m}» = [0xB7] ++ «K_1» ++ «V_1» ++...++ «K_m» ++ «V_m» ++ [0x84]
 There is *no* ordering requirement on the `E_i` elements or
 `K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
 order. However, the `E_i` and `K_i` *MUST* be pairwise distinct. In
 addition, implementations *SHOULD* default to writing set elements and
 dictionary key/value pairs in order sorted lexicographically by their
 `Repr`s[^not-sorted-semantically], and *MAY* offer the option of
 serializing in some other implementation-defined order.
  [^no-sorting-rationale]: In the BitTorrent encoding format,
    [bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding),
    dictionary key/value pairs must be sorted by key. This is a
    necessary step for ensuring serialization of `Value`s is
    canonical. We do not require that key/value pairs (or set
    elements) be in sorted order for serialized `Value`s; however, a
    [canonical form][canonical] for `Repr`s does exist where a sorted
    ordering is required.
  [^not-sorted-semantically]: It's important to note that the sort
    ordering for writing out set elements and dictionary key/value
    pairs is *not* the same as the sort ordering implied by the
    semantic ordering of those elements or keys. For example, the
    `Repr` of a negative number very far from zero will start with
    byte that is *greater* than the byte which starts the `Repr` of
    zero, making it sort lexicographically later by `Repr`, despite
    being semantically *less than* zero.
    **Rationale**. This is for ease-of-implementation reasons: not all
    languages can easily represent sorted sets or sorted dictionaries,
    but encoding and then sorting byte strings is much more likely to
    be within easy reach.
 ### SignedIntegers.
    «x» when x ∈ SignedInteger = [0xB0] ++ varint(m) ++ intbytes(x)  if ¬(-3≤x≤12) ∧ m>16
                                 ([0xA0] + m - 1) ++ intbytes(x)     if ¬(-3≤x≤12) ∧ m≤16
                                 ([0xA0] + x)                        if  (-3≤x≤-1)
                                 ([0x90] + x)                        if  ( 0≤x≤12)
                               where m =        |intbytes(x)|
 Integers in the range [-3,12] are compactly represented with tags
 between `0x90` and `0x9F` because they are so frequently used.
 Integers up to 16 bytes long are represented with a single-byte tag
 encoding the length of the integer. Larger integers are represented
 with an explicit varint length. Every `SignedInteger` *MUST* be
 represented with its shortest possible encoding.
 The function `intbytes(x)` gives the big-endian two's-complement
 binary representation of `x`, taking exactly as many whole bytes as
 needed to unambiguously identify the value and its sign, and `m =
 |intbytes(x)|`. The most-significant bit in the first byte in
 `intbytes(x)` <!-- for `x`≠0 --> is the sign bit.[^zero-intbytes] For
 example,
      «87112285931760246646623899502532662132736»
        = B0 12 01 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00
                00 00
      «-257» = A1 FE FF        «-3» = 9D          «128» = A1 00 80
      «-256» = A1 FF 00        «-2» = 9E          «255» = A1 00 FF
      «-255» = A1 FF 01        «-1» = 9F          «256» = A1 01 00
      «-254» = A1 FF 02         «0» = 90        «32767» = A1 7F FF
      «-129» = A1 FF 7F         «1» = 91        «32768» = A2 00 80 00
      «-128» = A0 80           «12» = 9C        «65535» = A2 00 FF FF
      «-127» = A0 81           «13» = A0 0D     «65536» = A2 01 00 00
        «-4» = A0 FC          «127» = A0 7F    «131072» = A2 02 00 00
  [^zero-intbytes]: The value 0 needs zero bytes to identify the
    value, so `intbytes(0)` is the empty byte string. Non-zero values
    need at least one byte.
 ### Strings, ByteStrings and Symbols.
 Syntax for these three types varies only in the tag used. For `String`
 and `Symbol`, the data following the tag is a UTF-8 encoding of the
 `Value`'s code points, while for `ByteString` it is the raw data
 contained within the `Value` unmodified.
    «S» = [0xB1] ++ varint(|utf8(S)|) ++ utf8(S)  if S ∈ String
          [0xB2] ++ varint(|S|) ++ S              if S ∈ ByteString
          [0xB3] ++ varint(|utf8(S)|) ++ utf8(S)  if S ∈ Symbol
 ### Booleans.
    «#f» = [0x80]
    «#t» = [0x81]
 ### Floats and Doubles.
    «F» when F ∈ Float  = [0x82] ++ binary32(F)
    «D» when D ∈ Double = [0x83] ++ binary64(D)
 The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
 8-byte IEEE 754 binary representations of `F` and `D`, respectively.
 ### Embeddeds.
 The `Repr` of an `Embedded` is the `Repr` of a `Value` chosen to
 represent the denoted object, prefixed with `[0x86]`.
    «#!V» = [0x86] ++ «V»
 ### Annotations.
 To annotate a `Repr` `r` with some `Value` `v`, prepend `r` with
 `[0x85] ++ «v»`. For example, the `Repr` corresponding to textual
 syntax `@a@b[]`, i.e. an empty sequence annotated with two symbols,
 `a` and `b`, is
    «@a @b []»
      = [0x85] ++ «a» ++ [0x85] ++ «b» ++ «[]»
      = [0x85, 0xB3, 0x01, 0x61, 0x85, 0xB3, 0x01, 0x62, 0xB5, 0x84]
 ## Security Considerations
 **Annotations.** In modes where a `Value` is being read while
 annotations are skipped, an endless sequence of annotations may give an
 illusion of progress.
 **Canonical form for cryptographic hashing and signing.** No canonical
 textual encoding of a `Value` is specified. A
 [canonical form][canonical] exists for binary encoded `Value`s, and
 implementations *SHOULD* produce canonical binary encodings by
 default; however, an implementation *MAY* permit two serializations of
 the same `Value` to yield different binary `Repr`s.
 ## Appendix. Autodetection of textual or binary syntax
 Every tag byte in a binary Preserves `Document` falls within the range
 [`0x80`, `0xBF`]. These bytes, interpreted as UTF-8, are *continuation
 bytes*, and will never occur as the first byte of a UTF-8 encoded code
 point. This means no binary-encoded document can be misinterpreted as
 valid UTF-8.
 Conversely, a UTF-8 document must start with a valid codepoint,
 meaning in particular that it must not start with a byte in the range
 [`0x80`, `0xBF`]. This means that no UTF-8 encoded textual-syntax
 Preserves document can be misinterpreted as a binary-syntax document.
 Examination of the top two bits of the first byte of a document gives
 its syntax: if the top two bits are `10`, it should be interpreted as
 a binary-syntax document; otherwise, it should be interpreted as text.
 ## Appendix. Table of tag values
     80 - False
     81 - True
     82 - Float
     83 - Double
     84 - End marker
     85 - Annotation
     86 - Embedded
    (8x)  RESERVED 87-8F
     9x - Small integers 0..12,-3..-1
     An - Medium integers, (n+1) bytes long
     B0 - Large integers, variable length
     B1 - String
     B2 - ByteString
     B3 - Symbol
     B4 - Record
     B5 - Sequence
     B6 - Set
     B7 - Dictionary
 ## Appendix. Binary SignedInteger representation
 Languages that provide fixed-width machine word types may find the
 following table useful in encoding and decoding binary `SignedInteger`
 values.
 | Integer range                              | Bytes required | Encoding (hex)                               |
 | ---                                        | ---            | ---                                          |
 | -3 ≤ n ≤ 12                                | 1              | `9X`                                         |
 | -2<sup>7</sup> ≤ n < 2<sup>7</sup> (i8)    | 2              | `A0` `XX`                                    |
 | -2<sup>15</sup> ≤ n < 2<sup>15</sup> (i16) | 3              | `A1` `XX` `XX`                               |
 | -2<sup>23</sup> ≤ n < 2<sup>23</sup> (i24) | 4              | `A2` `XX` `XX` `XX`                          |
 | -2<sup>31</sup> ≤ n < 2<sup>31</sup> (i32) | 5              | `A3` `XX` `XX` `XX` `XX`                     |
 | -2<sup>39</sup> ≤ n < 2<sup>39</sup> (i40) | 6              | `A4` `XX` `XX` `XX` `XX` `XX`                |
 | -2<sup>47</sup> ≤ n < 2<sup>47</sup> (i48) | 7              | `A5` `XX` `XX` `XX` `XX` `XX` `XX`           |
 | -2<sup>55</sup> ≤ n < 2<sup>55</sup> (i56) | 8              | `A6` `XX` `XX` `XX` `XX` `XX` `XX` `XX`      |
 | -2<sup>63</sup> ≤ n < 2<sup>63</sup> (i64) | 9              | `A7` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
 <!-- Heading to visually offset the footnotes from the main document: -->
 ## Notes
--- a/preserves-text.md
+++ b/preserves-text.md
@ -0,0 +1,302 @@
 ---
 no_site_title: true
 title: "Preserves: Text Syntax"
 ---
 Tony Garnock-Jones <tonyg@leastfixedpoint.com>  
 {{ site.version_date }}. Version {{ site.version }}.
  [sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt
  [spki]: http://world.std.com/~cme/html/spki.html
  [varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
  [LEB128]: https://en.wikipedia.org/wiki/LEB128
  [erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map
  [abnf]: https://tools.ietf.org/html/rfc7405
  [canonical]: canonical-binary.html
 *Preserves* is a data model, with associated serialization formats. This
 document defines one of those formats: a textual syntax for `Value`s
 from the [Preserves data model](preserves.html) that is easy for people
 to read and write. An [equivalent machine-oriented binary
 syntax](preserves-binary.html) also exists.
 ## Preliminaries
 The definition uses [case-sensitive ABNF][abnf].
 ABNF allows easy definition of US-ASCII-based languages. However,
 Preserves is a Unicode-based language. Therefore, we reinterpret ABNF as
 a grammar for recognising sequences of Unicode code points.
 **Encoding.** Textual syntax for a `Value` *SHOULD* be encoded using
 UTF-8 where possible.
 **Whitespace.** Whitespace is defined as any number of spaces, tabs,
 carriage returns, line feeds, or commas.
                ws = *(%x20 / %x09 / newline / ",")
           newline = CR / LF
 ## Grammar
 Standalone documents may have trailing whitespace.
          Document = Value ws
 Any `Value` may be preceded by whitespace.
             Value = ws (Record / Collection / Atom / Embedded / Machine)
        Collection = Sequence / Dictionary / Set
              Atom = Boolean / Float / Double / SignedInteger /
                     String / ByteString / Symbol
 Each `Record` is an angle-bracket enclosed grouping of its
 label-`Value` followed by its field-`Value`s.
            Record = "<" Value *Value ws ">"
 `Sequence`s are enclosed in square brackets. `Dictionary` values are
 curly-brace-enclosed colon-separated pairs of values. `Set`s are
 written as values enclosed by the tokens `#{` and
 `}`.[^printing-collections] It is an error for a set to contain
 duplicate elements or for a dictionary to contain duplicate keys.
          Sequence = "[" *Value ws "]"
        Dictionary = "{" *(Value ws ":" Value) ws "}"
               Set = "#{" *Value ws "}"
  [^printing-collections]: **Implementation note.** When implementing
    printing of `Value`s using the textual syntax, consider supporting
    (a) optional pretty-printing with indentation, (b) optional
    JSON-compatible print mode for that subset of `Value` that is
    compatible with JSON, and (c) optional submodes for no commas,
    commas separating, and commas terminating elements or key/value
    pairs within a collection.
 `Boolean`s are the simple literal strings `#t` and `#f` for true and
 false, respectively.
           Boolean = %s"#t" / %s"#f"
 Numeric data follow the
 [JSON grammar](https://tools.ietf.org/html/rfc8259#section-6), with
 the addition of a trailing “f” distinguishing `Float` from `Double`
 values. `Float`s and `Double`s always have either a fractional part or
 an exponent part, where `SignedInteger`s never have
 either.[^reading-and-writing-floats-accurately]
 [^arbitrary-precision-signedinteger]
             Float = flt %i"f"
            Double = flt
     SignedInteger = int
          digit1-9 = %x31-39
               nat = %x30 / ( digit1-9 *DIGIT )
               int = ["-"] nat
              frac = "." 1*DIGIT
               exp = %i"e" ["-"/"+"] 1*DIGIT
               flt = int (frac exp / frac / exp)
  [^reading-and-writing-floats-accurately]: **Implementation note.**
    Your language's standard library likely has a good routine for
    converting between decimal notation and IEEE 754 floating-point.
    However, if not, or if you are interested in the challenges of
    accurately reading and writing floating point numbers, see the
    excellent matched pair of 1990 papers by Clinger and Steele &
    White, and a recent follow-up by Jaffer:
    Clinger, William D. ‘How to Read Floating Point Numbers
    Accurately’. In Proc. PLDI. White Plains, New York, 1990.
    <https://doi.org/10.1145/93542.93557>.
    Steele, Guy L., Jr., and Jon L. White. ‘How to Print
    Floating-Point Numbers Accurately’. In Proc. PLDI. White Plains,
    New York, 1990. <https://doi.org/10.1145/93542.93559>.
    Jaffer, Aubrey. ‘Easy Accurate Reading and Writing of
    Floating-Point Numbers’. ArXiv:1310.8121 [Cs], 27 October 2013.
    <http://arxiv.org/abs/1310.8121>.
  [^arbitrary-precision-signedinteger]: **Implementation note.** Be
    aware when implementing reading and writing of `SignedInteger`s
    that the data model *requires* arbitrary-precision integers. Your
    implementation may (but, ideally, should not) truncate precision
    when reading or writing a `SignedInteger`; however, if it does so,
    it should (a) signal its client that truncation has occurred, and
    (b) make it clear to the client that comparing such truncated
    values for equality or ordering will not yield results that match
    the expected semantics of the data model.
 `String`s are,
 [as in JSON](https://tools.ietf.org/html/rfc8259#section-7), possibly
 escaped text surrounded by double quotes. The escaping rules are the
 same as for JSON.[^string-json-correspondence] [^escaping-surrogate-pairs]
            String = %x22 *char %x22
              char = unescaped / %x7C / escape (escaped / %x22 / %s"u" 4HEXDIG)
         unescaped = %x20-21 / %x23-5B / %x5D-7B / %x7D-10FFFF
            escape = %x5C              ; \
           escaped = ( %x5C /          ; \    reverse solidus U+005C
                       %x2F /          ; /    solidus         U+002F
                       %x62 /          ; b    backspace       U+0008
                       %x66 /          ; f    form feed       U+000C
                       %x6E /          ; n    line feed       U+000A
                       %x72 /          ; r    carriage return U+000D
                       %x74 )          ; t    tab             U+0009
  [^string-json-correspondence]: The grammar for `String` has the same
    effect as the
    [JSON](https://tools.ietf.org/html/rfc8259#section-7) grammar for
    `string`. Some auxiliary definitions (e.g. `escaped`) are lifted
    largely unmodified from the text of RFC 8259.
  [^escaping-surrogate-pairs]: In particular, note JSON's rules around
    the use of surrogate pairs for code points not in the Basic
    Multilingual Plane. We encourage implementations to avoid using
    `\u` escapes when producing output, and instead to rely on the
    UTF-8 encoding of the entire document to handle non-ASCII
    codepoints correctly.
 A `ByteString` may be written in any of three different forms.
 The first is similar to a `String`, but prepended with a hash sign
 `#`. In addition, only Unicode code points overlapping with printable
 7-bit ASCII are permitted unescaped inside such a `ByteString`; other
 byte values must be escaped by prepending a two-digit hexadecimal
 value with `\x`.
        ByteString = "#" %x22 *binchar %x22
           binchar = binunescaped / escape (escaped / %x22 / %s"x" 2HEXDIG)
      binunescaped = %x20-21 / %x23-5B / %x5D-7E
 The second is as a sequence of pairs of hexadecimal digits interleaved
 with whitespace and surrounded by `#x"` and `"`.
       ByteString =/ %s"#x" %x22 *(ws / 2HEXDIG) ws %x22
 The third is as a sequence of
 [Base64](https://tools.ietf.org/html/rfc4648) characters, interleaved
 with whitespace and surrounded by `#[` and `]`. Plain and URL-safe
 Base64 characters are allowed.
       ByteString =/ "#[" *(ws / base64char) ws "]"
        base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "="
 A `Symbol` may be written in a “bare” form[^cf-sexp-token] so long as
 it conforms to certain restrictions on the characters appearing in the
 symbol. Alternatively, it may be written in a quoted form. The quoted
 form is much the same as the syntax for `String`s, including embedded
 escape syntax, except using a bar or pipe character (`|`) instead of a
 double quote mark.
            Symbol = symstart *symcont / "|" *symchar "|"
          symstart = ALPHA / sympunct / symustart
           symcont = ALPHA / sympunct / symustart / symucont / DIGIT / "-"
          sympunct = "~" / "!" / "$" / "%" / "^" / "&" / "*" /
                     "?" / "_" / "=" / "+" / "/" / "."
           symchar = unescaped / %x22 / escape (escaped / %x7C / %s"u" 4HEXDIG)
         symustart = <any code point greater than 127 whose Unicode
                      category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me,
                      Pc, Po, Sc, Sm, Sk, So, or Co>
          symucont = <any code point greater than 127 whose Unicode
                      category is Nd, Nl, No, or Pd>
  [^cf-sexp-token]: Compare with the [SPKI S-expression][sexp.txt]
    definition of “token representation”, and with the
    [R6RS definition of identifiers](http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_sec_4.2.4).
 An `Embedded` is written as a `Value` chosen to represent the denoted
 object, prefixed with `#!`.
           Embedded = "#!" Value
 Finally, any `Value` may be represented by escaping from the textual
 syntax to the [machine-oriented binary syntax](preserves-binary.html)
 by prefixing a `ByteString` containing the binary representation of the
 `Value` with `#=`.[^rationale-switch-to-binary]
 [^no-literal-binary-in-text] [^machine-value-annotations]
           Machine = "#=" ws ByteString
  [^rationale-switch-to-binary]: **Rationale.** The textual syntax
    cannot express every `Value`: specifically, it cannot express the
    several million floating-point NaNs, or the two floating-point
    Infinities. Since the machine-oriented binary format for `Value`s
    expresses each `Value` with precision, embedding binary `Value`s
    solves the problem.
  [^no-literal-binary-in-text]: Every text is ultimately physically
    stored as bytes; therefore, it might seem possible to escape to the
    raw form of binary encoding from within a piece of textual syntax.
    However, while bytes must be involved in any *representation* of
    text, the text *itself* is logically a sequence of *code points* and
    is not *intrinsically* a binary structure at all. It would be
    incoherent to expect to be able to access the representation of the
    text from within the text itself.
  [^machine-value-annotations]: Any text-syntax annotations preceding
    the `#` are prepended to any binary-syntax annotations yielded by
    decoding the `ByteString`.
 ## Annotations
 When written down, a `Value` may have an associated sequence of
 *annotations* carrying “out-of-band” contextual metadata about the
 value. Each annotation is, in turn, a `Value`, and may itself have
 annotations. The ordering of annotations attached to a `Value` is
 significant.
            Value =/ ws "@" Value Value
 Each annotation is preceded by `@`; the underlying annotated value
 follows its annotations. Here we extend only the syntactic nonterminal
 named “`Value`” without altering the semantic class of `Value`s.
 **Comments.** Strings annotating a `Value` are conventionally
 interpreted as comments associated with that value. Comments are
 sufficiently common that special syntax exists for them.
            Value =/ ws
                     ";" *(%x00-09 / %x0B-0C / %x0E-10FFFF) newline
                     Value
 When written this way, everything between the `;` and the newline is
 included in the string annotating the `Value`.
 **Equivalence.** Annotations appear within syntax denoting a `Value`;
 however, the annotations are not part of the denoted value. They are
 only part of the syntax. Annotations do not play a part in
 equivalences and orderings of `Value`s.
 Reflective tools such as debuggers, user interfaces, and message
 routers and relays---tools which process `Value`s generically---may
 use annotated inputs to tailor their operation, or may insert
 annotations in their outputs. By contrast, in ordinary programs, as a
 rule of thumb, the presence, absence or content of an annotation
 should not change the control flow or output of the program.
 Annotations are data *describing* `Value`s, and are not in the domain
 of any specific application of `Value`s. That is, an annotation will
 almost never cause a non-reflective program to do anything observably
 different.
 ## Security Considerations
 **Whitespace.** The textual format allows arbitrary whitespace in many
 positions. Consider optional restrictions on the amount of consecutive
 whitespace that may appear.
 **Annotations.** Similarly, in modes where a `Value` is being read
 while annotations are skipped, an endless sequence of annotations may
 give an illusion of progress.
 ## Acknowledgements
 The treatment of commas as whitespace in the text syntax is inspired
 by the same feature of [EDN](https://github.com/edn-format/edn).
 The text syntax for `Boolean`s, `Symbol`s, and `ByteString`s is
 directly inspired by [Racket](https://racket-lang.org/)'s lexical
 syntax.
 <!-- Heading to visually offset the footnotes from the main document: -->
 ## Notes
--- a/preserves.md
+++ b/preserves.md
@ -4,7 +4,7 @@ title: "Preserves: an Expressive Data Language"
 ---
 Tony Garnock-Jones <tonyg@leastfixedpoint.com>  
-January 2022. Version 0.6.2.
+{{ site.version_date }}. Version {{ site.version }}.
  [sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt
  [spki]: http://world.std.com/~cme/html/spki.html
@ -14,29 +14,35 @@ January 2022. Version 0.6.2.
  [abnf]: https://tools.ietf.org/html/rfc7405
  [canonical]: canonical-binary.html
-This document proposes a data model and serialization format called
+*Preserves* is a data model, with associated serialization formats.
 *Preserves*.
-Preserves supports *records* with user-defined *labels*, embedded
+It supports *records* with user-defined *labels*, embedded *references*,
-*references*, and the usual suite of atomic and compound data types,
+and the usual suite of atomic and compound data types, including
-including *binary* data as a distinct type from text strings. Its
+*binary* data as a distinct type from text strings. Its *annotations*
-*annotations* allow separation of data from metadata such as
+allow separation of data from metadata such as
-[comments](conventions.html#comments), trace information, and
+[comments](conventions.html#comments), trace information, and provenance
-provenance information.
+information.
 Preserves departs from many other data languages in defining how to
 *compare* two values. Comparison is based on the data model, not on
 syntax or on data structures of any particular implementation
 language.
-## Starting with Semantics
+This document defines the core semantics and data model of Preserves and
 presents a handful of examples. Two other core documents define
-Taking inspiration from functional programming, we start with a
+ - a [human-readable text syntax](preserves-text.html), and
-definition of the *values* that we want to work with and give them
+ - a [machine-oriented binary syntax](preserves-binary.html)
 meaning independent of their syntax.
-<a id="values"></a>
+for the Preserves data model.
-Our `Value`s fall into two broad categories: *atomic* and *compound*
+
 ## <a id="semantics"></a><a id="starting-with-semantics"></a>Values
 Preserves *values* are given meaning independent of their syntax. We
 will write "`Value`" when we mean the set of all Preserves values or an
 element of that set.
 `Value`s fall into two broad categories: *atomic* and *compound*
 data. Every `Value` is finite and non-cyclic. Embedded values, called
 `Embedded`s, are a third, special-case category.
@ -76,20 +82,23 @@ neither is less than the other according to the total order.
 ### Signed integers.
-A `SignedInteger` is a signed integer of arbitrary width.
+A `SignedInteger` is an arbitrarily-large signed integer.
 `SignedInteger`s are compared as mathematical integers.
 ### Unicode strings.
 A `String` is a sequence of Unicode
-[code-point](http://www.unicode.org/glossary/#code_point)s. `String`s
+[code-point](http://www.unicode.org/glossary/#code_point)s.[^nul-permitted]
-are compared lexicographically, code-point by
+`String`s are compared lexicographically, code-point by
 code-point.[^utf8-is-awesome]
  [^utf8-is-awesome]: Happily, the design of UTF-8 is such that this
    gives the same result as a lexicographic byte-by-byte comparison
    of the UTF-8 encoding of a string!
  [^nul-permitted]: All Unicode code-points are permitted, including NUL
    (code point zero).
 ### Binary data.
 A `ByteString` is a sequence of octets. `ByteString`s are compared
@ -111,11 +120,11 @@ less-than the “true” value.
 `Float`s and `Double`s are single- and double-precision IEEE 754
 floating-point values, respectively. `Float`s, `Double`s and
-`SignedInteger`s are disjoint; by the rules [above](#total-order),
+`SignedInteger`s are disjoint; by the rules [above](#total-order), every
-every `Float` is less than every `Double`, and every `SignedInteger`
+`Float` is less than every `Double`, and every `SignedInteger` is
-is greater than both. Two `Float`s or two `Double`s are to be ordered
+greater than both. Two `Float`s or two `Double`s are to be ordered by
-by the `totalOrder` predicate defined in section 5.10 of
+the `totalOrder` predicate defined in section 5.10 of [IEEE Std
-[IEEE Std 754-2008](https://dx.doi.org/10.1109/IEEESTD.2008.4610935).
+754-2008](https://dx.doi.org/10.1109/IEEESTD.2008.4610935).
 ### Records.
@ -200,457 +209,13 @@ URL, compared according to
 usually be represented as ordinary `Value`s, in which case the
 ordinary rules for comparing `Value`s will apply.
 ## Textual Syntax
 Now we have discussed `Value`s and their meanings, we may turn to
 techniques for *representing* `Value`s for communication or storage.
 In this section, we use [case-sensitive ABNF][abnf] to define a
 textual syntax that is easy for people to read and
 write.[^json-superset] Most of the examples in this document are
 written using this syntax. In the following section, we will define an
 equivalent compact machine-readable syntax.
  [^json-superset]: The grammar of the textual syntax is a superset of
    JSON, with the slightly unusual feature that `true`, `false`, and
    `null` are all read as `Symbol`s, and that `SignedInteger`s are
    never read as `Double`s.
    The following [schema](./preserves-schema.html) definitions match
    exactly the JSON subset of a Preserves input:
        version 1 .
        JSON = @string string / @integer int / @double double / @boolean JSONBoolean / @null =null
             / @array [JSON ...] / @object { string: JSON ...:... } .
        JSONBoolean = =true / =false .
 ### Character set.
 [ABNF][abnf] allows easy definition of US-ASCII-based languages.
 However, Preserves is a Unicode-based language. Therefore, we
 reinterpret ABNF as a grammar for recognising sequences of Unicode
 code points.
 Textual syntax for a `Value` *SHOULD* be encoded using UTF-8 where
 possible.
 ### Whitespace.
 Whitespace is defined as any number of spaces, tabs, carriage returns,
 line feeds, or commas.
                ws = *(%x20 / %x09 / newline / ",")
           newline = CR / LF
 ### Grammar.
 Standalone documents may have trailing whitespace.
          Document = Value ws
 Any `Value` may be preceded by whitespace.
             Value = ws (Record / Collection / Atom / Embedded / Compact)
        Collection = Sequence / Dictionary / Set
              Atom = Boolean / Float / Double / SignedInteger /
                     String / ByteString / Symbol
 Each `Record` is an angle-bracket enclosed grouping of its
 label-`Value` followed by its field-`Value`s.
            Record = "<" Value *Value ws ">"
 `Sequence`s are enclosed in square brackets. `Dictionary` values are
 curly-brace-enclosed colon-separated pairs of values. `Set`s are
 written as values enclosed by the tokens `#{` and
 `}`.[^printing-collections] It is an error for a set to contain
 duplicate elements or for a dictionary to contain duplicate keys.
          Sequence = "[" *Value ws "]"
        Dictionary = "{" *(Value ws ":" Value) ws "}"
               Set = "#{" *Value ws "}"
  [^printing-collections]: **Implementation note.** When implementing
    printing of `Value`s using the textual syntax, consider supporting
    (a) optional pretty-printing with indentation, (b) optional
    JSON-compatible print mode for that subset of `Value` that is
    compatible with JSON, and (c) optional submodes for no commas,
    commas separating, and commas terminating elements or key/value
    pairs within a collection.
 `Boolean`s are the simple literal strings `#t` and `#f` for true and
 false, respectively.
           Boolean = %s"#t" / %s"#f"
 Numeric data follow the
 [JSON grammar](https://tools.ietf.org/html/rfc8259#section-6), with
 the addition of a trailing “f” distinguishing `Float` from `Double`
 values. `Float`s and `Double`s always have either a fractional part or
 an exponent part, where `SignedInteger`s never have
 either.[^reading-and-writing-floats-accurately]
 [^arbitrary-precision-signedinteger]
             Float = flt %i"f"
            Double = flt
     SignedInteger = int
          digit1-9 = %x31-39
               nat = %x30 / ( digit1-9 *DIGIT )
               int = ["-"] nat
              frac = "." 1*DIGIT
               exp = %i"e" ["-"/"+"] 1*DIGIT
               flt = int (frac exp / frac / exp)
  [^reading-and-writing-floats-accurately]: **Implementation note.**
    Your language's standard library likely has a good routine for
    converting between decimal notation and IEEE 754 floating-point.
    However, if not, or if you are interested in the challenges of
    accurately reading and writing floating point numbers, see the
    excellent matched pair of 1990 papers by Clinger and Steele &
    White, and a recent follow-up by Jaffer:
    Clinger, William D. ‘How to Read Floating Point Numbers
    Accurately’. In Proc. PLDI. White Plains, New York, 1990.
    <https://doi.org/10.1145/93542.93557>.
    Steele, Guy L., Jr., and Jon L. White. ‘How to Print
    Floating-Point Numbers Accurately’. In Proc. PLDI. White Plains,
    New York, 1990. <https://doi.org/10.1145/93542.93559>.
    Jaffer, Aubrey. ‘Easy Accurate Reading and Writing of
    Floating-Point Numbers’. ArXiv:1310.8121 [Cs], 27 October 2013.
    <http://arxiv.org/abs/1310.8121>.
  [^arbitrary-precision-signedinteger]: **Implementation note.** Be
    aware when implementing reading and writing of `SignedInteger`s
    that the data model *requires* arbitrary-precision integers. Your
    implementation may (but, ideally, should not) truncate precision
    when reading or writing a `SignedInteger`; however, if it does so,
    it should (a) signal its client that truncation has occurred, and
    (b) make it clear to the client that comparing such truncated
    values for equality or ordering will not yield results that match
    the expected semantics of the data model.
 `String`s are,
 [as in JSON](https://tools.ietf.org/html/rfc8259#section-7), possibly
 escaped text surrounded by double quotes. The escaping rules are the
 same as for JSON.[^string-json-correspondence] [^escaping-surrogate-pairs]
            String = %x22 *char %x22
              char = unescaped / %x7C / escape (escaped / %x22 / %s"u" 4HEXDIG)
         unescaped = %x20-21 / %x23-5B / %x5D-7B / %x7D-10FFFF
            escape = %x5C              ; \
           escaped = ( %x5C /          ; \    reverse solidus U+005C
                       %x2F /          ; /    solidus         U+002F
                       %x62 /          ; b    backspace       U+0008
                       %x66 /          ; f    form feed       U+000C
                       %x6E /          ; n    line feed       U+000A
                       %x72 /          ; r    carriage return U+000D
                       %x74 )          ; t    tab             U+0009
  [^string-json-correspondence]: The grammar for `String` has the same
    effect as the
    [JSON](https://tools.ietf.org/html/rfc8259#section-7) grammar for
    `string`. Some auxiliary definitions (e.g. `escaped`) are lifted
    largely unmodified from the text of RFC 8259.
  [^escaping-surrogate-pairs]: In particular, note JSON's rules around
    the use of surrogate pairs for code points not in the Basic
    Multilingual Plane. We encourage implementations to avoid using
    `\u` escapes when producing output, and instead to rely on the
    UTF-8 encoding of the entire document to handle non-ASCII
    codepoints correctly.
 A `ByteString` may be written in any of three different forms.
 The first is similar to a `String`, but prepended with a hash sign
 `#`. In addition, only Unicode code points overlapping with printable
 7-bit ASCII are permitted unescaped inside such a `ByteString`; other
 byte values must be escaped by prepending a two-digit hexadecimal
 value with `\x`.
        ByteString = "#" %x22 *binchar %x22
           binchar = binunescaped / escape (escaped / %x22 / %s"x" 2HEXDIG)
      binunescaped = %x20-21 / %x23-5B / %x5D-7E
 The second is as a sequence of pairs of hexadecimal digits interleaved
 with whitespace and surrounded by `#x"` and `"`.
       ByteString =/ %s"#x" %x22 *(ws / 2HEXDIG) ws %x22
 The third is as a sequence of
 [Base64](https://tools.ietf.org/html/rfc4648) characters, interleaved
 with whitespace and surrounded by `#[` and `]`. Plain and URL-safe
 Base64 characters are allowed.
       ByteString =/ "#[" *(ws / base64char) ws "]"
        base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "="
 A `Symbol` may be written in a “bare” form[^cf-sexp-token] so long as
 it conforms to certain restrictions on the characters appearing in the
 symbol. Alternatively, it may be written in a quoted form. The quoted
 form is much the same as the syntax for `String`s, including embedded
 escape syntax, except using a bar or pipe character (`|`) instead of a
 double quote mark.
            Symbol = symstart *symcont / "|" *symchar "|"
          symstart = ALPHA / sympunct / symustart
           symcont = ALPHA / sympunct / symustart / symucont / DIGIT / "-"
          sympunct = "~" / "!" / "$" / "%" / "^" / "&" / "*" /
                     "?" / "_" / "=" / "+" / "/" / "."
           symchar = unescaped / %x22 / escape (escaped / %x7C / %s"u" 4HEXDIG)
         symustart = <any code point greater than 127 whose Unicode
                      category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me,
                      Pc, Po, Sc, Sm, Sk, So, or Co>
          symucont = <any code point greater than 127 whose Unicode
                      category is Nd, Nl, No, or Pd>
  [^cf-sexp-token]: Compare with the [SPKI S-expression][sexp.txt]
    definition of “token representation”, and with the
    [R6RS definition of identifiers](http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_sec_4.2.4).
 An `Embedded` is written as a `Value` chosen to represent the denoted
 object, prefixed with `#!`.
           Embedded = "#!" Value
 Finally, any `Value` may be represented by escaping from the textual
 syntax to the [compact binary syntax](#compact-binary-syntax) by
 prefixing a `ByteString` containing the binary representation of the
 `Value` with `#=`.[^rationale-switch-to-binary]
 [^no-literal-binary-in-text] [^compact-value-annotations]
           Compact = "#=" ws ByteString
  [^rationale-switch-to-binary]: **Rationale.** The textual syntax
    cannot express every `Value`: specifically, it cannot express the
    several million floating-point NaNs, or the two floating-point
    Infinities. Since the compact binary format for `Value`s expresses
    each `Value` with precision, embedding binary `Value`s solves the
    problem.
  [^no-literal-binary-in-text]: Every text is ultimately physically
    stored as bytes; therefore, it might seem possible to escape to
    the raw binary form of compact binary encoding from within a
    pieces of textual syntax. However, while bytes must be involved in
    any *representation* of text, the text *itself* is logically a
    sequence of *code points* and is not *intrinsically* a binary
    structure at all. It would be incoherent to expect to be able to
    access the representation of the text from within the text itself.
  [^compact-value-annotations]: Any text-syntax annotations preceding
    the `#` are prepended to any binary-syntax annotations yielded by
    decoding the `ByteString`.
 ### Annotations.
 **Syntax.** When written down, a `Value` may have an associated
 sequence of *annotations* carrying “out-of-band” contextual metadata
 about the value. Each annotation is, in turn, a `Value`, and may
 itself have annotations. The ordering of annotations attached to a
 `Value` is significant.
            Value =/ ws "@" Value Value
 Each annotation is preceded by `@`; the underlying annotated value
 follows its annotations. Here we extend only the syntactic nonterminal
 named “`Value`” without altering the semantic class of `Value`s.
 **Comments.** Strings annotating a `Value` are conventionally
 interpreted as comments associated with that value. Comments are
 sufficiently common that special syntax exists for them.
            Value =/ ws
                     ";" *(%x00-09 / %x0B-0C / %x0E-10FFFF) newline
                     Value
 When written this way, everything between the `;` and the newline is
 included in the string annotating the `Value`.
 **Equivalence.** Annotations appear within syntax denoting a `Value`;
 however, the annotations are not part of the denoted value. They are
 only part of the syntax. Annotations do not play a part in
 equivalences and orderings of `Value`s.
 Reflective tools such as debuggers, user interfaces, and message
 routers and relays---tools which process `Value`s generically---may
 use annotated inputs to tailor their operation, or may insert
 annotations in their outputs. By contrast, in ordinary programs, as a
 rule of thumb, the presence, absence or content of an annotation
 should not change the control flow or output of the program.
 Annotations are data *describing* `Value`s, and are not in the domain
 of any specific application of `Value`s. That is, an annotation will
 almost never cause a non-reflective program to do anything observably
 different.
 ## Compact Binary Syntax
 A `Repr` is a binary-syntax encoding, or representation, of a `Value`.
 For a value `v`, we write `«v»` for the `Repr` of v.
 ### Type and Length representation.
 Each `Repr` starts with a tag byte, describing the kind of information
 represented. Depending on the tag, a length indicator, further encoded
 information, and/or an ending tag may follow.
    tag                          (simple atomic data and small integers)
    tag ++ binarydata            (most integers)
    tag ++ length ++ binarydata  (large integers, strings, symbols, and binary)
    tag ++ repr ++ ... ++ endtag (compound data)
 The unique end tag is byte value `0x84`.
 If present after a tag, the length of a following piece of binary data
 is formatted as a [base 128 varint][varint].[^see-also-leb128] We
 write `varint(m)` for the varint-encoding of `m`. Quoting the
 [Google Protocol Buffers][varint] definition,
  [^see-also-leb128]: Also known as [LEB128][] encoding, for unsigned
    integers. Varints and LEB128-encoded integers differ only for
    signed integers, which are not used in Preserves.
 > Each byte in a varint, except the last byte, has the most
 > significant bit (msb) set – this indicates that there are further
 > bytes to come. The lower 7 bits of each byte are used to store the
 > two's complement representation of the number in groups of 7 bits,
 > least significant group first.
 The following table illustrates varint-encoding.
 | Number, `m` | `m` in binary, grouped into 7-bit chunks  | `varint(m)` bytes |
 | ------      | -------------------                       | ------------      |
 | 15          | `0001111`                                 | 15                |
 | 300         | `0000010 0101100`                         | 172 2             |
 | 1000000000  | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 |
 It is an error for a varint-encoded `m` in a `Repr` to be anything
 other than the unique shortest encoding for that `m`. That is, a
 varint-encoding of `m` *MUST NOT* end in `0` unless `m`=0.
 ### Records, Sequences, Sets and Dictionaries.
          «<L F_1...F_m>» = [0xB4] ++ «L» ++ «F_1» ++...++ «F_m» ++ [0x84]
            «[X_1...X_m]» = [0xB5] ++ «X_1» ++...++ «X_m» ++ [0x84]
           «#{E_1...E_m}» = [0xB6] ++ «E_1» ++...++ «E_m» ++ [0x84]
    «{K_1:V_1...K_m:V_m}» = [0xB7] ++ «K_1» ++ «V_1» ++...++ «K_m» ++ «V_m» ++ [0x84]
 There is *no* ordering requirement on the `E_i` elements or
 `K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
 order. However, the `E_i` and `K_i` *MUST* be pairwise distinct. In
 addition, implementations *SHOULD* default to writing set elements and
 dictionary key/value pairs in order sorted lexicographically by their
 `Repr`s[^not-sorted-semantically], and *MAY* offer the option of
 serializing in some other implementation-defined order.
  [^no-sorting-rationale]: In the BitTorrent encoding format,
    [bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding),
    dictionary key/value pairs must be sorted by key. This is a
    necessary step for ensuring serialization of `Value`s is
    canonical. We do not require that key/value pairs (or set
    elements) be in sorted order for serialized `Value`s; however, a
    [canonical form][canonical] for `Repr`s does exist where a sorted
    ordering is required.
  [^not-sorted-semantically]: It's important to note that the sort
    ordering for writing out set elements and dictionary key/value
    pairs is *not* the same as the sort ordering implied by the
    semantic ordering of those elements or keys. For example, the
    `Repr` of a negative number very far from zero will start with
    byte that is *greater* than the byte which starts the `Repr` of
    zero, making it sort lexicographically later by `Repr`, despite
    being semantically *less than* zero.
    **Rationale**. This is for ease-of-implementation reasons: not all
    languages can easily represent sorted sets or sorted dictionaries,
    but encoding and then sorting byte strings is much more likely to
    be within easy reach.
 ### SignedIntegers.
    «x» when x ∈ SignedInteger = [0xB0] ++ varint(m) ++ intbytes(x)  if ¬(-3≤x≤12) ∧ m>16
                                 ([0xA0] + m - 1) ++ intbytes(x)     if ¬(-3≤x≤12) ∧ m≤16
                                 ([0xA0] + x)                        if  (-3≤x≤-1)
                                 ([0x90] + x)                        if  ( 0≤x≤12)
                               where m =        |intbytes(x)|
 Integers in the range [-3,12] are compactly represented with tags
 between `0x90` and `0x9F` because they are so frequently used.
 Integers up to 16 bytes long are represented with a single-byte tag
 encoding the length of the integer. Larger integers are represented
 with an explicit varint length. Every `SignedInteger` *MUST* be
 represented with its shortest possible encoding.
 The function `intbytes(x)` gives the big-endian two's-complement
 binary representation of `x`, taking exactly as many whole bytes as
 needed to unambiguously identify the value and its sign, and `m =
 |intbytes(x)|`. The most-significant bit in the first byte in
 `intbytes(x)` <!-- for `x`≠0 --> is the sign bit.[^zero-intbytes] For
 example,
      «87112285931760246646623899502532662132736»
        = B0 12 01 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00
                00 00
      «-257» = A1 FE FF        «-3» = 9D          «128» = A1 00 80
      «-256» = A1 FF 00        «-2» = 9E          «255» = A1 00 FF
      «-255» = A1 FF 01        «-1» = 9F          «256» = A1 01 00
      «-254» = A1 FF 02         «0» = 90        «32767» = A1 7F FF
      «-129» = A1 FF 7F         «1» = 91        «32768» = A2 00 80 00
      «-128» = A0 80           «12» = 9C        «65535» = A2 00 FF FF
      «-127» = A0 81           «13» = A0 0D     «65536» = A2 01 00 00
        «-4» = A0 FC          «127» = A0 7F    «131072» = A2 02 00 00
  [^zero-intbytes]: The value 0 needs zero bytes to identify the
    value, so `intbytes(0)` is the empty byte string. Non-zero values
    need at least one byte.
 ### Strings, ByteStrings and Symbols.
 Syntax for these three types varies only in the tag used. For `String`
 and `Symbol`, the data following the tag is a UTF-8 encoding of the
 `Value`'s code points, while for `ByteString` it is the raw data
 contained within the `Value` unmodified.
    «S» = [0xB1] ++ varint(|utf8(S)|) ++ utf8(S)  if S ∈ String
          [0xB2] ++ varint(|S|) ++ S              if S ∈ ByteString
          [0xB3] ++ varint(|utf8(S)|) ++ utf8(S)  if S ∈ Symbol
 ### Booleans.
    «#f» = [0x80]
    «#t» = [0x81]
 ### Floats and Doubles.
    «F» when F ∈ Float  = [0x82] ++ binary32(F)
    «D» when D ∈ Double = [0x83] ++ binary64(D)
 The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
 8-byte IEEE 754 binary representations of `F` and `D`, respectively.
 ### Embeddeds.
 The `Repr` of an `Embedded` is the `Repr` of a `Value` chosen to
 represent the denoted object, prefixed with `[0x86]`.
    «#!V» = [0x86] ++ «V»
 ### Annotations.
 To annotate a `Repr` `r` with some `Value` `v`, prepend `r` with
 `[0x85] ++ «v»`. For example, the `Repr` corresponding to textual
 syntax `@a@b[]`, i.e. an empty sequence annotated with two symbols,
 `a` and `b`, is
    «@a @b []»
      = [0x85] ++ «a» ++ [0x85] ++ «b» ++ «[]»
      = [0x85, 0xB3, 0x01, 0x61, 0x85, 0xB3, 0x01, 0x62, 0xB5, 0x84]
 ## Examples
 The definitions above are independent of any particular concrete syntax.
 The examples of `Value`s that follow are written using [the Preserves
 text syntax](preserves-text.html), and the example encoded byte
 sequences use [the Preserves binary encoding](preserves-binary.html).
 ### Ordering.
 The total ordering specified [above](#total-order) means that the following statements are true:
@ -720,10 +285,23 @@ encodes to
 ### JSON examples.
-The examples from
+Preserves text syntax is a superset of JSON, so the examples from [RFC
-[RFC 8259](https://tools.ietf.org/html/rfc8259#section-13) read as
+8259](https://tools.ietf.org/html/rfc8259#section-13) read as valid
-valid Preserves, though the JSON literals `true`, `false` and `null`
+Preserves.
-read as `Symbol`s. The first example:
+
 The JSON literals `true`, `false` and `null` all read as `Symbol`s, and
 JSON numbers read (unambiguously) either as `SignedInteger`s or as
 `Double`s.[^json-superset]
  [^json-superset]: The following [schema](./preserves-schema.html)
    definitions match exactly the JSON subset of a Preserves input:
        version 1 .
        JSON = @string string / @integer int / @double double / @boolean JSONBoolean / @null =null
             / @array [JSON ...] / @object { string: JSON ...:... } .
        JSONBoolean = =true / =false .
 The first RFC 8259 example:
    {
      "Image": {
@ -740,7 +318,8 @@ read as `Symbol`s. The first example:
        }
    }
-encodes to binary as follows:
+when read using the Preserves text syntax encodes via the binary syntax
 as follows:
    B7
      B1 05 "Image"
@ -764,7 +343,7 @@ encodes to binary as follows:
      84
    84
-and the second example:
+The second RFC 8259 example:
    [
      {
@ -814,89 +393,5 @@ encodes to binary as follows:
      84
    84
 ## Security Considerations
 **Whitespace.** The textual format allows arbitrary whitespace in many
 positions. Consider optional restrictions on the amount of consecutive
 whitespace that may appear.
 **Annotations.** Similarly, in modes where a `Value` is being read
 while annotations are skipped, an endless sequence of annotations may
 give an illusion of progress.
 **Canonical form for cryptographic hashing and signing.** No canonical
 textual encoding of a `Value` is specified. A
 [canonical form][canonical] exists for binary encoded `Value`s, and
 implementations *SHOULD* produce canonical binary encodings by
 default; however, an implementation *MAY* permit two serializations of
 the same `Value` to yield different binary `Repr`s.
 ## Acknowledgements
 The treatment of commas as whitespace in the text syntax is inspired
 by the same feature of [EDN](https://github.com/edn-format/edn).
 The text syntax for `Boolean`s, `Symbol`s, and `ByteString`s is
 directly inspired by [Racket](https://racket-lang.org/)'s lexical
 syntax.
 ## Appendix. Autodetection of textual or binary syntax
 Every tag byte in a binary Preserves `Document` falls within the range
 [`0x80`, `0xBF`]. These bytes, interpreted as UTF-8, are *continuation
 bytes*, and will never occur as the first byte of a UTF-8 encoded code
 point. This means no binary-encoded document can be misinterpreted as
 valid UTF-8.
 Conversely, a UTF-8 document must start with a valid codepoint,
 meaning in particular that it must not start with a byte in the range
 [`0x80`, `0xBF`]. This means that no UTF-8 encoded textual-syntax
 Preserves document can be misinterpreted as a binary-syntax document.
 Examination of the top two bits of the first byte of a document gives
 its syntax: if the top two bits are `10`, it should be interpreted as
 a binary-syntax document; otherwise, it should be interpreted as text.
 ## Appendix. Table of tag values
     80 - False
     81 - True
     82 - Float
     83 - Double
     84 - End marker
     85 - Annotation
     86 - Embedded
    (8x)  RESERVED 87-8F
     9x - Small integers 0..12,-3..-1
     An - Medium integers, (n+1) bytes long
     B0 - Large integers, variable length
     B1 - String
     B2 - ByteString
     B3 - Symbol
     B4 - Record
     B5 - Sequence
     B6 - Set
     B7 - Dictionary
 ## Appendix. Binary SignedInteger representation
 Languages that provide fixed-width machine word types may find the
 following table useful in encoding and decoding binary `SignedInteger`
 values.
 | Integer range                              | Bytes required | Encoding (hex)                               |
 | ---                                        | ---            | ---                                          |
 | -3 ≤ n ≤ 12                                | 1              | `9X`                                         |
 | -2<sup>7</sup> ≤ n < 2<sup>7</sup> (i8)    | 2              | `A0` `XX`                                    |
 | -2<sup>15</sup> ≤ n < 2<sup>15</sup> (i16) | 3              | `A1` `XX` `XX`                               |
 | -2<sup>23</sup> ≤ n < 2<sup>23</sup> (i24) | 4              | `A2` `XX` `XX` `XX`                          |
 | -2<sup>31</sup> ≤ n < 2<sup>31</sup> (i32) | 5              | `A3` `XX` `XX` `XX` `XX`                     |
 | -2<sup>39</sup> ≤ n < 2<sup>39</sup> (i40) | 6              | `A4` `XX` `XX` `XX` `XX` `XX`                |
 | -2<sup>47</sup> ≤ n < 2<sup>47</sup> (i48) | 7              | `A5` `XX` `XX` `XX` `XX` `XX` `XX`           |
 | -2<sup>55</sup> ≤ n < 2<sup>55</sup> (i56) | 8              | `A6` `XX` `XX` `XX` `XX` `XX` `XX` `XX`      |
 | -2<sup>63</sup> ≤ n < 2<sup>63</sup> (i64) | 9              | `A7` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
 <!-- Heading to visually offset the footnotes from the main document: -->
 ## Notes