Smaller simpler (?) presentation of binary syntax

2022-06-19 15:56:03 +02:00 · 2022-06-19 15:56:03 +02:00 · b43d372014
parent f28ae51215
commit b43d372014
1 changed files with 91 additions and 110 deletions
--- a/preserves-binary.md
+++ b/preserves-binary.md
@ -21,34 +21,72 @@ syntax](preserves-text.html) also exists.
 ## Machine-Oriented Binary Syntax
 A `Repr` is a binary-syntax encoding, or representation, of a `Value`.
 For a value `v`, we write `«v»` for the `Repr` of v.
 ### Type and Length representation.
 Each `Repr` starts with a tag byte, describing the kind of information
-represented.
+represented. The expected length of the `Repr` is always available
 However, inspired by [argdata][], a `Repr` does *not* describe its own
 length. Instead, the expected length of the `Repr` is always available
 from the surrounding context: either from a containing encoded value, or
 from the overall container of the data, which could be a file, an HTTP
 message, a UDP packet, etc.
-As a consequence, `Repr`s for `Compound` values store the lengths of
+### Atomic Values.
-their contained values. Each contained `Value` is represented as a
+
-length in bytes followed by its own `Repr`. Implementations use each
+**Booleans.** The "false" boolean's `Repr` is just tag `0xA0`; "true" is
-stored length to decide when to stop reading the following `Repr`.
+`0xA1`.
 **Floats and Doubles.** Both `Float` and `Double` values are represented
 as tag `0xA2` followed by big-endian 4- or 8-byte IEEE 754 binary
 representations of the values, respectively.
 **SignedIntegers.** A `SignedInteger` encodes as tag `0xA3` followed by
 a big-endian two's-complement binary representation of the value, taking
 at least as many whole bytes as needed to unambiguously identify the
 value and its sign. Zero may be represented as the tag alone, with no
 following bytes. The most-significant bit in the first byte after the
 tag is the sign bit.[^zero-intbytes] The shortest possible encoding
 *SHOULD* be used.[^overlong-signedinteger]
  [^zero-intbytes]: The value 0 needs zero bytes to identify the value,
    so `intbytes(0)` can be the empty byte string. Non-zero values need
    at least one byte.
  [^overlong-signedinteger]: **Implementation note.** The spec permits
    overlong `SignedInteger` encodings to allow e.g. construction of
    `Repr`s by filling in partially-completed templates, which can be
    useful in resource-constrained situations.
 **Strings.** A `String` encodes as tag `0xA4` followed by the UTF-8
 encoding of the string, with an additional trailing `NUL` (0) byte. The
 `NUL` byte *MUST NOT* be treated as part of the `String`: it exists to
 permit zero-copy C interoperability.[^zero-copy-c-string-interop]
  [^zero-copy-c-string-interop]: Some care must still be taken when
    passing `String` `Repr`s directly to a C-style ABI, since `String`s
    may contain the zero Unicode code point, which C library routines
    will usually misinterpret as an end-of-string marker.
 **ByteStrings.** A `ByteString` encodes as tag `0xA5` followed by the
 bytes themselves.
 **Symbols.** A `Symbol` encodes as tag `0xA6` followed by the UTF-8
 encoding of the symbol's code points.
 ### Compound Values.
 `Repr`s for `Compound` values store the lengths of their contained
 values. Each contained `Value` is converted to a `Repr` and stored as
 the length of the `Repr` in bytes followed by the `Repr` itself.
 Implementations use each stored length to decide when to stop reading
 the associated `Repr`. Similarly, no sentinel marks the end of a
 sequence of length-prefixed `Repr`s. Implementations use the length of
 the containing `Repr`, known from the surrounding context, to decide
 when to stop expecting more contained `Repr`s.
 <a id="varint"></a> Each length is stored as an [argdata][]-compatible
 big-endian base 128 *varint*.[^see-also-leb128] Each byte of a varint
 stores seven bits of the length. All bytes have a clear upper bit,
-except the final byte, which has the upper bit set. We write
+except the final byte, which has the upper bit set.
 `len(m)` for the varint-encoding of a non-negative integer `m`,
 defined recursively as follows:
    len(m) = e(m, 128)
           where e(v, d) = [v + d]                           if v < 128
                           e(v / 128, 0) ++ [(v % 128) + d]  if v ≥ 128
  [^see-also-leb128]: Argdata's length representation is very close to
    [Variable-length quantity (VLQ)][VLQ] encoding, differing only in
@ -56,10 +94,8 @@ defined recursively as follows:
    big-endian, unlike [LEB128][] encoding ([as used by
    Google][google-varint] in protobufs).
-We write `len(|r|)` for the varint-encoding of the length of `Repr` `r`.
+There is no requirement that a varint-encoded length be the unique
-
+shortest encoding for the length.[^overlong-varint] However,
 There is no requirement that a varint-encoded `m` in a `Repr` be the
 unique shortest encoding for that `m`.[^overlong-varint] However,
 implementations *SHOULD* use the shortest encoding whereever possible
 when writing, and *MAY* reject encodings with more than eight leading
 `0` bytes when reading encoded values.
@ -69,21 +105,24 @@ when writing, and *MAY* reject encodings with more than eight leading
    anything other than a very low-level language, it is likely to be able to use
    [IOList](./conventions.html#iolists)-style data structures to avoid unnecessary copying.
-### Records, Sequences, Sets and Dictionaries.
+**Records.** A `Record` is encoded as tag `0xA7` followed by the
 length-prefixed encodings of its label and fields.
-          «<L F_1...F_m>» = [0xA7] ++ seq(«L», «F_1», ..., «F_m»)
+**Sequences.** A `Sequence` is encoded as tag `0xA8` followed by the
-            «[X_1...X_m]» = [0xA8] ++ seq(«X_1», ..., «X_m»)
+length-prefixed encodings of its members.
           «#{E_1...E_m}» = [0xA9] ++ seq(«E_1», ..., «E_m»)
    «{K_1:V_1...K_m:V_m}» = [0xAA] ++ seq(«K_1», «V_1», ..., «K_m», «V_m»)
-       seq(R_1, ..., R_m) = len(|R_1|) ++ R_1 ++...++ len(|R_m|) ++ R_m
+**Sets.** A `Set` is encoded like a `Sequence`, but with tag `0xA9`, and
 in some arbitrary order.
-There is *no* ordering requirement on the `E_i` elements or
+**Dictionaries.** A `Dictionary` encodes as tag `0xAA` followed by the
-`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
+length-prefixed keys and values, in an alternating key/value sequence.
-order. However, the `E_i` and `K_i` *MUST* be pairwise distinct. In
+
-addition, implementations *SHOULD* default to writing set elements and
+There is *no* ordering requirement on the elements of sets or the
-dictionary key/value pairs in order sorted lexicographically by their
+key/value pairs of dictionaries.[^no-sorting-rationale] However,
-`Repr`s[^not-sorted-semantically], and *MAY* offer the option of
+elements of sets and keys in dictionaries *MUST* be pairwise distinct.
 In addition, implementations *SHOULD* default to writing set elements
 and dictionary key/value pairs in order sorted lexicographically by
 their `Repr`s[^not-sorted-semantically], and *MAY* offer the option of
 serializing in some other implementation-defined order.
  [^no-sorting-rationale]: In the BitTorrent encoding format,
@ -109,93 +148,33 @@ serializing in some other implementation-defined order.
    but encoding and then sorting byte strings is much more likely to
    be within easy reach.
-No sentinel marks the end of a sequence of length-prefixed `Repr`s.
+### Embedded Values.
 During decoding, use the length of the containing `Repr` to decide when
 to stop expecting more contained `Repr`s.
-### SignedIntegers.
+Embedded values are encoded as tag `0xAB` followed by the encoding of
-
+some `Value` chosen to represent the denoted embedded object.
    «x» when x ∈ SignedInteger = [0xA3] ++ intbytes(x)
 The function `intbytes(x)` gives a big-endian two's-complement binary
 representation of `x`, taking at least as many whole bytes as needed to
 unambiguously identify the value and its sign; `intbytes(0)` may be the
 empty byte sequence.[^zero-intbytes] The most-significant bit in the
 first byte in `intbytes(x)` is the sign bit. While every `SignedInteger`
 *SHOULD* be represented with its shortest possible encoding (which will
 often include a necessary leading `0xFF` or `0x00`), redundant leading
 `0xFF` or `0x00` bytes *MAY* be used.[^overlong-signedinteger]
  [^zero-intbytes]: The value 0 needs zero bytes to identify the value,
    so `intbytes(0)` can be the empty byte string. Non-zero values need
    at least one byte.
  [^overlong-signedinteger]: **Implementation note.** The spec permits
    overlong `SignedInteger` encodings to allow e.g. construction of
    `Repr`s by filling in partially-completed templates, which can be
    useful in resource-constrained situations.
 ### Strings, ByteStrings and Symbols.
    «S» = [0xA4] ++ utf8(S) ++ [0]  if S ∈ String
          [0xA5] ++ S               if S ∈ ByteString
          [0xA6] ++ utf8(S)         if S ∈ Symbol
 For `String` and `Symbol`, the data following the tag is a UTF-8
 encoding of the `Value`'s code points, while for `ByteString` it is the
 raw data contained within the `Value` unmodified.
 Each `String` has a trailing zero byte appended. This extra byte *MUST
 NOT* be treated as part of the `Value`: it exists to permit zero-copy C
 interoperability.[^zero-copy-c-string-interop]
  [^zero-copy-c-string-interop]: Some care must still be taken when
    passing `String` `Repr`s directly to a C-style ABI, since `String`s
    may contain the zero Unicode code point, which C library routines
    will usually misinterpret as an end-of-string marker.
 ### Booleans.
    «#f» = [0xA0]
    «#t» = [0xA1]
 ### Floats and Doubles.
    «F» when F ∈ Float  = [0xA2] ++ binary32(F)
    «D» when D ∈ Double = [0xA2] ++ binary64(D)
 The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
 8-byte IEEE 754 binary representations of `F` and `D`, respectively.
 ### Embeddeds.
 The `Repr` of an `Embedded` is the `Repr` of a `Value` chosen to
 represent the denoted object, prefixed with `[0xAB]`.
    «#!V» = [0xAB] ++ «V»
 ### Annotations.
-To annotate a `Repr` `r` with some sequence of `Value`s `[v_1, ...,
+The encoding of a sequence of annotations for a `Repr` uses tag `0xBF`,
-v_m]`, surround `r` as follows:
+followed by the length-prefixed `Repr`, followed by the length-prefixed
 encoded annotations, in order. The `Repr` *MUST NOT* already have
 annotations (must not begin with `0xBF`), and there *MUST* be at least
 one `Value` in the sequence following the `Repr`.
-    [0xBF] ++ len(|r|) ++ r ++ len(|«v_1»|) ++ «v_1» ++...++ len(|«v_m»|) ++ «v_m»
+## Examples (normative)
-The `Repr` `r` *MUST NOT* already have annotations; that is, it must not
+We write `«v»` for the `Repr` of some `Value` `v`, and `varint(|«v»|)` for
-begin with `0xBF`. The sequence `[v_1, ..., v_m]` *MUST* contain at
+the varint-encoded length of the `Repr` of `v`.
 least one `Value`.
 ## Examples
 ### Varints (length representations).
 The following table illustrates varint-encoding.
-| Number, `m` | `m` in binary, grouped into 7-bit chunks  | `len(m)` bytes  |
+| Number, `m` | `m` in binary, grouped into 7-bit chunks  | `varint(m)` bytes |
-|-------------|-------------------------------------------|-----------------|
+|-------------|-------------------------------------------|-------------------|
-| 15          | `0001111`                                 | 143             |
+| 15          | `0001111`                                 | 143               |
-| 300         | `0000010 0101100`                         | 2 172           |
+| 300         | `0000010 0101100`                         | 2 172             |
-| 1000000000  | `0000011 1011100 1101011 0010100 0000000` | 3 92 107 20 128 |
+| 1000000000  | `0000011 1011100 1101011 0010100 0000000` | 3 92 107 20 128   |
 ### Atoms.
@ -288,7 +267,9 @@ The `Repr` corresponding to textual syntax `@a@b[]`, i.e. an empty sequence anno
 symbols, `a` and `b`, is
    «@a @b []»
-      = [0xBF] ++ len(|«[]»|) ++ «[]» ++ len(|«a»|) ++ «a» ++ len(|«b»|) ++ «b»
+      = [0xBF] ++ varint(|«[]»|) ++ «[]»
               ++ varint(|«a»|) ++ «a»
               ++ varint(|«b»|) ++ «b»
      = [0xBF, 0x81, 0xA8, 0x82, 0xA6, 0x61, 0x82, 0xA6, 0x62]
 ## Security Considerations
@ -346,7 +327,7 @@ undetermined number of `Value`s across, say, a TCP/IP connection:
 - If the binary syntax is to be used for the connection, start the
   connection with byte `0xA8` (sequence). After the initial byte, send
-   each value `v` as `len(|«v»|) ++ «v»`. A side effect of this approach
+   each value `v` as `varint(|«v»|) ++ «v»`. A side effect of this approach
   is that the entire stream, when complete, is a valid `Sequence`
   `Repr`.