From b43d372014de0b1532199e9b777d755151f1ba12 Mon Sep 17 00:00:00 2001 From: Tony Garnock-Jones Date: Sun, 19 Jun 2022 15:56:03 +0200 Subject: [PATCH] Smaller simpler (?) presentation of binary syntax --- preserves-binary.md | 201 ++++++++++++++++++++------------------------ 1 file changed, 91 insertions(+), 110 deletions(-) diff --git a/preserves-binary.md b/preserves-binary.md index a2fe195..30054f0 100644 --- a/preserves-binary.md +++ b/preserves-binary.md @@ -21,34 +21,72 @@ syntax](preserves-text.html) also exists. ## Machine-Oriented Binary Syntax A `Repr` is a binary-syntax encoding, or representation, of a `Value`. -For a value `v`, we write `«v»` for the `Repr` of v. ### Type and Length representation. Each `Repr` starts with a tag byte, describing the kind of information -represented. - -However, inspired by [argdata][], a `Repr` does *not* describe its own -length. Instead, the expected length of the `Repr` is always available +represented. The expected length of the `Repr` is always available from the surrounding context: either from a containing encoded value, or from the overall container of the data, which could be a file, an HTTP message, a UDP packet, etc. -As a consequence, `Repr`s for `Compound` values store the lengths of -their contained values. Each contained `Value` is represented as a -length in bytes followed by its own `Repr`. Implementations use each -stored length to decide when to stop reading the following `Repr`. +### Atomic Values. + +**Booleans.** The "false" boolean's `Repr` is just tag `0xA0`; "true" is +`0xA1`. + +**Floats and Doubles.** Both `Float` and `Double` values are represented +as tag `0xA2` followed by big-endian 4- or 8-byte IEEE 754 binary +representations of the values, respectively. + +**SignedIntegers.** A `SignedInteger` encodes as tag `0xA3` followed by +a big-endian two's-complement binary representation of the value, taking +at least as many whole bytes as needed to unambiguously identify the +value and its sign. Zero may be represented as the tag alone, with no +following bytes. The most-significant bit in the first byte after the +tag is the sign bit.[^zero-intbytes] The shortest possible encoding +*SHOULD* be used.[^overlong-signedinteger] + + [^zero-intbytes]: The value 0 needs zero bytes to identify the value, + so `intbytes(0)` can be the empty byte string. Non-zero values need + at least one byte. + + [^overlong-signedinteger]: **Implementation note.** The spec permits + overlong `SignedInteger` encodings to allow e.g. construction of + `Repr`s by filling in partially-completed templates, which can be + useful in resource-constrained situations. + +**Strings.** A `String` encodes as tag `0xA4` followed by the UTF-8 +encoding of the string, with an additional trailing `NUL` (0) byte. The +`NUL` byte *MUST NOT* be treated as part of the `String`: it exists to +permit zero-copy C interoperability.[^zero-copy-c-string-interop] + + [^zero-copy-c-string-interop]: Some care must still be taken when + passing `String` `Repr`s directly to a C-style ABI, since `String`s + may contain the zero Unicode code point, which C library routines + will usually misinterpret as an end-of-string marker. + +**ByteStrings.** A `ByteString` encodes as tag `0xA5` followed by the +bytes themselves. + +**Symbols.** A `Symbol` encodes as tag `0xA6` followed by the UTF-8 +encoding of the symbol's code points. + +### Compound Values. + +`Repr`s for `Compound` values store the lengths of their contained +values. Each contained `Value` is converted to a `Repr` and stored as +the length of the `Repr` in bytes followed by the `Repr` itself. +Implementations use each stored length to decide when to stop reading +the associated `Repr`. Similarly, no sentinel marks the end of a +sequence of length-prefixed `Repr`s. Implementations use the length of +the containing `Repr`, known from the surrounding context, to decide +when to stop expecting more contained `Repr`s. Each length is stored as an [argdata][]-compatible big-endian base 128 *varint*.[^see-also-leb128] Each byte of a varint stores seven bits of the length. All bytes have a clear upper bit, -except the final byte, which has the upper bit set. We write -`len(m)` for the varint-encoding of a non-negative integer `m`, -defined recursively as follows: - - len(m) = e(m, 128) - where e(v, d) = [v + d] if v < 128 - e(v / 128, 0) ++ [(v % 128) + d] if v ≥ 128 +except the final byte, which has the upper bit set. [^see-also-leb128]: Argdata's length representation is very close to [Variable-length quantity (VLQ)][VLQ] encoding, differing only in @@ -56,10 +94,8 @@ defined recursively as follows: big-endian, unlike [LEB128][] encoding ([as used by Google][google-varint] in protobufs). -We write `len(|r|)` for the varint-encoding of the length of `Repr` `r`. - -There is no requirement that a varint-encoded `m` in a `Repr` be the -unique shortest encoding for that `m`.[^overlong-varint] However, +There is no requirement that a varint-encoded length be the unique +shortest encoding for the length.[^overlong-varint] However, implementations *SHOULD* use the shortest encoding whereever possible when writing, and *MAY* reject encodings with more than eight leading `0` bytes when reading encoded values. @@ -69,21 +105,24 @@ when writing, and *MAY* reject encodings with more than eight leading anything other than a very low-level language, it is likely to be able to use [IOList](./conventions.html#iolists)-style data structures to avoid unnecessary copying. -### Records, Sequences, Sets and Dictionaries. +**Records.** A `Record` is encoded as tag `0xA7` followed by the +length-prefixed encodings of its label and fields. - «» = [0xA7] ++ seq(«L», «F_1», ..., «F_m») - «[X_1...X_m]» = [0xA8] ++ seq(«X_1», ..., «X_m») - «#{E_1...E_m}» = [0xA9] ++ seq(«E_1», ..., «E_m») - «{K_1:V_1...K_m:V_m}» = [0xAA] ++ seq(«K_1», «V_1», ..., «K_m», «V_m») +**Sequences.** A `Sequence` is encoded as tag `0xA8` followed by the +length-prefixed encodings of its members. - seq(R_1, ..., R_m) = len(|R_1|) ++ R_1 ++...++ len(|R_m|) ++ R_m +**Sets.** A `Set` is encoded like a `Sequence`, but with tag `0xA9`, and +in some arbitrary order. -There is *no* ordering requirement on the `E_i` elements or -`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any -order. However, the `E_i` and `K_i` *MUST* be pairwise distinct. In -addition, implementations *SHOULD* default to writing set elements and -dictionary key/value pairs in order sorted lexicographically by their -`Repr`s[^not-sorted-semantically], and *MAY* offer the option of +**Dictionaries.** A `Dictionary` encodes as tag `0xAA` followed by the +length-prefixed keys and values, in an alternating key/value sequence. + +There is *no* ordering requirement on the elements of sets or the +key/value pairs of dictionaries.[^no-sorting-rationale] However, +elements of sets and keys in dictionaries *MUST* be pairwise distinct. +In addition, implementations *SHOULD* default to writing set elements +and dictionary key/value pairs in order sorted lexicographically by +their `Repr`s[^not-sorted-semantically], and *MAY* offer the option of serializing in some other implementation-defined order. [^no-sorting-rationale]: In the BitTorrent encoding format, @@ -109,93 +148,33 @@ serializing in some other implementation-defined order. but encoding and then sorting byte strings is much more likely to be within easy reach. -No sentinel marks the end of a sequence of length-prefixed `Repr`s. -During decoding, use the length of the containing `Repr` to decide when -to stop expecting more contained `Repr`s. +### Embedded Values. -### SignedIntegers. - - «x» when x ∈ SignedInteger = [0xA3] ++ intbytes(x) - -The function `intbytes(x)` gives a big-endian two's-complement binary -representation of `x`, taking at least as many whole bytes as needed to -unambiguously identify the value and its sign; `intbytes(0)` may be the -empty byte sequence.[^zero-intbytes] The most-significant bit in the -first byte in `intbytes(x)` is the sign bit. While every `SignedInteger` -*SHOULD* be represented with its shortest possible encoding (which will -often include a necessary leading `0xFF` or `0x00`), redundant leading -`0xFF` or `0x00` bytes *MAY* be used.[^overlong-signedinteger] - - [^zero-intbytes]: The value 0 needs zero bytes to identify the value, - so `intbytes(0)` can be the empty byte string. Non-zero values need - at least one byte. - - [^overlong-signedinteger]: **Implementation note.** The spec permits - overlong `SignedInteger` encodings to allow e.g. construction of - `Repr`s by filling in partially-completed templates, which can be - useful in resource-constrained situations. - -### Strings, ByteStrings and Symbols. - - «S» = [0xA4] ++ utf8(S) ++ [0] if S ∈ String - [0xA5] ++ S if S ∈ ByteString - [0xA6] ++ utf8(S) if S ∈ Symbol - -For `String` and `Symbol`, the data following the tag is a UTF-8 -encoding of the `Value`'s code points, while for `ByteString` it is the -raw data contained within the `Value` unmodified. - -Each `String` has a trailing zero byte appended. This extra byte *MUST -NOT* be treated as part of the `Value`: it exists to permit zero-copy C -interoperability.[^zero-copy-c-string-interop] - - [^zero-copy-c-string-interop]: Some care must still be taken when - passing `String` `Repr`s directly to a C-style ABI, since `String`s - may contain the zero Unicode code point, which C library routines - will usually misinterpret as an end-of-string marker. - -### Booleans. - - «#f» = [0xA0] - «#t» = [0xA1] - -### Floats and Doubles. - - «F» when F ∈ Float = [0xA2] ++ binary32(F) - «D» when D ∈ Double = [0xA2] ++ binary64(D) - -The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and -8-byte IEEE 754 binary representations of `F` and `D`, respectively. - -### Embeddeds. - -The `Repr` of an `Embedded` is the `Repr` of a `Value` chosen to -represent the denoted object, prefixed with `[0xAB]`. - - «#!V» = [0xAB] ++ «V» +Embedded values are encoded as tag `0xAB` followed by the encoding of +some `Value` chosen to represent the denoted embedded object. ### Annotations. -To annotate a `Repr` `r` with some sequence of `Value`s `[v_1, ..., -v_m]`, surround `r` as follows: +The encoding of a sequence of annotations for a `Repr` uses tag `0xBF`, +followed by the length-prefixed `Repr`, followed by the length-prefixed +encoded annotations, in order. The `Repr` *MUST NOT* already have +annotations (must not begin with `0xBF`), and there *MUST* be at least +one `Value` in the sequence following the `Repr`. - [0xBF] ++ len(|r|) ++ r ++ len(|«v_1»|) ++ «v_1» ++...++ len(|«v_m»|) ++ «v_m» +## Examples (normative) -The `Repr` `r` *MUST NOT* already have annotations; that is, it must not -begin with `0xBF`. The sequence `[v_1, ..., v_m]` *MUST* contain at -least one `Value`. - -## Examples +We write `«v»` for the `Repr` of some `Value` `v`, and `varint(|«v»|)` for +the varint-encoded length of the `Repr` of `v`. ### Varints (length representations). The following table illustrates varint-encoding. -| Number, `m` | `m` in binary, grouped into 7-bit chunks | `len(m)` bytes | -|-------------|-------------------------------------------|-----------------| -| 15 | `0001111` | 143 | -| 300 | `0000010 0101100` | 2 172 | -| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 3 92 107 20 128 | +| Number, `m` | `m` in binary, grouped into 7-bit chunks | `varint(m)` bytes | +|-------------|-------------------------------------------|-------------------| +| 15 | `0001111` | 143 | +| 300 | `0000010 0101100` | 2 172 | +| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 3 92 107 20 128 | ### Atoms. @@ -288,7 +267,9 @@ The `Repr` corresponding to textual syntax `@a@b[]`, i.e. an empty sequence anno symbols, `a` and `b`, is «@a @b []» - = [0xBF] ++ len(|«[]»|) ++ «[]» ++ len(|«a»|) ++ «a» ++ len(|«b»|) ++ «b» + = [0xBF] ++ varint(|«[]»|) ++ «[]» + ++ varint(|«a»|) ++ «a» + ++ varint(|«b»|) ++ «b» = [0xBF, 0x81, 0xA8, 0x82, 0xA6, 0x61, 0x82, 0xA6, 0x62] ## Security Considerations @@ -346,7 +327,7 @@ undetermined number of `Value`s across, say, a TCP/IP connection: - If the binary syntax is to be used for the connection, start the connection with byte `0xA8` (sequence). After the initial byte, send - each value `v` as `len(|«v»|) ++ «v»`. A side effect of this approach + each value `v` as `varint(|«v»|) ++ «v»`. A side effect of this approach is that the entire stream, when complete, is a valid `Sequence` `Repr`.