Smaller simpler (?) presentation of binary syntax

This commit is contained in:
Tony Garnock-Jones 2022-06-19 15:56:03 +02:00
parent f28ae51215
commit b43d372014
1 changed files with 91 additions and 110 deletions

View File

@ -21,34 +21,72 @@ syntax](preserves-text.html) also exists.
## Machine-Oriented Binary Syntax
A `Repr` is a binary-syntax encoding, or representation, of a `Value`.
For a value `v`, we write `«v»` for the `Repr` of v.
### Type and Length representation.
Each `Repr` starts with a tag byte, describing the kind of information
represented.
However, inspired by [argdata][], a `Repr` does *not* describe its own
length. Instead, the expected length of the `Repr` is always available
represented. The expected length of the `Repr` is always available
from the surrounding context: either from a containing encoded value, or
from the overall container of the data, which could be a file, an HTTP
message, a UDP packet, etc.
As a consequence, `Repr`s for `Compound` values store the lengths of
their contained values. Each contained `Value` is represented as a
length in bytes followed by its own `Repr`. Implementations use each
stored length to decide when to stop reading the following `Repr`.
### Atomic Values.
**Booleans.** The "false" boolean's `Repr` is just tag `0xA0`; "true" is
`0xA1`.
**Floats and Doubles.** Both `Float` and `Double` values are represented
as tag `0xA2` followed by big-endian 4- or 8-byte IEEE 754 binary
representations of the values, respectively.
**SignedIntegers.** A `SignedInteger` encodes as tag `0xA3` followed by
a big-endian two's-complement binary representation of the value, taking
at least as many whole bytes as needed to unambiguously identify the
value and its sign. Zero may be represented as the tag alone, with no
following bytes. The most-significant bit in the first byte after the
tag is the sign bit.[^zero-intbytes] The shortest possible encoding
*SHOULD* be used.[^overlong-signedinteger]
[^zero-intbytes]: The value 0 needs zero bytes to identify the value,
so `intbytes(0)` can be the empty byte string. Non-zero values need
at least one byte.
[^overlong-signedinteger]: **Implementation note.** The spec permits
overlong `SignedInteger` encodings to allow e.g. construction of
`Repr`s by filling in partially-completed templates, which can be
useful in resource-constrained situations.
**Strings.** A `String` encodes as tag `0xA4` followed by the UTF-8
encoding of the string, with an additional trailing `NUL` (0) byte. The
`NUL` byte *MUST NOT* be treated as part of the `String`: it exists to
permit zero-copy C interoperability.[^zero-copy-c-string-interop]
[^zero-copy-c-string-interop]: Some care must still be taken when
passing `String` `Repr`s directly to a C-style ABI, since `String`s
may contain the zero Unicode code point, which C library routines
will usually misinterpret as an end-of-string marker.
**ByteStrings.** A `ByteString` encodes as tag `0xA5` followed by the
bytes themselves.
**Symbols.** A `Symbol` encodes as tag `0xA6` followed by the UTF-8
encoding of the symbol's code points.
### Compound Values.
`Repr`s for `Compound` values store the lengths of their contained
values. Each contained `Value` is converted to a `Repr` and stored as
the length of the `Repr` in bytes followed by the `Repr` itself.
Implementations use each stored length to decide when to stop reading
the associated `Repr`. Similarly, no sentinel marks the end of a
sequence of length-prefixed `Repr`s. Implementations use the length of
the containing `Repr`, known from the surrounding context, to decide
when to stop expecting more contained `Repr`s.
<a id="varint"></a> Each length is stored as an [argdata][]-compatible
big-endian base 128 *varint*.[^see-also-leb128] Each byte of a varint
stores seven bits of the length. All bytes have a clear upper bit,
except the final byte, which has the upper bit set. We write
`len(m)` for the varint-encoding of a non-negative integer `m`,
defined recursively as follows:
len(m) = e(m, 128)
where e(v, d) = [v + d] if v < 128
e(v / 128, 0) ++ [(v % 128) + d] if v ≥ 128
except the final byte, which has the upper bit set.
[^see-also-leb128]: Argdata's length representation is very close to
[Variable-length quantity (VLQ)][VLQ] encoding, differing only in
@ -56,10 +94,8 @@ defined recursively as follows:
big-endian, unlike [LEB128][] encoding ([as used by
Google][google-varint] in protobufs).
We write `len(|r|)` for the varint-encoding of the length of `Repr` `r`.
There is no requirement that a varint-encoded `m` in a `Repr` be the
unique shortest encoding for that `m`.[^overlong-varint] However,
There is no requirement that a varint-encoded length be the unique
shortest encoding for the length.[^overlong-varint] However,
implementations *SHOULD* use the shortest encoding whereever possible
when writing, and *MAY* reject encodings with more than eight leading
`0` bytes when reading encoded values.
@ -69,21 +105,24 @@ when writing, and *MAY* reject encodings with more than eight leading
anything other than a very low-level language, it is likely to be able to use
[IOList](./conventions.html#iolists)-style data structures to avoid unnecessary copying.
### Records, Sequences, Sets and Dictionaries.
**Records.** A `Record` is encoded as tag `0xA7` followed by the
length-prefixed encodings of its label and fields.
«<L F_1...F_m>» = [0xA7] ++ seq(«L», «F_1», ..., «F_m»)
«[X_1...X_m]» = [0xA8] ++ seq(«X_1», ..., «X_m»)
«#{E_1...E_m}» = [0xA9] ++ seq(«E_1», ..., «E_m»)
«{K_1:V_1...K_m:V_m}» = [0xAA] ++ seq(«K_1», «V_1», ..., «K_m», «V_m»)
**Sequences.** A `Sequence` is encoded as tag `0xA8` followed by the
length-prefixed encodings of its members.
seq(R_1, ..., R_m) = len(|R_1|) ++ R_1 ++...++ len(|R_m|) ++ R_m
**Sets.** A `Set` is encoded like a `Sequence`, but with tag `0xA9`, and
in some arbitrary order.
There is *no* ordering requirement on the `E_i` elements or
`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
order. However, the `E_i` and `K_i` *MUST* be pairwise distinct. In
addition, implementations *SHOULD* default to writing set elements and
dictionary key/value pairs in order sorted lexicographically by their
`Repr`s[^not-sorted-semantically], and *MAY* offer the option of
**Dictionaries.** A `Dictionary` encodes as tag `0xAA` followed by the
length-prefixed keys and values, in an alternating key/value sequence.
There is *no* ordering requirement on the elements of sets or the
key/value pairs of dictionaries.[^no-sorting-rationale] However,
elements of sets and keys in dictionaries *MUST* be pairwise distinct.
In addition, implementations *SHOULD* default to writing set elements
and dictionary key/value pairs in order sorted lexicographically by
their `Repr`s[^not-sorted-semantically], and *MAY* offer the option of
serializing in some other implementation-defined order.
[^no-sorting-rationale]: In the BitTorrent encoding format,
@ -109,93 +148,33 @@ serializing in some other implementation-defined order.
but encoding and then sorting byte strings is much more likely to
be within easy reach.
No sentinel marks the end of a sequence of length-prefixed `Repr`s.
During decoding, use the length of the containing `Repr` to decide when
to stop expecting more contained `Repr`s.
### Embedded Values.
### SignedIntegers.
«x» when x ∈ SignedInteger = [0xA3] ++ intbytes(x)
The function `intbytes(x)` gives a big-endian two's-complement binary
representation of `x`, taking at least as many whole bytes as needed to
unambiguously identify the value and its sign; `intbytes(0)` may be the
empty byte sequence.[^zero-intbytes] The most-significant bit in the
first byte in `intbytes(x)` is the sign bit. While every `SignedInteger`
*SHOULD* be represented with its shortest possible encoding (which will
often include a necessary leading `0xFF` or `0x00`), redundant leading
`0xFF` or `0x00` bytes *MAY* be used.[^overlong-signedinteger]
[^zero-intbytes]: The value 0 needs zero bytes to identify the value,
so `intbytes(0)` can be the empty byte string. Non-zero values need
at least one byte.
[^overlong-signedinteger]: **Implementation note.** The spec permits
overlong `SignedInteger` encodings to allow e.g. construction of
`Repr`s by filling in partially-completed templates, which can be
useful in resource-constrained situations.
### Strings, ByteStrings and Symbols.
«S» = [0xA4] ++ utf8(S) ++ [0] if S ∈ String
[0xA5] ++ S if S ∈ ByteString
[0xA6] ++ utf8(S) if S ∈ Symbol
For `String` and `Symbol`, the data following the tag is a UTF-8
encoding of the `Value`'s code points, while for `ByteString` it is the
raw data contained within the `Value` unmodified.
Each `String` has a trailing zero byte appended. This extra byte *MUST
NOT* be treated as part of the `Value`: it exists to permit zero-copy C
interoperability.[^zero-copy-c-string-interop]
[^zero-copy-c-string-interop]: Some care must still be taken when
passing `String` `Repr`s directly to a C-style ABI, since `String`s
may contain the zero Unicode code point, which C library routines
will usually misinterpret as an end-of-string marker.
### Booleans.
«#f» = [0xA0]
«#t» = [0xA1]
### Floats and Doubles.
«F» when F ∈ Float = [0xA2] ++ binary32(F)
«D» when D ∈ Double = [0xA2] ++ binary64(D)
The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
8-byte IEEE 754 binary representations of `F` and `D`, respectively.
### Embeddeds.
The `Repr` of an `Embedded` is the `Repr` of a `Value` chosen to
represent the denoted object, prefixed with `[0xAB]`.
«#!V» = [0xAB] ++ «V»
Embedded values are encoded as tag `0xAB` followed by the encoding of
some `Value` chosen to represent the denoted embedded object.
### Annotations.
To annotate a `Repr` `r` with some sequence of `Value`s `[v_1, ...,
v_m]`, surround `r` as follows:
The encoding of a sequence of annotations for a `Repr` uses tag `0xBF`,
followed by the length-prefixed `Repr`, followed by the length-prefixed
encoded annotations, in order. The `Repr` *MUST NOT* already have
annotations (must not begin with `0xBF`), and there *MUST* be at least
one `Value` in the sequence following the `Repr`.
[0xBF] ++ len(|r|) ++ r ++ len(|«v_1»|) ++ «v_1» ++...++ len(|«v_m»|) ++ «v_m»
## Examples (normative)
The `Repr` `r` *MUST NOT* already have annotations; that is, it must not
begin with `0xBF`. The sequence `[v_1, ..., v_m]` *MUST* contain at
least one `Value`.
## Examples
We write `«v»` for the `Repr` of some `Value` `v`, and `varint(|«v»|)` for
the varint-encoded length of the `Repr` of `v`.
### Varints (length representations).
The following table illustrates varint-encoding.
| Number, `m` | `m` in binary, grouped into 7-bit chunks | `len(m)` bytes |
|-------------|-------------------------------------------|-----------------|
| 15 | `0001111` | 143 |
| 300 | `0000010 0101100` | 2 172 |
| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 3 92 107 20 128 |
| Number, `m` | `m` in binary, grouped into 7-bit chunks | `varint(m)` bytes |
|-------------|-------------------------------------------|-------------------|
| 15 | `0001111` | 143 |
| 300 | `0000010 0101100` | 2 172 |
| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 3 92 107 20 128 |
### Atoms.
@ -288,7 +267,9 @@ The `Repr` corresponding to textual syntax `@a@b[]`, i.e. an empty sequence anno
symbols, `a` and `b`, is
«@a @b []»
= [0xBF] ++ len(|«[]»|) ++ «[]» ++ len(|«a»|) ++ «a» ++ len(|«b»|) ++ «b»
= [0xBF] ++ varint(|«[]»|) ++ «[]»
++ varint(|«a»|) ++ «a»
++ varint(|«b»|) ++ «b»
= [0xBF, 0x81, 0xA8, 0x82, 0xA6, 0x61, 0x82, 0xA6, 0x62]
## Security Considerations
@ -346,7 +327,7 @@ undetermined number of `Value`s across, say, a TCP/IP connection:
- If the binary syntax is to be used for the connection, start the
connection with byte `0xA8` (sequence). After the initial byte, send
each value `v` as `len(|«v»|) ++ «v»`. A side effect of this approach
each value `v` as `varint(|«v»|) ++ «v»`. A side effect of this approach
is that the entire stream, when complete, is a valid `Sequence`
`Repr`.