Smaller simpler (?) presentation of binary syntax
This commit is contained in:
parent
f28ae51215
commit
b43d372014
|
@ -21,34 +21,72 @@ syntax](preserves-text.html) also exists.
|
||||||
## Machine-Oriented Binary Syntax
|
## Machine-Oriented Binary Syntax
|
||||||
|
|
||||||
A `Repr` is a binary-syntax encoding, or representation, of a `Value`.
|
A `Repr` is a binary-syntax encoding, or representation, of a `Value`.
|
||||||
For a value `v`, we write `«v»` for the `Repr` of v.
|
|
||||||
|
|
||||||
### Type and Length representation.
|
### Type and Length representation.
|
||||||
|
|
||||||
Each `Repr` starts with a tag byte, describing the kind of information
|
Each `Repr` starts with a tag byte, describing the kind of information
|
||||||
represented.
|
represented. The expected length of the `Repr` is always available
|
||||||
|
|
||||||
However, inspired by [argdata][], a `Repr` does *not* describe its own
|
|
||||||
length. Instead, the expected length of the `Repr` is always available
|
|
||||||
from the surrounding context: either from a containing encoded value, or
|
from the surrounding context: either from a containing encoded value, or
|
||||||
from the overall container of the data, which could be a file, an HTTP
|
from the overall container of the data, which could be a file, an HTTP
|
||||||
message, a UDP packet, etc.
|
message, a UDP packet, etc.
|
||||||
|
|
||||||
As a consequence, `Repr`s for `Compound` values store the lengths of
|
### Atomic Values.
|
||||||
their contained values. Each contained `Value` is represented as a
|
|
||||||
length in bytes followed by its own `Repr`. Implementations use each
|
**Booleans.** The "false" boolean's `Repr` is just tag `0xA0`; "true" is
|
||||||
stored length to decide when to stop reading the following `Repr`.
|
`0xA1`.
|
||||||
|
|
||||||
|
**Floats and Doubles.** Both `Float` and `Double` values are represented
|
||||||
|
as tag `0xA2` followed by big-endian 4- or 8-byte IEEE 754 binary
|
||||||
|
representations of the values, respectively.
|
||||||
|
|
||||||
|
**SignedIntegers.** A `SignedInteger` encodes as tag `0xA3` followed by
|
||||||
|
a big-endian two's-complement binary representation of the value, taking
|
||||||
|
at least as many whole bytes as needed to unambiguously identify the
|
||||||
|
value and its sign. Zero may be represented as the tag alone, with no
|
||||||
|
following bytes. The most-significant bit in the first byte after the
|
||||||
|
tag is the sign bit.[^zero-intbytes] The shortest possible encoding
|
||||||
|
*SHOULD* be used.[^overlong-signedinteger]
|
||||||
|
|
||||||
|
[^zero-intbytes]: The value 0 needs zero bytes to identify the value,
|
||||||
|
so `intbytes(0)` can be the empty byte string. Non-zero values need
|
||||||
|
at least one byte.
|
||||||
|
|
||||||
|
[^overlong-signedinteger]: **Implementation note.** The spec permits
|
||||||
|
overlong `SignedInteger` encodings to allow e.g. construction of
|
||||||
|
`Repr`s by filling in partially-completed templates, which can be
|
||||||
|
useful in resource-constrained situations.
|
||||||
|
|
||||||
|
**Strings.** A `String` encodes as tag `0xA4` followed by the UTF-8
|
||||||
|
encoding of the string, with an additional trailing `NUL` (0) byte. The
|
||||||
|
`NUL` byte *MUST NOT* be treated as part of the `String`: it exists to
|
||||||
|
permit zero-copy C interoperability.[^zero-copy-c-string-interop]
|
||||||
|
|
||||||
|
[^zero-copy-c-string-interop]: Some care must still be taken when
|
||||||
|
passing `String` `Repr`s directly to a C-style ABI, since `String`s
|
||||||
|
may contain the zero Unicode code point, which C library routines
|
||||||
|
will usually misinterpret as an end-of-string marker.
|
||||||
|
|
||||||
|
**ByteStrings.** A `ByteString` encodes as tag `0xA5` followed by the
|
||||||
|
bytes themselves.
|
||||||
|
|
||||||
|
**Symbols.** A `Symbol` encodes as tag `0xA6` followed by the UTF-8
|
||||||
|
encoding of the symbol's code points.
|
||||||
|
|
||||||
|
### Compound Values.
|
||||||
|
|
||||||
|
`Repr`s for `Compound` values store the lengths of their contained
|
||||||
|
values. Each contained `Value` is converted to a `Repr` and stored as
|
||||||
|
the length of the `Repr` in bytes followed by the `Repr` itself.
|
||||||
|
Implementations use each stored length to decide when to stop reading
|
||||||
|
the associated `Repr`. Similarly, no sentinel marks the end of a
|
||||||
|
sequence of length-prefixed `Repr`s. Implementations use the length of
|
||||||
|
the containing `Repr`, known from the surrounding context, to decide
|
||||||
|
when to stop expecting more contained `Repr`s.
|
||||||
|
|
||||||
<a id="varint"></a> Each length is stored as an [argdata][]-compatible
|
<a id="varint"></a> Each length is stored as an [argdata][]-compatible
|
||||||
big-endian base 128 *varint*.[^see-also-leb128] Each byte of a varint
|
big-endian base 128 *varint*.[^see-also-leb128] Each byte of a varint
|
||||||
stores seven bits of the length. All bytes have a clear upper bit,
|
stores seven bits of the length. All bytes have a clear upper bit,
|
||||||
except the final byte, which has the upper bit set. We write
|
except the final byte, which has the upper bit set.
|
||||||
`len(m)` for the varint-encoding of a non-negative integer `m`,
|
|
||||||
defined recursively as follows:
|
|
||||||
|
|
||||||
len(m) = e(m, 128)
|
|
||||||
where e(v, d) = [v + d] if v < 128
|
|
||||||
e(v / 128, 0) ++ [(v % 128) + d] if v ≥ 128
|
|
||||||
|
|
||||||
[^see-also-leb128]: Argdata's length representation is very close to
|
[^see-also-leb128]: Argdata's length representation is very close to
|
||||||
[Variable-length quantity (VLQ)][VLQ] encoding, differing only in
|
[Variable-length quantity (VLQ)][VLQ] encoding, differing only in
|
||||||
|
@ -56,10 +94,8 @@ defined recursively as follows:
|
||||||
big-endian, unlike [LEB128][] encoding ([as used by
|
big-endian, unlike [LEB128][] encoding ([as used by
|
||||||
Google][google-varint] in protobufs).
|
Google][google-varint] in protobufs).
|
||||||
|
|
||||||
We write `len(|r|)` for the varint-encoding of the length of `Repr` `r`.
|
There is no requirement that a varint-encoded length be the unique
|
||||||
|
shortest encoding for the length.[^overlong-varint] However,
|
||||||
There is no requirement that a varint-encoded `m` in a `Repr` be the
|
|
||||||
unique shortest encoding for that `m`.[^overlong-varint] However,
|
|
||||||
implementations *SHOULD* use the shortest encoding whereever possible
|
implementations *SHOULD* use the shortest encoding whereever possible
|
||||||
when writing, and *MAY* reject encodings with more than eight leading
|
when writing, and *MAY* reject encodings with more than eight leading
|
||||||
`0` bytes when reading encoded values.
|
`0` bytes when reading encoded values.
|
||||||
|
@ -69,21 +105,24 @@ when writing, and *MAY* reject encodings with more than eight leading
|
||||||
anything other than a very low-level language, it is likely to be able to use
|
anything other than a very low-level language, it is likely to be able to use
|
||||||
[IOList](./conventions.html#iolists)-style data structures to avoid unnecessary copying.
|
[IOList](./conventions.html#iolists)-style data structures to avoid unnecessary copying.
|
||||||
|
|
||||||
### Records, Sequences, Sets and Dictionaries.
|
**Records.** A `Record` is encoded as tag `0xA7` followed by the
|
||||||
|
length-prefixed encodings of its label and fields.
|
||||||
|
|
||||||
«<L F_1...F_m>» = [0xA7] ++ seq(«L», «F_1», ..., «F_m»)
|
**Sequences.** A `Sequence` is encoded as tag `0xA8` followed by the
|
||||||
«[X_1...X_m]» = [0xA8] ++ seq(«X_1», ..., «X_m»)
|
length-prefixed encodings of its members.
|
||||||
«#{E_1...E_m}» = [0xA9] ++ seq(«E_1», ..., «E_m»)
|
|
||||||
«{K_1:V_1...K_m:V_m}» = [0xAA] ++ seq(«K_1», «V_1», ..., «K_m», «V_m»)
|
|
||||||
|
|
||||||
seq(R_1, ..., R_m) = len(|R_1|) ++ R_1 ++...++ len(|R_m|) ++ R_m
|
**Sets.** A `Set` is encoded like a `Sequence`, but with tag `0xA9`, and
|
||||||
|
in some arbitrary order.
|
||||||
|
|
||||||
There is *no* ordering requirement on the `E_i` elements or
|
**Dictionaries.** A `Dictionary` encodes as tag `0xAA` followed by the
|
||||||
`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
|
length-prefixed keys and values, in an alternating key/value sequence.
|
||||||
order. However, the `E_i` and `K_i` *MUST* be pairwise distinct. In
|
|
||||||
addition, implementations *SHOULD* default to writing set elements and
|
There is *no* ordering requirement on the elements of sets or the
|
||||||
dictionary key/value pairs in order sorted lexicographically by their
|
key/value pairs of dictionaries.[^no-sorting-rationale] However,
|
||||||
`Repr`s[^not-sorted-semantically], and *MAY* offer the option of
|
elements of sets and keys in dictionaries *MUST* be pairwise distinct.
|
||||||
|
In addition, implementations *SHOULD* default to writing set elements
|
||||||
|
and dictionary key/value pairs in order sorted lexicographically by
|
||||||
|
their `Repr`s[^not-sorted-semantically], and *MAY* offer the option of
|
||||||
serializing in some other implementation-defined order.
|
serializing in some other implementation-defined order.
|
||||||
|
|
||||||
[^no-sorting-rationale]: In the BitTorrent encoding format,
|
[^no-sorting-rationale]: In the BitTorrent encoding format,
|
||||||
|
@ -109,93 +148,33 @@ serializing in some other implementation-defined order.
|
||||||
but encoding and then sorting byte strings is much more likely to
|
but encoding and then sorting byte strings is much more likely to
|
||||||
be within easy reach.
|
be within easy reach.
|
||||||
|
|
||||||
No sentinel marks the end of a sequence of length-prefixed `Repr`s.
|
### Embedded Values.
|
||||||
During decoding, use the length of the containing `Repr` to decide when
|
|
||||||
to stop expecting more contained `Repr`s.
|
|
||||||
|
|
||||||
### SignedIntegers.
|
Embedded values are encoded as tag `0xAB` followed by the encoding of
|
||||||
|
some `Value` chosen to represent the denoted embedded object.
|
||||||
«x» when x ∈ SignedInteger = [0xA3] ++ intbytes(x)
|
|
||||||
|
|
||||||
The function `intbytes(x)` gives a big-endian two's-complement binary
|
|
||||||
representation of `x`, taking at least as many whole bytes as needed to
|
|
||||||
unambiguously identify the value and its sign; `intbytes(0)` may be the
|
|
||||||
empty byte sequence.[^zero-intbytes] The most-significant bit in the
|
|
||||||
first byte in `intbytes(x)` is the sign bit. While every `SignedInteger`
|
|
||||||
*SHOULD* be represented with its shortest possible encoding (which will
|
|
||||||
often include a necessary leading `0xFF` or `0x00`), redundant leading
|
|
||||||
`0xFF` or `0x00` bytes *MAY* be used.[^overlong-signedinteger]
|
|
||||||
|
|
||||||
[^zero-intbytes]: The value 0 needs zero bytes to identify the value,
|
|
||||||
so `intbytes(0)` can be the empty byte string. Non-zero values need
|
|
||||||
at least one byte.
|
|
||||||
|
|
||||||
[^overlong-signedinteger]: **Implementation note.** The spec permits
|
|
||||||
overlong `SignedInteger` encodings to allow e.g. construction of
|
|
||||||
`Repr`s by filling in partially-completed templates, which can be
|
|
||||||
useful in resource-constrained situations.
|
|
||||||
|
|
||||||
### Strings, ByteStrings and Symbols.
|
|
||||||
|
|
||||||
«S» = [0xA4] ++ utf8(S) ++ [0] if S ∈ String
|
|
||||||
[0xA5] ++ S if S ∈ ByteString
|
|
||||||
[0xA6] ++ utf8(S) if S ∈ Symbol
|
|
||||||
|
|
||||||
For `String` and `Symbol`, the data following the tag is a UTF-8
|
|
||||||
encoding of the `Value`'s code points, while for `ByteString` it is the
|
|
||||||
raw data contained within the `Value` unmodified.
|
|
||||||
|
|
||||||
Each `String` has a trailing zero byte appended. This extra byte *MUST
|
|
||||||
NOT* be treated as part of the `Value`: it exists to permit zero-copy C
|
|
||||||
interoperability.[^zero-copy-c-string-interop]
|
|
||||||
|
|
||||||
[^zero-copy-c-string-interop]: Some care must still be taken when
|
|
||||||
passing `String` `Repr`s directly to a C-style ABI, since `String`s
|
|
||||||
may contain the zero Unicode code point, which C library routines
|
|
||||||
will usually misinterpret as an end-of-string marker.
|
|
||||||
|
|
||||||
### Booleans.
|
|
||||||
|
|
||||||
«#f» = [0xA0]
|
|
||||||
«#t» = [0xA1]
|
|
||||||
|
|
||||||
### Floats and Doubles.
|
|
||||||
|
|
||||||
«F» when F ∈ Float = [0xA2] ++ binary32(F)
|
|
||||||
«D» when D ∈ Double = [0xA2] ++ binary64(D)
|
|
||||||
|
|
||||||
The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
|
|
||||||
8-byte IEEE 754 binary representations of `F` and `D`, respectively.
|
|
||||||
|
|
||||||
### Embeddeds.
|
|
||||||
|
|
||||||
The `Repr` of an `Embedded` is the `Repr` of a `Value` chosen to
|
|
||||||
represent the denoted object, prefixed with `[0xAB]`.
|
|
||||||
|
|
||||||
«#!V» = [0xAB] ++ «V»
|
|
||||||
|
|
||||||
### Annotations.
|
### Annotations.
|
||||||
|
|
||||||
To annotate a `Repr` `r` with some sequence of `Value`s `[v_1, ...,
|
The encoding of a sequence of annotations for a `Repr` uses tag `0xBF`,
|
||||||
v_m]`, surround `r` as follows:
|
followed by the length-prefixed `Repr`, followed by the length-prefixed
|
||||||
|
encoded annotations, in order. The `Repr` *MUST NOT* already have
|
||||||
|
annotations (must not begin with `0xBF`), and there *MUST* be at least
|
||||||
|
one `Value` in the sequence following the `Repr`.
|
||||||
|
|
||||||
[0xBF] ++ len(|r|) ++ r ++ len(|«v_1»|) ++ «v_1» ++...++ len(|«v_m»|) ++ «v_m»
|
## Examples (normative)
|
||||||
|
|
||||||
The `Repr` `r` *MUST NOT* already have annotations; that is, it must not
|
We write `«v»` for the `Repr` of some `Value` `v`, and `varint(|«v»|)` for
|
||||||
begin with `0xBF`. The sequence `[v_1, ..., v_m]` *MUST* contain at
|
the varint-encoded length of the `Repr` of `v`.
|
||||||
least one `Value`.
|
|
||||||
|
|
||||||
## Examples
|
|
||||||
|
|
||||||
### Varints (length representations).
|
### Varints (length representations).
|
||||||
|
|
||||||
The following table illustrates varint-encoding.
|
The following table illustrates varint-encoding.
|
||||||
|
|
||||||
| Number, `m` | `m` in binary, grouped into 7-bit chunks | `len(m)` bytes |
|
| Number, `m` | `m` in binary, grouped into 7-bit chunks | `varint(m)` bytes |
|
||||||
|-------------|-------------------------------------------|-----------------|
|
|-------------|-------------------------------------------|-------------------|
|
||||||
| 15 | `0001111` | 143 |
|
| 15 | `0001111` | 143 |
|
||||||
| 300 | `0000010 0101100` | 2 172 |
|
| 300 | `0000010 0101100` | 2 172 |
|
||||||
| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 3 92 107 20 128 |
|
| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 3 92 107 20 128 |
|
||||||
|
|
||||||
### Atoms.
|
### Atoms.
|
||||||
|
|
||||||
|
@ -288,7 +267,9 @@ The `Repr` corresponding to textual syntax `@a@b[]`, i.e. an empty sequence anno
|
||||||
symbols, `a` and `b`, is
|
symbols, `a` and `b`, is
|
||||||
|
|
||||||
«@a @b []»
|
«@a @b []»
|
||||||
= [0xBF] ++ len(|«[]»|) ++ «[]» ++ len(|«a»|) ++ «a» ++ len(|«b»|) ++ «b»
|
= [0xBF] ++ varint(|«[]»|) ++ «[]»
|
||||||
|
++ varint(|«a»|) ++ «a»
|
||||||
|
++ varint(|«b»|) ++ «b»
|
||||||
= [0xBF, 0x81, 0xA8, 0x82, 0xA6, 0x61, 0x82, 0xA6, 0x62]
|
= [0xBF, 0x81, 0xA8, 0x82, 0xA6, 0x61, 0x82, 0xA6, 0x62]
|
||||||
|
|
||||||
## Security Considerations
|
## Security Considerations
|
||||||
|
@ -346,7 +327,7 @@ undetermined number of `Value`s across, say, a TCP/IP connection:
|
||||||
|
|
||||||
- If the binary syntax is to be used for the connection, start the
|
- If the binary syntax is to be used for the connection, start the
|
||||||
connection with byte `0xA8` (sequence). After the initial byte, send
|
connection with byte `0xA8` (sequence). After the initial byte, send
|
||||||
each value `v` as `len(|«v»|) ++ «v»`. A side effect of this approach
|
each value `v` as `varint(|«v»|) ++ «v»`. A side effect of this approach
|
||||||
is that the entire stream, when complete, is a valid `Sequence`
|
is that the entire stream, when complete, is a valid `Sequence`
|
||||||
`Repr`.
|
`Repr`.
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue