From df1d75d1812c060c798f2d79c1ee537eb0604e3c Mon Sep 17 00:00:00 2001 From: Tony Garnock-Jones Date: Mon, 20 Jun 2022 17:05:47 +0200 Subject: [PATCH] Put cheatsheet in an appendix --- _includes/cheatsheet-binary.md | 46 +++++++++++++++++++++++++++++++++ cheatsheet.md | 47 +--------------------------------- preserves-binary.md | 31 ++++++++++++++-------- 3 files changed, 68 insertions(+), 56 deletions(-) create mode 100644 _includes/cheatsheet-binary.md diff --git a/_includes/cheatsheet-binary.md b/_includes/cheatsheet-binary.md new file mode 100644 index 0000000..4fbff08 --- /dev/null +++ b/_includes/cheatsheet-binary.md @@ -0,0 +1,46 @@ +For a value `v`, we write `«v»` for the binary encoding of `v`. The +length of an encoding is always available from context: either from +a containing encoded value, or from the overall container of the data, +which could be a file, an HTTP message, a UDP packet, etc. + + «#f» = [0xA0] + «#t» = [0xA1] + «F» = [0xA2] ++ binary32(F) if F ∈ Float + «D» = [0xA2] ++ binary64(D) if D ∈ Double + «x» = [0xA3] ++ intbytes(x) if x ∈ SignedInteger + «S» = [0xA4] ++ utf8(S) ++ [0] if S ∈ String + [0xA5] ++ S if S ∈ ByteString + [0xA6] ++ utf8(S) if S ∈ Symbol + + «» = [0xA7] ++ seq(«L», «F_1», ..., «F_m») + «[X_1...X_m]» = [0xA8] ++ seq(«X_1», ..., «X_m») + «#{E_1...E_m}» = [0xA9] ++ seq(«E_1», ..., «E_m») + «{K_1:V_1...K_m:V_m}» = [0xAA] ++ seq(«K_1», «V_1», ..., «K_m», «V_m») + + seq(R_1, ..., R_m) = len(|R_1|) ++ R_1 ++...++ len(|R_m|) ++ R_m + + len(m) = e(m, 128) + + e(v, d) = [v + d] if v < 128 + e(v / 128, 0) ++ [(v % 128) + d] if v ≥ 128 + + «#!V» = [0xAB] ++ «V» + +The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and +8-byte IEEE 754 binary representations of `F` and `D`, respectively. + +The function `intbytes(x)` is a big-endian two's-complement signed +binary representation of `x`, taking at least as many whole bytes as +needed to unambiguously identify the value and its sign. `intbytes(0)` +may be the empty byte sequence. + +When reading, the length of the input is supplied externally. This means +that, when reading a length/value pair in a `seq()`, each length should +be passed down to the decoder for the corresponding value, so that the +decoder knows when to stop. + +**Annotations.** To annotate a `Repr` `r` (that *MUST NOT* itself +already be annotated) with some sequence of `Value`s `[v_1, ..., v_m]` +(that *MUST* be non-empty), surround `r` as follows: + + [0xBF] ++ len(|r|) ++ r ++ len(|«v_1»|) ++ «v_1» ++...++ len(|«v_m»|) ++ «v_m» diff --git a/cheatsheet.md b/cheatsheet.md index d07a670..50c3f0a 100644 --- a/cheatsheet.md +++ b/cheatsheet.md @@ -8,49 +8,4 @@ June 2022. Version 0.7.0. ## Machine-Oriented Binary Syntax -For a value `v`, we write `«v»` for the binary encoding of `v`. The -length of an encoding is always available from context: either from -a containing encoded value, or from the overall container of the data, -which could be a file, an HTTP message, a UDP packet, etc. - - «#f» = [0xA0] - «#t» = [0xA1] - «F» = [0xA2] ++ binary32(F) if F ∈ Float - «D» = [0xA2] ++ binary64(D) if D ∈ Double - «x» = [0xA3] ++ intbytes(x) if x ∈ SignedInteger - «S» = [0xA4] ++ utf8(S) ++ [0] if S ∈ String - [0xA5] ++ S if S ∈ ByteString - [0xA6] ++ utf8(S) if S ∈ Symbol - - «» = [0xA7] ++ seq(«L», «F_1», ..., «F_m») - «[X_1...X_m]» = [0xA8] ++ seq(«X_1», ..., «X_m») - «#{E_1...E_m}» = [0xA9] ++ seq(«E_1», ..., «E_m») - «{K_1:V_1...K_m:V_m}» = [0xAA] ++ seq(«K_1», «V_1», ..., «K_m», «V_m») - - seq(R_1, ..., R_m) = len(|R_1|) ++ R_1 ++...++ len(|R_m|) ++ R_m - - len(m) = e(m, 128) - - e(v, d) = [v + d] if v < 128 - e(v / 128, 0) ++ [(v % 128) + d] if v ≥ 128 - - «#!V» = [0xAB] ++ «V» - -The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and -8-byte IEEE 754 binary representations of `F` and `D`, respectively. - -The function `intbytes(x)` is a big-endian two's-complement signed -binary representation of `x`, taking at least as many whole bytes as -needed to unambiguously identify the value and its sign. `intbytes(0)` -may be the empty byte sequence. - -When reading, the length of the input is supplied externally. This means -that, when reading a length/value pair in a `seq()`, each length should -be passed down to the decoder for the corresponding value, so that the -decoder knows when to stop. - -**Annotations.** To annotate a `Repr` `r` (that *MUST NOT* itself -already be annotated) with some sequence of `Value`s `[v_1, ..., v_m]` -(that *MUST* be non-empty), surround `r` as follows: - - [0xBF] ++ len(|r|) ++ r ++ len(|«v_1»|) ++ «v_1» ++...++ len(|«v_m»|) ++ «v_m» +{% include cheatsheet-binary.md %} diff --git a/preserves-binary.md b/preserves-binary.md index de6b6bc..e0209c0 100644 --- a/preserves-binary.md +++ b/preserves-binary.md @@ -86,7 +86,10 @@ when to stop expecting more contained `Repr`s. Each length is stored as an [argdata][]-compatible big-endian base 128 *varint*.[^see-also-leb128] Each byte of a varint stores seven bits of the length. All bytes have a clear upper bit, -except the final byte, which has the upper bit set. +except the final byte, which has the upper bit set. Implementations +*SHOULD* use the shortest encoding for a varint, and *MUST NOT* produce +an encoded varint with more than nine leading `0` +bytes.[^overlong-varint] [^nine-leading-varint-zeroes] [^see-also-leb128]: Argdata's length representation is very close to [Variable-length quantity (VLQ)][VLQ] encoding, differing only in @@ -94,16 +97,20 @@ except the final byte, which has the upper bit set. big-endian, unlike [LEB128][] encoding ([as used by Google][google-varint] in protobufs). -There is no requirement that a varint-encoded length be the unique -shortest encoding for the length.[^overlong-varint] However, -implementations *SHOULD* use the shortest encoding whereever possible -when writing, and *MAY* reject encodings with more than eight leading -`0` bytes when reading encoded values. + [^overlong-varint]: **Implementation note.** The spec permits overlong + length encodings to reduce wasted activity in resource-constrained + situations. If an implementation is in anything other than a very + low-level language, it is likely to be able to use + [IOList](./conventions.html#iolists)-style data structures to avoid + unnecessary copying. - [^overlong-varint]: **Implementation note.** The spec permits overlong length encodings to - reduce wasted activity in resource-constrained situations. If an implementation is in - anything other than a very low-level language, it is likely to be able to use - [IOList](./conventions.html#iolists)-style data structures to avoid unnecessary copying. + [^nine-leading-varint-zeroes]: Nine leading zero bytes, plus one + non-zero byte, equals ten bytes in total. Each byte of varint yields + 7 bits of usable length indicator, so ten bytes gives 70 bits, while + nine would only give 63, not quite enough for a 64-bit value. Of + course, it may be some time before an encoder legitimately needs to + use a 64-bit length indicator, let alone in a resource-constrained + situation. **Records.** A `Record` is encoded as tag `0xA7` followed by the length-prefixed encodings of its label and fields. @@ -298,6 +305,10 @@ The exclusion of lengths from `Repr`s, placing lengths instead ahead of contained values in sequences, is inspired by [argdata][], as is the inclusion of a `NUL` byte in `String` `Repr`s for C interoperability. +## Appendix. Summary of syntax + +{% include cheatsheet-binary.md %} + ## Appendix. Autodetection of textual or binary syntax Every tag byte in a binary Preserves `Repr` falls within the range