Put cheatsheet in an appendix

2022-06-20 17:05:47 +02:00 · 2022-06-20 17:05:47 +02:00 · df1d75d181
parent 9ee59562a1
commit df1d75d181
3 changed files with 68 additions and 56 deletions
--- a/_includes/cheatsheet-binary.md
+++ b/_includes/cheatsheet-binary.md
@ -0,0 +1,46 @@
+For a value `v`, we write `«v»` for the binary encoding of `v`. The
+length of an encoding is always available from context: either from
+a containing encoded value, or from the overall container of the data,
+which could be a file, an HTTP message, a UDP packet, etc.
+
+                          «#f» = [0xA0]
+                          «#t» = [0xA1]
+                           «F» = [0xA2] ++ binary32(F)     if F ∈ Float
+                           «D» = [0xA2] ++ binary64(D)     if D ∈ Double
+                           «x» = [0xA3] ++ intbytes(x)     if x ∈ SignedInteger
+                           «S» = [0xA4] ++ utf8(S) ++ [0]  if S ∈ String
+                                 [0xA5] ++ S               if S ∈ ByteString
+                                 [0xA6] ++ utf8(S)         if S ∈ Symbol
+
+               «<L F_1...F_m>» = [0xA7] ++ seq(«L», «F_1», ..., «F_m»)
+                 «[X_1...X_m]» = [0xA8] ++ seq(«X_1», ..., «X_m»)
+                «#{E_1...E_m}» = [0xA9] ++ seq(«E_1», ..., «E_m»)
+         «{K_1:V_1...K_m:V_m}» = [0xAA] ++ seq(«K_1», «V_1», ..., «K_m», «V_m»)
+
+            seq(R_1, ..., R_m) = len(|R_1|) ++ R_1 ++...++ len(|R_m|) ++ R_m
+
+                        len(m) = e(m, 128)
+
+                       e(v, d) = [v + d]                           if v < 128
+                                 e(v / 128, 0) ++ [(v % 128) + d]  if v ≥ 128
+
+                         «#!V» = [0xAB] ++ «V»
+
+The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
+8-byte IEEE 754 binary representations of `F` and `D`, respectively.
+
+The function `intbytes(x)` is a big-endian two's-complement signed
+binary representation of `x`, taking at least as many whole bytes as
+needed to unambiguously identify the value and its sign. `intbytes(0)`
+may be the empty byte sequence.
+
+When reading, the length of the input is supplied externally. This means
+that, when reading a length/value pair in a `seq()`, each length should
+be passed down to the decoder for the corresponding value, so that the
+decoder knows when to stop.
+
+**Annotations.** To annotate a `Repr` `r` (that *MUST NOT* itself
+already be annotated) with some sequence of `Value`s `[v_1, ..., v_m]`
+(that *MUST* be non-empty), surround `r` as follows:
+
+    [0xBF] ++ len(|r|) ++ r ++ len(|«v_1»|) ++ «v_1» ++...++ len(|«v_m»|) ++ «v_m»
--- a/cheatsheet.md
+++ b/cheatsheet.md
@ -8,49 +8,4 @@ June 2022. Version 0.7.0.

 ## Machine-Oriented Binary Syntax

-For a value `v`, we write `«v»` for the binary encoding of `v`. The
-length of an encoding is always available from context: either from
-a containing encoded value, or from the overall container of the data,
-which could be a file, an HTTP message, a UDP packet, etc.
-
-                          «#f» = [0xA0]
-                          «#t» = [0xA1]
-                           «F» = [0xA2] ++ binary32(F)     if F ∈ Float
-                           «D» = [0xA2] ++ binary64(D)     if D ∈ Double
-                           «x» = [0xA3] ++ intbytes(x)     if x ∈ SignedInteger
-                           «S» = [0xA4] ++ utf8(S) ++ [0]  if S ∈ String
-                                 [0xA5] ++ S               if S ∈ ByteString
-                                 [0xA6] ++ utf8(S)         if S ∈ Symbol
-
-               «<L F_1...F_m>» = [0xA7] ++ seq(«L», «F_1», ..., «F_m»)
-                 «[X_1...X_m]» = [0xA8] ++ seq(«X_1», ..., «X_m»)
-                «#{E_1...E_m}» = [0xA9] ++ seq(«E_1», ..., «E_m»)
-         «{K_1:V_1...K_m:V_m}» = [0xAA] ++ seq(«K_1», «V_1», ..., «K_m», «V_m»)
-
-            seq(R_1, ..., R_m) = len(|R_1|) ++ R_1 ++...++ len(|R_m|) ++ R_m
-
-                        len(m) = e(m, 128)
-
-                       e(v, d) = [v + d]                           if v < 128
-                                 e(v / 128, 0) ++ [(v % 128) + d]  if v ≥ 128
-
-                         «#!V» = [0xAB] ++ «V»
-
-The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
-8-byte IEEE 754 binary representations of `F` and `D`, respectively.
-
-The function `intbytes(x)` is a big-endian two's-complement signed
-binary representation of `x`, taking at least as many whole bytes as
-needed to unambiguously identify the value and its sign. `intbytes(0)`
-may be the empty byte sequence.
-
-When reading, the length of the input is supplied externally. This means
-that, when reading a length/value pair in a `seq()`, each length should
-be passed down to the decoder for the corresponding value, so that the
-decoder knows when to stop.
-
-**Annotations.** To annotate a `Repr` `r` (that *MUST NOT* itself
-already be annotated) with some sequence of `Value`s `[v_1, ..., v_m]`
-(that *MUST* be non-empty), surround `r` as follows:
-
-    [0xBF] ++ len(|r|) ++ r ++ len(|«v_1»|) ++ «v_1» ++...++ len(|«v_m»|) ++ «v_m»
+{% include cheatsheet-binary.md %}
--- a/preserves-binary.md
+++ b/preserves-binary.md
@ -86,7 +86,10 @@ when to stop expecting more contained `Repr`s.
 <a id="varint"></a> Each length is stored as an [argdata][]-compatible
 big-endian base 128 *varint*.[^see-also-leb128] Each byte of a varint
 stores seven bits of the length. All bytes have a clear upper bit,
-except the final byte, which has the upper bit set.
+except the final byte, which has the upper bit set. Implementations
+*SHOULD* use the shortest encoding for a varint, and *MUST NOT* produce
+an encoded varint with more than nine leading `0`
+bytes.[^overlong-varint] [^nine-leading-varint-zeroes]

  [^see-also-leb128]: Argdata's length representation is very close to
    [Variable-length quantity (VLQ)][VLQ] encoding, differing only in
@ -94,16 +97,20 @@ except the final byte, which has the upper bit set.
    big-endian, unlike [LEB128][] encoding ([as used by
    Google][google-varint] in protobufs).

-There is no requirement that a varint-encoded length be the unique
-shortest encoding for the length.[^overlong-varint] However,
-implementations *SHOULD* use the shortest encoding whereever possible
-when writing, and *MAY* reject encodings with more than eight leading
-`0` bytes when reading encoded values.
+  [^overlong-varint]: **Implementation note.** The spec permits overlong
+    length encodings to reduce wasted activity in resource-constrained
+    situations. If an implementation is in anything other than a very
+    low-level language, it is likely to be able to use
+    [IOList](./conventions.html#iolists)-style data structures to avoid
+    unnecessary copying.

-  [^overlong-varint]: **Implementation note.** The spec permits overlong length encodings to
-    reduce wasted activity in resource-constrained situations. If an implementation is in
-    anything other than a very low-level language, it is likely to be able to use
-    [IOList](./conventions.html#iolists)-style data structures to avoid unnecessary copying.
+  [^nine-leading-varint-zeroes]: Nine leading zero bytes, plus one
+    non-zero byte, equals ten bytes in total. Each byte of varint yields
+    7 bits of usable length indicator, so ten bytes gives 70 bits, while
+    nine would only give 63, not quite enough for a 64-bit value. Of
+    course, it may be some time before an encoder legitimately needs to
+    use a 64-bit length indicator, let alone in a resource-constrained
+    situation.

 **Records.** A `Record` is encoded as tag `0xA7` followed by the
 length-prefixed encodings of its label and fields.
@ -298,6 +305,10 @@ The exclusion of lengths from `Repr`s, placing lengths instead ahead of
 contained values in sequences, is inspired by [argdata][], as is the
 inclusion of a `NUL` byte in `String` `Repr`s for C interoperability.

+## Appendix. Summary of syntax
+
+{% include cheatsheet-binary.md %}
+
 ## Appendix. Autodetection of textual or binary syntax

 Every tag byte in a binary Preserves `Repr` falls within the range