Spec change proposal for #41

2023-03-15 16:24:02 +01:00 · 2023-03-15 16:24:02 +01:00 · d11f008705
parent 34f92c3870
commit d11f008705
2 changed files with 97 additions and 78 deletions
--- a/_includes/cheatsheet-binary.md
+++ b/_includes/cheatsheet-binary.md
@ -1 +1,36 @@
-(TODO)
+  [varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
+
+                          «#f» = [0x80]
+                          «#t» = [0x81]
+
+                         «#!V» = [0x86] ++ «V»
+
+      «V» if V ∈ Float         = [0x87] ++ varint(|binary32(V)|) ++ binary32(V)
+      «V» if V ∈ Double        = [0x87] ++ varint(|binary64(V)|) ++ binary64(V)
+
+      «V» if V ∈ SignedInteger = [0xB0] ++ varint(|intbytes(x)|) ++ intbytes(x)
+      «V» if V ∈ String        = [0xB1] ++ varint(|utf8(V)|) ++ utf8(V)
+      «V» if V ∈ ByteString    = [0xB2] ++ varint(|V|) ++ V
+      «V» if V ∈ Symbol        = [0xB3] ++ varint(|utf8(V)|) ++ utf8(V)
+
+               «<L F_1...F_m>» = [0xB4] ++ «L» ++ «F_1» ++...++ «F_m» ++ [0x84]
+                 «[X_1...X_m]» = [0xB5] ++ «X_1» ++...++ «X_m» ++ [0x84]
+                «#{E_1...E_m}» = [0xB6] ++ «E_1» ++...++ «E_m» ++ [0x84]
+         «{K_1:V_1...K_m:V_m}» = [0xB7] ++ «K_1» ++ «V_1» ++...++ «K_m» ++ «V_m» ++ [0x84]
+
+               «@V_1...@V_n V» = [0xBF] ++ «V» ++ «V_1» ++...++ «V_n» ++ [0x84]
+
+Where
+
+ - `varint(m)` is the [varint-encoding][varint] of `m`; for example, `varint(15)` is `[0x0F]`,
+   and `varint(1000000000)` is `[0x80, 0x94, 0xeb, 0xdc, 0x03]`.
+
+ - `intbytes(x)` gives the big-endian two's-complement binary representation of `x`, taking
+   exactly as many whole bytes as needed to unambiguously identify the value and its sign. For
+   example, `intbytes(-128)` is `[0x80]`, `intbytes(-1)` is `[0xFF]`, `intbytes(0)` is `[]`,
+   `intbytes(1)` is `[0x01]`, `intbytes(128)` is `[0x00, 0x80]` etc.
+
+ - `utf8(S)` gives the sequence of bytes forming the UTF-8 encoding of the string `S`.
+
+ - `binary32(F)` and `binary64(D)` yield big-endian 4- and 8-byte IEEE 754 binary
+   representations of `F` and `D`, respectively.
--- a/preserves-binary.md
+++ b/preserves-binary.md
@ -27,10 +27,9 @@ Each `Repr` starts with a tag byte, describing the kind of information
 represented. Depending on the tag, a length indicator, further encoded
 information, and/or an ending tag may follow.

-    tag                          (simple atomic data and small integers)
-    tag ++ binarydata            (most integers)
-    tag ++ length ++ binarydata  (large integers, strings, symbols, and binary)
-    tag ++ repr ++ ... ++ endtag (compound data)
+    tag                          (simple atomic data)
+    tag ++ length ++ binarydata  (floats, doubles, integers, strings, symbols, and binary)
+    tag ++ repr ++ ... ++ endtag (compound data and annotations)

 The unique end tag is byte value `0x84`.

@ -41,7 +40,8 @@ write `varint(m)` for the varint-encoding of `m`. Quoting the

  [^see-also-leb128]: Also known as [LEB128][] encoding, for unsigned
    integers. Varints and LEB128-encoded integers differ only for
-    signed integers, which are not used in Preserves.
+    negative numbers, which cannot appear as length indicators and are
+    thus not used in Preserves.

 > Each byte in a varint, except the last byte, has the most
 > significant bit (msb) set – this indicates that there are further
@ -49,13 +49,8 @@ write `varint(m)` for the varint-encoding of `m`. Quoting the
 > two's complement representation of the number in groups of 7 bits,
 > least significant group first.

-The following table illustrates varint-encoding.
-
-| Number, `m` | `m` in binary, grouped into 7-bit chunks  | `varint(m)` bytes |
-| ------      | -------------------                       | ------------      |
-| 15          | `0001111`                                 | 15                |
-| 300         | `0000010 0101100`                         | 172 2             |
-| 1000000000  | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 |
+For example, `varint(15)` is `[0x0F]`, and `varint(1000000000)` is `[0x80, 0x94, 0xeb, 0xdc,
+0x03]`.

 It is an error for a varint-encoded `m` in a `Repr` to be anything
 other than the unique shortest encoding for that `m`. That is, a
@ -80,7 +75,7 @@ serializing in some other implementation-defined order.
    [bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding),
    dictionary key/value pairs must be sorted by key. This is a
    necessary step for ensuring serialization of `Value`s is
-    canonical. We do not require that key/value pairs (or set
+    canonical. We encourage, but do not require that key/value pairs (or set
    elements) be in sorted order for serialized `Value`s; however, a
    [canonical form][canonical] for `Repr`s does exist where a sorted
    ordering is required.
@ -101,55 +96,38 @@ serializing in some other implementation-defined order.

 ### SignedIntegers.

-    «x» when x ∈ SignedInteger = [0xB0] ++ varint(m) ++ intbytes(x)  if ¬(-3≤x≤12) ∧ m>16
-                                 ([0xA0] + m - 1) ++ intbytes(x)     if ¬(-3≤x≤12) ∧ m≤16
-                                 ([0xA0] + x)                        if  (-3≤x≤-1)
-                                 ([0x90] + x)                        if  ( 0≤x≤12)
-                               where m =        |intbytes(x)|
-
-Integers in the range [-3,12] are compactly represented with tags
-between `0x90` and `0x9F` because they are so frequently used.
-Integers up to 16 bytes long are represented with a single-byte tag
-encoding the length of the integer. Larger integers are represented
-with an explicit varint length. Every `SignedInteger` *MUST* be
-represented with its shortest possible encoding.
+    «x» = [0xB0] ++ varint(|intbytes(x)|) ++ intbytes(x)  if x ∈ SignedInteger

 The function `intbytes(x)` gives the big-endian two's-complement
 binary representation of `x`, taking exactly as many whole bytes as
-needed to unambiguously identify the value and its sign, and `m =
-|intbytes(x)|`. The most-significant bit in the first byte in
-`intbytes(x)` <!-- for `x`≠0 --> is the sign bit.[^zero-intbytes] For
-example,
+needed to unambiguously identify the value and its sign. The value 0
+needs zero bytes to identify the value; non-zero values need at least
+one byte, and the most-significant bit in the first byte is the sign
+bit. For example,
+
+      «-257» = B0 02 FE FF     «-2» = B0 01 FE       «255» = B0 02 00 FF
+      «-256» = B0 02 FF 00     «-1» = B0 01 FF       «256» = B0 02 01 00
+      «-255» = B0 02 FF 01      «0» = B0 00        «32767» = B0 02 7F FF
+      «-129» = B0 02 FF 7F      «1» = B0 01 01     «32768» = B0 03 00 80 00
+      «-128» = B0 01 80       «127» = B0 01 7F     «65535» = B0 03 00 FF FF
+      «-127» = B0 01 81       «128» = B0 02 00 80  «65536» = B0 03 01 00 00

      «87112285931760246646623899502532662132736»
        = B0 12 01 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00
                00 00

-      «-257» = A1 FE FF        «-3» = 9D          «128» = A1 00 80
-      «-256» = A1 FF 00        «-2» = 9E          «255» = A1 00 FF
-      «-255» = A1 FF 01        «-1» = 9F          «256» = A1 01 00
-      «-254» = A1 FF 02         «0» = 90        «32767» = A1 7F FF
-      «-129» = A1 FF 7F         «1» = 91        «32768» = A2 00 80 00
-      «-128» = A0 80           «12» = 9C        «65535» = A2 00 FF FF
-      «-127» = A0 81           «13» = A0 0D     «65536» = A2 01 00 00
-        «-4» = A0 FC          «127» = A0 7F    «131072» = A2 02 00 00
-
-  [^zero-intbytes]: The value 0 needs zero bytes to identify the
-    value, so `intbytes(0)` is the empty byte string. Non-zero values
-    need at least one byte.
-
 ### Strings, ByteStrings and Symbols.

+    «S» = [0xB1] ++ varint(|utf8(S)|) ++ utf8(S)  if S ∈ String
+          [0xB2] ++ varint(|S|) ++ S              if S ∈ ByteString
+          [0xB3] ++ varint(|utf8(S)|) ++ utf8(S)  if S ∈ Symbol
+
 Syntax for these three types varies only in the tag used. For `String`
 and `Symbol`, the data following the tag is a UTF-8 encoding of the
 `Value`'s code points, while for `ByteString` it is the raw data
 contained within the `Value` unmodified.

-    «S» = [0xB1] ++ varint(|utf8(S)|) ++ utf8(S)  if S ∈ String
-          [0xB2] ++ varint(|S|) ++ S              if S ∈ ByteString
-          [0xB3] ++ varint(|utf8(S)|) ++ utf8(S)  if S ∈ Symbol
-
 ### Booleans.

    «#f» = [0x80]
@ -157,39 +135,42 @@ contained within the `Value` unmodified.

 ### Floats and Doubles.

-    «F» when F ∈ Float  = [0x82] ++ binary32(F)
-    «D» when D ∈ Double = [0x83] ++ binary64(D)
+    «F» = [0x87, 0x04] ++ binary32(F)  if F ∈ Float
+    «D» = [0x87, 0x08] ++ binary64(D)  if D ∈ Double

 The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
 8-byte IEEE 754 binary representations of `F` and `D`, respectively.

 ### Embeddeds.

+    «#!V» = [0x86] ++ «V»
+
 The `Repr` of an `Embedded` is the `Repr` of a `Value` chosen to
 represent the denoted object, prefixed with `[0x86]`.

-    «#!V» = [0x86] ++ «V»
-
 ### Annotations.

-To annotate a `Repr` `r` with some `Value` `v`, prepend `r` with
-`[0x85] ++ «v»`. For example, the `Repr` corresponding to textual
-syntax `@a@b[]`, i.e. an empty sequence annotated with two symbols,
-`a` and `b`, is
+    «@V_1...@V_n V» = [0xBF] ++ «V» ++ «V_1» ++...++ «V_n» ++ [0x84]

-    «@a @b []»
-      = [0x85] ++ «a» ++ [0x85] ++ «b» ++ «[]»
-      = [0x85, 0xB3, 0x01, 0x61, 0x85, 0xB3, 0x01, 0x62, 0xB5, 0x84]
+`V` *MUST NOT* itself be annotated, but `V_1...V_n` *MAY* be
+annotated. For example, the `Repr` corresponding to textual syntax
+`@a@b[]`, i.e. an empty sequence annotated with two symbols, `a` and
+`b`, is
+
+    «@a @b []» = [0xBF] ++ «[]» ++ «a» ++ «b» ++ [0x84]
+               = [0xBF, 0xB5, 0x84, 0xB3, 0x01, 0x61, 0xB3, 0x01, 0x62, 0x84]
+
+Implementations *SHOULD* default to omitting annotations from binary `Repr`s.

 ## Security Considerations

 **Annotations.** In modes where a `Value` is being read while
-annotations are skipped, an endless sequence of annotations may give an
+annotations are skipped, an endless nesting of annotations may give an
 illusion of progress.

 **Canonical form for cryptographic hashing and signing.** No canonical
-textual encoding of a `Value` is specified. A
-[canonical form][canonical] exists for binary encoded `Value`s, and
+*textual* encoding of a `Value` is specified. However, a [canonical
+form][canonical] exists for binary encoded `Value`s, and
 implementations *SHOULD* produce canonical binary encodings by
 default; however, an implementation *MAY* permit two serializations of
 the same `Value` to yield different binary `Repr`s.
@ -215,25 +196,29 @@ a binary-syntax document; otherwise, it should be interpreted as text.

     80 - False
     81 - True
-     82 - Float
-     83 - Double
+    (82)  RESERVED
+    (83)  RESERVED
     84 - End marker
-     85 - Annotation
+    (85)  RESERVED
     86 - Embedded
-    (8x)  RESERVED 87-8F
+     87 - Float and Double
+    (8x)  RESERVED 88-8F

-     9x - Small integers 0..12,-3..-1
-     An - Medium integers, (n+1) bytes long
-     B0 - Large integers, variable length
+    (9x)  RESERVED
+    (Ax)  RESERVED
+
+     B0 - Integer
     B1 - String
     B2 - ByteString
     B3 - Symbol
-
     B4 - Record
     B5 - Sequence
     B6 - Set
     B7 - Dictionary

+    (Bx)  RESERVED B8-BE
+     BF - Annotated Repr (not itself starting with BF) followed by annotations
+
 ## Appendix. Binary SignedInteger representation

 Languages that provide fixed-width machine word types may find the
@ -242,15 +227,14 @@ values.

 | Integer range                              | Bytes required | Encoding (hex)                               |
 | ---                                        | ---            | ---                                          |
-| -3 ≤ n ≤ 12                                | 1              | `9X`                                         |
-| -2<sup>7</sup> ≤ n < 2<sup>7</sup> (i8)    | 2              | `A0` `XX`                                    |
-| -2<sup>15</sup> ≤ n < 2<sup>15</sup> (i16) | 3              | `A1` `XX` `XX`                               |
-| -2<sup>23</sup> ≤ n < 2<sup>23</sup> (i24) | 4              | `A2` `XX` `XX` `XX`                          |
-| -2<sup>31</sup> ≤ n < 2<sup>31</sup> (i32) | 5              | `A3` `XX` `XX` `XX` `XX`                     |
-| -2<sup>39</sup> ≤ n < 2<sup>39</sup> (i40) | 6              | `A4` `XX` `XX` `XX` `XX` `XX`                |
-| -2<sup>47</sup> ≤ n < 2<sup>47</sup> (i48) | 7              | `A5` `XX` `XX` `XX` `XX` `XX` `XX`           |
-| -2<sup>55</sup> ≤ n < 2<sup>55</sup> (i56) | 8              | `A6` `XX` `XX` `XX` `XX` `XX` `XX` `XX`      |
-| -2<sup>63</sup> ≤ n < 2<sup>63</sup> (i64) | 9              | `A7` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
+| -2<sup>7</sup> ≤ n < 2<sup>7</sup> (i8)    | 3              | `B0` `01` `XX`                                    |
+| -2<sup>15</sup> ≤ n < 2<sup>15</sup> (i16) | 4              | `B0` `02` `XX` `XX`                               |
+| -2<sup>23</sup> ≤ n < 2<sup>23</sup> (i24) | 5              | `B0` `03` `XX` `XX` `XX`                          |
+| -2<sup>31</sup> ≤ n < 2<sup>31</sup> (i32) | 6              | `B0` `04` `XX` `XX` `XX` `XX`                     |
+| -2<sup>39</sup> ≤ n < 2<sup>39</sup> (i40) | 7              | `B0` `05` `XX` `XX` `XX` `XX` `XX`                |
+| -2<sup>47</sup> ≤ n < 2<sup>47</sup> (i48) | 8              | `B0` `06` `XX` `XX` `XX` `XX` `XX` `XX`           |
+| -2<sup>55</sup> ≤ n < 2<sup>55</sup> (i56) | 9              | `B0` `07` `XX` `XX` `XX` `XX` `XX` `XX` `XX`      |
+| -2<sup>63</sup> ≤ n < 2<sup>63</sup> (i64) | 10             | `B0` `08` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |

 <!-- Heading to visually offset the footnotes from the main document: -->
 ## Notes