Loosen re overlong varints. Clarify re use of length info

2022-06-11 11:13:48 +02:00 · 2022-06-11 11:13:48 +02:00 · d6b0b8bbd8
parent dd231284f1
commit d6b0b8bbd8
2 changed files with 29 additions and 6 deletions
--- a/canonical-binary.md
+++ b/canonical-binary.md
@ -23,6 +23,11 @@ binary syntax](preserves-binary.html).
 **Annotations.**
 Annotations *MUST NOT* be present.

+**Length representations.** [Varint-encoded
+lengths](./preserves-binary.html#varint) *MUST* appear in the unique shortest
+encoding for a given length. That is, canonical varint-encodings *MUST
+NOT* start with `0`.
+
 **Sets.**
 The elements of a `Set` *MUST* be serialized sorted in ascending order
 by comparing their canonical encoded binary representations.
--- a/preserves-binary.md
+++ b/preserves-binary.md
@ -29,12 +29,13 @@ Each `Repr` starts with a tag byte, describing the kind of information
 represented.

 However, inspired by [argdata][], a `Repr` does *not* describe its own
-length. Instead, the surrounding context must supply the length of the
-`Repr`.
+length. Instead, the surrounding context must supply the expected length
+of the `Repr`.

 As a consequence, `Repr`s for `Compound` values store the lengths of
 their contained values. Each contained `Value` is represented as a
-length in bytes followed by its own `Repr`.
+length in bytes followed by its own `Repr`. Implementations use each
+stored length to decide when to stop reading the following `Repr`.

 <a id="varint"></a> Each length is stored as an [argdata][]-compatible
 big-endian base 128 *varint*.[^see-also-leb128] Each byte of a varint
@ -63,9 +64,18 @@ The following table illustrates varint-encoding.
 | 300         | `0000010 0101100`                         | 2 172           |
 | 1000000000  | `0000011 1011100 1101011 0010100 0000000` | 3 92 107 20 128 |

-It is an error for a varint-encoded `m` in a `Repr` to be anything other
-than the unique shortest encoding for that `m`. That is, a
-varint-encoding of `m` *MUST NOT* start with `0`.
+There is no requirement that a varint-encoded `m` in a `Repr` be the unique shortest encoding
+for that `m`.[^overlong-varint] However, implementations *SHOULD* use the shortest encoding
+whereever possible when writing, and *SHOULD* reject excessively long encodings when reading
+encoded values.[^excessively-long-varint]
+
+  [^overlong-varint]: **Implementation note.** The spec permits overlong length encodings to
+    reduce wasted activity in resource-constrained situations. If an implementation is in
+    anything other than a very low-level language, it is likely to be able to use
+    [IOList](./conventions.html#iolists)-style data structures to avoid unnecessary copying.
+
+  [^excessively-long-varint]: As a guideline, reject more than eight leading `0` bytes in a
+    varint.

 ### Records, Sequences, Sets and Dictionaries.

@ -107,6 +117,10 @@ serializing in some other implementation-defined order.
    but encoding and then sorting byte strings is much more likely to
    be within easy reach.

+No sentinel marks the end of a sequence of length-prefixed `Repr`s.
+During decoding, use the length of the containing `Repr` to decide when
+to stop expecting more contained `Repr`s.
+
 ### SignedIntegers.

    «x» when x ∈ SignedInteger = [0xA3] ++ intbytes(x)
@ -192,6 +206,10 @@ an empty sequence annotated with two symbols, `a` and `b`, is
 annotations are skipped, an endless sequence of annotations may give an
 illusion of progress.

+**Overlong varints.** The binary format allows (but discourages) overlong [varint](#varint)s.
+Consider optional restrictions on the number of redundant leading `0` bytes accepted when
+reading a varint.
+
 **Canonical form for cryptographic hashing and signing.** No canonical
 textual encoding of a `Value` is specified. A
 [canonical form][canonical] exists for binary encoded `Value`s, and