From d6b0b8bbd87d6159b68bb1114061a368c4bc7b31 Mon Sep 17 00:00:00 2001
From: Tony Garnock-Jones <tonyg@leastfixedpoint.com>
Date: Sat, 11 Jun 2022 11:13:48 +0200
Subject: [PATCH] Loosen re overlong varints. Clarify re use of length info

---
 canonical-binary.md |  5 +++++
 preserves-binary.md | 30 ++++++++++++++++++++++++------
 2 files changed, 29 insertions(+), 6 deletions(-)
diff --git a/canonical-binary.md b/canonical-binary.md
index 8393cab..3e5ebef 100644
--- a/canonical-binary.md
+++ b/canonical-binary.md
@@ -23,6 +23,11 @@ binary syntax](preserves-binary.html).
 **Annotations.**
 Annotations *MUST NOT* be present.
 
+**Length representations.** [Varint-encoded
+lengths](./preserves-binary.html#varint) *MUST* appear in the unique shortest
+encoding for a given length. That is, canonical varint-encodings *MUST
+NOT* start with `0`.
+
 **Sets.**
 The elements of a `Set` *MUST* be serialized sorted in ascending order
 by comparing their canonical encoded binary representations.
diff --git a/preserves-binary.md b/preserves-binary.md
index f188d60..4047048 100644
--- a/preserves-binary.md
+++ b/preserves-binary.md
@@ -29,12 +29,13 @@ Each `Repr` starts with a tag byte, describing the kind of information
 represented.
 
 However, inspired by [argdata][], a `Repr` does *not* describe its own
-length. Instead, the surrounding context must supply the length of the
-`Repr`.
+length. Instead, the surrounding context must supply the expected length
+of the `Repr`.
 
 As a consequence, `Repr`s for `Compound` values store the lengths of
 their contained values. Each contained `Value` is represented as a
-length in bytes followed by its own `Repr`.
+length in bytes followed by its own `Repr`. Implementations use each
+stored length to decide when to stop reading the following `Repr`.
 
 <a id="varint"></a> Each length is stored as an [argdata][]-compatible
 big-endian base 128 *varint*.[^see-also-leb128] Each byte of a varint
@@ -63,9 +64,18 @@ The following table illustrates varint-encoding.
 | 300         | `0000010 0101100`                         | 2 172           |
 | 1000000000  | `0000011 1011100 1101011 0010100 0000000` | 3 92 107 20 128 |
 
-It is an error for a varint-encoded `m` in a `Repr` to be anything other
-than the unique shortest encoding for that `m`. That is, a
-varint-encoding of `m` *MUST NOT* start with `0`.
+There is no requirement that a varint-encoded `m` in a `Repr` be the unique shortest encoding
+for that `m`.[^overlong-varint] However, implementations *SHOULD* use the shortest encoding
+whereever possible when writing, and *SHOULD* reject excessively long encodings when reading
+encoded values.[^excessively-long-varint]
+
+  [^overlong-varint]: **Implementation note.** The spec permits overlong length encodings to
+    reduce wasted activity in resource-constrained situations. If an implementation is in
+    anything other than a very low-level language, it is likely to be able to use
+    [IOList](./conventions.html#iolists)-style data structures to avoid unnecessary copying.
+
+  [^excessively-long-varint]: As a guideline, reject more than eight leading `0` bytes in a
+    varint.
 
 ### Records, Sequences, Sets and Dictionaries.
 
@@ -107,6 +117,10 @@ serializing in some other implementation-defined order.
     but encoding and then sorting byte strings is much more likely to
     be within easy reach.
 
+No sentinel marks the end of a sequence of length-prefixed `Repr`s.
+During decoding, use the length of the containing `Repr` to decide when
+to stop expecting more contained `Repr`s.
+
 ### SignedIntegers.
 
     «x» when x ∈ SignedInteger = [0xA3] ++ intbytes(x)
@@ -192,6 +206,10 @@ an empty sequence annotated with two symbols, `a` and `b`, is
 annotations are skipped, an endless sequence of annotations may give an
 illusion of progress.
 
+**Overlong varints.** The binary format allows (but discourages) overlong [varint](#varint)s.
+Consider optional restrictions on the number of redundant leading `0` bytes accepted when
+reading a varint.
+
 **Canonical form for cryptographic hashing and signing.** No canonical
 textual encoding of a `Value` is specified. A
 [canonical form][canonical] exists for binary encoded `Value`s, and