From d6b0b8bbd87d6159b68bb1114061a368c4bc7b31 Mon Sep 17 00:00:00 2001 From: Tony Garnock-Jones Date: Sat, 11 Jun 2022 11:13:48 +0200 Subject: [PATCH] Loosen re overlong varints. Clarify re use of length info --- canonical-binary.md | 5 +++++ preserves-binary.md | 30 ++++++++++++++++++++++++------ 2 files changed, 29 insertions(+), 6 deletions(-) diff --git a/canonical-binary.md b/canonical-binary.md index 8393cab..3e5ebef 100644 --- a/canonical-binary.md +++ b/canonical-binary.md @@ -23,6 +23,11 @@ binary syntax](preserves-binary.html). **Annotations.** Annotations *MUST NOT* be present. +**Length representations.** [Varint-encoded +lengths](./preserves-binary.html#varint) *MUST* appear in the unique shortest +encoding for a given length. That is, canonical varint-encodings *MUST +NOT* start with `0`. + **Sets.** The elements of a `Set` *MUST* be serialized sorted in ascending order by comparing their canonical encoded binary representations. diff --git a/preserves-binary.md b/preserves-binary.md index f188d60..4047048 100644 --- a/preserves-binary.md +++ b/preserves-binary.md @@ -29,12 +29,13 @@ Each `Repr` starts with a tag byte, describing the kind of information represented. However, inspired by [argdata][], a `Repr` does *not* describe its own -length. Instead, the surrounding context must supply the length of the -`Repr`. +length. Instead, the surrounding context must supply the expected length +of the `Repr`. As a consequence, `Repr`s for `Compound` values store the lengths of their contained values. Each contained `Value` is represented as a -length in bytes followed by its own `Repr`. +length in bytes followed by its own `Repr`. Implementations use each +stored length to decide when to stop reading the following `Repr`. Each length is stored as an [argdata][]-compatible big-endian base 128 *varint*.[^see-also-leb128] Each byte of a varint @@ -63,9 +64,18 @@ The following table illustrates varint-encoding. | 300 | `0000010 0101100` | 2 172 | | 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 3 92 107 20 128 | -It is an error for a varint-encoded `m` in a `Repr` to be anything other -than the unique shortest encoding for that `m`. That is, a -varint-encoding of `m` *MUST NOT* start with `0`. +There is no requirement that a varint-encoded `m` in a `Repr` be the unique shortest encoding +for that `m`.[^overlong-varint] However, implementations *SHOULD* use the shortest encoding +whereever possible when writing, and *SHOULD* reject excessively long encodings when reading +encoded values.[^excessively-long-varint] + + [^overlong-varint]: **Implementation note.** The spec permits overlong length encodings to + reduce wasted activity in resource-constrained situations. If an implementation is in + anything other than a very low-level language, it is likely to be able to use + [IOList](./conventions.html#iolists)-style data structures to avoid unnecessary copying. + + [^excessively-long-varint]: As a guideline, reject more than eight leading `0` bytes in a + varint. ### Records, Sequences, Sets and Dictionaries. @@ -107,6 +117,10 @@ serializing in some other implementation-defined order. but encoding and then sorting byte strings is much more likely to be within easy reach. +No sentinel marks the end of a sequence of length-prefixed `Repr`s. +During decoding, use the length of the containing `Repr` to decide when +to stop expecting more contained `Repr`s. + ### SignedIntegers. «x» when x ∈ SignedInteger = [0xA3] ++ intbytes(x) @@ -192,6 +206,10 @@ an empty sequence annotated with two symbols, `a` and `b`, is annotations are skipped, an endless sequence of annotations may give an illusion of progress. +**Overlong varints.** The binary format allows (but discourages) overlong [varint](#varint)s. +Consider optional restrictions on the number of redundant leading `0` bytes accepted when +reading a varint. + **Canonical form for cryptographic hashing and signing.** No canonical textual encoding of a `Value` is specified. A [canonical form][canonical] exists for binary encoded `Value`s, and