Loosen re overlong varints. Clarify re use of length info

This commit is contained in:
Tony Garnock-Jones 2022-06-11 11:13:48 +02:00
parent dd231284f1
commit d6b0b8bbd8
2 changed files with 29 additions and 6 deletions

View File

@ -23,6 +23,11 @@ binary syntax](preserves-binary.html).
**Annotations.**
Annotations *MUST NOT* be present.
**Length representations.** [Varint-encoded
lengths](./preserves-binary.html#varint) *MUST* appear in the unique shortest
encoding for a given length. That is, canonical varint-encodings *MUST
NOT* start with `0`.
**Sets.**
The elements of a `Set` *MUST* be serialized sorted in ascending order
by comparing their canonical encoded binary representations.

View File

@ -29,12 +29,13 @@ Each `Repr` starts with a tag byte, describing the kind of information
represented.
However, inspired by [argdata][], a `Repr` does *not* describe its own
length. Instead, the surrounding context must supply the length of the
`Repr`.
length. Instead, the surrounding context must supply the expected length
of the `Repr`.
As a consequence, `Repr`s for `Compound` values store the lengths of
their contained values. Each contained `Value` is represented as a
length in bytes followed by its own `Repr`.
length in bytes followed by its own `Repr`. Implementations use each
stored length to decide when to stop reading the following `Repr`.
<a id="varint"></a> Each length is stored as an [argdata][]-compatible
big-endian base 128 *varint*.[^see-also-leb128] Each byte of a varint
@ -63,9 +64,18 @@ The following table illustrates varint-encoding.
| 300 | `0000010 0101100` | 2 172 |
| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 3 92 107 20 128 |
It is an error for a varint-encoded `m` in a `Repr` to be anything other
than the unique shortest encoding for that `m`. That is, a
varint-encoding of `m` *MUST NOT* start with `0`.
There is no requirement that a varint-encoded `m` in a `Repr` be the unique shortest encoding
for that `m`.[^overlong-varint] However, implementations *SHOULD* use the shortest encoding
whereever possible when writing, and *SHOULD* reject excessively long encodings when reading
encoded values.[^excessively-long-varint]
[^overlong-varint]: **Implementation note.** The spec permits overlong length encodings to
reduce wasted activity in resource-constrained situations. If an implementation is in
anything other than a very low-level language, it is likely to be able to use
[IOList](./conventions.html#iolists)-style data structures to avoid unnecessary copying.
[^excessively-long-varint]: As a guideline, reject more than eight leading `0` bytes in a
varint.
### Records, Sequences, Sets and Dictionaries.
@ -107,6 +117,10 @@ serializing in some other implementation-defined order.
but encoding and then sorting byte strings is much more likely to
be within easy reach.
No sentinel marks the end of a sequence of length-prefixed `Repr`s.
During decoding, use the length of the containing `Repr` to decide when
to stop expecting more contained `Repr`s.
### SignedIntegers.
«x» when x ∈ SignedInteger = [0xA3] ++ intbytes(x)
@ -192,6 +206,10 @@ an empty sequence annotated with two symbols, `a` and `b`, is
annotations are skipped, an endless sequence of annotations may give an
illusion of progress.
**Overlong varints.** The binary format allows (but discourages) overlong [varint](#varint)s.
Consider optional restrictions on the number of redundant leading `0` bytes accepted when
reading a varint.
**Canonical form for cryptographic hashing and signing.** No canonical
textual encoding of a `Value` is specified. A
[canonical form][canonical] exists for binary encoded `Value`s, and