Spec change proposal for #41
This commit is contained in:
parent
34f92c3870
commit
d11f008705
|
@ -1 +1,36 @@
|
|||
(TODO)
|
||||
[varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
|
||||
|
||||
«#f» = [0x80]
|
||||
«#t» = [0x81]
|
||||
|
||||
«#!V» = [0x86] ++ «V»
|
||||
|
||||
«V» if V ∈ Float = [0x87] ++ varint(|binary32(V)|) ++ binary32(V)
|
||||
«V» if V ∈ Double = [0x87] ++ varint(|binary64(V)|) ++ binary64(V)
|
||||
|
||||
«V» if V ∈ SignedInteger = [0xB0] ++ varint(|intbytes(x)|) ++ intbytes(x)
|
||||
«V» if V ∈ String = [0xB1] ++ varint(|utf8(V)|) ++ utf8(V)
|
||||
«V» if V ∈ ByteString = [0xB2] ++ varint(|V|) ++ V
|
||||
«V» if V ∈ Symbol = [0xB3] ++ varint(|utf8(V)|) ++ utf8(V)
|
||||
|
||||
«<L F_1...F_m>» = [0xB4] ++ «L» ++ «F_1» ++...++ «F_m» ++ [0x84]
|
||||
«[X_1...X_m]» = [0xB5] ++ «X_1» ++...++ «X_m» ++ [0x84]
|
||||
«#{E_1...E_m}» = [0xB6] ++ «E_1» ++...++ «E_m» ++ [0x84]
|
||||
«{K_1:V_1...K_m:V_m}» = [0xB7] ++ «K_1» ++ «V_1» ++...++ «K_m» ++ «V_m» ++ [0x84]
|
||||
|
||||
«@V_1...@V_n V» = [0xBF] ++ «V» ++ «V_1» ++...++ «V_n» ++ [0x84]
|
||||
|
||||
Where
|
||||
|
||||
- `varint(m)` is the [varint-encoding][varint] of `m`; for example, `varint(15)` is `[0x0F]`,
|
||||
and `varint(1000000000)` is `[0x80, 0x94, 0xeb, 0xdc, 0x03]`.
|
||||
|
||||
- `intbytes(x)` gives the big-endian two's-complement binary representation of `x`, taking
|
||||
exactly as many whole bytes as needed to unambiguously identify the value and its sign. For
|
||||
example, `intbytes(-128)` is `[0x80]`, `intbytes(-1)` is `[0xFF]`, `intbytes(0)` is `[]`,
|
||||
`intbytes(1)` is `[0x01]`, `intbytes(128)` is `[0x00, 0x80]` etc.
|
||||
|
||||
- `utf8(S)` gives the sequence of bytes forming the UTF-8 encoding of the string `S`.
|
||||
|
||||
- `binary32(F)` and `binary64(D)` yield big-endian 4- and 8-byte IEEE 754 binary
|
||||
representations of `F` and `D`, respectively.
|
||||
|
|
|
@ -27,10 +27,9 @@ Each `Repr` starts with a tag byte, describing the kind of information
|
|||
represented. Depending on the tag, a length indicator, further encoded
|
||||
information, and/or an ending tag may follow.
|
||||
|
||||
tag (simple atomic data and small integers)
|
||||
tag ++ binarydata (most integers)
|
||||
tag ++ length ++ binarydata (large integers, strings, symbols, and binary)
|
||||
tag ++ repr ++ ... ++ endtag (compound data)
|
||||
tag (simple atomic data)
|
||||
tag ++ length ++ binarydata (floats, doubles, integers, strings, symbols, and binary)
|
||||
tag ++ repr ++ ... ++ endtag (compound data and annotations)
|
||||
|
||||
The unique end tag is byte value `0x84`.
|
||||
|
||||
|
@ -41,7 +40,8 @@ write `varint(m)` for the varint-encoding of `m`. Quoting the
|
|||
|
||||
[^see-also-leb128]: Also known as [LEB128][] encoding, for unsigned
|
||||
integers. Varints and LEB128-encoded integers differ only for
|
||||
signed integers, which are not used in Preserves.
|
||||
negative numbers, which cannot appear as length indicators and are
|
||||
thus not used in Preserves.
|
||||
|
||||
> Each byte in a varint, except the last byte, has the most
|
||||
> significant bit (msb) set – this indicates that there are further
|
||||
|
@ -49,13 +49,8 @@ write `varint(m)` for the varint-encoding of `m`. Quoting the
|
|||
> two's complement representation of the number in groups of 7 bits,
|
||||
> least significant group first.
|
||||
|
||||
The following table illustrates varint-encoding.
|
||||
|
||||
| Number, `m` | `m` in binary, grouped into 7-bit chunks | `varint(m)` bytes |
|
||||
| ------ | ------------------- | ------------ |
|
||||
| 15 | `0001111` | 15 |
|
||||
| 300 | `0000010 0101100` | 172 2 |
|
||||
| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 |
|
||||
For example, `varint(15)` is `[0x0F]`, and `varint(1000000000)` is `[0x80, 0x94, 0xeb, 0xdc,
|
||||
0x03]`.
|
||||
|
||||
It is an error for a varint-encoded `m` in a `Repr` to be anything
|
||||
other than the unique shortest encoding for that `m`. That is, a
|
||||
|
@ -80,7 +75,7 @@ serializing in some other implementation-defined order.
|
|||
[bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding),
|
||||
dictionary key/value pairs must be sorted by key. This is a
|
||||
necessary step for ensuring serialization of `Value`s is
|
||||
canonical. We do not require that key/value pairs (or set
|
||||
canonical. We encourage, but do not require that key/value pairs (or set
|
||||
elements) be in sorted order for serialized `Value`s; however, a
|
||||
[canonical form][canonical] for `Repr`s does exist where a sorted
|
||||
ordering is required.
|
||||
|
@ -101,55 +96,38 @@ serializing in some other implementation-defined order.
|
|||
|
||||
### SignedIntegers.
|
||||
|
||||
«x» when x ∈ SignedInteger = [0xB0] ++ varint(m) ++ intbytes(x) if ¬(-3≤x≤12) ∧ m>16
|
||||
([0xA0] + m - 1) ++ intbytes(x) if ¬(-3≤x≤12) ∧ m≤16
|
||||
([0xA0] + x) if (-3≤x≤-1)
|
||||
([0x90] + x) if ( 0≤x≤12)
|
||||
where m = |intbytes(x)|
|
||||
|
||||
Integers in the range [-3,12] are compactly represented with tags
|
||||
between `0x90` and `0x9F` because they are so frequently used.
|
||||
Integers up to 16 bytes long are represented with a single-byte tag
|
||||
encoding the length of the integer. Larger integers are represented
|
||||
with an explicit varint length. Every `SignedInteger` *MUST* be
|
||||
represented with its shortest possible encoding.
|
||||
«x» = [0xB0] ++ varint(|intbytes(x)|) ++ intbytes(x) if x ∈ SignedInteger
|
||||
|
||||
The function `intbytes(x)` gives the big-endian two's-complement
|
||||
binary representation of `x`, taking exactly as many whole bytes as
|
||||
needed to unambiguously identify the value and its sign, and `m =
|
||||
|intbytes(x)|`. The most-significant bit in the first byte in
|
||||
`intbytes(x)` <!-- for `x`≠0 --> is the sign bit.[^zero-intbytes] For
|
||||
example,
|
||||
needed to unambiguously identify the value and its sign. The value 0
|
||||
needs zero bytes to identify the value; non-zero values need at least
|
||||
one byte, and the most-significant bit in the first byte is the sign
|
||||
bit. For example,
|
||||
|
||||
«-257» = B0 02 FE FF «-2» = B0 01 FE «255» = B0 02 00 FF
|
||||
«-256» = B0 02 FF 00 «-1» = B0 01 FF «256» = B0 02 01 00
|
||||
«-255» = B0 02 FF 01 «0» = B0 00 «32767» = B0 02 7F FF
|
||||
«-129» = B0 02 FF 7F «1» = B0 01 01 «32768» = B0 03 00 80 00
|
||||
«-128» = B0 01 80 «127» = B0 01 7F «65535» = B0 03 00 FF FF
|
||||
«-127» = B0 01 81 «128» = B0 02 00 80 «65536» = B0 03 01 00 00
|
||||
|
||||
«87112285931760246646623899502532662132736»
|
||||
= B0 12 01 00 00 00 00 00 00 00
|
||||
00 00 00 00 00 00 00 00
|
||||
00 00
|
||||
|
||||
«-257» = A1 FE FF «-3» = 9D «128» = A1 00 80
|
||||
«-256» = A1 FF 00 «-2» = 9E «255» = A1 00 FF
|
||||
«-255» = A1 FF 01 «-1» = 9F «256» = A1 01 00
|
||||
«-254» = A1 FF 02 «0» = 90 «32767» = A1 7F FF
|
||||
«-129» = A1 FF 7F «1» = 91 «32768» = A2 00 80 00
|
||||
«-128» = A0 80 «12» = 9C «65535» = A2 00 FF FF
|
||||
«-127» = A0 81 «13» = A0 0D «65536» = A2 01 00 00
|
||||
«-4» = A0 FC «127» = A0 7F «131072» = A2 02 00 00
|
||||
|
||||
[^zero-intbytes]: The value 0 needs zero bytes to identify the
|
||||
value, so `intbytes(0)` is the empty byte string. Non-zero values
|
||||
need at least one byte.
|
||||
|
||||
### Strings, ByteStrings and Symbols.
|
||||
|
||||
«S» = [0xB1] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ String
|
||||
[0xB2] ++ varint(|S|) ++ S if S ∈ ByteString
|
||||
[0xB3] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ Symbol
|
||||
|
||||
Syntax for these three types varies only in the tag used. For `String`
|
||||
and `Symbol`, the data following the tag is a UTF-8 encoding of the
|
||||
`Value`'s code points, while for `ByteString` it is the raw data
|
||||
contained within the `Value` unmodified.
|
||||
|
||||
«S» = [0xB1] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ String
|
||||
[0xB2] ++ varint(|S|) ++ S if S ∈ ByteString
|
||||
[0xB3] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ Symbol
|
||||
|
||||
### Booleans.
|
||||
|
||||
«#f» = [0x80]
|
||||
|
@ -157,39 +135,42 @@ contained within the `Value` unmodified.
|
|||
|
||||
### Floats and Doubles.
|
||||
|
||||
«F» when F ∈ Float = [0x82] ++ binary32(F)
|
||||
«D» when D ∈ Double = [0x83] ++ binary64(D)
|
||||
«F» = [0x87, 0x04] ++ binary32(F) if F ∈ Float
|
||||
«D» = [0x87, 0x08] ++ binary64(D) if D ∈ Double
|
||||
|
||||
The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
|
||||
8-byte IEEE 754 binary representations of `F` and `D`, respectively.
|
||||
|
||||
### Embeddeds.
|
||||
|
||||
«#!V» = [0x86] ++ «V»
|
||||
|
||||
The `Repr` of an `Embedded` is the `Repr` of a `Value` chosen to
|
||||
represent the denoted object, prefixed with `[0x86]`.
|
||||
|
||||
«#!V» = [0x86] ++ «V»
|
||||
|
||||
### Annotations.
|
||||
|
||||
To annotate a `Repr` `r` with some `Value` `v`, prepend `r` with
|
||||
`[0x85] ++ «v»`. For example, the `Repr` corresponding to textual
|
||||
syntax `@a@b[]`, i.e. an empty sequence annotated with two symbols,
|
||||
`a` and `b`, is
|
||||
«@V_1...@V_n V» = [0xBF] ++ «V» ++ «V_1» ++...++ «V_n» ++ [0x84]
|
||||
|
||||
«@a @b []»
|
||||
= [0x85] ++ «a» ++ [0x85] ++ «b» ++ «[]»
|
||||
= [0x85, 0xB3, 0x01, 0x61, 0x85, 0xB3, 0x01, 0x62, 0xB5, 0x84]
|
||||
`V` *MUST NOT* itself be annotated, but `V_1...V_n` *MAY* be
|
||||
annotated. For example, the `Repr` corresponding to textual syntax
|
||||
`@a@b[]`, i.e. an empty sequence annotated with two symbols, `a` and
|
||||
`b`, is
|
||||
|
||||
«@a @b []» = [0xBF] ++ «[]» ++ «a» ++ «b» ++ [0x84]
|
||||
= [0xBF, 0xB5, 0x84, 0xB3, 0x01, 0x61, 0xB3, 0x01, 0x62, 0x84]
|
||||
|
||||
Implementations *SHOULD* default to omitting annotations from binary `Repr`s.
|
||||
|
||||
## Security Considerations
|
||||
|
||||
**Annotations.** In modes where a `Value` is being read while
|
||||
annotations are skipped, an endless sequence of annotations may give an
|
||||
annotations are skipped, an endless nesting of annotations may give an
|
||||
illusion of progress.
|
||||
|
||||
**Canonical form for cryptographic hashing and signing.** No canonical
|
||||
textual encoding of a `Value` is specified. A
|
||||
[canonical form][canonical] exists for binary encoded `Value`s, and
|
||||
*textual* encoding of a `Value` is specified. However, a [canonical
|
||||
form][canonical] exists for binary encoded `Value`s, and
|
||||
implementations *SHOULD* produce canonical binary encodings by
|
||||
default; however, an implementation *MAY* permit two serializations of
|
||||
the same `Value` to yield different binary `Repr`s.
|
||||
|
@ -215,25 +196,29 @@ a binary-syntax document; otherwise, it should be interpreted as text.
|
|||
|
||||
80 - False
|
||||
81 - True
|
||||
82 - Float
|
||||
83 - Double
|
||||
(82) RESERVED
|
||||
(83) RESERVED
|
||||
84 - End marker
|
||||
85 - Annotation
|
||||
(85) RESERVED
|
||||
86 - Embedded
|
||||
(8x) RESERVED 87-8F
|
||||
87 - Float and Double
|
||||
(8x) RESERVED 88-8F
|
||||
|
||||
9x - Small integers 0..12,-3..-1
|
||||
An - Medium integers, (n+1) bytes long
|
||||
B0 - Large integers, variable length
|
||||
(9x) RESERVED
|
||||
(Ax) RESERVED
|
||||
|
||||
B0 - Integer
|
||||
B1 - String
|
||||
B2 - ByteString
|
||||
B3 - Symbol
|
||||
|
||||
B4 - Record
|
||||
B5 - Sequence
|
||||
B6 - Set
|
||||
B7 - Dictionary
|
||||
|
||||
(Bx) RESERVED B8-BE
|
||||
BF - Annotated Repr (not itself starting with BF) followed by annotations
|
||||
|
||||
## Appendix. Binary SignedInteger representation
|
||||
|
||||
Languages that provide fixed-width machine word types may find the
|
||||
|
@ -242,15 +227,14 @@ values.
|
|||
|
||||
| Integer range | Bytes required | Encoding (hex) |
|
||||
| --- | --- | --- |
|
||||
| -3 ≤ n ≤ 12 | 1 | `9X` |
|
||||
| -2<sup>7</sup> ≤ n < 2<sup>7</sup> (i8) | 2 | `A0` `XX` |
|
||||
| -2<sup>15</sup> ≤ n < 2<sup>15</sup> (i16) | 3 | `A1` `XX` `XX` |
|
||||
| -2<sup>23</sup> ≤ n < 2<sup>23</sup> (i24) | 4 | `A2` `XX` `XX` `XX` |
|
||||
| -2<sup>31</sup> ≤ n < 2<sup>31</sup> (i32) | 5 | `A3` `XX` `XX` `XX` `XX` |
|
||||
| -2<sup>39</sup> ≤ n < 2<sup>39</sup> (i40) | 6 | `A4` `XX` `XX` `XX` `XX` `XX` |
|
||||
| -2<sup>47</sup> ≤ n < 2<sup>47</sup> (i48) | 7 | `A5` `XX` `XX` `XX` `XX` `XX` `XX` |
|
||||
| -2<sup>55</sup> ≤ n < 2<sup>55</sup> (i56) | 8 | `A6` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
|
||||
| -2<sup>63</sup> ≤ n < 2<sup>63</sup> (i64) | 9 | `A7` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
|
||||
| -2<sup>7</sup> ≤ n < 2<sup>7</sup> (i8) | 3 | `B0` `01` `XX` |
|
||||
| -2<sup>15</sup> ≤ n < 2<sup>15</sup> (i16) | 4 | `B0` `02` `XX` `XX` |
|
||||
| -2<sup>23</sup> ≤ n < 2<sup>23</sup> (i24) | 5 | `B0` `03` `XX` `XX` `XX` |
|
||||
| -2<sup>31</sup> ≤ n < 2<sup>31</sup> (i32) | 6 | `B0` `04` `XX` `XX` `XX` `XX` |
|
||||
| -2<sup>39</sup> ≤ n < 2<sup>39</sup> (i40) | 7 | `B0` `05` `XX` `XX` `XX` `XX` `XX` |
|
||||
| -2<sup>47</sup> ≤ n < 2<sup>47</sup> (i48) | 8 | `B0` `06` `XX` `XX` `XX` `XX` `XX` `XX` |
|
||||
| -2<sup>55</sup> ≤ n < 2<sup>55</sup> (i56) | 9 | `B0` `07` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
|
||||
| -2<sup>63</sup> ≤ n < 2<sup>63</sup> (i64) | 10 | `B0` `08` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
|
||||
|
||||
<!-- Heading to visually offset the footnotes from the main document: -->
|
||||
## Notes
|
||||
|
|
Loading…
Reference in New Issue