18 KiB
no_site_title | title |
---|---|
true | Preserves: Binary Syntax |
Tony Garnock-Jones tonyg@leastfixedpoint.com
{{ site.version_date }}. Version {{ site.version }}.
Preserves is a data model, with associated serialization formats. This
document defines one of those formats: a binary syntax for Value
s from
the Preserves data model that is easy for computer
software to read and write. An equivalent human-readable text
syntax also exists.
Machine-Oriented Binary Syntax
A Repr
is a binary-syntax encoding, or representation, of a Value
.
Type and Length representation.
Each Repr
starts with a tag byte, describing the kind of information
represented. The expected length of the Repr
is always available
from the surrounding context: either from a containing encoded value, or
from the overall container of the data, which could be a file, an HTTP
message, a UDP packet, etc.
Atomic Values.
Booleans. The "false" boolean's Repr
is just tag 0xA0
; "true" is
0xA1
.
Floats and Doubles. Both Float
and Double
values are represented
as tag 0xA2
followed by big-endian 4- or 8-byte IEEE 754 binary
representations of the values, respectively.
SignedIntegers. A SignedInteger
encodes as tag 0xA3
followed by
a big-endian two's-complement binary representation of the value, taking
at least as many whole bytes as needed to unambiguously identify the
value and its sign. Zero may be represented as the tag alone, with no
following bytes. The most-significant bit in the first byte after the
tag is the sign bit.1 The shortest possible encoding
SHOULD be used.2
Strings. A String
encodes as tag 0xA4
followed by the UTF-8
encoding of the string, with an additional trailing NUL
(0) byte. The
NUL
byte MUST NOT be treated as part of the String
: it exists to
permit zero-copy C interoperability.3
ByteStrings. A ByteString
encodes as tag 0xA5
followed by the
bytes themselves.
Symbols. A Symbol
encodes as tag 0xA6
followed by the UTF-8
encoding of the symbol's code points.
Compound Values.
Repr
s for Compound
values store the lengths of their contained
values. Each contained Value
is converted to a Repr
and stored as
the length of the Repr
in bytes followed by the Repr
itself.
Implementations use each stored length to decide when to stop reading
the associated Repr
. Similarly, no sentinel marks the end of a
sequence of length-prefixed Repr
s. Implementations use the length of
the containing Repr
, known from the surrounding context, to decide
when to stop expecting more contained Repr
s.
Each length is stored as an argdata-compatible
big-endian base 128 varint.4 Each byte of a varint
stores seven bits of the length. All bytes have a clear upper bit,
except the final byte, which has the upper bit set. Implementations
SHOULD use the shortest encoding for a varint, and MUST NOT produce
an encoded varint with more than nine leading 0
bytes.5 6
Records. A Record
is encoded as tag 0xA7
followed by the
length-prefixed encodings of its label and fields.
Sequences. A Sequence
is encoded as tag 0xA8
followed by the
length-prefixed encodings of its members.
Sets. A Set
is encoded like a Sequence
, but with tag 0xA9
, and
in some arbitrary order.
Dictionaries. A Dictionary
encodes as tag 0xAA
followed by the
length-prefixed keys and values, in an alternating key/value sequence.
There is no ordering requirement on the elements of sets or the key/value pairs of
dictionaries.7 However, elements of sets and keys in dictionaries MUST
be pairwise distinct. In addition, implementations SHOULD default to writing set elements in
order sorted lexicographically by their Repr
s and SHOULD default to writing dictionary
key/value pairs in order sorted lexicographically by the Repr
s of their
keys8. Implementations MAY offer the option of serializing in some
other implementation-defined order.
Embedded Values.
Embedded values are encoded as tag 0xAB
followed by the encoding of
some Value
chosen to represent the denoted embedded object.
Annotations.
The encoding of a sequence of annotations for a Repr
uses tag 0xBF
,
followed by the length-prefixed Repr
, followed by the length-prefixed
encoded annotations, in order. The Repr
MUST NOT already have
annotations (must not begin with 0xBF
), and there MUST be at least
one Value
in the sequence following the Repr
.
Examples (normative)
We write «v»
for the Repr
of some Value
v
, and varint(|«v»|)
for
the varint-encoded length of the Repr
of v
.
Varints (length representations).
The following table illustrates varint-encoding.
Number, m |
m in binary, grouped into 7-bit chunks |
varint(m) bytes |
---|---|---|
15 | 0001111 |
143 |
300 | 0000010 0101100 |
2 172 |
1000000000 | 0000011 1011100 1101011 0010100 0000000 |
3 92 107 20 128 |
Atoms.
«#f» = A0 «#t» = A1
«0.123f» = A2 3D FB E7 6D «0.123» = A2 3F BF 7C ED 91 68 72 B0
«-257» = A3 FE FF «-3» = A3 FD «128» = A3 00 80
«-256» = A3 FF 00 «-2» = A3 FE «255» = A3 00 FF
«-255» = A3 FF 01 «-1» = A3 FF «256» = A3 01 00
«-254» = A3 FF 02 «0» = A3 «32767» = A3 7F FF
«-129» = A3 FF 7F «1» = A3 01 «32768» = A3 00 80 00
«-128» = A3 80 «12» = A3 0C «65535» = A3 00 FF FF
«-127» = A3 81 «13» = A3 0D «65536» = A3 01 00 00
«-4» = A3 FC «127» = A3 7F «131072» = A3 02 00 00
«87112285931760246646623899502532662132736»
= A3 01 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
00 00
«""» = A4 00 «||» = A6
«"a"» = A4 61 00 «|a|» = A6 61
«"hello"» = A4 68 65 6C 6C 6F 00 «|hello|» = A6 68 65 6C 6C 6F
«#[]» = A5
«#[AQ==]» = A5 01
«#[ATAyMDMwNDA1]» = A5 01 02 03 04 05
Compounds.
«<window 100 120 500 300>» = A7 87 A6 77696E646F77
82 A3 64
82 A3 78
83 A3 01F4
83 A3 012C
«["zzzz(...192 zs...)zzzz"]»
(a length-1 sequence containing a length-200 string)
= A8 01CA A4 7A7A7A7A (... 192 repetitions of 7A ...)
7A7A7A7A 00
«[H, He, Li, Be, B, C, N, O, F, Ne]»
= A8 82A648 83A64865 83A64C69 83A64265
82A642 82A643 82A64E 82A64F 82A646 83A64E65
«#{H He Li Be B C N O F Ne}»
= A9 82A642 (B)
83A64265 (Be)
82A643 (C)
82A646 (F)
82A648 (H)
83A64865 (He)
83A64C69 (Li)
82A64E (N)
83A64E65 (Ne)
82A64F (O)
«{H: 1.0080f, He: 4.0026f, Li: 6.94f, Be: 9.0122f,
B: 10.81f, C: 12.011f, N: 14.007f, O: 15.999f,
F: 18.998f, Ne: 20.180f}»
= AA 82A642 85A2412CF5C3 (B: 10.81f)
83A64265 85A2411031F9 (Be: 9.0122f)
82A643 85A241402D0E (C: 12.011f)
82A646 85A24197FBE7 (F: 18.998f)
82A648 85A23F810625 (H: 1.0080f)
83A64865 85A24080154D (He: 4.0026f)
83A64C69 85A240DE147B (Li: 6.94f)
82A64E 85A241601CAC (N: 14.007f)
83A64E65 85A241A170A4 (Ne: 20.180f)
82A64F 85A2417FFBE7 (O: 15.999f)
«[[H 1.0080f] [He 4.0026f] [Li 6.94f] [Be 9.0122f]
[B 10.81f] [C 12.011f] [N 14.007f] [O 15.999f]
[F 18.998f] [Ne 20.180f]]»
= A8 8A A8 82A648 85A23F810625 ([H 1.0080f])
8B A8 83A64865 85A24080154D ([He 4.0026f])
8B A8 83A64C69 85A240DE147B ([Li 6.94f])
8B A8 83A64265 85A2411031F9 ([Be 9.0122f])
8A A8 82A642 85A2412CF5C3 ([B 10.81f])
8A A8 82A643 85A241402D0E ([C 12.011f])
8A A8 82A64E 85A241601CAC ([N 14.007f])
8A A8 82A64F 85A2417FFBE7 ([O 15.999f])
8A A8 82A646 85A24197FBE7 ([F 18.998f])
8B A8 83A64E65 85A241A170A4 ([Ne 20.180f])
Annotations.
The Repr
corresponding to textual syntax @a@b[]
, i.e. an empty sequence annotated with two
symbols, a
and b
, is
«@a @b []»
= [0xBF] ++ varint(|«[]»|) ++ «[]»
++ varint(|«a»|) ++ «a»
++ varint(|«b»|) ++ «b»
= [0xBF, 0x81, 0xA8, 0x82, 0xA6, 0x61, 0x82, 0xA6, 0x62]
Security Considerations
Overlong varints. The binary format allows (but discourages)
overlong varints. Because every Repr
has a bound on its
length from its surrounding context, this is not a denial-of-service
vector per se; however, implementations may wish to consider optional
restrictions on the number of redundant leading 0
bytes accepted when
reading a varint.
Overlong SignedIntegers. Similarly, implementations may wish to
consider optional restrictions on the number of redundant leading 0xFF
or 0x00
bytes accepted when reading a SignedInteger
.
Canonical form for cryptographic hashing and signing. No canonical
textual encoding of a Value
is specified. A
canonical form exists for binary encoded Value
s, and
implementations SHOULD produce canonical binary encodings by
default; however, an implementation MAY permit two serializations of
the same Value
to yield different binary Repr
s.
Acknowledgements
The exclusion of lengths from Repr
s, placing lengths instead ahead of
contained values in sequences, is inspired by argdata, as is the
inclusion of a NUL
byte in String
Repr
s for C interoperability.
Appendix. Summary of syntax
{% include cheatsheet-binary.md %}
Appendix. Autodetection of textual or binary syntax
Every tag byte in a binary Preserves Repr
falls within the range
[0x80
, 0xBF
]. These bytes, interpreted as UTF-8, are continuation
bytes, and will never occur as the first byte of a UTF-8 encoded code
point. This means no binary-encoded Repr
can be misinterpreted as
valid UTF-8.
Conversely, a UTF-8 Document
must start with a valid codepoint,
meaning in particular that it must not start with a byte in the range
[0x80
, 0xBF
]. This means that no UTF-8 encoded textual-syntax
Preserves Document
can be misinterpreted as a binary-syntax Repr
.
Examination of the top two bits of the first byte of an encoded Value
gives its syntax: if the top two bits are 10
, it should be interpreted
as a binary-syntax Repr
; otherwise, it should be interpreted as text.
Streaming. Autodetection is still possible when streaming an
undetermined number of Value
s across, say, a TCP/IP connection:
-
If the text syntax is to be used for the connection, simply start writing each
Document
one after the other. Documents forAtom
s are in general ambiguous if not separated from their neighbours by whitespace; whitespace SHOULD be used to separate adjacent documents. Specifically, whitespace separating adjacent documents SHOULD be ASCII newline (10). -
If the binary syntax is to be used for the connection, start the connection with byte
0xA8
(sequence). After the initial byte, send each valuev
asvarint(|«v»|) ++ «v»
. A side effect of this approach is that the entire stream, when complete, is a validSequence
Repr
.
Appendix. Table of tag values
(8x) RESERVED 80-8F
(9x) RESERVED 90-9F
A0 - False
A1 - True
A2 - Float or Double (length disambiguates)
A3 - SignedIntegers (0 may be encoded with no bytes at all)
A4 - String (a trailing NUL is added)
A5 - ByteString
A6 - Symbol
A7 - Record
A8 - Sequence
A9 - Set
AA - Dictionary
AB - Embedded
(Ax) RESERVED AC-AF
(Bx) RESERVED B0-BE
BF - Annotations. {BF Lval val Lann0 ann0 Lann1 ann1 ...}
Appendix. Binary SignedInteger representation
Languages that provide fixed-width machine word types may find the
following table useful in encoding and decoding binary SignedInteger
values.
Integer range | Bytes required | Encoding (hex) |
---|---|---|
0 | 1 | A3 |
-27 ≤ n < 27 (i8) | 2 | A3 XX |
-215 ≤ n < 215 (i16) | 3 | A3 XX XX |
-223 ≤ n < 223 (i24) | 4 | A3 XX XX XX |
-231 ≤ n < 231 (i32) | 5 | A3 XX XX XX XX |
-239 ≤ n < 239 (i40) | 6 | A3 XX XX XX XX XX |
-247 ≤ n < 247 (i48) | 7 | A3 XX XX XX XX XX XX |
-255 ≤ n < 255 (i56) | 8 | A3 XX XX XX XX XX XX XX |
-263 ≤ n < 263 (i64) | 9 | A3 XX XX XX XX XX XX XX XX |
Notes
-
The value 0 needs zero bytes to identify the value, so
intbytes(0)
can be the empty byte string. Non-zero values need at least one byte. ↩︎ -
Implementation note. The spec permits overlong
SignedInteger
encodings to allow e.g. construction ofRepr
s by filling in partially-completed templates, which can be useful in resource-constrained situations. ↩︎ -
Some care must still be taken when passing
String
Repr
s directly to a C-style ABI, sinceString
s may contain the zero Unicode code point, which C library routines will usually misinterpret as an end-of-string marker. ↩︎ -
Argdata's length representation is very close to Variable-length quantity (VLQ) encoding, differing only in the flipped interpretation of the high bit of each byte. It is big-endian, unlike LEB128 encoding (as used by Google in protobufs). ↩︎
-
Implementation note. The spec permits overlong length encodings to reduce wasted activity in resource-constrained situations. If an implementation is in anything other than a very low-level language, it is likely to be able to use IOList-style data structures to avoid unnecessary copying. ↩︎
-
Nine leading zero bytes, plus one non-zero byte, equals ten bytes in total. Each byte of varint yields 7 bits of usable length indicator, so ten bytes gives 70 bits, while nine would only give 63, not quite enough for a 64-bit value. Of course, it may be some time before an encoder legitimately needs to use a 64-bit length indicator, let alone in a resource-constrained situation. ↩︎
-
In the BitTorrent encoding format, bencoding, dictionary key/value pairs must be sorted by key. This is a necessary step for ensuring serialization of
Value
s is canonical. We do not require that key/value pairs (or set elements) be in sorted order for serializedValue
s; however, a canonical form forRepr
s does exist where a sorted ordering is required. ↩︎ -
It's important to note that the sort ordering for writing out set elements and dictionary key/value pairs is not the same as the sort ordering implied by the semantic ordering of those elements or keys. For example, the
Repr
of a negative number very far from zero will start with a byte that is greater than the byte which starts theRepr
of zero, making it sort lexicographically later byRepr
, despite being semantically less than zero.Rationale. This is for ease-of-implementation reasons: not all languages can easily represent sorted sets or sorted dictionaries, but encoding and then sorting byte strings is much more likely to be within easy reach. ↩︎