14 KiB
no_site_title | title |
---|---|
true | Preserves: Binary Syntax |
Tony Garnock-Jones tonyg@leastfixedpoint.com
{{ site.version_date }}. Version {{ site.version }}.
Preserves is a data model, with associated serialization formats. This
document defines one of those formats: a binary syntax for Value
s from
the Preserves data model that is easy for computer
software to read and write. An equivalent human-readable text
syntax also exists.
Machine-Oriented Binary Syntax
A Repr
is a binary-syntax encoding, or representation, of a Value
.
For a value v
, we write «v»
for the Repr
of v.
Type and Length representation.
Each Repr
starts with a tag byte, describing the kind of information
represented.
However, inspired by argdata, a Repr
does not describe its own
length. Instead, the expected length of the Repr
is always available
from the surrounding context: either from a containing encoded value, or
from the overall container of the data, which could be a file, an HTTP
message, a UDP packet, etc.
As a consequence, Repr
s for Compound
values store the lengths of
their contained values. Each contained Value
is represented as a
length in bytes followed by its own Repr
. Implementations use each
stored length to decide when to stop reading the following Repr
.
Each length is stored as an argdata-compatible
big-endian base 128 varint.1 Each byte of a varint
stores seven bits of the length. All bytes have a clear upper bit,
except the final byte, which has the upper bit set. We write
len(m)
for the varint-encoding of a non-negative integer m
,
defined recursively as follows:
len(m) = e(m, 128)
where e(v, d) = [v + d] if v < 128
e(v / 128, 0) ++ [(v % 128) + d] if v ≥ 128
We write len(|r|)
for the varint-encoding of the length of Repr
r
.
There is no requirement that a varint-encoded m
in a Repr
be the
unique shortest encoding for that m
.2 However,
implementations SHOULD use the shortest encoding whereever possible
when writing, and MAY reject encodings with more than eight leading
0
bytes when reading encoded values.
Records, Sequences, Sets and Dictionaries.
«<L F_1...F_m>» = [0xA7] ++ seq(«L», «F_1», ..., «F_m»)
«[X_1...X_m]» = [0xA8] ++ seq(«X_1», ..., «X_m»)
«#{E_1...E_m}» = [0xA9] ++ seq(«E_1», ..., «E_m»)
«{K_1:V_1...K_m:V_m}» = [0xAA] ++ seq(«K_1», «V_1», ..., «K_m», «V_m»)
seq(R_1, ..., R_m) = len(|R_1|) ++ R_1 ++...++ len(|R_m|) ++ R_m
There is no ordering requirement on the E_i
elements or
K_i
/V_i
pairs.3 They may appear in any
order. However, the E_i
and K_i
MUST be pairwise distinct. In
addition, implementations SHOULD default to writing set elements and
dictionary key/value pairs in order sorted lexicographically by their
Repr
s4, and MAY offer the option of
serializing in some other implementation-defined order.
No sentinel marks the end of a sequence of length-prefixed Repr
s.
During decoding, use the length of the containing Repr
to decide when
to stop expecting more contained Repr
s.
SignedIntegers.
«x» when x ∈ SignedInteger = [0xA3] ++ intbytes(x)
The function intbytes(x)
gives the big-endian two's-complement binary
representation of x
, taking exactly as many whole bytes as needed to
unambiguously identify the value and its sign. As a special case,
intbytes(0)
is the empty byte sequence. The most-significant bit in
the first byte in intbytes(x)
(for x
≠0) is the sign
bit.5 Every SignedInteger
MUST be represented with
its shortest possible encoding.
Strings, ByteStrings and Symbols.
«S» = [0xA4] ++ utf8(S) ++ [0] if S ∈ String
[0xA5] ++ S if S ∈ ByteString
[0xA6] ++ utf8(S) if S ∈ Symbol
For String
and Symbol
, the data following the tag is a UTF-8
encoding of the Value
's code points, while for ByteString
it is the
raw data contained within the Value
unmodified.
Each String
has a trailing zero byte appended. This extra byte MUST
NOT be treated as part of the Value
: it exists to permit zero-copy C
interoperability.6
Booleans.
«#f» = [0xA0]
«#t» = [0xA1]
Floats and Doubles.
«F» when F ∈ Float = [0xA2] ++ binary32(F)
«D» when D ∈ Double = [0xA2] ++ binary64(D)
The functions binary32(F)
and binary64(D)
yield big-endian 4- and
8-byte IEEE 754 binary representations of F
and D
, respectively.
Embeddeds.
The Repr
of an Embedded
is the Repr
of a Value
chosen to
represent the denoted object, prefixed with [0xAB]
.
«#!V» = [0xAB] ++ «V»
Annotations.
To annotate a Repr
r
with some sequence of Value
s [v_1, ..., v_m]
, surround r
as follows:
[0xBF] ++ len(|r|) ++ r ++ len(|«v_1»|) ++ «v_1» ++...++ len(|«v_m»|) ++ «v_m»
The Repr
r
MUST NOT already have annotations; that is, it must not
begin with 0xBF
. The sequence [v_1, ..., v_m]
MUST contain at
least one Value
.
Examples
Varints (length representations).
The following table illustrates varint-encoding.
Number, m |
m in binary, grouped into 7-bit chunks |
len(m) bytes |
---|---|---|
15 | 0001111 |
143 |
300 | 0000010 0101100 |
2 172 |
1000000000 | 0000011 1011100 1101011 0010100 0000000 |
3 92 107 20 128 |
SignedIntegers.
«87112285931760246646623899502532662132736»
= A3 01 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
00 00
«-257» = A3 FE FF «-3» = A3 FD «128» = A3 00 80
«-256» = A3 FF 00 «-2» = A3 FE «255» = A3 00 FF
«-255» = A3 FF 01 «-1» = A3 FF «256» = A3 01 00
«-254» = A3 FF 02 «0» = A3 «32767» = A3 7F FF
«-129» = A3 FF 7F «1» = A3 01 «32768» = A3 00 80 00
«-128» = A3 80 «12» = A3 0C «65535» = A3 00 FF FF
«-127» = A3 81 «13» = A3 0D «65536» = A3 01 00 00
«-4» = A3 FC «127» = A3 7F «131072» = A3 02 00 00
Annotations.
The Repr
corresponding to textual syntax @a@b[]
, i.e. an empty sequence annotated with two
symbols, a
and b
, is
«@a @b []»
= [0xBF] ++ len(|«[]»|) ++ «[]» ++ len(|«a»|) ++ «a» ++ len(|«b»|) ++ «b»
= [0xBF, 0x81, 0xA8, 0x82, 0xA6, 0x61, 0x82, 0xA6, 0x62]
Security Considerations
Annotations. In modes where a Value
is being read while
annotations are skipped, an endless sequence of annotations may give an
illusion of progress.
Overlong varints. The binary format allows (but discourages)
overlong varints. Because every Repr
has a bound on its
length from its surrounding context, this is not a denial-of-service
vector per se; however, implementations may wish to consider optional
restrictions on the number of redundant leading 0
bytes accepted when
reading a varint.
Canonical form for cryptographic hashing and signing. No canonical
textual encoding of a Value
is specified. A
canonical form exists for binary encoded Value
s, and
implementations SHOULD produce canonical binary encodings by
default; however, an implementation MAY permit two serializations of
the same Value
to yield different binary Repr
s.
Acknowledgements
The exclusion of lengths from Repr
s, placing lengths instead ahead of
contained values in sequences, is inspired by argdata, as is the
inclusion of a NUL
byte in String
Repr
s for C interoperability.
Appendix. Autodetection of textual or binary syntax
Every tag byte in a binary Preserves Repr
falls within the range
[0x80
, 0xBF
]. These bytes, interpreted as UTF-8, are continuation
bytes, and will never occur as the first byte of a UTF-8 encoded code
point. This means no binary-encoded Repr
can be misinterpreted as
valid UTF-8.
Conversely, a UTF-8 Document
must start with a valid codepoint,
meaning in particular that it must not start with a byte in the range
[0x80
, 0xBF
]. This means that no UTF-8 encoded textual-syntax
Preserves Document
can be misinterpreted as a binary-syntax Repr
.
Examination of the top two bits of the first byte of an encoded Value
gives its syntax: if the top two bits are 10
, it should be interpreted
as a binary-syntax Repr
; otherwise, it should be interpreted as text.
Streaming. Autodetection is still possible when streaming an
undetermined number of Value
s across, say, a TCP/IP connection:
-
If the text syntax is to be used for the connection, simply start writing each
Document
one after the other. Documents forAtom
s are in general ambiguous if not separated from their neighbours by whitespace; whitespace SHOULD be used to separate adjacent documents. Specifically, whitespace separating adjacent documents SHOULD be ASCII newline (10). -
If the binary syntax is to be used for the connection, start the connection with byte
0xA8
(sequence). After the initial byte, send each valuev
aslen(|«v»|) ++ «v»
. A side effect of this approach is that the entire stream, when complete, is a validSequence
Repr
.
Appendix. Table of tag values
(8x) RESERVED 80-8F
(9x) RESERVED 90-9F
A0 - False
A1 - True
A2 - Float or Double (length disambiguates)
A3 - SignedIntegers (0 is encoded with no bytes at all)
A4 - String (no trailing NUL is added)
A5 - ByteString
A6 - Symbol
A7 - Record
A8 - Sequence
A9 - Set
AA - Dictionary
AB - Embedded
(Ax) RESERVED AC-AF
(Bx) RESERVED B0-BE
BF - Annotations. {BF Lval val Lann0 ann0 Lann1 ann1 ...}
Appendix. Binary SignedInteger representation
Languages that provide fixed-width machine word types may find the
following table useful in encoding and decoding binary SignedInteger
values.
Integer range | Bytes required | Encoding (hex) |
---|---|---|
0 | 1 | A3 |
-27 ≤ n < 27 (i8) | 2 | A3 XX |
-215 ≤ n < 215 (i16) | 3 | A3 XX XX |
-223 ≤ n < 223 (i24) | 4 | A3 XX XX XX |
-231 ≤ n < 231 (i32) | 5 | A3 XX XX XX XX |
-239 ≤ n < 239 (i40) | 6 | A3 XX XX XX XX XX |
-247 ≤ n < 247 (i48) | 7 | A3 XX XX XX XX XX XX |
-255 ≤ n < 255 (i56) | 8 | A3 XX XX XX XX XX XX XX |
-263 ≤ n < 263 (i64) | 9 | A3 XX XX XX XX XX XX XX XX |
Notes
-
Argdata's length representation is very close to Variable-length quantity (VLQ) encoding, differing only in the flipped interpretation of the high bit of each byte. It is big-endian, unlike LEB128 encoding (as used by Google in protobufs). ↩︎
-
Implementation note. The spec permits overlong length encodings to reduce wasted activity in resource-constrained situations. If an implementation is in anything other than a very low-level language, it is likely to be able to use IOList-style data structures to avoid unnecessary copying. ↩︎
-
In the BitTorrent encoding format, bencoding, dictionary key/value pairs must be sorted by key. This is a necessary step for ensuring serialization of
Value
s is canonical. We do not require that key/value pairs (or set elements) be in sorted order for serializedValue
s; however, a canonical form forRepr
s does exist where a sorted ordering is required. ↩︎ -
It's important to note that the sort ordering for writing out set elements and dictionary key/value pairs is not the same as the sort ordering implied by the semantic ordering of those elements or keys. For example, the
Repr
of a negative number very far from zero will start with a byte that is greater than the byte which starts theRepr
of zero, making it sort lexicographically later byRepr
, despite being semantically less than zero.Rationale. This is for ease-of-implementation reasons: not all languages can easily represent sorted sets or sorted dictionaries, but encoding and then sorting byte strings is much more likely to be within easy reach. ↩︎
-
The value 0 needs zero bytes to identify the value, so
intbytes(0)
is the empty byte string. Non-zero values need at least one byte. ↩︎ -
Some care must still be taken when passing
String
Repr
s directly to a C-style ABI, sinceString
s may contain the zero Unicode code point, which C library routines will usually misinterpret as an end-of-string marker. ↩︎