9.7 KiB
no_site_title | title |
---|---|
true | Preserves: Zero-copy Binary Syntax |
Tony Garnock-Jones tonyg@leastfixedpoint.com
{{ site.version_date }}. Version {{ site.version }}.
Preserves is a data model, with associated serialization formats. This
document defines one of those formats: a binary syntax for Value
s from
the Preserves data model that avoids, in many cases, use
of intermediate data structures during reading and writing. This makes it
suitable for use for representation of very large values whose
fully-decoded representations may not fit in working memory.
Zero-Copy Binary Syntax
A Buf
is a zero-copy syntax encoding, or representation, of a
non-immediate Value
. A Ref
is either a type-tagged representation of a
small immediate Value
or a type-tagged pointer to a Buf
.
Each Ref
is a 64-bit unsigned value. Its tag appears in the low 4 bits.
The remaining 60 bits encode either an unsigned offset pointing to a
previously-encoded Buf
, or an immediate value. Pointers always point
backwards to earlier positions.
Each Buf
is prefixed with a 64-bit payload length, counted in units of
bytes, and is zero-padded to the nearest multiple of 16 bytes. Neither the
length of the padding nor the length of the length itself are included in
the length.
Offsets in pointer Ref
s are counted in 16-byte units, measuring from the
beginning of the length indicator of the Buf
in which the Ref
appears.
A zero offset is special: it denotes an empty value of the type
associated with the tag in the Ref
.
All multi-byte quantities are encoded using little-endian byte order.
Header.
Because Ref
s are typed, but Buf
s are not, the outermost Value
in e.g.
a file or network stream is always encoded preceded by a special header.
Offset Length Description
-------- ------ -----------
00000000 1 Marker byte 0xFF
00000001 1 Version number 0x00
00000002 6 Reserved, 0x00
00000008 8 Special Ref
00000010 8 Length ("n") of encoded data, in bytes
00000018 n Encoded data
- - Zero-padding to next 16-byte boundary
The Ref
in the header at offset 8 is special.
If it encodes an immediate Value
, that Value
is the encoded value, and
the length field and encoded data are omitted. The entire encoded value is
exactly 16 bytes long in this case.
However, if the special Ref
is an encoding of a pointer to a Buf
, the
offset is interpreted as counting back from the very end of the padding at
the end of the encoded data. The entire encoded value is the length of the
encoded data, plus 24, rounded up to the next multiple of 16.
Either way, the tag on the special Ref
is the type of the encoded value.
Tags and Refs.
Tag Type Interpretation of 60-bit payload
--- ------------- --------------------------------
0 Boolean 0 = False, 1 = True
1 IEEE 754 Offset to Buf holding little-endian 32/64-bit float
2 SignedInteger Signed 60-bit integer
3 SignedInteger Offset to Buf holding little-endian signed integer
4 String 0-7 bytes of UTF-8; length in lower 4 bits
5 String Offset to Buf holding UTF-8 data
6 ByteString 0-7 bytes of raw binary; length in lower 4 bits
7 ByteString Offset to Buf holding raw binary data
8 Symbol 0-7 bytes of UTF-8; length in lower 4 bits
9 Symbol Offset to Buf holding UTF-8 data
A Record Offset to Buf holding Refs (label, fields)
B Sequence Offset to Buf holding Refs (sequence values)
C Set Offset to Buf holding Refs (elements in arbitrary order)
D Dictionary Offset to Buf holding Refs (key/value pairs)
E Embedded Offset to Buf holding a single Ref
F - (reserved)
Records, Sequences, Sets and Dictionaries.
Offset Length Description
-------- ------ -----------
00000000 8 n*8: length of following sequence of n Refs, in bytes
00000008 8 Ref 0
... ... ...
n*8 8 Ref n-1
(n+1)*8 8 Padding, only if n is even
Each compound datum is represented as a sequence of Ref
s representing the
contained Value
s. Each Record
's sequence represents the label, followed
by the fields in order. Each Sequence
's representation is just its
contained values in order. Set
s are ordered arbitrarily into a sequence.
The key-value pairs in a Dictionary
are ordered arbitrarily, alternating
between keys and their matching values.
There is no ordering requirement on the elements of Set
s or the
key-value pairs in a Dictionary
. They may appear in any order. However,
the elements and keys MUST be pairwise distinct according to the
Preserves equivalence relation.
SignedIntegers.
Integers between -259 and 259-1, inclusive, are
represented as immediate values in a Ref
with tag 2. Integers outside
this range are represented with a Ref
with tag 3 pointing to a Buf
containing exactly as many 64-bit words as needed to unambiguously identify
the value and its sign, in little-endian byte and word ordering. Every
SignedInteger
MUST be represented with its shortest possible encoding.
For example,
Number (decimal) Ref (64-bit) Buf (hex bytes)
----------------------------------------- ---------------- ----------------
-576460752303423488 8000000000000002 -
-257 FFFFFFFFFFFFEFF2 -
-1 FFFFFFFFFFFFFFF2 -
0 0000000000000002 -
1 0000000000000012 -
257 0000000000001012 -
576460752303423487 7FFFFFFFFFFFFFF2 -
1000000000000000000000000000000 ...............3 1000000000000000
00000040EAED7446
D09C2C9F0C000000
0000000000000000
-1000000000000000000000000000000 ...............3 1000000000000000
000000C015128BB9
2F63D360F3FFFFFF
0000000000000000
87112285931760246646623899502532662132736 ...............3 1800000000000000
0000000000000000
0000000000000000
0001000000000000
Strings, ByteStrings and Symbols.
Syntax for these three types varies only in the tag used. For String
and
Symbol
, the encoded data is a UTF-8 encoding of the Value
's code
points, while for ByteString
it is the raw data contained within the
Value
unmodified.
Encoded data of length 7 bytes or shorter is represented as an immediate
Ref
with tag 4 (String
), 6 (ByteString
) or 8 (Symbol
). The lower 4
bits of the 60-bit payload are the length of the encoded data; the upper 56
bits are 7 bytes of data, with the first data byte in the lowest byte, so
that the order of data bytes in memory in an immediate encoding matches the
order in a Buf
encoding.
Data longer than 7 bytes is represented with a Ref
with tag 5, 7 or 9
pointing to a Buf
containing the bytes of encoded data. Empty values
(length 0) MUST be encoded using pointer Ref
form with special offset
zero.
For example,
Value Ref (64-bit) Buf (hex bytes)
----------------------------------------- ---------------- ----------------
"" 0000000000000005 -
#"" 0000000000000007 -
|| 0000000000000009 -
"Hello" 48656C6C6F000054 -
"a\0a" 6100610000000034 -
"Hello, world!" ...............5 0D00000000000000
48656C6C6F2C2077
6F726C6421000000
0000000000000000
Booleans.
Value Ref (64-bit) Buf (hex bytes)
----------------------------------------- ---------------- ----------------
#f 0000000000000000 -
#t 0000000000000010 -
Floats and Doubles.
Each IEEE 754 4- and 8-byte binary representation is encoded into a Buf
,
pointed to with a Ref
with tag 1. The length of the Buf
disambiguates
between 32-bit floats and 64-bit doubles.
((This is a very sparse encoding! Each float/double takes up 24 bytes split
across the Buf
and Ref
.))
Embeddeds.
To encode an Embedded
, first choose a Value
to represent the denoted
object, and encode that, producing a Ref
. Place that ref in a Buf
all
of its own (with length 8). Finally, point to the Buf
with a Ref
with
tag 15.
Annotations.
((Not sure: put them as a trailer after a Header?))
Security Considerations
((TBD))
Appendix. Autodetection of textual or binary syntax
The first byte of a Header is 0xFF, which may not appear in any UTF-8 string. ((...))