45 KiB
no_site_title | title |
---|---|
true | Preserves: an Expressive Data Language |
Tony Garnock-Jones tonyg@leastfixedpoint.com
May 2020. Version 0.0.7.
This document proposes a data model and serialization format called Preserves.
Preserves supports records with user-defined labels. This relieves the confusion caused by encoding records as dictionaries, seen in most data languages in use on the web. It also allows Preserves to easily represent the labelled sums of products as seen in many functional programming languages.
Preserves also supports the usual suite of atomic and compound data types, in particular including binary data as a distinct type from text strings. Its annotations allow separation of data from metadata such as comments, trace information, and provenance information.
Finally, Preserves defines precisely how to compare two values. Comparison is based on the data model, not on syntax or on data structures of any particular implementation language.
Starting with Semantics
Taking inspiration from functional programming, we start with a definition of the values that we want to work with and give them meaning independent of their syntax.
Our Value
s fall into two broad categories: atomic and compound
data.
Value = Atom
| Compound
Atom = Boolean
| Float
| Double
| SignedInteger
| String
| ByteString
| Symbol
Compound = Record
| Sequence
| Set
| Dictionary
Total order. As we go, we will
incrementally specify a total order over Value
s. Two values of the
same kind are compared using kind-specific rules. The ordering among
values of different kinds is essentially arbitrary, but having a total
order is convenient for many tasks, so we define it as
follows:
(Values) Atom < Compound
(Compounds) Record < Sequence < Set < Dictionary
(Atoms) Boolean < Float < Double < SignedInteger
< String < ByteString < Symbol
Equivalence. Two Value
s are equal if
neither is less than the other according to the total order.
Signed integers.
A SignedInteger
is a signed integer of arbitrary width.
SignedInteger
s are compared as mathematical integers.
Unicode strings.
A String
is a sequence of Unicode
code-points. String
s
are compared lexicographically, code-point by
code-point.1
Binary data.
A ByteString
is a sequence of octets. ByteString
s are compared
lexicographically.
Symbols.
Programming languages like Lisp and Prolog frequently use string-like
values called symbols. Here, a Symbol
is, like a String
, a
sequence of Unicode code-points representing an identifier of some
kind. Symbol
s are also compared lexicographically by code-point.
Booleans.
There are two Boolean
s, “false” and “true”. The “false” value is
less-than the “true” value.
IEEE floating-point values.
Float
s and Double
s are single- and double-precision IEEE 754
floating-point values, respectively. Float
s, Double
s and
SignedInteger
s are disjoint; by the rules above,
every Float
is less than every Double
, and every SignedInteger
is greater than both. Two Float
s or two Double
s are to be ordered
by the totalOrder
predicate defined in section 5.10 of
IEEE Std 754-2008.
Records.
A Record
is a labelled tuple of Value
s, the record's fields. A
label can be any Value
, but is usually a Symbol
.2
3 Record
s are compared lexicographically: first by
label, then by field sequence.
Sequences.
A Sequence
is a sequence of Value
s. Sequence
s are compared
lexicographically.
Sets.
A Set
is an unordered finite set of Value
s. It contains no
duplicate values, following the equivalence relation
induced by the total order on Value
s. Two Set
s are compared by
sorting their elements ascending using the total order
and comparing the resulting Sequence
s.
Dictionaries.
A Dictionary
is an unordered finite collection of pairs of Value
s.
Each pair comprises a key and a value. Keys in a Dictionary
are
pairwise distinct. Instances of Dictionary
are compared by
lexicographic comparison of the sequences resulting from ordering each
Dictionary
's pairs in ascending order by key.
Textual Syntax
Now we have discussed Value
s and their meanings, we may turn to
techniques for representing Value
s for communication or storage.
In this section, we use case-sensitive ABNF to define a textual syntax that is easy for people to read and write.4 Most of the examples in this document are written using this syntax. In the following section, we will define an equivalent compact machine-readable syntax.
Character set.
ABNF allows easy definition of US-ASCII-based languages. However, Preserves is a Unicode-based language. Therefore, we reinterpret ABNF as a grammar for recognising sequences of Unicode code points.
Textual syntax for a Value
SHOULD be encoded using UTF-8 where
possible.
Whitespace.
Whitespace is defined as any number of spaces, tabs, carriage returns, line feeds, or commas.
ws = *(%x20 / %x09 / newline / ",")
newline = CR / LF
Grammar.
Standalone documents may have trailing whitespace.
Document = Value ws
Any Value
may be preceded by whitespace.
Value = ws (Record / Collection / Atom / Compact)
Collection = Sequence / Dictionary / Set
Atom = Boolean / Float / Double / SignedInteger /
String / ByteString / Symbol
Each Record
is an angle-bracket enclosed grouping of its
label-Value
followed by its field-Value
s.
Record = "<" Value *Value ws ">"
Sequence
s are enclosed in square brackets. Dictionary
values are
curly-brace-enclosed colon-separated pairs of values. Set
s are
written either as one or more values enclosed in curly braces, or zero
or more values enclosed by the tokens #set{
and
}
.5 It is an error for a set to contain
duplicate elements or for a dictionary to contain duplicate keys.
Sequence = "[" *Value ws "]"
Dictionary = "{" *(Value ws ":" Value) ws "}"
Set = %s"#set{" *Value ws "}" / "{" 1*Value ws "}"
Boolean
s are the simple literal strings #true
and #false
.
Boolean = %s"#true" / %s"#false"
Numeric data follow the
JSON grammar, with
the addition of a trailing “f” distinguishing Float
from Double
values. Float
s and Double
s always have either a fractional part or
an exponent part, where SignedInteger
s never have
either.6
7
Float = flt %i"f"
Double = flt
SignedInteger = int
digit1-9 = %x31-39
nat = %x30 / ( digit1-9 *DIGIT )
int = ["-"] nat
frac = "." 1*DIGIT
exp = %i"e" ["-"/"+"] 1*DIGIT
flt = int (frac exp / frac / exp)
String
s are,
as in JSON, possibly
escaped text surrounded by double quotes. The escaping rules are the
same as for JSON.8 9
String = %x22 *char %x22
char = unescaped / %x7C / escape (escaped / %x22 / %s"u" 4HEXDIG)
unescaped = %x20-21 / %x23-5B / %x5D-7B / %x7D-10FFFF
escape = %x5C ; \
escaped = ( %x5C / ; \ reverse solidus U+005C
%x2F / ; / solidus U+002F
%x62 / ; b backspace U+0008
%x66 / ; f form feed U+000C
%x6E / ; n line feed U+000A
%x72 / ; r carriage return U+000D
%x74 ) ; t tab U+0009
A ByteString
may be written in any of three different forms.
The first is similar to a String
, but prepended with a hash sign
#
. In addition, only Unicode code points overlapping with printable
7-bit ASCII are permitted unescaped inside such a ByteString
; other
byte values must be escaped by prepending a two-digit hexadecimal
value with \x
.
ByteString = "#" %x22 *binchar %x22
binchar = binunescaped / escape (escaped / %x22 / %s"x" 2HEXDIG)
binunescaped = %x20-21 / %x23-5B / %x5D-7E
The second is as a sequence of pairs of hexadecimal digits interleaved
with whitespace and surrounded by #hex{
and }
.
ByteString =/ %s"#hex{" *(ws / 2HEXDIG) ws "}"
The third is as a sequence of
Base64 characters, interleaved
with whitespace and surrounded by #base64{
and }
. Plain and
URL-safe Base64 characters are allowed.
ByteString =/ %s"#base64{" *(ws / base64char) ws "}" /
base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "="
A Symbol
may be written in a “bare” form10 so long as
it conforms to certain restrictions on the characters appearing in the
symbol. Alternatively, it may be written in a quoted form. The quoted
form is much the same as the syntax for String
s, including embedded
escape syntax, except using a bar or pipe character (|
) instead of a
double quote mark.
Symbol = symstart *symcont / "|" *symchar "|"
symstart = ALPHA / sympunct / symustart
symcont = ALPHA / sympunct / symustart / symucont / DIGIT / "-"
sympunct = "~" / "!" / "$" / "%" / "^" / "&" / "*" /
"?" / "_" / "=" / "+" / "/" / "."
symchar = unescaped / %x22 / escape (escaped / %x7C / %s"u" 4HEXDIG)
symustart = <any code point greater than 127 whose Unicode
category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me,
Pc, Po, Sc, Sm, Sk, So, or Co>
symucont = <any code point greater than 127 whose Unicode
category is Nd, Nl, No, or Pd>
Finally, any Value
may be represented by escaping from the textual
syntax to the compact binary syntax by
prefixing a ByteString
containing the binary representation of the
Value
with #value
.11
12 13
Compact = %s"#value" ws ByteString
Annotations.
Syntax. When written down, a Value
may have an associated
sequence of annotations carrying “out-of-band” contextual metadata
about the value. Each annotation is, in turn, a Value
, and may
itself have annotations.
Value =/ ws "@" Value Value
Each annotation is preceded by @
; the underlying annotated value
follows its annotations. Here we extend only the syntactic nonterminal
named “Value
” without altering the semantic class of Value
s.
Equivalence. Annotations appear within syntax denoting a Value
;
however, the annotations are not part of the denoted value. They are
only part of the syntax. Annotations do not play a part in
equivalences and orderings of Value
s.
Reflective tools such as debuggers, user interfaces, and message
routers and relays---tools which process Value
s generically---may
use annotated inputs to tailor their operation, or may insert
annotations in their outputs. By contrast, in ordinary programs, as a
rule of thumb, the presence, absence or content of an annotation
should not change the control flow or output of the program.
Annotations are data describing Value
s, and are not in the domain
of any specific application of Value
s. That is, an annotation will
almost never cause a non-reflective program to do anything observably
different.
Compact Binary Syntax
A Repr
is a binary-syntax encoding, or representation, of either
- a
Value
, - a “placeholder” for a
Value
, or - an annotation on a
Repr
.
Each Repr
comprises one or more bytes describing the kind of
represented information and the length of the representation, followed
by the encoded details.
For a value v
, we write [[v]]
for the Repr
of v.
Type and Length representation.
Each Repr
takes one of three possible forms:
-
(A) type-specific form, used for simple values such as
Boolean
s orFloat
s, for placeholders, and for introducing annotations. -
(B) a variable-length form with length specified up-front, used for compound and variable-length atomic data structures when their sizes are known at the time serialization begins.
-
(C) a variable-length streaming form with unknown or unpredictable length, used in cases when serialization begins before the number of elements or bytes in the corresponding
Value
is known.
Applications may choose between formats B and C depending on their needs at serialization time.
The lead byte.
Every Repr
starts with a lead byte, constructed by
leadbyte(t,n,m)
, where t
,n
∈{0,1,2,3} and 0≤m
<16:
leadbyte(t,n,m) = [t*64 + n*16 + m]
The arguments t
, n
and m
describe the rest of the
representation.14
t |
n |
m |
Meaning |
---|---|---|---|
0 | 0 | 0–3 | (format A) An Atom with fixed-length binary representation |
0 | 0 | 4 | (format C) Stream end |
0 | 0 | 5 | (format A) Annotation |
0 | 1 | (format A) Placeholder for an application-specific Value |
|
0 | 2 | (format C) Stream start | |
0 | 3 | (format A) Certain small SignedInteger s |
|
1 | (format B) An Atom with variable-length binary representation |
||
2 | (format B) A Compound with variable-length representation |
||
3 | 3 | 15 | (format A) 0xFF byte; no-op |
Encoding data of type-specific length (format A).
Each type of data defines its own rules for this format.
Of particular note is lead byte 0xFF
, which is a no-op byte acting
as a kind of pseudo-whitespace in a binary-syntax encoding.
Encoding data of known length (format B).
Format B is used where the length l
of the Value
to be encoded is
known when serialization begins. Format B Repr
s use m
in
leadbyte
to encode l
. The length counts bytes for atomic
Value
s, but counts contained values for compound Value
s.
- A length
l
between 0 and 14 is represented usingleadbyte
withm=l
. - A length of 15 or greater is represented by
m=15
and additional bytes describing the length following the lead byte.
The function header(t,n,m)
yields an appropriate sequence of bytes
describing a Repr
's type and length when t
, n
and m
are
appropriate non-negative integers:
header(t,n,m) = leadbyte(t,n,m) when m < 15
or leadbyte(t,n,15) ++ varint(m) otherwise
The additional length bytes are formatted as
base 128 varints.15 We write varint(m)
for
the varint-encoding of m
. Quoting the
Google Protocol Buffers definition,
Each byte in a varint, except the last byte, has the most significant bit (msb) set – this indicates that there are further bytes to come. The lower 7 bits of each byte are used to store the two's complement representation of the number in groups of 7 bits, least significant group first.
The following table illustrates varint-encoding.
Number, m |
m in binary, grouped into 7-bit chunks |
varint(m) bytes |
---|---|---|
15 | 0001111 |
15 |
300 | 0000010 0101100 |
172 2 |
1000000000 | 0000011 1011100 1101011 0010100 0000000 |
128 148 235 220 3 |
It is an error for a varint-encoded m
in a Repr
to be anything
other than the unique shortest encoding for that m
. That is, a
varint-encoding of m
MUST NOT end in 0
unless m
=0. However,
the varint(m)
encoding of a length MUST NOT be used when m
<15,
meaning that a Repr
MUST NOT contain any varint-encoding with
final byte 0
.
Streaming data of unknown length (format C).
A Repr
where the length of the Value
to be encoded is variable and
not known at the time serialization of the Value
starts is encoded
by a single Stream Start (“open”) byte, followed by zero or more
chunks, followed by a matching Stream End (“close”) byte:
open(t,n) = leadbyte(0,2, t*4 + n) = [0x20 + t*4 + n]
close() = leadbyte(0,0, 4) = [0x04]
For a format C Repr
of an atomic Value
, each chunk is to be a
format B Repr
of a ByteString
, no matter the type of the overall
Value
. Annotations are not allowed on these individual chunks.
For a format C Repr
of a compound Value
, each chunk is to be a
single Repr
, which may itself be annotated.
Each chunk within a format C Repr
MUST have non-zero length.
Software that decodes Repr
s MUST reject Repr
s that include
zero-length chunks.
Records.
Format B (known length):
[[ <L F_1...F_m> ]] = header(2,0,m+1) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]]
For m
fields, m+1
is supplied to header
, to account for the
encoding of the record label.
Format C (streaming):
[[ <L F_1...F_m> ]] = open(2,0) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] ++ close()
Applications SHOULD prefer the known-length format for encoding
Record
s.
Placeholders.
Applications may define an interpretation for numbered placeholders
in the binary syntax, mapping each placeholder number n
to a
specific Value
. For example, a placeholder number may be assigned
for a frequently-used Record
label.
A Value
v
for which placeholder number n
has been assigned may
be tersely encoded as
[[v]] = header(0,1,n) when n is a placeholder number for v
Examples. For example, a protocol may choose to assign placeholder
number 4 to the symbol void
, making
[[void]] = header(0,1,4) = [0x14]
[[<void>]] = header(2,0,1) ++ [[void]] = [0x81, 0x14]
or it may map symbol person
to placeholder number 102, making
[[person]] = header(0,1,102) = [0x1F, 0x66]
and so
[[<person "Dr" "Elizabeth" "Blackwell">]]
= header(2,0,4) ++ [[person]] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
= [0x84, 0x1F, 0x66] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
for format B, or
open(2,0) ++ [[person]] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ close()
= [0x28, 0x1F, 0x66] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ [0x04]
for format C.
Sequences, Sets and Dictionaries.
Format B (known length):
[[ [X_1...X_m] ]] = header(2,1,m) ++ [[X_1]] ++...++ [[X_m]]
[[ #set{X_1...X_m} ]] = header(2,2,m) ++ [[X_1]] ++...++ [[X_m]]
[[ {K_1:V_1...K_m:V_m} ]] = header(2,3,m*2) ++ [[K_1]] ++ [[V_1]] ++...
++ [[K_m]] ++ [[V_m]]
Note that m*2
is given to header
for a Dictionary
, since there
are two Value
s in each key-value pair.
Format C (streaming):
[[ [X_1...X_m] ]] = open(2,1) ++ [[X_1]] ++...++ [[X_m]] ++ close()
[[ #set{E_1...E_m} ]] = open(2,2) ++ [[E_1]] ++...++ [[E_m]] ++ close()
[[ {K_1:V_1...K_m:V_m} ]] = open(2,3) ++ [[K_1]] ++ [[V_1]] ++...
++ [[K_m]] ++ [[V_m]] ++ close()
Applications may use whichever format suits their needs on a case-by-case basis.
There is no ordering requirement on the E_i
elements or
K_i
/V_i
pairs.16 They may appear in any
order. However, the E_i
and K_i
MUST be pairwise distinct.
SignedIntegers.
Format B/A (known length/fixed-size):
[[ x ]] when x ∈ SignedInteger = header(1,0,m) ++ intbytes(x) if x<-3 ∨ 13≤x
header(0,3,x+16) if -3≤x<0
header(0,3,x) if 0≤x<13
Integers in the range [-3,12] are compactly represented using format A because they are so frequently used. Other integers are represented using format B.
Format C MUST NOT be used for SignedInteger
s. Format A MUST be
used for integers in the range -3 to 12, inclusive.
The function intbytes(x)
gives the big-endian two's-complement
binary representation of x
, taking exactly as many whole bytes as
needed to unambiguously identify the value and its sign, and m = |intbytes(x)|
. The most-significant bit in the first byte in
intbytes(x)
is the sign bit.17
For example,
[[ -257 ]] = 42 FE FF [[ -3 ]] = 3D [[ 128 ]] = 42 00 80
[[ -256 ]] = 42 FF 00 [[ -2 ]] = 3E [[ 255 ]] = 42 00 FF
[[ -255 ]] = 42 FF 01 [[ -1 ]] = 3F [[ 256 ]] = 42 01 00
[[ -254 ]] = 42 FF 02 [[ 0 ]] = 30 [[ 32767 ]] = 42 7F FF
[[ -129 ]] = 42 FF 7F [[ 1 ]] = 31 [[ 32768 ]] = 43 00 80 00
[[ -128 ]] = 41 80 [[ 12 ]] = 3C [[ 65535 ]] = 43 00 FF FF
[[ -127 ]] = 41 81 [[ 13 ]] = 41 0D [[ 65536 ]] = 43 01 00 00
[[ -4 ]] = 41 FC [[ 127 ]] = 41 7F [[ 131072 ]] = 43 02 00 00
Strings, ByteStrings and Symbols.
Syntax for these three types varies only in the value of n
supplied
to header
and open
. In each case, the payload following the header
is a binary sequence; for String
and Symbol
, it is a UTF-8
encoding of the Value
's code points, while for ByteString
it is
the raw data contained within the Value
unmodified.
Format B (known length):
[[ S ]] = header(1,n,m) ++ encode(S)
where m = |encode(S)|
and (n,encode(S)) = (1,utf8(S)) if S ∈ String
(2,S) if S ∈ ByteString
(3,utf8(S)) if S ∈ Symbol
To stream a String
, ByteString
or Symbol
, emit open(1,n)
and
then a sequence of zero or more format B chunks, followed by
close()
. Every chunk must be a ByteString
, and no chunk may be
annotated.
While the overall content of a streamed String
or Symbol
must be
valid UTF-8, individual chunks do not have to conform to UTF-8.
Fixed-length Atoms.
Fixed-length atoms all use format A, and do not have a length
representation. They repurpose the bits that format B Repr
s use to
specify lengths. Applications MUST NOT use format C with open(0,n)
for any n
.
Booleans.
[[ #false ]] = header(0,0,0) = [0x00]
[[ #true ]] = header(0,0,1) = [0x01]
Floats and Doubles.
[[ F ]] when F ∈ Float = header(0,0,2) ++ binary32(F)
[[ D ]] when D ∈ Double = header(0,0,3) ++ binary64(D)
The functions binary32(F)
and binary64(D)
yield big-endian 4- and
8-byte IEEE 754 binary representations of F
and D
, respectively.
Annotations.
To annotate a Repr
r
with some Value
v
, prepend r
with
[0x05] ++ [[v]]
.
For example, the Repr
corresponding to textual syntax @a@b[]
,
i.e. an empty sequence annotated with two symbols, a
and b
, is
[[ @a @b [] ]]
= [0x05] ++ [[a]] ++ [0x05] ++ [[b]] ++ [[ [] ]]
= [0x05, 0x71, 0x61, 0x05, 0x71, 0x62, 0x90]
Examples
Simple examples.
For the following examples, imagine an application that maps
placeholder number 0 to symbol discard
, 1 to capture
, and 2 to
observe
.
Value | Encoded byte sequence |
---|---|
<capture <discard>> |
82 11 81 10 |
<observe <speak <discard> <capture <discard>>>> |
82 12 83 75 's' 'p' 'e' 'a' 'k' 81 10 82 11 81 11 |
[1 2 3 4] (format B) |
94 31 32 33 34 |
[1 2 3 4] (format C) |
29 31 32 33 34 04 |
[-2 -1 0 1] |
94 3E 3F 30 31 |
"hello" (format B) |
55 'h' 'e' 'l' 'l' 'o' |
"hello" (format C, 2 chunks) |
25 62 'h' 'e' 63 'l' 'l' 'o' 35 |
"hello" (format C, 5 chunks) |
25 61 'h' 61 'e' 61 'l' 61 'l' 61 'o' 35 |
["hello" there #"world" [] #set{} #true #false] |
97 55 'h' 'e' 'l' 'l' 'o' 75 't' 'h' 'e' 'r' 'e' 65 'w' 'o' 'r' 'l' 'd' 90 A0 01 00 |
-257 |
42 FE FF |
-1 |
3F |
0 |
30 |
1 |
31 |
255 |
42 00 FF |
1.0f |
02 3F 80 00 00 |
1.0 |
03 3F F0 00 00 00 00 00 00 |
-1.202e300 |
03 FE 3C B7 B7 59 BF 04 26 |
The next example uses a non-Symbol
label for a record.18 The Record
<[titled person 2 thing 1] 101 "Blackwell" <date 1821 2 3> "Dr">
encodes to
85 ;; Record, generic, 4+1
95 ;; Sequence, 5
76 74 69 74 6C 65 64 ;; Symbol, "titled"
76 70 65 72 73 6F 6E ;; Symbol, "person"
32 ;; SignedInteger, "2"
75 74 68 69 6E 67 ;; Symbol, "thing"
31 ;; SignedInteger, "1"
41 65 ;; SignedInteger, "101"
59 42 6C 61 63 6B 77 65 6C 6C ;; String, "Blackwell"
84 ;; Record, generic, 3+1
74 64 61 74 65 ;; Symbol, "date"
42 07 1D ;; SignedInteger, "1821"
32 ;; SignedInteger, "2"
33 ;; SignedInteger, "3"
52 44 72 ;; String, "Dr"
JSON examples.
The examples from
RFC 8259 read as
valid Preserves, though the JSON literals true
, false
and null
read as Symbol
s. The first example:
{
"Image": {
"Width": 800,
"Height": 600,
"Title": "View from 15th Floor",
"Thumbnail": {
"Url": "http://www.example.com/image/481989943",
"Height": 125,
"Width": 100
},
"Animated" : false,
"IDs": [116, 943, 234, 38793]
}
}
encodes to binary as follows:
B2
55 "Image"
BC
55 "Width" 42 03 20
55 "Title" 5F 14 "View from 15th Floor"
58 "Animated" 75 "false"
56 "Height" 42 02 58
59 "Thumbnail"
B6
55 "Width" 41 64
53 "Url" 5F 26 "http://www.example.com/image/481989943"
56 "Height" 41 7D
53 "IDs" 94
41 74
42 03 AF
42 00 EA
43 00 97 89
and the second example:
[
{
"precision": "zip",
"Latitude": 37.7668,
"Longitude": -122.3959,
"Address": "",
"City": "SAN FRANCISCO",
"State": "CA",
"Zip": "94107",
"Country": "US"
},
{
"precision": "zip",
"Latitude": 37.371991,
"Longitude": -122.026020,
"Address": "",
"City": "SUNNYVALE",
"State": "CA",
"Zip": "94085",
"Country": "US"
}
]
encodes to binary as follows:
92
BF 10
59 "precision" 53 "zip"
58 "Latitude" 03 40 42 E2 26 80 9D 49 52
59 "Longitude" 03 C0 5E 99 56 6C F4 1F 21
57 "Address" 50
54 "City" 5D "SAN FRANCISCO"
55 "State" 52 "CA"
53 "Zip" 55 "94107"
57 "Country" 52 "US"
BF 10
59 "precision" 53 "zip"
58 "Latitude" 03 40 42 AF 9D 66 AD B4 03
59 "Longitude" 03 C0 5E 81 AA 4F CA 42 AF
57 "Address" 50
54 "City" 59 "SUNNYVALE"
55 "State" 52 "CA"
53 "Zip" 55 "94085"
57 "Country" 52 "US"
Security Considerations
Empty chunks. Chunks of zero length are prohibited in streamed
(format C) Repr
s. However, a malicious or broken encoder may include
them nonetheless. This opens up a possibility for denial-of-service:
an attacker may begin streaming a String
, for example, sending an
endless sequence of zero length chunks, appearing to make progress but
not actually doing so. Implementations MUST reject zero length
chunks when decoding, and MUST NOT produce them when encoding.
Whitespace and no-ops. Similarly, the binary format allows 0xFF
no-ops and the textual format allows arbitrary whitespace in many
positions. In streaming transfer situations, consider optional
restrictions on the amount of consecutive whitespace or the number of
consecutive no-ops that may appear.
Annotations. Also similarly, in modes where a Value
is being
read while annotations are skipped, an endless sequence of annotations
may give an illusion of progress.
Canonical form for cryptographic hashing and signing. As
specified, neither the textual nor the compact binary encoding rules
for Value
s force canonical serializations. Two serializations of the
same Value
may yield different binary Repr
s.
Acknowledgements
The use of low-order bits of each lead byte for the length of short values is inspired by a similar feature of CBOR.
The treatment of commas as whitespace in the text syntax is inspired by the same feature of EDN.
The text syntax for Boolean
s, Symbol
s, and ByteString
s is
directly inspired by Racket's lexical
syntax.
Appendix. Autodetection of textual or binary syntax
Whitespace characters 0x09
(ASCII HT (tab)), 0x0A
(LF), 0x0D
(CR), 0x20
(space) and 0x2C
(comma) are ignored at the start of a
textual-syntax Preserves Document
, and their UTF-8 encodings are
reserved lead byte values in binary-syntax Preserves.
The byte 0xFF
, signifying a no-op in binary-syntax Preserves, has no
meaning in either 7-bit ASCII or UTF-8, and therefore cannot appear in
a valid textual-syntax Preserves Document
.
If applications prefix their textual-syntax documents with e.g. a
space or newline character, and their binary-syntax documents with a
0xFF
byte, consumers of these documents may reliably autodetect the
syntax being used. In a network protocol supporting this kind of
autodetection, clients may transmit LF or 0xFF
to select text or
binary syntax, respectively.
Furthermore, if an application consistently uses Record
s for its
top-level messages,19 eschewing Atom
s in
particular, then autodetection of the encoding used for a given input
can be done as follows:
First byte of encoded input | Encoding | Other conclusions |
---|---|---|
0x80 --0x8F |
binary | Record (format B) |
0x28 |
binary | Record (format C) |
0x05 |
binary | annotated value (presumably a Record ) |
0xFF |
binary | no-op; value will follow |
--- | --- | --- |
0x7B ("<") |
text | Record |
0x40 ("@") |
text | annotated value (presumably a Record ) |
0x09 , 0x0A , 0x0D , 0x20 or 0x2C |
text | whitespace; value will follow |
Appendix. Table of lead byte values
00 - False
01 - True
02 - Float
03 - Double
04 - End stream
05 - Annotation
(0x) RESERVED 06-0F (NB. 09, 0A, 0D specially reserved)
1x - Placeholder
2x - Start Stream (NB. 20, 2C specially reserved)
3x - Small integers 0..12,-3..-1
4x - SignedInteger
5x - String
6x - ByteString
7x - Symbol
8x - Record
9x - Sequence
Ax - Set
Bx - Dictionary
(Cx) RESERVED C0-CF
(Dx) RESERVED D0-DF
(Ex) RESERVED E0-EF
(Fx) RESERVED F0-FE
FF No-op
Appendix. Bit fields within lead byte values
tt nn mmmm contents
---------- ---------
00 00 0000 False
00 00 0001 True
00 00 0010 Float, 32 bits big-endian binary
00 00 0011 Double, 64 bits big-endian binary
00 00 0100 End Stream (to match a previous Start Stream)
00 00 0101 Annotation; two more Reprs follow
00 00 1001 (ASCII HT (tab)) \
00 00 1010 (ASCII LF) |- Reserved: may be used to indicate
00 00 1101 (ASCII CR) / use of text encoding
00 01 mmmm Placeholder; m is the placeholder number
00 10 ttnn Start Stream <tt,nn>
When tt = 00 --> error
When nn = 00 --> (ASCII space)
Reserved: may be used to indicate
use of text encoding
otherwise --> error
01 --> each chunk is a ByteString
10 --> each chunk is a single encoded Value
11 --> error (RESERVED)
When nn = 00 --> (ASCII comma)
Reserved: may be used to indicate
use of text encoding
otherwise --> error
00 11 xxxx Small integers 0..12,-3..-1
01 00 mmmm SignedInteger, big-endian binary
01 01 mmmm String, UTF-8 binary
01 10 mmmm ByteString
01 11 mmmm Symbol, UTF-8 binary
10 00 mmmm Record
10 01 mmmm Sequence
10 10 mmmm Set
10 11 mmmm Dictionary
11 nn mmmm error, RESERVED
11 11 1111 no-op; unambiguous indication of binary Preserves format
Where mmmm
appears, interpret it as an unsigned 4-bit number m
. If
m
<15, let l
=m
. Otherwise, m
=15; let l
be the result of
decoding the varint that follows.
Then, if ttnn
=0001
, l
is the placeholder number; otherwise, l
is the length of the body that follows, counted in bytes for tt
=01
and in Repr
s for tt
=10
.
Appendix. Binary SignedInteger representation
Languages that provide fixed-width machine word types may find the
following table useful in encoding and decoding binary SignedInteger
values.
Integer range | Bytes required | Encoding (hex) |
---|---|---|
-3 ≤ n < 13 (numbers -3..12 encoded specially) | 1 | 3X |
-27 ≤ n < 27 (i8) | 2 | 41 XX |
-215 ≤ n < 215 (i16) | 3 | 42 XX XX |
-223 ≤ n < 223 (i24) | 4 | 43 XX XX XX |
-231 ≤ n < 231 (i32) | 5 | 44 XX XX XX XX |
-239 ≤ n < 239 (i40) | 6 | 45 XX XX XX XX XX |
-247 ≤ n < 247 (i48) | 7 | 46 XX XX XX XX XX XX |
-255 ≤ n < 255 (i56) | 8 | 47 XX XX XX XX XX XX XX |
-263 ≤ n < 263 (i64) | 9 | 48 XX XX XX XX XX XX XX XX |
Notes
-
Happily, the design of UTF-8 is such that this gives the same result as a lexicographic byte-by-byte comparison of the UTF-8 encoding of a string! ↩︎
-
The Racket programming language defines “prefab” structure types, which map well to our
Record
s. Racket supports record extensibility by encoding record supertypes into record labels as specially-formatted lists. ↩︎ -
It is occasionally (but seldom) necessary to interpret such
Symbol
labels as UTF-8 encoded IRIs. Where a label can be read as a relative IRI, it is notionally interpreted with respect to the IRIurn:uuid:6bf094a6-20f1-4887-ada7-46834a9b5b34
; where a label can be read as an absolute IRI, it stands for that IRI; and otherwise, it cannot be read as an IRI at all, and so the label simply stands for itself—for its ownValue
. ↩︎ -
The grammar of the textual syntax is a superset of JSON, with the slightly unusual feature that
true
,false
, andnull
are all read asSymbol
s, and thatSignedInteger
s are never read asDouble
s. ↩︎ -
Implementation note. When implementing printing of
Value
s using the textual syntax, consider supporting (a) optional pretty-printing with indentation, (b) optional JSON-compatible print mode for that subset ofValue
that is compatible with JSON, and (c) optional submodes for no commas, commas separating, and commas terminating elements or key/value pairs within a collection. ↩︎ -
Implementation note. Your language's standard library likely has a good routine for converting between decimal notation and IEEE 754 floating-point. However, if not, or if you are interested in the challenges of accurately reading and writing floating point numbers, see the excellent matched pair of 1990 papers by Clinger and Steele & White, and a recent follow-up by Jaffer:
Clinger, William D. ‘How to Read Floating Point Numbers Accurately’. In Proc. PLDI. White Plains, New York, 1990. https://doi.org/10.1145/93542.93557.
Steele, Guy L., Jr., and Jon L. White. ‘How to Print Floating-Point Numbers Accurately’. In Proc. PLDI. White Plains, New York, 1990. https://doi.org/10.1145/93542.93559.
Jaffer, Aubrey. ‘Easy Accurate Reading and Writing of Floating-Point Numbers’. ArXiv:1310.8121 [Cs], 27 October 2013. http://arxiv.org/abs/1310.8121. ↩︎
-
Implementation note. Be aware when implementing reading and writing of
SignedInteger
s that the data model requires arbitrary-precision integers. Your I/O routines must not truncate precision either when reading or writing aSignedInteger
. ↩︎ -
The grammar for
String
has the same effect as the JSON grammar forstring
. Some auxiliary definitions (e.g.escaped
) are lifted largely unmodified from the text of RFC 8259. ↩︎ -
In particular, note JSON's rules around the use of surrogate pairs for code points not in the Basic Multilingual Plane. We encourage implementations to avoid escaping such characters when producing output, and instead to rely on the UTF-8 encoding of the entire document to handle them correctly. ↩︎
-
Compare with the SPKI S-expression definition of “token representation”, and with the R6RS definition of identifiers. ↩︎
-
Rationale. The textual syntax cannot express every
Value
: specifically, it cannot express the several million floating-point NaNs, or the two floating-point Infinities. Since the compact binary format forValue
s expresses eachValue
with precision, embedding binaryValue
s solves the problem. ↩︎ -
Every text is ultimately physically stored as bytes; therefore, it might seem possible to escape to the raw binary form of compact binary encoding from within a pieces of textual syntax. However, while bytes must be involved in any representation of text, the text itself is logically a sequence of code points and is not intrinsically a binary structure at all. It would be incoherent to expect to be able to access the representation of the text from within the text itself. ↩︎
-
Any text-syntax annotations preceding the
#value
are prepended to any binary-syntax annotations yielded by decoding theByteString
. ↩︎ -
Some encodings are unused. All such encodings are reserved for future versions of this specification. ↩︎
-
Also known as LEB128 encoding, for unsigned integers. Varints and LEB128-encoded integers differ only for signed integers, which are not used in Preserves. ↩︎
-
In the BitTorrent encoding format, bencoding, dictionary key/value pairs must be sorted by key. This is a necessary step for ensuring serialization of
Value
s is canonical. We do not require that key/value pairs (or set elements) be in sorted order for serializedValue
s, because (a) where canonicalization is used for cryptographic signatures, it is more reliable to simply retain the exact binary form of the signed document than to depend on canonical de- and re-serialization, and (b) sorting keys or elements makes no sense in streaming serialization formats.However, a quality implementation may wish to offer the programmer the option of serializing with set elements and dictionary keys in sorted order. ↩︎
-
The value 0 needs zero bytes to identify the value, so
intbytes(0)
is the empty byte string. Non-zero values need at least one byte. ↩︎ -
It happens to line up with Racket's representation of a record label for an inheritance hierarchy where
titled
extendsperson
extendsthing
:(struct date (year month day) #:prefab) (struct thing (id) #:prefab) (struct person thing (name date-of-birth) #:prefab) (struct titled person (title) #:prefab)
For more detail on Racket's representations of record labels, see the Racket documentation for
make-prefab-struct
. ↩︎ -
Similar reasoning can be used to permit unambiguous detection of encoding when
Collection
s are allowed as top-level messages as well asRecord
s. ↩︎