59 KiB
Preserves: an Expressive Data Language
Tony Garnock-Jones tonyg@leastfixedpoint.com
August 2019. Version 0.0.6.
This document proposes a data model and serialization format called Preserves.
Preserves supports records with user-defined labels. This relieves the confusion caused by encoding records as dictionaries, seen in most data languages in use on the web. It also allows Preserves to easily represent the labelled sums of products as seen in many functional programming languages.
Preserves also supports the usual suite of atomic and compound data types, in particular including binary data as a distinct type from text strings. Its annotations allow separation of data from metadata such as comments, trace information, and provenance information.
Finally, Preserves defines precisely how to compare two values. Comparison is based on the data model, not on syntax or on data structures of any particular implementation language.
Starting with Semantics
Taking inspiration from functional programming, we start with a definition of the values that we want to work with and give them meaning independent of their syntax.
Our Value
s fall into two broad categories: atomic and compound
data.
Value = Atom
| Compound
Atom = Boolean
| Float
| Double
| SignedInteger
| String
| ByteString
| Symbol
Compound = Record
| Sequence
| Set
| Dictionary
Total order. As we go, we will
incrementally specify a total order over Value
s. Two values of the
same kind are compared using kind-specific rules. The ordering among
values of different kinds is essentially arbitrary, but having a total
order is convenient for many tasks, so we define it as
follows:1
(Values) Atom < Compound
(Compounds) Record < Sequence < Set < Dictionary
(Atoms) Boolean < Float < Double < SignedInteger
< String < ByteString < Symbol
Equivalence. Two Value
s are equal if
neither is less than the other according to the total order.
Signed integers.
A SignedInteger
is a signed integer of arbitrary width.
SignedInteger
s are compared as mathematical integers.
Unicode strings.
A String
is a sequence of Unicode
code-points. String
s
are compared lexicographically, code-point by
code-point.2
Binary data.
A ByteString
is a sequence of octets. ByteString
s are compared
lexicographically.
Symbols.
Programming languages like Lisp and Prolog frequently use string-like
values called symbols. Here, a Symbol
is, like a String
, a
sequence of Unicode code-points representing an identifier of some
kind. Symbol
s are also compared lexicographically by code-point.
Booleans.
There are two Boolean
s, “false” and “true”. The “false” value is
less-than the “true” value.
IEEE floating-point values.
Float
s and Double
s are single- and double-precision IEEE 754
floating-point values, respectively. Float
s, Double
s and
SignedInteger
s are disjoint; by the rules above,
every Float
is less than every Double
, and every SignedInteger
is greater than both. Two Float
s or two Double
s are to be ordered
by the totalOrder
predicate defined in section 5.10 of
IEEE Std 754-2008.
Records.
A Record
is a labelled tuple of Value
s, the record's fields. A
label can be any Value
, but is usually a Symbol
.3
4 Record
s are compared lexicographically: first by
label, then by field sequence.
Sequences.
A Sequence
is a sequence of Value
s. Sequence
s are compared
lexicographically.
Sets.
A Set
is an unordered finite set of Value
s. It contains no
duplicate values, following the equivalence relation
induced by the total order on Value
s. Two Set
s are compared by
sorting their elements ascending using the total order
and comparing the resulting Sequence
s.
Dictionaries.
A Dictionary
is an unordered finite collection of pairs of Value
s.
Each pair comprises a key and a value. Keys in a Dictionary
are
pairwise distinct. Instances of Dictionary
are compared by
lexicographic comparison of the sequences resulting from ordering each
Dictionary
's pairs in ascending order by key.
Textual Syntax
Now we have discussed Value
s and their meanings, we may turn to
techniques for representing Value
s for communication or storage.
In this section, we use case-sensitive ABNF to define a textual syntax that is easy for people to read and write.5 Most of the examples in this document are written using this syntax. In the following section, we will define an equivalent compact machine-readable syntax.
Character set.
ABNF allows easy definition of US-ASCII-based languages. However, Preserves is a Unicode-based language. Therefore, we reinterpret ABNF as a grammar for recognising sequences of Unicode code points.
Textual syntax for a Value
SHOULD be encoded using UTF-8 where
possible.
Whitespace.
Whitespace is defined as any number of spaces, tabs, carriage returns, line feeds, or commas.
ws = *(%x20 / %x09 / newline / ",")
newline = CR / LF
Grammar.
Standalone documents may have trailing whitespace.
Document = Value ws
Any Value
may be preceded by whitespace.
Value = ws (Record / Collection / Atom / Compact)
Collection = Sequence / Dictionary / Set
Atom = Boolean / Float / Double / SignedInteger /
String / ByteString / Symbol
Each Record
is an angle-bracket enclosed grouping of its
label-Value
followed by its field-Value
s.
Record = "<" Value *Value ws ">"
Sequence
s are enclosed in square brackets. Dictionary
values are
curly-brace-enclosed colon-separated pairs of values. Set
s are
written either as one or more values enclosed in curly braces, or zero
or more values enclosed by the tokens #set{
and
}
.6
Sequence = "[" *Value ws "]"
Dictionary = "{" *(Value ws ":" Value) ws "}"
Set = %s"#set{" *Value ws "}" / "{" 1*Value ws "}"
Boolean
s are the simple literal strings #true
and #false
.
Boolean = %s"#true" / %s"#false"
Numeric data follow the
JSON grammar, with
the addition of a trailing "f" distinguishing Float
from Double
values. Float
s and Double
s always have either a fractional part or
an exponent part, where SignedInteger
s never have
either.7
8
Float = flt %i"f"
Double = flt
SignedInteger = int
digit1-9 = %x31-39
nat = %x30 / ( digit1-9 *DIGIT )
int = ["-"] nat
frac = "." 1*DIGIT
exp = %i"e" ["-"/"+"] 1*DIGIT
flt = int (frac exp / frac / exp)
String
s are,
as in JSON, possibly
escaped text surrounded by double quotes. The escaping rules are the
same as for JSON.9 10
String = %x22 *char %x22
char = unescaped / %x7C / escape (escaped / %x22 / %s"u" 4HEXDIG)
unescaped = %x20-21 / %x23-5B / %x5D-7B / %x7D-10FFFF
escape = %x5C ; \
escaped = ( %x5C / ; \ reverse solidus U+005C
%x2F / ; / solidus U+002F
%x62 / ; b backspace U+0008
%x66 / ; f form feed U+000C
%x6E / ; n line feed U+000A
%x72 / ; r carriage return U+000D
%x74 ) ; t tab U+0009
A ByteString
may be written in any of three different forms.
The first is similar to a String
, but prepended with a hash sign
#
. In addition, only Unicode code points overlapping with printable
7-bit ASCII are permitted unescaped inside such a ByteString
; other
byte values must be escaped by prepending a two-digit hexadecimal
value with \x
.
ByteString = "#" %x22 *binchar %x22
binchar = binunescaped / escape (escaped / %x22 / %s"x" 2HEXDIG)
binunescaped = %x20-21 / %x23-5B / %x5D-7E
The second is as a sequence of pairs of hexadecimal digits interleaved
with whitespace and surrounded by #hex{
and }
.
ByteString =/ %s"#hex{" *(ws / 2HEXDIG) ws "}"
The third is as a sequence of
Base64 characters, interleaved
with whitespace and surrounded by #base64{
and }
. Plain and
URL-safe Base64 characters are allowed.
ByteString =/ %s"#base64{" *(ws / base64char) ws "}" /
base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "="
A Symbol
may be written in a "bare" form11 so long as
it conforms to certain restrictions on the characters appearing in the
symbol. Alternatively, it may be written in a quoted form. The quoted
form is much the same as the syntax for String
s, including embedded
escape syntax, except using a bar or pipe character (|
) instead of a
double quote mark.
Symbol = symstart *symcont / "|" *symchar "|"
symstart = ALPHA / sympunct / symunicode
symcont = ALPHA / sympunct / symunicode / DIGIT / "-"
sympunct = "~" / "!" / "$" / "%" / "^" / "&" / "*" /
"?" / "_" / "=" / "+" / "/" / "."
symchar = unescaped / %x22 / escape (escaped / %x7C / %s"u" 4HEXDIG)
symunicode = <any code point greater than 127 whose Unicode
category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me, Nd,
Nl, No, Pd, Pc, Po, Sc, Sm, Sk, So, or Co>
Finally, any Value
may be represented by escaping from the textual
syntax to the compact binary syntax by
prefixing a ByteString
containing the binary representation of the
Value
with #value
.12 13
Compact = %s"#value" ws ByteString
Annotations.
Syntax. When written down, a Value
may have an associated
sequence of annotations carrying “out-of-band” contextual metadata
about the value. Each annotation is, in turn, a Value
, and may
itself have annotations.
Value =/ ws "@" Value Value
Each annotation is preceded by @
; the underlying annotated value
follows its annotations. Here we extend only the syntactic nonterminal
named "Value
" without altering the semantic class of Value
s.
Equivalence. Annotations appear within syntax denoting a Value
;
however, the annotations are not part of the denoted value. They are
only part of the syntax. Annotations do not play a part in
equivalences and orderings of Value
s.
Reflective tools such as debuggers, user interfaces, and message
routers and relays---tools which process Value
s generically---may
use annotated inputs to tailor their operation, or may insert
annotations in their outputs. By contrast, in ordinary programs, as a
rule of thumb, the presence, absence or content of an annotation
should not change the control flow or output of the program.
Annotations are data describing Value
s, and are not in the domain
of any specific application of Value
s. That is, an annotation will
almost never cause a non-reflective program to do anything observably
different.
Compact Binary Syntax
A Repr
is a binary-syntax encoding, or representation, of either
- a
Value
, - a "placeholder" for a
Value
, or - an annotation on a
Repr
.
Each Repr
comprises one or more bytes describing the kind of
represented information and the length of the representation, followed
by the encoded details.
For a value v
, we write [[v]]
for the Repr
of v.
Type and Length representation.
Each Repr
takes one of three possible forms:
-
(A) type-specific form, used for simple values such as
Boolean
s orFloat
s, for placeholders, and for introducing annotations. -
(B) a variable-length form with length specified up-front, used for compound and variable-length atomic data structures when their sizes are known at the time serialization begins.
-
(C) a variable-length streaming form with unknown or unpredictable length, used in cases when serialization begins before the number of elements or bytes in the corresponding
Value
is known.
Applications may choose between formats B and C depending on their needs at serialization time.
The lead byte.
Every Repr
starts with a lead byte, constructed by
leadbyte(t,n,m)
, where t
,n
∈{0,1,2,3} and 0≤m
<16:
leadbyte(t,n,m) = [t*64 + n*16 + m]
The arguments t
, n
and m
describe the rest of the
representation.14
t |
n |
m |
Meaning |
---|---|---|---|
0 | 0 | 0–3 | (format A) An Atom with fixed-length binary representation |
0 | 0 | 4 | (format C) Stream end |
0 | 0 | 5 | (format A) Annotation |
0 | 1 | (format A) Placeholder for an application-specific Value |
|
0 | 2 | (format C) Stream start | |
0 | 3 | (format A) Certain small SignedInteger s |
|
1 | (format B) An Atom with variable-length binary representation |
||
2 | (format B) A Compound with variable-length representation |
Encoding data of type-specific length (format A).
Each type of data defines its own rules for this format.
Encoding data of known length (format B).
Format B is used where the length l
of the Value
to be encoded is
known when serialization begins. Format B Repr
s use m
in
leadbyte
to encode l
. The length counts bytes for atomic
Value
s, but counts contained values for compound Value
s.
- A length
l
between 0 and 14 is represented usingleadbyte
withm=l
. - A length of 15 or greater is represented by
m=15
and additional bytes describing the length following the lead byte.
The function header(t,n,m)
yields an appropriate sequence of bytes
describing a Repr
's type and length when t
, n
and m
are
appropriate non-negative integers:
header(t,n,m) = leadbyte(t,n,m) when m < 15
or leadbyte(t,n,15) ++ varint(m) otherwise
The additional length bytes are formatted as
base 128 varints. We write varint(m)
for the
varint-encoding of m
. Quoting the Google Protocol Buffers
definition,
Each byte in a varint, except the last byte, has the most significant bit (msb) set – this indicates that there are further bytes to come. The lower 7 bits of each byte are used to store the two's complement representation of the number in groups of 7 bits, least significant group first.
The following table illustrates varint-encoding.
Number, m |
m in binary, grouped into 7-bit chunks |
varint(m) bytes |
---|---|---|
15 | 0001111 |
15 |
300 | 0000010 0101100 |
172 2 |
1000000000 | 0000011 1011100 1101011 0010100 0000000 |
128 148 235 220 3 |
Streaming data of unknown length (format C).
A Repr
where the length of the Value
to be encoded is variable and
not known at the time serialization of the Value
starts is encoded
by a single Stream Start (“open”) byte, followed by zero or more
chunks, followed by a matching Stream End (“close”) byte:
open(t,n) = leadbyte(0,2, t*4 + n) = [0x20 + t*4 + n]
close() = leadbyte(0,0, 4) = [0x04]
For a format C Repr
of an atomic Value
, each chunk is to be a
format B Repr
of a ByteString
, no matter the type of the overall
Value
. Annotations are not allowed on these individual chunks.
For a format C Repr
of a compound Value
, each chunk is to be a
single Repr
, which may itself be annotated.
Each chunk within a format C Repr
MUST have non-zero length.
Software that decodes Repr
s MUST reject Repr
s that include
zero-length chunks.
Records.
Format B (known length):
[[ <L F_1...F_m> ]] = header(2,0,m+1) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]]
For m
fields, m+1
is supplied to header
, to account for the
encoding of the record label.
Format C (streaming):
[[ <L F_1...F_m> ]] = open(2,0) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] ++ close()
Applications SHOULD prefer the known-length format for encoding
Record
s.
Placeholders.
Applications may define an interpretation for numbered placeholders
in the binary syntax, mapping each placeholder number n
to a
specific Value
. For example, a placeholder number may be assigned
for a frequently-used Record
label.
A Value
v
for which placeholder number n
has been assigned may
be tersely encoded as
[[v]] = header(0,1,n) when n is a placeholder number for v
Examples. For example, a protocol may choose to assign placeholder
number 4 to the symbol void
, making
[[void]] = header(0,1,4) = [0x14]
[[<void>]] = header(2,0,1) ++ [[void]] = [0x81, 0x14]
or it may map symbol person
to placeholder number 102, making
[[person]] = header(0,1,102) = [0x1F, 0x66]
and so
[[<person "Dr" "Elizabeth" "Blackwell">]]
= header(2,0,4) ++ [[person]] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
= [0x84, 0x1F, 0x66] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
for format B, or
open(2,0) ++ [[person]] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ close()
= [0x28, 0x1F, 0x66] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ [0x04]
for format C.
Sequences, Sets and Dictionaries.
Format B (known length):
[[ [X_1...X_m] ]] = header(2,1,m) ++ [[X_1]] ++...++ [[X_m]]
[[ #set{X_1...X_m} ]] = header(2,2,m) ++ [[X_1]] ++...++ [[X_m]]
[[ {K_1:V_1...K_m:V_m} ]] = header(2,3,m*2) ++ [[K_1]] ++ [[V_1]] ++...
++ [[K_m]] ++ [[V_m]]
Note that m*2
is given to header
for a Dictionary
, since there
are two Value
s in each key-value pair.
Format C (streaming):
[[ [X_1...X_m] ]] = open(2,1) ++ [[X_1]] ++...++ [[X_m]] ++ close()
[[ #set{X_1...X_m} ]] = open(2,2) ++ [[X_1]] ++...++ [[X_m]] ++ close()
[[ {K_1:V_1...K_m:V_m} ]] = open(2,3) ++ [[K_1]] ++ [[V_1]] ++...
++ [[K_m]] ++ [[V_m]] ++ close()
Applications may use whichever format suits their needs on a case-by-case basis.
There is no ordering requirement on the X_i
elements or
K_i
/V_i
pairs.15 They may appear in any
order.
SignedIntegers.
Format B/A (known length/fixed-size):
[[ x ]] when x ∈ SignedInteger = header(1,0,m) ++ intbytes(x) if x<-3 ∨ 13≤x
header(0,3,x+16) if -3≤x<0
header(0,3,x) if 0≤x<13
Integers in the range [-3,12] are compactly represented using format A because they are so frequently used. Other integers are represented using format B.
Format C MUST NOT be used for SignedInteger
s.
The function intbytes(x)
gives the big-endian two's-complement
binary representation of x
, taking exactly as many whole bytes as
needed to unambiguously identify the value and its sign, and m = |intbytes(x)|
. The most-significant bit in the first byte in
intbytes(x)
is the sign bit.16
For example,
[[ -257 ]] = 42 FE FF [[ -3 ]] = 3D [[ 128 ]] = 42 00 80
[[ -256 ]] = 42 FF 00 [[ -2 ]] = 3E [[ 255 ]] = 42 00 FF
[[ -255 ]] = 42 FF 01 [[ -1 ]] = 3F [[ 256 ]] = 42 01 00
[[ -254 ]] = 42 FF 02 [[ 0 ]] = 30 [[ 32767 ]] = 42 7F FF
[[ -129 ]] = 42 FF 7F [[ 1 ]] = 31 [[ 32768 ]] = 43 00 80 00
[[ -128 ]] = 41 80 [[ 12 ]] = 3C [[ 65535 ]] = 43 00 FF FF
[[ -127 ]] = 41 81 [[ 13 ]] = 41 0D [[ 65536 ]] = 43 01 00 00
[[ -4 ]] = 41 FC [[ 127 ]] = 41 7F [[ 131072 ]] = 43 02 00 00
Strings, ByteStrings and Symbols.
Syntax for these three types varies only in the value of n
supplied
to header
and open
. In each case, the payload following the header
is a binary sequence; for String
and Symbol
, it is a UTF-8
encoding of the Value
's code points, while for ByteString
it is
the raw data contained within the Value
unmodified.
Format B (known length):
[[ S ]] = header(1,n,m) ++ encode(S)
where m = |encode(S)|
and (n,encode(S)) = (1,utf8(S)) if S ∈ String
(2,S) if S ∈ ByteString
(3,utf8(S)) if S ∈ Symbol
To stream a String
, ByteString
or Symbol
, emit open(1,n)
and
then a sequence of zero or more format B chunks, followed by
close()
. Every chunk must be a ByteString
, and no chunk may be
annotated.
While the overall content of a streamed String
or Symbol
must be
valid UTF-8, individual chunks do not have to conform to UTF-8.
Fixed-length Atoms.
Fixed-length atoms all use format A, and do not have a length
representation. They repurpose the bits that format B Repr
s use to
specify lengths. Applications MUST NOT use format C with open(0,n)
for any n
.
Booleans.
[[ #false ]] = header(0,0,0) = [0x00]
[[ #true ]] = header(0,0,1) = [0x01]
Floats and Doubles.
[[ F ]] when F ∈ Float = header(0,0,2) ++ binary32(F)
[[ D ]] when D ∈ Double = header(0,0,3) ++ binary64(D)
The functions binary32(F)
and binary64(D)
yield big-endian 4- and
8-byte IEEE 754 binary representations of F
and D
, respectively.
Annotations.
To annotate a Repr
r
with some Value
v
, prepend r
with
[0x05] ++ [[v]]
.
For example, the Repr
corresponding to textual syntax @a@b[]
,
i.e. an empty sequence annotated with two symbols, a
and b
, is
[[ @a @b [] ]]
= [0x05] ++ [[a]] ++ [0x05] ++ [[b]] ++ [[ [] ]]
= [0x05, 0x71, 0x61, 0x05, 0x71, 0x62, 0x90]
Examples
Simple examples.
For the following examples, imagine an application that maps
placeholder number 0 to symbol discard
, 1 to capture
, and 2 to
observe
.
Value | Encoded byte sequence |
---|---|
<capture <discard>> |
82 11 81 10 |
<observe <speak <discard> <capture <discard>>>> |
82 12 83 75 's' 'p' 'e' 'a' 'k' 81 10 82 11 81 11 |
[1 2 3 4] (format B) |
94 31 32 33 34 |
[1 2 3 4] (format C) |
29 31 32 33 34 04 |
[-2 -1 0 1] |
94 3E 3F 30 31 |
"hello" (format B) |
55 'h' 'e' 'l' 'l' 'o' |
"hello" (format C, 2 chunks) |
25 62 'h' 'e' 63 'l' 'l' 'o' 35 |
"hello" (format C, 5 chunks) |
25 61 'h' 61 'e' 61 'l' 61 'l' 61 'o' 35 |
["hello" there #"world" [] #set{} #true #false] |
97 55 'h' 'e' 'l' 'l' 'o' 75 't' 'h' 'e' 'r' 'e' 65 'w' 'o' 'r' 'l' 'd' 90 A0 01 00 |
-257 |
42 FE FF |
-1 |
3F |
0 |
30 |
1 |
31 |
255 |
42 00 FF |
1.0f |
02 3F 80 00 00 |
1.0 |
03 3F F0 00 00 00 00 00 00 |
-1.202e300 |
03 FE 3C B7 B7 59 BF 04 26 |
The next example uses a non-Symbol
label for a record.17 The Record
<[titled person 2 thing 1] 101 "Blackwell" <date 1821 2 3> "Dr">
encodes to
85 ;; Record, generic, 4+1
95 ;; Sequence, 5
76 74 69 74 6C 65 64 ;; Symbol, "titled"
76 70 65 72 73 6F 6E ;; Symbol, "person"
32 ;; SignedInteger, "2"
75 74 68 69 6E 67 ;; Symbol, "thing"
31 ;; SignedInteger, "1"
41 65 ;; SignedInteger, "101"
59 42 6C 61 63 6B 77 65 6C 6C ;; String, "Blackwell"
84 ;; Record, generic, 3+1
74 64 61 74 65 ;; Symbol, "date"
42 07 1D ;; SignedInteger, "1821"
32 ;; SignedInteger, "2"
33 ;; SignedInteger, "3"
52 44 72 ;; String, "Dr"
JSON examples.
The examples from
RFC 8259 read as
valid Preserves, though the JSON literals true
, false
and null
read as Symbol
s. The first example:
{
"Image": {
"Width": 800,
"Height": 600,
"Title": "View from 15th Floor",
"Thumbnail": {
"Url": "http://www.example.com/image/481989943",
"Height": 125,
"Width": 100
},
"Animated" : false,
"IDs": [116, 943, 234, 38793]
}
}
encodes to binary as follows:
B2
55 "Image"
BC
55 "Width" 42 03 20
55 "Title" 5F 14 "View from 15th Floor"
58 "Animated" 75 "false"
56 "Height" 42 02 58
59 "Thumbnail"
B6
55 "Width" 41 64
53 "Url" 5F 26 "http://www.example.com/image/481989943"
56 "Height" 41 7D
53 "IDs" 94
41 74
42 03 AF
42 00 EA
43 00 97 89
and the second example:
[
{
"precision": "zip",
"Latitude": 37.7668,
"Longitude": -122.3959,
"Address": "",
"City": "SAN FRANCISCO",
"State": "CA",
"Zip": "94107",
"Country": "US"
},
{
"precision": "zip",
"Latitude": 37.371991,
"Longitude": -122.026020,
"Address": "",
"City": "SUNNYVALE",
"State": "CA",
"Zip": "94085",
"Country": "US"
}
]
encodes to binary as follows:
92
BF 10
59 "precision" 53 "zip"
58 "Latitude" 03 40 42 E2 26 80 9D 49 52
59 "Longitude" 03 C0 5E 99 56 6C F4 1F 21
57 "Address" 50
54 "City" 5D "SAN FRANCISCO"
55 "State" 52 "CA"
53 "Zip" 55 "94107"
57 "Country" 52 "US"
BF 10
59 "precision" 53 "zip"
58 "Latitude" 03 40 42 AF 9D 66 AD B4 03
59 "Longitude" 03 C0 5E 81 AA 4F CA 42 AF
57 "Address" 50
54 "City" 59 "SUNNYVALE"
55 "State" 52 "CA"
53 "Zip" 55 "94085"
57 "Country" 52 "US"
Conventions for Common Data Types
The Value
data type is essentially an S-Expression, able to
represent semi-structured data over ByteString
, String
,
SignedInteger
atoms and so on.18
However, users need a wide variety of data types for representing domain-specific values such as various kinds of encoded and normalized text, calendrical values, machine words, and so on.
Appropriately-labelled Record
s denote these domain-specific data
types.19
All of these conventions are optional. They form a layer atop the core
Value
structure. Non-domain-specific tools do not in general need to
treat them specially.
Validity. Many of the labels we will describe in this section come
with side-conditions on the contents of labelled Record
s. It is
possible to construct an instance of Value
that violates these
side-conditions without ceasing to be a Value
or becoming
unrepresentable. However, we say that such a Value
is invalid
because it fails to honour the necessary side-conditions.
Implementations SHOULD allow two modes of working: one which
treats all Value
s identically, without regard for side-conditions,
and one which enforces validity (i.e. side-conditions) when reading,
writing, or constructing Value
s.
IOLists.
Inspired by Erlang's notions of
iolist()
and iodata()
,
an IOList
is any tree constructed from ByteString
s and
Sequence
s. Formally, an IOList
is either a ByteString
or a
Sequence
of IOList
s.
IOList
s can be useful for
vectored I/O.
Additionally, the flexibility of IOList
trees allows annotation of
interior portions of a tree.
Comments.
String
values used as annotations are conventionally interpreted as
comments.
@"I am a comment for the Dictionary"
{
@"I am a comment for the key"
key: @"I am a comment for the value"
value
}
@"I am a comment for this entire IOList"
[
#hex{00010203}
@"I am a comment for the middle half of the IOList"
@"A second comment for the same portion of the IOList"
[
@"I am a comment for the following ByteString"
#hex{04050607}
#hex{08090A0B}
]
#hex{0C0D0E0F}
]
MIME-type tagged binary data.
Many internet protocols use
media types (a.k.a MIME types)
to indicate the format of some associated binary data. For this
purpose, we define MIMEData
to be a record labelled mime
with two
fields, the first being a Symbol
, the media type, and the second
being a ByteString
, the binary data.
While each media type may define its own rules for comparing
documents, we define ordering among MIMEData
representations of
such media types following the general rules for ordering of
Record
s.
Examples.
Value | Encoded hexadecimal byte sequence |
---|---|
<mime application/octet-stream #"abcde"> |
83 74 6D 69 6D 65 7F 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D 65 61 62 63 64 65 |
<mime text/plain #"ABC"> |
83 74 6D 69 6D 65 7A 74 65 78 74 2F 70 6C 61 69 6E 63 41 42 43 |
<mime application/xml #"<xhtml/>"> |
83 74 6D 69 6D 65 7F 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 68 3C 78 68 74 6D 6C 2F 3E |
<mime text/csv #"123,234,345"> |
83 74 6D 69 6D 65 78 74 65 78 74 2F 63 73 76 6B 31 32 33 2C 32 33 34 2C 33 34 35 |
Applications making heavy use of mime
records may choose to use a
placeholder number for the symbol mime
as well as the symbols for
individual media types. For example, if placeholder number 1 were
chosen for mime
, and placeholder number 7 for text/plain
, the
second example above, <mime text/plain #"ABC">
, would be encoded as
83 11 17 63 41 42 43
.
Unicode normalization forms.
Unicode defines multiple
normalization forms for text.
While no particular normalization form is required for String
s,
users may need to unambiguously signal or require a particular
normalization form. A NormalizedString
is a Record
labelled with
unicode-normalization
and having two fields, the first of which is a
Symbol
specifying the normalization form used (e.g. nfc
, nfd
,
nfkc
, nfkd
), and the second of which is a String
whose
underlying code point representation MUST be normalized according to
the named normalization form.
IRIs (URIs, URLs, URNs, etc.).
An IRI
is a Record
labelled with iri
and having one field, a
String
which is the IRI itself and which MUST be a valid absolute
or relative IRI.
Machine words.
The definition of SignedInteger
captures all integers. However, in
certain circumstances it can be valuable to assert that a number
inhabits a particular range, such as a fixed-width machine word.
A family of labels i
n and u
n for n ∈ {8,16,32,64} denote
n-bit-wide signed and unsigned range restrictions, respectively.
Records with these labels MUST have one field, a SignedInteger
,
which MUST fall within the appropriate range. That is, to be valid,
- in
<i8
x>
, -128 <= x <= 127. - in
<u8
x>
, 0 <= x <= 255. - in
<i16
x>
, -32768 <= x <= 32767. - etc.
Anonymous Tuples and Unit.
A Tuple
is a Record
with label tuple
and zero or more fields,
denoting an anonymous tuple of values.
The 0-ary tuple, <tuple>
, denotes the empty tuple, sometimes called
"unit" or "void" (but not e.g. JavaScript's "undefined" value).
Null and Undefined.
Tony Hoare's
"billion-dollar mistake"
can be represented with the 0-ary Record
<null>
. An "undefined"
value can be represented as <undefined>
.
Dates and Times.
Dates, times, moments, and timestamps can be represented with a
Record
with label rfc3339
having a single field, a String
, which
MUST conform to one of the full-date
, partial-time
, full-time
,
or date-time
productions of
section 5.6 of RFC 3339.
Security Considerations
Empty chunks. Chunks of zero length are prohibited in streamed
(format C) Repr
s. However, a malicious or broken encoder may include
them nonetheless. This opens up a possibility for denial-of-service:
an attacker may begin streaming a String
, for example, sending an
endless sequence of zero length chunks, appearing to make progress but
not actually doing so. Implementations MUST reject zero length
chunks when decoding, and MUST NOT produce them when encoding.
Whitespace. Similarly, the textual format for Value
s allows
arbitrary whitespace in many positions. In streaming transfer
situations, consider optional restrictions on the amount of
consecutive whitespace that may appear in a serialized Value
.
Annotations. Also similarly, in modes where a Value
is being
read while annotations are skipped, an endless sequence of annotations
may give an illusion of progress.
Canonical form for cryptographic hashing and signing. As
specified, neither the textual nor the compact binary encoding rules
for Value
s force canonical serializations. Two serializations of the
same Value
may yield different binary Repr
s.
Appendix. Table of lead byte values
00 - False
01 - True
02 - Float
03 - Double
04 - End stream
05 - Annotation
(0x) RESERVED 06-0F
1x - Placeholder
2x - Start Stream
3x - Small integers 0..12,-3..-1
4x - SignedInteger
5x - String
6x - ByteString
7x - Symbol
8x - Record
9x - Sequence
Ax - Set
Bx - Dictionary
(Cx) RESERVED C0-CF
(Dx) RESERVED D0-DF
(Ex) RESERVED E0-EF
(Fx) RESERVED F0-FF
Appendix. Bit fields within lead byte values
tt nn mmmm contents
---------- ---------
00 00 0000 False
00 00 0001 True
00 00 0010 Float, 32 bits big-endian binary
00 00 0011 Double, 64 bits big-endian binary
00 00 0100 End Stream (to match a previous Start Stream)
00 00 0101 Annotation; two more Reprs follow
00 01 mmmm Placeholder; m is the placeholder number
00 10 ttnn Start Stream <tt,nn>
When tt = 00 --> error
01 --> each chunk is a ByteString
10 --> each chunk is a single encoded Value
11 --> error (RESERVED)
00 11 xxxx Small integers 0..12,-3..-1
01 00 mmmm SignedInteger, big-endian binary
01 01 mmmm String, UTF-8 binary
01 10 mmmm ByteString
01 11 mmmm Symbol, UTF-8 binary
10 00 mmmm Record
10 01 mmmm Sequence
10 10 mmmm Set
10 11 mmmm Dictionary
11 nn mmmm error, RESERVED
Where mmmm
appears, interpret it as an unsigned 4-bit number m
. If
m
<15, let l
=m
. Otherwise, m
=15; let l
be the result of
decoding the varint that follows.
Then, if ttnn
=0001
, l
is the placeholder number; otherwise, l
is the length of the body that follows, counted in bytes for tt
=01
and in Repr
s for tt
=10
.
Appendix. Why not Just Use JSON?
JSON offers syntax for numbers, strings, booleans, null, arrays and string-keyed maps. However, it suffers from two major problems. First, it offers no semantics for the syntax: it is left to each implementation to determine how to treat each JSON term. This causes interoperability and even security issues. Second, JSON's lack of support for type tags leads to awkward and incompatible encodings of type information in terms of the fixed suite of constructors on offer.
There are other minor problems with JSON having to do with its syntax. Examples include its relative verbosity and its lack of support for binary data.
JSON syntax doesn't mean anything
When are two JSON values the same? When are they different?
The specifications are largely silent on these questions. Different JSON implementations give different answers.
Specifically, JSON does not:
- assign any meaning to numbers,20
- determine how strings are to be compared,21
- determine whether object key ordering is significant,22 or
- determine whether duplicate object keys are permitted, what it would mean if they were, or how to determine a duplicate in the first place.23
In short, JSON syntax doesn't denote anything.24 25
Some examples:
- are the JSON values
1
,1.0
, and1e0
the same or different? - are the JSON values
1.0
and1.0000000000000001
the same or different? - are the JSON strings
"päron"
(UTF-870c3a4726f6e
) and"päron"
(UTF-87061cc88726f6e
) the same or different? - are the JSON objects
{"a":1, "b":2}
and{"b":2, "a":1}
the same or different? - which, if any, of
{"a":1, "a":2}
,{"a":1}
and{"a":2}
are the same? Are all three legal? - are
{"päron":1}
and{"päron":1}
the same or different?
JSON can multiply nicely, but it can't add very well
JSON includes a fixed set of types: numbers, strings, booleans, null, arrays and string-keyed maps. Domain-specific data must be encoded into these types. For example, dates and email addresses are often represented as strings with an implicit internal structure.
There is no convention for labelling a value as belonging to a
particular category. Instead, JSON-encoded data are often labelled in
an ad-hoc way. Multiple incompatible approaches exist. For example, a
"money" structure containing a currency
field and an amount
may be
represented in any number of ways:
{ "_type": "money", "currency": "EUR", "amount": 10 }
{ "type": "money", "value": { "currency": "EUR", "amount": 10 } }
[ "money", { "currency": "EUR", "amount": 10 } ]
{ "@money": { "currency": "EUR", "amount": 10 } }
This causes particular problems when JSON is used to represent sum or union types, such as "either a value or an error, but not both". Again, multiple incompatible approaches exist.
For example, imagine an API for depositing money in an account. The response might be either a "success" response indicating the new balance, or one of a set of possible errors.
Sometimes, a pair of values is used, with null
marking the option
not taken.26
{ "ok": { "balance": 210 }, "error": null }
{ "ok": null, "error": "Unauthorized" }
The branch not chosen is sometimes present, sometimes omitted as if it were an optional field:
{ "ok": { "balance": 210 } }
{ "error": "Unauthorized" }
Sometimes, an array of a label and a value is used:
[ "ok", { "balance": 210 } ]
[ "error", "Unauthorized" ]
Sometimes, the shape of the data is sufficient to distinguish among the alternatives, and the label is left implicit:
{ "balance": 210 }
"Unauthorized"
JSON itself does not offer any guidance for which of these options to choose. In many real cases on the web, poor choices have led to encodings that are irrecoverably ambiguous.
Open questions
Q. Should "symbols" instead be URIs? Relative, usually; relative to what? Some domain-specific base URI?
Q. Literal small integers: are they pulling their weight? They're not absolutely necessary.
Q. Should we go for trying to make the data ordering line up with the
encoding ordering? We'd have to only use streaming forms, and avoid
the small integer encoding, and not store record arities, and sort
sets and dictionaries, and mask floats and doubles (perhaps
like this),
and perhaps pick a specific NaN
, and I don't know what to do about
SignedIntegers. Perhaps make them more like float formats, with the
byte count acting as a kind of exponent underneath the sign bit.
-
Perhaps define separate additional canonicalization restrictions? Doesn't help the ordering, but does help the equivalence.
-
Canonicalization and early-bailout-equivalence-checking are in tension with support for streaming values.
Q. To remain compatible with JSON, portions of the text syntax have to
remain case-insensitive (%i"..."
). However, non-JSON extensions do
not. There's only one (?) at the moment, the %i"f"
in Float
;
should it be changed to case-sensitive?
Q. Should IOList
s be wrapped in an identifying unary record constructor?
TODO: Examples of the ordering. "bzz" < "c" < "caa"
; #true < 3 < "3" < |3|
TODO: Probably should add a canonicalized subset. Consider adding
explicit "I promise this is canonical" marker, like a BOM, which
identifies a binary value as (first) binary and (second, optionally)
as canonical. UTF-8 disallows byte 0xFF
from appearing anywhere in a
text; this might be a good candidate for a marker sequence.
((Actually, perhaps 0x10
would be good! It corresponds to DLE, "data
link escape"; it is not a printable ASCII character, and is disallowed
in the textual Preserves grammar; and it is also mnemonic for "version
0", since it is the Preserves binary encoding of the small integer
zero.))
Notes
-
The observant reader may note that the ordering here is the same as that implied by the tagging scheme used in the concrete binary syntax for
Value
s. ↩︎ -
Happily, the design of UTF-8 is such that this gives the same result as a lexicographic byte-by-byte comparison of the UTF-8 encoding of a string! ↩︎
-
The Racket programming language defines “prefab” structure types, which map well to our
Record
s. Racket supports record extensibility by encoding record supertypes into record labels as specially-formatted lists. ↩︎ -
It is occasionally (but seldom) necessary to interpret such
Symbol
labels as UTF-8 encoded IRIs. Where a label can be read as a relative IRI, it is notionally interpreted with respect to the IRIurn:uuid:6bf094a6-20f1-4887-ada7-46834a9b5b34
; where a label can be read as an absolute IRI, it stands for that IRI; and otherwise, it cannot be read as an IRI at all, and so the label simply stands for itself—for its ownValue
. ↩︎ -
The grammar of the textual syntax is a superset of JSON, with the slightly unusual feature that
true
,false
, andnull
are all read asSymbol
s, and thatSignedInteger
s are never read asDouble
s. ↩︎ -
Implementation note. When implementing printing of
Value
s using the textual syntax, consider supporting (a) optional pretty-printing with indentation, (b) optional JSON-compatible print mode for that subset ofValue
that is compatible with JSON, and (c) optional submodes for no commas, commas separating, and commas terminating elements or key/value pairs within a collection. ↩︎ -
Implementation note. Your language's standard library likely has a good routine for converting between decimal notation and IEEE 754 floating-point. However, if not, or if you are interested in the challenges of accurately reading and writing floating point numbers, see the excellent matched pair of 1990 papers by Clinger and Steele & White, and a recent follow-up by Jaffer:
Clinger, William D. ‘How to Read Floating Point Numbers Accurately’. In Proc. PLDI. White Plains, New York, 1990. https://doi.org/10.1145/93542.93557.
Steele, Guy L., Jr., and Jon L. White. ‘How to Print Floating-Point Numbers Accurately’. In Proc. PLDI. White Plains, New York, 1990. https://doi.org/10.1145/93542.93559.
Jaffer, Aubrey. ‘Easy Accurate Reading and Writing of Floating-Point Numbers’. ArXiv:1310.8121 [Cs], 27 October 2013. http://arxiv.org/abs/1310.8121. ↩︎
-
Implementation note. Be aware when implementing reading and writing of
SignedInteger
s that the data model requires arbitrary-precision integers. Your I/O routines must not truncate precision either when reading or writing aSignedInteger
. ↩︎ -
The grammar for
String
has the same effect as the JSON grammar forstring
. Some auxiliary definitions (e.g.escaped
) are lifted largely unmodified from the text of RFC 8259. ↩︎ -
In particular, note JSON's rules around the use of surrogate pairs for code points not in the Basic Multilingual Plane. We encourage implementations to avoid escaping such characters when producing output, and instead to rely on the UTF-8 encoding of the entire document to handle them correctly. ↩︎
-
Compare with the SPKI S-expression definition of "token representation", and with the R6RS definition of identifiers. ↩︎
-
Rationale. The textual syntax cannot express every
Value
: specifically, it cannot express the several million floating-point NaNs, or the two floating-point Infinities. Since the compact binary format forValue
s expresses eachValue
with precision, embedding binaryValue
s solves the problem. ↩︎ -
Every text is ultimately physically stored as bytes; therefore, it might seem possible to escape to the raw binary form of compact binary encoding from within a pieces of textual syntax. However, while bytes must be involved in any representation of text, the text itself is logically a sequence of code points and is not intrinsically a binary structure at all. It would be incoherent to expect to be able to access the representation of the text from within the text itself. ↩︎
-
Some encodings are unused. All such encodings are reserved for future versions of this specification. ↩︎
-
In the BitTorrent encoding format, bencoding, dictionary key/value pairs must be sorted by key. This is a necessary step for ensuring serialization of
Value
s is canonical. We do not require that key/value pairs (or set elements) be in sorted order for serializedValue
s, because (a) where canonicalization is used for cryptographic signatures, it is more reliable to simply retain the exact binary form of the signed document than to depend on canonical de- and re-serialization, and (b) sorting keys or elements makes no sense in streaming serialization formats.However, a quality implementation may wish to offer the programmer the option of serializing with set elements and dictionary keys in sorted order. ↩︎
-
The value 0 needs zero bytes to identify the value, so
intbytes(0)
is the empty byte string. Non-zero values need at least one byte. ↩︎ -
It happens to line up with Racket's representation of a record label for an inheritance hierarchy where
titled
extendsperson
extendsthing
:(struct date (year month day) #:prefab) (struct thing (id) #:prefab) (struct person thing (name date-of-birth) #:prefab) (struct titled person (title) #:prefab)
For more detail on Racket's representations of record labels, see the Racket documentation for
make-prefab-struct
. ↩︎ -
Rivest's S-Expressions are in many ways similar to Preserves. However, while they include binary data and sequences, and an obvious equivalence for them exists, they lack numbers per se as well as any kind of unordered structure such as sets or maps. In addition, while "display hints" allow labelling of binary data with an intended interpretation, they cannot be attached to any other kind of structure, and the "hint" itself can only be a binary blob. ↩︎
-
Given
Record
's existence, it may seem odd thatDictionary
,Set
,Float
, etc. are given special treatment. Preserves aims to offer a useful basic equivalence predicate to programmers, and so if a data type demands a special equivalence predicate, asDictionary
,Set
andFloat
all do, then the type should be included in the base language. Otherwise, it can be represented as aRecord
and treated separately.Boolean
,String
andSymbol
are seeming exceptions. The first two merit inclusion because of their cultural importance, whileSymbol
s are included to allow their use asRecord
labels. PrimitiveSymbol
support avoids a bootstrapping issue. ↩︎ -
Section 6 of RFC 8259 does go so far as to indicate “good interoperability can be achieved” by imagining that parsers are able reliably to understand the syntax of numbers as denoting an IEEE 754 double-precision floating-point value. ↩︎
-
Section 8.3 of RFC 8259 suggests that if an implementation compares strings used as object keys “code unit by code unit”, then it will interoperate with other such implementations, but neither requires this behaviour nor discusses comparisons of strings used in other contexts. ↩︎
-
Section 4 of RFC 8259 remarks that “[implementations] differ as to whether or not they make the ordering of object members visible to calling software.” ↩︎
-
Section 4 of RFC 8259 is the only place in the specification that mentions the issue. It explicitly sanctions implementations supporting duplicate keys, noting only that “when the names within an object are not unique, the behavior of software that receives such an object is unpredictable.” Implementations are free to choose any behaviour at all in this situation, including signalling an error, or discarding all but one of a set of duplicates. ↩︎
-
The XML world has the concept of XML infoset. Loosely speaking, XML infoset is the denotation of an XML document; the meaning of the document. ↩︎
-
Most other recent data languages are like JSON in specifying only a syntax with no associated semantics. While some do make a sketch of a semantics, the result is often underspecified (e.g. in terms of how strings are to be compared), overly machine-oriented (e.g. treating 32-bit integers as fundamentally distinct from 64-bit integers and from floating-point numbers), overly fine (e.g. giving visibility to the order in which map entries are written), or all three. ↩︎
-
What is the meaning of a document where both
ok
anderror
are non-null? What might happen when a program is presented with such a document? ↩︎