45 KiB
Preserves: Semantic Serialization of Node-labelled Data
_________
<_________> Tony Garnock-Jones <tonyg@leastfixedpoint.com>
| FRμIT | September 2018
|Preserves| Version 0.0.2
\_________/
Most data serialization formats used on the web represent edge-labelled semi-structured data.
This document proposes a data model and serialization format that takes a node-labelled approach.
This makes it both extensible and much more like S-expressions, making it easily able to represent the labelled sums of products as seen in Rust, Haskell, OCaml, and other functional programming languages.
Starting with Semantics
Taking inspiration from functional programming, we start with a definition of the values that we want to work with and give them meaning independent of their syntax. We will treat syntax separately, later in this document.
Value = Atom
| Compound
Atom = Boolean
| Float
| Double
| SignedInteger
| String
| ByteString
| Symbol
Compound = Record
| Sequence
| Set
| Dictionary
Our Value
s fall into two broad categories: atomic and compound
data.1
Total order. As we go, we will
incrementally specify a total order over Value
s. Two values of the
same kind are compared using kind-specific rules. The ordering among
values of different kinds is essentially arbitrary, but having a total
order is convenient for many tasks, so we define it as
follows:2
(Values) Atom < Compound
(Compounds) Record < Sequence < Set < Dictionary
(Atoms) Boolean < Float < Double < SignedInteger
< String < ByteString < Symbol
Equivalence. Two Value
s are equal if
neither is less than the other according to the total order.
Signed integers.
A SignedInteger
is a signed integer of arbitrary width.
SignedInteger
s are compared as mathematical integers. We will write
examples of SignedInteger
s using standard mathematical notation.
Examples. 10; -6; 0.
Non-examples. NaN (the clue is in the name!); ∞ (not finite); 0.2 (not an integer); 1/7 (likewise); 2+i3 (likewise); √2 (likewise).
Unicode strings.
A String
is a sequence of Unicode
code-points. Two
String
s are compared lexicographically, code-point by
code-point.3 We will write examples of String
s text
surrounded by double-quotes “"
” using a monospace font.
Examples. "Hello world"
, an eleven-code-point string; "z水𝄞"
,
the string containing the three Unicode code-points z
(0x7A), 水
(0x6C34) and 𝄞
(0x1D11E); ""
, the empty string.
Normalization forms. Unicode defines multiple
normalization forms for text. No
particular normalization form is required for String
s;
see below.
Binary data.
A ByteString
is an ordered sequence of zero or more integers in the
inclusive range [0..255]. ByteString
s are compared
lexicographically, byte by byte. We will only write examples of
ByteString
s that contain bytes mapping to printable ASCII
characters, using “#"
” as an opening quote mark and “"
” as a
closing quote mark.
Examples. The ByteString
containing the integers 65, 66 and 67
(corresponding to ASCII characters A
, B
and C
) is written as
#"ABC"
. The empty ByteString
is written as #""
. N.B. Despite
appearances, these are binary data.
Symbols or identifiers.
Programming languages like Lisp and Prolog frequently use string-like
values called symbols. Here, a Symbol
is, like a String
, a
sequence of Unicode code-points, intended to represent an identifier
of some kind. Symbol
s are also compared lexicographically by
code-point. We will write examples including only non-empty sequences
of non-whitespace characters, using a monospace font without quotation
marks.
Examples. hello-world
; utf8-string
; exact-integer?
.
Booleans.
There are exactly two Boolean
values, “false” and “true”. The
“false” value compares less-than the “true” value. We write #f
for
“false”, and #t
for “true”.
Examples. #f
; #t
.
IEEE floating-point values.
A Float
is a single-precision IEEE 754 floating-point value; a
Double
is a double-precision IEEE 754 floating-point value.
Float
s, Double
s and SignedInteger
s are considered disjoint, and
so by the rules above, every Float
is less than
every Double
, and every SignedInteger
is less than both. Two
Float
s or two Double
s are to be ordered by the totalOrder
predicate defined in section 5.10 of
IEEE Std 754-2008.
We write examples using standard mathematical notation, avoiding NaN
and infinities, using a suffix f
or d
to indicate Float
or
Double
, respectively.
Examples. 10f; -6d; 0f; 0.5d; -1.202e300d.
Non-examples. 10, -6, and 0, because writing them this way
indicates SignedInteger
s, not Float
s or Double
s.
Records.
A Record
is a labelled tuple of zero or more Value
s, called the
record's fields. A record's label is, itself, a Value
, though it
will usually be a Symbol
.4 5 Record
s
are compared lexicographically as if they were just tuples; that is,
first by their labels, and then by the remainder of their fields. We
will only write examples of Record
s having labels that are Symbol
s
entirely composed of ASCII characters. Such Record
s will be written
as a parenthesised, space-separated sequence of their label followed
by their fields.
Examples. The Record
with label foo
and fields 1, 2 and 3 is
written (foo 1 2 3)
; the Record
with label void
and no fields is
written (void)
.
Sequences.
A Sequence
is a general-purpose, variable-length ordered sequence of
zero or more Value
s. Sequence
s are compared lexicographically,
appealing to the ordering on Value
s for comparisons at each position
in the Sequence
s. We write examples space-separated, surrounded with
square brackets.
Examples. []
, the empty sequence; [1 2 3]
, the sequence of
SignedInteger
s 1, 2 and 3.
Sets.
A Set
is an unordered finite set of Value
s. It contains no
duplicate values, following the equivalence relation
induced by the total order on Value
s. Two Set
s are compared by
sorting their elements using the total order and
comparing the resulting sequences as Sequence
s. We write examples
space-separated, surrounded with curly braces, prefixed by #set
.
Examples. #set{}
, the empty set; #set{#set{}}
, the set
containing only the empty set; #set{4 "hello" (void) 9.0f}
, the set
containing 4, the string "hello"
, the record with label void
and
no fields, and the Float
denoting the number 9.0; #set{1 1.0f}
,
the set containing a SignedInteger
and a Float
, both denoting the
number 1; #set{(mime application/xml #"<x/>") (mime application/xml #"<x />")}
, a set containing two different
type-labelled byte arrays.6
Non-examples. #set{1 1 1}
, because it contains multiple
equivalent Value
s.
Dictionaries, hash-tables or maps.
A Dictionary
is an unordered finite collection of zero or more pairs
of Value
s. Each pair comprises a key and a value. Keys in a
Dictionary
must be pairwise distinct. Instances of Dictionary
are
compared by lexicographic comparison of the sequences resulting from
ordering each Dictionary
's pairs in ascending order by key. Examples
are written as a #dict
-prefixed, curly-brace-surrounded sequence of
space-separated key-value pairs, each written with a colon between the
key and value.
Examples. #dict{}
, the empty dictionary; #dict{a:1}
, the
dictionary mapping the Symbol
a
to the SignedInteger
1;
#dict{1:a}
, mapping 1 to a
; #dict{"hi":0 hi:0 there:[]}
, having
a String
and two Symbol
keys, and SignedInteger
and Sequence
values.
Non-examples. #dict{a:1 b:2 a:3}
, because it contains duplicate
keys; #dict{[]:[] []:99}
, for the same reason.
Syntax
Now we have discussed Value
s and their meanings, we may turn to
techniques for representing Value
s for communication or storage.
The syntax we have used for the examples so far is inadequate in many
ways, not least of which is that it cannot represent every Value
.
Separation of the meaning of a piece of syntax from the syntax itself opens the door to domain-specific syntaxes, all equivalent and interconvertible.7 With a robust semantic foundation, connections to other data languages can also be made.
Binary syntax
For now, we limit our attention to an easily-parsed, easily-produced machine-readable syntax.
A Repr
is an encoding, or representation, of a specific Value
.
Each Repr
comprises one or more bytes describing first the kind of
represented Value
and the length of the representation, and then the
encoded details of the Value
itself.
For a value v
, we write [[v]]
for the Repr
of v.
The following figure summarises the definitions below:
tt nn mmmm varint(m) contents
-------------------------------
00 00 0000 False
00 00 0001 True
00 00 0010 Float, 32 bits big-endian binary
00 00 0011 Double, 64 bits big-endian binary
00 00 x1xx RESERVED
00 00 1xxx RESERVED
00 01 xxxx RESERVED
00 10 ttnn Start Stream <tt,nn>
When tt = 00 --> error
01 --> each chunk is a <tt,nn> piece
1x --> each chunk is a single encoded Value
00 11 ttnn End Stream <tt,nn> (must match preceding Start Stream)
01 00 mmmm ... SignedInteger, big-endian binary
01 01 mmmm ... String, UTF-8 binary
01 10 mmmm ... ByteString
01 11 mmmm ... Symbol, UTF-8 binary
10 00 mmmm ... application-specific Record
10 01 mmmm ... application-specific Record
10 10 mmmm ... application-specific Record
10 11 mmmm ... Record
11 00 mmmm ... Sequence
11 01 mmmm ... Set
11 10 mmmm ... Dictionary
11 11 xxxx RESERVED
If mmmm = 1111, varint(m) is present; otherwise, m is the length
Type and Length representation
Each Repr
takes one of three possible forms:
-
(A) a fixed-length form, used for simple values such as
Boolean
s orFloat
s. -
(B) a variable-length form with length specified up-front, used for almost all
Record
s as well as for mostSequence
s andString
s, when their sizes are known at the time serialization begins. -
(C) a variable-length streaming form with unknown or unpredictable length, used only seldom for
Record
s, since the number of fields in aRecord
is usually statically known, but sometimes used forSequence
s,String
s etc., such as in cases when serialization begins before the number of elements or bytes in the correspondingValue
is known.
Applications may choose between formats (B) and (C) depending on their needs at serialization time.
Every Repr
, however, starts with a lead byte describing the
remainder of the representation.
The lead byte
The lead byte is constructed by a function leadbyte
:
leadbyte(t,n,m) = [t*64 + n*16 + m]
Both t
and n
are two-bit unsigned numbers; m
is a four-bit
unsigned number.
The lead byte describes the rest of the representation as follows:8
leadbyte(0,0,-)
(format A) represents an Atom with fixed-length binary representation.leadbyte(0,1,-)
(format A) is RESERVED.leadbyte(0,2,-)
(format C) is a Stream Start byte.leadbyte(0,3,-)
(format C) is a Stream End byte.leadbyte(1,-,-)
(format B) represents an Atom with variable-length binary representation.leadbyte(2,-,-)
(format B) represents a Record.leadbyte(3,-,-)
(format B) represents a Sequence, Set or Dictionary.
Encoding data of fixed length (format A)
Each specific type of data defines its own rules for this format.
Encoding data of known length (format B)
A Repr
where the length of the Value
to be encoded is variable but
known uses the value of m
in leadbyte
to encode its length. The
length counts bytes for atomic Value
s, but counts contained
values for compound Value
s.
- A length
l
between 0 and 14 is represented usingleadbyte
withm=l
. - A length of 15 or greater is represented by
m=15
and additional bytes describing the length following the lead byte.
The function header(t,n,m)
yields an appropriate sequence of bytes
describing a Repr
's type and length when t
, n
and m
are
appropriate non-negative integers:
header(t,n,m) = leadbyte(t,n,m) when m < 15
or leadbyte(t,n,15) ++ varint(m) otherwise
The additional length bytes are formatted as
base 128 varints. We write varint(m)
for the
varint-encoding of m
. Quoting the Google Protocol Buffers
definition,
Each byte in a varint, except the last byte, has the most significant bit (msb) set – this indicates that there are further bytes to come. The lower 7 bits of each byte are used to store the two's complement representation of the number in groups of 7 bits, least significant group first.
Examples.
- The varint representation of 15 is just the byte 15.
- 300 (binary, grouped into 7-bit chunks,
10 0101100
) varint-encodes to the two bytes 172 and 2. - 1000000000 (binary
11 1011100 1101011 0010100 0000000
) varint-encodes to bytes 128, 148, 235, 220, and 3.
Streaming data of unknown length (format C)
A Repr
where the length of the Value
to be encoded is variable and
not known at the time serialization of the Value
starts is encoded
by a single Stream Start byte, followed by zero or more chunks,
followed by a matching Stream End byte:
startbyte(t,n) = leadbyte(0,2, t*4 + n)
endbyte(t,n) = leadbyte(0,3, t*4 + n)
For a Repr
of a Value
containing binary data, each chunk is to be
a format B Repr
of the same type as the overall Repr
.
For a Repr
of a Value
containing other Value
s, each chunk is to
be a single Repr
.
Records
Format B (known length):
[[ (L F_1 ... F_m) ]] = header(2,3,m+1) ++ [[L]] ++ [[F_1]] ++ ... ++ [[F_m]]
For m
fields, m+1
is supplied to header
, to account for the
encoding of the record label.
Format C (streaming):
[[ (L F_1 ... F_m) ]]
= startbyte(2,3) ++ [[L]] ++ [[F_1]] ++ ... ++ [[F_m]] ++ endbyte(2,3)
Applications SHOULD prefer the known-length format for encoding
Record
s.
Application-specific short form for labels
Any given protocol using Preserves may additionally define an
interpretation for n ∈ {0,1,2}
, mapping each short form label
number n
to a specific record label. When encoding m
fields with
short form label number n
, format B becomes
header(2,n,m) ++ [[F_1]] ++ ... ++ [[F_m]]
and format C becomes
startbyte(2,n) ++ [[F_1]] ++ ... ++ [[F_m]] ++ endbyte(2,n)
Examples. For example, a protocol may choose to map records
labelled void
to n=0
, making
[[(void)]] = header(2,0,0) = [0x80]
or it may map records labelled person
to short form label number 1,
making
[[(person "Dr" "Elizabeth" "Blackwell")]]
= header(2,1,3) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
= [0x93] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
for format B, or
= startbyte(2,1) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ endbyte(2,1)
= [0x29] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ [0x39]
for format C.
Sequences, Sets and Dictionaries
Format B (known length):
[[ [X_1 ... X_m] ]] = header(3,0,m) ++ [[X_1]] ++ ... ++ [[X_m]]
[[ #set{X_1 ... X_m} ]] = header(3,1,m) ++ [[X_1]] ++ ... ++ [[X_m]]
[[ #dict{K_1:V_1 ... K_m:V_m} ]]
= header(3,2,m) ++ [[K_1]] ++ [[V_1]] ++ ... ++ [[K_m]] ++ [[V_m]]
Format C (streaming):
[[ [X_1 ... X_m] ]] = startbyte(3,0) ++ [[X_1]] ++ ... ++ [[X_m]] ++ endbyte(3,0)
[[ #set{X_1 ... X_m} ]] = startbyte(3,1) ++ [[X_1]] ++ ... ++ [[X_m]] ++ endbyte(3,1)
[[ #dict{K_1:V_1 ... K_m:V_m} ]]
= startbyte(3,2) ++ [[K_1]] ++ [[V_1]] ++ ... ++ [[K_m]] ++ [[V_m]] ++ endbyte(3,2)
Applications may use whichever format suits their needs on a case-by-case basis.
There is no ordering requirement on the X_i
elements or
K_i
/V_i
pairs.9 They may appear in any
order.
Note that header(3,3,m)
and startbyte(3,3)
/endbyte(3,3)
is unused and reserved.
Variable-length Atoms
SignedInteger
Format B (known length):
[[ x ]] when x ∈ SignedInteger = header(1,0,m) ++ intbytes(x)
where m = |intbytes(x)|
and intbytes(x) = a big-endian two's-complement representation
of the signed integer x, taking exactly as
many whole bytes as needed to unambiguously
identify the value
Format C MUST NOT be used for SignedInteger
s.
The value 0 needs zero bytes to identify the value, so intbytes(0)
is the empty byte string. Non-zero values need at least one byte; the
most-significant bit in the first byte in intbytes(x)
for x≠0
is
the sign bit.
For example,
[[ -257 ]] = [0x42, 0xFE, 0xFF]
[[ -256 ]] = [0x42, 0xFF, 0x00]
[[ -255 ]] = [0x42, 0xFF, 0x01]
[[ -254 ]] = [0x42, 0xFF, 0x02]
[[ -129 ]] = [0x42, 0xFF, 0x7F]
[[ -128 ]] = [0x41, 0x80]
[[ -127 ]] = [0x41, 0x81]
[[ -2 ]] = [0x41, 0xFE]
[[ -1 ]] = [0x41, 0xFF]
[[ 0 ]] = [0x40]
[[ 1 ]] = [0x41, 0x01]
[[ 127 ]] = [0x41, 0x7F]
[[ 128 ]] = [0x42, 0x00, 0x80]
[[ 255 ]] = [0x42, 0x00, 0xFF]
[[ 256 ]] = [0x42, 0x01, 0x00]
[[ 32767 ]] = [0x42, 0x7F, 0xFF]
[[ 32768 ]] = [0x43, 0x00, 0x80, 0x00]
[[ 65535 ]] = [0x43, 0x00, 0xFF, 0xFF]
[[ 65536 ]] = [0x43, 0x01, 0x00, 0x00]
[[ 131072 ]] = [0x43, 0x02, 0x00, 0x00]
String
Format B (known length):
[[ S ]] when S ∈ String = header(1,1,m) ++ utf8(S)
where m = |utf8(x)|
and utf8(x) = the UTF-8 encoding of S
To stream a String
, emit startbyte(1,1)
and then a sequence of
zero or more format B String
chunks, followed by endbyte(1,1)
.
While the overall content of a streamed String
must be valid UTF-8,
individual chunks do not have to conform to UTF-8.
ByteString
Format B (known length):
[[ B ]] when B ∈ ByteString = header(1,2,m) ++ B
where m = |B|
To stream a ByteString
, emit startbyte(1,2)
and then a sequence of
zero or more format B ByteString
chunks, followed by endbyte(1,2)
.
Symbol
Format B (known length):
[[ S ]] when S ∈ Symbol = header(1,3,m) ++ utf8(S)
where m = |utf8(x)|
and utf8(x) = the UTF-8 encoding of S
To stream a Symbol
, emit startbyte(1,3)
and then a sequence of
zero or more format B Symbol
chunks, followed by endbyte(1,3)
.
Fixed-length Atoms
Fixed-length atoms all use format A, and do not have a length
representation. They repurpose the bits that format B Repr
s use to
specify lengths. Applications MUST NOT use format C with
startbyte(0,n)
or endbyte(0,n)
for any n
.
Booleans
[[ #f ]] = header(0,0,0) = [0x00]
[[ #t ]] = header(0,0,1) = [0x01]
Floats and Doubles
[[ F ]] when F ∈ Float = header(0,0,2) ++ binary32(F)
[[ D ]] when D ∈ Double = header(0,0,3) ++ binary64(D)
where binary32(F) and binary64(D) are big-endian 4- and 8-byte
IEEE 754 binary representations
Examples
For the following examples, imagine an application that maps Record
short form label number 0 to label discard
, 1 to capture
, and 2 to
observe
.
Value | Encoded hexadecimal byte sequence |
---|---|
(capture (discard)) |
91 80 |
(observe (speak (discard) (capture (discard)))) |
A1 B3 75 73 70 65 61 6B 80 91 80 |
[1 2 3 4] (format B) |
C4 41 01 41 02 41 03 41 04 |
[1 2 3 4] (format C) |
2C 41 01 41 02 41 03 41 04 3C |
[-2 -1 0 1] |
C4 41 FE 41 FF 40 41 01 |
"hello" (format B) |
55 68 65 6C 6C 6F |
"hello" (format C, 2 chunks) |
25 52 68 65 53 6C 6C 6F 35 |
"hello" (format C, 5 chunks) |
25 52 68 65 52 6C 6C 50 50 51 6F 35 |
["hello" there #"world" [] #set{} #t #f] |
C7 55 68 65 6C 6C 6F 75 74 68 65 72 65 C0 D0 01 00 |
-257 |
42 FE FF |
-1 |
41 FF |
0 |
40 |
1 |
41 01 |
255 |
42 00 FF |
1f |
02 3F 80 00 00 |
1d |
03 3F F0 00 00 00 00 00 00 |
-1.202e300d |
03 FE 3C B7 B7 59 BF 04 26 |
Finally, a larger example, using a non-Symbol
label for a record.10 The Record
([titled person 2 thing 1]
101
"Blackwell"
(date 1821 2 3)
"Dr")
encodes to
B5 ;; Record, generic, 4+1
C5 ;; Sequence, 5
76 74 69 74 6C 65 64 ;; Symbol, "titled"
76 70 65 72 73 6F 6E ;; Symbol, "person"
41 02 ;; SignedInteger, "2"
75 74 68 69 6E 67 ;; Symbol, "thing"
41 01 ;; SignedInteger, "1"
41 65 ;; SignedInteger, "101"
59 42 6C 61 63 6B 77 65 6C 6C ;; String, "Blackwell"
B4 ;; Record, generic, 3+1
74 64 61 74 65 ;; Symbol, "date"
42 07 1D ;; SignedInteger, "1821"
41 02 ;; SignedInteger, "2"
41 03 ;; SignedInteger, "3"
52 44 72 ;; String, "Dr"
Conventions for Common Data Types
The Value
data type is essentially an S-Expression, able to
represent semi-structured data over ByteString
, String
,
SignedInteger
atoms and so on.
However, users need a wide variety of data types for representing domain-specific values such as various kinds of encoded and normalized text, calendrical values, machine words, and so on.
We use appropriately-labelled Record
s to denote these
domain-specific data types.
All of these conventions are optional. They form a layer atop the core
Value
structure. Non-domain-specific tools do not in general need to
treat them specially.
Validity. Many of the labels we will describe in this section come
with side-conditions on the contents of labelled Record
s. It is
possible to construct an instance of Value
that violates these
side-conditions without ceasing to be a Value
or becoming
unrepresentable. However, we say that such a Value
is invalid
because it fails to honour the necessary side-conditions.
Implementations SHOULD allow two modes of working: one which
treats all Value
s identically, without regard for side-conditions,
and one which enforces validity (i.e. side-conditions) when reading,
writing, or constructing Value
s.
MIME-type tagged binary data
Many internet protocols use
media types (a.k.a MIME types)
to indicate the format of some associated binary data. For this
purpose, we define MIMEData
to be a record labelled mime
with two
fields, the first being a Symbol
, the media type, and the second
being a ByteString
, the binary data.
While each media type may define its own rules for comparing
documents, we define ordering among MIMEData
representations of
such media types lexicographically over the (Symbol
, ByteString
)
pair.
Examples.
| (mime application/octet-stream #"abcde")
| B3 74 6D 69 6D 65 7F 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D 65 61 62 63 64 65 |
| (mime text/plain #"ABC")
| B3 74 6D 69 6D 65 7A 74 65 78 74 2F 70 6C 61 69 6E 63 41 42 43 |
| (mime application/xml #"<xhtml/>")
| B3 74 6D 69 6D 65 7F 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 68 3C 78 68 74 6D 6C 2F 3E |
| (mime text/csv #"123,234,345")
| B3 74 6D 69 6D 65 78 74 65 78 74 2F 63 73 76 6B 31 32 33 2C 32 33 34 2C 33 34 35 |
Applications making heavy use of mime
records may choose to use a
short form label number for the record type. For example, if short
form label number 1 were chosen, the second example above, (mime text/plain "ABC")
, would be encoded with "92" in place of "B3 74 6D
69 6D 65".
Text
Normalization forms
In order for users to unambiguously signal or require a particular
normalization form, we define a
NormalizedString
, which is a Record
labelled with
unicode-normalization
and having two fields, the first of which is a
Symbol
specifying the normalization form used (e.g. nfc
, nfd
,
nfkc
, nfkd
), and the second of which is a String
whose
underlying code point representation MUST be normalized according to
the named normalization form.
IRIs (URIs, URLs, URNs, etc.)
An IRI
is a Record
labelled with iri
and having one field, a
String
which is the IRI itself and which MUST be a valid absolute
or relative IRI.
Machine words
The definition of SignedInteger
captures all integers. However, in
certain circumstances it can be valuable to assert that a number
inhabits a particular range, such as a fixed-width machine word.
A family of labels i
n and u
n for n ∈ {16,32,64} denote
n-bit-wide signed and unsigned range restrictions, respectively.
Records with these labels MUST have one field, a SignedInteger
,
which MUST fall within the appropriate range. That is, to be valid,
- in
(i16
x)
, -32768 <= x <= 32767. - in
(u16
x)
, 0 <= x <= 65535. - in
(i32
x)
, -2147483648 <= x <= 2147483647. - etc.
Anonymous Tuples and Unit
A Tuple
is a Record
with label tuple
and zero or more fields,
denoting an anonymous tuple of values.
The 0-ary tuple, (tuple)
, denotes the empty tuple, sometimes called
"unit" or "void" (but not e.g. JavaScript's "undefined" value).
Null and Undefined
Tony Hoare's
"billion-dollar mistake"
can be represented with the 0-ary Record
(null)
. An "undefined"
value can be represented as (undefined)
.
Dates and Times
Dates, times, moments, and timestamps can be represented with a
Record
with label rfc3339
having a single field, a String
, which
MUST conform to one of the full-date
, partial-time
, full-time
,
or date-time
productions of
section 5.6 of RFC 3339.
Representing Values in Programming Languages
We have given a definition of Value
and its semantics, and proposed
a concrete syntax for communicating and storing Value
s. We now turn
to suggested representations of Value
s as programming-language
values for various programming languages.
When designing a language mapping, an important consideration is roundtripping: serialization after deserialization, and vice versa, should both be identities.
JavaScript
SignedInteger
↔ numbers orBigInt
[1, 2]String
↔ stringsByteString
↔Uint8Array
Symbol
↔Symbol.for(...)
Boolean
↔Boolean
Float
andDouble
↔ numbers,Record
↔{ "_label": theLabel, "_fields": [field0, ..., fieldN] }
, plus convenience accessors(undefined)
↔ the undefined value(rfc3339 F)
↔Date
, ifF
matches thedate-time
RFC 3339 production
Sequence
↔Array
Set
↔{ "_set": M }
whereM
is aMap
from the elements of the set totrue
Dictionary
↔ aMap
Scheme/Racket
SignedInteger
↔ exact numbersString
↔ stringsByteString
↔ byte vector (Racket: "Bytes")Symbol
↔ symbolsBoolean
↔ booleansFloat
andDouble
↔ inexact numbers (Racket: single- and double-precision floats)Record
↔ structures (Racket: prefab struct)Sequence
↔ listsSet
↔ Racket: setsDictionary
↔ Racket: hash-table
Java
SignedInteger
↔Integer
,Long
,BigInteger
String
↔String
ByteString
↔byte[]
Symbol
↔ a simple data class wrapping aString
Boolean
↔Boolean
Float
andDouble
↔Float
andDouble
Record
↔ in a simple implementation, a genericRecord
class; else perhaps a bean mapping?Sequence
↔ an implementation ofjava.util.List
Set
↔ an implementation ofjava.util.Set
Dictionary
↔ an implementation ofjava.util.Map
Erlang
SignedInteger
↔ integersString
↔ tuple ofutf8
and a binaryByteString
↔ a binarySymbol
↔ the underlying string converted to an Erlang atom, if some kind of an "unsafe" mode is set on the decoder (because Erlang atoms are not GC'd); otherwise perhaps a tuple ofsymbol
and a binary of the utf-8Boolean
↔true
andfalse
Float
andDouble
↔ floats (unsure how Erlang deals with single-precision)Record
↔ a tuple with the label in the first position, and the fields in subsequent positionsSequence
↔ a listSet
↔ asets
set (is this unambiguous? Maybe a map from elements totrue
?)Dictionary
↔ a map (new in Erlang/OTP R17)
Appendix. Table of lead byte values
00 - False
01 - True
02 - Float
03 - Double
(0x) RESERVED 04-0F
(1x) RESERVED 10-1F
2x - Start Stream
3x - End Stream
4x - SignedInteger
5x - String
6x - ByteString
7x - Symbol
8x - short form Record label index 0
9x - short form Record label index 1
Ax - short form Record label index 2
Bx - Record
Cx - Sequence
Dx - Set
Ex - Dictionary
(Fx) RESERVED F0-FF
Appendix. Why not Just Use JSON?
JSON offers syntax for numbers, strings, booleans, null, arrays and string-keyed maps. However, it suffers from two major problems. First, it offers no semantics for the syntax: it is left to each implementation to determine how to treat each JSON term. This causes interoperability and even security issues. Second, JSON's lack of support for type tags leads to awkward and incompatible encodings of type information in terms of the fixed suite of constructors on offer.
There are other minor problems with JSON having to do with its syntax. Examples include its relative verbosity and its lack of support for binary data.
JSON syntax doesn't mean anything
When are two JSON values the same? When are they different?
The specifications are largely silent on these questions. Different JSON implementations give different answers.
Specifically, JSON does not:
- assign any meaning to numbers,11
- determine how strings are to be compared,12
- determine whether object key ordering is significant,13 or
- determine whether duplicate object keys are permitted, what it would mean if they were, or how to determine a duplicate in the first place.14
In short, JSON syntax doesn't denote anything.15 16
Some examples:
- are the JSON values
1
,1.0
, and1e0
the same or different? - are the JSON values
1.0
and1.0000000000000001
the same or different? - are the JSON strings
"päron"
(UTF-870c3a4726f6e
) and"päron"
(UTF-87061cc88726f6e
) the same or different? - are the JSON objects
{"a":1, "b":2}
and{"b":2, "a":1}
the same or different? - which, if any, of
{"a":1, "a":2}
,{"a":1}
and{"a":2}
are the same? Are all three legal? - are
{"päron":1}
and{"päron":1}
the same or different?
JSON can multiply nicely, but it can't add very well
JSON includes a fixed set of types: numbers, strings, booleans, null, arrays and string-keyed maps. Domain-specific data must be encoded into these types. For example, dates and email addresses are often represented as strings with an implicit internal structure.
There is no convention for labelling a value as belonging to a particular category. This makes it difficult to extract, say, all email addresses, or all URLs, from an arbitrary JSON document.
Instead, JSON-encoded data are often labelled in an ad-hoc way.
Multiple incompatible approaches exist. For example, a "money"
structure containing a currency
field and an amount
may be
represented in any number of ways:
{ "_type": "money", "currency": "EUR", "amount": 10 }
{ "type": "money", "value": { "currency": "EUR", "amount": 10 } }
[ "money", { "currency": "EUR", "amount": 10 } ]
{ "@money": { "currency": "EUR", "amount": 10 } }
This causes particular problems when JSON is used to represent sum or union types, such as "either a value or an error, but not both". Again, multiple incompatible approaches exist.
For example, imagine an API for depositing money in an account. The response might be either a "success" response indicating the new balance, or one of a set of possible errors.
Sometimes, a pair of values is used, with null
marking the option
not taken.17
{ "ok": { "balance": 210 }, "error": null }
{ "ok": null, "error": "Unauthorized" }
The branch not chosen is sometimes present, sometimes omitted as if it were an optional field:
{ "ok": { "balance": 210 } }
{ "error": "Unauthorized" }
Sometimes, an array of a label and a value is used:
[ "ok", { "balance": 210 } ]
[ "error", "Unauthorized" ]
Sometimes, the shape of the data is sufficient to distinguish among the alternatives, and the label is left implicit:
{ "balance": 210 }
"Unauthorized"
JSON itself does not offer any guidance for which of these options to choose. In many real cases on the web, poor choices have led to encodings that are irrecoverably ambiguous.
Open questions
Q. Should "symbols" instead be URIs? Relative, usually; relative to what? Some domain-specific base URI?
Q. What about general rationals, subsuming integers and IEEE floats (except NaN and the Infinities)?
Q. Should I map to SPKI SEXP or is that nonsense / for later?18
Q. Should Symbol
be a special syntax for a Record
with a Symbol
label (recursive!?) and a single String
field?
Q. Should String
be a special syntax for (utf8 ByteString)
? Again,
recursiveness problems...?
Q. Should Dictionary
be a special syntax for etc etc.? Set
?
Float
? Double
?
--> Rule of thumb: if there's a special equivalence predicate for it,
it needs to be built-in syntax. Otherwise it can be a regular
record. (So: Boolean
might not make the cut for special
treatment?? Likewise String
...? Ugh those are psychologically
important perhaps)
Q. Are the language mappings reasonable? How about one for Python?
Literal small integers: could be nice? Not absolutely necessary.
-
This design was loosely inspired by S-expressions, as seen in Lisp, Scheme, SPKI/SDSI, and many others, and by the ML type system, as seen in languages such as SML, OCaml, Haskell, Rust, and many others. It is also related to Zephyr ASDL (h/t Darius Bacon), which doesn't offer much in the way of atoms, but offers general-purpose labelled sums and products. See D. C. Wang, A. W. Appel, J. L. Korn, and C. S. Serra, “The Zephyr Abstract Syntax Description Language,” in USENIX Conference on Domain-Specific Languages, 1997, pp. 213–228. PDF available. ↩︎
-
The observant reader may note that the ordering here is the same as that implied by the tagging scheme used in the concrete binary syntax for
Value
s. ↩︎ -
Happily, the design of UTF-8 is such that this gives the same result as a lexicographic byte-by-byte comparison of the UTF-8 encoding of a string! ↩︎
-
The Racket programming language defines “prefab” structure types, which map well to our
Record
s. Racket supports record extensibility by encoding record supertypes into record labels as specially-formatted lists. ↩︎ -
It is occasionally (but seldom) necessary to interpret such
Symbol
labels as UTF-8 encoded IRIs. Where a label can be read as a relative IRI, it is notionally interpreted with respect to the IRIurn:uuid:6bf094a6-20f1-4887-ada7-46834a9b5b34
; where a label can be read as an absolute IRI, it stands for that IRI; and otherwise, it cannot be read as an IRI at all, and so the label simply stands for itself - for its ownValue
. ↩︎ -
The two XML documents
<x/>
and<x />
differ by bytewise comparison, and thus yield different record values, even though under the semantics of XML they denote identical XML infoset. ↩︎ -
Those who remember ASN.1 will recall BER, DER, PER, CER, XER and so on, each appropriate to a different setting. Similarly, Rivest's S-Expression design offers a human-friendly syntax, a syntax robust to network-induced message corruption, and an unambiguous, simple and easily-parsed machine-friendly syntax for the same underlying values. ↩︎
-
Some encodings are unused. All such encodings are reserved for future versions of this specification. ↩︎
-
In the BitTorrent encoding format, bencoding, dictionary key/value pairs must be sorted by key. This is a necessary step for ensuring serialization of
Value
s is canonical. We do not require that key/value pairs (or set elements) be in sorted order for serializedValue
s, because (a) where canonicalization is used for cryptographic signatures, it is more reliable to simply retain the exact binary form of the signed document than to depend on canonical de- and re-serialization, and (b) sorting keys or elements makes no sense in streaming serialization formats. ↩︎ -
It happens to line up with Racket's representation of a record label for an inheritance hierarchy where
titled
extendsperson
extendsthing
:
↩︎(struct date (year month day) #:prefab) (struct thing (id) #:prefab) (struct person thing (name date-of-birth) #:prefab) (struct titled person (title) #:prefab)
-
Section 6 of RFC 7159 does go so far as to indicate “good interoperability can be achieved” by imagining that parsers are able reliably to understand the syntax of numbers as denoting an IEEE 754 double-precision floating-point value. ↩︎
-
Section 8.3 of RFC 7159 suggests that if an implementation compares strings used as object keys “code unit by code unit”, then it will interoperate with other such implementations, but neither requires this behaviour nor discusses comparisons of strings used in other contexts. ↩︎
-
Section 4 of RFC 7159 remarks that “[implementations] differ as to whether or not they make the ordering of object members visible to calling software.” ↩︎
-
Section 4 of RFC 7159 is the only place in the specification that mentions the issue. It explicitly sanctions implementations supporting duplicate keys, noting only that “when the names within an object are not unique, the behavior of software that receives such an object is unpredictable.” Implementations are free to choose any behaviour at all in this situation, including signalling an error, or discarding all but one of a set of duplicates. ↩︎
-
The XML world has the concept of XML infoset. Loosely speaking, XML infoset is the denotation of an XML document; the meaning of the document. ↩︎
-
Most other recent data languages are like JSON in specifying only a syntax with no associated semantics. While some do make a sketch of a semantics, the result is often underspecified (e.g. in terms of how strings are to be compared), overly machine-oriented (e.g. treating 32-bit integers as fundamentally distinct from 64-bit integers and from floating-point numbers), overly fine (e.g. giving visibility to the order in which map entries are written), or all three. ↩︎
-
What is the meaning of a document where both
ok
anderror
are non-null? What might happen when a program is presented with such a document? ↩︎ -
Why not just use Rivest's S-Expressions as they are? While they include binary data and sequences, and an obvious equivalence for them exists, they lack numbers per se as well as any kind of unordered structure such as sets or maps. In addition, while "display hints" allow labelling of binary data with an intended interpretation, they cannot be attached to any other kind of structure, and the "hint" itself can only be a binary blob. ↩︎