27 KiB
2018-06-04 20:31:48 TODO: cwebber's email comments 2018-06-04 20:31:51 TODO: look at https://github.com/imbal/rson and at clojure EDN 2018-06-05 10:32:02 ... and at http://json-schema.org/latest/json-schema-core.html#rfc.section.4.2
SPKI CAT: SPKI S-Expressions with Canonical Atom Tags
Tony Garnock-Jones tonyg@leastfixedpoint.com
Christopher Lemmer Webber cwebber@dustycloud.org
May 2018
Version 0.0.1
________________
/ \
/\__/\ / boo! \
/ \ \ i'm very spki! \
\=\_^__^_/= ___/\__________________/
|/ \
\\ | | /
<_|--|_>
Introduction
This document proposes a language-neutral JSON-like data type, along with a robust equivalence relation ("semantics") and a total ordering over inhabitants of the type.1
It then suggests conventions for encoding common data formats in terms of the proposed data type.
Finally, it proposes concrete syntax for the data type, offering a language-neutral transfer syntax (based on Rivest's S-Expressions as used in SPKI/SDSI) and suggesting possible language-specific representations for the data type's inhabitants.
Why not Just Use JSON?
JSON offers syntax for numbers, strings, booleans, null, arrays and string-keyed maps. However, it offers no semantics for the syntax: it is left to each implementation to determine how to treat each JSON term. This causes interoperability and even security issues.
Specifically, JSON does not:
- assign any meaning to numbers,2
- determine how strings are to be compared,3
- determine whether object key ordering is significant, or
- determine whether duplicate object keys are permitted, what it would mean if they were, or how to determine a duplicate in the first place.
In short, JSON syntax doesn't denote anything.4 5
Some examples:
- are the JSON values
1
,1.0
, and1e0
the same or different? - are the JSON values
1.0
and1.0000000000000001
the same or different? - are the JSON strings
"päron"
(UTF-870c3a4726f6e
) and"päron"
(UTF-87061cc88726f6e
) the same or different? - are the JSON objects
{"a":1, "b":2}
and{"b":2, "a":1}
the same or different? - which, if any, of
{"a":1, "a":2}
,{"a":1}
and{"a":2}
are the same? Are all three legal? - are
{"päron":1}
and{"päron":1}
the same or different?
Different JSON implementations give different answers to these questions. The JSON specifications are silent on these questions.
There are other minor problems with JSON having to do with its syntax. Examples include its relative verbosity and its lack of support for binary data.
Starting with Semantics
Taking inspiration from functional programming, we start with a definition of the values that we want to work with and give them meaning independent of their syntax. We will treat syntax separately, later in this document.
We will want our data type to accommodate atoms (numbers and text), products (both tuples and sequences), and labelled sums.6 It should also include keyed maps. We should avoid unnecessary restrictions such as machine-oriented fixed-width integer or floating-point values where possible.
Values
A Value
is one of:
- a
ByteString
for general-purpose non-numeric atomic data,7 - a
Number
for integers and rational numbers, - a
List
for general-purpose variable-length sequences, - a
Map
for general-purpose variable-size key/value maps, or - a
Record
for tagging a tuple of values with an intended interpretation.
We define a total order over Value
s: Every ByteString
is less than
the other kinds of Value
; every Number
is less than any List
,
Map
or Record
, but greater than any ByteString
; and so on.
That is, ByteString < Number < List < Map < Record
.
Two values of the same kind are compared using kind-specific rules, given below.
Two Value
s are equal if neither is less than the other according to
the total order.
Byte strings
A ByteString
is an ordered sequence of zero or more integers in the
inclusive range [0..255].
ByteString
s are compared lexicographically.
We will write examples of ByteString
s that contain only ASCII
characters using “#"
” as an opening quote mark and “"
” as a
closing quote mark.
Examples. The ByteString
containing the three ASCII characters
A
, B
and C
is written as #"ABC"
. The empty ByteString
is
written as #""
. N.B. Despite appearances, these are binary
data.
Numbers
A Number
is a signed rational number of finite precision whose
magnitude can be exactly represented in base two with a finite number
of digits. This includes integers of arbitrary width as well as (for
example) the non-infinite non-NaN IEEE 754 floating-point values.
Number
s are compared as mathematical numbers.
We will write examples of Number
s using standard mathematical
notation.
Examples. 10, -6, 0.5, -3/2, 33/192, -1.202E4567.
Non-examples. NaN (the clue is in the name!), ∞ (not finite), 0.2 (cannot be exactly represented with a finite number of binary digits), 1/7 (likewise), 2+i3 (not rational), √2 (likewise).
Lists
A List
is an ordered sequence of zero or more Value
s.
List
s are compared lexicographically, appealing to the ordering on
Value
s for comparisons at each position in the List
s.
Maps
A Map
is an unordered collection of zero or more pairs of
Value
s. Each pair comprises a key and a value. Keys in a Map
must be pairwise distinct.
Instances of Map
are compared by lexicographic comparison of the
sequences resulting from ordering each Map
's pairs in ascending
order by key. ((TODO: Is this a good idea? Is it clearly-enough
written? An alternative approach is to compare first by the count of
pairs, and only if the count is the same, start comparing the pairs
themselves.))
Records
A Record
is a tuple of one or more Value
s. The first in the tuple
is called the label of the Record
, and the other elements of the
tuple are called its fields.
Record
labels are usually ByteString
s, but can be any kind of
Value
.8
Record
s are compared lexicographically as if they were just tuples;
that is, first by their labels, and then by the remainder of their
fields.
We will write examples of Record
s with ByteString
labels entirely
composed of ASCII characters as their label followed by their
parenthesised, comma-separated fields.
Examples. The Record
with label #"foo"
and fields 1, 2 and 3
is written #"foo"(1, 2, 3)
; the Record
with label #"void"
and no
fields is written #"void"()
.
Conventions for Common Data Types
The Value
data type is essentially an abstract S-Expression, able to
represent semi-structured data over ByteString
and Number
atoms.
However, users need a wide variety of data types for representing domain-specific values such as text, calendrical values, machine words, IEEE 754 floating-point values, booleans, and so on.
We use appropriately-labelled Record
s to denote these
domain-specific data types.
All of these conventions are optional. They form a layer atop the core
Value
structure. Non-domain-specific tools do not in general need to
treat them specially.
Validity. Many of the labels we will describe in this section come
with side-conditions on the contents of labelled Record
s. It is
possible to construct an instance of Value
that violates these
side-conditions without ceasing to be a Value
or becoming
unrepresentable. However, we say that such a Value
is invalid
because it fails to honour the necessary side-conditions.
Implementations SHOULD allow two modes of working: one which
treats all Value
s identically, without regard for side-conditions,
and one which enforces validity (i.e. side-conditions) when reading,
writing, or constructing Value
s.
Text
A Text
is a Record
labelled with the ByteString
#"utf-8"
and
having a single field that is also a ByteString
. The field MUST be
valid UTF-8.
We will write examples of Text
s that contain Unicode text using
“"
” as both an opening and closing quote mark.
Examples. The Text
containing the three Unicode code points z
(0x7A), 水
(0x6C34) and 𝄞
(0x1D11E) is written as "z水𝄞"
.
Normalization forms. Unicode defines multiple
normalization forms for text. The
ordering and equivalence relations defined for Value
s mean that, for
Unicode text, the UTF-8 encoded byte-level form of a text is used in
comparisons.9 In order for users to unambiguously
signal or require a particular normalization form, we define a
NormalizedText
, which is a Record
labelled with
#"unicode-normalization"
and having two fields, the first of which
is a Text
specifying the normalization form used (e.g. "nfc"
,
"nfd"
, "nfkc"
, "nfkd"
), and the second of which is a Text
whose underlying representation MUST be normalized according to the
named normalization form.
IRIs. (URIs, URLs, URNs, etc.) An IRI
is a Record
labelled
with #"iri"
and having one field, a Text
which is the IRI itself
and which MUST be a valid absolute or relative IRI.
Symbols. Programming languages like Lisp and Prolog frequently use
string-like values called symbols. A Symbol
is a Record
labelled
with #"symbol"
and having one field, a Text
.
Numbers
The definition of Number
captures all integers and all
finitely-representable floating-point values. However, in certain
circumstances it can be valuable to assert that a number inhabits a
particular range, such as a fixed-width machine word or an IEEE 754
floating-point value.
Fixed-width machine words. (16-, 32- and 64-bit) A family of
labels i
n and u
n for n ∈ {16,32,64} denote n-bit-wide
signed and unsigned range restrictions, respectively. Records with
these labels MUST have one field, a Number
, which MUST fall
within the appropriate range. That is, to be valid,
- in
#"i16"(
x)
, -32768 <= x <= 32767, and ⌊x⌋ = x. - in
#"u16"(
x)
, 0 <= x <= 65535, and ⌊x⌋ = x. - in
#"i32"(
x)
, -2147483648 <= x <= 2147483647, and ⌊x⌋ = x. - etc.
IEEE 754 floating-point. (single- and double-precision) The labels
f32
and f64
denote single- and double-precision IEEE 754
floating-point values, respectively. Records with these labels MUST
have one field. This field MUST either be a Number
, which MUST
fall within the appropriate representable range, or one of the records
#"nan"()
, #"+inf"()
or #"-inf"()
.
Anonymous Tuples and Unit
A Tuple
is a Record
with label #"tuple"
and zero or more fields,
denoting an anonymous tuple of values.
The 0-ary tuple, #"tuple"()
, denotes the empty tuple, sometimes
called "unit" or "void" (but not e.g. JavaScript's "undefined"
value).
Booleans, Null and Undefined
The two 0-ary Record
s #"true"()
and #"false"()
denote the "true"
and "false" Boolean values, respectively.
Tony Hoare's
"billion-dollar mistake"
can be represented with the 0-ary Record
#"null"()
. An "undefined"
value can be represented as #"undefined"()
.
Dates and Times
Dates, times, moments, and timestamps can be represented with a
Record
with label #"rfc3339"
having a single field, a Text
,
which MUST conform to one of the full-date
, partial-time
,
full-time
, or date-time
productions of
section 5.6 of RFC 3339.
Syntax
Now we have discussed Value
s and their meanings, we may turn to
techniques for representing Value
s for communication or storage.
The syntax we have used for the examples so far is inadequate in many
ways, not least of which is that it cannot represent every Value
.
Separation of the meaning of a piece of syntax from the syntax itself opens the door to domain-specific syntaxes, all equivalent and interconvertible.10 With a robust semantic foundation, connections to other data languages can also be made.
Transfer syntax: S-Expressions
For now, we limit our attention to an easily-parsed, easily-produced
machine-readable syntax by mapping our Value
s to the canonical form
of Rivest's S-Expressions.11
Byte strings
ByteString
s map to byte-string S-Expressions.
Examples.
- What we have been writing above as
#"ABC"
would be represented as the S-Expression3:ABC
. - The empty
ByteString
is represented by the S-Expression0:
.
Numbers
Numbers are the most complicated values to represent as an S-Expression.
((TODO: Consider cutting complexity by e.g. representing a Number
as
a sign bit, a little-endian blob of the integer part of the number,
and a little-endian blob of the fractional part of the number. Lots of
trailing/leading zeros for very large/small numbers!))
We represent Number
s using a sign-magnitude format, where the
magnitude is written using a little-endian, twos-complement binary
significand and a
(signed) shift amount.
In essence, we use a generalized, variable-width form of binary IEEE floating-point representation.
Let N
be the Number
to represent as an S-Expression.
The sign bit is 0 when N
is zero or positive, and 1 when N
is
negative.
The magnitude of N
can be viewed as an infinite sequence of bits
with a fraction-separator mark placed somewhere in the sequence,
···00.000 b_0 b_1 ··· b_{k-1} b_k ··· b_{n-1} 000000···
···000000 b_0 b_1 ··· b_{k-1} . b_k ··· b_{n-1} 000000···
···000000 b_0 b_1 ··· b_{k-1} b_k ··· b_{n-1} 000.00···
where b_0
is the leftmost (most significant) and b_{n-1}
the
rightmost (least significant) non-zero bit.
Let k
, the position of the fraction-separator mark, be i
when it
is immediately to the left of b_i
for some i
, generalizing to
negative values when it is to the left of b_0
and values greater
than n-1
when it is to the right of b_{n-1}
.
For example, k
will be:
- 0 when the fraction-separator is immediately (i.e. zero bits) to the left of
b_0
; - -3 (as in the first example above) when it is three bits left of
b_0
; n
when it is immediately (i.e. zero bits) to the right ofb_{n-1}
;n
+3 when it is three bits to the right ofb_{n-1}
.
The unpadded significand is b_0 b_1 ··· b_{n-1}
.
When k
< n
, the shift z
=k-n
and the significand is:
- the unpadded significand,
- with the sign bit appended to it on the right, and then
- padded on the left with zeroes until it is a whole number of octets wide.
When k
≥ n
, the shift z
=8×⌊(k-n)/8⌋
and the significand is:
- the unpadded significand,
- padded on the right with
(k-n) mod 8
zeroes, - with the sign bit then appended on the right, and then
- padded on the left with zeroes until it is a whole number of octets wide.
Now, let s
=2z
if z
is zero or positive, or s
=2|z
|+1 if z
is negative.
Finally, the S-Expression form of N
is:
(4:*num [SIGNIFICAND] [SHIFT])
, ifs
≠0; or(4:*num [SIGNIFICAND])
, ifs
=0 but the significand contains non-zero bits; or(4:*num)
, ifs
=0 and the significand contains no non-zero bits;
where
[SIGNIFICAND]
stands for a byte-string S-Expression containing a little-endian representation of the significand, and[SHIFT]
stands for a byte-string S-Expression containing a little-endian representation ofs
.
Examples. (Shown using the hexadecimal representation of byte-strings from section 4.4 of Rivest's S-Expression specification in places.)
N
=0 →(4:*num)
N
=1 →(4:*num#02#)
N
=-1 →(4:*num#03#)
N
=10₁₀=1010.0₂ →n
=3,k
=4,z
=0,s
=0 →(4:*num#14#)
N
=2560₁₀=101000000000.0₂ →n
=3,k
=12,z
=8,s
=16 →(4:*num#14##10#)
N
=-2560₁₀=-101000000000.0₂ →n
=3,k
=12,z
=8,s
=16 →(4:*num#15##10#)
N
=-6₁₀=-110.0₂ →n
=2,k
=3,z
=0,s
=0 →(4:*num#0D#)
N
=0.5₁₀=0.1₂ →n
=1,k
=0,z
=-1,s
=3 →(4:*num#02##03#)
N
=-3/2₁₀=-1.1₂ →n
=2,k
=1,z
=-1,s
=3 →(4:*num#07##03#)
N
=33/192₁₀=0.001011₂ →n
=4,k
=-2,z
=-6,s
=7 →(4:*num#16##07#)
N
=-1.202E4567=1011011001···000₂ (15172 binary digits, the last 4565 of which are zero) →n
=10607,k
=15172,z
=4560,s
=9120 →(4:*num#41828E···24CD16##A023#)
((TODO: figure out what this algorithm would actually look like in, say, C, Python and Racket.))
Lists
A List
maps to an S-Expression list of representations of its
elements, with the byte-string S-Expression 5:*list
prepended.
Examples.
- The
List
containing theByteString
s#"a"
,#"b"
, and#"c"
would be represented as the S-Expression(5:*list1:a1:b1:c)
. - The empty
List
is represented by the S-Expression(5:*list)
.
Maps
A Map
is represented by an S-Expression list of representations of
the Map
's key-value pairs, with the byte-string 4:*map
prepended.
Each key-value pair is represented by a two-element S-Expression list containing representations of the key and the value, in that order.
The key-value pairs MUST be ordered by Value
-order of their keys.
Examples.
- The
Map
containing entries mapping#"a"
to#"d"
and#"c"
to#"b"
is represented by(4:*map(1:a1:d)(1:c1:b))
. - The
Map
containing an entry mapping the empty list to a "true" Boolean value is represented by(4:*map((5:*list)(4:true)))
. - The empty
Map
is represented by(4:*map)
.
Non-examples.
- The S-Expression
(4:*map(1:c1:b)(1:a1:d))
is invalid, because its key-value pairs are not inValue
-order by key:#"c"
>#"a"
. - The S-Expression
(4:*map1:a1:d1:c1:b)
is invalid, because its key-value pairs appear "flattened" in the outer list, rather than each appearing in a two-element list of its own.
Records
A Record
is represented by an S-Expression list of its fields,
prepended by:
- the representation of its label, if its label is a
ByteString
and does not begin with byte 42 (ASCII "*
"); or - the S-Expression
1:*
followed by the representation of theRecord
's label, otherwise.
Examples.
- The
Text
"hello-world"
is represented by the S-Expression(5:utf-811:hello-world)
. - The
IRI
denotinghttp://www.w3.org/
is represented by the S-Expression(3:iri(5:utf-818:http://www.w3.org/))
. - The
Record
#"*"()
is represented by the S-Expression(1:*1:*)
. - The
Record
#"*foo"(#"*bar")
is represented by the S-Expression(1:*4:*foo4:*bar)
. - The
Record
with the empty list as its label and no fields is represented by the S-Expression(1:*(5:*list))
. (7:rfc3339(5:utf-83:foo))
represents a well-formedValue
that is aRecord
with#"rfc3339"
as its label, and a singleText
field. While it is a perfectly reasonableValue
, it does not represent a valid date or time, since theText
"foo"
does not conform to any of the RFC 3339 productions enumerated above.
Non-examples.
((5:*list))
is not a representation of theRecord
with the empty list as its label and no fields, because thatRecord
has a non-ByteString
as its label, mandating a1:*
prefix on its S-Expression representation.(4:*foo4:*bar)
does not represent theRecord
#"*foo"(#"*bar")
, because the label#"*foo"
begins with "*
", mandating a1:*
prefix on theRecord
's S-Expression representation.
Examples
((TODO: Give some examples of large and small SPKI-CAT documents, perhaps translated from various JSON blobs floating around the internet.))
Representing Values in Programming Languages
We have given a definition of Value
and its semantics, and proposed
a concrete syntax for communicating and storing Value
s. We now turn
to suggested representations of Value
s as programming-language
values for various programming languages.
JavaScript
ByteString
↔Uint8Array
Number
↔ numbers, problematically; bignums, perhaps; other?? TODOList
↔Array
Map
↔Object
Record
↔ an instance of something likeRecord
below, unless the label is...#"utf-8"
↔String
#"true"
↔true
#"false"
↔false
#"null"
↔null
#"undefined"
↔ the undefined value#"rfc3339"
↔Date
, if theRecord
's field matches thedate-time
RFC 3339 production
function Record(label, ...fields) {
this.label = label;
this.fields = fields;
}
Scheme/Racket
ByteString
↔ byte vector (Racket: "Bytes")Number
↔ numbersList
↔ (where possible, immutable) listMap
↔ hash-tableRecord
↔ a structure (Racket: a "prefab struct"), unless the label is...#"utf-8"
↔ a string#"true"
↔#t
#"false"
↔#f
#"symbol"
↔ a symbol
Java
ByteString
↔byte[]
Number
↔ numbers, problematically; bignums, perhaps; other?? TODOList
↔java.util.List
Map
↔java.util.Map
Record
↔ an instance of something likeRecord
below, unless the label is...#"utf-8"
↔java.lang.String
#"true"
↔java.lang.Boolean.TRUE
#"false"
↔java.lang.Boolean.FALSE
#"null"
↔ a special singleton object, but not Java'snull
#"rfc3339"
↔java.util.{Date,Time,Timestamp}
, according to which RFC 3339 production theRecord
's field matches
Erlang
ByteString
↔ a binaryNumber
↔ numbers, probably; TODOList
↔ a listMap
↔ a map (new in Erlang/OTP R17)Record
↔ a tuple with the label in the first position, and the fields in subsequent positions, unless the label is...#"true"
↔true
#"false"
↔false
#"null"
↔null
#"undefined"
↔undefined
#"symbol"
↔ theText
field converted to an Erlang atom, if some kind of an "unsafe" mode is set on the decoder (because Erlang atoms are not GC'd); otherwise like any other kind ofRecord
-
TJSON has a similar aim: “different on-the-wire representations of an object correspond to the same typed data object” (source); “TJSON is defined as a serialization format on top of a JSON-like data model” (source). ↩︎
-
Section 6 of RFC 7159 does go so far as to indicate “good interoperability can be achieved” by imagining that parsers are able reliably to understand the syntax of numbers as denoting an IEEE 754 double-precision floating-point value. ↩︎
-
Section 8.3 of RFC 7159 suggests that if an implementation compares strings used as object keys “code unit by code unit”, then it will interoperate with other such implementations, but neither requires this behaviour nor discusses comparisons of strings used in other contexts. ↩︎
-
The XML world has the concept of XML infoset. Loosely speaking, XML infoset is the denotation of an XML document; the meaning of the document. ↩︎
-
Most other recent data languages are like JSON in specifying only a syntax with no associated semantics. While some do make a sketch of a semantics, the result is often underspecified (e.g. in terms of how strings are to be compared), overly machine-oriented (e.g. treating 32-bit integers as fundamentally distinct from 64-bit integers and from floating-point numbers), overly fine (e.g. giving visibility to the order in which map entries are written), or all three. ↩︎
-
This design was loosely inspired by Zephyr ASDL (h/t Darius Bacon), which doesn't offer much in the way of atoms, but offers general-purpose labelled sums and products. See D. C. Wang, A. W. Appel, J. L. Korn, and C. S. Serra, “The Zephyr Abstract Syntax Description Language,” in USENIX Conference on Domain-Specific Languages, 1997, pp. 213–228. PDF available. ↩︎
-
Why include
ByteString
, when we could instead use a reservedRecord
along with aList
ofNumber
s? ((TODO: Actually decide about this! Similarly, why includeMap
rather than a restricted form ofList
with aRecord
? I think the answer has to do with the arbitrariness of the label we'd pick: unless extremely carefully chosen (i.e. number 0 (ideally even-Inf
!) for byte strings, number 1 for map, and have the order goNumber < Record < List
), they would mess up the prettiness of the ordering. Though... we could ultimately reduce this toNumber
andRecord
, and have a family of#"list"
and#"map"
Record
s...)) ↩︎ -
It is occasionally (but seldom) necessary to interpret such
ByteString
labels as UTF-8 encoded IRIs. Where a label can be read as a relative IRI, it is notionally interpreted with respect to the IRIhttp://spki-cat.org/
((TODO: placeholder)); where a label can be read as an absolute IRI, it stands for that IRI; and otherwise, it cannot be read as an IRI at all, and so the label simply stands for itself - for its ownValue
. ↩︎ -
Happily, the design of UTF-8 is such that this gives the same result as a lexicographic code-point-by-code-point comparison! ↩︎
-
Those who remember ASN.1 will recall BER, DER, PER, CER, XER and so on, each appropriate to a different setting. Similarly, Rivest's S-Expression design offers a human-friendly syntax, a syntax robust to network-induced message corruption, and an unambiguous, simple and easily-parsed machine-friendly syntax for the same underlying values. ↩︎
-
Why not just use Rivest's S-Expressions as they are? While they include binary data and sequences, and an obvious equivalence for them exists, they lack numbers per se as well as any kind of unordered structure such as sets or maps. In addition, while "display hints" allow labelling of binary data with an intended interpretation, they cannot be attached to any other kind of structure, and the "hint" itself can only be a binary blob. ↩︎