preserves/historical/codec.md

27 KiB
Raw Permalink Blame History

2018-06-04 20:31:48 TODO: cwebber's email comments 2018-06-04 20:31:51 TODO: look at https://github.com/imbal/rson and at clojure EDN 2018-06-05 10:32:02 ... and at http://json-schema.org/latest/json-schema-core.html#rfc.section.4.2

SPKI CAT: SPKI S-Expressions with Canonical Atom Tags

Tony Garnock-Jones tonyg@leastfixedpoint.com
Christopher Lemmer Webber cwebber@dustycloud.org
May 2018
Version 0.0.1

                 ________________
                /                \
   /\__/\      /  boo!            \
  /      \     \   i'm very spki!  \
\=\_^__^_/= ___/\__________________/
 |/      \
 \\ |  | /
  <_|--|_>

Introduction

This document proposes a language-neutral JSON-like data type, along with a robust equivalence relation ("semantics") and a total ordering over inhabitants of the type.1

It then suggests conventions for encoding common data formats in terms of the proposed data type.

Finally, it proposes concrete syntax for the data type, offering a language-neutral transfer syntax (based on Rivest's S-Expressions as used in SPKI/SDSI) and suggesting possible language-specific representations for the data type's inhabitants.

Why not Just Use JSON?

JSON offers syntax for numbers, strings, booleans, null, arrays and string-keyed maps. However, it offers no semantics for the syntax: it is left to each implementation to determine how to treat each JSON term. This causes interoperability and even security issues.

Specifically, JSON does not:

  • assign any meaning to numbers,2
  • determine how strings are to be compared,3
  • determine whether object key ordering is significant, or
  • determine whether duplicate object keys are permitted, what it would mean if they were, or how to determine a duplicate in the first place.

In short, JSON syntax doesn't denote anything.4 5

Some examples:

  • are the JSON values 1, 1.0, and 1e0 the same or different?
  • are the JSON values 1.0 and 1.0000000000000001 the same or different?
  • are the JSON strings "päron" (UTF-8 70c3a4726f6e) and "päron" (UTF-8 7061cc88726f6e) the same or different?
  • are the JSON objects {"a":1, "b":2} and {"b":2, "a":1} the same or different?
  • which, if any, of {"a":1, "a":2}, {"a":1} and {"a":2} are the same? Are all three legal?
  • are {"päron":1} and {"päron":1} the same or different?

Different JSON implementations give different answers to these questions. The JSON specifications are silent on these questions.

There are other minor problems with JSON having to do with its syntax. Examples include its relative verbosity and its lack of support for binary data.

Starting with Semantics

Taking inspiration from functional programming, we start with a definition of the values that we want to work with and give them meaning independent of their syntax. We will treat syntax separately, later in this document.

We will want our data type to accommodate atoms (numbers and text), products (both tuples and sequences), and labelled sums.6 It should also include keyed maps. We should avoid unnecessary restrictions such as machine-oriented fixed-width integer or floating-point values where possible.

Values

A Value is one of:

  • a ByteString for general-purpose non-numeric atomic data,7
  • a Number for integers and rational numbers,
  • a List for general-purpose variable-length sequences,
  • a Map for general-purpose variable-size key/value maps, or
  • a Record for tagging a tuple of values with an intended interpretation.

We define a total order over Values: Every ByteString is less than the other kinds of Value; every Number is less than any List, Map or Record, but greater than any ByteString; and so on.

That is, ByteString < Number < List < Map < Record.

Two values of the same kind are compared using kind-specific rules, given below.

Two Values are equal if neither is less than the other according to the total order.

Byte strings

A ByteString is an ordered sequence of zero or more integers in the inclusive range [0..255].

ByteStrings are compared lexicographically.

We will write examples of ByteStrings that contain only ASCII characters using “#"” as an opening quote mark and “"” as a closing quote mark.

Examples. The ByteString containing the three ASCII characters A, B and C is written as #"ABC". The empty ByteString is written as #"". N.B. Despite appearances, these are binary data.

Numbers

A Number is a signed rational number of finite precision whose magnitude can be exactly represented in base two with a finite number of digits. This includes integers of arbitrary width as well as (for example) the non-infinite non-NaN IEEE 754 floating-point values.

Numbers are compared as mathematical numbers.

We will write examples of Numbers using standard mathematical notation.

Examples. 10, -6, 0.5, -3/2, 33/192, -1.202E4567.

Non-examples. NaN (the clue is in the name!), ∞ (not finite), 0.2 (cannot be exactly represented with a finite number of binary digits), 1/7 (likewise), 2+i3 (not rational), √2 (likewise).

Lists

A List is an ordered sequence of zero or more Values.

Lists are compared lexicographically, appealing to the ordering on Values for comparisons at each position in the Lists.

Maps

A Map is an unordered collection of zero or more pairs of Values. Each pair comprises a key and a value. Keys in a Map must be pairwise distinct.

Instances of Map are compared by lexicographic comparison of the sequences resulting from ordering each Map's pairs in ascending order by key. ((TODO: Is this a good idea? Is it clearly-enough written? An alternative approach is to compare first by the count of pairs, and only if the count is the same, start comparing the pairs themselves.))

Records

A Record is a tuple of one or more Values. The first in the tuple is called the label of the Record, and the other elements of the tuple are called its fields.

Record labels are usually ByteStrings, but can be any kind of Value.8

Records are compared lexicographically as if they were just tuples; that is, first by their labels, and then by the remainder of their fields.

We will write examples of Records with ByteString labels entirely composed of ASCII characters as their label followed by their parenthesised, comma-separated fields.

Examples. The Record with label #"foo" and fields 1, 2 and 3 is written #"foo"(1, 2, 3); the Record with label #"void" and no fields is written #"void"().

Conventions for Common Data Types

The Value data type is essentially an abstract S-Expression, able to represent semi-structured data over ByteString and Number atoms.

However, users need a wide variety of data types for representing domain-specific values such as text, calendrical values, machine words, IEEE 754 floating-point values, booleans, and so on.

We use appropriately-labelled Records to denote these domain-specific data types.

All of these conventions are optional. They form a layer atop the core Value structure. Non-domain-specific tools do not in general need to treat them specially.

Validity. Many of the labels we will describe in this section come with side-conditions on the contents of labelled Records. It is possible to construct an instance of Value that violates these side-conditions without ceasing to be a Value or becoming unrepresentable. However, we say that such a Value is invalid because it fails to honour the necessary side-conditions. Implementations SHOULD allow two modes of working: one which treats all Values identically, without regard for side-conditions, and one which enforces validity (i.e. side-conditions) when reading, writing, or constructing Values.

Text

A Text is a Record labelled with the ByteString #"utf-8" and having a single field that is also a ByteString. The field MUST be valid UTF-8.

We will write examples of Texts that contain Unicode text using “"” as both an opening and closing quote mark.

Examples. The Text containing the three Unicode code points z (0x7A), (0x6C34) and 𝄞 (0x1D11E) is written as "z水𝄞".

Normalization forms. Unicode defines multiple normalization forms for text. The ordering and equivalence relations defined for Values mean that, for Unicode text, the UTF-8 encoded byte-level form of a text is used in comparisons.9 In order for users to unambiguously signal or require a particular normalization form, we define a NormalizedText, which is a Record labelled with #"unicode-normalization" and having two fields, the first of which is a Text specifying the normalization form used (e.g. "nfc", "nfd", "nfkc", "nfkd"), and the second of which is a Text whose underlying representation MUST be normalized according to the named normalization form.

IRIs. (URIs, URLs, URNs, etc.) An IRI is a Record labelled with #"iri" and having one field, a Text which is the IRI itself and which MUST be a valid absolute or relative IRI.

Symbols. Programming languages like Lisp and Prolog frequently use string-like values called symbols. A Symbol is a Record labelled with #"symbol" and having one field, a Text.

Numbers

The definition of Number captures all integers and all finitely-representable floating-point values. However, in certain circumstances it can be valuable to assert that a number inhabits a particular range, such as a fixed-width machine word or an IEEE 754 floating-point value.

Fixed-width machine words. (16-, 32- and 64-bit) A family of labels in and un for n ∈ {16,32,64} denote n-bit-wide signed and unsigned range restrictions, respectively. Records with these labels MUST have one field, a Number, which MUST fall within the appropriate range. That is, to be valid,

  • in #"i16"(x), -32768 <= x <= 32767, and ⌊x⌋ = x.
  • in #"u16"(x), 0 <= x <= 65535, and ⌊x⌋ = x.
  • in #"i32"(x), -2147483648 <= x <= 2147483647, and ⌊x⌋ = x.
  • etc.

IEEE 754 floating-point. (single- and double-precision) The labels f32 and f64 denote single- and double-precision IEEE 754 floating-point values, respectively. Records with these labels MUST have one field. This field MUST either be a Number, which MUST fall within the appropriate representable range, or one of the records #"nan"(), #"+inf"() or #"-inf"().

Anonymous Tuples and Unit

A Tuple is a Record with label #"tuple" and zero or more fields, denoting an anonymous tuple of values.

The 0-ary tuple, #"tuple"(), denotes the empty tuple, sometimes called "unit" or "void" (but not e.g. JavaScript's "undefined" value).

Booleans, Null and Undefined

The two 0-ary Records #"true"() and #"false"() denote the "true" and "false" Boolean values, respectively.

Tony Hoare's "billion-dollar mistake" can be represented with the 0-ary Record #"null"(). An "undefined" value can be represented as #"undefined"().

Dates and Times

Dates, times, moments, and timestamps can be represented with a Record with label #"rfc3339" having a single field, a Text, which MUST conform to one of the full-date, partial-time, full-time, or date-time productions of section 5.6 of RFC 3339.

Syntax

Now we have discussed Values and their meanings, we may turn to techniques for representing Values for communication or storage.

The syntax we have used for the examples so far is inadequate in many ways, not least of which is that it cannot represent every Value.

Separation of the meaning of a piece of syntax from the syntax itself opens the door to domain-specific syntaxes, all equivalent and interconvertible.10 With a robust semantic foundation, connections to other data languages can also be made.

Transfer syntax: S-Expressions

For now, we limit our attention to an easily-parsed, easily-produced machine-readable syntax by mapping our Values to the canonical form of Rivest's S-Expressions.11

Byte strings

ByteStrings map to byte-string S-Expressions.

Examples.

  • What we have been writing above as #"ABC" would be represented as the S-Expression 3:ABC.
  • The empty ByteString is represented by the S-Expression 0:.

Numbers

Numbers are the most complicated values to represent as an S-Expression.

((TODO: Consider cutting complexity by e.g. representing a Number as a sign bit, a little-endian blob of the integer part of the number, and a little-endian blob of the fractional part of the number. Lots of trailing/leading zeros for very large/small numbers!))

We represent Numbers using a sign-magnitude format, where the magnitude is written using a little-endian, twos-complement binary significand and a (signed) shift amount.

In essence, we use a generalized, variable-width form of binary IEEE floating-point representation.

Let N be the Number to represent as an S-Expression.

The sign bit is 0 when N is zero or positive, and 1 when N is negative.

The magnitude of N can be viewed as an infinite sequence of bits with a fraction-separator mark placed somewhere in the sequence,

···00.000 b_0 b_1 ··· b_{k-1}   b_k ··· b_{n-1} 000000···
···000000 b_0 b_1 ··· b_{k-1} . b_k ··· b_{n-1} 000000···
···000000 b_0 b_1 ··· b_{k-1}   b_k ··· b_{n-1} 000.00···

where b_0 is the leftmost (most significant) and b_{n-1} the rightmost (least significant) non-zero bit.

Let k, the position of the fraction-separator mark, be i when it is immediately to the left of b_i for some i, generalizing to negative values when it is to the left of b_0 and values greater than n-1 when it is to the right of b_{n-1}.

For example, k will be:

  • 0 when the fraction-separator is immediately (i.e. zero bits) to the left of b_0;
  • -3 (as in the first example above) when it is three bits left of b_0;
  • n when it is immediately (i.e. zero bits) to the right of b_{n-1};
  • n+3 when it is three bits to the right of b_{n-1}.

The unpadded significand is b_0 b_1 ··· b_{n-1}.

When k < n, the shift z=k-n and the significand is:

  • the unpadded significand,
  • with the sign bit appended to it on the right, and then
  • padded on the left with zeroes until it is a whole number of octets wide.

When kn, the shift z=8×⌊(k-n)/8⌋ and the significand is:

  • the unpadded significand,
  • padded on the right with (k-n) mod 8 zeroes,
  • with the sign bit then appended on the right, and then
  • padded on the left with zeroes until it is a whole number of octets wide.

Now, let s=2z if z is zero or positive, or s=2|z|+1 if z is negative.

Finally, the S-Expression form of N is:

  • (4:*num [SIGNIFICAND] [SHIFT]), if s≠0; or
  • (4:*num [SIGNIFICAND]), if s=0 but the significand contains non-zero bits; or
  • (4:*num), if s=0 and the significand contains no non-zero bits;

where

  • [SIGNIFICAND] stands for a byte-string S-Expression containing a little-endian representation of the significand, and
  • [SHIFT] stands for a byte-string S-Expression containing a little-endian representation of s.

Examples. (Shown using the hexadecimal representation of byte-strings from section 4.4 of Rivest's S-Expression specification in places.)

  • N=0 → (4:*num)
  • N=1 → (4:*num#02#)
  • N=-1 → (4:*num#03#)
  • N=10₁₀=1010.0₂ → n=3, k=4, z=0, s=0 → (4:*num#14#)
  • N=2560₁₀=101000000000.0₂ → n=3, k=12, z=8, s=16 → (4:*num#14##10#)
  • N=-2560₁₀=-101000000000.0₂ → n=3, k=12, z=8, s=16 → (4:*num#15##10#)
  • N=-6₁₀=-110.0₂ → n=2, k=3, z=0, s=0 → (4:*num#0D#)
  • N=0.5₁₀=0.1₂ → n=1, k=0, z=-1, s=3 → (4:*num#02##03#)
  • N=-3/2₁₀=-1.1₂ → n=2, k=1, z=-1, s=3 → (4:*num#07##03#)
  • N=33/192₁₀=0.001011₂ → n=4, k=-2, z=-6, s=7 → (4:*num#16##07#)
  • N=-1.202E4567=1011011001···000₂ (15172 binary digits, the last 4565 of which are zero) → n=10607, k=15172, z=4560, s=9120 → (4:*num#41828E···24CD16##A023#)

((TODO: figure out what this algorithm would actually look like in, say, C, Python and Racket.))

Lists

A List maps to an S-Expression list of representations of its elements, with the byte-string S-Expression 5:*list prepended.

Examples.

  • The List containing the ByteStrings #"a", #"b", and #"c" would be represented as the S-Expression (5:*list1:a1:b1:c).
  • The empty List is represented by the S-Expression (5:*list).

Maps

A Map is represented by an S-Expression list of representations of the Map's key-value pairs, with the byte-string 4:*map prepended.

Each key-value pair is represented by a two-element S-Expression list containing representations of the key and the value, in that order.

The key-value pairs MUST be ordered by Value-order of their keys.

Examples.

  • The Map containing entries mapping #"a" to #"d" and #"c" to #"b" is represented by (4:*map(1:a1:d)(1:c1:b)).
  • The Map containing an entry mapping the empty list to a "true" Boolean value is represented by (4:*map((5:*list)(4:true))).
  • The empty Map is represented by (4:*map).

Non-examples.

  • The S-Expression (4:*map(1:c1:b)(1:a1:d)) is invalid, because its key-value pairs are not in Value-order by key: #"c" > #"a".
  • The S-Expression (4:*map1:a1:d1:c1:b) is invalid, because its key-value pairs appear "flattened" in the outer list, rather than each appearing in a two-element list of its own.

Records

A Record is represented by an S-Expression list of its fields, prepended by:

  • the representation of its label, if its label is a ByteString and does not begin with byte 42 (ASCII "*"); or
  • the S-Expression 1:* followed by the representation of the Record's label, otherwise.

Examples.

  • The Text "hello-world" is represented by the S-Expression (5:utf-811:hello-world).
  • The IRI denoting http://www.w3.org/ is represented by the S-Expression (3:iri(5:utf-818:http://www.w3.org/)).
  • The Record #"*"() is represented by the S-Expression (1:*1:*).
  • The Record #"*foo"(#"*bar") is represented by the S-Expression (1:*4:*foo4:*bar).
  • The Record with the empty list as its label and no fields is represented by the S-Expression (1:*(5:*list)).
  • (7:rfc3339(5:utf-83:foo)) represents a well-formed Value that is a Record with #"rfc3339" as its label, and a single Text field. While it is a perfectly reasonable Value, it does not represent a valid date or time, since the Text "foo" does not conform to any of the RFC 3339 productions enumerated above.

Non-examples.

  • ((5:*list)) is not a representation of the Record with the empty list as its label and no fields, because that Record has a non-ByteString as its label, mandating a 1:* prefix on its S-Expression representation.
  • (4:*foo4:*bar) does not represent the Record #"*foo"(#"*bar"), because the label #"*foo" begins with "*", mandating a 1:* prefix on the Record's S-Expression representation.

Examples

((TODO: Give some examples of large and small SPKI-CAT documents, perhaps translated from various JSON blobs floating around the internet.))

Representing Values in Programming Languages

We have given a definition of Value and its semantics, and proposed a concrete syntax for communicating and storing Values. We now turn to suggested representations of Values as programming-language values for various programming languages.

JavaScript

  • ByteStringUint8Array
  • Number ↔ numbers, problematically; bignums, perhaps; other?? TODO
  • ListArray
  • MapObject
  • Record ↔ an instance of something like Record below, unless the label is...
    • #"utf-8"String
    • #"true"true
    • #"false"false
    • #"null"null
    • #"undefined" ↔ the undefined value
    • #"rfc3339"Date, if the Record's field matches the date-time RFC 3339 production
function Record(label, ...fields) {
  this.label = label;
  this.fields = fields;
}

Scheme/Racket

  • ByteString ↔ byte vector (Racket: "Bytes")
  • Number ↔ numbers
  • List ↔ (where possible, immutable) list
  • Map ↔ hash-table
  • Record ↔ a structure (Racket: a "prefab struct"), unless the label is...
    • #"utf-8" ↔ a string
    • #"true"#t
    • #"false"#f
    • #"symbol" ↔ a symbol

Java

  • ByteStringbyte[]
  • Number ↔ numbers, problematically; bignums, perhaps; other?? TODO
  • Listjava.util.List
  • Mapjava.util.Map
  • Record ↔ an instance of something like Record below, unless the label is...
    • #"utf-8"java.lang.String
    • #"true"java.lang.Boolean.TRUE
    • #"false"java.lang.Boolean.FALSE
    • #"null" ↔ a special singleton object, but not Java's null
    • #"rfc3339"java.util.{Date,Time,Timestamp}, according to which RFC 3339 production the Record's field matches

Erlang

  • ByteString ↔ a binary
  • Number ↔ numbers, probably; TODO
  • List ↔ a list
  • Map ↔ a map (new in Erlang/OTP R17)
  • Record ↔ a tuple with the label in the first position, and the fields in subsequent positions, unless the label is...
    • #"true"true
    • #"false"false
    • #"null"null
    • #"undefined"undefined
    • #"symbol" ↔ the Text field converted to an Erlang atom, if some kind of an "unsafe" mode is set on the decoder (because Erlang atoms are not GC'd); otherwise like any other kind of Record


  1. TJSON has a similar aim: “different on-the-wire representations of an object correspond to the same typed data object” (source); “TJSON is defined as a serialization format on top of a JSON-like data model” (source). ↩︎

  2. Section 6 of RFC 7159 does go so far as to indicate “good interoperability can be achieved” by imagining that parsers are able reliably to understand the syntax of numbers as denoting an IEEE 754 double-precision floating-point value. ↩︎

  3. Section 8.3 of RFC 7159 suggests that if an implementation compares strings used as object keys “code unit by code unit”, then it will interoperate with other such implementations, but neither requires this behaviour nor discusses comparisons of strings used in other contexts. ↩︎

  4. The XML world has the concept of XML infoset. Loosely speaking, XML infoset is the denotation of an XML document; the meaning of the document. ↩︎

  5. Most other recent data languages are like JSON in specifying only a syntax with no associated semantics. While some do make a sketch of a semantics, the result is often underspecified (e.g. in terms of how strings are to be compared), overly machine-oriented (e.g. treating 32-bit integers as fundamentally distinct from 64-bit integers and from floating-point numbers), overly fine (e.g. giving visibility to the order in which map entries are written), or all three. ↩︎

  6. This design was loosely inspired by Zephyr ASDL (h/t Darius Bacon), which doesn't offer much in the way of atoms, but offers general-purpose labelled sums and products. See D. C. Wang, A. W. Appel, J. L. Korn, and C. S. Serra, “The Zephyr Abstract Syntax Description Language,” in USENIX Conference on Domain-Specific Languages, 1997, pp. 213228. PDF available. ↩︎

  7. Why include ByteString, when we could instead use a reserved Record along with a List of Numbers? ((TODO: Actually decide about this! Similarly, why include Map rather than a restricted form of List with a Record? I think the answer has to do with the arbitrariness of the label we'd pick: unless extremely carefully chosen (i.e. number 0 (ideally even -Inf!) for byte strings, number 1 for map, and have the order go Number < Record < List), they would mess up the prettiness of the ordering. Though... we could ultimately reduce this to Number and Record, and have a family of #"list" and #"map" Records...)) ↩︎

  8. It is occasionally (but seldom) necessary to interpret such ByteString labels as UTF-8 encoded IRIs. Where a label can be read as a relative IRI, it is notionally interpreted with respect to the IRI http://spki-cat.org/ ((TODO: placeholder)); where a label can be read as an absolute IRI, it stands for that IRI; and otherwise, it cannot be read as an IRI at all, and so the label simply stands for itself - for its own Value. ↩︎

  9. Happily, the design of UTF-8 is such that this gives the same result as a lexicographic code-point-by-code-point comparison! ↩︎

  10. Those who remember ASN.1 will recall BER, DER, PER, CER, XER and so on, each appropriate to a different setting. Similarly, Rivest's S-Expression design offers a human-friendly syntax, a syntax robust to network-induced message corruption, and an unambiguous, simple and easily-parsed machine-friendly syntax for the same underlying values. ↩︎

  11. Why not just use Rivest's S-Expressions as they are? While they include binary data and sequences, and an obvious equivalence for them exists, they lack numbers per se as well as any kind of unordered structure such as sets or maps. In addition, while "display hints" allow labelling of binary data with an intended interpretation, they cannot be attached to any other kind of structure, and the "hint" itself can only be a binary blob. ↩︎