Split out inessential text from the spec

This commit is contained in:
Tony Garnock-Jones 2019-08-18 17:51:26 +01:00
parent 1bb7e1862e
commit 9064258dbc
5 changed files with 500 additions and 479 deletions

181
conventions.md Normal file
View File

@ -0,0 +1,181 @@
---
---
<title>Preserves: Conventions for Common Data Types</title>
<link rel="stylesheet" href="preserves.css">
# Preserves: Conventions for Common Data Types
The `Value` data type is essentially an S-Expression, able to
represent semi-structured data over `ByteString`, `String`,
`SignedInteger` atoms and so on.[^why-not-spki-sexps]
[^why-not-spki-sexps]: Rivest's S-Expressions are in many ways
similar to Preserves. However, while they include binary data and
sequences, and an obvious equivalence for them exists, they lack
numbers *per se* as well as any kind of unordered structure such
as sets or maps. In addition, while “display hints” allow
labelling of binary data with an intended interpretation, they
cannot be attached to any other kind of structure, and the “hint”
itself can only be a binary blob.
However, users need a wide variety of data types for representing
domain-specific values such as various kinds of encoded and normalized
text, calendrical values, machine words, and so on.
Appropriately-labelled `Record`s denote these domain-specific data
types.[^why-dictionaries]
[^why-dictionaries]: Given `Record`'s existence, it may seem odd
that `Dictionary`, `Set`, `Float`, etc. are given special
treatment. Preserves aims to offer a useful basic equivalence
predicate to programmers, and so if a data type demands a special
equivalence predicate, as `Dictionary`, `Set` and `Float` all do,
then the type should be included in the base language. Otherwise,
it can be represented as a `Record` and treated separately.
`Boolean`, `String` and `Symbol` are seeming exceptions. The first
two merit inclusion because of their cultural importance, while
`Symbol`s are included to allow their use as `Record` labels.
Primitive `Symbol` support avoids a bootstrapping issue.
All of these conventions are optional. They form a layer atop the core
`Value` structure. Non-domain-specific tools do not in general need to
treat them specially.
**Validity.** Many of the labels we will describe in this section come
with side-conditions on the contents of labelled `Record`s. It is
possible to construct an instance of `Value` that violates these
side-conditions without ceasing to be a `Value` or becoming
unrepresentable. However, we say that such a `Value` is *invalid*
because it fails to honour the necessary side-conditions.
Implementations *SHOULD* allow two modes of working: one which
treats all `Value`s identically, without regard for side-conditions,
and one which enforces validity (i.e. side-conditions) when reading,
writing, or constructing `Value`s.
## IOLists.
Inspired by Erlang's notions of
[`iolist()` and `iodata()`](http://erlang.org/doc/reference_manual/typespec.html),
an `IOList` is any tree constructed from `ByteString`s and
`Sequence`s. Formally, an `IOList` is either a `ByteString` or a
`Sequence` of `IOList`s.
`IOList`s can be useful for
[vectored I/O](https://en.wikipedia.org/wiki/Vectored_I/O).
Additionally, the flexibility of `IOList` trees allows annotation of
interior portions of a tree.
## Comments.
`String` values used as annotations are conventionally interpreted as
comments.
@"I am a comment for the Dictionary"
{
@"I am a comment for the key"
key: @"I am a comment for the value"
value
}
@"I am a comment for this entire IOList"
[
#hex{00010203}
@"I am a comment for the middle half of the IOList"
@"A second comment for the same portion of the IOList"
[
@"I am a comment for the following ByteString"
#hex{04050607}
#hex{08090A0B}
]
#hex{0C0D0E0F}
]
## MIME-type tagged binary data.
Many internet protocols use
[media types](https://tools.ietf.org/html/rfc6838) (a.k.a MIME types)
to indicate the format of some associated binary data. For this
purpose, we define `MIMEData` to be a record labelled `mime` with two
fields, the first being a `Symbol`, the media type, and the second
being a `ByteString`, the binary data.
While each media type may define its own rules for comparing
documents, we define ordering among `MIMEData` *representations* of
such media types following the general rules for ordering of
`Record`s.
**Examples.**
| Value | Encoded hexadecimal byte sequence |
|--------------------------------------------|-------------------------------------------------------------------------------------------------------------------|
| `<mime application/octet-stream #"abcde">` | 83 74 6D 69 6D 65 7F 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D 65 61 62 63 64 65 |
| `<mime text/plain #"ABC">` | 83 74 6D 69 6D 65 7A 74 65 78 74 2F 70 6C 61 69 6E 63 41 42 43 |
| `<mime application/xml #"<xhtml/>">` | 83 74 6D 69 6D 65 7F 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 68 3C 78 68 74 6D 6C 2F 3E |
| `<mime text/csv #"123,234,345">` | 83 74 6D 69 6D 65 78 74 65 78 74 2F 63 73 76 6B 31 32 33 2C 32 33 34 2C 33 34 35 |
Applications making heavy use of `mime` records may choose to use a
placeholder number for the symbol `mime` as well as the symbols for
individual media types. For example, if placeholder number 1 were
chosen for `mime`, and placeholder number 7 for `text/plain`, the
second example above, `<mime text/plain #"ABC">`, would be encoded as
`83 11 17 63 41 42 43`.
## Unicode normalization forms.
Unicode defines multiple
[normalization forms](http://unicode.org/reports/tr15/) for text.
While no particular normalization form is required for `String`s,
users may need to unambiguously signal or require a particular
normalization form. A `NormalizedString` is a `Record` labelled with
`unicode-normalization` and having two fields, the first of which is a
`Symbol` specifying the normalization form used (e.g. `nfc`, `nfd`,
`nfkc`, `nfkd`), and the second of which is a `String` whose
underlying code point representation *MUST* be normalized according to
the named normalization form.
## IRIs (URIs, URLs, URNs, etc.).
An `IRI` is a `Record` labelled with `iri` and having one field, a
`String` which is the IRI itself and which *MUST* be a valid absolute
or relative IRI.
## Machine words.
The definition of `SignedInteger` captures all integers. However, in
certain circumstances it can be valuable to assert that a number
inhabits a particular range, such as a fixed-width machine word.
A family of labels `i`*n* and `u`*n* for *n* ∈ {8,16,32,64} denote
*n*-bit-wide signed and unsigned range restrictions, respectively.
Records with these labels *MUST* have one field, a `SignedInteger`,
which *MUST* fall within the appropriate range. That is, to be valid,
- in `<i8 `*x*`>`, -128 <= *x* <= 127.
- in `<u8 `*x*`>`, 0 <= *x* <= 255.
- in `<i16 `*x*`>`, -32768 <= *x* <= 32767.
- etc.
## Anonymous Tuples and Unit.
A `Tuple` is a `Record` with label `tuple` and zero or more fields,
denoting an anonymous tuple of values.
The 0-ary tuple, `<tuple>`, denotes the empty tuple, sometimes called
“unit” or “void” (but *not* e.g. JavaScript's “undefined” value).
## Null and Undefined.
Tony Hoare's
“[billion-dollar mistake](https://en.wikipedia.org/wiki/Tony_Hoare#Apologies_and_retractions)”
can be represented with the 0-ary `Record` `<null>`. An “undefined”
value can be represented as `<undefined>`.
## Dates and Times.
Dates, times, moments, and timestamps can be represented with a
`Record` with label `rfc3339` having a single field, a `String`, which
*MUST* conform to one of the `full-date`, `partial-time`, `full-time`,
or `date-time` productions of
[section 5.6 of RFC 3339](https://tools.ietf.org/html/rfc3339#section-5.6).
<!-- Heading to visually offset the footnotes from the main document: -->
## Notes

View File

@ -26,7 +26,8 @@ programming languages.
Preserves also supports the usual suite of atomic and compound data
types, in particular including *binary* data as a distinct type from
text strings. Its *annotations* allow separation of data from metadata
such as comments, trace information, and provenance information.
such as [comments](conventions.html#comments), trace information, and
provenance information.
Finally, Preserves defines precisely how to *compare* two values.
Comparison is based on the data model, not on syntax or on data
@ -873,180 +874,6 @@ encodes to binary as follows:
53 "Zip" 55 "94085"
57 "Country" 52 "US"
## Conventions for Common Data Types
The `Value` data type is essentially an S-Expression, able to
represent semi-structured data over `ByteString`, `String`,
`SignedInteger` atoms and so on.[^why-not-spki-sexps]
[^why-not-spki-sexps]: Rivest's S-Expressions are in many ways
similar to Preserves. However, while they include binary data and
sequences, and an obvious equivalence for them exists, they lack
numbers *per se* as well as any kind of unordered structure such
as sets or maps. In addition, while “display hints” allow
labelling of binary data with an intended interpretation, they
cannot be attached to any other kind of structure, and the “hint”
itself can only be a binary blob.
However, users need a wide variety of data types for representing
domain-specific values such as various kinds of encoded and normalized
text, calendrical values, machine words, and so on.
Appropriately-labelled `Record`s denote these domain-specific data
types.[^why-dictionaries]
[^why-dictionaries]: Given `Record`'s existence, it may seem odd
that `Dictionary`, `Set`, `Float`, etc. are given special
treatment. Preserves aims to offer a useful basic equivalence
predicate to programmers, and so if a data type demands a special
equivalence predicate, as `Dictionary`, `Set` and `Float` all do,
then the type should be included in the base language. Otherwise,
it can be represented as a `Record` and treated separately.
`Boolean`, `String` and `Symbol` are seeming exceptions. The first
two merit inclusion because of their cultural importance, while
`Symbol`s are included to allow their use as `Record` labels.
Primitive `Symbol` support avoids a bootstrapping issue.
All of these conventions are optional. They form a layer atop the core
`Value` structure. Non-domain-specific tools do not in general need to
treat them specially.
**Validity.** Many of the labels we will describe in this section come
with side-conditions on the contents of labelled `Record`s. It is
possible to construct an instance of `Value` that violates these
side-conditions without ceasing to be a `Value` or becoming
unrepresentable. However, we say that such a `Value` is *invalid*
because it fails to honour the necessary side-conditions.
Implementations *SHOULD* allow two modes of working: one which
treats all `Value`s identically, without regard for side-conditions,
and one which enforces validity (i.e. side-conditions) when reading,
writing, or constructing `Value`s.
### IOLists.
Inspired by Erlang's notions of
[`iolist()` and `iodata()`](http://erlang.org/doc/reference_manual/typespec.html),
an `IOList` is any tree constructed from `ByteString`s and
`Sequence`s. Formally, an `IOList` is either a `ByteString` or a
`Sequence` of `IOList`s.
`IOList`s can be useful for
[vectored I/O](https://en.wikipedia.org/wiki/Vectored_I/O).
Additionally, the flexibility of `IOList` trees allows annotation of
interior portions of a tree.
### Comments.
`String` values used as annotations are conventionally interpreted as
comments.
@"I am a comment for the Dictionary"
{
@"I am a comment for the key"
key: @"I am a comment for the value"
value
}
@"I am a comment for this entire IOList"
[
#hex{00010203}
@"I am a comment for the middle half of the IOList"
@"A second comment for the same portion of the IOList"
[
@"I am a comment for the following ByteString"
#hex{04050607}
#hex{08090A0B}
]
#hex{0C0D0E0F}
]
### MIME-type tagged binary data.
Many internet protocols use
[media types](https://tools.ietf.org/html/rfc6838) (a.k.a MIME types)
to indicate the format of some associated binary data. For this
purpose, we define `MIMEData` to be a record labelled `mime` with two
fields, the first being a `Symbol`, the media type, and the second
being a `ByteString`, the binary data.
While each media type may define its own rules for comparing
documents, we define ordering among `MIMEData` *representations* of
such media types following the general rules for ordering of
`Record`s.
**Examples.**
| Value | Encoded hexadecimal byte sequence |
|--------------------------------------------|-------------------------------------------------------------------------------------------------------------------|
| `<mime application/octet-stream #"abcde">` | 83 74 6D 69 6D 65 7F 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D 65 61 62 63 64 65 |
| `<mime text/plain #"ABC">` | 83 74 6D 69 6D 65 7A 74 65 78 74 2F 70 6C 61 69 6E 63 41 42 43 |
| `<mime application/xml #"<xhtml/>">` | 83 74 6D 69 6D 65 7F 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 68 3C 78 68 74 6D 6C 2F 3E |
| `<mime text/csv #"123,234,345">` | 83 74 6D 69 6D 65 78 74 65 78 74 2F 63 73 76 6B 31 32 33 2C 32 33 34 2C 33 34 35 |
Applications making heavy use of `mime` records may choose to use a
placeholder number for the symbol `mime` as well as the symbols for
individual media types. For example, if placeholder number 1 were
chosen for `mime`, and placeholder number 7 for `text/plain`, the
second example above, `<mime text/plain #"ABC">`, would be encoded as
`83 11 17 63 41 42 43`.
### Unicode normalization forms.
Unicode defines multiple
[normalization forms](http://unicode.org/reports/tr15/) for text.
While no particular normalization form is required for `String`s,
users may need to unambiguously signal or require a particular
normalization form. A `NormalizedString` is a `Record` labelled with
`unicode-normalization` and having two fields, the first of which is a
`Symbol` specifying the normalization form used (e.g. `nfc`, `nfd`,
`nfkc`, `nfkd`), and the second of which is a `String` whose
underlying code point representation *MUST* be normalized according to
the named normalization form.
### IRIs (URIs, URLs, URNs, etc.).
An `IRI` is a `Record` labelled with `iri` and having one field, a
`String` which is the IRI itself and which *MUST* be a valid absolute
or relative IRI.
### Machine words.
The definition of `SignedInteger` captures all integers. However, in
certain circumstances it can be valuable to assert that a number
inhabits a particular range, such as a fixed-width machine word.
A family of labels `i`*n* and `u`*n* for *n* ∈ {8,16,32,64} denote
*n*-bit-wide signed and unsigned range restrictions, respectively.
Records with these labels *MUST* have one field, a `SignedInteger`,
which *MUST* fall within the appropriate range. That is, to be valid,
- in `<i8 `*x*`>`, -128 <= *x* <= 127.
- in `<u8 `*x*`>`, 0 <= *x* <= 255.
- in `<i16 `*x*`>`, -32768 <= *x* <= 32767.
- etc.
### Anonymous Tuples and Unit.
A `Tuple` is a `Record` with label `tuple` and zero or more fields,
denoting an anonymous tuple of values.
The 0-ary tuple, `<tuple>`, denotes the empty tuple, sometimes called
“unit” or “void” (but *not* e.g. JavaScript's “undefined” value).
### Null and Undefined.
Tony Hoare's
“[billion-dollar mistake](https://en.wikipedia.org/wiki/Tony_Hoare#Apologies_and_retractions)”
can be represented with the 0-ary `Record` `<null>`. An “undefined”
value can be represented as `<undefined>`.
### Dates and Times.
Dates, times, moments, and timestamps can be represented with a
`Record` with label `rfc3339` having a single field, a `String`, which
*MUST* conform to one of the `full-date`, `partial-time`, `full-time`,
or `date-time` productions of
[section 5.6 of RFC 3339](https://tools.ietf.org/html/rfc3339#section-5.6).
## Security Considerations
**Empty chunks.** Chunks of zero length are prohibited in streamed
@ -1141,309 +968,5 @@ Then, if `ttnn`=`0001`, `l` is the placeholder number; otherwise, `l`
is the length of the body that follows, counted in bytes for `tt`=`01`
and in `Repr`s for `tt`=`10`.
<!-- Not yet ready
## Appendix. Representing Values in Programming Languages
We have given a definition of `Value` and its semantics, and proposed
a concrete syntax for communicating and storing `Value`s. We now turn
to **suggested** representations of `Value`s as *programming-language
values* for various programming languages.
When designing a language mapping, an important consideration is
roundtripping: serialization after deserialization, and vice versa,
should both be identities.
Also, the presence or absence of annotations on a `Value` should not
affect comparisons of that `Value` to others in any way.
### JavaScript.
- `Boolean``Boolean`
- `Float` and `Double` ↔ numbers
- `SignedInteger` ↔ numbers or `BigInt` (see [here](https://developers.google.com/web/updates/2018/05/bigint) and [here](https://github.com/tc39/proposal-bigint))
- `String` ↔ strings
- `ByteString``Uint8Array`
- `Symbol``Symbol.for(...)`
- `Record``{ "_label": theLabel, "_fields": [field0, ..., fieldN] }`, plus convenience accessors
- `(undefined)` ↔ the undefined value
- `(rfc3339 F)``Date`, if `F` matches the `date-time` RFC 3339 production
- `Sequence``Array`
- `Set``{ "_set": M }` where `M` is a `Map` from the elements of the set to `true`
- `Dictionary` ↔ a `Map`
### Scheme/Racket.
- `Boolean` ↔ booleans
- `Float` and `Double` ↔ inexact numbers (Racket: single- and double-precision floats)
- `SignedInteger` ↔ exact numbers
- `String` ↔ strings
- `ByteString` ↔ byte vector (Racket: "Bytes")
- `Symbol` ↔ symbols
- `Record` ↔ structures (Racket: prefab struct)
- `Sequence` ↔ lists
- `Set` ↔ Racket: sets
- `Dictionary` ↔ Racket: hash-table
### Java.
- `Boolean``Boolean`
- `Float` and `Double``Float` and `Double`
- `SignedInteger``Integer`, `Long`, `BigInteger`
- `String``String`
- `ByteString``byte[]`
- `Symbol` ↔ a simple data class wrapping a `String`
- `Record` ↔ in a simple implementation, a generic `Record` class; else perhaps a bean mapping?
- `(mime T B)` ↔ an implementation of `javax.activation.DataSource`?
- `Sequence` ↔ an implementation of `java.util.List`
- `Set` ↔ an implementation of `java.util.Set`
- `Dictionary` ↔ an implementation of `java.util.Map`
### Erlang.
- `Boolean``true` and `false`
- `Float` and `Double` ↔ floats (unsure how Erlang deals with single-precision)
- `SignedInteger` ↔ integers
- `String` ↔ pair of `utf8` and a binary
- `ByteString` ↔ a binary
- `Symbol` ↔ pair of `atom` and a binary
- `Record` ↔ triple of `obj`, label, and field list
- `Sequence` ↔ a list
- `Set` ↔ a `sets` set
- `Dictionary` ↔ a [map][erlang-map] (new in Erlang/OTP R17)
This is a somewhat unsatisfactory mapping because: (a) Erlang doesn't
garbage-collect its atoms, meaning that (a.1) representing `Symbol`s
as atoms could lead to denial-of-service and (a.2) representing
`Symbol`-labelled `Record`s as Erlang records must be rejected for the
same reason; (b) even if it did, Erlang's boolean values are atoms,
which would then clash with the `Symbol`s `true` and `false`; and (c)
Erlang has no distinct string type, making for a trilemma where
`String`s are in danger of clashing with `ByteString`s, `Sequence`s,
or `Record`s.
### Python.
- `Boolean``True` and `False`
- `Float` ↔ a `Float` wrapper-class for a double-precision value
- `Double` ↔ float
- `SignedInteger` ↔ int and long
- `String``unicode`
- `ByteString``bytes`
- `Symbol` ↔ a simple data class wrapping a `unicode`
- `Record` ↔ something like `namedtuple`, but that doesn't care about class identity?
- `Sequence``tuple` (but accept `list` during encoding)
- `Set``frozenset` (but accept `set` during encoding)
- `Dictionary` ↔ a hashable (immutable) dictionary-like thing (but accept `dict` during encoding)
### Squeak Smalltalk.
- `Boolean``true` and `false`
- `Float` ↔ perhaps a subclass of `Float`?
- `Double``Float`
- `SignedInteger``Integer`
- `String``WideString`
- `ByteString``ByteArray`
- `Symbol``WideSymbol`
- `Record` ↔ a simple data class
- `Sequence``ArrayedCollection` (usually `OrderedCollection`)
- `Set``Set`
- `Dictionary``Dictionary`
-->
## Appendix. Why not Just Use JSON?
<!-- JSON lacks semantics: JSON syntax doesn't denote anything -->
JSON offers *syntax* for numbers, strings, booleans, null, arrays and
string-keyed maps. However, it suffers from two major problems. First,
it offers no *semantics* for the syntax: it is left to each
implementation to determine how to treat each JSON term. This causes
[interoperability](http://seriot.ch/parsing_json.php) and even
[security](http://web.archive.org/web/20180906202559/http://docs.couchdb.org/en/stable/cve/2017-12635.html)
issues. Second, JSON's lack of support for type tags leads to awkward
and incompatible *encodings* of type information in terms of the fixed
suite of constructors on offer.
There are other minor problems with JSON having to do with its syntax.
Examples include its relative verbosity and its lack of support for
binary data.
### JSON syntax doesn't *mean* anything
When are two JSON values the same? When are they different?
<!-- When is one JSON value “less than” another? -->
The specifications are largely silent on these questions. Different
JSON implementations give different answers.
Specifically, JSON does not:
- assign any meaning to numbers,[^meaning-ieee-double]
- determine how strings are to be compared,[^string-key-comparison]
- determine whether object key ordering is significant,[^json-member-ordering] or
- determine whether duplicate object keys are permitted, what it
would mean if they were, or how to determine a duplicate in the
first place.[^json-key-uniqueness]
In short, JSON syntax doesn't *denote* anything.[^xml-infoset] [^other-formats]
[^meaning-ieee-double]:
[Section 6 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-6)
does go so far as to indicate “good interoperability can be
achieved” by imagining that parsers are able reliably to
understand the syntax of numbers as denoting an IEEE 754
double-precision floating-point value.
[^string-key-comparison]:
[Section 8.3 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-8.3)
suggests that *if* an implementation compares strings used as
object keys “code unit by code unit”, then it will interoperate
with *other such implementations*, but neither requires this
behaviour nor discusses comparisons of strings used in other
contexts.
[^json-member-ordering]:
[Section 4 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-4)
remarks that “[implementations] differ as to whether or not they
make the ordering of object members visible to calling software.”
[^json-key-uniqueness]:
[Section 4 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-4)
is the only place in the specification that mentions the issue. It
explicitly sanctions implementations supporting duplicate keys,
noting only that “when the names within an object are not unique,
the behavior of software that receives such an object is
unpredictable.” Implementations are free to choose any behaviour
at all in this situation, including signalling an error, or
discarding all but one of a set of duplicates.
[^xml-infoset]: The XML world has the concept of
[XML infoset](https://www.w3.org/TR/xml-infoset/). Loosely
speaking, XML infoset is the *denotation* of an XML document; the
*meaning* of the document.
[^other-formats]: Most other recent data languages are like JSON in
specifying only a syntax with no associated semantics. While some
do make a sketch of a semantics, the result is often
underspecified (e.g. in terms of how strings are to be compared),
overly machine-oriented (e.g. treating 32-bit integers as
fundamentally distinct from 64-bit integers and from
floating-point numbers), overly fine (e.g. giving visibility to
the order in which map entries are written), or all three.
Some examples:
- are the JSON values `1`, `1.0`, and `1e0` the same or different?
- are the JSON values `1.0` and `1.0000000000000001` the same or different?
- are the JSON strings `"päron"` (UTF-8 `70c3a4726f6e`) and `"päron"`
(UTF-8 `7061cc88726f6e`) the same or different?
- are the JSON objects `{"a":1, "b":2}` and `{"b":2, "a":1}` the same
or different?
- which, if any, of `{"a":1, "a":2}`, `{"a":1}` and `{"a":2}` are the
same? Are all three legal?
- are `{"päron":1}` and `{"päron":1}` the same or different?
### JSON can multiply nicely, but it can't add very well
JSON includes a fixed set of types: numbers, strings, booleans, null,
arrays and string-keyed maps. Domain-specific data must be *encoded*
into these types. For example, dates and email addresses are often
represented as strings with an implicit internal structure.
There is no convention for *labelling* a value as belonging to a
particular category. Instead, JSON-encoded data are often labelled in
an ad-hoc way. Multiple incompatible approaches exist. For example, a
“money” structure containing a `currency` field and an `amount` may be
represented in any number of ways:
{ "_type": "money", "currency": "EUR", "amount": 10 }
{ "type": "money", "value": { "currency": "EUR", "amount": 10 } }
[ "money", { "currency": "EUR", "amount": 10 } ]
{ "@money": { "currency": "EUR", "amount": 10 } }
This causes particular problems when JSON is used to represent *sum*
or *union* types, such as “either a value or an error, but not both”.
Again, multiple incompatible approaches exist.
For example, imagine an API for depositing money in an account. The
response might be either a “success” response indicating the new
balance, or one of a set of possible errors.
Sometimes, a *pair* of values is used, with `null` marking the option
not taken.[^interesting-failure-mode]
{ "ok": { "balance": 210 }, "error": null }
{ "ok": null, "error": "Unauthorized" }
[^interesting-failure-mode]: What is the meaning of a document where
both `ok` and `error` are non-null? What might happen when a
program is presented with such a document?
The branch not chosen is sometimes present, sometimes omitted as if it
were an optional field:
{ "ok": { "balance": 210 } }
{ "error": "Unauthorized" }
Sometimes, an array of a label and a value is used:
[ "ok", { "balance": 210 } ]
[ "error", "Unauthorized" ]
Sometimes, the shape of the data is sufficient to distinguish among
the alternatives, and the label is left implicit:
{ "balance": 210 }
"Unauthorized"
JSON itself does not offer any guidance for which of these options to
choose. In many real cases on the web, poor choices have led to
encodings that are irrecoverably ambiguous.
# Open questions
Q. Should "symbols" instead be URIs? Relative, usually; relative to
what? Some domain-specific base URI?
Q. Literal small integers: are they pulling their weight? They're not
absolutely necessary.
Q. Should we go for trying to make the data ordering line up with the
encoding ordering? We'd have to only use streaming forms, and avoid
the small integer encoding, and not store record arities, and sort
sets and dictionaries, and mask floats and doubles (perhaps
[like this](https://stackoverflow.com/questions/43299299/sorting-floating-point-values-using-their-byte-representation)),
and perhaps pick a specific `NaN`, and I don't know what to do about
SignedIntegers. Perhaps make them more like float formats, with the
byte count acting as a kind of exponent underneath the sign bit.
- Perhaps define separate additional canonicalization restrictions?
Doesn't help the ordering, but does help the equivalence.
- Canonicalization and early-bailout-equivalence-checking are in
tension with support for streaming values.
Q. To remain compatible with JSON, portions of the text syntax have to
remain case-insensitive (`%i"..."`). However, non-JSON extensions do
not. There's only one (?) at the moment, the `%i"f"` in `Float`;
should it be changed to case-sensitive?
Q. Should `IOList`s be wrapped in an identifying unary record constructor?
TODO: Examples of the ordering. `"bzz" < "c" < "caa"`; `#true < 3 < "3" < |3|`
TODO: Probably should add a canonicalized subset. Consider adding
explicit "I promise this is canonical" marker, like a BOM, which
identifies a binary value as (first) binary and (second, optionally)
as canonical. UTF-8 disallows byte `0xFF` from appearing anywhere in a
text; this might be a good candidate for a marker sequence.
((Actually, perhaps `0x10` would be good! It corresponds to DLE, "data
link escape"; it is not a printable ASCII character, and is disallowed
in the textual Preserves grammar; and it is also mnemonic for "version
0", since it is the Preserves binary encoding of the small integer
zero.))
<!-- Heading to visually offset the footnotes from the main document: -->
## Notes

47
questions.md Normal file
View File

@ -0,0 +1,47 @@
---
---
<title>Preserves: Open questions</title>
<link rel="stylesheet" href="preserves.css">
# Open questions
Q. Should "symbols" instead be URIs? Relative, usually; relative to
what? Some domain-specific base URI?
Q. Literal small integers: are they pulling their weight? They're not
absolutely necessary.
Q. Should we go for trying to make the data ordering line up with the
encoding ordering? We'd have to only use streaming forms, and avoid
the small integer encoding, and not store record arities, and sort
sets and dictionaries, and mask floats and doubles (perhaps
[like this](https://stackoverflow.com/questions/43299299/sorting-floating-point-values-using-their-byte-representation)),
and perhaps pick a specific `NaN`, and I don't know what to do about
SignedIntegers. Perhaps make them more like float formats, with the
byte count acting as a kind of exponent underneath the sign bit.
- Perhaps define separate additional canonicalization restrictions?
Doesn't help the ordering, but does help the equivalence.
- Canonicalization and early-bailout-equivalence-checking are in
tension with support for streaming values.
Q. To remain compatible with JSON, portions of the text syntax have to
remain case-insensitive (`%i"..."`). However, non-JSON extensions do
not. There's only one (?) at the moment, the `%i"f"` in `Float`;
should it be changed to case-sensitive?
Q. Should `IOList`s be wrapped in an identifying unary record constructor?
TODO: Examples of the ordering. `"bzz" < "c" < "caa"`; `#true < 3 < "3" < |3|`
TODO: Probably should add a canonicalized subset. Consider adding
explicit "I promise this is canonical" marker, like a BOM, which
identifies a binary value as (first) binary and (second, optionally)
as canonical. UTF-8 disallows byte `0xFF` from appearing anywhere in a
text; this might be a good candidate for a marker sequence.
((Actually, perhaps `0x10` would be good! It corresponds to DLE, "data
link escape"; it is not a printable ASCII character, and is disallowed
in the textual Preserves grammar; and it is also mnemonic for "version
0", since it is the Preserves binary encoding of the small integer
zero.))

113
representations.md Normal file
View File

@ -0,0 +1,113 @@
---
---
<title>Preserves: Representing Values in Programming Languages</title>
<link rel="stylesheet" href="preserves.css">
# Preserves: Representing Values in Programming Languages
**NOT YET READY**
We have given a definition of `Value` and its semantics, and proposed
a concrete syntax for communicating and storing `Value`s. We now turn
to **suggested** representations of `Value`s as *programming-language
values* for various programming languages.
When designing a language mapping, an important consideration is
roundtripping: serialization after deserialization, and vice versa,
should both be identities.
Also, the presence or absence of annotations on a `Value` should not
affect comparisons of that `Value` to others in any way.
## JavaScript.
- `Boolean``Boolean`
- `Float` and `Double` ↔ numbers
- `SignedInteger` ↔ numbers or `BigInt` (see [here](https://developers.google.com/web/updates/2018/05/bigint) and [here](https://github.com/tc39/proposal-bigint))
- `String` ↔ strings
- `ByteString``Uint8Array`
- `Symbol``Symbol.for(...)`
- `Record``{ "_label": theLabel, "_fields": [field0, ..., fieldN] }`, plus convenience accessors
- `(undefined)` ↔ the undefined value
- `(rfc3339 F)``Date`, if `F` matches the `date-time` RFC 3339 production
- `Sequence``Array`
- `Set``{ "_set": M }` where `M` is a `Map` from the elements of the set to `true`
- `Dictionary` ↔ a `Map`
## Scheme/Racket.
- `Boolean` ↔ booleans
- `Float` and `Double` ↔ inexact numbers (Racket: single- and double-precision floats)
- `SignedInteger` ↔ exact numbers
- `String` ↔ strings
- `ByteString` ↔ byte vector (Racket: "Bytes")
- `Symbol` ↔ symbols
- `Record` ↔ structures (Racket: prefab struct)
- `Sequence` ↔ lists
- `Set` ↔ Racket: sets
- `Dictionary` ↔ Racket: hash-table
## Java.
- `Boolean``Boolean`
- `Float` and `Double``Float` and `Double`
- `SignedInteger``Integer`, `Long`, `BigInteger`
- `String``String`
- `ByteString``byte[]`
- `Symbol` ↔ a simple data class wrapping a `String`
- `Record` ↔ in a simple implementation, a generic `Record` class; else perhaps a bean mapping?
- `(mime T B)` ↔ an implementation of `javax.activation.DataSource`?
- `Sequence` ↔ an implementation of `java.util.List`
- `Set` ↔ an implementation of `java.util.Set`
- `Dictionary` ↔ an implementation of `java.util.Map`
## Erlang.
- `Boolean``true` and `false`
- `Float` and `Double` ↔ floats (unsure how Erlang deals with single-precision)
- `SignedInteger` ↔ integers
- `String` ↔ pair of `utf8` and a binary
- `ByteString` ↔ a binary
- `Symbol` ↔ pair of `atom` and a binary
- `Record` ↔ triple of `obj`, label, and field list
- `Sequence` ↔ a list
- `Set` ↔ a `sets` set
- `Dictionary` ↔ a [map][erlang-map] (new in Erlang/OTP R17)
This is a somewhat unsatisfactory mapping because: (a) Erlang doesn't
garbage-collect its atoms, meaning that (a.1) representing `Symbol`s
as atoms could lead to denial-of-service and (a.2) representing
`Symbol`-labelled `Record`s as Erlang records must be rejected for the
same reason; (b) even if it did, Erlang's boolean values are atoms,
which would then clash with the `Symbol`s `true` and `false`; and (c)
Erlang has no distinct string type, making for a trilemma where
`String`s are in danger of clashing with `ByteString`s, `Sequence`s,
or `Record`s.
## Python.
- `Boolean``True` and `False`
- `Float` ↔ a `Float` wrapper-class for a double-precision value
- `Double` ↔ float
- `SignedInteger` ↔ int and long
- `String``unicode`
- `ByteString``bytes`
- `Symbol` ↔ a simple data class wrapping a `unicode`
- `Record` ↔ something like `namedtuple`, but that doesn't care about class identity?
- `Sequence``tuple` (but accept `list` during encoding)
- `Set``frozenset` (but accept `set` during encoding)
- `Dictionary` ↔ a hashable (immutable) dictionary-like thing (but accept `dict` during encoding)
## Squeak Smalltalk.
- `Boolean``true` and `false`
- `Float` ↔ perhaps a subclass of `Float`?
- `Double``Float`
- `SignedInteger``Integer`
- `String``WideString`
- `ByteString``ByteArray`
- `Symbol``WideSymbol`
- `Record` ↔ a simple data class
- `Sequence``ArrayedCollection` (usually `OrderedCollection`)
- `Set``Set`
- `Dictionary``Dictionary`

157
why-not-json.md Normal file
View File

@ -0,0 +1,157 @@
---
---
<title>Preserves: Why not Just Use JSON?</title>
<link rel="stylesheet" href="preserves.css">
# Why not Just Use JSON?
<!-- JSON lacks semantics: JSON syntax doesn't denote anything -->
JSON offers *syntax* for numbers, strings, booleans, null, arrays and
string-keyed maps. However, it suffers from two major problems. First,
it offers no *semantics* for the syntax: it is left to each
implementation to determine how to treat each JSON term. This causes
[interoperability](http://seriot.ch/parsing_json.php) and even
[security](http://web.archive.org/web/20180906202559/http://docs.couchdb.org/en/stable/cve/2017-12635.html)
issues. Second, JSON's lack of support for type tags leads to awkward
and incompatible *encodings* of type information in terms of the fixed
suite of constructors on offer.
There are other minor problems with JSON having to do with its syntax.
Examples include its relative verbosity and its lack of support for
binary data.
## JSON syntax doesn't *mean* anything
When are two JSON values the same? When are they different?
<!-- When is one JSON value “less than” another? -->
The specifications are largely silent on these questions. Different
JSON implementations give different answers.
Specifically, JSON does not:
- assign any meaning to numbers,[^meaning-ieee-double]
- determine how strings are to be compared,[^string-key-comparison]
- determine whether object key ordering is significant,[^json-member-ordering] or
- determine whether duplicate object keys are permitted, what it
would mean if they were, or how to determine a duplicate in the
first place.[^json-key-uniqueness]
In short, JSON syntax doesn't *denote* anything.[^xml-infoset] [^other-formats]
[^meaning-ieee-double]:
[Section 6 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-6)
does go so far as to indicate “good interoperability can be
achieved” by imagining that parsers are able reliably to
understand the syntax of numbers as denoting an IEEE 754
double-precision floating-point value.
[^string-key-comparison]:
[Section 8.3 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-8.3)
suggests that *if* an implementation compares strings used as
object keys “code unit by code unit”, then it will interoperate
with *other such implementations*, but neither requires this
behaviour nor discusses comparisons of strings used in other
contexts.
[^json-member-ordering]:
[Section 4 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-4)
remarks that “[implementations] differ as to whether or not they
make the ordering of object members visible to calling software.”
[^json-key-uniqueness]:
[Section 4 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-4)
is the only place in the specification that mentions the issue. It
explicitly sanctions implementations supporting duplicate keys,
noting only that “when the names within an object are not unique,
the behavior of software that receives such an object is
unpredictable.” Implementations are free to choose any behaviour
at all in this situation, including signalling an error, or
discarding all but one of a set of duplicates.
[^xml-infoset]: The XML world has the concept of
[XML infoset](https://www.w3.org/TR/xml-infoset/). Loosely
speaking, XML infoset is the *denotation* of an XML document; the
*meaning* of the document.
[^other-formats]: Most other recent data languages are like JSON in
specifying only a syntax with no associated semantics. While some
do make a sketch of a semantics, the result is often
underspecified (e.g. in terms of how strings are to be compared),
overly machine-oriented (e.g. treating 32-bit integers as
fundamentally distinct from 64-bit integers and from
floating-point numbers), overly fine (e.g. giving visibility to
the order in which map entries are written), or all three.
Some examples:
- are the JSON values `1`, `1.0`, and `1e0` the same or different?
- are the JSON values `1.0` and `1.0000000000000001` the same or different?
- are the JSON strings `"päron"` (UTF-8 `70c3a4726f6e`) and `"päron"`
(UTF-8 `7061cc88726f6e`) the same or different?
- are the JSON objects `{"a":1, "b":2}` and `{"b":2, "a":1}` the same
or different?
- which, if any, of `{"a":1, "a":2}`, `{"a":1}` and `{"a":2}` are the
same? Are all three legal?
- are `{"päron":1}` and `{"päron":1}` the same or different?
## JSON can multiply nicely, but it can't add very well
JSON includes a fixed set of types: numbers, strings, booleans, null,
arrays and string-keyed maps. Domain-specific data must be *encoded*
into these types. For example, dates and email addresses are often
represented as strings with an implicit internal structure.
There is no convention for *labelling* a value as belonging to a
particular category. Instead, JSON-encoded data are often labelled in
an ad-hoc way. Multiple incompatible approaches exist. For example, a
“money” structure containing a `currency` field and an `amount` may be
represented in any number of ways:
{ "_type": "money", "currency": "EUR", "amount": 10 }
{ "type": "money", "value": { "currency": "EUR", "amount": 10 } }
[ "money", { "currency": "EUR", "amount": 10 } ]
{ "@money": { "currency": "EUR", "amount": 10 } }
This causes particular problems when JSON is used to represent *sum*
or *union* types, such as “either a value or an error, but not both”.
Again, multiple incompatible approaches exist.
For example, imagine an API for depositing money in an account. The
response might be either a “success” response indicating the new
balance, or one of a set of possible errors.
Sometimes, a *pair* of values is used, with `null` marking the option
not taken.[^interesting-failure-mode]
{ "ok": { "balance": 210 }, "error": null }
{ "ok": null, "error": "Unauthorized" }
[^interesting-failure-mode]: What is the meaning of a document where
both `ok` and `error` are non-null? What might happen when a
program is presented with such a document?
The branch not chosen is sometimes present, sometimes omitted as if it
were an optional field:
{ "ok": { "balance": 210 } }
{ "error": "Unauthorized" }
Sometimes, an array of a label and a value is used:
[ "ok", { "balance": 210 } ]
[ "error", "Unauthorized" ]
Sometimes, the shape of the data is sufficient to distinguish among
the alternatives, and the label is left implicit:
{ "balance": 210 }
"Unauthorized"
JSON itself does not offer any guidance for which of these options to
choose. In many real cases on the web, poor choices have led to
encodings that are irrecoverably ambiguous.
<!-- Heading to visually offset the footnotes from the main document: -->
## Notes