Split out inessential text from the spec
This commit is contained in:
parent
1bb7e1862e
commit
9064258dbc
|
@ -0,0 +1,181 @@
|
|||
---
|
||||
---
|
||||
<title>Preserves: Conventions for Common Data Types</title>
|
||||
<link rel="stylesheet" href="preserves.css">
|
||||
|
||||
# Preserves: Conventions for Common Data Types
|
||||
|
||||
The `Value` data type is essentially an S-Expression, able to
|
||||
represent semi-structured data over `ByteString`, `String`,
|
||||
`SignedInteger` atoms and so on.[^why-not-spki-sexps]
|
||||
|
||||
[^why-not-spki-sexps]: Rivest's S-Expressions are in many ways
|
||||
similar to Preserves. However, while they include binary data and
|
||||
sequences, and an obvious equivalence for them exists, they lack
|
||||
numbers *per se* as well as any kind of unordered structure such
|
||||
as sets or maps. In addition, while “display hints” allow
|
||||
labelling of binary data with an intended interpretation, they
|
||||
cannot be attached to any other kind of structure, and the “hint”
|
||||
itself can only be a binary blob.
|
||||
|
||||
However, users need a wide variety of data types for representing
|
||||
domain-specific values such as various kinds of encoded and normalized
|
||||
text, calendrical values, machine words, and so on.
|
||||
|
||||
Appropriately-labelled `Record`s denote these domain-specific data
|
||||
types.[^why-dictionaries]
|
||||
|
||||
[^why-dictionaries]: Given `Record`'s existence, it may seem odd
|
||||
that `Dictionary`, `Set`, `Float`, etc. are given special
|
||||
treatment. Preserves aims to offer a useful basic equivalence
|
||||
predicate to programmers, and so if a data type demands a special
|
||||
equivalence predicate, as `Dictionary`, `Set` and `Float` all do,
|
||||
then the type should be included in the base language. Otherwise,
|
||||
it can be represented as a `Record` and treated separately.
|
||||
`Boolean`, `String` and `Symbol` are seeming exceptions. The first
|
||||
two merit inclusion because of their cultural importance, while
|
||||
`Symbol`s are included to allow their use as `Record` labels.
|
||||
Primitive `Symbol` support avoids a bootstrapping issue.
|
||||
|
||||
All of these conventions are optional. They form a layer atop the core
|
||||
`Value` structure. Non-domain-specific tools do not in general need to
|
||||
treat them specially.
|
||||
|
||||
**Validity.** Many of the labels we will describe in this section come
|
||||
with side-conditions on the contents of labelled `Record`s. It is
|
||||
possible to construct an instance of `Value` that violates these
|
||||
side-conditions without ceasing to be a `Value` or becoming
|
||||
unrepresentable. However, we say that such a `Value` is *invalid*
|
||||
because it fails to honour the necessary side-conditions.
|
||||
Implementations *SHOULD* allow two modes of working: one which
|
||||
treats all `Value`s identically, without regard for side-conditions,
|
||||
and one which enforces validity (i.e. side-conditions) when reading,
|
||||
writing, or constructing `Value`s.
|
||||
|
||||
## IOLists.
|
||||
|
||||
Inspired by Erlang's notions of
|
||||
[`iolist()` and `iodata()`](http://erlang.org/doc/reference_manual/typespec.html),
|
||||
an `IOList` is any tree constructed from `ByteString`s and
|
||||
`Sequence`s. Formally, an `IOList` is either a `ByteString` or a
|
||||
`Sequence` of `IOList`s.
|
||||
|
||||
`IOList`s can be useful for
|
||||
[vectored I/O](https://en.wikipedia.org/wiki/Vectored_I/O).
|
||||
Additionally, the flexibility of `IOList` trees allows annotation of
|
||||
interior portions of a tree.
|
||||
|
||||
## Comments.
|
||||
|
||||
`String` values used as annotations are conventionally interpreted as
|
||||
comments.
|
||||
|
||||
@"I am a comment for the Dictionary"
|
||||
{
|
||||
@"I am a comment for the key"
|
||||
key: @"I am a comment for the value"
|
||||
value
|
||||
}
|
||||
|
||||
@"I am a comment for this entire IOList"
|
||||
[
|
||||
#hex{00010203}
|
||||
@"I am a comment for the middle half of the IOList"
|
||||
@"A second comment for the same portion of the IOList"
|
||||
[
|
||||
@"I am a comment for the following ByteString"
|
||||
#hex{04050607}
|
||||
#hex{08090A0B}
|
||||
]
|
||||
#hex{0C0D0E0F}
|
||||
]
|
||||
|
||||
## MIME-type tagged binary data.
|
||||
|
||||
Many internet protocols use
|
||||
[media types](https://tools.ietf.org/html/rfc6838) (a.k.a MIME types)
|
||||
to indicate the format of some associated binary data. For this
|
||||
purpose, we define `MIMEData` to be a record labelled `mime` with two
|
||||
fields, the first being a `Symbol`, the media type, and the second
|
||||
being a `ByteString`, the binary data.
|
||||
|
||||
While each media type may define its own rules for comparing
|
||||
documents, we define ordering among `MIMEData` *representations* of
|
||||
such media types following the general rules for ordering of
|
||||
`Record`s.
|
||||
|
||||
**Examples.**
|
||||
|
||||
| Value | Encoded hexadecimal byte sequence |
|
||||
|--------------------------------------------|-------------------------------------------------------------------------------------------------------------------|
|
||||
| `<mime application/octet-stream #"abcde">` | 83 74 6D 69 6D 65 7F 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D 65 61 62 63 64 65 |
|
||||
| `<mime text/plain #"ABC">` | 83 74 6D 69 6D 65 7A 74 65 78 74 2F 70 6C 61 69 6E 63 41 42 43 |
|
||||
| `<mime application/xml #"<xhtml/>">` | 83 74 6D 69 6D 65 7F 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 68 3C 78 68 74 6D 6C 2F 3E |
|
||||
| `<mime text/csv #"123,234,345">` | 83 74 6D 69 6D 65 78 74 65 78 74 2F 63 73 76 6B 31 32 33 2C 32 33 34 2C 33 34 35 |
|
||||
|
||||
Applications making heavy use of `mime` records may choose to use a
|
||||
placeholder number for the symbol `mime` as well as the symbols for
|
||||
individual media types. For example, if placeholder number 1 were
|
||||
chosen for `mime`, and placeholder number 7 for `text/plain`, the
|
||||
second example above, `<mime text/plain #"ABC">`, would be encoded as
|
||||
`83 11 17 63 41 42 43`.
|
||||
|
||||
## Unicode normalization forms.
|
||||
|
||||
Unicode defines multiple
|
||||
[normalization forms](http://unicode.org/reports/tr15/) for text.
|
||||
While no particular normalization form is required for `String`s,
|
||||
users may need to unambiguously signal or require a particular
|
||||
normalization form. A `NormalizedString` is a `Record` labelled with
|
||||
`unicode-normalization` and having two fields, the first of which is a
|
||||
`Symbol` specifying the normalization form used (e.g. `nfc`, `nfd`,
|
||||
`nfkc`, `nfkd`), and the second of which is a `String` whose
|
||||
underlying code point representation *MUST* be normalized according to
|
||||
the named normalization form.
|
||||
|
||||
## IRIs (URIs, URLs, URNs, etc.).
|
||||
|
||||
An `IRI` is a `Record` labelled with `iri` and having one field, a
|
||||
`String` which is the IRI itself and which *MUST* be a valid absolute
|
||||
or relative IRI.
|
||||
|
||||
## Machine words.
|
||||
|
||||
The definition of `SignedInteger` captures all integers. However, in
|
||||
certain circumstances it can be valuable to assert that a number
|
||||
inhabits a particular range, such as a fixed-width machine word.
|
||||
|
||||
A family of labels `i`*n* and `u`*n* for *n* ∈ {8,16,32,64} denote
|
||||
*n*-bit-wide signed and unsigned range restrictions, respectively.
|
||||
Records with these labels *MUST* have one field, a `SignedInteger`,
|
||||
which *MUST* fall within the appropriate range. That is, to be valid,
|
||||
- in `<i8 `*x*`>`, -128 <= *x* <= 127.
|
||||
- in `<u8 `*x*`>`, 0 <= *x* <= 255.
|
||||
- in `<i16 `*x*`>`, -32768 <= *x* <= 32767.
|
||||
- etc.
|
||||
|
||||
## Anonymous Tuples and Unit.
|
||||
|
||||
A `Tuple` is a `Record` with label `tuple` and zero or more fields,
|
||||
denoting an anonymous tuple of values.
|
||||
|
||||
The 0-ary tuple, `<tuple>`, denotes the empty tuple, sometimes called
|
||||
“unit” or “void” (but *not* e.g. JavaScript's “undefined” value).
|
||||
|
||||
## Null and Undefined.
|
||||
|
||||
Tony Hoare's
|
||||
“[billion-dollar mistake](https://en.wikipedia.org/wiki/Tony_Hoare#Apologies_and_retractions)”
|
||||
can be represented with the 0-ary `Record` `<null>`. An “undefined”
|
||||
value can be represented as `<undefined>`.
|
||||
|
||||
## Dates and Times.
|
||||
|
||||
Dates, times, moments, and timestamps can be represented with a
|
||||
`Record` with label `rfc3339` having a single field, a `String`, which
|
||||
*MUST* conform to one of the `full-date`, `partial-time`, `full-time`,
|
||||
or `date-time` productions of
|
||||
[section 5.6 of RFC 3339](https://tools.ietf.org/html/rfc3339#section-5.6).
|
||||
|
||||
<!-- Heading to visually offset the footnotes from the main document: -->
|
||||
## Notes
|
481
preserves.md
481
preserves.md
|
@ -26,7 +26,8 @@ programming languages.
|
|||
Preserves also supports the usual suite of atomic and compound data
|
||||
types, in particular including *binary* data as a distinct type from
|
||||
text strings. Its *annotations* allow separation of data from metadata
|
||||
such as comments, trace information, and provenance information.
|
||||
such as [comments](conventions.html#comments), trace information, and
|
||||
provenance information.
|
||||
|
||||
Finally, Preserves defines precisely how to *compare* two values.
|
||||
Comparison is based on the data model, not on syntax or on data
|
||||
|
@ -873,180 +874,6 @@ encodes to binary as follows:
|
|||
53 "Zip" 55 "94085"
|
||||
57 "Country" 52 "US"
|
||||
|
||||
## Conventions for Common Data Types
|
||||
|
||||
The `Value` data type is essentially an S-Expression, able to
|
||||
represent semi-structured data over `ByteString`, `String`,
|
||||
`SignedInteger` atoms and so on.[^why-not-spki-sexps]
|
||||
|
||||
[^why-not-spki-sexps]: Rivest's S-Expressions are in many ways
|
||||
similar to Preserves. However, while they include binary data and
|
||||
sequences, and an obvious equivalence for them exists, they lack
|
||||
numbers *per se* as well as any kind of unordered structure such
|
||||
as sets or maps. In addition, while “display hints” allow
|
||||
labelling of binary data with an intended interpretation, they
|
||||
cannot be attached to any other kind of structure, and the “hint”
|
||||
itself can only be a binary blob.
|
||||
|
||||
However, users need a wide variety of data types for representing
|
||||
domain-specific values such as various kinds of encoded and normalized
|
||||
text, calendrical values, machine words, and so on.
|
||||
|
||||
Appropriately-labelled `Record`s denote these domain-specific data
|
||||
types.[^why-dictionaries]
|
||||
|
||||
[^why-dictionaries]: Given `Record`'s existence, it may seem odd
|
||||
that `Dictionary`, `Set`, `Float`, etc. are given special
|
||||
treatment. Preserves aims to offer a useful basic equivalence
|
||||
predicate to programmers, and so if a data type demands a special
|
||||
equivalence predicate, as `Dictionary`, `Set` and `Float` all do,
|
||||
then the type should be included in the base language. Otherwise,
|
||||
it can be represented as a `Record` and treated separately.
|
||||
`Boolean`, `String` and `Symbol` are seeming exceptions. The first
|
||||
two merit inclusion because of their cultural importance, while
|
||||
`Symbol`s are included to allow their use as `Record` labels.
|
||||
Primitive `Symbol` support avoids a bootstrapping issue.
|
||||
|
||||
All of these conventions are optional. They form a layer atop the core
|
||||
`Value` structure. Non-domain-specific tools do not in general need to
|
||||
treat them specially.
|
||||
|
||||
**Validity.** Many of the labels we will describe in this section come
|
||||
with side-conditions on the contents of labelled `Record`s. It is
|
||||
possible to construct an instance of `Value` that violates these
|
||||
side-conditions without ceasing to be a `Value` or becoming
|
||||
unrepresentable. However, we say that such a `Value` is *invalid*
|
||||
because it fails to honour the necessary side-conditions.
|
||||
Implementations *SHOULD* allow two modes of working: one which
|
||||
treats all `Value`s identically, without regard for side-conditions,
|
||||
and one which enforces validity (i.e. side-conditions) when reading,
|
||||
writing, or constructing `Value`s.
|
||||
|
||||
### IOLists.
|
||||
|
||||
Inspired by Erlang's notions of
|
||||
[`iolist()` and `iodata()`](http://erlang.org/doc/reference_manual/typespec.html),
|
||||
an `IOList` is any tree constructed from `ByteString`s and
|
||||
`Sequence`s. Formally, an `IOList` is either a `ByteString` or a
|
||||
`Sequence` of `IOList`s.
|
||||
|
||||
`IOList`s can be useful for
|
||||
[vectored I/O](https://en.wikipedia.org/wiki/Vectored_I/O).
|
||||
Additionally, the flexibility of `IOList` trees allows annotation of
|
||||
interior portions of a tree.
|
||||
|
||||
### Comments.
|
||||
|
||||
`String` values used as annotations are conventionally interpreted as
|
||||
comments.
|
||||
|
||||
@"I am a comment for the Dictionary"
|
||||
{
|
||||
@"I am a comment for the key"
|
||||
key: @"I am a comment for the value"
|
||||
value
|
||||
}
|
||||
|
||||
@"I am a comment for this entire IOList"
|
||||
[
|
||||
#hex{00010203}
|
||||
@"I am a comment for the middle half of the IOList"
|
||||
@"A second comment for the same portion of the IOList"
|
||||
[
|
||||
@"I am a comment for the following ByteString"
|
||||
#hex{04050607}
|
||||
#hex{08090A0B}
|
||||
]
|
||||
#hex{0C0D0E0F}
|
||||
]
|
||||
|
||||
### MIME-type tagged binary data.
|
||||
|
||||
Many internet protocols use
|
||||
[media types](https://tools.ietf.org/html/rfc6838) (a.k.a MIME types)
|
||||
to indicate the format of some associated binary data. For this
|
||||
purpose, we define `MIMEData` to be a record labelled `mime` with two
|
||||
fields, the first being a `Symbol`, the media type, and the second
|
||||
being a `ByteString`, the binary data.
|
||||
|
||||
While each media type may define its own rules for comparing
|
||||
documents, we define ordering among `MIMEData` *representations* of
|
||||
such media types following the general rules for ordering of
|
||||
`Record`s.
|
||||
|
||||
**Examples.**
|
||||
|
||||
| Value | Encoded hexadecimal byte sequence |
|
||||
|--------------------------------------------|-------------------------------------------------------------------------------------------------------------------|
|
||||
| `<mime application/octet-stream #"abcde">` | 83 74 6D 69 6D 65 7F 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D 65 61 62 63 64 65 |
|
||||
| `<mime text/plain #"ABC">` | 83 74 6D 69 6D 65 7A 74 65 78 74 2F 70 6C 61 69 6E 63 41 42 43 |
|
||||
| `<mime application/xml #"<xhtml/>">` | 83 74 6D 69 6D 65 7F 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 68 3C 78 68 74 6D 6C 2F 3E |
|
||||
| `<mime text/csv #"123,234,345">` | 83 74 6D 69 6D 65 78 74 65 78 74 2F 63 73 76 6B 31 32 33 2C 32 33 34 2C 33 34 35 |
|
||||
|
||||
Applications making heavy use of `mime` records may choose to use a
|
||||
placeholder number for the symbol `mime` as well as the symbols for
|
||||
individual media types. For example, if placeholder number 1 were
|
||||
chosen for `mime`, and placeholder number 7 for `text/plain`, the
|
||||
second example above, `<mime text/plain #"ABC">`, would be encoded as
|
||||
`83 11 17 63 41 42 43`.
|
||||
|
||||
### Unicode normalization forms.
|
||||
|
||||
Unicode defines multiple
|
||||
[normalization forms](http://unicode.org/reports/tr15/) for text.
|
||||
While no particular normalization form is required for `String`s,
|
||||
users may need to unambiguously signal or require a particular
|
||||
normalization form. A `NormalizedString` is a `Record` labelled with
|
||||
`unicode-normalization` and having two fields, the first of which is a
|
||||
`Symbol` specifying the normalization form used (e.g. `nfc`, `nfd`,
|
||||
`nfkc`, `nfkd`), and the second of which is a `String` whose
|
||||
underlying code point representation *MUST* be normalized according to
|
||||
the named normalization form.
|
||||
|
||||
### IRIs (URIs, URLs, URNs, etc.).
|
||||
|
||||
An `IRI` is a `Record` labelled with `iri` and having one field, a
|
||||
`String` which is the IRI itself and which *MUST* be a valid absolute
|
||||
or relative IRI.
|
||||
|
||||
### Machine words.
|
||||
|
||||
The definition of `SignedInteger` captures all integers. However, in
|
||||
certain circumstances it can be valuable to assert that a number
|
||||
inhabits a particular range, such as a fixed-width machine word.
|
||||
|
||||
A family of labels `i`*n* and `u`*n* for *n* ∈ {8,16,32,64} denote
|
||||
*n*-bit-wide signed and unsigned range restrictions, respectively.
|
||||
Records with these labels *MUST* have one field, a `SignedInteger`,
|
||||
which *MUST* fall within the appropriate range. That is, to be valid,
|
||||
- in `<i8 `*x*`>`, -128 <= *x* <= 127.
|
||||
- in `<u8 `*x*`>`, 0 <= *x* <= 255.
|
||||
- in `<i16 `*x*`>`, -32768 <= *x* <= 32767.
|
||||
- etc.
|
||||
|
||||
### Anonymous Tuples and Unit.
|
||||
|
||||
A `Tuple` is a `Record` with label `tuple` and zero or more fields,
|
||||
denoting an anonymous tuple of values.
|
||||
|
||||
The 0-ary tuple, `<tuple>`, denotes the empty tuple, sometimes called
|
||||
“unit” or “void” (but *not* e.g. JavaScript's “undefined” value).
|
||||
|
||||
### Null and Undefined.
|
||||
|
||||
Tony Hoare's
|
||||
“[billion-dollar mistake](https://en.wikipedia.org/wiki/Tony_Hoare#Apologies_and_retractions)”
|
||||
can be represented with the 0-ary `Record` `<null>`. An “undefined”
|
||||
value can be represented as `<undefined>`.
|
||||
|
||||
### Dates and Times.
|
||||
|
||||
Dates, times, moments, and timestamps can be represented with a
|
||||
`Record` with label `rfc3339` having a single field, a `String`, which
|
||||
*MUST* conform to one of the `full-date`, `partial-time`, `full-time`,
|
||||
or `date-time` productions of
|
||||
[section 5.6 of RFC 3339](https://tools.ietf.org/html/rfc3339#section-5.6).
|
||||
|
||||
## Security Considerations
|
||||
|
||||
**Empty chunks.** Chunks of zero length are prohibited in streamed
|
||||
|
@ -1141,309 +968,5 @@ Then, if `ttnn`=`0001`, `l` is the placeholder number; otherwise, `l`
|
|||
is the length of the body that follows, counted in bytes for `tt`=`01`
|
||||
and in `Repr`s for `tt`=`10`.
|
||||
|
||||
<!-- Not yet ready
|
||||
|
||||
## Appendix. Representing Values in Programming Languages
|
||||
|
||||
We have given a definition of `Value` and its semantics, and proposed
|
||||
a concrete syntax for communicating and storing `Value`s. We now turn
|
||||
to **suggested** representations of `Value`s as *programming-language
|
||||
values* for various programming languages.
|
||||
|
||||
When designing a language mapping, an important consideration is
|
||||
roundtripping: serialization after deserialization, and vice versa,
|
||||
should both be identities.
|
||||
|
||||
Also, the presence or absence of annotations on a `Value` should not
|
||||
affect comparisons of that `Value` to others in any way.
|
||||
|
||||
### JavaScript.
|
||||
|
||||
- `Boolean` ↔ `Boolean`
|
||||
- `Float` and `Double` ↔ numbers
|
||||
- `SignedInteger` ↔ numbers or `BigInt` (see [here](https://developers.google.com/web/updates/2018/05/bigint) and [here](https://github.com/tc39/proposal-bigint))
|
||||
- `String` ↔ strings
|
||||
- `ByteString` ↔ `Uint8Array`
|
||||
- `Symbol` ↔ `Symbol.for(...)`
|
||||
- `Record` ↔ `{ "_label": theLabel, "_fields": [field0, ..., fieldN] }`, plus convenience accessors
|
||||
- `(undefined)` ↔ the undefined value
|
||||
- `(rfc3339 F)` ↔ `Date`, if `F` matches the `date-time` RFC 3339 production
|
||||
- `Sequence` ↔ `Array`
|
||||
- `Set` ↔ `{ "_set": M }` where `M` is a `Map` from the elements of the set to `true`
|
||||
- `Dictionary` ↔ a `Map`
|
||||
|
||||
### Scheme/Racket.
|
||||
|
||||
- `Boolean` ↔ booleans
|
||||
- `Float` and `Double` ↔ inexact numbers (Racket: single- and double-precision floats)
|
||||
- `SignedInteger` ↔ exact numbers
|
||||
- `String` ↔ strings
|
||||
- `ByteString` ↔ byte vector (Racket: "Bytes")
|
||||
- `Symbol` ↔ symbols
|
||||
- `Record` ↔ structures (Racket: prefab struct)
|
||||
- `Sequence` ↔ lists
|
||||
- `Set` ↔ Racket: sets
|
||||
- `Dictionary` ↔ Racket: hash-table
|
||||
|
||||
### Java.
|
||||
|
||||
- `Boolean` ↔ `Boolean`
|
||||
- `Float` and `Double` ↔ `Float` and `Double`
|
||||
- `SignedInteger` ↔ `Integer`, `Long`, `BigInteger`
|
||||
- `String` ↔ `String`
|
||||
- `ByteString` ↔ `byte[]`
|
||||
- `Symbol` ↔ a simple data class wrapping a `String`
|
||||
- `Record` ↔ in a simple implementation, a generic `Record` class; else perhaps a bean mapping?
|
||||
- `(mime T B)` ↔ an implementation of `javax.activation.DataSource`?
|
||||
- `Sequence` ↔ an implementation of `java.util.List`
|
||||
- `Set` ↔ an implementation of `java.util.Set`
|
||||
- `Dictionary` ↔ an implementation of `java.util.Map`
|
||||
|
||||
### Erlang.
|
||||
|
||||
- `Boolean` ↔ `true` and `false`
|
||||
- `Float` and `Double` ↔ floats (unsure how Erlang deals with single-precision)
|
||||
- `SignedInteger` ↔ integers
|
||||
- `String` ↔ pair of `utf8` and a binary
|
||||
- `ByteString` ↔ a binary
|
||||
- `Symbol` ↔ pair of `atom` and a binary
|
||||
- `Record` ↔ triple of `obj`, label, and field list
|
||||
- `Sequence` ↔ a list
|
||||
- `Set` ↔ a `sets` set
|
||||
- `Dictionary` ↔ a [map][erlang-map] (new in Erlang/OTP R17)
|
||||
|
||||
This is a somewhat unsatisfactory mapping because: (a) Erlang doesn't
|
||||
garbage-collect its atoms, meaning that (a.1) representing `Symbol`s
|
||||
as atoms could lead to denial-of-service and (a.2) representing
|
||||
`Symbol`-labelled `Record`s as Erlang records must be rejected for the
|
||||
same reason; (b) even if it did, Erlang's boolean values are atoms,
|
||||
which would then clash with the `Symbol`s `true` and `false`; and (c)
|
||||
Erlang has no distinct string type, making for a trilemma where
|
||||
`String`s are in danger of clashing with `ByteString`s, `Sequence`s,
|
||||
or `Record`s.
|
||||
|
||||
### Python.
|
||||
|
||||
- `Boolean` ↔ `True` and `False`
|
||||
- `Float` ↔ a `Float` wrapper-class for a double-precision value
|
||||
- `Double` ↔ float
|
||||
- `SignedInteger` ↔ int and long
|
||||
- `String` ↔ `unicode`
|
||||
- `ByteString` ↔ `bytes`
|
||||
- `Symbol` ↔ a simple data class wrapping a `unicode`
|
||||
- `Record` ↔ something like `namedtuple`, but that doesn't care about class identity?
|
||||
- `Sequence` ↔ `tuple` (but accept `list` during encoding)
|
||||
- `Set` ↔ `frozenset` (but accept `set` during encoding)
|
||||
- `Dictionary` ↔ a hashable (immutable) dictionary-like thing (but accept `dict` during encoding)
|
||||
|
||||
### Squeak Smalltalk.
|
||||
|
||||
- `Boolean` ↔ `true` and `false`
|
||||
- `Float` ↔ perhaps a subclass of `Float`?
|
||||
- `Double` ↔ `Float`
|
||||
- `SignedInteger` ↔ `Integer`
|
||||
- `String` ↔ `WideString`
|
||||
- `ByteString` ↔ `ByteArray`
|
||||
- `Symbol` ↔ `WideSymbol`
|
||||
- `Record` ↔ a simple data class
|
||||
- `Sequence` ↔ `ArrayedCollection` (usually `OrderedCollection`)
|
||||
- `Set` ↔ `Set`
|
||||
- `Dictionary` ↔ `Dictionary`
|
||||
|
||||
-->
|
||||
|
||||
## Appendix. Why not Just Use JSON?
|
||||
|
||||
<!-- JSON lacks semantics: JSON syntax doesn't denote anything -->
|
||||
|
||||
JSON offers *syntax* for numbers, strings, booleans, null, arrays and
|
||||
string-keyed maps. However, it suffers from two major problems. First,
|
||||
it offers no *semantics* for the syntax: it is left to each
|
||||
implementation to determine how to treat each JSON term. This causes
|
||||
[interoperability](http://seriot.ch/parsing_json.php) and even
|
||||
[security](http://web.archive.org/web/20180906202559/http://docs.couchdb.org/en/stable/cve/2017-12635.html)
|
||||
issues. Second, JSON's lack of support for type tags leads to awkward
|
||||
and incompatible *encodings* of type information in terms of the fixed
|
||||
suite of constructors on offer.
|
||||
|
||||
There are other minor problems with JSON having to do with its syntax.
|
||||
Examples include its relative verbosity and its lack of support for
|
||||
binary data.
|
||||
|
||||
### JSON syntax doesn't *mean* anything
|
||||
|
||||
When are two JSON values the same? When are they different?
|
||||
<!-- When is one JSON value “less than” another? -->
|
||||
|
||||
The specifications are largely silent on these questions. Different
|
||||
JSON implementations give different answers.
|
||||
|
||||
Specifically, JSON does not:
|
||||
|
||||
- assign any meaning to numbers,[^meaning-ieee-double]
|
||||
- determine how strings are to be compared,[^string-key-comparison]
|
||||
- determine whether object key ordering is significant,[^json-member-ordering] or
|
||||
- determine whether duplicate object keys are permitted, what it
|
||||
would mean if they were, or how to determine a duplicate in the
|
||||
first place.[^json-key-uniqueness]
|
||||
|
||||
In short, JSON syntax doesn't *denote* anything.[^xml-infoset] [^other-formats]
|
||||
|
||||
[^meaning-ieee-double]:
|
||||
[Section 6 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-6)
|
||||
does go so far as to indicate “good interoperability can be
|
||||
achieved” by imagining that parsers are able reliably to
|
||||
understand the syntax of numbers as denoting an IEEE 754
|
||||
double-precision floating-point value.
|
||||
|
||||
[^string-key-comparison]:
|
||||
[Section 8.3 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-8.3)
|
||||
suggests that *if* an implementation compares strings used as
|
||||
object keys “code unit by code unit”, then it will interoperate
|
||||
with *other such implementations*, but neither requires this
|
||||
behaviour nor discusses comparisons of strings used in other
|
||||
contexts.
|
||||
|
||||
[^json-member-ordering]:
|
||||
[Section 4 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-4)
|
||||
remarks that “[implementations] differ as to whether or not they
|
||||
make the ordering of object members visible to calling software.”
|
||||
|
||||
[^json-key-uniqueness]:
|
||||
[Section 4 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-4)
|
||||
is the only place in the specification that mentions the issue. It
|
||||
explicitly sanctions implementations supporting duplicate keys,
|
||||
noting only that “when the names within an object are not unique,
|
||||
the behavior of software that receives such an object is
|
||||
unpredictable.” Implementations are free to choose any behaviour
|
||||
at all in this situation, including signalling an error, or
|
||||
discarding all but one of a set of duplicates.
|
||||
|
||||
[^xml-infoset]: The XML world has the concept of
|
||||
[XML infoset](https://www.w3.org/TR/xml-infoset/). Loosely
|
||||
speaking, XML infoset is the *denotation* of an XML document; the
|
||||
*meaning* of the document.
|
||||
|
||||
[^other-formats]: Most other recent data languages are like JSON in
|
||||
specifying only a syntax with no associated semantics. While some
|
||||
do make a sketch of a semantics, the result is often
|
||||
underspecified (e.g. in terms of how strings are to be compared),
|
||||
overly machine-oriented (e.g. treating 32-bit integers as
|
||||
fundamentally distinct from 64-bit integers and from
|
||||
floating-point numbers), overly fine (e.g. giving visibility to
|
||||
the order in which map entries are written), or all three.
|
||||
|
||||
Some examples:
|
||||
|
||||
- are the JSON values `1`, `1.0`, and `1e0` the same or different?
|
||||
- are the JSON values `1.0` and `1.0000000000000001` the same or different?
|
||||
- are the JSON strings `"päron"` (UTF-8 `70c3a4726f6e`) and `"päron"`
|
||||
(UTF-8 `7061cc88726f6e`) the same or different?
|
||||
- are the JSON objects `{"a":1, "b":2}` and `{"b":2, "a":1}` the same
|
||||
or different?
|
||||
- which, if any, of `{"a":1, "a":2}`, `{"a":1}` and `{"a":2}` are the
|
||||
same? Are all three legal?
|
||||
- are `{"päron":1}` and `{"päron":1}` the same or different?
|
||||
|
||||
### JSON can multiply nicely, but it can't add very well
|
||||
|
||||
JSON includes a fixed set of types: numbers, strings, booleans, null,
|
||||
arrays and string-keyed maps. Domain-specific data must be *encoded*
|
||||
into these types. For example, dates and email addresses are often
|
||||
represented as strings with an implicit internal structure.
|
||||
|
||||
There is no convention for *labelling* a value as belonging to a
|
||||
particular category. Instead, JSON-encoded data are often labelled in
|
||||
an ad-hoc way. Multiple incompatible approaches exist. For example, a
|
||||
“money” structure containing a `currency` field and an `amount` may be
|
||||
represented in any number of ways:
|
||||
|
||||
{ "_type": "money", "currency": "EUR", "amount": 10 }
|
||||
{ "type": "money", "value": { "currency": "EUR", "amount": 10 } }
|
||||
[ "money", { "currency": "EUR", "amount": 10 } ]
|
||||
{ "@money": { "currency": "EUR", "amount": 10 } }
|
||||
|
||||
This causes particular problems when JSON is used to represent *sum*
|
||||
or *union* types, such as “either a value or an error, but not both”.
|
||||
Again, multiple incompatible approaches exist.
|
||||
|
||||
For example, imagine an API for depositing money in an account. The
|
||||
response might be either a “success” response indicating the new
|
||||
balance, or one of a set of possible errors.
|
||||
|
||||
Sometimes, a *pair* of values is used, with `null` marking the option
|
||||
not taken.[^interesting-failure-mode]
|
||||
|
||||
{ "ok": { "balance": 210 }, "error": null }
|
||||
{ "ok": null, "error": "Unauthorized" }
|
||||
|
||||
[^interesting-failure-mode]: What is the meaning of a document where
|
||||
both `ok` and `error` are non-null? What might happen when a
|
||||
program is presented with such a document?
|
||||
|
||||
The branch not chosen is sometimes present, sometimes omitted as if it
|
||||
were an optional field:
|
||||
|
||||
{ "ok": { "balance": 210 } }
|
||||
{ "error": "Unauthorized" }
|
||||
|
||||
Sometimes, an array of a label and a value is used:
|
||||
|
||||
[ "ok", { "balance": 210 } ]
|
||||
[ "error", "Unauthorized" ]
|
||||
|
||||
Sometimes, the shape of the data is sufficient to distinguish among
|
||||
the alternatives, and the label is left implicit:
|
||||
|
||||
{ "balance": 210 }
|
||||
"Unauthorized"
|
||||
|
||||
JSON itself does not offer any guidance for which of these options to
|
||||
choose. In many real cases on the web, poor choices have led to
|
||||
encodings that are irrecoverably ambiguous.
|
||||
|
||||
# Open questions
|
||||
|
||||
Q. Should "symbols" instead be URIs? Relative, usually; relative to
|
||||
what? Some domain-specific base URI?
|
||||
|
||||
Q. Literal small integers: are they pulling their weight? They're not
|
||||
absolutely necessary.
|
||||
|
||||
Q. Should we go for trying to make the data ordering line up with the
|
||||
encoding ordering? We'd have to only use streaming forms, and avoid
|
||||
the small integer encoding, and not store record arities, and sort
|
||||
sets and dictionaries, and mask floats and doubles (perhaps
|
||||
[like this](https://stackoverflow.com/questions/43299299/sorting-floating-point-values-using-their-byte-representation)),
|
||||
and perhaps pick a specific `NaN`, and I don't know what to do about
|
||||
SignedIntegers. Perhaps make them more like float formats, with the
|
||||
byte count acting as a kind of exponent underneath the sign bit.
|
||||
|
||||
- Perhaps define separate additional canonicalization restrictions?
|
||||
Doesn't help the ordering, but does help the equivalence.
|
||||
|
||||
- Canonicalization and early-bailout-equivalence-checking are in
|
||||
tension with support for streaming values.
|
||||
|
||||
Q. To remain compatible with JSON, portions of the text syntax have to
|
||||
remain case-insensitive (`%i"..."`). However, non-JSON extensions do
|
||||
not. There's only one (?) at the moment, the `%i"f"` in `Float`;
|
||||
should it be changed to case-sensitive?
|
||||
|
||||
Q. Should `IOList`s be wrapped in an identifying unary record constructor?
|
||||
|
||||
TODO: Examples of the ordering. `"bzz" < "c" < "caa"`; `#true < 3 < "3" < |3|`
|
||||
|
||||
TODO: Probably should add a canonicalized subset. Consider adding
|
||||
explicit "I promise this is canonical" marker, like a BOM, which
|
||||
identifies a binary value as (first) binary and (second, optionally)
|
||||
as canonical. UTF-8 disallows byte `0xFF` from appearing anywhere in a
|
||||
text; this might be a good candidate for a marker sequence.
|
||||
((Actually, perhaps `0x10` would be good! It corresponds to DLE, "data
|
||||
link escape"; it is not a printable ASCII character, and is disallowed
|
||||
in the textual Preserves grammar; and it is also mnemonic for "version
|
||||
0", since it is the Preserves binary encoding of the small integer
|
||||
zero.))
|
||||
|
||||
<!-- Heading to visually offset the footnotes from the main document: -->
|
||||
## Notes
|
||||
|
|
|
@ -0,0 +1,47 @@
|
|||
---
|
||||
---
|
||||
<title>Preserves: Open questions</title>
|
||||
<link rel="stylesheet" href="preserves.css">
|
||||
|
||||
# Open questions
|
||||
|
||||
Q. Should "symbols" instead be URIs? Relative, usually; relative to
|
||||
what? Some domain-specific base URI?
|
||||
|
||||
Q. Literal small integers: are they pulling their weight? They're not
|
||||
absolutely necessary.
|
||||
|
||||
Q. Should we go for trying to make the data ordering line up with the
|
||||
encoding ordering? We'd have to only use streaming forms, and avoid
|
||||
the small integer encoding, and not store record arities, and sort
|
||||
sets and dictionaries, and mask floats and doubles (perhaps
|
||||
[like this](https://stackoverflow.com/questions/43299299/sorting-floating-point-values-using-their-byte-representation)),
|
||||
and perhaps pick a specific `NaN`, and I don't know what to do about
|
||||
SignedIntegers. Perhaps make them more like float formats, with the
|
||||
byte count acting as a kind of exponent underneath the sign bit.
|
||||
|
||||
- Perhaps define separate additional canonicalization restrictions?
|
||||
Doesn't help the ordering, but does help the equivalence.
|
||||
|
||||
- Canonicalization and early-bailout-equivalence-checking are in
|
||||
tension with support for streaming values.
|
||||
|
||||
Q. To remain compatible with JSON, portions of the text syntax have to
|
||||
remain case-insensitive (`%i"..."`). However, non-JSON extensions do
|
||||
not. There's only one (?) at the moment, the `%i"f"` in `Float`;
|
||||
should it be changed to case-sensitive?
|
||||
|
||||
Q. Should `IOList`s be wrapped in an identifying unary record constructor?
|
||||
|
||||
TODO: Examples of the ordering. `"bzz" < "c" < "caa"`; `#true < 3 < "3" < |3|`
|
||||
|
||||
TODO: Probably should add a canonicalized subset. Consider adding
|
||||
explicit "I promise this is canonical" marker, like a BOM, which
|
||||
identifies a binary value as (first) binary and (second, optionally)
|
||||
as canonical. UTF-8 disallows byte `0xFF` from appearing anywhere in a
|
||||
text; this might be a good candidate for a marker sequence.
|
||||
((Actually, perhaps `0x10` would be good! It corresponds to DLE, "data
|
||||
link escape"; it is not a printable ASCII character, and is disallowed
|
||||
in the textual Preserves grammar; and it is also mnemonic for "version
|
||||
0", since it is the Preserves binary encoding of the small integer
|
||||
zero.))
|
|
@ -0,0 +1,113 @@
|
|||
---
|
||||
---
|
||||
<title>Preserves: Representing Values in Programming Languages</title>
|
||||
<link rel="stylesheet" href="preserves.css">
|
||||
|
||||
# Preserves: Representing Values in Programming Languages
|
||||
|
||||
**NOT YET READY**
|
||||
|
||||
We have given a definition of `Value` and its semantics, and proposed
|
||||
a concrete syntax for communicating and storing `Value`s. We now turn
|
||||
to **suggested** representations of `Value`s as *programming-language
|
||||
values* for various programming languages.
|
||||
|
||||
When designing a language mapping, an important consideration is
|
||||
roundtripping: serialization after deserialization, and vice versa,
|
||||
should both be identities.
|
||||
|
||||
Also, the presence or absence of annotations on a `Value` should not
|
||||
affect comparisons of that `Value` to others in any way.
|
||||
|
||||
## JavaScript.
|
||||
|
||||
- `Boolean` ↔ `Boolean`
|
||||
- `Float` and `Double` ↔ numbers
|
||||
- `SignedInteger` ↔ numbers or `BigInt` (see [here](https://developers.google.com/web/updates/2018/05/bigint) and [here](https://github.com/tc39/proposal-bigint))
|
||||
- `String` ↔ strings
|
||||
- `ByteString` ↔ `Uint8Array`
|
||||
- `Symbol` ↔ `Symbol.for(...)`
|
||||
- `Record` ↔ `{ "_label": theLabel, "_fields": [field0, ..., fieldN] }`, plus convenience accessors
|
||||
- `(undefined)` ↔ the undefined value
|
||||
- `(rfc3339 F)` ↔ `Date`, if `F` matches the `date-time` RFC 3339 production
|
||||
- `Sequence` ↔ `Array`
|
||||
- `Set` ↔ `{ "_set": M }` where `M` is a `Map` from the elements of the set to `true`
|
||||
- `Dictionary` ↔ a `Map`
|
||||
|
||||
## Scheme/Racket.
|
||||
|
||||
- `Boolean` ↔ booleans
|
||||
- `Float` and `Double` ↔ inexact numbers (Racket: single- and double-precision floats)
|
||||
- `SignedInteger` ↔ exact numbers
|
||||
- `String` ↔ strings
|
||||
- `ByteString` ↔ byte vector (Racket: "Bytes")
|
||||
- `Symbol` ↔ symbols
|
||||
- `Record` ↔ structures (Racket: prefab struct)
|
||||
- `Sequence` ↔ lists
|
||||
- `Set` ↔ Racket: sets
|
||||
- `Dictionary` ↔ Racket: hash-table
|
||||
|
||||
## Java.
|
||||
|
||||
- `Boolean` ↔ `Boolean`
|
||||
- `Float` and `Double` ↔ `Float` and `Double`
|
||||
- `SignedInteger` ↔ `Integer`, `Long`, `BigInteger`
|
||||
- `String` ↔ `String`
|
||||
- `ByteString` ↔ `byte[]`
|
||||
- `Symbol` ↔ a simple data class wrapping a `String`
|
||||
- `Record` ↔ in a simple implementation, a generic `Record` class; else perhaps a bean mapping?
|
||||
- `(mime T B)` ↔ an implementation of `javax.activation.DataSource`?
|
||||
- `Sequence` ↔ an implementation of `java.util.List`
|
||||
- `Set` ↔ an implementation of `java.util.Set`
|
||||
- `Dictionary` ↔ an implementation of `java.util.Map`
|
||||
|
||||
## Erlang.
|
||||
|
||||
- `Boolean` ↔ `true` and `false`
|
||||
- `Float` and `Double` ↔ floats (unsure how Erlang deals with single-precision)
|
||||
- `SignedInteger` ↔ integers
|
||||
- `String` ↔ pair of `utf8` and a binary
|
||||
- `ByteString` ↔ a binary
|
||||
- `Symbol` ↔ pair of `atom` and a binary
|
||||
- `Record` ↔ triple of `obj`, label, and field list
|
||||
- `Sequence` ↔ a list
|
||||
- `Set` ↔ a `sets` set
|
||||
- `Dictionary` ↔ a [map][erlang-map] (new in Erlang/OTP R17)
|
||||
|
||||
This is a somewhat unsatisfactory mapping because: (a) Erlang doesn't
|
||||
garbage-collect its atoms, meaning that (a.1) representing `Symbol`s
|
||||
as atoms could lead to denial-of-service and (a.2) representing
|
||||
`Symbol`-labelled `Record`s as Erlang records must be rejected for the
|
||||
same reason; (b) even if it did, Erlang's boolean values are atoms,
|
||||
which would then clash with the `Symbol`s `true` and `false`; and (c)
|
||||
Erlang has no distinct string type, making for a trilemma where
|
||||
`String`s are in danger of clashing with `ByteString`s, `Sequence`s,
|
||||
or `Record`s.
|
||||
|
||||
## Python.
|
||||
|
||||
- `Boolean` ↔ `True` and `False`
|
||||
- `Float` ↔ a `Float` wrapper-class for a double-precision value
|
||||
- `Double` ↔ float
|
||||
- `SignedInteger` ↔ int and long
|
||||
- `String` ↔ `unicode`
|
||||
- `ByteString` ↔ `bytes`
|
||||
- `Symbol` ↔ a simple data class wrapping a `unicode`
|
||||
- `Record` ↔ something like `namedtuple`, but that doesn't care about class identity?
|
||||
- `Sequence` ↔ `tuple` (but accept `list` during encoding)
|
||||
- `Set` ↔ `frozenset` (but accept `set` during encoding)
|
||||
- `Dictionary` ↔ a hashable (immutable) dictionary-like thing (but accept `dict` during encoding)
|
||||
|
||||
## Squeak Smalltalk.
|
||||
|
||||
- `Boolean` ↔ `true` and `false`
|
||||
- `Float` ↔ perhaps a subclass of `Float`?
|
||||
- `Double` ↔ `Float`
|
||||
- `SignedInteger` ↔ `Integer`
|
||||
- `String` ↔ `WideString`
|
||||
- `ByteString` ↔ `ByteArray`
|
||||
- `Symbol` ↔ `WideSymbol`
|
||||
- `Record` ↔ a simple data class
|
||||
- `Sequence` ↔ `ArrayedCollection` (usually `OrderedCollection`)
|
||||
- `Set` ↔ `Set`
|
||||
- `Dictionary` ↔ `Dictionary`
|
|
@ -0,0 +1,157 @@
|
|||
---
|
||||
---
|
||||
<title>Preserves: Why not Just Use JSON?</title>
|
||||
<link rel="stylesheet" href="preserves.css">
|
||||
|
||||
# Why not Just Use JSON?
|
||||
|
||||
<!-- JSON lacks semantics: JSON syntax doesn't denote anything -->
|
||||
|
||||
JSON offers *syntax* for numbers, strings, booleans, null, arrays and
|
||||
string-keyed maps. However, it suffers from two major problems. First,
|
||||
it offers no *semantics* for the syntax: it is left to each
|
||||
implementation to determine how to treat each JSON term. This causes
|
||||
[interoperability](http://seriot.ch/parsing_json.php) and even
|
||||
[security](http://web.archive.org/web/20180906202559/http://docs.couchdb.org/en/stable/cve/2017-12635.html)
|
||||
issues. Second, JSON's lack of support for type tags leads to awkward
|
||||
and incompatible *encodings* of type information in terms of the fixed
|
||||
suite of constructors on offer.
|
||||
|
||||
There are other minor problems with JSON having to do with its syntax.
|
||||
Examples include its relative verbosity and its lack of support for
|
||||
binary data.
|
||||
|
||||
## JSON syntax doesn't *mean* anything
|
||||
|
||||
When are two JSON values the same? When are they different?
|
||||
<!-- When is one JSON value “less than” another? -->
|
||||
|
||||
The specifications are largely silent on these questions. Different
|
||||
JSON implementations give different answers.
|
||||
|
||||
Specifically, JSON does not:
|
||||
|
||||
- assign any meaning to numbers,[^meaning-ieee-double]
|
||||
- determine how strings are to be compared,[^string-key-comparison]
|
||||
- determine whether object key ordering is significant,[^json-member-ordering] or
|
||||
- determine whether duplicate object keys are permitted, what it
|
||||
would mean if they were, or how to determine a duplicate in the
|
||||
first place.[^json-key-uniqueness]
|
||||
|
||||
In short, JSON syntax doesn't *denote* anything.[^xml-infoset] [^other-formats]
|
||||
|
||||
[^meaning-ieee-double]:
|
||||
[Section 6 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-6)
|
||||
does go so far as to indicate “good interoperability can be
|
||||
achieved” by imagining that parsers are able reliably to
|
||||
understand the syntax of numbers as denoting an IEEE 754
|
||||
double-precision floating-point value.
|
||||
|
||||
[^string-key-comparison]:
|
||||
[Section 8.3 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-8.3)
|
||||
suggests that *if* an implementation compares strings used as
|
||||
object keys “code unit by code unit”, then it will interoperate
|
||||
with *other such implementations*, but neither requires this
|
||||
behaviour nor discusses comparisons of strings used in other
|
||||
contexts.
|
||||
|
||||
[^json-member-ordering]:
|
||||
[Section 4 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-4)
|
||||
remarks that “[implementations] differ as to whether or not they
|
||||
make the ordering of object members visible to calling software.”
|
||||
|
||||
[^json-key-uniqueness]:
|
||||
[Section 4 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-4)
|
||||
is the only place in the specification that mentions the issue. It
|
||||
explicitly sanctions implementations supporting duplicate keys,
|
||||
noting only that “when the names within an object are not unique,
|
||||
the behavior of software that receives such an object is
|
||||
unpredictable.” Implementations are free to choose any behaviour
|
||||
at all in this situation, including signalling an error, or
|
||||
discarding all but one of a set of duplicates.
|
||||
|
||||
[^xml-infoset]: The XML world has the concept of
|
||||
[XML infoset](https://www.w3.org/TR/xml-infoset/). Loosely
|
||||
speaking, XML infoset is the *denotation* of an XML document; the
|
||||
*meaning* of the document.
|
||||
|
||||
[^other-formats]: Most other recent data languages are like JSON in
|
||||
specifying only a syntax with no associated semantics. While some
|
||||
do make a sketch of a semantics, the result is often
|
||||
underspecified (e.g. in terms of how strings are to be compared),
|
||||
overly machine-oriented (e.g. treating 32-bit integers as
|
||||
fundamentally distinct from 64-bit integers and from
|
||||
floating-point numbers), overly fine (e.g. giving visibility to
|
||||
the order in which map entries are written), or all three.
|
||||
|
||||
Some examples:
|
||||
|
||||
- are the JSON values `1`, `1.0`, and `1e0` the same or different?
|
||||
- are the JSON values `1.0` and `1.0000000000000001` the same or different?
|
||||
- are the JSON strings `"päron"` (UTF-8 `70c3a4726f6e`) and `"päron"`
|
||||
(UTF-8 `7061cc88726f6e`) the same or different?
|
||||
- are the JSON objects `{"a":1, "b":2}` and `{"b":2, "a":1}` the same
|
||||
or different?
|
||||
- which, if any, of `{"a":1, "a":2}`, `{"a":1}` and `{"a":2}` are the
|
||||
same? Are all three legal?
|
||||
- are `{"päron":1}` and `{"päron":1}` the same or different?
|
||||
|
||||
## JSON can multiply nicely, but it can't add very well
|
||||
|
||||
JSON includes a fixed set of types: numbers, strings, booleans, null,
|
||||
arrays and string-keyed maps. Domain-specific data must be *encoded*
|
||||
into these types. For example, dates and email addresses are often
|
||||
represented as strings with an implicit internal structure.
|
||||
|
||||
There is no convention for *labelling* a value as belonging to a
|
||||
particular category. Instead, JSON-encoded data are often labelled in
|
||||
an ad-hoc way. Multiple incompatible approaches exist. For example, a
|
||||
“money” structure containing a `currency` field and an `amount` may be
|
||||
represented in any number of ways:
|
||||
|
||||
{ "_type": "money", "currency": "EUR", "amount": 10 }
|
||||
{ "type": "money", "value": { "currency": "EUR", "amount": 10 } }
|
||||
[ "money", { "currency": "EUR", "amount": 10 } ]
|
||||
{ "@money": { "currency": "EUR", "amount": 10 } }
|
||||
|
||||
This causes particular problems when JSON is used to represent *sum*
|
||||
or *union* types, such as “either a value or an error, but not both”.
|
||||
Again, multiple incompatible approaches exist.
|
||||
|
||||
For example, imagine an API for depositing money in an account. The
|
||||
response might be either a “success” response indicating the new
|
||||
balance, or one of a set of possible errors.
|
||||
|
||||
Sometimes, a *pair* of values is used, with `null` marking the option
|
||||
not taken.[^interesting-failure-mode]
|
||||
|
||||
{ "ok": { "balance": 210 }, "error": null }
|
||||
{ "ok": null, "error": "Unauthorized" }
|
||||
|
||||
[^interesting-failure-mode]: What is the meaning of a document where
|
||||
both `ok` and `error` are non-null? What might happen when a
|
||||
program is presented with such a document?
|
||||
|
||||
The branch not chosen is sometimes present, sometimes omitted as if it
|
||||
were an optional field:
|
||||
|
||||
{ "ok": { "balance": 210 } }
|
||||
{ "error": "Unauthorized" }
|
||||
|
||||
Sometimes, an array of a label and a value is used:
|
||||
|
||||
[ "ok", { "balance": 210 } ]
|
||||
[ "error", "Unauthorized" ]
|
||||
|
||||
Sometimes, the shape of the data is sufficient to distinguish among
|
||||
the alternatives, and the label is left implicit:
|
||||
|
||||
{ "balance": 210 }
|
||||
"Unauthorized"
|
||||
|
||||
JSON itself does not offer any guidance for which of these options to
|
||||
choose. In many real cases on the web, poor choices have led to
|
||||
encodings that are irrecoverably ambiguous.
|
||||
|
||||
<!-- Heading to visually offset the footnotes from the main document: -->
|
||||
## Notes
|
Loading…
Reference in New Issue