preserves/conventions.md at main

8.4 KiB

Raw Permalink Blame History

title
Conventions for Common Data Types

The Value data type is essentially an S-Expression, able to represent semi-structured data over ByteString, String, SignedInteger atoms and so on.¹

However, users need a wide variety of data types for representing domain-specific values such as various kinds of encoded and normalized text, calendrical values, machine words, and so on.

Appropriately-labelled Records denote these domain-specific data types.²

All of these conventions are optional. They form a layer atop the core Value structure. Non-domain-specific tools do not in general need to treat them specially.

Validity. Many of the labels we will describe in this section come with side-conditions on the contents of labelled Records. It is possible to construct an instance of Value that violates these side-conditions without ceasing to be a Value or becoming unrepresentable. However, we say that such a Value is invalid because it fails to honour the necessary side-conditions. Implementations SHOULD allow two modes of working: one which treats all Values identically, without regard for side-conditions, and one which enforces validity (i.e. side-conditions) when reading, writing, or constructing Values.

IOLists.

Inspired by Erlang's notions of iolist() and iodata(), an IOList is any tree constructed from ByteStrings and Sequences. Formally, an IOList is either a ByteString or a Sequence of IOLists.

IOLists can be useful for vectored I/O. Additionally, the flexibility of IOList trees allows annotation of interior portions of a tree.

Comments.

String values used as annotations are conventionally interpreted as comments. Special syntax exists for such string annotations, though the usual @-prefixed annotation notation can also be used.

;I am a comment for the Dictionary
{
  ;I am a comment for the key
  key: ;I am a comment for the value
       value
}

;I am a comment for this entire IOList
[
  #x"00010203"
  ;I am a comment for the middle half of the IOList
  ;A second comment for the same portion of the IOList
  @ ;I am the first and only comment for the following comment
    "A third (itself commented!) comment for the same part of the IOList"
  [
    ;"I am a comment for the following ByteString"
    #x"04050607"
    #x"08090A0B"
  ]
  #x"0C0D0E0F"
]

MIME-type tagged binary data.

Many internet protocols use media types (a.k.a MIME types) to indicate the format of some associated binary data. For this purpose, we define MIMEData to be a record labelled mime with two fields, the first being a Symbol, the media type, and the second being a ByteString, the binary data.

While each media type may define its own rules for comparing documents, we define ordering among MIMEData representations of such media types following the general rules for ordering of Records.

Examples.

«<mime application/octet-stream #"abcde">»
  = B4 B3 04 "mime" B3 18 "application/octet-stream" B2 05 "abcde"

«<mime text/plain #"ABC">»
  = B4 B3 04 "mime" B3 0A "text/plain" B2 03 "ABC" 84

«<mime application/xml #"<xhtml/>">»
  = B4 B3 04 "mime" B3 0F "application/xml" B2 08 "<xhtml/>" 84

«<mime text/csv #"123,234,345">»
  = B4 B3 04 "mime" B3 08 "text/csv" B2 0B "123,234,345" 84

Unicode normalization forms.

Unicode defines multiple normalization forms for text. While no particular normalization form is required for Strings, users may need to unambiguously signal or require a particular normalization form. A NormalizedString is a Record labelled with unicode-normalization and having two fields, the first of which is a Symbol specifying the normalization form used (e.g. nfc, nfd, nfkc, nfkd), and the second of which is a String whose underlying code point representation MUST be normalized according to the named normalization form.

IRIs (URIs, URLs, URNs, etc.).

An IRI is a Record labelled with iri and having one field, a String which is the IRI itself and which MUST be a valid absolute or relative IRI.

Machine words.

The definition of SignedInteger captures all integers. However, in certain circumstances it can be valuable to assert that a number inhabits a particular range, such as a fixed-width machine word.

A family of labels in and un for n ∈ {8,16,32,64,128} denote n-bit-wide signed and unsigned range restrictions, respectively. Records with these labels MUST have one field, a SignedInteger, which MUST fall within the appropriate range. That is, to be valid,

in <i8 x>, -128 <= x <= 127.
in <u8 x>, 0 <= x <= 255.
in <i16 x>, -32768 <= x <= 32767.
etc.

Anonymous Tuples and Unit.

A Tuple is a Record with label tuple and zero or more fields, denoting an anonymous tuple of values.

The 0-ary tuple, <tuple>, denotes the empty tuple, sometimes called “unit” or “void” (but not e.g. JavaScript's “undefined” value).

Null and Undefined.

Tony Hoare's “billion-dollar mistake” can be represented with the 0-ary Record <null>. An “undefined” value can be represented as <undefined>.

Dates and Times.

Dates, times, moments, and timestamps can be represented with a Record with label rfc3339 having a single field, a String, which MUST conform to one of the full-date, partial-time, full-time, or date-time productions of section 5.6 of RFC 3339. (In date-time, "T" and "Z" MUST be upper-case and "T" MUST be used; a space separating the full-date and full-time MUST NOT be used.)

XML Infoset

XML Infoset describes the semantics of XML - that is, the underlying information contained in a document, independent of surface syntax.

A useful subset of XML Infoset, namely its Element Information Items (omitting processing instructions, entities, entity references, comments, namespaces, name prefixes, and base URIs), can be captured with the schema

Node = Text / Element .
Text = string .
Element =
  / @withAttributes
    <<rec> @localName symbol [@attributes Attributes @children Node ...]>
  / @withoutAttributes
    <<rec> @localName symbol                        @children [Node ...]> .
Attributes = { symbol: string ...:... } .