12 KiB
title |
---|
P-expressions |
Tony Garnock-Jones tonyg@leastfixedpoint.com
October 2023. Version 0.3.0.
Experimental. This document defines a grammar called Preserves Expressions (P-expressions, pexprs) that includes ordinary Preserves text syntax but offers extensions sufficient to support a Lisp- or Haskell-like programming notation.
Motivation. The text syntax for Preserves works well for writing
Value
s, i.e. data. However, in some contexts, Preserves applications
need a broader grammar that allows interleaving of expressions with
data. Two examples are the Preserves Schema
language and the Synit configuration scripting
language, both of
which (ab)use Preserves text syntax as a kind of programming notation.
Preliminaries
The P-expression grammar includes by reference the definition of Atom
from the
text syntax, as well as the definitions that Atom
depends on.
ws = *(%x20 / %x09 / CR / LF)
No changes to the Preserves semantic model are made. Every Preserves text-syntax term can be parsed as a valid P-expression, but in general P-expressions must be rewritten or otherwise interpreted before a meaningful Preserves value can be arrived at (see below).
Grammar
Standalone documents containing P-expressions are sequences of
individual Expr
s, followed by annotations, comments, and/or whitespace.
Document = *Expr Trailer ws
A single P-expression Expr
can be an Atom
from the text syntax,
a compound expression, special punctuation, an Embedded
expression, or
an Annotated
expression. The class SimpleExpr
includes all of Expr
except special punctuation.
Expr = ws (SimpleExpr / Punct)
SimpleExpr = Atom / Compound / Embedded / Annotated
Embedded and annotated values are as in the text syntax, differing only
in that uses of Value
are replaced with SimpleExpr
.
Embedded = "#!" SimpleExpr
Annotated = Annotation SimpleExpr
Annotation = "@" SimpleExpr / "#" [(%x20 / %x09) linecomment] (CR / LF)
linecomment = *<any unicode scalar value except CR or LF>
P-expression special punctuation marks are comma, semicolon, and sequences of one or more colons.1
Punct = "," / ";" / 1*":"
Compound expressions are sequences of Expr
s with optional trailing
Annotation
s, surrounded by various kinds of parentheses.
Compound = Sequence / Record / Block / Group / Set
Sequence = "[" *Expr Trailer ws "]"
Record = "<" *Expr Trailer ws ">"
Block = "{" *Expr Trailer ws "}"
Group = "(" *Expr Trailer ws ")"
Set = "#{" *Expr Trailer ws "}"
In an Annotated
P-expression, annotations and comments attach to the
term following them, just as in the ordinary text syntax. However, it is
common in programming notations to allow comments at the end of a file
or other sequential construct. The ordinary text syntax forbids comments
in these positions, but P-expressions allow them.
Trailer = *(ws Annotation)
Encoding P-expressions as Preserves
We write ⌜p⌝ for the encoding into Preserves of a P-expression p.
{:.pseudocode.equations}
| ⌜·⌝ : Expr | ⟶ | Value |
| ⌜[
p ...]
⌝ | = | [
⌜p⌝ ...]
|
| ⌜<
p ...>
⌝ | = | <r
⌜p⌝ ...>
|
| ⌜{
p ...}
⌝ | = | <b
⌜p⌝ ...>
|
| ⌜(
p ...)
⌝ | = | <g
⌜p⌝ ...>
|
| ⌜#{
p ...}
⌝ | = | <s
⌜p⌝ ...>
|
| ⌜#!
p⌝ | = | #!
⌜p⌝ |
| ⌜@
p q⌝ | = | @
⌜p⌝ ⌜q⌝ |
| ⌜p⌝ | = | p | when p ∈ Atom |
| ⌜,
⌝ | = | <p |,|>
|
| ⌜;
⌝ | = | <p |;|>
|
| ⌜:
...⌝ | = | <p |:
...|>
|
| ⌜t⌝ | = | @
⌜a⌝ ... <a>
| where @
a ... are the annotations in t and t ∈ Trailer |
The record <a>
acts as an anchor for the annotations in a Trailer
.
We overload the ⌜·⌝ notation for encoding whole Document
s into
sequences of Preserves values.
{:.pseudocode.equations}
| ⌜·⌝ : P-expression Document | ⟶ | Preserves Sequence |
| ⌜p ...⌝ | = | [
⌜p⌝ ...]
|
| ⌜p ... @
a ...⌝ | = | [
⌜p⌝ ... @
⌜a⌝ ... <a>]
|
| | | where @
a ... are trailing annotations |
Interpreting P-expressions as Preserves
The previous section discussed ways of representing P-expressions using Preserves. Here, we discuss interpreting P-expressions as Preserves so that (1) a Preserves datum (2) written using Preserves text syntax and then (3) read as a P-expression can be (4) interpreted from that P-expression to yield the original datum.
- Every
(
..)
or;
that appears is an error. - Every
:
,::
,:::
, ... is an error, except in context ofBlock
s as described below. - Every
,
that appears is discarded. - Every
Trailer
that appears is an error.2 - Every
Record
with no values in it is an error. - Every
Block
must contain zero or more repeating triplets ofSimpleExpr
,:
,SimpleExpr
. AnyBlock
not following this pattern is an error. EachBlock
following the pattern is translated to aDictionary
containing a key/value pair for each triplet. AnyBlock
with duplicate keys (under interpretation) is an error. - Every
Set
containing any duplicate expressions (under interpretation) is an error.
Appendix: Examples
Examples are given as pairs of P-expressions and their Preserves text-syntax encodings.
Individual P-expression Expr
s
⌜<date 1821 (lookup-month "February") 3>⌝
= <r date 1821 <g lookup-month "February"> 3>
⌜<>⌝
= <r>
⌜(begin (println! (+ 1 2)) (+ 3 4))⌝
= <g begin <g println! <g + 1 2>> <g + 3 4>>
⌜()⌝
= <g>
⌜[() () ()]⌝
= [<g>, <g>, <g>]
⌜{
setUp();
# Now enter the loop
loop: {
greet("World");
}
tearDown();
}⌝
= <b
setUp <g> <p |;|>
# Now enter the loop
loop <p |:|> <b
greet <g "World"> <p |;|>
>
tearDown <g> <p |;|>
>
⌜[1 + 2.0, print "Hello", predicate: #t, foo, #!remote, bar]⌝
= [1 + 2.0 <p |,|> print "Hello" <p |,|> predicate <p |:|> #t <p |,|>
foo <p |,|> #!remote <p |,|> bar]
⌜#{1 2 3}⌝
= <s 1 2 3>
⌜#{(read) (read) (read)}⌝
= <s <g read> <g read> <g read>>
⌜{
optional name: string,
address: Address,
}⌝
= <b
optional name <p |:|> string <p |,|>
address <p |:|> Address <p |,|>
>
Whole Document
s
⌜{
key: value
# example of a comment at the end of a dictionary
}
# example of a comment at the end of the input file⌝
= [ <b
key <p |:|> value
@"example of a comment at the end of a dictionary" <a>
>
@"example of a comment at the end of the input file"
<a>
]
Appendix: Reading vs. Parsing
Lisp systems first read streams of bytes into S-expressions and then parse those S-expressions into more abstract structures denoting various kinds of program syntax. Separation of reading from parsing is what gives Lisp its syntactic flexibility.
Similarly, the Apple programming language Dylan included a reader-parser split, with the Dylan reader producing D-expressions that are somewhat similar to P-expressions.
Finally, the Racket dialects Honu and Something use a reader-parser-macro setup, where the reader produces Racket data, the parser produces "syntax" and is user-extensible, and Racket's own modular macro system rewrites this "syntax" down to core forms to be compiled to machine code.
Similarly, when using P-expressions as the foundation for a language, a generic P-expression reader can then feed into special-purpose parsers. The reader captures the coarse syntactic structure of a program, and the parser refines this.
Often, a parser will wish to extract structure from sequences of
P-expression Expr
s.
-
A simple technique is repeated splitting of sequences; first by
Semicolon
, then byComma
, then by increasingly high binding-power operators. -
More refined is to use a Pratt parser or similar (1, 2, 3) to build a parse tree using an extensible specification of the pre-, in-, and postfix operators involved.
-
Finally, if you treat sequences of
Expr
s as pre-lexed token streams, almost any parsing formalism (such as PEG parsing, Ometa, etc.) can be used to extract further syntactic structure.
Appendix: Equations for interpreting P-expressions as Preserves
The function uncomma(p) removes all occurrences of ,
from a
P-expression p ∈ Expr
− {,
}.
{:.pseudocode.equations}
| uncomma : Expr − {,
} | ⟶ | Expr | |
| uncomma([
p ...]
) | = | [
uncomma(p) ...]
| omitting any p = ,
|
| uncomma(<
p ...>
) | = | <
uncomma(p) ...>
| omitting any p = ,
|
| uncomma({
p ...}
) | = | {
uncomma(p) ...}
| omitting any p = ,
|
| uncomma((
p ...)
) | = | (
uncomma(p) ...)
| omitting any p = ,
|
| uncomma(#{
p ...}
) | = | #{
uncomma(p) ...}
| omitting any p = ,
|
| uncomma(#!
p) | = | #!
uncomma(p) | |
| uncomma(@
p q) | = | @
uncomma(p) uncomma(q) | |
| uncomma(p) | = | p | if p ∈ Atom ∪ Punct − {,
} |
We write ⌞uncomma(p)⌟ for the partial function mapping a
P-expression p ∈ Expr
− {,
} to a corresponding Preserves Value
.
{:.pseudocode.equations}
| ⌞·⌟ : Expr − {,
} | ⇀ | Value | |
| ⌞[
p ...]
⌟ | = | [
⌞p⌟ ...]
| |
| ⌞<
ℓ p ...>
⌟ | = | <
⌞ℓ⌟ ⌞p⌟ ...>
| |
| ⌞{
k:
v ...}
⌟ | = | {
⌞k⌟:
⌞v⌟ ...}
| if all ⌞k⌟ ... are distinct |
| ⌞#{
p ...}
⌟ | = | #{
⌞p⌟ ...}
| if all ⌞p⌟ ... are distinct |
| ⌞#!
p⌟ | = | #!
⌞p⌟ | |
| ⌞@
p q⌟ | = | @
⌞p⌟ ⌞q⌟ | |
| ⌞p⌟ | = | p | when p ∈ Atom |
Notes
-
Colon matching is greedy: when reading, all adjacent colons are always taken into a single token, and when writing, adjacent colon-sequence punctuation marks must be written with whitespace separating them. ↩︎
-
Implementation note. When implementing parsing of P-expressions into Preserves, consider offering an optional mode where trailing annotations
Trailer
are discarded instead of causing an error to be signalled. ↩︎