#lang scribble/manual @require[racket/include] @include{prelude.inc} @title{REMAINDER} Figure~\ref{vm-interface-types} specifies the framework and its underlying library via stylized type signatures.@note{The actual implementation supports secondary features not essential to the system, such as debug-names for processes and user-accessible process identifiers. Also, in Typed Racket, we must encode the existentially-quantified types of Process and Spawn using second-order polymorphism.} Additionally, our framework allows the recursive nesting of marketplaces, thus realizing Dijkstra's vision of a layered, virtualized operating system. Processes within a layer can themselves be the substrate for a further layer of sub-processes. Each layer communicates internally using protocols appropriate to just that layer. Relay processes translate messages between protocols as they cross layer boundaries. Within a marketplace, the appearance or disappearance of a service becomes an event that affects interested parties. Our architecture comes with a notion of presence and absence notification integrated with each nested layer. Using presence, our architectural framework naturally delimits conversational contexts and manages associated resources. While many existing environments use a "mailbox" metaphor, where programs exchange messages with peers, real distributed systems do not behave like orderly postal services. In practice, messages frequently get lost, through corruption and congestion. Programs engage in multiple simultaneous conversations. The services a program depends on may be crashed, down for maintenance, or still going through their startup procedures. An orderly startup sequence is an impossibility. The system as a whole frequently cannot be rebooted, existing instead in a state of constant recovery. Addresses become stale. Demand for services often outstrips their supply. The marketplace metaphor implies that such complications are not problems to be solved anew by each application, but issues that the programming environment should solve, once and for all. In this paper we report on initial progress toward this vision. We take a three-pronged approach to scaling Worlds and Universes to systems programming. We make Worlds nestable, transform their event system into a pub/sub network, and integrate presence and absence notifications. In addition to satisfying the criteria of Hudak and Sundaresh, the combination of nesting and presence gives a principled approach to resource management and to subsystem isolation and composition. Presence gives a flexible communications topology to each layer in the layered architecture and provides a clean account of error signalling. Our design is at heart a @emph{distributed operating system}. This idea, together with the recent virtualization trend, suggests the introduction of a @emph{virtual machine (VM)} in which user programs run. To each VM, we add pub/sub messaging. We escape the constraints of a hub-and-spoke routing topology by automatically deducing routing tables from the set of active pub/sub subscriptions. Basing message routing on active subscriptions in this way has a pleasant side effect. Our VM notifies processes when routes relevant to their interests appear or disappear, yielding a generalized form of @emph{presence}, a concept originating in more restricted form in instant messaging networks such as XMPP. Presence notifications are a common, though often disguised, feature of communications media, but to date have not received wide attention. Our approach separates discrete actions such as spawning new processes and sending and receiving messages from more continuous reactions to changes in a process' environment, such as arrival of a new service or the crashing of a peer. To illustrate the idea of presence, consider a widely-used Internet-scale pub/sub network: Twitter. Each Twitter user is analogous to a process in our system. Following a user is equivalent to subscribing to their message stream. The analog of presence notification is the email Twitter sends to inform a user of a new follower. In some sense, users tailor their message stream to match the perceived interests of their followers; similarly, processes in our system base their decisions about what to send on @emph{who is listening}. Our system goes further in that presence is bidirectional, informing processes not only of subscribers matching their advertised intent to publish, but also of publishers matching their declared interest in receiving messages. To illustrate the idea of presence, consider the essence of the BitTorrent file-sharing protocol, as it might be implemented in our system. A group of processes share a communication space and collaborate to ensure all members have a copy of the file being shared. Each process advertises the chunks of a file it holds. Peers subscribe to chunks they wish to receive. The VM infrastructure computes the intersections between advertisements and subscriptions, and conveys that routing information to the processes. As processes arrive and depart, the subscription set changes, and the routes computed from subscriptions indicate changing demand and supply levels for blocks within the network. Presence, then, indicates what it is profitable to send to whom. In order to properly encapsulate and isolate groups of processes collaborating on subtasks within a larger system, we take care to ensure that the type of our VM kernel program is a subtype of the type of its processes, which makes our system @emph{recursively nestable}. A VM instance can be run as a process within another VM. A layered structure of nested VMs arises, with each VM encapsulating a group of related processes. A @emph{ground VM} maps events and actions to real communication with the outside world. Each subsequent layer translates between its clients above and its substrate below, in a way similar both to layers in network architectures such as the OSI model and to the architecture envisaged in Hoare's quote above. Because presence operates both inside and outside a nestable VM, it can be used to automatically propagate demand for services across layers. Consider a cloud scenario where a single physical machine hosts $n$ Linux virtual machines, each of which hosts $m$ socket-based services. Using an approach such as @tt{systemd}'s socket-activated OS containers, incoming connections not only cause processes to be spawned, but cause whole virtual machines to be started. Our system achieves the same responsiveness to changing demand, while avoiding the manual configuration step necessary with @tt{systemd}: presence expressed by the innermost processes flows across successive levels of containment to the ground VM, where it can be turned into actual TCP activity by a TCP driver. \begin{figure}[t] \centering \begin{tabular}{|l|c|c|} \hline Challenge & Traditional model & Marketplace model \\ \hline Application logic & App & App \\ User interface & App & App \\ Service discovery & App & Language \\ Session lifetime & App & Language \\ Demand tracking & App & Language \\ Fault isolation & App & Language \\ Routing & App & Language \\ Messaging & Language & Language \\ Concurrency & Language & Language \\ \hline \end{tabular} \ruledcaption{Challenges faced, and division of responsibility} \label{asynchronous-challenges} \end{figure} In this way, we have moved from a "mailbox" model based strictly around producing and consuming messages toward a "marketplace" model. Figure~\ref{asynchronous-challenges} summarizes the burdens that our marketplace architecture lifts from applications. Each VM makes a "bazaar" of interacting vendors and buyers. Groups of collaborating programs are placed within task-specific VMs to scope their interactions. Conversations between programs are multi-party, and programs naturally participate in many such conversations at once. Not only are messages sent and received, but programs react to presence notifications that report the comings and goings of their peers. Presence also serves to communicate changes in demand for and supply of services, both within a VM and across layers. Programs are no longer responsible for maintaining presence information or for scoping group communications; their containing virtual machine takes on those tasks for them. @section{Interface} Our @emph{processes} generalize World programs by replacing the latter's special-purpose input handlers with @emph{endpoints}, a single, general construct for handling (possibly message-carrying) @emph{events}. Existentially-quantified types hide process states (\State) from the kernel, and we hide kernel state from processes by never passing it into user code. Given an event and a current process state, event handlers respond with a @emph{transition}, which bundles a new process state with a list of @emph{actions}. The containing VM interprets these action data structures. Actions can be communication-related (@racket[add-endpoint] and @racket[delete-endpoint], @racket[send-message]), process-related (@racket[spawn], @racket[quit]), or cross-layer (@racket[at-meta-level]). A virtual machine groups and isolates a collection of processes; in turn, it presents itself as a process to another group of processes. That is, a system consists of nested layers of processes that interact via messages. The bottom-most (ground) layer is the runtime library of our language, and interacts with the real world. \paragraph*{Starting an Application.} Applications differ from normal Racket modules only in their selection of language. A Racket module written with @tt{#lang marketplace}, such as the echo server in figure~\ref{echo-paper3}, specifies a sequence of definitions and startup actions for an application. Typically, initial actions spawn application processes and nested VMs, which in turn subscribe to sources of events from the outside world. \paragraph*{Endpoints, Conversations, Messaging and Feedback.} Processes engage in multiple simultaneous conversations. Each process therefore has a set of active subscription @emph{endpoints}, each of which selects a subset of the messages on the network. Roughly speaking, each endpoint plays a @emph{role} within an ongoing conversation. Publishers and subscribers declare their interests to their containing VM via @emph{advertisements} and @emph{ subscriptions}, respectively, created with @racket[add-endpoint] actions: @#reader scribble/comment-reader (racketblock (add-endpoint @emph{endpoint-id} (role @emph{orientation} @emph{topic} @emph{interest-type}) (\LAMBDA (event) (\LAMBDA (state) @emph{... computation resulting in:} (transition @emph{new-state} @emph{action0} @emph{action1} ...)))) ) Endpoints are the most complex structure in our system's interface, and so deserve careful explanation. They are named, for later reference in @racket[delete-endpoint] actions: @#reader scribble/comment-reader (racketblock (delete-endpoint @emph{endpoint-id}) ) TGJ: This sentence is probably not required? Endpoint IDs must be unique within the scope of a process. \noindent Endpoints contain a @emph{role}, which generalizes traditional notions of advertisement and subscription by combining a topic of conversation with an orientation: @emph{publisher} or @emph{subscriber}. The topic filter is a pattern over S-expression-shaped messages@note{In pub/sub terminology, this is a @emph{content-based filter}.} expressed as a general datum with embedded wildcards. Choosing this representation gives both an intuitive pattern language and, with unification, a conventional operation for computing topic intersections. Borrowing an example from the chat server implementation of section~\ref{sec:example}, the following constructs an endpoint advertising intent to publish@note{This endpoint exists solely to indicate presence to others, and its event handler therefore ignores incoming events.} on the "$X$ says $Y$" topic, where $X$ is bound to a user's name (@racket[me]) and $Y$ is wild (@racket[?]): @#reader scribble/comment-reader (racketblock (add-endpoint 'speaker (role 'publisher `(,me says ,?) 'participant) (\LAMBDA (event) (\LAMBDA (state) (transition state)))) ) Event handlers dispatch on the type of event and current process state, returning a transition structure for the VM to process. An endpoint matching @racket['speaker] might be: @#reader scribble/comment-reader (racketblock (add-endpoint 'listener (role 'subscriber `(,? says ,?) 'participant) (\LAMBDA (event) (\LAMBDA (state) (match event [(presence-event arriving-role) ...] @emph{;; describe the arrival of a user} [(absence-event departing-role reason) ...] @emph{;; describe the departure of a user} [(message-event sender-role `(,who says ,what)) ...])))) @emph{;; inform the user that }who@emph{ said }what ) Since the @emph{presence} of processes is as important as exchanging messages, we include (dis)appearances of processes as essential events of a conversation alongside regular message deliveries. Concretely, presence and absence events carry a VM-computed @racket[role] structure describing the @emph{intersection} between the advertised interests of the recipient and the appearing or disappearing peer. For example, if endpoint "A" takes on the role of subscriber to topic @racket[(? says ?)], and a peer process creates an endpoint "B" taking on the role of publisher within the topic @racket[(Bob ? ?)], then the VM sends a presence event to "A" noting that a publisher on topic @racket[(Bob says ?)] has appeared. Likewise, the VM informs "B" of a new subscriber on the same topic. Shared topics of conversation are just the intersections of the topics of the endpoints viewed as sets of messages. @defstruct*[send-message ([body any/c] [orientation orientation?])]{ Processes send messages to peers with @racket[send-message] actions. The optional orientation is by default @racket['publisher], when @racket[message-body] is intended for matching @racket['subscriber]s. Because our system enjoys publisher/subscriber symmetry in its presence notifications and routing tables, @emph{subscribers} may offer feedback to @emph{publishers}: with @racket[send-message] orientation @racket['subscriber], messages can flow @emph{upstream} to processes playing the conversational role of publisher. Feedback can express flow-control, mode-selection and message acknowledgement. To illustrate, endpoint "B" from above might take a transition @#reader scribble/comment-reader (racketblock (transition (compute-bob-state) (send-message '(Bob says hello) 'publisher) (send-message '(Bob goes-to the-shop) 'publisher)) ) Endpoint "A" would receive just the first message, and might give feedback with @#reader scribble/comment-reader (racketblock (transition (compute-alice-state) (send-message '(Alice hears (Bob says hello)) 'subscriber)) ) As another example, the chat program in section~\ref{sec:example} uses such feedback to manage flow-control between the chat process and the TCP driver. } \paragraph*{Participants and Observers.} The @emph{interest type} given in an endpoint's @racket[role] structure allows endpoints to monitor interest in some topic of conversation without offering to participate in such conversations, or equivalently, to monitor demand for some service without offering to supply or consume that service. Endpoints with an interest type of @racket['participant] are regular subscribers, both receiving and causing presence notifications for matching participant endpoints in the system. Those with type @racket['observer], however, @emph{receive} presence notifications about participants but do not @emph{cause} any. Finally, endpoints using interest type @racket['everything] receive notifications about all three types of endpoint in the system. The ability to passively observe other participants in a conversation naturally supports supervisor processes. Such supervisors can create and destroy services in response to changes in demand. \begin{figure} \centering \begin{tabular}{|r|c|c|} \hline & Participant & Observer \\ \hline Subscriber & Informed of pubs. & Informed of pubs. \\ & Acts as listener & \\ \hline Publisher & Informed of subs. & Informed of subs. \\ & Acts as speaker & \\ \hline \end{tabular} \ruledcaption{Interest types, roles, and presence events.} \label{interest-types-in-our-architecture} \end{figure} \paragraph*{Linguistic Simplifications.} Often, only a subset of the flexibility of @racket[add-endpoint] is needed. Hence, definitions like that of the @racket['listener] endpoint look long-winded. For such cases, a small, optional endpoint creation domain-specific language provides sensible defaults. The endpoints in figure~\ref{echo-paper3}, for example, are created using the DSL instead of building @racket[add-endpoint] structures directly. \begin{figure} \begin{tabular}{rcl} $endpoint$ & := & @tt{(endpoint }$orientation$ $topic$ \\ & & $\quad\quad\{interest\}$ \\ \\ & & $\quad\quad$\{@tt{#:state }$pattern$\} \\ & & $\quad\quad$\{@tt{#:conversation }$pattern$\} \\ & & $\quad\quad$\{@tt{#:reason }$identifier$\} \\ \\ & & $\quad\quad$\{@tt{#:let-name }$identifier$\} \\ & & $\quad\quad$\{@tt{#:name }$expr$\} \\ \\ & & $\quad\quad$\{@tt{#:on-presence }$handler$\} \\ & & $\quad\quad$\{@tt{#:on-absence }$handler$\} \\ \\ & & $\quad\quad message\mhyphen handler^*$@tt{)}\\ \\ $orientation$ & := & @tt{#:publisher} $|$ @tt{#:subscriber} \\ \\ $topic$ & := & $expr$ \\ \\ $interest$ & := & @tt{#:participant} $|$ \\ & & @tt{#:observer} $|$ \\ & & @tt{#:everything} \\ \\ $message\mhyphen handler$ & := & @tt{(}$pattern$ $handler$@tt{)} \\ \\ $handler$ & := & $expr$ \end{tabular} \ruledcaption{Syntax of the @racket[endpoint] DSL. Braces indicate optional elements; Kleene star indicates repetition.} \label{endpoint-dsl-syntax} \end{figure} \begin{figure} \centering \begin{tabular}{|r|c|c|c|c|} \hline Handler & \begin{sideways}@tt{#:state}\end{sideways} & \begin{sideways}@tt{#:conversation}\end{sideways} & \begin{sideways}@tt{#:reason}\end{sideways} & \begin{sideways}@tt{#:let-name}$\quad$\end{sideways} \\ \hline message & \checkmark & \checkmark & & \checkmark \\ @tt{#:on-presence} & \checkmark & \checkmark & & \checkmark \\ @tt{#:on-absence} & \checkmark & \checkmark & \checkmark & \checkmark \\ \hline \end{tabular} \caption{Scope of bindings in @racket[endpoint] handlers} \label{endpoint-dsl-scope} \end{figure} Figure~\ref{endpoint-dsl-syntax} specifies the syntax of the @racket[endpoint] language. The only mandatory parts of an @racket[endpoint] are its @emph{orientation}, that is whether it is a subscription or a publication advertisement, and its @emph{topic}. Many of the optional clauses introduce new bindings into the scope of the endpoint's handlers. Figure~\ref{endpoint-dsl-scope} summarizes the visibility of new bindings in each kind of handler. With a @racket[#:state] clause, handler expressions can refer to and update the current process state. Variables introduced in the associated pattern are scoped over all three types of handler. If @racket[#:state] is present, handler expressions are expected to return a full transition structure including a new process state. If it is absent, however, handler expressions are expected to return only a list of actions. This permits concision in the common case of a stateless process or endpoint. For example, consider the "no-op" event handler in the @racket['speaker] endpoint example above. Using @racket[endpoint], it becomes @#reader scribble/comment-reader (racketblock (endpoint #:publisher `(,me says ,?)) ) The @racket[#:conversation] clause, again scoped over all handlers, gives access to the topic of conversation carried in each notification. The @racket[#:reason] clause, scoped solely over @racket[#:on-absence] handlers, conveys the exit reason code carried in absence notifications. Endpoint names are introduced with @racket[#:name], if the program wishes to supply an explicitly-computed name, or @racket[#:let-name], if programs wish to delegate name construction to the VM. When @racket[#:let-name] is used, a guaranteed-fresh endpoint name is supplied to handlers. This permits an idiom for declaring a temporary endpoint: @#reader scribble/comment-reader (racketblock (endpoint #:subscriber some-topic #:let-name e ;; message handler: [request (let ([reply (compute-reply request)]) (list (delete-endpoint e) (send-message reply)))]) ) Message handling clauses at the end of an @racket[endpoint] expression are run against delivered messages in the usual left-to-right order. If no clauses match, the delivered message is silently discarded. \paragraph*{Cross-layer communication.} Each VM has access to @emph{two} inter-process communication (IPC) facilities: the external network connecting it to its siblings and the internal network connecting its contained processes to each other. When a process hands normal @racket[add-endpoint], @racket[delete-endpoint] and @racket[send-message] actions to its VM, they apply to the internal network of the VM. Actions must be wrapped in an @racket[at-meta-level] structure to signal to the VM that they are to apply to the VM's external network. \begin{figure}[tb] @#reader scribble/comment-reader (racketblock (define relay-down (endpoint #:subscriber ? ;; message handler: [message (at-meta-level (send-message message))])) (define relay-up (at-meta-level (endpoint #:subscriber ? ;; (meta-level) message handler: [message (send-message message)]))) ) \ruledcaption{Examples of the use of @racket[at-meta-level]} \label{at-meta-level-examples} \end{figure} Figure~\ref{at-meta-level-examples} demonstrates the use of @racket[at-meta-level]. Both examples evaluate to @racket[add-endpoint] action structures. The @racket[relay-down] endpoint subscribes to the wildcard pattern on the internal network, and upon receipt of a message, transmits it on the external network. The @racket[relay-up] endpoint subscribes to the external network and transmits on the internal network. Relaying messages between layers is straightforward, but relaying presence across layers requires the passive @racket['observer] interest-type. An observer subscription can be used to measure demand for some service at an upper layer and project it as demand for analogous service at a lower layer, without appearing to satisfy the upper-layer demand until matching supply is detected at the lower layer. \paragraph*{Creating Processes.} A @racket[spawn] action requests the launch of a new process. Each @racket[spawn] contains a function producing an initial transition for the new process: @#reader scribble/comment-reader (racketblock (make-spawn (\LAMBDA () (transition @emph{state0} @emph{action0} @emph{action1} ...))) ) The function delays computation of the initial state and initial actions until the VM installs an appropriate exception handler, so that blame for any exceptions is correctly apportioned. Because this is syntactically awkward, a simple shorthand is provided: @#reader scribble/comment-reader (racketblock (spawn #:child (transition @emph{state0} @emph{action0} @emph{action1} ...)) ) The VM interpreting the @racket[spawn] datum creates a new process record with the initial state and queues up the associated actions for execution. At the type level, a @racket[spawn] action involves a fresh, existentially-quantified state type variable. \paragraph*{Exceptions and Process Termination.} @defstruct*[quit ([pid pid?] [reason any/c])]{ A @racket[quit] action terminates the invoking process, cancelling all its subscriptions. The optional @emph{reason code} is passed along to other processes in any absence notifications arising from the process's termination. This is analogous to the "exit reason" carried by Erlang's process exit signals~\cite[\S3.5.6]{Armstrong2003}. Any exception thrown in an event handler (or during the computation of an initial transition from a @racket[spawn] action) is caught by the VM and translated into a @racket[quit] action. This isolates processes, but not endpoints within processes, from each other's failures. } \paragraph*{Scheduling, Management and Monitoring.} Our current VM implementations cooperatively schedule their processes, and so support an additional @racket[yield] action, which cedes control of the CPU to other processes: @#reader scribble/comment-reader (racketblock (make-yield (\LAMBDA (state) (transition ...))) ) TODO: Check that this is mentioned elsewhere: VMs treat processes under their care as linear resources, leaving them free to use either a pure-functional approach to managing their state or to use side-effecting actions as they see fit. Finally, many real operating and networking systems provide reflective facilities which permit listing of running processes, listing of active network endpoints, killing of processes by ID, attachment of debuggers to running processes, and so on. Programmers working with systems that do not provide such facilities often find themselves implementing makeshift substitutes. Our current implementation has limited support for such features; we conjecture that our design will naturally extend to this kind of reflection, but properly integrating these ideas remains future work. @section{Implementation} We have two interworking implementations of our VM abstraction: one nestable VM used to organize applications, and one ground VM mapping abstract events to actions in the outside world. \paragraph*{The Nestable VM.} The workhorses of our system, nested VM instances are created by a new linguistic construct, @racket[nested-vm]. Given a list of actions for a primordial process to run in the new VM, @racket[nested-vm] returns a @racket[spawn] action that requests the launch of the new VM: @#reader scribble/comment-reader (racketblock (transition @emph{spawner-state} (nested-vm @emph{primordial-action} ...)) ) Figure~\ref{spawning-nested-vm} illustrates the creation of a new VM. \begin{figure}[tb] @#reader scribble/comment-reader (racketblock ) \centering \includegraphics[height=3cm]{spawning-nested-vm.eps} \ruledcaption{Spawning a nested VM} \label{spawning-nested-vm} \end{figure} Nested VM instances are implemented as ordinary processes, and so have state, a state type, and a collection of active subscriptions. Their private state is nothing more than the table of contained processes: $$ \State_{vm} = \textrm{PID} \mapsto \textrm{Process} $$ Recall from figure~\ref{vm-interface-types} that the Process type involves "EPs" and "MetaEPs", which are sets of endpoints interacting with the VM's internal and external networks, respectively. Nested VMs interpret actions from contained processes as they respond to VM events. Ordinary actions, such as @racket[add-endpoint] and @racket[spawn], operate on the VM's resources. Meta-level actions, wrapped in an @racket[at-meta-level] action structure, are translated into actions that the VM hands back to its container. Where @racket[spawn] creates a process that is a sibling of the acting process, an @racket[at-meta-level] @racket[spawn] creates a process that is a sibling of the VM itself. Similarly, @racket[quit] can be used with @racket[at-meta-level] to terminate the entire VM, and @racket[send-message] with @racket[at-meta-level] transmits a message on the VM's external network, not its internal one. A meta-level @racket[add-endpoint] action requests the creation of an endpoint in the @emph{external} network. The VM translates the request into an action at the VM-as-process level that creates an relaying endpoint in the @emph{internal} network of the VM's own container. A record of the relaying "meta-endpoint" is placed in the "MetaEPs" set of the requesting process, so that when the relaying event handler fires, the event can be passed to the correct handler in the contained process. The relaying event handler level-shifts events to compensate for the level-shifting that took place when the meta-endpoint was established. \paragraph*{The Ground VM.} Virtual machines can only be stacked so far. At some point, they must connect to the outside world. Our "ground" VM implementation does just that. Its processes produce real-world output by judicious use of side-effecting Racket procedures, and await input by using ordinary subscription endpoints with topics describing Racket's core events. The ground VM is automatically started for applications written in the language. Programs written in other languages built on Racket can also make use of our system by explicitly invoking the @racket[ground-vm] procedure. The ground VM monitors subscriptions involving CML-style event descriptors, interpreting their presence as demand for the corresponding events and translating them into concrete I/O requests. When underlying Racket events fire, the resulting values are sent as messages on the ground VM's internal network. There, they match subscription topics that caused the event to be activated in the first place and are delivered to corresponding endpoints. Concretely, I/O subscription topic patterns are structured as a pair of a Racket event descriptor and a pattern matching the values the event yields upon firing. For example, the timer driver process asks for events when the system clock advances past a certain point as follows: @#reader scribble/comment-reader (racketblock (endpoint #:subscriber (cons (timer-evt deadline) ?) ;; message handler: [(cons \_ current-system-clock-value) (begin (display "Time's up!\textbackslashn") '())]) ) where @racket[deadline] is the time of the next pending event and the @racket[timer-evt] function maps such a deadline to an I/O event descriptor. In the subscription topic, the @racket[car] of the pair is the event descriptor of interest, and the @racket[cdr] is a wildcard. In the message-handling pattern, however, the @racket[car] is ignored since it is simply the event descriptor subscribed to, and the @racket[cdr] is expected to be the current value of the system clock. Drivers for other devices construct analogous subscriptions. The ground VM is in some ways similar to an @emph{ exokernel} in that it exposes the underlying "hardware" I/O mechanisms in terms of its own communication interface. In other words, it multiplexes access to the underlying system without abstracting away from it. \paragraph*{Other VM Implementations.} We have chosen to build our VM implementations in a completely functional style. Our VM API is deliberately formulated to permit side-effect-free implementations. Nothing in the interface forces this choice, however. It is both possible and useful to consider implementations that internally use imperative features to manage their process tables, that use Racket's concurrency and parallelism to improve scalability on multicore machines, that transparently distribute their contained processes across different machines in a LAN, and so on. Because the observable behaviour of a VM is independent of its implementation, changing the way in which an application scales may be as simple as switching one VM implementation for another. We hope to explore this territory in the future. \begin{figure}[tb] @verbatim{ $ telnet localhost 5999 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. You are user63. user81 arrived. hello user63: hello user81: hi user63 user81 departed. } \ruledcaption{Transcript of a session with the chat service.} \label{example-transcript} \end{figure} balance emacs syntax highlighting: $ To illustrate how the pieces of our system fit together, we analyze the source code for a hub-and-spoke style, TCP-based chat server. The code in this section is the entirety of the program. Clients connect to the server with @tt{telnet}. The server assigns a unique name, such as @tt{user63}, to each connecting client. The arrivals and departures of peers in the chatroom are announced to connected clients. Each line of text sent by a client is relayed to every connected client; figure~\ref{example-transcript} shows a transcript. Our chat service has two layers, shown in figure~\ref{chat-service-layering}: a ground layer for the TCP driver and a nested VM for chats. The latter hosts one process for accepting incoming connections plus one process per accepted chat connection. Three types of conversation take place: \circled{1} between the network socket and its socket manager process; \circled{2} between the socket manager and its associated chat process; and \circled{3} the multi-party conversation between these chat processes. Note how each process engages in two distinct conversations simultaneously. The server's entry point is a module written in the @racket[marketplace] language, which automatically starts the ground VM with the actions given in the module's body: @#reader scribble/comment-reader (racketblock ;; \ensuremath{\forall\State . \Action{\State}} (nested-vm (at-meta-level (endpoint #:subscriber (tcp-channel ? (tcp-listener 5999) ?) #:observer #:conversation (tcp-channel them us _) #:on-presence (spawn #:child (chat-session them us)))))) ) This initial action spawns a @racket[nested-vm] to contain processes specific to our chat service. Initially, its only process is the primordial process, which takes on the role of listening for incoming connections. Recall that each VM has access to two IPC facilities: the external network of its container and the internal network for its own processes. The primordial @racket[endpoint] is wrapped in an @racket[at-meta-level] structure to indicate that it relates to activity in the VM's external network. Specifically, it is interested in observing, but not participating in, TCP conversations on local port number {\tt 5999}. It is this advertisement of interest that @emph{ implicitly} coordinates with the TCP driver through the presence mechanism. \begin{figure}[tb] \centering \includegraphics[width=6cm]{chat-revised.eps} \ruledcaption{Layering and levels of discourse within the chat service. Processes started automatically by the system are shaded.} \label{chat-service-layering} \end{figure} The system's TCP driver responds to the appearance of this observer subscription by creating a listening TCP server socket. When a new TCP connection arrives, the TCP driver spawns a "socket manager" process (see figure~\ref{chat-service-layering}) to manage the new socket and that process creates a subscription for discussing activity on the socket. The new subscription matches the one shown above in the listening endpoint. The VM detects the match and sends an @racket[#:on-presence] notification to the listening endpoint, which then spawns a process within the App VM whose initial state and actions are given by @racket[chat-session]: @#reader scribble/comment-reader (racketblock ;; \TcpAddress\Times\TcpAddress \RArr \Transition{\Stateless} (define (chat-session them us) (define user (gensym 'user)) (transition stateless (listen-to-user user them us) (speak-to-user user them us))) ) The arguments @racket[them] and @racket[us], representing the new connection's remote and local TCP/IP endpoint addresses, are extracted from the topic of the conversation that the new peer, the TCP socket manager process, is willing to have with the chat session: a conversation about management of a specific TCP connection. No longer true for our simplified case: @note{The associated protocol has a lot in common with Erlang's I/O protocol, \url{http://www.erlang.org/doc/apps/stdlib/io_protocol.html}.} The initial actions requested by a newly-spawned @racket[chat-session] are produced by the routines @racket[listen-to-user] and @racket[speak-to-user]. The @racket[listen-to-user] function subscribes to incoming TCP data and converts it to messages describing speech acts, which it then publishes on the internal (nested) network: @#reader scribble/comment-reader (racketblock ;; \ensuremath{\forall\State. \Symbol\Times\TcpAddress\Times\TcpAddress\rightarrow\Actions{\State}} (define (listen-to-user user them us) (list (endpoint #:publisher `(,user says ,?)) (at-meta-level (endpoint #:subscriber (tcp-channel them us ?) #:on-absence (quit) [(tcp-channel _ _ (? bytes? text)) (send-message `(,user says ,text))])))) ) It is the @racket[#:subscriber] endpoint that starts the ongoing conversation with the TCP socket manager (marked \circled{2} in figure~\ref{chat-service-layering}). The use of @racket[at-meta-level] attaches the endpoint to the VM's @emph{external} network, where the domain of discourse is TCP. The @racket[#:publisher] endpoint, by contrast, attaches to the @emph{internal} network, where a higher-level chat-specific protocol is used, and advertises an intent to send chat messages of the form "$X$ says $Y$." The presence mechanism appears, for the second time, in @racket[listen-to-user]. Its @racket[#:on-absence] notification handler responds to a drop in presence on the topic for the socket's inbound data stream. This happens when the TCP connection is closed by the remote @tt{telnet} process; the TCP socket manager process responds to termination of the TCP connection by @racket[quit]ting. All its subscriptions are thus deleted, causing matching absence notifications. In particular, the handler in @racket[listen-to-user] terminates the chat session process, which causes @emph{its} subscriptions to be deleted in turn. Thus, changes in presence cascade through the system along lines determined by the subscriptions of processes. The @racket[speak-to-user] function sends a greeting to the user and then relays events from the internal network to the user via the connected TCP socket: @#reader scribble/comment-reader (racketblock ;; \ensuremath{\forall\State. \Symbol\Times\TcpAddress\Times\TcpAddress\rightarrow\Actions{\State}} (define (speak-to-user user them us) \ensuremath{...\textrm{definitions of} say \textrm{and} announce...} (list (say "You are ~s.~n" user) (at-meta-level (endpoint #:publisher (tcp-channel us them ?))) (endpoint #:subscriber `(,? says ,?) #:conversation `(,who says ,_) #:on-presence (announce who 'arrived) #:on-absence (announce who 'departed) [`(,who says ,what) (say "~a: ~a" who what)]))) ) \ \\ @#reader scribble/comment-reader (racketblock ;; \ensuremath{\forall\State. \String\Times\Any\Times...\rightarrow\Action{\State}} (define (say fmt . args) (at-meta-level (send-message (tcp-channel us them (apply format fmt args))))) ;; \ensuremath{\forall\State. \Symbol\Times\Symbol\rightarrow\Action{\State}} (define (announce who did-what) (unless (equal? who user) (say "~s ~s.~n" who did-what))) ) Here we see presence used a third time. In @racket[listen-to-user], sessions advertise presence as a @emph{publisher} on the "$X$ says $Y$" topic. This ensures that @emph{subscribers} matching this topic are informed of the presence of each such publisher. Concretely, when the publisher endpoint is created, the @racket[#:on-presence] handlers in @racket[speak-to-user]'s subscriber endpoints in existing sessions are run. The subscriber endpoint in @racket[speak-to-user] responds to presence or absence by describing the change to the user. In sum, a single connection is represented in the system by a three-party relationship: the remote peer, the TCP socket manager process, and the chat session process. The remote peer communicates with the system over TCP as usual (marked \circled{1} in figure~\ref{chat-service-layering}). The bytes it sends manifest themselves as Racket-level events on the ground VM's pub/sub network. The TCP socket manager translates between these low-level events and the high-level conversational representation of the connection used with the chat session process (\circled{2} in figure~\ref{chat-service-layering}). Each chat session process manages its half of the conversation with its corresponding TCP socket manager as part of its other responsibilities. In this case, it relays input from the remote peer as speech acts on the nested VM's pub/sub network. The other chat sessions within the nested VM, each one representing the application side of another TCP connection, subscribe to these relayed speech acts (\circled{3} in figure~\ref{chat-service-layering}) and format and deliver them to their remote peers. According to Hudak and Sundaresh, a functional I/O system should provide support for (1) equational reasoning, (2) efficiency, (3) interactivity, (4) extensibility, and (5) handling of "anomalous situations," or errors. Broadening our focus to systems programming, we add (6) resource management and (7) subsystem encapsulation to this list of criteria. We have found that the individual elements of our approach work well together to address this complex of issues as a whole. Pub/sub subscriptions not only permit flexible communications topologies, but also give rise to presence information. Presence, in turn, allows resource management and crash notification and interacts with our nestable VMs to provide encapsulation, isolation, and layering. \paragraph*{1: Equational Reasoning.} Like Worlds and Universes, our system allows for equational reasoning because event handlers are functional state transducers. When side-effects are absolutely required, they can be encapsulated in a process, limiting their scope as in our SSH server. The state of the system as a whole can be partitioned into independent processes, allowing programmers to avoid global reasoning when designing and unit-testing their code. \paragraph*{2: Efficiency.} Our VM implementations manage both their own state and the state of their contained processes in a linear way. Hudak and Sundaresh, discussing their "stream" model of I/O, remark that the state of their kernel "is a single-threaded object, and so can be implemented efficiently". Our system shares this advantage with streams. There are no theoretical obstacles to providing more efficient and scalable implementations of our core abstractions. Siena and Hermes both use subscription and advertisement information to construct efficient routing @emph{trees}. Using a similar technique for implementing a virtual machine would permit scale-out of the corresponding layer without changing any code in the application processes. \paragraph*{3: Interactivity.} The term "interactivity" in this context relates to the ability of the system to interleave communication and computation with other actors in the system, in particular, to permit user actions to affect the evolution of the system. Our system naturally satisfies this requirement because all processes are concurrently-evolving, communicating entities. \paragraph*{4: Extensibility.} Our system is extensible in that our ground VM multiplexes raw Racket events without abstracting away from them. Hence, driver processes can be written for our system to adapt it to any I/O facilities that Racket offers in the future. The collection of request and response types for the "stream" model given by Hudak and Sundaresh~\cite[\S 4.1]{Hudak1988} is static and non-extensible because their operating system is monolithic, with device drivers baked in to the kernel. On the one hand, monolithicity means that the possible communication failures are obvious from the set of device drivers available; on the other hand, its simplistic treatment of user-to-driver communication means that the system cannot express the kinds of failures that arise in microkernel or distributed systems. Put differently, a monolithic stream system is not suitable for a functional approach to systems programming. Our action type (figure~\ref{vm-interface-types}) appears to block future extensions because it consists of a finite set of variants. This appearance is deceiving. Actions are merely the interface between a program and its VM. Extensibility is due to the messages exchanged between a program and its peers. In other words, the Action type is similar to the limited set of core forms in the lambda calculus, the limited set of methods in HTTP and the handful of core system calls in Unix: a finite kernel generating an infinite spectrum of possibilities. a fixed core that can express many other things when combined. Protocols such as HTTP and the @tt{9p} file-system of Plan 9 take similar approaches: they provide a simple protocol with a small number of general-purpose actions which can express a wide variety of effects in combination. \paragraph*{5: Errors.} In distributed systems, a request can fail in two distinct ways. Some "failures" are successful communications with a service, which just happens to fail at some requested task; but some failures are caused by the unreachability of the service requested. Our system represents the former kind of failure via protocols capable of expressing error responses to requests. For the latter kind of failure, it uses absence notifications. \paragraph*{6: Resource Management.} Presence and absence notifications are also the basis for resource management in our system. Through the presence mechanism, programs can measure demand for some resource and allocate or release it in response. There's a really interesting connection to garbage collection here, which this comment is too narrow to explain. Presence arises from considering the intersection of pub/sub topic filters, but using pub/sub has another benefit. It generalizes point-to-point, multicast, broadcast and even anycast communication; the same few primitive actions are able to express any point along this spectrum. The VM network is responsible for routing based on interest, decoupling the language for declaring interest from the semantics of routing. \paragraph*{7: Subsystem Encapsulation and Isolation.} Finally, our use of layered, nested VMs encapsulates and isolates subsystems in a complete program. Our use of a fixed API between a VM and its processes decouples the implementation of each layer's virtual machine from its content. We can therefore swap out one VM implementation for another without altering its processes. Isolation of process groups is required in a pub/sub system to avoid potential crosstalk between logically separate groups of processes. In our system, VMs provide the necessary isolation. If we had chosen point-to-point communication instead, nesting would not be absolutely required; however, the use of pub/sub is a key advantage of our system, since it gives rise to presence. Presence can be combined with nesting to build supervision hierarchies that restart entire nested VM instances in response to failures. We present a novel approach to functional systems programming, building on previous work on functional approaches to managing state and I/O. By incorporating multi-party communications and explicitly considering concurrency, our model factors out numerous cross-cutting concerns including discovery, synchronization, failure detection, and state lifetime. The connection between a process and its container is declarative. Our model encourages the programmer to think declaratively, yet in concurrent rather than sequential terms, writing programs that react smoothly to changes in their environment. Placing the combination of presence, nested virtualization, and event-based publish/subscribe communication at the heart of a system design eliminates a large amount of scattered application code that recurs across many different kinds of projects. As a result, programs become smaller and more robust, and programmers are freed to concentrate on the functionality of their applications. Integrating treatment of lost messages, congestion and queue management into our approach remains as future work. \paragraph*{Code.} The source-code for our system, examples, and case studies is available at \url{https://github.com/tonyg/marketplace}.