sfsexp

the small, fast s-expression library


Written by Matthew Sottile (matt@galois.com)

Latest version: 1.2.1 (June 1, 2010)
[Return to Project Page]

Table of Contents
  1. Overview
  2. Use Cases
  3. Implementation details
  4. API documentation
  5. Questions


1. Overview

This library is intended for developers who wish to manipulate (read, parse, modify, and create) symbolic expressions from C or C++ programs. A symbolic expression, or s-expression, is essentially a LISP-like expression such as (a (b c)). S-expressions are able to represent complex, structured data without requiring additional meta-data describing the structure. They are recursively defined: an s-expression is a list of either atoms or s-expressions. In the example above, the expression contains an atom "a" and an s-expression, which in turn contains two atoms, "b" and "c". They are simple, useful, and well understood.

This library is intended to be a minimal set of functions and structures for the four functions listed above: reading s-expressions (I/O), parsing strings containing them into an AST equivalent, modifying the AST representation, and converting the AST back into a well formatted string. The primary goals are efficiency and simplicity. This library forms the basis of the data representation and transmission protocol for the supermon high-speed cluster monitoring system from the LANL ACL. The usefulness and lack of choice in available, open source s-expression libraries motivated the independent (from supermon) release of this library. Although the number of potential users represents a rather small community, the author felt it was a valuable contribution. As of March 2005, this library has actually received more interest in terms of downloads and page views than it's parent project!

This work was funded by the Department of Energy, Office of Science. The original development was performed at the Los Alamos National Laboratory between 2002 and 2007. It is currently maintained independently of any funding source as a service to the community (plus, it's fun to work on).

1.1. Revision Information

Starting with version 1.2 (October 2007), I will provide documentation of the most recent revisions here. Detailed notes are available from the file releases and news items, but the most important items will be highlighted here.

Version 1.2 -> 1.2.1 (June 1, 2010)

Version 1.1 -> 1.2 (October 2007)

[Return to top]

2. Use Cases

The two most appropriate and common uses of this library are for data structure representation and network protocols. It's a trivial step from a data structure (C struct, C++ class, or F95 user-defined type) to s-expression representation. For example, the following are very easy to represent as s-expressions: Quite often one will want to have a way to save and restore structures to and from disk, or transmit a data structure over a network between two applications. This naturally leads to network protocols where structurally non-trivial data is passed back and forth. The s-expression allows data to be stored internally in a format that mimics the original data structure, and transmitted in an efficiently parsed and unparsed format. Furthermore, this library extends the basic s-expression to allow inlined binary data, removing the need to encode binary data (such as images) before sending. Such binary payloads are easily encapsulated in the s-expression format without marshalling to ASCII or other non-binary formats.

Furthermore, data structures sent over the wire as s-expressions are not prone to differences in architecture. Unlike systems that send pure binary and are based on the assumption that structures are laid out consistently between languages and platforms, the s-expression is portable with minimal overhead to achive this portable representation. There is an innevitable trade-off in byte efficiency with s-expressions versus raw-binary, but unlike other representations (such as XML) it is mininal. Parsing is also simpler than XML or other 'document markup languages'. Also note that portability goes out the window with inlined binary, at least in terms of byte ordering.

Example uses of this library that have been made available to the author include:

Feel free to send your applications along to include here!

[Return to top]

3. Implementation details

An s-expression is an expression composed of elements that are either atoms or s-expressions. For example, the following are s-expressions containing two elements:
    (a b) 
    (a (b c))
    ((a b) (c d))
    (() ())
In the first case, the s-expression contains two atoms, the strings "a" and "b". The second example is contains an atom and another s-expression (which contains two atoms). The third example is composed of two other s-expressions. The final contains two empty s-expressions. One possible representation of these expressions is as nested lists, where a single element of a list is captured in the structure:
    struct elt {
      int type;
      char *val;
      struct elt *list; 
      struct elt *next;
    };
Since an element can be either a list or atom, the element structure has a type indicator that can be either LIST or VALUE. If the type indicator is LIST, then the structure member "list" will be a pointer to the head of the list represented by this element. If the type indicator is VALUE, then the structure member "val" will contain the atom represented by the element as a string. In both cases, the "next" pointer will point at the next element of the current s-expression. The following rough illustration shows how the expression (a (b c)) is translated to the structures above.
+----+------+-------+
|LIST|list=o|next=o--->NULL
+----+-----|+-------+
           |
   +-------+
   |
   v
+-----+-------+-------+  +----+------+-------+
|VALUE|val="a"|next=o--->|LIST|list=o|next=o--->NULL
+-----+-------+-------+  +----+-----|+-------+
                                    |
                         +----------+
                         |
                         v
                     +-----+-------+-------+  +-----+-------+-------+
                     +VALUE|val="b"|next=o--->|VALUE|val="c"|next=o--->NULL
                     +-----+-------+-------+  +-----+-------+-------+
(note: o---> represents a pointer) The actual structures are slightly more complex to deal with variable size values in atoms, but those fields aren't necessary to express the basic idea of the structures.

New! (12/20/04): A new function, sexp_to_dotfile() has been added to allow visualization of the sexp_t structures that represent the s-expressions using Graphviz. This is very useful for debugging. To further illustrate how the structure of an s-expression is represented by the library, the example available here shows the structure corresponding with the following:

;
; example input for sexpvis sample code
;
(
  (this (is a nested) list)
  (and this "has a dquote" in it)
  '(squote here)
)
[Return to top]

4. API documentation

All of the API docs are now produced by DOxygen in two formats:

3.1. Ruby language bindings

The ruby bindings are available in the src/ruby subdirectory of the source distribution. The following is a transcript of an interactive ruby session using the s-expression library:

irb(main):001:0> require "Sexp";
irb(main):002:0* s = Sexp.new("(simple s-expression (int 42) (float 22.3))");
irb(main):003:0* s.getStr
"(simple s-expression (int 42) (float 22.3))"
irb(main):004:0> sary = s.getAry
["simple", "s-expression", ["int", 42], ["float", 22.3]]
irb(main):005:0> sary[3][1] = 33.2;
irb(main):006:0* sary[3][0] = "newfloat";
irb(main):007:0* sary
["simple", "s-expression", ["int", 42], ["newfloat", 33.2]]
irb(main):008:0> s.setAry(sary);
irb(main):009:0* s.unparse;
irb(main):010:0* s.getStr
"(simple s-expression (int 42) (newfloat 33.200000))"
irb(main):011:0> 

[Return to top]

5. Questions

Why is it called "Smaller"?
Well, when I initially started looking for a way to manipulate s-expressions from C, I found some libraries that existed. Unfortunately, they were big. All I wanted was the ability to take a string representation of an s-expression, turn it into a linked data structure that is easy to walk and modify in C, and then turn that back into a string. So, this library lets you do exactly that. It's a bit bigger than before, but for most purposes a developer will only use three or four of the routines it provides - parse, print, destroy, and find. Others exist for different types of parsing or operations on the linked sexp_t data structure, but they're not necessary for all users.

What is a continuation?
Many definitions probably exist out there for this, and most people have likely seen continuations in languages like Scheme, ML, or others that have a little function called callcc. The idea is that a continuation represents state - in our case, the continuation contains the state of the parser when it exits. For example, if you have a very large s-expression being read off of a socket, you might not get the entire string in a single read call. Now, you can either accumulate the string into a buffer until it is complete, or you can tell the parser to start parsing the incomplete string. The trick is, when the next string-chunk comes in, you want to tell the parser "here's more data, and pick up where you left off." This involves restoring the stack the parser uses to put the linked structures onto while it works, restore the proper state (IE: was it parsing a simple atom or a quoted string?), and do other things to get it going again. Once the continuation has restored the state, the parser picks up and acts just like it would as if the whole s-expression string had been passed in at once. Clear? If not, you *probably* don't need to use this feature. In our case, we wanted to have multiple sockets all providing partial s-expressions, so the parser had to 1) not have any global state and 2) be able to pick up where each socket left off on each incomplete string. In addition, it had to be able to do this with minimal performance loss. So, we have a continuation based parser that takes a continuation and returns one, and the contents of the structure tell you about the state of the parser and the string being parsed.

Why would I use this library instead of just using LISP/Scheme/Emacs/etc.?
A few reasons - it's small, it has a tiny API, and it is fast. It isn't a language interpreter. It is just a way of efficiently manipulating LISP like expressions in a language like C. How many times have you wished for a cons operator in C? Or maybe a way of having lists of lists and atoms that was easy to deal with? Well, s-expressions are good for that. How about structured data? A tree in s-expression land is trivial to represent, as are more complex structures. In many ways, the structure of an s-expression might look close to XML, with parentheses in place of angle-bracketed tags. If you have an application that wants to pass around, parse, and manipulate complex data structures in a platform independent (non-binary) way, s-expressions are the easy way to do it. And again, they're fast. Oh yes, since you can create valid s-expressions with this library, there's no reason why you can't just feed them to a program in LISP or Emacs and start using them from there. It might involve a strategically placed setq or the like, but it's easy.

Why not XML?
The goal of any project should be to solve a problem in an elegant, well designed manner. Buzzword compliance rarely achieves this, other than allowing a lowly programmer or software engineer to appease higher-ups who read that "XML is the latest great thing!" XML suffers from certain flaws that make it worthless in an environment where speed and efficiency (memory and time complexity) are key requirements. Compare the performance, footprint, and developer complexity of Xerces and this s-expression library, and you'll see that for hierarchical data representation, one is significantly simpler than the other without losing the ability to do its job (represent data!). After working with XML for a while (~3-4 years) on multiple projects, I find that since starting to use s-expressions again as a data format has allowed me to spend less time debugging my data representation code and more time on problems that matter.

How fast is this library?
Updated 11/5/02: I've been meaning to update the answer to this question for a while. The library has undergone many optimizations since continuations were added, including self-contained memory management to avoid excessive malloc/free calls. The while loop of the parser now really does examine each character just once. Testing using the {c}torture.c tester included with the package yielded these results, which I give for each version released for comparison.

VersionAverage time(5 runs)
0.1.11.61s
0.1.21.12s
0.2.01.07s
0.2.11.09s
0.2.21.09s
0.2.31.11s
0.3.00.42s
0.3.10.70s

* Expression contained 380 elements, roughly equivalent to a full data set sample of 4 nodes using supermon

Each iteration of the test would parse the expression, unparse the AST back into a string and destroy the AST. This was performed 1,000 times and the required time averaged over 5 runs is shown above. Version 0.3.0 has every optimization thus far applied, although the one that proved most beneficial was the efficient memory management addition. Version 0.3.1 is slightly slower due to memory allocation operations that do NOT take advantage of the faster memory management but were added to remove the need for fixed-length buffers in sexp_t elements.

So why is this project on sourceforge?
To tell the truth, it's not really necessary to have it up here but I did so for two big reasons:

  1. I wanted to learn how to use sourceforge before putting a really big, complicated project up on it.
  2. The library is useful for more than just supermon - so if anyone wants to play with it, they can get the standalone version.

So there it is.

Why did you bother writing this?
This is one of those annoying questions that pops up occasionally. A quick search on google yields only one other obvious candidate: Rivest's SEXP Library. A quick look at the example code and guide to using this library shows that his library is far more complex due to it's purpose in cryptographic applications. All other references that I found for libraries to deal with s-expressions from C or C++ involve embedding GUILE or some other scheme/LISP interpreter in the application. For most purposes, this is overkill. In other languages, solutions exist. For example, in Java, the W3C web pages have references to a java package for dealing with s-expressions. In LISP or Scheme, the answer is obvious. In ML, a basic translation of LISP to ML should work fine since ML lists act in many ways like s-expressions. So no, this library isn't a waste of time in my opinion. Besides, how many times have you found many implementations of things like arbitrary precision arithmetic, window managers, GUI toolkits, etc... Each has its place and purpose and was written for a real reason. A final reason is that it is doubtful that many (if any) other s-expression parsers support continuations or some other form of partial parsing to be implemented. This is exceptionally useful when parsing off of a socket stream where the programmer doesn't have the space or desire to wait until the entire expression has been received before starting the parsing process.

Why is an unquoted word not a valid atom on its own?
The following are atoms that do not exist within parenthesis:

  1. "example one"
  2. '(example two)
  3. example_three

The first and second are valid and will be returned if those string are passed into the parser. Furthermore, they can be parsed and unparsed repeatedly and remain valid and unchanged over a number of iterations. The third example will ONLY parse if the last character of the string is whitespace or a carriage return. This is the only way to cue the parser into knowing that the string is in fact complete, and not to wait to see if the next call with the current continuation will introduce more characters to parse. Unfortunately, since whitespace and carriage returns are not considered part of the atom itself, when an atom is parsed that is followed by those characters, they do not appear in the unparsed version of that atom! This means that although it is possible to parse an expression such as:

simple_atom\n

Unparsing it will result in:

simple_atom

Which will not parse to the same result as the first. This is valid behaviour due to the semantics of the continuation based parser.

My code crashes in the parser with a core dump, segementation fault, assertion, or mysterious parser error message.
First, make sure you're passing a valid s-expression in. If this is the case (or you believe it to be the case), send me the s-expression and the code (or some code) that can reproduce the error. A note saying "the parser died with such-and-such error message" is of little use in tracking the problem. I'm not psychic! :)

[Return to top]


SourceForge Logo