I’ve been eyeing SCons for years now as a promising alternative to GNU Make, and now that I’m using it first-hand in nohtyP I am quite impressed.  The killer feature?  SCons is smart enough to rebuild if I change CPPDEFINES.  To get the same functionality from Make, all makefiles need to be dependencies of every object file, which then means any makefile change will cause a complete recompilation.  This is early days, but so far SCons’ dependency tracking is exceptional.

There are a few things SCons does by default that I didn’t want in nohtyP, though.  Namely:

  • It expects SConstruct to reside in the top-level directory
  • It leaves intermediate files in the source directories
  • It doesn’t raise errors if undefined build variables are referenced

When someone downloads nohtyP, I want them to be greeted by a simple and sparse project directory that puts the two most-important files, nohtyP.c and nohtyP.h, front-and-center.  The default SCons behaviour is counter to this goal but, thankfully, it has full support for alternate configurations.  This article will share how I was able to configure SCons to accomplish this.

To start, I created Build\make.bat, containing simply:

@scons -C %~dp0.. -f Build\make.scons %*

-C %~dp0.. makes SCons change to the parent of the directory containing make.bat (ie nohtyP, the top-level project directory), while -f (which must be specified relative to the -C directory) tells SCons to read make.scons instead of SConstruct. So long as the user has scons in their PATH, the build can be easily executed from any directory by running this batch file.

The rest of the configuration is in Build\make.scons. (Giving this file an extension is the only way to have it recognized by the Windows 7 file indexer, which I’ve found to be invaluable for development.) The file starts with:

vars = Variables( None )
env = DefaultEnvironment( variables=vars )
vars_unknown = vars.UnknownVariables( ) # must be after env created
if vars_unknown: Exit( "Unknown vars: %r" % vars_unknown.keys( ) )
AllowSubstExceptions( )

This defines the accepted command-line build variables (of which there are none, currently) and creates SCons’ default build environment. (If you instead call Environment(), as the docs suggest, SCons will create a separate environment for use as a default, and each environment will search for installed tools.) It then throws an error if unknown variables were given on the command-line, and tells SCons to fail the build if variable substitution (ie $FOOBAR/nohtyP) references an unknown variable.

I found there are a few variables that SCons references by default that were causing errors because I didn’t need them. Rather than removing the AllowSubstExceptions() call, which would supress substitution errors for all variables, I defined these variables to empty values.


VariantDir() can be used to ensure intermediate and target files are placed in a separate directory. It has some surprising behaviour, though: it sets up one directory to completely overlay and mirror another. Mirroring can be disabled using duplicate=False, but for the overlay to work it requires changes to how source files are referenced, which I’ll show in a moment.

env["OBJDIR"] = "#Build/scons";
env.VariantDir( "$OBJDIR", "#", duplicate=False )
env.SConsignFile( os.path.abspath( "Build/.sconsign" ) )

# refers to the top-level directory, as set by the -C option, so #Build/scons expands to nohtyP/Build/scons. The SConsignFile() call prevents SCons from writing it’s .sconsign.dblite file in the top-level directory, keeping this instead in Build.

With this in place, we can now define the rules for building nohtyP.dll:

    CPPPATH=["", ],
    PDB="", LIBS="" )
env.SharedLibrary( "$OBJDIR/nohtyP", source="$OBJDIR/nohtyP.c" )

Here’s where the VariantDir() overlay makes things weird: instead of specifying #nohtyP.c as the source, we must instead give the path relative to the target of the overlay ($OBJDIR/nohtyP.c). If we don’t, then the object file will be written in the source tree, despite what we told VariantDir(). PDB and LIBS are other variables SCons references by default; I kept these separate from CPPFLAGS et al because I want to be reminded of them when adding other targets.

And that’s it! SCons is smart enough to find an available compiler and build with it using the appropriate options (another killer feature). It also knows that the shared library should be nohtyP.dll on Windows and libnohtyP.so on Linux. (You’ll notice I used '/' as the path separator, as it’s accepted on both systems.) While the SCons documentation wasn’t quite sufficient to piece all this together, it only took about a day of Googling to get this new build system up-and-running. I’m looking forward to the day I can say goodbye to Make once-and-for-all.

[PS: The current version of SCons, 2.3.0, doesn’t work well with any version of Visual Studio Express on 64-bit Windows machines. Patches for this have been submitted, so until a new release is made you’re best to install directly from the project source.]

When designing a new code base it’s a good idea to consider how certain operations can be avoided.  With nohtyP, I intend to optimize for one-use objects and data copying, and if I can reduce the number of individually-allocated buffers, then so much the better.

I’ve been working on the list implementation, and something occured to me when I wrote this code:

ypObject *list_insert( ypObject *sq, yp_ssize_t i, ypObject *x )
    ypObject *result;

    // Resize if necessary
    if( ypTuple_ALLOCLEN( sq ) - ypTuple_LEN( sq ) < 1 ) {
        result = _ypTuple_resize( sq, ypTuple_LEN( sq ) + 1 );
        if( yp_isexceptionC( result ) ) return result;

    // Make room at i and add x
    ypTuple_ELEMMOVE( sq, i+1, i );
    ypTuple_ARRAY( sq )[i] = yp_incref( x );
    ypTuple_LEN( sq ) += 1;
    return yp_None;

_ypTuple_resize calls realloc, which potentially copies data; ypTuple_ELEMMOVE also copies data.  Together, they have the potential to copy a large amount of data twice.  Ideally, we should skip the memcpy in realloc and instead customize how data is moved between old and new buffers.

Unfortunately, plain-vanilla C doesn’t allow for this: there’s no way to tell if realloc will call memcpy before you call realloc.  One solution is to always call malloc instead of realloc, but now you can’t take advantage of realloc’s ability to resize in-place.  Certain heap implementations recognize this and allow in-place resizes to be attempted before malloc’ing a new buffer (Windows’ _expand, jemalloc‘s ALLOCM_NO_MOVE flag).  This appears to be as good as it gets, although I have one niggling concern: this strategy requires up to two calls into the heap and, therefore, two synchronization calls.

An ideal version of “realloc” would, in one function call, either expand the block or allocate a new one, but never free the old one. This API would be tricky to use, because you’d have to remember to call free if, and only if, the returned pointer was different than what you supplied.

Despite how tricky this could be to use, and that no heap implementations currently employ this, it is the design I’m using for nohtyP’s internal memory allocations. The key word being “internal”: if this causes issues, it’s easy enough to change.

Python does strings and other containers better than C, but C is much better at crunching numbers efficiently.  Not safely, but efficiently.  It is important to retain this efficiency for embedded firmware: malloc’ing a new object every time you add two numbers together will quickly bog down the system.  To this end, nohtyP will have a number of options for working with numbers.

The first option will be to simply use raw C numbers and operations.  There are many cases where you would never expect to see overflow, so checking for it would be wasteful.  If you do want to check for overflow, there will be a set of “library” functions (yp_addL et al) made available that will check for overflow but will otherwise accept and return a C number.  Either way, the functions that most-commonly deal with integers (len, range, getitem for sequences) will have versions that work directly with C integers.

As an aside, to enable stringing-together multiple library function calls, such functions will accept a ypObject** argument that is set only on error.  You would pass in the address of a variable that has been initialized to yp_None, then inspect it after your series of functions.  If multiple errors occur you will only see the most-recent, but you will see an error.

ypObject *e = yp_None; // set to exception on error
yp_int_t val = yp_mulL( 10, yp_powL( a, 100, &e ), &e );
if( isexceptionC( e ) ) abort( );

The next option will be to use a mutable number, an “intstore” or a “floatstore”, and update it in-place.  This is a deviation from Python, which enforces that numbers are strictly immutable.  However, it’s the only way to reduce object creation in common programming cases, such as keeping running totals as the values of a dictionary.  Such objects will support in-place operations using C numbers as the second operand: for example, yp_iaddC will add a C integer to the given object, with full overflow checking.

Finally, there will be a set of functions that, as in Python, return a new object with every operation.  Care will have to be taken to properly discard of intermediate results in this case.

Python 3.3 was recently released. One change was to make setdefault atomic; I believe there are other operations on dicts and sets that could be improved.

It’s common to use dicts and sets to handle recursion. For example, yp_deepfreeze will need to maintain a set of visited object ids to avoid descending into already-frozen objects. The only way to implement this currently is to do two hash lookups: one to see if the id is in the set, and another to add it. However, if sets had an addunique method that raised KeyError if the id was already in the set, this would become an atomic operation. A similar method for dicts, say setunique, would be useful when you’re avoiding recursion and need to store additional data.

Python uses dicts to intern objects, but this is wasteful as the object is stored in both the key and value. Instead, sets should have a way to return a specific stored object. I propose addinterned, which will return the existing item from the set that is equal to x, otherwise it will add x and return it. In nohtyP, this will also help ensure interned objects do not become “immortal”, as you can just discard the set once you’re done interning. (Python uses weakrefs to achieve the same benefit, something I’m not planning just yet.)

Finally, a setexisting method that adds a key to dict only if the key is already in dict, and otherwise raises a KeyError, could be useful to catch when a key is misspelled, for example. I’ve found that this type of check is useful in data mining to alert me that there’s a new type of data field I need to handle, and making this a contained atomic operation would speed up such activities.

[Update: Oct 22, 2012] My original setunique example was poorly-chosen, so replaced with an addunique example.

A tweet pointed me to a paper [PDF] that discussed the merits of roll-your-own storage allocators.  To summarize: it’s generally not worth it.

This got me thinking about how Python allocates memory.  At its core, Python uses the system’s malloc routine, so nothing fancy there.  However, several of the types have custom allocation schemes, the most popular possibly being “free lists”.  The idea here is to maintain a list of old objects to reuse for future allocations, in order to save a pair of free/malloc calls.  This seems like something that should be handled inside of malloc itself.

You’d think I’d be in favour of free lists, since I’ve stated that reducing calls to malloc will be important in nohtyP.  This paper made me realize that what I’m really trying to limit is data copying and one-use objects.  To this end:

  • intstores and floatstores: mutable ints/floats so you don’t need to create a new object on every operation
  • C-versions of routines: functions where some of the inputs and/or outputs are C types, for example yp_lenC which returns the length as a C integer
  • “N-versions” of routines: functions that take a variable number of objects directly, rather than creating a one-use tuple
  • Pre-allocation: avoiding multiple realloc calls by allocating the necessary memory once, at object creation
  • yp_freeze: to make an object immutable, in-place, without copying data
  • Immortals: objects statically allocated at compile-time

There is an inherent distrust of malloc among the optimizationally-minded like myself.  The reality is that decades of development have gone into tuning these algorithms, so we should stop second-guessing them!  …Unless we know they suck, but even then we have other options.

Most immutable Python objects are hashable, so nohtyP will dedicate space (ob_hash) in every ypObject to cache the hash; this space is ignored for mutable objects.  The immutable constructors and yp_freeze will automatically populate ob_hash.  However, it is impossible to calculate the hashes of immortals with the C compiler, so if ob_hash is -1 at runtime, the hash is calculated then cached for future calls.  Also, -1 will be used if the object does not support hashing; the hash calculation function will then be able to return the appropriate error.

So far so good, but some digging into Python shows that using hash() to indicate immutability has some problems:

  • “x in otherset” requires the hash of x in order to find it in otherset.  Because Python doesn’t allow a hash to be calculated if x is mutable, a frozenset copy of x is created and destroyed behind the scenes.  If x is large, this is quite inefficient!
  • The documentation states that user-defined classes are hashable by default, as a convenience to programmers.  However, unless a custom class remembers to set __hash__ to None, they will have a hash value, even if they are mutable.

Really, the hash has more to do with the value of the object than its mutability.  The requirement of immutability only comes into play when storing objects in sets/dicts, as the hash is cached in the hash table to improve lookup speed.  It should be possible to compute the current hash, based on the current value, in contexts where the object is not being stored.  To this end, nohtyP will have a yp_currenthashC function, applicable for immutable objects.

The test for mutability will instead use the type code.  Recall that types will come in mutable/immutable pairs, so the least-significant bit of the type code will indicate if the type is mutable or not.

The initial, proof-of-concept implementation of sets in Python used dict in the back-end and simply ignored the associated values.  When it came time to formalize this new type, a copy of the dict implementation was made, then modified to remove all notion of values.  I think this duplication of code can be avoided, both in Python and in nohtyP.

Although dicts were part of Python long before sets, you can think of dicts as being sets where each element has an associated value.  If you can find the index of an element in a hash table, you can index into a separate array table to retrieve the value.  Thus, dicts and sets can share the same hash table code.

It occured to me that this could potentially slow down retrieving the value: in large dicts, the hash and value tables would not be cached together, so retrieving the value would be a cache miss.  However, Python’s hash table code does not use linear probing; instead, it bounces around the entire width of the table.  Removing the values actually increases the density of hashes, allowing more to be cached at once and, thus, improving look-ups.  You inspect many hash entries while finding the key, but you only retrieve the value once.

As a side note, nohtyP will give sets an addunique method (setunique for dicts) that will add the element only if it doesn’t already exist, raising an error otherwise.  It is common in Python that, to avoid recursion, you track visited objects in a set like so:

  if id( obj ) in memo: return
  memo.add( id( obj ) )
  # process the object

addunique would follow the EAFP principle and turn the two hash lookups in the above code into one.  Like setdefault, it would be a useful addition to the library.

[Update: Oct 6, 2012] The just-released Python 3.3 includes PEP 412, which describes a separation of key and value tables similar to what I describe above.  (I recall reading this PEP before; it likely formed the basis for this idea.)  Combining these together leads me to an interesting thought: what if the key table for dicts was a set object?

In the early days of Python, mutability was an important property of each type.  ints and strs were immutable, and dicts mutable.  GvR asserted that tuples were not read-only lists, and to underscore that point some useful list methods (ie index, count) were missing for tuples.

Now we have bytes and bytearrays, frozensets and sets, tuples supporting all the appropriate list methods, strs that can be modified from the C API, and I believe there are some read-only dict instances poking around internally.  Nearly every type comes in an immutable form.

I’m planning on exploiting this trend to simplify the implementation of nohtyP.  Most types will be implemented as a pair of mutable (unfrozen) and immutable (frozen) types.  For example, tuples and lists will share the same implementation, differing only in that the mutating methods will be missing from tuple’s method table.

This opens up an interesting possibility. In the C API, you initialize a tuple to a certain NULL-filled size then fill in its contents, breaking the “immutable” abstraction. In Python, you build a list then use it to initialize a tuple, introducing an extra malloc/free; alternatively, you use a generator, which C doesn’t natively support.  In nohtyP, you will create a list, then transmute it into a tuple, in-place, using the one-way yp_freeze function.

These types of ideas, freezing and transmuting, have been considered before.  PEP 351 proposed an object method that would return a new, immutable version of itself; in-place freezing helps avoid those extra malloc/free calls.  While not recommended, Python supports transmutation by assigning to __class__; by sharing implementations between pairs, transmutation becomes much easier.

Still, is this a good idea?  For the answer, I invoke The Zen of Python:

Simple is better than complex.
Complex is better than complicated.
Special cases aren’t special enough to break the rules.
Although practicality beats purity.

Reference counting in Python’s C API can be tricky business, as you need to remember which inputs are stolen or borrowed, and which outputs are borrowed or new.  To simplify this situation, in nohtyP most inputs will be borrowed, and most outputs will be new.

However, I’m considering optimizing for a particular case.  Often, functions will have many intermediate steps on the way to a final value, and any one of those steps could cause an exception.  As such, you need to check for exceptions every step of the way, and potentially discard the unfinished object.  It would be handy to avoid those checks.

My idea is that functions that modify their inputs will steal a reference to the modified object.  When an exception occurs the reference is discarded, otherwise it is returned as a “new” reference by the function. Subsequent functions will immediately return any exception object they are given as input, so the output need not be inspected between function calls.

ypObject *list = yp_list_new( );
int i;
for( i=0; i<5; i++ ) {
  list = yp_append( list, yp_True );
if( yp_isexceptionC( list ) ) abort( );

If this behaviour isn’t desirable, you’d have options. If you didn’t want the original object modified, you could pass in a copy (yp_copy). If you didn’t want to lose the object on error, you could increment the reference count (yp_incref), optionally check for errors, then discard the return (yp_decref).

[Update: Sep 12, 2012] It occured to me that a better alternative for in-place modifying of objects is to pass in a PyObject**, like so:

yp_append( &list, yp_True );

This has the advantage of fewer keystrokes, and the ‘&’ character gives a reminder that this argument is stolen, not borrowed.

Without any kind of try/except functionality in C, it can be tedious to check every function call for error conditions.  A system where we could check for errors at the end of a function, or ignore them entirely, would be ideal, and if such a system could avoid relying on a platform-specific thread-local storage function, so much the better.

My thinking is this.  In C, error codes and valid values are both outputs from functions.  Outputs from one function are usually fed into the next function, stringing these calls together until the final value is obtained.  If all functions immediately returned any exception given as input, it would behave similar to how a try/except statement stops execution on the first exception.

(Incidentally, I’m writing this blog as a brain-dump of my ideas, but I am surely not the first person to consider these methods.  I’d be quite interested to hear about any other projects that have used similar methods.)

My current plan for nohtyP is to make exceptions immutable, immortal objects.  There will be a finite set of exception objects, and the only piece of data they will contain is the type of exception that occured.  Every function will be declared to return an object, so that they can return an exception if need be; those that don’t otherwise need a return value will just use yp_None, also an immortal.  Being immortal, exceptions don’t need to be yp_decref’d, allowing them to co-exist with functions that return either borrowed or new references.

The end result means the following code will correctly return an exception if yp_join fails:

ypObject *FormatList( ypObject *list ) {
  yp_CONST_BYTES( sep, ", " );
  yp_CONST_BYTES( fmt, "({0})\n" );
  ypObject *sepList = yp_join( sep, list );
  ypObject *result = yp_format( fmt, sepList );
  yp_decref( sepList );
  return result;