Chapter 12 The Moby run-time system

12.1 Run-time representations

This section describes the run-time representations of various kinds of Moby data values. We use C-like type and expression syntax to describe these representations, where the C type MobyWord_t is the unsigned integer type that is the same size as a pointer.

12.1.1 Function closures

Function closures are represented by a two-word record object: the first word holds the function's code addressand the second word holds the function's environment pointer. Note that the environment pointer is not always a pointer --- if the function's environment consists of a single value that has uniform representation, then the environment pointer will be that value, and if the environment is empty, then the environment pointer will be zero.

12.1.2 Objects and method suites

Objects are represented as a pair of the object's data pointer and the object's class ID (see Figure 12.1.

Figure 12.1: Object representation

The object's data pointer points to a record whose first field is a pointer to the object's method suite, and whose other fields contain the object's fields. Every object generated from a particular class shares the same method suite. The object's class ID is used as a key when dynamically searching for the field or method slot for a given label.

When executing code inside a method body, we do not need the class ID to locate slots, since the slot offsets are static w.r.t. the particular class. Therefore, we represent self by just the pointer to the object's data inside methods (and when we pass self to a method's body during dispatch).

12.1.3 Tagtypes

Tag types are implemented using a scheme described by Reppy and Riecke [RR96]. For each declared tagtype, there is a statically allocated descriptor with the following layout:

typedef struct MOBY_tagtype_desc *MOBY_tagtype_t; struct MOBY_tagtype_desc { MobyWord_t _depth; MOBY_tagtype_t _id[N]; char _name[]; };

where N is the depth value for the tagtype. The address of the tagtype's descriptor serves as its unique ID. The _id field of the descriptor holds the IDs of the ancestors of the tagtype, with X->_id[X->_depth-1] holding X, for any tagtype descriptor X. Following the ancestor IDs is a null-terminated string, which is used by the run-time system to report the name of exceptions.

We can test if a tagtype value matches the constructor X as follows:

((tag->_depth >= X.depth)(tag->_id[tag->_depth] == X))

where tag is the value's ID.

12.1.4 Arrays and vectors

We use a two-level representation of arrays, strings, and vectors.

12.2 Runtime system data structures

Figure 12.2: Per-task data structures

12.3 Garbage collection

We are using a modified version of Morrisett and Smith's Mostly-copying Collector (MCC) [SM97]. Although we have maintained much of the algorithm's structure, we have made liberal modifications to the source code.

In this section, we describe only the single-threaded version of the garbage collector. By single-threaded, we mean the version that works only for Moby programs that do not spawn additional threads. Both versions of the collector run in one thread (they are not parallel) and must run while all mutator threads are stopped (they are not incremental). The next section explains how the multi-threaded version is different. Because the multi-threaded version is less stable, we currently have two source trees, one for each version. In the future, we hope to merge them, turning off expensive synchronization operations for single-threaded programs.

MCC has conservative objects and record objects. (It also has arrays and large objects.) Both objects have 32-bit headers. Conservative objects' headers just record the size of the object; we do not know which of the fields hold pointer values. Record objects' headers have a bitmask --- a 1 means a field is definitely a pointer, a 0 means a field is definitely not a pointer. Unfortunately, this encoding means an object with just one ambiguous field must be given a conservative header, so all fields are considered ambiguous. So in the future, we intend to change the header encoding for record objects to use two bits per object field. Then one of the four bit patterns can indicate that a particular field is ambiguous.

The glue code to transfer control between the mutator and the collector and back is new and subtle, so we discuss it in detail. (It is also rather different for multi-threaded programs, so read the next section for that case.) The biggest difference from the original MCC is that the allocation pointer does not stay in a global variable.

At program initialization time, the Moby runtime calls gcInit. This function gets an initial heap, breaks it up into blocks, breaks the blocks up into pages, etc. Unlike the original MCC, it does not establish an allocation pointer. (In MCC terminology, the entire space for collectable objects is called the heap. It is broken up into contiguous-memory segments called blocks and the blocks are broken up into pages. Large objects can span pages, but not blocks. Small objects do not span pages. Pages can be used for mutator objects or for system information about other pages. Page size is a power of 2 (currently 8K for Moby) and pages are aligned on the same boundary.

gcNewAllocPtr returns a pointer to a fresh to-space page. The mutator is then responsible for allocating off this pointer. It should not let the pointer leave the page.

Each module's initialization code registers its statically allocated pointers by calling
MOBY_add_staticseg so that they will be scanned during garbage collection. The collector assumes the values in such a segment are unambiguous pointers, so they can be small values, but they cannot be integers that could appear to be pointers in the Moby heap. Therefore, the compiler segregates top-level bindings by pointerness. This segregation works because top-level bindings are monomorphic, but it may make functors harder to compile. We may want a third category for the segregation for ambiguous pointers.

Because we expect allocation performance to be crucial for Moby applications, the compiler reserves a register (edi on the IA-32) to hold the allocation pointer while Moby code is running. The Moby code does heap-limit checks to ensure that there is room for ensuing allocations on the current page. (The placement of these checks is currently broken because it does not take into account that function calls can perform an arbitrary amount of allocation. That's the compiler's problem.)

When a heap-limit check fails, the mutator calls MOBY_callgc, written in assembly using a special calling convention. Basically, the function stores all the registers on the stack and then calls the C routine MOBY_do_gc. It is not necessary to preserve all the registers; indeed, the multi-threaded version does something more sophisticated. In other words, this assembly code should be improved for the single-threaded version.

The C routine then moves the registers from the stack into a global data structure and transfers control to the MCC routine gcAllocSpace. (Again, all this stuff is cleaner in the multi-threaded version, but we refrain from cleaning up the single-threaded case for fear of introducing bugs.) This routine marks the rest of the old page as unused, grabs a new page, and returns a pointer onto this new page. Upon return, MOBY_callgc moves this pointer into register edi and restores all the other registers.

The slow path, of course, is if the fast path described above has been taken enough times that a full garbage collection is warranted. The biggest difference in collection between Moby and the original MCC is how the roots are scanned. Therefore, the scanRoots function is re-implemented in the Moby run-time.

scanRoots first crawls the stack and then scans the registered static segments. This order is very important. MCC assumes that all ambiguous pointers are scanned before any unambiguous pointers. Otherwise, we may move an object that must be pinned. The order works because our current information about the stack requires us to treat all live locations as ambiguous pointers. In the future, we might put information in the PC map to indicate that some slots were live, unambiguous pointers. Doing this naively will break the collector. After all, it would break the invariant that all ambiguous roots are scanned first. The simplest fix is probably to maintain a stack of unambiguous roots and do them after all stack walking is done. Another option is two stack walks.

Stack-crawling must know where to stop. We currently use a straightforward kludge: Moby code is called from C code, so we will reach this C code's activation frame before ``falling off'' the deep-end of the stack. We can detect the C code because Moby frames have return addresses that appear in the PC map. So once a return address lookup fails, we stop. A more robust solution is to record the deep-end of the Moby stack. The trick here is to get it exactly right when transferring control to Moby code; some hand-written assembly may be necessary. Also, a more complicated solution will be necessary if we allow C code to ``callback'' into Moby code.

When we are done with the collection, we put the allocation pointer on a fresh page and return it. Note that the multi-threaded version does not move any allocation pointers during collection.

Because Moby code keeps its allocation pointer in a register, it is inconvenient to have foreign code allocate off of this pointer. Therefore, the collector maintains another allocation pointer, called specialAllocPtr in a global variable. Calls to malloc-replacement functions, such as gcMalloc, use this pointer to do allocation. The collector just needs to put this pointer onto a fresh page after collection is complete.

12.4 Multithreading

Note: the multithreaded version works fine for single-threaded programs, so we encourage using it unless bugs are encountered or performance demands otherwise. (The degraded performance would involve managing some unneeded mutexes, maintaining some unneeded global state, and enduring some unneeded function call indirections.) In fact, this situation is the only one that has been tested. That is, we have not yet invoked multiple threads with enough allocation to trigger a collection.

Currently, each Moby thread runs in its own Posix thread (pthread). Some day we hope to multiplex multiple Moby threads in a pthread, but we ignore that complication here and just focus on the garbage-collection issues. (A major reason we temporarily abandoned this multiplexing is that the current Linux pthread implementation determines thread identity by masking the stack pointer. Newer versions have an implementation that does not rely on register esp.) Be warned that by default, these threads have only 2 megabyte stacks.

For each thread, there is a Moby context (moby_exec_ctxt_l_t) structure. All the contexts are kept in a doubly-linked list that is reachable through the global moby_runtime_info. Each thread also keeps a pointer to its own context in thread-local-data (using the key MobyCtxtKey). But wait, there's more. Concerned that accessing thread-local-data is too expensive, we also put a pointer to the context at the bottom of the page into which the context's allocation pointer currently points. So we can mask the allocation pointer to get the context pointer. (To be precise, we write a record object with a header indicating one non-pointer field and then the context pointer. It is a non-pointer in the sense that it does not point into the Moby heap.) When a thread gets a new page for allocating, the context pointer must be copied into that page.

Access to the global MOBY runtime information is guarded by a single, global lock. We need the lock only for module initialization, thread creation, thread destruction, and (full-blown) garbage collection, so we don't expect this to be a major bottleneck. However, when we support lighter-weight threads (much lighter than a pthread), we probably should strive not to need hold the global lock just to create and destruct such a lightweight thread.

We now describe the fields of a moby_exec_ctxt_l_t structure:

typedef struct moby_exec_ctxt_l {
  greg_t     saved_alloc_ptr; 
  stk_ptr_t  saved_stack_ptr;
  stk_ptr_t  saved_frame_ptr;
  code_ptr_t saved_ret_addr;
  ucontext_t * uctxt;
  int        gc_instigator; 

  moby_thd_status_t thd_status;
  pthread_t         pthread;   

  MOBY_closure_t *  closure;    

  struct moby_exec_ctxt_l * prev;
  struct moby_exec_ctxt_l * next;
} moby_exec_ctxt_l_t;

The prev and next fields establish a doubly-linked list. Code modifying this list should hold the global MOBY info lock.

The closure field initially holds a pointer to the closure that a thread first executes. This field is necessary because the closure lives in the Moby heap: If a garbage collection occurs between the time the thread is created and when it starts running, then the closure field must be treated as a root and updated by the collector. Once the collector detects that the Moby thread has started running, it assigns NULL to this field.

The pthread field is what you think it is. When the garbage-collection instigator stops the world, it does so by walking through the list of contexts and sending a signal to each pthread encountered, except for itself.

The gc_instigator field is set only for the thread that instigated the garbage collection. The stackwalk must treat this thread differently for a couple of reasons. First, we can find the callee-save registers and Moby stack portion in known stack offsets. Second, the uctxt field is not correct because we do not signal ourselves (see below). Now that I think about it, we don't need to maintain this state because the fragment

pthread_equal(pthread_self(),ctxt->pthread)

should suffice. (This fragment works because the collection runs in the instigator's thread.)

The thd_status field should be deleted. We need something like this field when we do our own scheduling, but the operating system currently does all of the scheduling for us.

The uctxt field points to space that is allocated during thread-creation and destroyed during thread-destruction. The signal handler for an interrupted-for-garbage-collection thread copies the uctxt provided to it by the system into this field. (The signal handler kludges are described in more detail below.) Then on restarting, we somehow use this information to perform a setcontext. (This function is currently unimplemented in Linux, but we have some ideas on workarounds.) Note that this field is not correct for the instigating thread.

The saved_* fields are written by the trampoline code that is used to let Moby code call C code. In the single-threaded case, Moby code can call C code directly, especially because Moby has strictly fewer callee-save registers (namely, none). Also, for the time being we might assume that the C code is not allocating into the Moby heap, so we are assured that a garbage collection will not occur while executing C code. (If it can occur, we probably need trampoline code because C code does not use MOBY_callgc.) But in the multi-threaded case, another thread could instigate a garbage collection. We use the saved information to determine the Moby portion of the stack and the allocation pointer. Then the stack portion above (shallower) than the Moby portion is treated conservatively.

The trampoline code is in the file MOBY-ccall.S. The compiler uses the multithreaded flag (-t) to decide to generate C function calls to go through this code instead of directly to C. Also, the multithreaded version of MOBY_callgc actually calls the trampoline code, and relies on its implementation. Specifically, it stores the callee-save registers (ebx and esi) on the stack. Later, the stack-walker for the instigator will use the saved_stack_ptr field to find these values. This trick avoids the need to scan the C portion of the instigator's stack.

Now we can describe how garbage collection occurs. First we describe the synchronization and signalling. Then we describe how the actual collection process is different from the signal-threaded case.

Recall that most calls to gcAllocSpace take the fast path of finishing up the old page, grabbing a free page, writing the context-pointer record into it, and returning a pointer. These last three steps constitute a critical section for two reasons. First, we cannot have concurrent access to the free list (for grabbing a free page). Second, we cannot have a full-blown collection occur before the context-pointer record is written and the allocation pointer updated. Hence the call to collect is also in this critical section. We protect gcNewAllocPtr with the same lock because it also manipulates the list of free pages. In the future, we intend to use a (faster?) spinlock for this code because we do not expect much contention.

Once a thread decides to do a full-blown collection, it is the instigator. No other thread can also be instigating because of mutual exclusion. To stop the world, the instigator sends a signal to every other thread. It then waits for all of the threads to receive the signal, save appropriate state in their context, and wait to be restarted. The synchronization to do this stopping stage is essentially the mutex encoding of a counting semaphore. The synchronization for restarting is essentially the inverse --- we set a start flag and broadcast to the other threads. (These other threads check the flag before stopping in case collection happens ``really fast.'') None of this synchronization has been tested, of course.

Note that the description above implies that signal handlers are performing synchronization operations. The pthreads implementation forbids doing so and uses scary words like ``deadlock'' and ``unpredictable'' to dissuade us. Therefore, we pull a dirty trick. After the handler copies the ucontext into its thread's context record,¹ it updates the system's ucontext's program counter so that the handler will return somewhere else, specifically the C function stop_for_gc.

Therefore, stop_for_gc must not take any arguments! Also, no function can ever have data at negative offsets from the stack pointer! We access the context pointer using thread-local data and do the synchronization described above. Upon restart, we use setcontext along with the saved ctxt field to restore things how they were. The memory management of uctxt here is quite subtle. Why don't we pass setcontext the actual field from the context record, not a copy? If setcontext frees this memory (it probably doesn't), the next collection will have a memory error. If setcontext does not free the memory, then we have a race condition: Another collection might occur before setcontext is done reading the uctxt field. When I think about what would happen as a result, my head feels fuzzy. The other option is to pass a copy of the uctxt to setcontext. But now we have a memory leak if setcontext does not free the memory. So we need to stack allocate this copy. Particularly subtle is that this structure is not flat; the pointer field must also refer to a stack-allocated object. But if setcontext does free the memory, we're hosed because it is stack-allocated. The bottom line is we need to ensure that passing a stack-allocated parameter to setcontext is the right thing to do.

Once the world is stopped, the actual garbage collection is much like the single-threaded case. Of course, we may have more than one stack to crawl and only one stack actually instigated the garbage collection. Starting a stack crawl is difficult because of pre-emption and is probably bug-ridden. There are five cases:

The thread is the instigator. The top of the Moby portion of the stack is in the context record. The callee-save registers are in a known position. The C portion of the stack contains no roots.
The thread is running Moby code. The context record is not accurate. We may be in an allocation sequence; if so we roll back the program counter because the already-initialized fields may have moved.
The thread is running C code. The context record is accurate. Treat the C portion of the stack conservatively; doing so ensures we trace the callee-save registers.
The thread has not started running yet. Trace the closure field only. The stack has no Moby roots.
The thread is in the trampoline code. How to find the necessary roots depends on the exact value of the program counter. This case is unimplemented.

The other main difference in the multi-threaded collector is that we do not adjust any allocation pointers. For this policy to be correct, it suffices to pin the pages into which allocation pointers point. So if there are n contexts, we will pin an additional n+1 pages (we still have the specialAllocPtr). The obvious disadvantage is less compaction. The advantage is that allocation pointers are not volatile; they are never changed by another thread. Changing them wouldn't be so bad if they always lived in register edi. But with foreign calls and concurrent run-time calls, it gets very difficult to always track where a context holds its ``true'' allocation pointer. For example, imagine if another thread is in gcAllocSpace when it is pre-empted. Of course, we could have larger critical sections instead. Pinning allocation-pointer pages turned out to be a simple solution and a pleasing interaction with the mostly-coping concept.

Note that by not moving allocation pointers, it would be fairly simple to avoid the need for rollback. All we need to do is trace the already-initialized fields as roots. However, the real advantage of rollback comes when we have multiple Moby threads per context, all sharing an allocation pointer. Here, rollback is a duty of the thread scheduler anyway.

There are two unfortunate properties of MCC that may prove problematic for us. The first is that conservative objects (and, by extension, objects with ambiguous pointer fields) are maintained in an explicit data structure so that collection only has to traverse the pointer graph once. (This distinction is the primary difference between MCC and Bartlett's collector.) Therefore, allocating such objects is likely to be too complicated to do with in-line Moby code. Also the accessing the explicit data structure must by synchronized. BUG: The current code does not synchronize access to the conservative objects queue. A better solution, though, is probably to have a separate queue per context, thus avoiding the need for synchronization.

The second is that part of determining if an ambiguous pointer is valid involves testing whether it points to the beginning of an object. To detect object beginnings, the collector must walk from the beginning of a page to the beginning of the object. Every ambiguous pointer that is a pointer suffers this walk. Two changes suggest that this overhead will be much worse for us than for the original MCC. The first is that we use a larger page size. The second is that we intend, at least at first, to consider all polymorphic fields to be ambiguous pointers. (MCC's original mutator created conservative objects only in C code.)

The MCC code base has some support for using bitmasks to determine object beginnings. (So for each word --- or double-word if objects are so aligned --- have one bit indicating whether it is the beginning of an object.) I think these are a good idea, but the MCC code indicates that this feature is currently broken. It's probably worth fixing and/or finding out why it wasn't used. Another option, of course, is to be more conservative and conclude an ambiguous pointer might be a pointer as soon as it points into the Moby heap with correct alignment. The disadvantage here is more false pointers.

1: We use pthread_getspecific for this operation; the pthread implementation suggests this operation is safe in a handler even though the documentation does not specifically permit it.