Chapter 10 Code generation

10.1 Calling conventions

10.1.1 Classifying functions

Functions are grouped into clusters, which are functions at the same lexical level that can share the same closure and stack frame. We classify functions based on an analysis of their call sites.

ESCAPING: the function escapes and, thus, may have unknown call sites.
KNOWN: all call sites for the function are known.
TAIL: all call sites for the function are known tail calls.
LABEL: all call sites for the function are tail calls from the function's cluster.

We define an entry to a cluster is a function that is not a LABEL; i.e., it is called from outside the cluster.

10.1.2 Classifying call sites

10.1.3 Choosing a calling convention

The calling convention used at a call site depends on the target fucntion's classification, as well as the call-site's classification. The following table describes the analysis:

	Call-site class
Function class	Non-tail call	Tail call
ESCAPING	Standard call	Standard tail-call
KNOWN	Known call	Known tail-call
TAIL	n.a.
LABEL	n.a.	Goto

10.1.4 Notation

We use a simple C-like syntax for describing the calling conventions. The calling conventions are described in terms of a stack-based argument passing model; in practice some architecture-specific prefix of the arguments will be passed in registers. We may also use dedicated registers for the implicit parameters (e.g., the exception handler continuation and the environement pointer). In the discussion below, we assume that we are calling a function with the following signature:

val f : (t₁, ..., t_n) -> (s₁, ..., s_m)

In the code below, we use f to refer to the callee's closure. A function closure is a two-word record consisting of a code pointer (refered to by cp) and an environment pointer (refered to by ep). We use sp to refer to the stack pointer, fp to refer to the frame pointer, exh to refer to the current exception handler continuation, arg[i] to refer to the ith argument, and res[i] to refer to the ith result.

10.1.5 Escaping functions

For calls to escaping functions, the arguments and results are passed in uniform representation (i.e., one word per value). There are two additional implicit arguments: the exception handler continuation and the function's environment pointer. We let p = max(n+2, m) be the size of the argument/result area.

Upon function entry, an escaping function first saves the previous function's fp register and sets the fp to point to the location of the saved fp. It then allocates the stack frame. Figure 10.1 shows the stack layout.

Figure 10.1: The standard stack-frame layout

Standard call

When n+2 < m, we allocate k = m-(n+2) extra stack space in the calling convention.

cp = f[0] ep = f[1] *--sp = arg[1] ... *--sp = arg[n] *--sp = exh *--sp = ep sp -= k call *cp res[1] = sp[m-1] ... res[m] = sp[0] sp += m

When n+2 ³ m, let k = n+2-m

cp = f[0] ep = f[1] *--sp = arg[1] ... *--sp = arg[n] *--sp = exh *--sp = ep call *cp res[1] = sp[n+2] ... res[m] = sp[k] sp += n+2

Note that when f is a known function, we can call directly to f's label and we do not need cp.

Standard tail-call

When a standard function call is made from a tail position, we need to discard the caller's stack frame prior to transfering control to the callee. We also need to setup the call so that the callee returns to the caller's caller. Let l be the argument arity of the caller (not including any implicit arguments). Note that Moby's type system guarantees that the result arity of the tail call will be m.

The simplest case is when l = n and we do not have to move the return PC. Furthermore, since the call is a tail call, the exception handler continuation that was passed in to the caller will be the same as is passed to the callee, so we do not have to overwrite that stack location. These properties lead to the following code sequence:

cp = f[0] ep = f[1] fp[n+3] = arg[1] ... fp[4] = arg[n] fp[2] = ep sp = fp fp = *sp++ jump *cp

When l > n, let k = l - n.

cp = f[0] ep = f[1] fp[l+3] = arg[1] ... fp[k+4] = arg[n] fp[k+3] = exh fp[k+2] = ep fp[k+1] = fp[1] sp = fp fp = *sp sp += (k+1) jump *cp

10.1.6 Known functions

For calls to known functions, the arguments and results can be passed in unboxed representation. As with escaping functions, there are two additional implicit arguments: the exception handler continuation and the function's environment pointer.

Known call

Known tail-call

10.1.7 Goto

10.1.8 C function calls

The basic code generation for C function calls is handled using the MLRISC C_CALLS interface. This library supports the underlying ABI's argument-passing convention. The Moby compiler supports annotations on C functions that can be used to specialize the calling convention. These annotations are introduced in the prototype declarations for the C functions in the MBX files and are described below.

may-callback

This annotation marks the C function as one that might call back into Moby code prior to its own return (e.g., the main event loop of a GUI library). If this annotation is present, then we must save and restore the allocation pointer across the C call in such a way as to allow any Moby callbacks to have access to the allocation pointer. We achieve this goal by saving the allocation pointer in the host task's descriptor prior to the call and then restoring it on return.

basePtr = allocPtrMASK; taskPtr = basePtr->taskPtr; taskPtr->allocPtr = allocPtr;... call C function ...allocPtr = taskPtr->allocPtr;

The wrapper code for the Moby callback gets the allocation pointer from the task structure on entry and saves it on exit (see Section 10.1.9 below).

moby-lib

This annotation marks the C function as requiring access to the task descriptor of its host task. In addition to the saving and restoring of the allocation pointer (as in the case of the may-callback above), we pass the current task pointer as an extra argument to the C function.

pure

The pure annotation does not affect the calling convention, although it should not appear in conjunction with the ``may-callback'' annotation. Instead, it is used to mark C functions as having no visible side effects, which allows the optimizer more freedom (see Chapter 9). Examples of pure C functions include math library functions, such as sin and cos, as well as system calls, such as gettimeofday and access, that query the system for information.

10.1.9 C callable functions

The code generator also generates wrapper code for C-callable functions.

10.2 Leaf-procedure optimization

After register allocation, we check to see if the cluster is both a leaf procedure and does not need allocate stack space (either for stack objects or spill locations). If these two conditions are met, then we rewrite the instructions to not allocate or use a new stack frame. For example, the Moby function

fun inc (r : Ref(Int)) -> () { *r := *r + 1 }

is compiled into the following instruction sequence on the IA32:

inc: pushl movl movl 16(incl (leave ret $12

By applying the leaf-procedure optimization, we can reduce it to the following:

inc: movl 12(incl (ret $12

This optimization works by first rewriting the entry and exit code sequences and then rewriting each block to replace frame-pointer based addressing with stack-pointer based addressing. The one subtlety is that MLRISC considers any basic block that has non-local control flow (i.e., because of a throw) to be an exit block. Thus, the exit sequence rewriting has to recognize such cases and ignore them. The leaf-procedure optimization can be disabled with the --no-leaf-opt command-line option.

Eventually, we may want to apply this optimization to a larger class of procedures. First, we could support stack allocation by making the operand rewriting more sophisticated (this is essentially like using a virtual frame pointer). We might also apply the optimization to procedures that make tail calls.

10.3 Generating PC maps

Note: It may make more sense for this information to be in the run-time chapter. This section doesn't really discuss how the compiler generates the PC map information. That would be useful information to have written down, though

The actions that a run-time service performs may depend on the state of the threads in the system. For example, we prefer not to trace dead variables during garbage collection. If the relevant portions of the state are available statically (that is, they are a function of the program counter), then the run-time system can combine two techniques to acquire the state:

Use the program counter as an index into a compiler-generated table containing relevant information.
Examine the instruction stream and the values of dedicated registers to determine information not contained in the table.

One important design goal is that we want to be able to interrupt threads at any instruction without preventing run-time services, such as garbage collection, from running. Achieving this goal allows us to hide the latency of a page fault by running other threads. It is not generally possible in systems with so-called ``gc safe-points''.

In this section, we describe the division of responsibility between the run-time system and the compiler, as well as the binary encoding of the PC-map table. Activation-record layout and code generation for memory allocation are closely related subjects, but we assume they are described in detail elsewhere. First, we describe the overall organization of the table. Second, we give the detailed binary encoding. Third, we explain how we expect the run-time system to use this information. At this time, PC maps are only implemented for the IA-32 architecture; much of what follows is architecture-specific.

10.3.1 Table Organization

There is one PC-map table per file. Each module's initialization code calls a run-time routine to register the file's table. (Because a file can have multiple modules but only one PC map, the run-time routine detects if a table has already been registered. This feature has not been tested, though.) The table itself is in a read-only data section.

A file table has three levels. (You could call it a ``table of tables of tables'', but each level contains information other than just pointers to tables.) The top level is also called the cluster level. The middle level is also called the common level. The bottom level is also called the uncommon level.

A cluster-level entry contains:

The address at which the cluster begins. (By begins, we mean the lowest address; it need not be the entry point.)
The number of bits that is sufficient to describe the difference between the beginning address of the cluster and any other address in the cluster. That is, if the cluster's code occupies b bytes, then this number must be at least ceil(log₂ b). We call this number the ``offset bits''.¹
The address at which the common-level table for this cluster begins.
The number of words between the frame pointer and the top of the activation record for this cluster.
The number of words between the frame pointer and the beginning of the floating-point spill section of the activation record.
The number of floating-point spill slots for the activation record.

In general, the cluster level describes the activation record, the encoding of lower-level tables, and the location of lower-level tables.

A common-level entry is for a particular address (program counter value). It contains:

The difference between the address and the beginning of the cluster. (That is, the offset.)
The address at which the associated uncommon-level table begins. (This address should probably be changed to an offset in the future.) This uncommon table contains entries for PC values between this common-level entry's value and the previous one. That is, it is for instructions preceding the common one.
Static liveness information for local variables at this PC value. Specifically, a bit vector indicates which registers and stack slots may be used before they are defined (i.e., which locations are live).

An uncommon-level entry is usually for a particular PC value. It contains:

The length of the instruction at this PC.² (In the future we hope to have the run-time recompute this information by parsing the instruction stream.)
Whether this PC value is in an allocation sequence. That is, whether an object is partially initialized.
Two lists of local variables (registers or stack slots) called ``plus'' and ``minus''. Given the liveness information for after the instruction, subtracting the minus and adding the plus gives the liveness information for before this instruction.

Note that the only way to find the uncommon-level entry for a particular PC value is to start at the beginning of the correct uncommon-level table and iterate through the entries subtracting instruction lengths until the correct PC value is obtained.

For allocation sequences, the liveness information is the same for the entire sequence since the run-time system should reset the program counter to the beginning of the sequence. Therefore, there is only one entry for the entire sequence; the length is the length of the sequence.

We now give some brief justification for using three different levels, each with its own encoding. The top-level table encodes the activation-record layout. Since it's entries are a fixed size, binary search is fast.

The division between common and uncommon entries is even more important. For most PC values, we never expect the run-time system to examine the information for this value. The run-time system needs the information only if a thread is currently at this PC or it is a return address on the call stack. So, currently we classify the ends of basic blocks and instructions after calls as common, all other PC values are uncommon.

Ends of basic blocks are actually not common, but it is easier to emit them into the common table since their liveness information is not closely related to the liveness information for the following instruction. Perhaps in the future we should segregate block ends from return addresses.

There are two rare situations where a return address is in the uncommon table instead of the common table. The first is after a call instruction that is the last instruction in a basic block. The compiler could detect this case, but it is a pain to do, so it is unimplemented. The second is after a call instruction that immediately precedes an allocation. In this case, the entry cannot be a common one, because only uncommon entries can describe allocation sequences. These cases are both rare. A call instruction is not followed by a stack-pointer adjustment only if the call has no arguments (including an exception handler or environment pointer) or results. In the second case, we also need a heap-limit check between a call and an allocation unless we can bound the amount of memory allocated during the call.

All the entries in a particular middle-level table have the same size, so binary search is easy there. For the bottom-level table, table size is more important than lookup time, so entries have variable length and linear processing is necessary. Moreover, by keeping the uncommon entries in separate tables, we hope to have better paging behavior.

10.3.2 Binary Encoding

Location Numbering

To record the liveness of a register or stack slot, we use a mapping from registers and stack slots to small integers. The non-dedicated registers eax, ebx, ecx, edx, esi are represented by 0, 1, 2, 3, 4, respectively. The 4-byte (non floating point) stack slot in the lowest address (shallowest on the stack) is represented by 5. The next shallowest slot gets 6, and so on. This encoding is a bit wasteful since we use numbers for the slots containing the saved frame pointer and return address, yet these slots are never garbage-collection roots.

Cluster Level

The cluster-level table begins with one word that is the number of cluster entries in the table. The entries are sorted with respect to the beginning addresses of the clusters they describe.

Each cluster-level entry takes 3 words. The first word is the address of the associated common-entry table. The second word is the beginning address for the cluster. The third word contains four separate one-byte fields. The first byte is an 8-bit integer indicating the number of offset bits used in the common-entry table. The second byte is an 8-bit integer indicating the number of words above the frame pointer in the activation record. The third byte is an 8-bit integers indicationg the number of words between the frame pointer and the floating-point spill slots. The fourth byte is an 8-bit integer indicating the number of floating-piont spill slots. (Each slot is 12 bytes.)

Hence, the encoding fails if there are more than 253 argument/results/reserved slots, more than 255 word spill slots, or more than 255 floating-point spill slots.

A final word in the cluster-level table is a label which follows the last common-level table.

To give some intuituion for this layout, consider finding the correct cluster for a PC value for a table starting at address pcmap. The first cluster's lowest address is at [pcmap+8] and the last cluster's lowest address is at [pcmap+8+3*[pcmap+0]]. Each entry is 12 bytes, so binary search is now straightforward. Suppose such a search determines that the cluster containing the PC value has its lowest address at [pcmap+i]. Then the common-level table for the cluster is exactly between [pcmap+i-1] and [pcmap+i-1+12]. (The extra label at the end of the table makes this fact hold even for the last cluster.)

Common Level

Note: This level's encoding should change soon. And the current implementation is incomplete as explained below.

Because the cluster-level table specifies the size of a common-level table, the common-level table contains only common-level entries. The entries are in sorted order.

Each entry consumes an integral number of words. The first word is the address of the associated uncommon-level table. (A future change involves making this an offset so that it can normally consume less space.) The second word contains the PC offset for this entry, that is, the difference between the address of this common instruction and the beginning of the cluster. The second word also contains a bit vector encoding which registers and stack slots are live. The ``offset bits'' field of the associated cluster entry gives the information that the run-time system needs to extract these two fields correctly.

We expect that most clusters can fit the pc offset and bit vector in one word. However, it is likely that some cannot. Allowing longer entries is not yet implemented, but it will be very soon. Until then, the pc offset is in the low-order bits of the one shared word. The compiler raises an error if more than one word is needed.

Uncommon Level

The entries in an uncommon level table describe a sequence of instructions preceding the instructions that the associated common level entry describes. The entries are in ``backwards order'' --- the first entry is for the instruction just before the common-entry instruction, the second entry is the for the instruction just before the first-entry instruction, etc. Allocation sequences are logically one instruction (i.e., one entry); they are discussed below.

For non-allocation sequences, each entry has a variable length of at least one byte. The first byte contains four independent pieces of information:

The highest bit is unset to indicate that this entry is not for an allocation sequence.
The next highest bit is set iff the ``plus'' list for the entry is non-empty.
The next highest bit is set iff the ``minuss'' list for the entry is non-empty.
The low 5 bits encode an unsigned integer representing the length of the instruction. (The encoding fails if an instruction can span more than 32 bytes, including nop bytes introduced by .align directives.)

The rest of the entry consists of the plus and minus lists, if necessary. If a list is non-empty, it contains the integer-encoding for each slot in the list followed by an ``all 1s'' sequence to terminate the list. The number of bits per list element is minimal; a list element may cross any byte boundary. The minimal number of bits is ceil(log₂(5+s+1)) where s is the number of slots in the activation record (number below the frame pointer plus number above the frame pointer).

Although the lists have no internal alignment, each uncommon-table entry begins on a byte boundary.

Allocation-sequence entries are very similar. The differences are:

The highest bit of the first byte is set.
We use 13 bits to encode the length instead of 5. Hence an allocation sequence can span up to 8Kb. The 13 bits (5 in the first byte and 8 in a mandatory second byte) are interpreted big-endian). Because IA-32 is little-endian, generating the correct assembly directives requires some acrobatics.

10.3.3 Run-time Use

Having explained what is encoded and how, we sketch how we expect the run-time system to use the information. We assume that the run-time needs to perform a garbage collection and that it has access to the program counters of all live threads. The system needs to create an appropriate collection of garbage-collection roots and leave all threads in appropriate states for continued execution.

Obviously, the system must index into the PC-map tables to get the information for the correct instruction. Since the method for doing so is fairly obvious from the encoding, we do not discuss it further. It is how the information is used that is more interesting.

An obligation of the code generator is to ensure that allocation sequences can be re-started (rolled back) simply be resetting the program counter to the beginning of the sequence. So if the run-time learns that a thread was in the middle of an allocation sequence, it needs only to calculate where the address of the beginning of the sequence.

Finding garbage-collection roots seems straightforward at first. The live variables for the current activation record are a function of the PC-map tables and the current program counter. From the activation-record layout information we can find the return address and use that to determine the live variables in the caller's frame. Iterating, we can ``crawl the stack'' finding all of the thread's roots.

Problems arise because we strive to support garbage collection at any program point. Hence the shallowest frame can be in a temporarily weird state and the run-time plus the tables must account for such strangeness. Specifically, a thread could be in a function prologue, a function epilogue, a call sequence (some arguments pushed on the stack), a return sequence (the results not yet popped off the stack), or in a tail-call optimized version of one of the above.

The real source of the strangeness is that the frame pointer, the stack pointer, or both are not where they usually are. So hopefully, the run-time system can take the difference between the two and use this difference to determine the correct state of the frame. For example, suppose we are in a call or return sequence. Then the difference should be too high and we could conservatively treat the extra slots (the arguments or results to the callee) as live.

A standard function epilogue has only one strange position, between the leave and ret instructions. In this case, the simplest solution may be for the run-time system to check the instruction stream for this case. Similarly, function prologues use instructions that do not appear elsewhere.

Tail calls may be more problematic. Currently, their code generation is broken for other reasons, so we will re-examine the tail-call case later.

The problem with all these ``corner cases'' is that they make rather brittle assumptions about code generation. The alternative, however, seems to be including a lot of extra bloat in the PC-map tables. In the long run, it is probably easiest to manage this bloat. For example, we could use one uncommon-level bit to indicate ``this is a corner case'' and then a few bits to indicate how far off the frame pointer is from where it should be.

1: In practice, the compiler does not know instruction lengths, so it (very) conservatively estimates b to be 32 times the number of instructions.
2: It is quite frustrating that the IA-32 has variable-length instructions.