PYNTO CODE GENERATOR -------------------- The Pynto Code Generator is largely machine independent, and is parameterized by descriptions of the target machine which are embodied by a 'machine definition'. The machine independent parts of code generation are described here, and the machine definition below. The machine independent code can be found in this file, codegen.py, and also in machinedefn.py and instrsdefn.py, which provide the base classes and utilities built upon by all actual machine definitions. The actual machine definitions can be found in files named after the type of backend, such as ppc.py or x86.py. One thing that is somewhat confusing is the fact that I am not creative enough to come up with an alternate term for Instruction, so we have two kinds of instructions: the Instruction objects that come the module instrs.py, and the low level instructions that the code generator works with. Code Generation operates in two broad phases. First, we identify those Value objects which are live across block boundaries. As is traditional for a compiler, I will refer to these as "global" Value objects. While gathering the set of global Values, we also compute a conflict graph: a conflict graph describes which Values must be stored in separate locations because they are used at the same time. We use the conflict graph to do a simple storage allocation for all the global values: each global is assigned to a register or a stack slot. The machine definition is used at this point to determine the set of registers which globals may be allocated in (those with the machinedefn.regbits.GLOBAL flag), as well as to hand out new stack slots. Once the globals have a home, then we begin to generate code on a block by block basis. We do the global Values first because we process each block separately in its entirety; we never store the low level IR for the entire function at a time, only for a single block. Therefore if the Values that extend between blocks are to be found at the same location for everyone they must be allocated first. We want to process the blocks in a reasonable order. In fact, we process the blocks in Post Order, meaning that a block is processed only after all of its successors have been (excluding back edges, of course). The reason we do it this way is then by the time we get to processing the Start Block for a function we know how many temporaries it has, so we can compute the stack layout and generate the code for the LoadParam instructions we will find there (some platforms need to know the stack layout to know where the parameters will be found). In addition, as we generate the machine code for each block, we insert it into a buffer: however, rather than inserting it at the front, we start from the end of a buffer and move towards the front, making it bigger as needed. The reason we do this is at then we end up with the code generated in Reverse Post Order, which happens to be very nearly execution order for most program graphs. Processing the blocks consists of first selecting the code gen templates for each Instruction and generating the low level pseucode from it using the Machine Defn (see the ID, pre filter, and emitter descriptions below). This results in a final list of instructions for the block. We do resource allocation on that list, replacing Temporarys and TemporaryStacks with hard registers and stack slots, and possibly spilling temporaries or global values as needed. Once that is done we use the Machine Defn's emitter table once again to actually generate the bytes that will be placed in the buffer (from the end forward, as described in the preceding paragraph). After generating all the code, we do one last pass over the generated machine code and patch up branches and other links. We can't do this earlier because the offsets are not necessarily known. The following sections describe the machine dependent parts of the code generator. Machine Definition ================== Code Generation is parameterized by a machine definition. which consists of three parts, one of which is nearly the same across all machines. I will summarize the parts briefly here, but you will probably need to read the detailed descriptions below to understand the interaction. The low level instructions are described in detail in the Instruction Definition section, but suffice to say for the moment that they consist of a small selection of very simple operations that can be implemented by one or two native instructions on most processors. I will use the capitalized "Instruction" when referring to the mid-level Instruction objects, and I will try to use the term "low level instruction" or "pseudo-code" to refer to the code generator instructions. I say "pseudo-code" because they are, in a way, a kind of "pseudo machine code" in that they hide some of the complexity of real machine code though the mapping between the two is obvious. 1. The Instruction Definition provides a mapping from the Pynto Instruction objects, such as LoadAttrInstruction, to the low-level pseudo-code. While it is technically true that the ID is selected on a per-machine basis, almost all machines use the common core definition which can be found in instrsdefn.py. Some machines (such as X86) take what is there and make changes to it in a programmatic way. In the ID, each Instruction object is mapped to one or more CodeGenTemplates, which are just sets of low-level instructions. During code generation, each basic block is converted to a set of low level instructions by concatenating the definitions of each Instruction object. 2. The "pre-filter" exists largely to implement calling conventions, but can be used to insert other custom processing as needed. When each block is being converted according to the ID, the pre filter is given a chance to make any modifications it wants to the instructions in the ID before they end up in the final listing. Commonly this is used to expand "CALL" instructions to a set of instructions that set up the arguments and deal with the return value. 3. The "emitter" takes the form of a table defining, at the lowest level, what kind of operands this machine can handle natively with each instruction and how to generate the actual machine instructions required. This table is used to determine what temporaries can be spilled to memory, which should be in registers, and of course to general the final instructions for execution. Instruction Definition ======================= As described above, the ID maps Pynto instructions to low level operations. Strictly speaking, there is one ID for each back-end; however, all the backends begin with the same ID and customize it. The base ID can be found in instrdefns.py; each back-end starts with that as a template and can make small adjustments if needed. I have not yet had a need to customize it. It is also possible to customize the instructions in the pre-filter, described below. Ordinarily, you would want to do any customizations required here if possible, because in that case the customizations could be done once at load time rather than over and over again for each function. Unfortunately, not all the tweaks can be done ahead of time. Each Pynto instruction is mapped to one or more CodeGenTemplate objects. Each CodeGenTemplate object contains two parts: 1. A condition function: this is just a function that, when provided when an actual Instruction instance, determines if this template applies. Sometimes multiple CodeGenTemplates are provided for a given Instruction type which indicates that it may be implemented in different ways depending on the conditions tested by this function. For example, there are two implementations of LoadAttrInstruction: one is for when the pointer is of a known type and we can load the attribute with a simple derefence, the other uses the helper PyObject_GetAttrString(). 2. A set of instructions: these are a list of low-level pseudo instructions that will be translated to actual machine instructions after being put through the prefilter and emitter of the MD. A low-level instruction consists of an opcode (which is a constant string), a list of operands, and possibly a label and/or target. A label is used to allow an instruction to be targeted by a branch elsewhere. A target is used when the instruction must move control flow; for example, a BEQ instruction branches to the target if its two operands are equal. The complete set of opcodes and their expected parameters etc are described in the next section. Low Level Pseudo Instructions ----------------------------- Note that in all cases any operands refer to word-sized values right now. 'BR' unconditional branch. Expects a 'target' which can be either a terminal of the instruction (i.e., instr.FALLTHROUGH or some such), or a local label. 'BEQ' Branch if equal: expects two word-sized operands and tests for equality. This is a simple integer comparison (pointer equality). 'BNE' Like 'BEQ', but tests for inequality. 'CALL' Performs a function call. Expects a list of parameters; the first item in the operand list is where the return value will be stored. If the return value should be ignored (or there is no return value) then the first operand should be the NullOperand. The function to call should be specified in the target; this will either be a string describing a known library function (like 'PyObject_IsTrue') or a knode.JittedCode object. Many MDs break down the CALL operand further, moving the argument setup and return value handling to separate instructions and changing the CALL instruction so that it simply affects control flow. 'CALLPTR' This is not typically used in the instruction definition, but it indicates a call by pointer [whereas a CALL is statically linked]. This would be used for a properly optimized call to a virtual function, or, in the case of PPC, in the call to a C library or other transition vector style function. 'ADD' Takes three operands (dst, left, right). Performs an integer addition of left and right and puts the result in dst. dst may be the same as left, right, or both. 'SUB' Like ADD, but subtracts! 'NOOP' No effect, no arguments. Used when a label is needed but no instruction is convenient. 'MOVE' Takes two operands (dst, src). Moves the value in src into dst. Note that on risc processors this is the only instruction which may have at most one memory argument (corresponding to a LOAD or a STORE). Pre-Filter ========== As describe in the beginning of this file, once the global Values have been allocated, we generate the code each block one at a time. The way that we do this, in more detail, is to begin by walking over the Instructions in the block. For each instruction we select an appropriate CodeGenTemplate from the ID (described in the previous section). Once we have a CodeGenTemplate that gives us a list of low level instructions that implement this particular Instruction. For each low level instruction, we emit it by first checking to see if the pre-filter has a method named after the opcode of the instruction (so prefilter.CALL() would be used to intervene with each 'CALL' instruction). If such a method exists, we call it. The method is expected to break the low level instruction down into other instructions or otherwise tweak it and then pass it to the emitter, described below. Should the pre-filter not have such a method, then we pass it to the emitter directly. The most common use of the prefilter is to implement the calling convention. For example, the ppc backend defines two prefilters: one for CALL and one for LOADPARAM; the CALL method takes the CALL instruction described above, which contains a listing of arguments etc, and breaks it apart into a series of MOVE instructions for each parameter, as well as a CALL instruction which simply performs the CALL itself, followed by a MOVE to recover the return value. It might also be possible, for example, to use a pre-filter to convert arithmetic like ADDs and SUBs from three-operand to two-operand form, though this could perhaps be done on the ID directly for better performance. Note that the prefilter often calls emitter._emit(), which bypasses the prefilter. Emitter ======= The emitter is entirely table driven. The code that reads the table is in machinedefn.py; should this architecture be the one I settle on, which is not clear, I intend to write a code generator for it which reads the table and generates optimal code so that the actual compilations can be faster. Doing that ahead of time, of course, would be somewhat premature optimization, so at the moment the interface is interpreted. The basic idea of the table is to define what sorts of operands each opcode can natively handle on this particular architecture. The operands are broken down into three categories right now: register, memory, and constant. Register indicates a machine register. Memory indicates an offset addressing mode: that is, *(reg+offset) where reg is a machine register. Constant represents an inline constant. The emitter takes as input an instruction which consists of an opcode and source/dest operands of any type. The emitter than compares that instruction to the templates in the table; if a template is found that matches the types of the operands on the instruction, then the instruction is added to the list as is. If not, the template which most closely matches the operands is found and those operands which don't match are converted appropriately. For example, suppose that on a risc machine, which only allows manipulation of registers, we are given an instruction that adds a stack slot and a register. In that case, we would find no matching ADD; the closest match for ADD in the table would be the form which adds two registers, so we would convert the stack slot into a temporary register. The form of the table itself is a dictionary with the opcode strings as the keys and lists of table entries as the values. Each table entry is a tuple with three parts: the destination types, source types, and byte emitting instructions. The types of the destination and source operands are specified by a tuple of integers; each integer is a bitmask indicating which types of operands are legal. If you like in machinedefn.optypes, you'll see the possible types of operands. The only kinds of operands that reach this stage are CONST, MEM, REG, and TEMP. TEMPs are removed during local resource allocation, described above, but they should be considered as a REGISTER (or MEM if the addr taken flag is set), since that is what they will be replaced with.