PYNTO CODE GENERATOR
--------------------


The Pynto Code Generator is largely machine independent, and is
parameterized by descriptions of the target machine which are embodied
by a 'machine definition'.  The machine independent parts of code
generation are described here, and the machine definition below.  The
machine independent code can be found in this file, codegen.py, and
also in machinedefn.py and instrsdefn.py, which provide the base
classes and utilities built upon by all actual machine definitions.
The actual machine definitions can be found in files named after the
type of backend, such as ppc.py or x86.py.

One thing that is somewhat confusing is the fact that I am not
creative enough to come up with an alternate term for Instruction, so
we have two kinds of instructions: the Instruction objects that come
the module instrs.py, and the low level instructions that the code
generator works with.

Code Generation operates in two broad phases.  First, we identify
those Value objects which are live across block boundaries.  As is
traditional for a compiler, I will refer to these as "global" Value
objects.  While gathering the set of global Values, we also compute a
conflict graph: a conflict graph describes which Values must be stored
in separate locations because they are used at the same time.  We use
the conflict graph to do a simple storage allocation for all the
global values: each global is assigned to a register or a stack slot.
The machine definition is used at this point to determine the set of
registers which globals may be allocated in (those with the
machinedefn.regbits.GLOBAL flag), as well as to hand out new stack
slots.

Once the globals have a home, then we begin to generate code on a
block by block basis.  We do the global Values first because we
process each block separately in its entirety; we never store the low
level IR for the entire function at a time, only for a single block.
Therefore if the Values that extend between blocks are to be found at
the same location for everyone they must be allocated first.

We want to process the blocks in a reasonable order.  In fact, we
process the blocks in Post Order, meaning that a block is processed
only after all of its successors have been (excluding back edges, of
course).  The reason we do it this way is then by the time we get to
processing the Start Block for a function we know how many temporaries
it has, so we can compute the stack layout and generate the code for
the LoadParam instructions we will find there (some platforms need to
know the stack layout to know where the parameters will be found).  In
addition, as we generate the machine code for each block, we insert it
into a buffer: however, rather than inserting it at the front, we
start from the end of a buffer and move towards the front, making it
bigger as needed.  The reason we do this is at then we end up with the
code generated in Reverse Post Order, which happens to be very nearly
execution order for most program graphs.

Processing the blocks consists of first selecting the code gen
templates for each Instruction and generating the low level pseucode
from it using the Machine Defn (see the ID, pre filter, and emitter
descriptions below).  This results in a final list of instructions for
the block.  We do resource allocation on that list, replacing
Temporarys and TemporaryStacks with hard registers and stack slots,
and possibly spilling temporaries or global values as needed.  Once
that is done we use the Machine Defn's emitter table once again to
actually generate the bytes that will be placed in the buffer (from
the end forward, as described in the preceding paragraph).

After generating all the code, we do one last pass over the generated
machine code and patch up branches and other links.  We can't do this
earlier because the offsets are not necessarily known.

The following sections describe the machine dependent parts of the
code generator.
 
Machine Definition
==================

Code Generation is parameterized by a machine definition. which
consists of three parts, one of which is nearly the same across all
machines.  I will summarize the parts briefly here, but you will
probably need to read the detailed descriptions below to understand
the interaction.

The low level instructions are described in detail in the Instruction
Definition section, but suffice to say for the moment that they
consist of a small selection of very simple operations that can be
implemented by one or two native instructions on most processors.  I
will use the capitalized "Instruction" when referring to the mid-level
Instruction objects, and I will try to use the term "low level
instruction" or "pseudo-code" to refer to the code generator
instructions.  I say "pseudo-code" because they are, in a way, a kind
of "pseudo machine code" in that they hide some of the complexity of
real machine code though the mapping between the two is obvious.

1. The Instruction Definition provides a mapping from the Pynto
Instruction objects, such as LoadAttrInstruction, to the low-level
pseudo-code.  While it is technically true that the ID is selected on
a per-machine basis, almost all machines use the common core
definition which can be found in instrsdefn.py.  Some machines (such
as X86) take what is there and make changes to it in a programmatic
way.  In the ID, each Instruction object is mapped to one or more
CodeGenTemplates, which are just sets of low-level instructions.
During code generation, each basic block is converted to a set of low
level instructions by concatenating the definitions of each
Instruction object.

2. The "pre-filter" exists largely to implement calling conventions,
but can be used to insert other custom processing as needed.  When
each block is being converted according to the ID, the pre filter is
given a chance to make any modifications it wants to the instructions
in the ID before they end up in the final listing.  Commonly this is
used to expand "CALL" instructions to a set of instructions that set
up the arguments and deal with the return value.

3. The "emitter" takes the form of a table defining, at the lowest
level, what kind of operands this machine can handle natively with
each instruction and how to generate the actual machine instructions
required.  This table is used to determine what temporaries can be
spilled to memory, which should be in registers, and of course to
general the final instructions for execution.

Instruction Definition
=======================

As described above, the ID maps Pynto instructions to low level
operations.  Strictly speaking, there is one ID for each back-end;
however, all the backends begin with the same ID and customize it.
The base ID can be found in instrdefns.py; each back-end starts with
that as a template and can make small adjustments if needed.  I have
not yet had a need to customize it.

It is also possible to customize the instructions in the pre-filter,
described below.  Ordinarily, you would want to do any customizations
required here if possible, because in that case the customizations
could be done once at load time rather than over and over again for
each function.  Unfortunately, not all the tweaks can be done ahead of
time.

Each Pynto instruction is mapped to one or more CodeGenTemplate objects.
Each CodeGenTemplate object contains two parts:

1. A condition function: this is just a function that, when provided
when an actual Instruction instance, determines if this template
applies.  Sometimes multiple CodeGenTemplates are provided for a given
Instruction type which indicates that it may be implemented in
different ways depending on the conditions tested by this function.
For example, there are two implementations of LoadAttrInstruction: one
is for when the pointer is of a known type and we can load the
attribute with a simple derefence, the other uses the helper
PyObject_GetAttrString().

2. A set of instructions: these are a list of low-level pseudo
instructions that will be translated to actual machine instructions
after being put through the prefilter and emitter of the MD.  A
low-level instruction consists of an opcode (which is a constant
string), a list of operands, and possibly a label and/or target.  A
label is used to allow an instruction to be targeted by a branch
elsewhere.  A target is used when the instruction must move control
flow; for example, a BEQ instruction branches to the target if its two
operands are equal.  The complete set of opcodes and their expected
parameters etc are described in the next section.

Low Level Pseudo Instructions
-----------------------------

Note that in all cases any operands refer to word-sized values right
now.

'BR'

         unconditional branch.  Expects a 'target' which can be either
         a terminal of the instruction (i.e., instr.FALLTHROUGH or
         some such), or a local label.

'BEQ'

         Branch if equal: expects two word-sized operands and tests
         for equality.  This is a simple integer comparison (pointer
         equality).

'BNE'

         Like 'BEQ', but tests for inequality.

'CALL'

         Performs a function call.  Expects a list of parameters; the
         first item in the operand list is where the return value will
         be stored.  If the return value should be ignored (or there
         is no return value) then the first operand should be the
         NullOperand.  The function to call should be specified in the
         target; this will either be a string describing a known
         library function (like 'PyObject_IsTrue') or a
         knode.JittedCode object.

         Many MDs break down the CALL operand further, moving the
         argument setup and return value handling to separate
         instructions and changing the CALL instruction so that it
         simply affects control flow.

'CALLPTR'

         This is not typically used in the instruction definition, but
         it indicates a call by pointer [whereas a CALL is statically
         linked].  This would be used for a properly optimized call to
         a virtual function, or, in the case of PPC, in the call to a
         C library or other transition vector style function.

'ADD'

         Takes three operands (dst, left, right).  Performs an integer
         addition of left and right and puts the result in dst.  dst
         may be the same as left, right, or both.

'SUB'

         Like ADD, but subtracts!

'NOOP'

         No effect, no arguments.  Used when a label is needed but no
         instruction is convenient.

'MOVE'

         Takes two operands (dst, src).  Moves the value in src into
         dst.  Note that on risc processors this is the only instruction
         which may have at most one memory argument (corresponding to
         a LOAD or a STORE).

Pre-Filter
==========

As describe in the beginning of this file, once the global Values have
been allocated, we generate the code each block one at a time.  The
way that we do this, in more detail, is to begin by walking over the
Instructions in the block.  For each instruction we select an
appropriate CodeGenTemplate from the ID (described in the previous
section).  Once we have a CodeGenTemplate that gives us a list of low
level instructions that implement this particular Instruction.

For each low level instruction, we emit it by first checking to see if
the pre-filter has a method named after the opcode of the instruction
(so prefilter.CALL() would be used to intervene with each 'CALL'
instruction).  If such a method exists, we call it.  The method is
expected to break the low level instruction down into other
instructions or otherwise tweak it and then pass it to the emitter,
described below.  Should the pre-filter not have such a method, then
we pass it to the emitter directly.

The most common use of the prefilter is to implement the calling
convention.  For example, the ppc backend defines two prefilters: one
for CALL and one for LOADPARAM; the CALL method takes the CALL
instruction described above, which contains a listing of arguments
etc, and breaks it apart into a series of MOVE instructions for each
parameter, as well as a CALL instruction which simply performs the
CALL itself, followed by a MOVE to recover the return value.

It might also be possible, for example, to use a pre-filter to convert
arithmetic like ADDs and SUBs from three-operand to two-operand form,
though this could perhaps be done on the ID directly for better
performance.

Note that the prefilter often calls emitter._emit(), which bypasses
the prefilter.

Emitter
=======

The emitter is entirely table driven.  The code that reads the table
is in machinedefn.py; should this architecture be the one I settle on,
which is not clear, I intend to write a code generator for it which
reads the table and generates optimal code so that the actual
compilations can be faster.  Doing that ahead of time, of course,
would be somewhat premature optimization, so at the moment the
interface is interpreted.

The basic idea of the table is to define what sorts of operands each
opcode can natively handle on this particular architecture.  The
operands are broken down into three categories right now: register,
memory, and constant.  Register indicates a machine register.  Memory
indicates an offset addressing mode: that is, *(reg+offset) where reg
is a machine register.  Constant represents an inline constant.

The emitter takes as input an instruction which consists of an opcode
and source/dest operands of any type.  The emitter than compares that
instruction to the templates in the table; if a template is found that
matches the types of the operands on the instruction, then the
instruction is added to the list as is.  If not, the template which
most closely matches the operands is found and those operands which
don't match are converted appropriately.  For example, suppose that on
a risc machine, which only allows manipulation of registers, we are
given an instruction that adds a stack slot and a register.  In that
case, we would find no matching ADD; the closest match for ADD in the
table would be the form which adds two registers, so we would convert
the stack slot into a temporary register.

The form of the table itself is a dictionary with the opcode strings
as the keys and lists of table entries as the values.  Each table
entry is a tuple with three parts: the destination types, source
types, and byte emitting instructions.

The types of the destination and source operands are specified by a
tuple of integers; each integer is a bitmask indicating which types of
operands are legal.  If you like in machinedefn.optypes, you'll see
the possible types of operands.  The only kinds of operands that reach
this stage are CONST, MEM, REG, and TEMP.  TEMPs are removed during
local resource allocation, described above, but they should be
considered as a REGISTER (or MEM if the addr taken flag is set), since
that is what they will be replaced with.