|
Main menu IT Visions Microprocessor Architecture Task scheduler Instruction set |
The Innovatic suggestion for an efficient microprocessor is based on a modified stack architecture. There are many advantages of the stach architecture, which may be known from HP calculators and microprocessors like ST20, PSC1000 and PicoJava.
With the Innovatic architecture it is completely different. The
processor contains both a data stack and a traditional accumulator (A).
The accumulator holds one of the operands. The other is taken directly
from the main memory, from the lowest stack level (B), as immediate
data (#) from the instruction pipe-line or from a register. The stack size
only changes in
case of parenthesis in the formula expression where it is necessary to
push a temporary result on the stack, or in case of operations which
involves the lowest stack level. This increases the efficiency and makes
it possible for the processor to do with a fairly small stack size. If you
e.g. want to add 3 numbers together and save the result in two places, it
is in principle only necessary with 5 8-bit instructions - Load A, Add B,
Add C, Store D, Store E - and in this case, the stack size do not change at
all! On traditional stack oriented processors it is necessary with 8
instructions - Load A, Load B, Add, Load C, Add, Push, Store D, Store E -
and the stack size is changed in each instruction!
|
|||||||||||||||||||||||||||||||||||||||||||||
|
Execution Units |
The processor contains two execution units, which are able to work in parallel - one for data and one for address. Each unit has its own stack(s). The data unit has one stack for calculations. The address unit has two stacks - a general purpose stack and an 8-level address register stack. The general purpose stack is primary intended for holding return addresses and status register content for subroutine calls and interrupt processing, but it may also be used for data. Besides, the lowest level (S) of this stack may be used as a predecrement counter in loops (DBZ - Decrement Branch on Zero, DBNZ - Decrement Branch on Non Zero). This is very convenient for nested count loops. The address register stack is intended for memory addressing. The 8 levels are refered to as R0 to R7. The registers are divided into 4 pairs (R0/R1, R2/R3, R4/R5 and R6/R7) where each pair consist of a base register (R1, R3, R5 and R7) and an index register (R0, R2, R4 and R6). The index registers may be automatically predecremented -(Rn) or postincremented (Rn)+. Most addressing in the processor is performed by means of a base register and an offset. The data stack and the general purpose stack are only accessible through the lowest level (B for the data stack and S for the general purpose stack), but all 8 levels of the address register stack are also directly accessible. A push to R0 (PushR) pushes the stack up and a pop to R0 (PopR) pulls it down. If an overflow should occur, the last level (R7) is automatically pushed on the general purpose stack. All directly accessible levels of the stacks, that is, the lowest level of the data stack, the lowest level of the general purpose stack and the entire address register stack are always contained internal in the CPU, but advanced processors may also hold more stack levels internally. If an overflow should occur in the internal data or general purpose stack, they are automatically spilled and refilled to/from the external memory. To be able to do this, each of these stacks has its own stack pointer, which consist of two parts - a base address register and a displacement register. The base address register points to the beginning of the external stack in memory. The displacement register holds the offset, which shows the stack size no matter how big the internal stacks are! Therefore, some of the first entries in the external stacks may not be utilized depending of the size of the internal stacks. As a supplement to the accumulator and the 3 stacks there are up to 248 (256 minus R0-R7) registers which are divided into various groups. The number of registers in the groups marked with an (*) may vary depending on the processor model.
The data ALU consist of two serial connected units, so that the
processor is able to fetch data from the memory by means of predecrement
or postincrement addressing, multiply them with an immediate (#) quotient
and then add the result to the contents of the accumulator (A) in only
one 8-bit instruction (Mul#Add). In this way, it is possible to
implement digital filters as efficient and fast as many dedicated digital
signal processors (DSP's) - especially if the processor uses a Harvard
architecture where the data and the quotients will be fetched
simultaneously from two different memory areas (data and program)! The
data ALU may also be split up in 2-4 units so that it is able to process
e.g. an entire pixel or a stereo signal at a time. This is called vector
processing.
|
|||||||||||||||||||||||||||||||||||||||||||||
|
Data Representation |
The Innovatic processor architecture is based on the big-endian model
(also known as the Motorola model) where the most significant byte
is stored in the lowest address. For a 32 bit system it looks like this:
The data representation in the data ALU and the data stack is
very untraditional in the way that all data are left shifted
with the most significant bit as bit 0, but this is the full
consequense of the big-endian model!
The data representation is also very untraditional in the way that all data are reguarded as being in the range from -1 to 1 (two's complement). The untraditional architecture gives a lot of advantages:
Floating Point A 32 or 64 bit processor either fully or partially supports the following 32-bit floating point format:
A 64-bit processorer may also support the following 64-bit floating point format:
Note, that in both cases the mantissa is a completely normal left
shifted signed number, which just always is in the range from +/-0,5 to
+/-1. This is also rather untraditional. Traditional floating point
formats scales the mantissa in such a way that it is between 1 and 2. To
utilize the range as much as possible these formats therefore has an
implied 1 to the left of the radix point and a separate sign bit. This
makes the calculations more difficult and has the very big disadvantage
that it is not possible to represent the number 0 without making
exceptions (All 0's means the value 0)! With the Innovatic format the data
are in standard two's complement notation so that it is much easier to do
calculations. Because of the left shifted architecture current
calculations may be performed in full width when the exponent has been
stripped off.
|
|||||||||||||||||||||||||||||||||||||||||||||
|
Address ALU and |
The address ALU uses normal right shifted integer arithmetic, but still with bit 0 as the most significant bit. Data from the lowest data stack level B or from a register (R0-R255) may be addressed directly. These data always has the full size of the registers e.g. 32 bit on a 32 bit processor. For branches and calls, the jump address is calculated as a sum of the content of a base register (PC, R1, R3, R5, R7 or A) and a 24 bit immediate displacement. Immediate data (#) are addressed relative to the PC or taken from a register file. Because the PC is automatically postincremented they are actually addressed as (PC)+. Data from the memory are addressed by means of the contents of the lowest data stack level B or relative as a sum of a base address (Rb) and a displacement. The base address is contained in the register R1, R3, R5 or R7. The displacement is contained in the register R0, R2, R4 or R6, but in one addressing mode it may be futher extended with an 8 bit immediate displacement, so that the address is calculated as a sum of 3 numbers. The 8 registers are grouped together in pairs so that the base register for R0 is R1, the base register for R2 is R3 and so on. As previous mentioned the displacement registers may be automatically predecremented or postincremented. To make it easier to implement cyclic buffers and digital filters and to ensure that stacks do not destroy other data areas in case of an overflow, the length (number of bits) of all displacement registers - including the displacement registers of the two stack pointers - may be programmed. This is done in the most significant 5 bits of the displacement register (bit 0-4). If one e.g. wishes to program a length of 1024 bytes, the number should be set to 10. Because it has no meaning to program a length of 0 it is defined that the number 0 means maximum length, that is, 11 bits on a 16 bit processor, 27 bits on a 32 bit processor and 59 bits on a 64 bit processor. In total the processor has the following 16 addressing modes:
For immediate data and data from the memory, the two most significant bits of the address (bit 0 and 1) are used to determine the size of the operand in the following way:
For immediate data (#) taken from the program, it is the two most
significant bits of the program counter (PC), which determine the size.
These bits may be setup by means of special instructions. As the program
counter is stored on the stack in case of interrupt processing or
subroutine calls, the state is reestablished automatically on return.
Immediate data from the register file is always used in full length.
|
|||||||||||||||||||||||||||||||||||||||||||||
|
Instruction Pipe-line |
The processor may be based on a 9 state pipe-line where 4 states are
executed on the rising edge of the clock and the last 5 on the falling
edge.
Note that the address is read out quite soon (with the address strobe) so that the memory has time to fetch the data before they are going to be used (Data read). This is a great advantage especially with dynamic RAM where it takes a lot of time to get the first data at random access. With this architecture a simple and cheap 66 MHz processor with 33 MHz EDO DRAM may perform as well as much faster traditional systems - especially with very big data areas where there are a lot of cache-misses. Traditional processors usually first discovers that the data is not present in the cache when the data is needed and then the processor is stalled until the data is ready. Even though the architecture makes it possible to do without a cache, the processor ought to include at least a very simple instruction cache for more reasons:
|
|||||||||||||||||||||||||||||||||||||||||||||
|
Fast Data Transfer |
By means of the historical buffer (cache) it is possible to execute
small program loops without fecthing the same instructions more than once.
This is utilized for a kind of intelligent DMA. The processor has more
data strobes, but only one address bus and one data bus.
To transfer data from e.g. a periferal unit to the memory, the memory is
first addressed as shown under instruction pipe-line. After that a read
strobe is issued to the periferal unit together with a write strobe to
the memory. In this way, the data is transfered in the fast fly-by mode,
but the data are at the same time compared to the accumulator, so that the
transfer may e.g. be terminated when a stop character is recognized.
Because the transfer is controlled by a program and not a DMA controller,
it may be made as intelligent as one wishes. Because the program should
only be fetched once and the processor is usually much faster than the
memory, so that it has time for a few instructions for each data transfer,
this way of transfer is as fast as normal DMA, but much more intelligent.
|
|||||||||||||||||||||||||||||||||||||||||||||
|
Interrupt |
The processor contains an 8-bit counter for interrupt enabling and disabling. The counter is incremented at interrupt disable and decremented at interrupt enable. Interrupt is only enabled when the counter is 0. The counter cannot be incremented beyound 255 and decremented beyound 0. The counter makes it possible to disable and enable interrupts without having to worry about the previous state. |
|||||||||||||||||||||||||||||||||||||||||||||
|
PS |
The high code density and simple architecture without the need for a floating point unit, a DMA controller or a complex cache reduces the number of transistor elements in the system. Together with the high efficiency, which reduces the need for a high clock frequency, it is possible to obtain the required performance in a very cheap system, which at the same time gets a very high reliability - a typical Innovatic solution - simple, efficient and reliable.
|
|||||||||||||||||||||||||||||||||||||||||||||