Main menu
IT Visions
Microprocessor
Architecture
Task scheduler
Instruction set

The Innovatic suggestion for an efficient microprocessor is based on a modified stack architecture. There are many advantages of the stach architecture, which may be known from HP calculators and microprocessors like ST20, PSC1000 and PicoJava.

  • It is not necessary with a source and destination description in each instruction. This reduces the instruction length to almost the half compared to traditional RISC processors. Traditional 32 bit RISC processors typical uses 15 bit for addressing of two source operands and one destination, and traditional 16 bit RISC processors typical uses 8 bits for source and destination addressing. The instruction length of the Innovatic architecture is only 8 bit (plus any immediate data)!

  • There are no data to save and restore during interrupt or subroutine calls.

  • By means of the stack it is easy to transfer any number of data to or from a subroutine. This way of data transfer is also used in programming languages like Java, PostScript, Forth og C, which makes it easy to support these languages.

  • All subroutines automatically becomes reentrant in the way that they may call themselves. This makes it easy to implement interpreters, Soft-PLC's etc.

  • There are no registers to manage. This makes it easy for a compiler to utilize the instruction set 100 %.
However, traditional stack oriented processors has the disadvantage that the stack is involved in all operations. At first the stack is pushed up and the data is loaded in the lowest level. After that the operation is performed on the lowest two levels and the stack is pulled down again. The stack size jumps up and down for each operation like a piston in an engine. As long as the stack can be contained in the processor this may not be the biggest problem, however, if the stack grow beyond the limit, the last level must be spilled and refilled to/from the main memory for almost every instruction. This lowers the efficiency a lot. Therefore, the stack size has to be quite big, which makes the processor expencive and power comsumpting.

With the Innovatic architecture it is completely different. The processor contains both a data stack and a traditional accumulator (A). The accumulator holds one of the operands. The other is taken directly from the main memory, from the lowest stack level (B), as immediate data (#) from the instruction pipe-line or from a register. The stack size only changes in case of parenthesis in the formula expression where it is necessary to push a temporary result on the stack, or in case of operations which involves the lowest stack level. This increases the efficiency and makes it possible for the processor to do with a fairly small stack size. If you e.g. want to add 3 numbers together and save the result in two places, it is in principle only necessary with 5 8-bit instructions - Load A, Add B, Add C, Store D, Store E - and in this case, the stack size do not change at all! On traditional stack oriented processors it is necessary with 8 instructions - Load A, Load B, Add, Load C, Add, Push, Store D, Store E - and the stack size is changed in each instruction!


Execution Units
(ALU's)

The processor contains two execution units, which are able to work in parallel - one for data and one for address. Each unit has its own stack(s). The data unit has one stack for calculations. The address unit has two stacks - a general purpose stack and an 8-level address register stack.

The general purpose stack is primary intended for holding return addresses and status register content for subroutine calls and interrupt processing, but it may also be used for data. Besides, the lowest level (S) of this stack may be used as a predecrement counter in loops (DBZ - Decrement Branch on Zero, DBNZ - Decrement Branch on Non Zero). This is very convenient for nested count loops.

The address register stack is intended for memory addressing. The 8 levels are refered to as R0 to R7. The registers are divided into 4 pairs (R0/R1, R2/R3, R4/R5 and R6/R7) where each pair consist of a base register (R1, R3, R5 and R7) and an index register (R0, R2, R4 and R6). The index registers may be automatically predecremented -(Rn) or postincremented (Rn)+.

Most addressing in the processor is performed by means of a base register and an offset.

The data stack and the general purpose stack are only accessible through the lowest level (B for the data stack and S for the general purpose stack), but all 8 levels of the address register stack are also directly accessible. A push to R0 (PushR) pushes the stack up and a pop to R0 (PopR) pulls it down. If an overflow should occur, the last level (R7) is automatically pushed on the general purpose stack.

All directly accessible levels of the stacks, that is, the lowest level of the data stack, the lowest level of the general purpose stack and the entire address register stack are always contained internal in the CPU, but advanced processors may also hold more stack levels internally. If an overflow should occur in the internal data or general purpose stack, they are automatically spilled and refilled to/from the external memory. To be able to do this, each of these stacks has its own stack pointer, which consist of two parts - a base address register and a displacement register. The base address register points to the beginning of the external stack in memory. The displacement register holds the offset, which shows the stack size no matter how big the internal stacks are! Therefore, some of the first entries in the external stacks may not be utilized depending of the size of the internal stacks.

As a supplement to the accumulator and the 3 stacks there are up to 248 (256 minus R0-R7) registers which are divided into various groups. The number of registers in the groups marked with an (*) may vary depending on the processor model.

  • The program counter (PC) which points to the next instruction.
  • The status register (SR).
  • The two stack pointer pairs.
  • General purpose RAM area for holding global variables like the base address for libraries or memory areas(*).
  • Various setup registers (*).
  • Registers with fixed (read only) contents like the numbers -64 to 64, 1/PI=0.3183, constants for rounding etc. (*).
  • A wrap-around register file which may be used for immediate data instead of the usual immediate data from the program. (*).

The data ALU consist of two serial connected units, so that the processor is able to fetch data from the memory by means of predecrement or postincrement addressing, multiply them with an immediate (#) quotient and then add the result to the contents of the accumulator (A) in only one 8-bit instruction (Mul#Add). In this way, it is possible to implement digital filters as efficient and fast as many dedicated digital signal processors (DSP's) - especially if the processor uses a Harvard architecture where the data and the quotients will be fetched simultaneously from two different memory areas (data and program)! The data ALU may also be split up in 2-4 units so that it is able to process e.g. an entire pixel or a stereo signal at a time. This is called vector processing.


Data Representation

The Innovatic processor architecture is based on the big-endian model (also known as the Motorola model) where the most significant byte is stored in the lowest address. For a 32 bit system it looks like this:

LongWord, Addr.=7  
Word, Addr.=7  
Byte, Addr.=8 Byte, Addr.=9 Byte, Addr.=10 Byte, Addr.=11
  LW, Addr.=7
  Word, Addr.=5 Word, Addr.=7
Byte, Addr.=4 Byte, Addr.=5 Byte, Addr.=6 Byte, Addr.=7
MSLongWord, Addr.=0
MSW, Addr.=0 Word, Addr.=2
MSB, Addr.=0 Byte, Addr.=1 Byte, Addr.=2 Byte, Addr.=3

The data representation in the data ALU and the data stack is very untraditional in the way that all data are left shifted with the most significant bit as bit 0, but this is the full consequense of the big-endian model!

MSb = bit 0 bit 1 bit 2  · · · · · · · ·  LSb = bit n

The data representation is also very untraditional in the way that all data are reguarded as being in the range from -1 to 1 (two's complement).

The untraditional architecture gives a lot of advantages:

  • There is no need for sign-extention when data are loaded. This simplifies the instruction set and saves instruction codes.

  • If the registers are at least one bit wider than the data it is easy to convert from unsigned to signed numbers without loosing the accuracy, because data may be shifted one place to the right without being truncated. This makes it possible to do all calculations in signed arithmetic, which saves instruction codes and makes the program easier to overlook. Unsigned bytes are automatically loaded in bit 1-8.

  • It is possible to change the resolution of an A/D or D/A converter without changing the program. With a 10 bit converter bit 10 and up is always zero, but if the converter is replaced by e.g. a 12 bit converter, bit 10 and 11 will just get valid information. With traditional processors the program has to be rewritten and checked for any overflow in the calculations. It is also very easy to read out and display the same data with different resolutions.

    An A/D converter actually do not read an absolute value, but a ratio between the input voltage and a reference voltage. For a 10 bit plus sign converter this ratio is -1024/1024 to 1023/1024 or -1 to 0.999.

  • It is possible to calculate ratios without loosing accuracy. One of the most common use of a division is to calculate the ratio between two numbers, which may be equal. For example a measuring value (M) is often calculated as the ratio between an input value (I) and a reference (R), that is, M=I/R, where I<=R. The result of this division will be in the range from -1 to 1. This fits perfectly with the data representation on the left shifted processor, where 1 is the biggest number; but on a traditional right shifted processor, where 1 is the smallest number, it creates big problems, which often means, that one chooses to perform the calculation in floating point. However, it is best to avoid floating point operations. Even on extremely fast RISC processors with a dedicated floating point unit they are usually at least 3 times slower than fixed point operations, and a floating point unit makes the processor much more expencive.

  • A multiplication will newer generate an overflow exept for the multiplication -1 x -1, which causes overflow because it is not possible to repesent the number 1 in the two's complement notation. However, because of the left shifted architecture the approximation becomes quite good as the full processor width is utilized. On a 32 bit processor the result will be 0.9999999995, which is good enough for any practical application. The overflow flag is therefore not set.

  • On type of multiplication is enough for the whole - regardless of the width of the operands. On traditional right shifted processors it is necessary with more types and often more successive operations. If you on a traditional 32 bit processor want to multiply two numbers, where one or both are wider than 16 bits, it creates problems - especially if the processor is only provided with a 16x16 bit multiplier. In many cases it may be necessary to shift the data before the multiplication or make up the result from more registers and perhaps more calculations.

  • For most applications it is possible to avoid floating point operations at all, because the left shifted data are actually already in a kind of floating point format. The left shifted data solves the traditional digital signal processor (DSP) dilemma because it makes it possible to do digital filters with fixed point speed, but simultaneously with higher accuracy than floating point - 32 bit against 24 bit for a 32 bit processor! The world seen from a microprocessor is not floating point, but fixed point because the data are read in and out through an A/D or D/A converter. Besides, the data are scaled as much to the left as possible to utilize the converter range as much as possible. Therefore, there is usually no need for multiplication or division with big integers (which a traditional microprocessor is very good at), but there is much more need for multiplications with quotients in the range from -1 to 1. An example of this is digital filters and regulators where the quotients and factors (PID) are just in this range. For digital image processing you may e.g. load 8 bit unsigned video data (bit 1-8), multiply them with 16 bit filter quotients (bit 0-15), sum the result with 25 bit accuracy (bit 0-24) and at last save the final result as 8 bit unsigned video data (bit 1-8). It is not necessary to perform any shift operations or concentrade oneself on the width of the various operands. This is of course only possible if the width of the registers is sufficiantly big. Therefore 32 bit registers are prefered, but 16 bit registers may also be used in low-cost processors.

  • Floating point operations may be necessary for big numerical calculations, but with left shifted data they are easier to perform because the mantissa is already left shifted (in the range +/-0.5 to +/-1).

  • It is very fast and easy to convert from binary to BCD - even on very long words like e.g. 64 bit. This is simply done by multiplying the value with 10 which may be performed by means of 3 left shift operations and one addition. What is to the left of the implied radix point between bit 0 and 1 (bit 0 is the sign bit) is the most significant BCD digit and what is to the right is the remainer. The process is then repeated with the remainer until there are no more digits or no more accuracy is needed. With traditional right shifted integers it is necessary with a difficult and very time consuming division (:10) for each digit and the least significant digit comes first.

    The opposite situation occur during convertion from BCD to binary. However, it is far more important with a fast convertion from binary to BCD than the other way around. For example, a SCADA (Supervisory, Control and Data Acqusition) system for electronic process control may need to convert several hundred or even thousand values per second to keep its displays updated, but convertion from BCD to binary may only be needed when the user types in some data, which may only happen a few times a day.

  • The processor may be provided with different multiplication units without the need to rewrite the program. A simple processor (32 bit) may just have a unit, which multiplies two 16 bit numbers and provides a 32 bit result. A more complex processor may utilize the upper 24 bits, but still provide a 32 bit result, and finally the processor may of course utilize all 32 bits. Because the multiplier should never provide more than a 32 bit result it is fairly simple, and even if the program is run on a processor with a simple 16x16 bit multiplier, the loss in accuracy in case of data wider than 16 bit is fairly limited.

  • In many cases programs written for a 32 bit processor may be directly executed on a 16 bit processor with just a little lack of accuracy. This is not possible on traditional right shifted microprocessors.

  • The architecture is perfect for algorithms which uses data in the range from -1 to 1. This is e.g. the case for propability calculations, some image processing algorithms etc.
The disadvantage of reguarding all data as beeing in the range from -1 to 1 is, that the numbers do not represent physical values. On the other hand it is usually advantageous not to scale the data until the display time. For example, a digital filter or a digital regulator normally needs to perform thousands of round-trips per second. However, it may only be necessary to display the data approximately 4 times per second. Therefore, it is a waist of time - and accuracy - to perform the scaling in the calculation routine. However, because all data are in the range from -1 to 1 it is very easy to perform the scaling, because it is only necessary to multiply with a suitable factor in the range 0.1 to 1, and then put the dot-point in the right position after the data has been converted to BCD (which is also a very simple matter). On a traditional right shifted processor it is usually necessary with both a multiplication and a division to perform such a scaling.

Floating Point

A 32 or 64 bit processor either fully or partially supports the following 32-bit floating point format:

Signed 24 bit Mantissa (bit 0-23) 8-bit Exponent

A 64-bit processorer may also support the following 64-bit floating point format:

Signed 52 bit Mantissa (bit 0-51) 12-bit Exponent

Note, that in both cases the mantissa is a completely normal left shifted signed number, which just always is in the range from +/-0,5 to +/-1. This is also rather untraditional. Traditional floating point formats scales the mantissa in such a way that it is between 1 and 2. To utilize the range as much as possible these formats therefore has an implied 1 to the left of the radix point and a separate sign bit. This makes the calculations more difficult and has the very big disadvantage that it is not possible to represent the number 0 without making exceptions (All 0's means the value 0)! With the Innovatic format the data are in standard two's complement notation so that it is much easier to do calculations. Because of the left shifted architecture current calculations may be performed in full width when the exponent has been stripped off.


Address ALU and
Address Generation

The address ALU uses normal right shifted integer arithmetic, but still with bit 0 as the most significant bit.

Data from the lowest data stack level B or from a register (R0-R255) may be addressed directly. These data always has the full size of the registers e.g. 32 bit on a 32 bit processor.

For branches and calls, the jump address is calculated as a sum of the content of a base register (PC, R1, R3, R5, R7 or A) and a 24 bit immediate displacement.

Immediate data (#) are addressed relative to the PC or taken from a register file. Because the PC is automatically postincremented they are actually addressed as (PC)+.

Data from the memory are addressed by means of the contents of the lowest data stack level B or relative as a sum of a base address (Rb) and a displacement. The base address is contained in the register R1, R3, R5 or R7. The displacement is contained in the register R0, R2, R4 or R6, but in one addressing mode it may be futher extended with an 8 bit immediate displacement, so that the address is calculated as a sum of 3 numbers. The 8 registers are grouped together in pairs so that the base register for R0 is R1, the base register for R2 is R3 and so on. As previous mentioned the displacement registers may be automatically predecremented or postincremented.

To make it easier to implement cyclic buffers and digital filters and to ensure that stacks do not destroy other data areas in case of an overflow, the length (number of bits) of all displacement registers - including the displacement registers of the two stack pointers - may be programmed. This is done in the most significant 5 bits of the displacement register (bit 0-4). If one e.g. wishes to program a length of 1024 bytes, the number should be set to 10. Because it has no meaning to program a length of 0 it is defined that the number 0 means maximum length, that is, 11 bits on a 16 bit processor, 27 bits on a 32 bit processor and 59 bits on a 64 bit processor.

In total the processor has the following 16 addressing modes:

  • # data, immediate data from the program (address = (PC)+) or from a register file.
  • Rn, n=0-255, - Register directly
  • B, Lowest data stack level directly
  • (B), Address = contents of the lowest data stack level

  • (R0) + (R1) + Disp8, Address = sum of R0, R1 and Disp8
  • (R2) + (R3) + Disp8
  • (R4) + (R5) + Disp8
  • (R6) + (R7) + Disp8

  • -(R0) + (R1), Address = sum of R0 and R1 with predecrement
  • -(R2) + (R3)
  • -(R4) + (R5)
  • -(R6) + (R7)

  • (R0)+ + (R1), Address = sum of R0 and R1 with postincrement
  • (R2)+ + (R3)
  • (R4)+ + (R5)
  • (R6)+ + (R7)
In assembler programming, the base register of data addressing is implied so that e.g. Mul#Add -(R0) + (R1) may just be written as Mul#Add -(R0).

For immediate data and data from the memory, the two most significant bits of the address (bit 0 and 1) are used to determine the size of the operand in the following way:

  • 00 = 32-bit long word data (default)
  • 01 = 24-bit immediate data or 64-bit very long word memory data
  • 10 = 16-bit word
  • 11 = 8-bit byte
For memory addressing, the bits are specified in the base address register or in the two most significant bits of the 8 bit displacement (6 bits left).

For immediate data (#) taken from the program, it is the two most significant bits of the program counter (PC), which determine the size. These bits may be setup by means of special instructions. As the program counter is stored on the stack in case of interrupt processing or subroutine calls, the state is reestablished automatically on return. Immediate data from the register file is always used in full length.


Instruction Pipe-line

The processor may be based on a 9 state pipe-line where 4 states are executed on the rising edge of the clock and the last 5 on the falling edge.

Instruction fetch
32-bit immediate data
 
Address predecrement
24- and 32-bit immediate data
24-bit branch offset
Memory addressing
Address strobe
Address postincrement

Immediate word data
Word branch offset
Instruction decode
Data read
Immediate data
Branch offset
Instruction execute
Write back

Note that the address is read out quite soon (with the address strobe) so that the memory has time to fetch the data before they are going to be used (Data read). This is a great advantage especially with dynamic RAM where it takes a lot of time to get the first data at random access. With this architecture a simple and cheap 66 MHz processor with 33 MHz EDO DRAM may perform as well as much faster traditional systems - especially with very big data areas where there are a lot of cache-misses. Traditional processors usually first discovers that the data is not present in the cache when the data is needed and then the processor is stalled until the data is ready.

Even though the architecture makes it possible to do without a cache, the processor ought to include at least a very simple instruction cache for more reasons:

  • By DRAM adressing it takes a lot of time to get the first data word because it is necessary to address the row and then the column before the data can be fetched. This is the case with all types of DRAM - even synchroneuos RAM (SDRAM), dual data rate (DDR) RAMs, RAMBUS etc. On the other hand the following words (at least 3) may be fetched very fast by means of various burst-modes. Because there is a very big probability that the following instructions should also be used it is vise to fetch e.g. 4 words at a time (of the bus width).

  • To be able to execute small program loops without fetching the same instructions again and again, it is a big advantage to save a number of the last executed instructions. Such a historical buffer is also necessary for the fast data transfer (see below).


Fast Data Transfer

By means of the historical buffer (cache) it is possible to execute small program loops without fecthing the same instructions more than once. This is utilized for a kind of intelligent DMA. The processor has more data strobes, but only one address bus and one data bus. To transfer data from e.g. a periferal unit to the memory, the memory is first addressed as shown under instruction pipe-line. After that a read strobe is issued to the periferal unit together with a write strobe to the memory. In this way, the data is transfered in the fast fly-by mode, but the data are at the same time compared to the accumulator, so that the transfer may e.g. be terminated when a stop character is recognized. Because the transfer is controlled by a program and not a DMA controller, it may be made as intelligent as one wishes. Because the program should only be fetched once and the processor is usually much faster than the memory, so that it has time for a few instructions for each data transfer, this way of transfer is as fast as normal DMA, but much more intelligent.


Interrupt

The processor contains an 8-bit counter for interrupt enabling and disabling. The counter is incremented at interrupt disable and decremented at interrupt enable. Interrupt is only enabled when the counter is 0. The counter cannot be incremented beyound 255 and decremented beyound 0. The counter makes it possible to disable and enable interrupts without having to worry about the previous state.



PS

The high code density and simple architecture without the need for a floating point unit, a DMA controller or a complex cache reduces the number of transistor elements in the system. Together with the high efficiency, which reduces the need for a high clock frequency, it is possible to obtain the required performance in a very cheap system, which at the same time gets a very high reliability - a typical Innovatic solution - simple, efficient and reliable.