Tuesday, March 5, 2013

ADVANCED MICROPROCESSOR NOTES



ADDITIONAL FEATURES OF 80386
The third generation of Intel 80386 was a 32 bit microprocessor backwards compatible with previous generations of 80x86 CPUs.
A new feature of 80386 is protected mode. In protected mode it includes the complete set of 32 bit registers and 32 bit instructions.
The cpu still used memory segment architecture similar to the one present in earlier x86 microprocessors.the size of memory segments was increased to 4GB. This simplified development of of 32 bit software and in most cases applications could run without worrying about switching memory segments.
It became possible to switch from protected mode to real mode without simulating processor reset.
Another new mode in 80386 CPU was 8086 virtual mode. In this mode the CPU could run old 8086 applications while providing necessary 80386 protected modes was very significant step. All current 32 bit operating systems use these modes to run legacy 16 bit and more modern 32 bit applications .
80386 added a 32 bit architecture and a paging translation unit which made it much easier to implement operating system which used virtual memory.
Also address  distinct features of 80386 like
Memory organization
Control signal associated
Ability to handle switching between real and protected mode.
Key features
1.      The organization and architecture is capable of handling and executing all the software written for its predecessors as it without any change. Hence it is downward compatible.
2.      80386 featured 3 operating modes real mode,protected mode and virtual mode.

Real mode:
Real mode, also called real address mode, is an operating mode of 80286 and later x86-compatible CPUs. Real mode is characterized by a 20 bit segmented memory address space (giving just over 1 MB of addressable memory) and unlimited direct software access to all memory and I/O addresses and peripheral hardware. Real mode provides no support for memory protection, multitasking, or code privilege levels. 80186 CPUs and earlier, back to the original 8086, have only one operational mode, which is equivalent to real mode in later chips.

Protected mode:
It allows the use of all possibilities of the 286(The initial protected mode, released with the 286, was not widely used.)and the protected mode extension of the 386 ,especially addressing up  to
4GB of memory. In computing, protected mode, also called protected virtual address mode,[1] is an operational mode of x86-compatible central processing units (CPU). It allows system software to utilize features such as virtual memory, paging, safe multi-tasking, and other features designed to increase an operating system's control over application software. 
When a processor that supports x86 protected mode is powered on, it begins executing instructions in real mode, in order to maintain backwards compatibility with earlier x86 processors.

Virtual mode:
This mode makes it possible to run one or more real mode programs in a protected environment. It allows the execution of real mode applications that are incapable of running directly in protected mode while the processor is running a protected mode operating system.
Other features:
1.     32 bit version and large word size
2.     Multi tasking
3.     Memory management
4.     Virtual memory with or without paging
5.     Software protection
6.     Large memory system ----physical 4GB and virtual 64 terabytes.
7.     Ability to switch between real mode to protected mode without resetting.

The Internal Architecture of 80386 is divided into 3 sections.
• Central processing unit
• Memory management unit
• Bus interface unit
• Central processing unit is further divided into Execution unit
and Instruction unit
• Execution unit has 8 General purpose and 8 Special purpose
registers which are either used for handling data or calculating
offset addresses.
The instruction unit decodes the opcode bytes  received from, 
the 16-byte instruction code queue and arranges them in a     
3- instruction decoded instruction queue.
• After decoding them pass it to the control section for deriving
the necessary control signals. The barrel shifter increases the
speed of all shift and rotate operations.
• The multiply / divide logic implements the bit-shift-rotate
algorithms to complete the operations in minimum time.
• Even 32- bit multiplications can be executed within one
microsecond by the multiply / divide logic.

The Memory management unit consist of a Segmentation unit
and a Paging unit.
• Segmentation unit allows the use of two address components
viz. segment and offset for relocability and sharing of code and
data.
• Segmentation unit allows segments of size 4Gbytes at max.
• The Paging unit organizes the physical memory in terms of
pages of 4kbytes size each.
• Paging unit works under the control of the segmentation unit,
vided into pages. The virtual  i.e. each segment is further di
memory is also organizes in terms of segments and pages by
the memory management unit.
The Segmentation unit provides a 4 level protection
olating the system code and  mechanism for protecting and is
data from those of the application program.
esses into physical addresses. • Paging unit converts linear addr
ks the privileges at the page  • The control and attribute PLA chec
level. Each of the pages maintains the paging information of
the task. The limit and attribute PLA checks segment limits
and attributes at segment level to avoid invalid accesses to
code and data in the memory segments.
The Bus control unit has a prioritizer to resolve the priority of
the various bus requests.
• This controls the access of the bus. The address driver drives the bus enable and address signal   A0-A31
The pipeline and
 dynamic bus sizing unit handle the related control signals.
• The data buffers interface the internal data bus with the system
bus.


Memory management in 80386

The Memory management unit consists of
Ø   Segmentation unit and 
Ø   Paging unit.
•Segmentation unit allows the use of two address components, viz. segment and offset for relocability and sharing of code and data.
•Segmentation unit allows segments of size 4Gbytes at max.
•The Paging unit organizes the physical memory in terms of pages of 4kbytes size each.
•Paging unit works under the control of the segmentation unit, i.e. each segment is further divided into pages. The virtual memory is also organizes in terms of segments and pages by the memory management unit.
The Segmentation unit provides a 4 level protection mechanism for protecting and isolating the system code and data from those of the application program.
•Paging unit converts linear addresses into physical addresses.
•The control and attribute PLA checks the privileges at the page level. Each of the pages maintains the paging information of the task. The limit and attribute PLA checks segment limits and attributes at segment level to avoid invalid accesses to code and data in the memory segments.
•The Bus control unit has a prioritizer to resolve the priority of the various bus requests.This controls the access of the bus. The address driver drives the bus enable and address signal A0 – A31. The pipeline and dynamic bus sizing unit handle the related control signals.
The data buffers interface the internal data bus with the system bus.

Additional features of Pentium processors
The Intel Pentium processor, like its predecessor the Intel486 microprocessor, is fully software compatible with the installed base of over 100 million compatible Intel architecture systems. In addition, the Intel Pentium processor provides new levels of performance to new and existing software through a reimplementation of the Intel 32-bit instruction set architecture using the latest, most advanced, design techniques. Optimized, dual execution units provide one-clock execution for "core" instructions, while advanced technology, such as superscalar architecture, branch prediction, and execution pipelining, enables multiple instructions to execute in parallel with high efficiency. Separate code and data caches combined with wide 128-bit and 256-bit internal data paths and a 64-bit, burstable, external bus allow these performance levels to be sustained in cost-effective systems. The application of this advanced technology in the Intel Pentium processor brings "state of the art" performance and capability to existing Intel architecture software as well as new and advanced applications.
The Pentium processor has two primary operating modes and a "system management mode."
The operating mode determines which instructions and architectural features are accessible.

Additional features of Pentium Processors.
Fifth generation of intel family Intel Pentium microprocessor was the first super scalar CPU.
The processor included 2 pipelined integer units which could execute microprocessor to 2 integer instructions per CPU cycle .The redesigned floating point unit considerably improved performance of floating point operations and could execute upto 1 floating point instruction per CPU cycle.

Other enhancements of Pentium Core
1.    To improve data transfer rate the size of data bus was increased to 64 bits.
2.    At first Pentium processor featured separate 8KB code and 8KB data caches. The size of both data and codeL1 latches are doubled in Pentium processor with MMX technology.
3.    Intel Pentium CPU used branch prediction to improve effectiveness of pipeline architecture.
Branch prediction was enhanced in Pentium MMX processors.
4.    To reduce CPU power consumption the core voltage was reduced on all Pentium MMX.
5.    Superscalar  architecture- the Pentium has 2 data paths (pipelines) that allow it to complete more than one instruction per clock cycle. One pipe (called “U”) can handle any instruction while the other  (called “V”) can handle simplest and most common instructions.
6.    The use of more than one pipeline is a characteristic typical of RISC processors designs showing that it was possible to merge both technologies ,creating “hybrid” processors.
7.    64 bit data path doubles the amount of information pulled from the memory on each fetch.
This doesn’t mean that Pentium can execute 64 bit applications;its main registers are still 32 bit wise.
8.    More identical to 80386 / 80486
9.    A dual cache
10.  Dual integer unit
11.  64 bit wide data
12.  Higher speed operation
13.  Separate cache for instruction and data.
14.  Designed to run at over 100 million instruction per second. 
15.  Harvard architecture
16.  Accessibility upto 512MB of physical memory
17.  Faster numeric coprocessor(5 times faster)
18.  Data dependency checks in hardware supported by software
19.  Usage of branch prediction logic.


Interfacing Coprocessors in 80386
A numerics coprocessor (e.g., the 80387 or 80287) provides an extension to the instruction set of the base architecture. The coprocessor extends the instruction set of the base architecture to support high-precision integer and floating-point calculations. This extended instruction set includes arithmetic, comparison, transcendental, and data transfer instructions. The coprocessor also contains a set of useful constants to enhance the speed of numeric calculations.
A program contains instructions for the coprocessor in line with the instructions for the CPU. The system executes these instructions in the same order as they appear in the instruction stream. The coprocessor operates concurrently with the CPU to provide maximum throughput for numeric calculations.
The 80386 is designed to operate with either an 80287 or 80387 math coprocessor. The ET bit of CR0 indicates which type of coprocessor is present. ET is set automatically by the 80386 after RESET according to the level detected on the ERROR# input. If desired, ET may also be set or reset by loading CR0 with a MOV instruction. If ET is set, the 80386 uses the 32-bit protocol of the 80387; if reset, the 80386 uses the 16-bit protocol of the 80287.
ESC and WAIT Instructions
The 80386 interprets the pattern 11011B in the first five bits of an instruction as an opcode intended for a coprocessor. Instructions thus marked are called ESCAPE or ESC instructions. The CPU performs the following functions upon encountering an ESC instruction before sending the instruction to the coprocessor:
  • Tests the emulation mode (EM) flag to determine whether coprocessor functions are being emulated by software.
  • Tests the TS flag to determine whether there has been a context change since the last ESC instruction.
  • For some ESC instructions, tests the ERROR# pin to determine whether the coprocessor detected an error in the previous ESC instruction.
The WAIT instruction is not an ESC instruction, but WAIT causes the CPU to perform some of the same tests that it performs upon encountering an ESC instruction. The processor performs the following actions for a WAIT instruction:
  • Waits until the coprocessor no longer asserts the BUSY# pin.
  • Tests the ERROR# pin (after BUSY# goes inactive). If ERROR# is active, the 80386 signals exception 16, which indicates that the coprocessor encountered an error in the previous ESC instruction.
  • WAIT can therefore be used to cause exception 16 if an error is pending from a previous ESC instruction. Note that, if no coprocessor is present, the ERROR# and BUSY# pins should be tied inactive to prevent WAIT from waiting forever or causing spurious exceptions.

EM and MP Flags

The EM and MP flags of CR0 control how the processor reacts to coprocessor instructions.
The EM bit indicates whether coprocessor functions are to be emulated. If the processor finds EM set when executing an ESC instruction, it signals exception 7, giving the exception handler an opportunity to emulate the ESC instruction.
The MP (monitor coprocessor) bit indicates whether a coprocessor is actually attached. The MP flag controls the function of the WAIT instruction. If, when executing a WAIT instruction, the CPU finds MP set, then it tests the TS flag; it does not otherwise test TS during a WAIT instruction. If it finds TS set under these conditions, the CPU signals exception 7.
The EM and MP flags can be changed with the aid of a MOV instruction using CR0 as the destination operand and read with the aid of a MOV instruction with CR0 as the source operand. These forms of the MOV instruction can be executed only at privilege level zero.
The Task-Switched Flag
The TS bit of CR0 helps to determine when the context of the coprocessor does not match that of the task being executed by the 80386 CPU. The 80386 sets TS each time it performs a task switch (whether triggered by software or by hardware interrupt). If, when interpreting one of the ESC instructions, the CPU finds TS already set, it causes exception 7. The WAIT instruction also causes exception 7 if both TS and MP are set. Operating systems can use this exception to switch the context of the coprocessor to correspond to the current task. Refer to the 80386 System Software Writer's Guide for an example.
The CLTS instruction (legal only at privilege level zero) resets the TS flag.

 Interrupt 7 -- Coprocessor Not Available

This exception occurs in either of two conditions:
  1. The CPU encounters an ESC instruction and EM is set. In this case, the exception handler should emulate the instruction that caused the exception. TS may also be set.
  2. The CPU encounters either the WAIT instruction or an ESC instruction when both MP and TS are set. In this case, the exception handler should update the state of the coprocessor, if necessary.

Interrupt 9 -- Coprocessor Segment Overrun

This exception occurs in protected mode under the following conditions:
  • An operand of a coprocessor instruction wraps around an addressing limit (0FFFFH for small segments, 0FFFFFFFFH for big segments, zero for expand-down segments). An operand may wrap around an addressing limit when the segment limit is near an addressing limit and the operand is near the largest valid address in the segment. Because of the wrap-around, the beginning and ending addresses of such an operand will be near opposite ends of the segment.
  • Both the first byte and the last byte of the operand (considering wrap-around) are at addresses located in the segment and in present and accessible pages.
  • The operand spans inaccessible addresses. There are two ways that such an operand may also span inaccessible addresses:
    1. The segment limit is not equal to the addressing limit (e.g., addressing limit is FFFFH and segment limit is FFFDH); therefore, the operand will span addresses that are not within the segment (e.g., an 8-byte operand that starts at valid offset FFFC will span addresses FFFC-FFFF and 0000-0003; however, addresses FFFE and FFFF are not valid, because they exceed the limit);
    2. The operand begins and ends in present and accessible pages but intermediate bytes of the operand fall either in a not-present page or in a page to which the current procedure does not have access rights.

Interrupt 16 -- Coprocessor Error

The numerics coprocessors can detect six different exception conditions during instruction execution. If the detected exception is not masked by a bit in the control word, the coprocessor communicates the fact that an error occurred to the CPU by a signal at the ERROR# pin. The CPU causes interrupt 16 the next time it checks the ERROR# pin, which is only at the beginning of a subsequent WAIT or certain ESC instructions. If the exception is masked, the numerics coprocessor handles the exception according to on-board logic; it does not assert the ERROR# pin in this case.
RISC
RISC, or Reduced Instruction Set Computer. is a type of microprocessor architecture that utilizes a small, highly-optimized set of instructions, rather than a more specialized set of instructions often found in other types of architectures.

Certain design features have been characteristic of most RISC processors:
  • one cycle execution time: RISC processors have a CPI (clock per instruction) of one cycle. This is due to the optimization of each instruction on the CPU and a technique called pipelining
  • pipelining: a techique that allows for simultaneous execution of parts, or stages, of instructions to more efficiently process instructions;
  • large number of registers: the RISC design philosophy generally incorporates a larger number of registers to prevent in large amounts of interactions with memory
A RISC chip will typically have far fewer transistors dedicated to the core logic which originally allowed designers to increase the size of the register set and increase internal parallelism.
Other features, which are typically found in RISC architectures are:
  • Uniform instruction format, using a single word with the opcode in the same bit positions in every instruction, demanding less decoding;
  • Identical general purpose registers, allowing any register to be used in any context, simplifying compiler design (although normally there are separate floating point registers);
  • Simple addressing modes. Complex addressing performed via sequences of arithmetic and/or load-store operations;
  • Few data types in hardware, some CISCs have byte string instructions, or support complex numbers this is so far unlikely to be found on a RISC.
  • Optimize H/W for common basic operations
  •      Fixed instruction length
  •           Shorter Execution Pipeline
  • Ease of Instruction Level Parallelism
  • Large number of registers
  • Less memory accesses
  • Reduce pipeline flush events
  • No ‘complex’ H/W instructions •
  • Handle exceptional conditions in S/W
  • Achieve Maximum performance by right partitioning between H/W and S/W
  • Examples: MIPS, IBM Power and PowerPC, Sun Sparc
CISC COMPLEX INSTRUCTION SET COMPUTER
Rich architecture
                      Variable length instructions.
               Complex addressing modes.
                      On-chip HW / SW partitioning required
                        H/W keeps executing ‘simple’ stuff
 More instructions treated as ‘simple’ as more H/W is availabl
Large (and forever increasing) software base
Code development tools
Expertise
H/W and  S/W spiral •
Example: Intel IA32, Motorola 680X0
Maximize information passed to the HW

AMD PROCESSOR
AMD Athlon™ processors are members of a new family of seventh-generation AMD
processors designed to meet the computation-intensive requirements of cutting-edgesoftware applications running on high-performance desktop systems, workstations, andservers. With the introduction of the AMD Athlon processor with performance-enhancingcache memory, AMD continues to deliver superior solutions for high-performancecomputing.
AMD Athlon processors feature a superpipelined, nine-issue superscalar
microarchitecture optimized for high clock frequency. AMD Athlon processors have a
large dual-ported 128KB split-L1 cache (64KB instruction cache + 64KB data cache); afull-speed on-die 256KB L2 cache; a large multi-level, 512-entry Translation Look-asideBuffer (TLB); a two-way, 2048-entry branch prediction table; multiple parallel x86instruction decoders; and multiple integer and floating point schedulers for independentsuperscalar, out-of-order, speculative execution of instructions. These elements are packedinto an aggressive processing pipeline that includes 10-stage integer and 15-stage floatingpoint pipelines.
The innovative AMD Athlon processor architecture implements the x86 instruction setby internally decoding x86 instructions into fixed-length “Macro-Ops” for higher
instruction throughput and increased processing power. AMD Athlon processors containnine execution pipelines—three for address calculations, three for integer calculations, andthree for execution of MMX™ instructions, 3DNow! technology, and x87 floating point instructions.

The industry’s first nine-issue, superpipelined, superscalar x86 processor
microarchitecture designed for high clock frequencies

1.    Multiple x86 instruction decoders
2.    Three out-of-order, superscalar, fully pipelined floating point execution units,
which execute all x87 (floating point), MMX and 3DNow! instructions
3.    Three out-of-order, superscalar, pipelined integer units
4.    Three out-of-order, superscalar, pipelined address calculation units
5.    72-entry instruction control unit
6.    Advanced dynamic branch prediction

7.    Enhanced 3DNow! technology with new instructions to enable improved integer mathcalculations for speech or video encoding and improved data movement for Internet plug-ins and other streaming applications

8.    200MHz AMD Athlon processor system bus (scalable beyond 400 MHz) enabling leading-edge system bandwidth for data movement-intensive applications

9.    High-performance cache architecture featuring a large split 128KB L1 cache, high-speed L2 cache of 256KB (full-speed, on-chip) dedicated snoop tags, and a large multi-level, 512-entry Translation Look-aside Buffer

No comments: