William Stallings Computer Organization and Architecture 8th Edition Chapter 4 p Cache Memory - 04 Cache Memory

  Willia m St a llings Willia m St a llings Com put e r Orga niza t ion a nd Arc hit e c t ure a nd Arc hit e c t ure 8 t h Edit ion Cha pt e r 4 p Ca c he M e m ory

  Cha ra c t e rist ic s

  • Locat ion

  C • Capacit y

  • Unit of t ransfer
  • Access m et hod
  • Perform ance • Perform ance
  • Physical t ype

  Ph i l h i i • Physical charact erist ics

  Ca pa c it y p y

  Word size •

— The nat ural unit of organisat ion Th t l it f i t i

  Num ber of words • — or Byt es

  U nit of T ra nsfe r

  • I nt ernal

  U ll d b d t b idt h — Usually governed by dat a bus widt h

  • Ext ernal

  — Usually a block which is m uch larger t han a word

  • Addressable unit

  

— Sm allest locat ion which can be uniquely

addressed

  Ac c e ss M e t hods (1 ) ( )

  Sequent ial •

— St art at t he beginning and read t hrough in St t t t h b i i d d t h h i

order

  — Access t im e depends on locat ion of dat a and Access t im e depends on locat ion of dat a and previous locat ion

  — — e.g. t ape e g t ape Direct •

  — I di id I ndividual blocks have unique address l bl k h i dd

  I di id l dd id t if l t i t l — I ndividual addresses ident ify locat ions exact ly — Access t im e is independent of locat ion or previous access previous access

  Ac c e ss M e t hods (2 ) ( )

  • Random

  — e.g. RAM A i t i • Associat ive

  — Dat a is locat ed by a com parison wit h cont ent s f t i f t h t of a port ion of t he st ore

  M e m ory H ie ra rc hy y y

  Regist ers • — I n CPU

  I CPU

  • I nt ernal or Main m em ory

  — May include one or m ore levels of cache — “ RAM” Ext ernal m em ory •

  — Backing st ore g

  m ra g g

  Ti b t t i t h dd d — Tim e bet ween present ing t he address and get t ing t he valid dat a M C l t i • Mem ory Cycle t im e

  Pe rform a nc e

  • Access t im e

  — Tim e m ay be required for t he m em ory t o “ eco e ” befo e ne t access “ recover” before next access

  — Cycle t im e is access + recovery T f R • Transfer Rat e

  Physic a l T ype s y yp

  Sem iconduct or • — RAM RAM

  Magnet ic •

— Disk & Tape

  Opt ical p • — CD & DVD

  Ot hers Ot hers • — Bubble

  c s

  Orga nisa t ion g

  • • Physical arrangem ent of bit s int o words

  • l b Not always obv
  • e.g. int erleaved

  H ie ra rc hy List y

  • Regist ers

  

C h • L1 Cache

  • L2 Cache

    • Main m em ory

  • Disk cache • Disk cache
  • Disk

  O i l • Opt ical

  So you w a nt fa st ? y

  I t is possible t o build a com put er which • uses only st at ic RAM ( see lat er) uses only st at ic RAM ( see lat er) This would be very fast • This would need no cache •

  — How can you cache cache? This would cost a very large am ount •

  Loc a lit y of Re fe re nc e y

  During t he course of t he execut ion of a •

program , m em ory references t end t o program m em ory references t end t o

clust er e.g. loops • l

  Ca c he

  Sm all am ount of fast m em ory • Sit s bet ween norm al m ain m em ory and • S b l d CPU

  May be locat ed on CPU chip or m odule •

  ryy

  re tu c u tr

  Ca c he ope ra t ion – ove rvie w p

  CPU request s cont ent s of m em ory locat ion • Check cache for t his dat a • Ch k h f h d I f present , get from cache ( fast ) • I f not present , read required block from • m ain m em ory t o cache a e o y t o cac e Then deliver from cache t o CPU • Cache includes t ags t o ident ify which • Cache includes t ags t o ident ify which •

  rt a h c w lo F -

  Ca c he De sign g

  Addressing • Size • S Mapping Funct ion • Replacem ent Algorit hm •

  • Writ e Policy Writ e Policy •

  Block Size • Num ber of Caches • N b f C h

  Ca c he Addre ssing g

  Where does cache sit ? •

— — Bet ween processor and virt ual m em ory m anagem ent Bet ween processor and virt ual m em ory m anagem ent

unit

  — Bet ween MMU and m ain m em ory Logical cache ( virt ual cache) st ores dat a using • virt ual addresses

  — Processor accesses cache direct ly, not t horough physical cache

  

— Cache access fast er, before MMU address t ranslat ion Cache access fast er, before MMU address t ranslat ion

— Virt ual addresses use sam e address space for different

  Size doe s m a t t e r

  Cost • — More cache is expensive M h i i

  Speed • — More cache is fast er ( up t o a point ) — Checking cache for dat a t akes t im e

  n o ti a

  Com pa rison of Ca c he Size s p Processor Type Introduction Year of L1 cache L2 cache L3 cache VAX 11/780 Minicomputer 1978 IBM 360/85

  

IBM 360/85 Mainframe Mainframe 1968 1968 16 to 32 KB 16 to 32 KB — —

PDP-11/70 Minicomputer 1975 16 KB — — 1 KB — — Intel 80486 PC 1989

IBM 3090 Mainframe 1985 128 to 256 KB — —

  IBM 3033 Mainframe 1978 64 KB — — 8 KB — — P PowerPC 601 PC 601 PC PC 1993 1993 PowerPC 620 PC 1996 Pentium PC 1993 32 KB/32 KB — — 8 KB/8 KB 256 to 512 KB — 32 KB 32 KB — — IBM S/390 G4 Mainframe Mainframe 1997 1997 IBM S/390 G4 IBM S/390 G6 Mainframe 1999 256 KB PowerPC G4 PC/server 1999 32 KB/32 KB 256 KB to 1 MB 32 KB 32 KB 256 KB 256 KB 8 MB — 2 MB 2 MB 2 MB Pentium 4 PC/server 2000 8 KB/8 KB 256 KB —

  M a pping Func t ion pp g

  Cache of 64kByt e • Cache block of 4 byt es • C h bl k f b 14 — i.e. cache is 16k ( 2 ) lines of 4 byt es

  16MByt es m ain m em ory •

  • 24 bit address 24 bit address
  • 24 — ( 2 = 16M)

      Dire c t M a pping pp g

      Each block of m ain m em ory m aps t o only • one cache line one cache line

    — i.e. if a block is in cache, it m ust be in one

    specific place specific place

    • Address is in t wo part s

      Least Significant w bit s ident ify unique • word Most Significant s bit s specify one •

      Dire c t M a pping Addre ss St ruc t ure

      Tag s-r Line or Slot r Word w

      8

      14

      2

    • 24 bit address
    • 2 bit word ident ifier ( 4 byt e block) • 2 bit word ident ifier ( 4 byt e block)
    • 22 bit block ident ifier

      — 8 bit t ag ( = 22- 14) — 14 bit slot or line

    • No t wo blocks in t he sam e line have t he sam e Tag field

      

    Dire c t M a pping from Ca c he t o M a in M e m ory Dire c t M a pping from Ca c he t o M a in M e m ory

      Dire c t M a pping Ca c he Line T a ble Ca c he Line T a ble C h li

      M i M bl k h ld Cache line Main Mem ory blocks held 0, m , 2m , 3m …2s- m

      1 1,m + 1, 2m + 1…2s- m + 1 , , … m - 1 m - 1, 2m - 1,3m - 1…2s- 1

      n o ti a iz n a g rg O

      Dire c t M a pping Sum m a ry pp g y

      Address lengt h = ( s + w) bit s • Num ber of addressable unit s = 2s+ w • b f dd bl 2 words or byt es

    • • Block size = line size = 2w words or byt es

      Num ber of blocks in m ain m em ory = 2s+ u be o b oc s a e o y s • w/ 2w = 2s Num ber of lines in cache = m = 2r • Num ber of lines in cache = m = 2r •

      — I f a program accesses 2 blocks t hat m ap t o t he sam e line repeat edly, cache m isses are very high

      Dire c t M a pping pros & c ons pp g p

    • Sim ple
    • I nexpensive
    • Fixed locat ion for given block

      V ic t im Ca c he

    • Lower m iss penalt y
    • b h d d d Rem em ber what was discarded

      — Already fet ched — Use again wit h lit t le penalt y

    • Fully associat ive y
    • 4 t o 16 cache lines
    • • Bet ween direct m apped L1 cache and next • Bet ween direct m apped L1 cache and next

      m em ory level

      Assoc ia t ive M a pping pp g

      A m ain m em ory block can load int o any • line of cache line of cache Mem ory address is int erpret ed as t ag and • word d Tag uniquely ident ifies block of m em ory • Every line’s t ag is exam ined for a m at ch • Cache searching get s expensive • Cache searching get s expensive •

      

    Assoc ia t ive M a pping from

    Ca c he t o M a in M e m ory Ca c he t o M a in M e m ory

      n o ti a iz n a g rg O e

      Assoc ia t ive M a pping Addre ss St ruc t ure

      Word Tag 22 bit 2 bit

    • 22 bit t ag st ored wit h each 32 bit block of dat a Com pare t ag field wit h t ag ent ry in cache t o p g g y • check for hit Least significant 2 bit s of address ident ify which • 16 bit word is required from 32 bit dat a block

      Assoc ia t ive M a pping Sum m a ry pp g y

      Address lengt h = ( s + w) bit s • Num ber of addressable unit s = 2s+ w • b f dd bl 2 words or byt es

    • • Block size = line size = 2w words or byt es

      Num ber of blocks in m ain m em ory = 2s+ u be o b oc s a e o y s • w/ 2w = 2s Num ber of lines in cache = undet erm ined • Num ber of lines in cache = undet erm ined •

      Se t Assoc ia t ive M a pping pp g

      Cache is divided int o a num ber of set s • Each set cont ains a num ber of lines • h b f l

    • • A given block m aps t o any line in a given

      set — e.g. Block B can be in any line of set i e.g. 2 lines per set •

      — 2 way associat ive m apping 2 way associat ive m apping — A given block can be in one of 2 lines in only

      Se t Assoc ia t ive M a pping Ex a m ple p

    • 13 bit set num ber Block num ber in m ain m em ory is m odulo • l k b

      d l

      13

      2 000000, 00A000, 00B000, 00C000 … m ap • t o sam e set

      

    M a pping From M a in M e m ory t o Ca c he :

    v Assoc ia t ive v Assoc ia t ive

      

    M a pping From M a in M e m ory t o Ca c he :

    k w a y Assoc ia t ive k -w a y Assoc ia t ive

      e h c a C

      Se t Assoc ia t ive M a pping Addre ss St ruc t ure

      Word Word Tag 9 bit Set 13 bit 2 bit

      Use set field t o det erm ine cache set t o • look in look in Com pare t ag field t o see if we have a hit • e.g •

      g in p p a M e v

      Se t Assoc ia t ive M a pping Sum m a ry pp g y

    • Address lengt h = ( s + w) bit s

      b f dd bl Num ber of addressable unit s = 2s+ w 2 • words or byt es

    • • Block size = line size = 2w words or byt es

    • Num ber of blocks in m ain m em ory = 2d u be o b oc s a e o y d
    • Num ber of lines in set = k
    • Num ber of set s v 2d • Num ber of set s = v = 2d

      Dire c t a nd Se t Assoc ia t ive Ca c he Pe rform a nc e Diffe re nc e s Pe rform a nc e Diffe re nc e s Significant up t o at least 64kB for 2- way • Difference bet ween 2- way and 4- way at • ff b

      2 d 4kB m uch less t han 4kB t o 8kB

    • Cache com plexit y increases wit h

      associat ivit y Not j ust ified against increasing cache t o • 8 8kB or 16kB o

      Figure 4 .1 6 V a rying Assoc ia t ivit y ove r Ca c he Size V a rying Assoc ia t ivit y ove r Ca c he Size 1.0 0.8

      0.9 o ti ra t

      0.7 0.6 Hi 0.4 0.4

      0.5 0.2

      0.3 0.0

      0.1

      Re pla c e m e nt Algorit hm s (1 ) Dire c t m a pping pp g

    • No choice

      h bl k l

    • l Each block only m aps t o one line
    • Replace t hat line

      Re pla c e m e nt Algorit hm s (2 ) Assoc ia t ive & Se t Assoc ia t ive

      Hardware im plem ent ed algorit hm ( speed) • Least Recent ly used ( LRU) • l d ( ) e.g. in 2 way set associat ive •

      — Which of t he 2 block is lru? First in first out ( FI FO) First in first out ( FI FO) •

      — replace block t hat has been in cache longest Least frequent ly used • Least frequent ly used •

      — replace block which has had fewest hit s

      Writ e Polic y y

      Must not overwrit e a cache block unless • m ain m em ory is up t o dat e m ain m em ory is up t o dat e Mult iple CPUs m ay have individual caches • I / O m ay address m ain m em ory direct ly •

      Writ e t hrough g

      All writ es go t o m ain m em ory as well as • cache cache Mult iple CPUs can m onit or m ain m em ory • t raffic t o keep local ( t o CPU) cache up t o ffi k l l ( C ) h dat e Lot s of t raffic • S o s do Slows down writ es s •

      Writ e ba c k

      Updat es init ially m ade in cache only • Updat e bit for cache slot is set when • d b f h l h updat e occurs

    • • I f block is t o be replaced, writ e t o m ain

      m em ory only if updat e bit is set Ot her caches get out of sync • I / O m ust access m ain m em ory t hrough • I / O m ust access m ain m em ory t hrough • cache

      Line Size

      Ret rieve not only desired word but a num ber of • adj acent words as well I ncreased block size will increase hit rat io at first •

      — t he principle of localit y Hit rat io will decreases as block becom es even Hit rat io will decreases as block becom es even • • bigger

      — Probabilit y of using newly fet ched inform at ion becom es less t han probabilit y of reusing replaced less t han probabilit y of reusing replaced

      Larger blocks • — Reduce num ber of blocks t hat fit in cache — Dat a overwrit t en short ly aft er being fet ched

    — Each addit ional word is less local so less likely t o be

      M ult ile ve l Ca c he s

    • • High logic densit y enables caches on chip

      F t t h b — Fast er t han bus access — Frees bus for ot her t ransfers

    • Com m on t o use bot h on and off chip

      cache — L1 on chip, L2 off chip in st at ic RAM

    — L2 access m uch fast er t han DRAM or ROM

    — L2 oft en uses separat e dat a pat h

      H it Ra t io (L1 & L2 )

    For 8 k byt e s a nd 1 6 k byt e L1 For 8 k byt e s a nd 1 6 k byt e L1

      U nifie d v Split Ca c he s p

      One cache for dat a and inst ruct ions or • t wo, one for dat a and one for inst ruct ions t wo one for dat a and one for inst ruct ions Advant ages of unified cache •

      — Higher hit rat e Balances load of inst ruct ion and dat a fet ch – O l Only one cache t o design & im plem ent h t d i & i – l t

      Advant ages of split cache • — Elim inat es cache cont ent ion bet ween

      Pe nt ium 4 Ca c he

    • 80386 – no on chip cache
    • 80486 8k using 16 byt e lines and four way set 80486 – 8k using 16 byt e lines and four way set • associat ive organizat
    • Pent ium ( all versions) – t wo on chip L1 caches ( ) p

      — Dat a & inst ruct ions Pent ium I I I – L3 cache added off chip • Pent ium 4 •

      — L1 caches 8k byt es 8k byt es –

    • – 64 byt e lines

      I nt e l Ca c he Evolut ion Problem Solution Processor on which feature first appears

      Add t l h i f t 386 External memory slower than the system bus. Add external cache using faster memory technology. 386 Increased processor speed results in external bus becoming a Move external cache on-chip, operating at the same speed as the 486 p p g bottleneck for cache access. operating at the same speed as the processor.

      Internal cache is rather small, due to limited space on chip Add external L2 cache using faster technology than main memory 486 Contention occurs when both the Instruction Prefetcher and the Execution Unit simultaneously require access to the cache. In that case, the Prefetcher is stalled while the E ti U it’ d t t k l Create separate data and instruction

    caches.

    Pentium Execution Unit’s data access takes place.

      Create separate back-side bus that runs at higher speed than the main Pentium Pro

      m

      Pe nt ium 4 Core Proc e ssor

    • Fet ch/ Decode Unit

      — Fet ches inst ruct ions from L2 cache — Fet ches inst ruct ions from L2 cache — Decode int o m icro- ops — St ore m icro- ops in L1 cache p

    • Out of order execut ion logic

      — Schedules m icro- ops — Based on dat a dependence and resources — May speculat ively execut e

      E t i it • Execut ion unit s — Execut e m icro- ops

      Pe nt ium 4 De sign Re a soning g g

    • Decodes inst ruct ions int o RI SC like m icro- ops before L1

      cache

    • Micro- ops fixed lengt h

      — Superscalar pipelining and scheduling

    • Pent ium inst ruct ions long & com plex Pent ium inst ruct ions long & com plex
    • Perform ance im proved by separat ing decoding from

      scheduling & pipelining — ( More lat er – ch14) — ( More lat er – ch14)

    • Dat a cache is writ e back

      — Can be configured t o writ e t hrough L1 h ll d b 2 bi i i • L1 cache cont rolled by 2 bit s in regist er

      — CD = cache disable

      ARM Ca c he Fe a t ure s

      Core Cache Cache Size (kB) Cache Line Size Associativity Location Write Buffer Type (words) Size (words) ARM720T Unified 8 4 4-way Logical 8 ARM920T Split 16/16 D/I 8 64-way Logical 16 ARM926EJ-S Split 4-128/4-128 D/I 8 4-way Logical 16 ARM1022E Split 16/16 D/I 8 64-way Logical 16 ARM1026EJ-S Split 4-128/4-128 D/I 8 4-way Logical 8

      ARM Ca c he Orga niza t ion g

      Sm all FI FO writ e buffer • — Enhances m em ory writ e perform ance E h it f

      — Bet ween cache and m ain m em ory — Sm all c.f. cache S ll f h

    — Dat a put in writ e buffer at processor clock

    speed speed

      — Processor cont inues execut ion — Ext ernal writ e in parallel unt il em pt y E t l it i ll l t il t

      

    ARM Ca c he a nd Writ e Buffe r Orga niza t ion ARM Ca c he a nd Writ e Buffe r Orga niza t ion