William Stallings Computer Organization and Architecture 7th Edition Chapter 4 Cache Memory Characteristics - Cache Memory

  • Locat ion
  • Capacit y
  • Unit of t r ansfer
  • Access m et hod
  • Per for m ance
  • Physical t ype
  • Physical char act er ist ics
  • Or ganisat ion
  • CPU
  • I nt er nal
  • Wor d size
  • Num ber of w or ds

  • I nt er nal
  • Sequent ial
  • Ext er nal
  • Addr essable unit
  • Dir ect

  —

  — Access t im e depends on locat ion and previous locat ion

  — I ndividual blocks have unique address — Access is by j um ping t o vicinit y plus sequent ial search

  e. g. t ape

  —

  — Access t im e depends on locat ion of dat a and previous locat ion

  — St art at t he beginning and read t hrough in order

  — Sm allest locat ion w hich can be uniquely addressed

  — Word int ernally — Clust er on M$ disks Ac c e ss M e t hods (1 )

  — Usually a block w hich is m uch larger t han a w ord

  — Usually governed by dat a bus w idt h

  — or Byt es U nit of T ra nsfe r

  — The nat ural unit of organisat ion

  Ca pa c it y

  Loc a t ion

  Willia m St a llings Com put e r Orga niza t ion a nd Arc hit e c t ure 7 t h Edit ion Cha pt e r 4 Ca c he M e m ory Cha ra c t e rist ic s

  e. g. disk

  Ac c e ss M e t hods (2 ) M e m ory H ie ra rc hy Random

  Regist er s — I ndividual addresses ident ify locat ions exact ly — I n CPU — Access t im e is independent of locat ion or

  I nt er nal or Main m em or y • previous access

  — May include one or m ore levels of cache — e. g. RAM

  — “ RAM” Associat ive •

  • Ext er nal m em or y

  — Dat a is locat ed by a com parison w it h cont ent s — Backing st ore of a port ion of t he st ore

  — Access t im e is independent of locat ion or previous access

  —

  e. g. cache M e m ory H ie ra rc hy - Dia gra m Pe rform a nc e

  Access t im e • — Tim e bet w een present ing t he address and get t ing t he valid dat a

  Mem or y Cycle t im e • — Tim e m ay be required for t he m em ory t o

  “ recover” before next access — Cycle t im e is access + recovery

  Tr ansfer Rat e • — Rat e at w hich dat a can be m oved

  Physic a l T ype s Physic a l Cha ra c t e rist ic s Sem iconduct or • •

  Decay — RAM

  • Volat ilit y

  Magnet ic •

  • Er asable

  — Disk & Tape Pow er consum pt ion •

  • Opt ical

  — CD & DVD

  • Ot her s

  — Bubble — Hologram

  Orga nisa t ion

  T he Bot t om Line

  • Physical ar r angem ent of bit s int o w or ds
  • Not alw ays obvious
  • e.g. int er leaved
  • How m uch?
  • How fast ?
  • How expensive?
  • Regist er s
  • L1 Cache • Main m em or y
  • Disk cache
  • Disk • Opt ical
  • Tape
  • I t is possible t o build a com put er w hich
  • This w ould be ver y fast
  • This w ould need no cache
  • This w ould cost a ver y lar ge am ount
  • Dur ing t he cour se of t he execut ion of a
  • Sm all am ount of fast m em or y
  • Sit s bet w een nor m al m ain m em or y and
  • e.g. loops
  • May be locat ed on CPU chip or m odule

  — Capacit y

  — Tim e is m oney

  H ie ra rc hy List

  So you w a nt fa st ?

  uses only st at ic RAM ( see lat er )

  — How can you cache cache?

  Loc a lit y of Re fe re nc e

  pr ogr am , m em or y r efer ences t end t o clust er

  Ca c he

  CPU

  Ca c he /M a in M e m ory St ruc t ure Ca c he ope ra t ion – ove rvie w CPU r equest s cont ent s of m em or y locat ion • Check cache for t his dat a • I f pr esent , get fr om cache ( fast ) •

  • I f not pr esent , r ead r equir ed block fr om

  m ain m em or y t o cache Then deliver fr om cache t o CPU •

  • Cache includes t ags t o ident ify w hich

  block of m ain m em or y is in each cache slot Ca c he Re a d Ope ra t ion - Flow c ha rt Ca c he De sign

  • Size • Mapping Funct ion

  Wr it e Policy • Block Size • Num ber of Caches •

  Size doe s m a t t e r T ypic a l Ca c he Orga niza t ion Cost •

  — More cache is expensive Speed •

  — More cache is fast er ( up t o a point ) — Checking cache for dat a t akes t im e

  IBM 360/85 Mainframe 1968 16 to 32 KB — — Processor Type L1 cache L2 cache L3 cache Com pa rison of Ca c he Size s M a pping Func t ion Introduction Year of a PDP-11/70 Minicomputer 1975 1 KB — — Cache of 64kByt e • VAX 11/780 Minicomputer 1978 IBM 3033 Mainframe 1978 64 KB — — 16 KB — — Cache block of 4 byt es •

  14 Intel 80486 PC 1989 IBM 3090 Mainframe 1985 128 to 256 KB — — 8 KB — — — i. e. cache is 16k ( 2 ) lines of 4 byt es Pentium PC 1993 8 KB/8 KB 256 to 512 KB —

  16MByt es m ain m em or y • PowerPC 620 PC 1996 PowerPC 601 PC 199332 KB/32 KB — — 32 KB — — 24 bit addr ess SGI Origin 2001 High-end server 2001 PowerPC G4 PC/server 1999 IBM S/390 G6 Mainframe 1999 256 KB IBM S/390 G4 Mainframe 1997 CRAY MTA Supercomputer 2000 Pentium 4 PC/server 2000 IBM SP 2000 Itanium PC/server 2001 b High-end server/ supercomputer 16 KB/16 KB 32 KB/32 KB 256 KB to 1 MB 64 KB/32 KB 32 KB/32 KB 8 KB/8 KB 256 KB — 32 KB 256 KB 8 KB 96 KB 4 MB — 2 MB — 8 MB — 8 MB —

2 MB — ( 2 = 16M)

2 MB 4 MB

  24 IBM POWER5 High-end server 2003 CRAY XD-1 Supercomputer 2004 Itanium 2 PC/server 2002 64 KB/64 KB 64 KB 32 KB 256 KB 1.9 MB 1MB — 36 MB 6 MB Dire c t M a pping Dire c t M a pping Addre ss St ruc t ure

  Each block of m ain m em or y m aps t o only • one cache line Tag s-r Line or Slot r Word w

  — i. e. if a block is in cache, it m ust be in one

  14

  2 specific place

  • 24 bit add>Addr ess is in t w o par>2 bit w ord identifier ( 4 byte bl>Least Significant w bit s ident ify un
  • 22 bit block identifier

  w or d — 8 bit t ag ( = 22- 14) — 14 bit slot or line

  • Most Significant s bit s specify one

  No tw o blocks in the sam e line have the sam e Tag field • m em or y block

  Check contents of cache by finding line and checking Tag • The MSBs ar e split int o a cache line field r • and a t ag of s- r ( m ost significant )

  Dire c t M a pping Ca c he Line T a ble Dire c t M a pping Ca c he Orga niza t ion Cache line Main Mem ory blocks held • 0, m , 2m , 3m …2s- m •

  • 1 1, m + 1, 2m + 1…2s- m + 1
  • m - 1 m - 1, 2m - 1, 3m - 1…2s- 1

  Dire c t M a pping Ex a m ple Dire c t M a pping Sum m a ry

  Addr ess lengt h = ( s + w ) bit s • Num ber of addr essable unit s = 2s+ w • w or ds or byt es

  • Block size = line size = 2w w or ds or byt es
  • Num ber of blocks in m ain m em or y = 2s+

  w / 2w = 2s Num ber of lines in cache = m = 2r • Size of t ag = ( s – r ) bit s •

  Dire c t M a pping pros & c ons Assoc ia t ive M a pping Sim ple •

  • A m ain m em or y block can load int o any

  line of cache

  • I nexpensive
  • Mem or y addr ess is int er pr et ed as t ag and

  w or d — I f a program accesses 2 blocks t hat m ap t o

  • t he sam e line repeat edly, cache m isses are

  Tag uniquely ident ifies block of m em or y very high

  Ever y line’s t ag is exam ined for a m at ch • Cache sear ching get s expensive • Assoc ia t ive Fully Assoc ia t ive Ca c he Orga niza t ion M a pping Ex a m ple

  Assoc ia t ive M a pping Addre ss St ruc t ure Assoc ia t ive M a pping Sum m a ry Addr ess lengt h = ( s + w ) bit s • Num ber of addr essable unit s = 2s+ w •

  Word Tag 22 bit w or ds or byt es

  2 bit

  • Block size = line size = 2w w or ds or by
  • 22 bit t ag st ored w it h each 32 bit block of dat a
  • check for hit
  • Com pare t ag field w it h t ag ent ry in cache t o

  Num ber of blocks in m ain m em or y = 2s+

  w / 2w = 2s

  • Least significant 2 bit s of address ident ify w hich

  Num ber of lines in cache = undet er m ined • 16 bit w ord is required from 32 bit dat a block

  • Size of t ag = s bit s
  • e. g.

  — Address Tag Data Cache line — FFFFFC FFFFFC24682468

  3FFF Se t Assoc ia t ive M a pping Se t Assoc ia t ive M a pping Ex a m ple

  • Cache is divided int o a num ber of set s • 13 bit set num ber
  • Each set cont ains a num ber of lines Block num ber in m ain m em or y is m odulo •

  13

  2

  • set

  000000, 00A000, 00B000, 00C000 … m ap —

  e. g. Block B can be in any line of set i t o sam e set e.g. 2 lines per set •

  — 2 w ay associat ive m apping — A given block can be in one of 2 lines in only one set

  T w o Wa y Se t Assoc ia t ive Ca c he Se t Assoc ia t ive M a pping Orga niza t ion Addre ss St ruc t ure Word Tag 9 bit

  Set 13 bit 2 bit

  • Use set field t o det er m ine cache set t o

  look in Com par e t ag field t o see if w e have a hit • e.g •

  — Address Tag Dat a Set num ber

  —

  1FF 7FFC

  1FF 12345678

  1FFF — 001 7FFC 001 11223344

  1FFF t r affic t o keep local ( t o CPU) cache up t o dat e

  T w o Wa y Se t Assoc ia t ive M a pping Ex a m ple Se t Assoc ia t ive M a pping Sum m a ry

  • Addr ess lengt h = ( s + w ) bit s
  • Num ber of addr essable unit s = 2s+ w
  • Block size = line size = 2w w or ds or byt es
  • Num ber of blocks in m ain m em or y = 2d
  • Num ber of lines in set = k
  • Num ber of set s = v = 2d
  • Num ber of lines in cache = kv = k * 2d
  • Size of t ag = ( s – d) bit s
  • No choice
  • Each block only m aps t o one line
  • Har dw ar e im plem ent ed algor it hm ( speed)
  • Least Recent ly used ( LRU)

  w or ds or byt es

  Re pla c e m e nt Algorit hm s (1 ) Dire c t m a pping

  Re pla c e m e nt Algorit hm s (2 ) Assoc ia t ive & Se t Assoc ia t ive

  — Which of t he 2 block is lru?

  • Fir st in fir st out ( FI FO)
  • Least fr equent ly used
  • Random

  — replace block t hat has been in cache longest

  — replace block w hich has had few est hit s

  Writ e Polic y

  Writ e t hrough

  • Must not over w r it e a cache block unless
  • All w r it es go t o m ain m em or y as w ell as

  m ain m em or y is up t o dat e

  cache

  • Mult iple CPUs m ay have individual caches
  • I / O m ay addr ess m ain m em or y dir ect ly
  • Mult iple CPUs can m onit or m ain m em or y
  • >Lot s of t r affic
  • Slow s dow n w r it es
  • Rem em ber bogus w r it e t hr ough caches!

  Writ e ba c k Pe nt ium 4 Ca c he

  • 80386 – no on chip cache

  Updat es init ially m ade in cache only •

  • 80486 – 8k using 16 byt e lines and four w ay
  • Updat e bit for cache slot is set w hen

  associat ive organizat ion updat e occur s

  • Pent ium ( all versions) – t w o on chip L1 caches

  — Data & instructions I f block is t o be r eplaced, w r it e t o m ain •

  • Pent ium I I I – L3 cache added off chip

  m em or y only if updat e bit is set Pent ium 4 •

  Ot her caches get out of sync • — L1 caches

  8k byt es I / O m ust access m ain m em or y t hr ough

  • – •
  • – 64 byt e lines

  cache

  • – four way set associat ive
    • — L2 cache

  N.B. 15% of m em or y r efer ences ar e Feeding bot h L1 caches – w r it es

  256k –

  • – 128 byt e lines
  • – 8 way set associat ive

  — L3 cache on chip Internal cache is rather small, due to limited space on Increased processor speed results in external bus External memory slower than the system bus. Problem Solution feature first appears I nt e l Ca c he Evolut ion Pe nt ium 4 Bloc k Dia gra m the processor. operating at the same speed as Move external cache on-chip, 486 memory technology. Add external cache using faster 386 Processor on which becoming a bottleneck for L2 cache access. Increased processor speed results in external bus chip the cache. In that case, the Prefetcher is stalled while the Execution Unit’s data access takes place. and the Execution Unit simultaneously require access to Contention occurs when both the Instruction Prefetcher main (front-side) external bus. that runs at higher speed than the Create separate back-side bus Pentium Pro The BSB is dedicated to the L2 Create separate data and Pentium memory faster technology than main Add external L2 cache using 486 instruction caches. must have rapid access to large amounts of data. The on-chip caches are too small. Some applications deal with massive databases and Move L3 cache on-chip. Pentium 4 Add external L3 cache. Pentium III processor chip. Move L2 cache on to the Pentium II cache.

  Pe nt ium 4 Core Proc e ssor Pe nt ium 4 De sign Re a soning Decodes instructions into RI SC like m icro- ops before L1 •

  Fet ch/ Decode Unit • cache

  — Fetches instructions from L2 cache Micro- ops fixed length •

  — Decode into m icro- ops — Superscalar pipelining and scheduling — Store m icro- ops in L1 cache

  • Pentium instructions long & com plex
  • Out of order execut ion logic • Perform ance im proved by separatin g decoding from

  scheduling & pipelining — Schedules m icro- ops — ( More lat er – ch14) — Based on data dependence and resources

  Data cache is w rite back • — May speculativ ely execute — Can be configured t o writ e t hrough

  • — CD = cache disable — Execute m icro- ops — NW = not writ e t hrough — Data from L1 cache — 2 inst ruct ions t o invalidat e ( flush) cache and writ e back t hen
  • L1 cache controlled by 2 bits in register

  Execut ion unit s

  — Results in registers invalidat e

  L2 and L3 8- w ay set- associative Mem ory subsyst em

  — Line size 128 byt es — L2 cache and system s bus

  Pow e rPC Ca c he Orga niza t ion

  • 601 – single 32kb 8 w ay set associat ive
  • 603 – 16kb ( 2 x 8kb) t w o w ay set

  associat ive

  • 604 – 32kb
  • 620 – 64kb
  • G3 & G4

  — 64kb L1 cache

  • – 8 w ay set associative

  — 256k, 512k or 1M L2 cache

  • – tw o w ay set associative
    • G5

  — 32kB inst ruct ion cache — 64kB dat a cache Pow e rPC G5 Bloc k Dia gra m I nt e rne t Sourc e s

  • Manufact ur er sit es

  — I nt el —

  I BM/ Mot orola

  • Sear ch on cache