William Stallings Computer Organization and Architecture 8th Edition Chapter 4 p Cache Memory - 04 Cache Memory
Willia m St a llings Willia m St a llings Com put e r Orga niza t ion a nd Arc hit e c t ure a nd Arc hit e c t ure 8 t h Edit ion Cha pt e r 4 p Ca c he M e m ory
Cha ra c t e rist ic s
- Locat ion
C • Capacit y
- Unit of t ransfer
- Access m et hod
- Perform ance • Perform ance
- Physical t ype
Ph i l h i i • Physical charact erist ics
Ca pa c it y p y
Word size •
— The nat ural unit of organisat ion Th t l it f i t i
Num ber of words • — or Byt es
U nit of T ra nsfe r
- I nt ernal
U ll d b d t b idt h — Usually governed by dat a bus widt h
- Ext ernal
— Usually a block which is m uch larger t han a word
- Addressable unit
— Sm allest locat ion which can be uniquely
addressedAc c e ss M e t hods (1 ) ( )
Sequent ial •
— St art at t he beginning and read t hrough in St t t t h b i i d d t h h i
order— Access t im e depends on locat ion of dat a and Access t im e depends on locat ion of dat a and previous locat ion
— — e.g. t ape e g t ape Direct •
— I di id I ndividual blocks have unique address l bl k h i dd
I di id l dd id t if l t i t l — I ndividual addresses ident ify locat ions exact ly — Access t im e is independent of locat ion or previous access previous access
Ac c e ss M e t hods (2 ) ( )
- Random
— e.g. RAM A i t i • Associat ive
— Dat a is locat ed by a com parison wit h cont ent s f t i f t h t of a port ion of t he st ore
M e m ory H ie ra rc hy y y
Regist ers • — I n CPU
I CPU
- I nt ernal or Main m em ory
— May include one or m ore levels of cache — “ RAM” Ext ernal m em ory •
— Backing st ore g
m ra g g
Ti b t t i t h dd d — Tim e bet ween present ing t he address and get t ing t he valid dat a M C l t i • Mem ory Cycle t im e
Pe rform a nc e
- Access t im e
— Tim e m ay be required for t he m em ory t o “ eco e ” befo e ne t access “ recover” before next access
— Cycle t im e is access + recovery T f R • Transfer Rat e
Physic a l T ype s y yp
Sem iconduct or • — RAM RAM
Magnet ic •
— Disk & Tape
Opt ical p • — CD & DVD
Ot hers Ot hers • — Bubble
c s
Orga nisa t ion g
• Physical arrangem ent of bit s int o words
- l b Not always obv
- e.g. int erleaved
H ie ra rc hy List y
- Regist ers
C h • L1 Cache
- L2 Cache
• Main m em ory
- Disk cache • Disk cache
- Disk
O i l • Opt ical
So you w a nt fa st ? y
I t is possible t o build a com put er which • uses only st at ic RAM ( see lat er) uses only st at ic RAM ( see lat er) This would be very fast • This would need no cache •
— How can you cache cache? This would cost a very large am ount •
Loc a lit y of Re fe re nc e y
During t he course of t he execut ion of a •
program , m em ory references t end t o program m em ory references t end t o
clust er e.g. loops • lCa c he
Sm all am ount of fast m em ory • Sit s bet ween norm al m ain m em ory and • S b l d CPU
May be locat ed on CPU chip or m odule •
ryy
re tu c u tr
Ca c he ope ra t ion – ove rvie w p
CPU request s cont ent s of m em ory locat ion • Check cache for t his dat a • Ch k h f h d I f present , get from cache ( fast ) • I f not present , read required block from • m ain m em ory t o cache a e o y t o cac e Then deliver from cache t o CPU • Cache includes t ags t o ident ify which • Cache includes t ags t o ident ify which •
rt a h c w lo F -
Ca c he De sign g
Addressing • Size • S Mapping Funct ion • Replacem ent Algorit hm •
- Writ e Policy Writ e Policy •
Block Size • Num ber of Caches • N b f C h
Ca c he Addre ssing g
Where does cache sit ? •
— — Bet ween processor and virt ual m em ory m anagem ent Bet ween processor and virt ual m em ory m anagem ent
unit— Bet ween MMU and m ain m em ory Logical cache ( virt ual cache) st ores dat a using • virt ual addresses
— Processor accesses cache direct ly, not t horough physical cache
— Cache access fast er, before MMU address t ranslat ion Cache access fast er, before MMU address t ranslat ion
— Virt ual addresses use sam e address space for differentSize doe s m a t t e r
Cost • — More cache is expensive M h i i
Speed • — More cache is fast er ( up t o a point ) — Checking cache for dat a t akes t im e
n o ti a
Com pa rison of Ca c he Size s p Processor Type Introduction Year of L1 cache L2 cache L3 cache VAX 11/780 Minicomputer 1978 IBM 360/85
IBM 360/85 Mainframe Mainframe 1968 1968 16 to 32 KB 16 to 32 KB — —
PDP-11/70 Minicomputer 1975 16 KB — — 1 KB — — Intel 80486 PC 1989IBM 3090 Mainframe 1985 128 to 256 KB — —
IBM 3033 Mainframe 1978 64 KB — — 8 KB — — P PowerPC 601 PC 601 PC PC 1993 1993 PowerPC 620 PC 1996 Pentium PC 1993 32 KB/32 KB — — 8 KB/8 KB 256 to 512 KB — 32 KB 32 KB — — IBM S/390 G4 Mainframe Mainframe 1997 1997 IBM S/390 G4 IBM S/390 G6 Mainframe 1999 256 KB PowerPC G4 PC/server 1999 32 KB/32 KB 256 KB to 1 MB 32 KB 32 KB 256 KB 256 KB 8 MB — 2 MB 2 MB 2 MB Pentium 4 PC/server 2000 8 KB/8 KB 256 KB —
M a pping Func t ion pp g
Cache of 64kByt e • Cache block of 4 byt es • C h bl k f b 14 — i.e. cache is 16k ( 2 ) lines of 4 byt es
16MByt es m ain m em ory •
- 24 bit address 24 bit address 24 — ( 2 = 16M)
- Address is in t wo part s
- 24 bit address
- 2 bit word ident ifier ( 4 byt e block) • 2 bit word ident ifier ( 4 byt e block)
- 22 bit block ident ifier
- No t wo blocks in t he sam e line have t he sam e Tag field
• Block size = line size = 2w words or byt es
- Sim ple
- I nexpensive
- Fixed locat ion for given block
- Lower m iss penalt y
- b h d d d Rem em ber what was discarded
- Fully associat ive y
- 4 t o 16 cache lines
• Bet ween direct m apped L1 cache and next • Bet ween direct m apped L1 cache and next
- 22 bit t ag st ored wit h each 32 bit block of dat a Com pare t ag field wit h t ag ent ry in cache t o p g g y • check for hit Least significant 2 bit s of address ident ify which • 16 bit word is required from 32 bit dat a block
• Block size = line size = 2w words or byt es
• A given block m aps t o any line in a given
- 13 bit set num ber Block num ber in m ain m em ory is m odulo • l k b
- Address lengt h = ( s + w) bit s
• Block size = line size = 2w words or byt es
- Num ber of blocks in m ain m em ory = 2d u be o b oc s a e o y d
- Num ber of lines in set = k
- Num ber of set s v 2d • Num ber of set s = v = 2d
- Cache com plexit y increases wit h
- No choice
- l Each block only m aps t o one line
- Replace t hat line
• I f block is t o be replaced, writ e t o m ain
• High logic densit y enables caches on chip
- Com m on t o use bot h on and off chip
- 80386 – no on chip cache
- 80486 8k using 16 byt e lines and four way set 80486 – 8k using 16 byt e lines and four way set • associat ive organizat
- Pent ium ( all versions) – t wo on chip L1 caches ( ) p
- – 64 byt e lines
- Fet ch/ Decode Unit
- Out of order execut ion logic
- Decodes inst ruct ions int o RI SC like m icro- ops before L1
- Micro- ops fixed lengt h
- Pent ium inst ruct ions long & com plex Pent ium inst ruct ions long & com plex
- Perform ance im proved by separat ing decoding from
- Dat a cache is writ e back
Dire c t M a pping pp g
Each block of m ain m em ory m aps t o only • one cache line one cache line
— i.e. if a block is in cache, it m ust be in one
specific place specific placeLeast Significant w bit s ident ify unique • word Most Significant s bit s specify one •
Dire c t M a pping Addre ss St ruc t ure
Tag s-r Line or Slot r Word w
8
14
2
— 8 bit t ag ( = 22- 14) — 14 bit slot or line
Dire c t M a pping from Ca c he t o M a in M e m ory Dire c t M a pping from Ca c he t o M a in M e m ory
Dire c t M a pping Ca c he Line T a ble Ca c he Line T a ble C h li
M i M bl k h ld Cache line Main Mem ory blocks held 0, m , 2m , 3m …2s- m
1 1,m + 1, 2m + 1…2s- m + 1 , , … m - 1 m - 1, 2m - 1,3m - 1…2s- 1
n o ti a iz n a g rg O
Dire c t M a pping Sum m a ry pp g y
Address lengt h = ( s + w) bit s • Num ber of addressable unit s = 2s+ w • b f dd bl 2 words or byt es
Num ber of blocks in m ain m em ory = 2s+ u be o b oc s a e o y s • w/ 2w = 2s Num ber of lines in cache = m = 2r • Num ber of lines in cache = m = 2r •
— I f a program accesses 2 blocks t hat m ap t o t he sam e line repeat edly, cache m isses are very high
Dire c t M a pping pros & c ons pp g p
V ic t im Ca c he
— Already fet ched — Use again wit h lit t le penalt y
m em ory level
Assoc ia t ive M a pping pp g
A m ain m em ory block can load int o any • line of cache line of cache Mem ory address is int erpret ed as t ag and • word d Tag uniquely ident ifies block of m em ory • Every line’s t ag is exam ined for a m at ch • Cache searching get s expensive • Cache searching get s expensive •
Assoc ia t ive M a pping from
Ca c he t o M a in M e m ory Ca c he t o M a in M e m oryn o ti a iz n a g rg O e
Assoc ia t ive M a pping Addre ss St ruc t ure
Word Tag 22 bit 2 bit
Assoc ia t ive M a pping Sum m a ry pp g y
Address lengt h = ( s + w) bit s • Num ber of addressable unit s = 2s+ w • b f dd bl 2 words or byt es
Num ber of blocks in m ain m em ory = 2s+ u be o b oc s a e o y s • w/ 2w = 2s Num ber of lines in cache = undet erm ined • Num ber of lines in cache = undet erm ined •
Se t Assoc ia t ive M a pping pp g
Cache is divided int o a num ber of set s • Each set cont ains a num ber of lines • h b f l
set — e.g. Block B can be in any line of set i e.g. 2 lines per set •
— 2 way associat ive m apping 2 way associat ive m apping — A given block can be in one of 2 lines in only
Se t Assoc ia t ive M a pping Ex a m ple p
d l
13
2 000000, 00A000, 00B000, 00C000 … m ap • t o sam e set
M a pping From M a in M e m ory t o Ca c he :
v Assoc ia t ive v Assoc ia t ive
M a pping From M a in M e m ory t o Ca c he :
k w a y Assoc ia t ive k -w a y Assoc ia t ivee h c a C
Se t Assoc ia t ive M a pping Addre ss St ruc t ure
Word Word Tag 9 bit Set 13 bit 2 bit
Use set field t o det erm ine cache set t o • look in look in Com pare t ag field t o see if we have a hit • e.g •
g in p p a M e v
Se t Assoc ia t ive M a pping Sum m a ry pp g y
b f dd bl Num ber of addressable unit s = 2s+ w 2 • words or byt es
Dire c t a nd Se t Assoc ia t ive Ca c he Pe rform a nc e Diffe re nc e s Pe rform a nc e Diffe re nc e s Significant up t o at least 64kB for 2- way • Difference bet ween 2- way and 4- way at • ff b
2 d 4kB m uch less t han 4kB t o 8kB
associat ivit y Not j ust ified against increasing cache t o • 8 8kB or 16kB o
Figure 4 .1 6 V a rying Assoc ia t ivit y ove r Ca c he Size V a rying Assoc ia t ivit y ove r Ca c he Size 1.0 0.8
0.9 o ti ra t
0.7 0.6 Hi 0.4 0.4
0.5 0.2
0.3 0.0
0.1
Re pla c e m e nt Algorit hm s (1 ) Dire c t m a pping pp g
h bl k l
Re pla c e m e nt Algorit hm s (2 ) Assoc ia t ive & Se t Assoc ia t ive
Hardware im plem ent ed algorit hm ( speed) • Least Recent ly used ( LRU) • l d ( ) e.g. in 2 way set associat ive •
— Which of t he 2 block is lru? First in first out ( FI FO) First in first out ( FI FO) •
— replace block t hat has been in cache longest Least frequent ly used • Least frequent ly used •
— replace block which has had fewest hit s
Writ e Polic y y
Must not overwrit e a cache block unless • m ain m em ory is up t o dat e m ain m em ory is up t o dat e Mult iple CPUs m ay have individual caches • I / O m ay address m ain m em ory direct ly •
Writ e t hrough g
All writ es go t o m ain m em ory as well as • cache cache Mult iple CPUs can m onit or m ain m em ory • t raffic t o keep local ( t o CPU) cache up t o ffi k l l ( C ) h dat e Lot s of t raffic • S o s do Slows down writ es s •
Writ e ba c k
Updat es init ially m ade in cache only • Updat e bit for cache slot is set when • d b f h l h updat e occurs
m em ory only if updat e bit is set Ot her caches get out of sync • I / O m ust access m ain m em ory t hrough • I / O m ust access m ain m em ory t hrough • cache
Line Size
Ret rieve not only desired word but a num ber of • adj acent words as well I ncreased block size will increase hit rat io at first •
— t he principle of localit y Hit rat io will decreases as block becom es even Hit rat io will decreases as block becom es even • • bigger
— Probabilit y of using newly fet ched inform at ion becom es less t han probabilit y of reusing replaced less t han probabilit y of reusing replaced
Larger blocks • — Reduce num ber of blocks t hat fit in cache — Dat a overwrit t en short ly aft er being fet ched
— Each addit ional word is less local so less likely t o be
M ult ile ve l Ca c he s
F t t h b — Fast er t han bus access — Frees bus for ot her t ransfers
cache — L1 on chip, L2 off chip in st at ic RAM
— L2 access m uch fast er t han DRAM or ROM
— L2 oft en uses separat e dat a pat hH it Ra t io (L1 & L2 )
For 8 k byt e s a nd 1 6 k byt e L1 For 8 k byt e s a nd 1 6 k byt e L1
U nifie d v Split Ca c he s p
One cache for dat a and inst ruct ions or • t wo, one for dat a and one for inst ruct ions t wo one for dat a and one for inst ruct ions Advant ages of unified cache •
— Higher hit rat e Balances load of inst ruct ion and dat a fet ch – O l Only one cache t o design & im plem ent h t d i & i – l t
Advant ages of split cache • — Elim inat es cache cont ent ion bet ween
Pe nt ium 4 Ca c he
— Dat a & inst ruct ions Pent ium I I I – L3 cache added off chip • Pent ium 4 •
— L1 caches 8k byt es 8k byt es –
I nt e l Ca c he Evolut ion Problem Solution Processor on which feature first appears
Add t l h i f t 386 External memory slower than the system bus. Add external cache using faster memory technology. 386 Increased processor speed results in external bus becoming a Move external cache on-chip, operating at the same speed as the 486 p p g bottleneck for cache access. operating at the same speed as the processor.
Internal cache is rather small, due to limited space on chip Add external L2 cache using faster technology than main memory 486 Contention occurs when both the Instruction Prefetcher and the Execution Unit simultaneously require access to the cache. In that case, the Prefetcher is stalled while the E ti U it’ d t t k l Create separate data and instruction
caches.
Pentium Execution Unit’s data access takes place.Create separate back-side bus that runs at higher speed than the main Pentium Pro
m
Pe nt ium 4 Core Proc e ssor
— Fet ches inst ruct ions from L2 cache — Fet ches inst ruct ions from L2 cache — Decode int o m icro- ops — St ore m icro- ops in L1 cache p
— Schedules m icro- ops — Based on dat a dependence and resources — May speculat ively execut e
E t i it • Execut ion unit s — Execut e m icro- ops
Pe nt ium 4 De sign Re a soning g g
cache
— Superscalar pipelining and scheduling
scheduling & pipelining — ( More lat er – ch14) — ( More lat er – ch14)
— Can be configured t o writ e t hrough L1 h ll d b 2 bi i i • L1 cache cont rolled by 2 bit s in regist er
— CD = cache disable
ARM Ca c he Fe a t ure s
Core Cache Cache Size (kB) Cache Line Size Associativity Location Write Buffer Type (words) Size (words) ARM720T Unified 8 4 4-way Logical 8 ARM920T Split 16/16 D/I 8 64-way Logical 16 ARM926EJ-S Split 4-128/4-128 D/I 8 4-way Logical 16 ARM1022E Split 16/16 D/I 8 64-way Logical 16 ARM1026EJ-S Split 4-128/4-128 D/I 8 4-way Logical 8
ARM Ca c he Orga niza t ion g
Sm all FI FO writ e buffer • — Enhances m em ory writ e perform ance E h it f
— Bet ween cache and m ain m em ory — Sm all c.f. cache S ll f h
— Dat a put in writ e buffer at processor clock
speed speed— Processor cont inues execut ion — Ext ernal writ e in parallel unt il em pt y E t l it i ll l t il t
ARM Ca c he a nd Writ e Buffe r Orga niza t ion ARM Ca c he a nd Writ e Buffe r Orga niza t ion