computer03_simulation.ppt 804KB Jun 23 2011 12:31:56 PM

Simulating
a $2M Commercial Server
on a $2K PC
Alaa Alameldeen, Milo Martin, Carl Mauer,
Kevin Moore, Min Xu, Daniel Sorin,
Mark D. Hill, & David A. Wood

Multifacet Project (www.cs.wisc.edu/multifacet)
Computer Sciences Department
University of Wisconsin—Madison
February 2003
(C) 2003 Mulitfacet Project

University of Wisconsin-Madison

Summary
• Context
– Commercial server design is important
– Multifacet project seeks improved designs
– Must evaluate alternatives


• Commercial Servers
– Processors, memory, disks  $2M
– Run large multithreaded transaction-oriented workloads
– Use commercial applications on commercial OS

• To Simulate on $2K PC
– Scale & tune workloads
– Manage simulation complexity
– Cope with workload variability
Methods

2

Keep L2 miss rates, etc.
Separate timing & function
Use randomness & statistics
Wisconsin Multifacet Project

Outline
• Context

– Commercial Servers
– Multifacet Project






Workload & Simulation Methods
Separate Timing & Functional Simulation
Cope with Workload Variability
Summary

Methods

3

Wisconsin Multifacet Project

Why Commercial Servers?

• Many (Academic) Architects
– Desktop computing
– Wireless appliances

• We focus on servers





(Important Market)
Performance Challenges
Robustness Challenges
Methodological Challenges

Methods

4

Wisconsin Multifacet Project


3-Tier Internet Service
Multifacet Focus

LAN
/
SAN

PCs w/
“soft” state
Methods

LAN
/
SAN

Servers running
applications
for “business” rules


5

Servers running
databases for
“hard” state
Wisconsin Multifacet Project

Multifacet: Commercial Server Design
• Wisconsin Multifacet Project
– Directed by Mark D. Hill & David A. Wood
– Sponsors: NSF, WI, Compaq, IBM, Intel, & Sun
– Current Contributors: Alaa Alameldeen, Brad Beckman,
Nikhil Gupta, Pacia Harper, Jarrod Lewis, Milo Martin, Carl Mauer,
Kevin Moore, Daniel Sorin, & Min Xu
– Past Contributors: Anastassia Ailamaki, Ender Bilir,
Ross Dickson, Ying Hu, Manoj Plakal, & Anne Condon

• Analysis
– Want 4-64 processors
– Many cache-to-cache misses

– Neither snooping nor directories ideal

• Multifacet Designs
– Snooping w/ multicast [ISCA99] or unordered network [ASPLOS00]
– Bandwidth-adaptive [HPCA02] & token coherence [ISCA03]
Methods

6

Wisconsin Multifacet Project

Outline
• Context
• Workload & Simulation Methods





Select, scale, & tune workloads

Transition workload to simulator
Specify & test the proposed design
Evaluate design with simple/detailed processor models

• Separate Timing & Functional Simulation
• Cope with Workload Variability
• Summary

Methods

7

Wisconsin Multifacet Project

Multifacet Simulation Overview
Full Workloads

Commercial Server
(Sun E6000)


Scaled Workloads

Workload Development

Memory Protocol
Generator (SLICC)
Pseudo-Random
Protocol Checker

Full System Functional
Simulator (Simics)
Memory Timing
Simulator (Ruby)

Protocol Development

Processor Timing
Simulator (Opal)

Timing Simulator


• Virtutech Simics (www.virtutech.com)
• Rest is Multifacet software
Methods

8

Wisconsin Multifacet Project

Select Important Workloads
Full Workloads







Online Transaction Processing: DB2 w/ TPC-C-like
Java Server Workload: SPECjbb

Static web content serving: Apache
Dynamic web content serving: Slashcode
Java-based Middleware: (soon)
Methods
Wisconsin Multifacet Project
9

Setup & Tune Workloads (on real hardware)
Full Workloads

Commercial Server
(Sun E6000)

• Tune workload, OS parameters
• Measure transaction rate, speed-up, miss rates, I/O
• Compare to published results

Methods

10


Wisconsin Multifacet Project

Scale & Re-tune Workloads
Commercial Server
(Sun E6000)

Scaled Workloads

• Scale-down for PC memory limits
• Retaining similar behavior (e.g., L2 cache miss rate)
• Re-tune to achieve higher transaction rates
(OLTP: raw disk, multiple disks, more users, etc.)
Methods

11

Wisconsin Multifacet Project

Transition Workloads to Simulation
Scaled Workloads
Full System Functional
Simulator (Simics)

• Create disk dumps of tuned workloads
• In simulator: Boot OS, start, & warm application
• Create Simics checkpoint (snapshot)

Methods

12

Wisconsin Multifacet Project

Specify Proposed Computer Design

Memory Protocol
Generator (SLICC)
Memory Timing
Simulator (Ruby)






Coherence Protocol (control tables: states X events)
Cache Hierarchy (parameters & queues)
Interconnect (switches & queues)
Processor (later)

Methods

13

Wisconsin Multifacet Project

Test Proposed Computer Design

Pseudo-Random
Protocol Checker







Memory Timing
Simulator (Ruby)

Randomly select write action & later read check
Massive false-sharing for interaction
Perverse network stresses design
Transient error & deadlock detection
Sound but not complete
Methods
Wisconsin Multifacet Project
14

Simulate with Simple Blocking Processor
Scaled Workloads
Full System Functional
Simulator (Simics)
Memory Timing
Simulator (Ruby)

• Warm-up caches or sometimes sufficient (SafetyNet)
• Run for fixed number of transactions
– Some transaction partially done at start
– Other transactions partially done at end

• Cope with workload variability (later)
Methods

15

Wisconsin Multifacet Project

Simulate with Detailed Processor
Scaled Workloads
Full System Functional
Simulator (Simics)
Memory Timing
Simulator (Ruby)

Processor Timing
Simulator (Opal)

• Accurate (future) timing & (current) function
• Simulation complexity decoupled (discussed soon)
• Same transaction methodology
& work variability issues
Methods

16

Wisconsin Multifacet Project

Simulation Infrastructure & Workload Process
Full Workloads

Commercial Server
(Sun E6000)

Memory Protocol
Generator (SLICC)
Pseudo-Random
Protocol Checker






Scaled Workloads

Full System Functional
Simulator (Simics)
Memory Timing
Simulator (Ruby)

Processor Timing
Simulator (Opal)

Select important workloads: run, tune, scale, & re-tune
Specify system & pseudo-randomly test
Create warm workload checkpoint
Simulate with simple or detailed processor
Fixed #transactions, manage simulation complexity (next),
cope with workload variability (next next)

Methods

17

Wisconsin Multifacet Project

Outline
• Context
• Simulation Infrastructure & Workload Process
• Separate Timing & Functional Simulation





Simulation Challenges
Managing Simulation Complexity
Timing-First Simulation
Evaluation

• Cope with Workload Variability
• Summary

Methods

18

Wisconsin Multifacet Project

Challenges to Timing Simulation
• Execution driven simulation is getting harder
• Micro-architecture complexity
– Multiple “in-flight” instructions
– Speculative execution
– Out-of-order execution

• Thread-level parallelism
– Hardware Multi-threading
– Traditional Multi-processing

Methods

19

Wisconsin Multifacet Project

Challenges to Functional Simulation
• Commercial workloads have high functional fidelity
demands
Web Server

Application complexity
Target Application
(Simulated)
Target System

Kernels

SPEC
Benchmarks

Database
Operating
System

MMU

Status
Registers

Real Time
Clock

Serial Port

I/O MMU
Controller

DMA
Controller

IRQ
Controller

Terminal

Processor
RAM

PCI Bus

Graphics
Card

Methods

20

Ethernet
Controller

CDROM

SCSI
Disk

Fiber
Channel
Controller

SCSI
Controller



SCSI
Disk

Wisconsin Multifacet Project

Managing Simulator Complexity
Timing and Functional
Simulator

Integrated (SimOS)
- Complex

Functional
Simulator

Timing
Simulator

Timing
Simulator

Functional
Simulator

Complete Timing
No? Function
Timing
Simulator
Complete Timing
Partial Function

Methods

Functional-First (Trace-driven)
- Timing feedback

Timing-Directed

No Timing
Complete Function

+ Timing feedback
- Tight Coupling
- Performance?

Timing-First (Multifacet)

Functional
Simulator
No Timing
Complete Function

21

+ Timing feedback
+ Using existing simulators
+ Software development advantages

Wisconsin Multifacet Project

Timing-First Simulation
• Timing Simulator
– does functional execution of user and privileged operations
– does speculative, out-of-order multiprocessor timing simulation
– does NOT implement functionality of full instruction set or any devices

• Functional Simulator

add
load
Execute

Cache

CPU

Network

– does full-system multiprocessor simulation
– does NOT model detailed micro-architectural timing

CPU
Commit
Verify

Timing
Simulator
Methods

System

RAM

Functional
Simulator

22

Wisconsin Multifacet Project

Timing-First Operation
• As instruction retires, step CPU in functional simulator
• Verify instruction’s execution
• Reload state if timing simulator deviates from functional

add
load
Execute

Cache

Network

– Loads in multi-processors
– Instructions with unidentified side-effects
– NOT loads/store to I/O devices

CPU
Commit
Verify

CPU

Timing
Simulator
Methods

Reload

23

System

RAM

Functional
Simulator
Wisconsin Multifacet Project

Benefits of Timing-First
• Supports speculative multi-processor timing models
• Leverages existing simulators
• Software development advantages
– Increases flexibility and reduces code complexity
– Immediate, precise check on timing simulator

• However:
– How much performance error is introduced in this approach?
– Are there simulation performance penalties?

Methods

24

Wisconsin Multifacet Project

Evaluation
• Our implementation, TFsim uses:
– Functional Simulator: Virtutech Simics
– Timing simulator: Implemented less than one-person year

• Evaluated using OS intensive commercial workloads
– OS Boot: > 1 billion instructions of Solaris 8 startup
– OLTP: TPC-C-like benchmark using a 1 GB database
– Dynamic Web: Apache serving message board, using code
and data similar to slashdot.org
– Static Web: Apache web server serving static web pages
– Barnes-Hut: Scientific SPLASH-2 benchmark

Methods

25

Wisconsin Multifacet Project

Measured Deviations
• Less than 20 deviations per 100,000 instructions (0.02%)

Methods

26

Wisconsin Multifacet Project

If the Timing Simulator Modeled Fewer Events

Methods

27

Wisconsin Multifacet Project

Analysis of Results
• Runs full-system workloads!
• Timing performance impact of deviations
– Worst case: less than 3% performance error

• ‘Overhead’ of redundant execution
– 18% on average for uniprocessors
– 18% (2 processors) up to 36% (16 processors)

Functional
Simulator

Timing
Simulator

Total Execution
Time
Methods

29

Wisconsin Multifacet Project

Performance Comparison
Target Application

SPLASH-2
Kernels

match

SPLASH-2
Kernels

(Simulated)
Target System

Out-of-Order MP
SPARC V9

close

Out-of-Order MP
Full-system
SPARC V9

Host Computer

400 MHz SPARC
running Solaris

different

1.2 GHz Pentium
running Linux

RSIM

TFsim

• Absolute simulation performance comparison
– In kilo-instructions committed per second (KIPS)
– RSIM Scaled: 107 KIPS
– Uniprocessor TFsim: 119 KIPS

Methods

30

Wisconsin Multifacet Project

Timing-First Conclusions
• Execution-driven simulators are increasingly complex
• How to manage complexity?
• Our answer:
Timing
Simulator
Complete Timing
Partial Function

Functional
Simulator

Timing-First Simulation

No Timing
Complete Function

– Introduces relatively little performance error (worst case:
3%)
– Has low-overhead (18% uniprocessor average)
– Rapid development time
Methods

32

Wisconsin Multifacet Project

Outline





Context
Workload Process & Infrastructure
Separate Timing & Functional Simulation
Cope with Workload Variability
– Variability in Multithreaded Workloads
– Coping in Simulation
– Examples & Statistics

• Summary

Methods

33

Wisconsin Multifacet Project

What is Happening Here?

OLTP
Methods

34

Wisconsin Multifacet Project

What is Happening Here?
• How can slower memory lead to faster workload?
• Answer: Multithreaded workload takes different path
– Different lock race outcomes
– Different scheduling decisions

• (1) Does this happen for real hardware?
• (2) If so, what should we do about it?

Methods

35

Wisconsin Multifacet Project

One Second Intervals (on real hardware)

OLTP

Methods

36

Wisconsin Multifacet Project

60 Second Intervals (on real hardware)

16-day
simulation

OLTP
Methods

37

Wisconsin Multifacet Project

Coping with Workload Variability
• Running (simulating) long enough not appealing
• Need to separate coincidental & real effects
• Standard statistics on real hardware
– Variation within base system runs
vs. variation between base & enhanced system runs
– But deterministic simulation has no “within” variation

• Solution with deterministic simulation
– Add pseudo-random delay on L2 misses
– Simulate base (enhanced) system many times
– Use simple or complex statistics
Methods

38

Wisconsin Multifacet Project

Coincidental (Space) Variability

Methods

39

Wisconsin Multifacet Project

Wrong Conclusion Ratio

• WCR (16,32) = 18%
• WCR (16,64) = 7.5%
• WCR (32,64) = 26%
Methods

40

Wisconsin Multifacet Project

More Generally: Use Standard Statistics
• As one would for a measurement of a “live” system
• Confidence Intervals
– 95% confidence intervals contain true value 95% of the time
– Non-overlapping confidence intervals give statistically
significant conclusions

• Use ANOVA or Hypothesis Testing – even better!

Methods

41

Wisconsin Multifacet Project

Confidence Interval Example

ROB

• Estimate #runs to get
non-overlapping confidence intervals
Methods

42

Wisconsin Multifacet Project

Also Time Variability (on real hardware)

OLTP

• Therefore, select checkpoint(s) carefully
Methods

43

Wisconsin Multifacet Project

Workload Variability Summary
• Variability is a real phenomenon for multi-threaded
workloads
– Runs from same initial conditions are different

• Variability is a challenge for simulations
– Simulations are short
– Wrong conclusions may be drawn

• Our solution accounts for variability
– Multiple runs, confidence intervals
– Reduces wrong conclusion probability

Methods

44

Wisconsin Multifacet Project

Talk Summary
• Simulations of $2M Commercial Servers must
– Complete in reasonable time (on $2K PCs)
– Handle OS, devices, & multithreaded hardware
– Cope with variability of multithreaded software

• Multifacet
– Scale & tune transactional workloads
– Separate timing & functional simulation
– Cope w/ workload variability via randomness & statistics

• References (www.cs.wisc.edu/multifacet/papers)
– Simulating a $2M Commercial Server on a $2K PC [Computer03]
– Full-System Timing-First Simulation [Sigmetrics02]
– Variability in Architectural Simulations … [HPCA03]
Methods

45

Wisconsin Multifacet Project

Other Multifacet Methods Work
• Specifying & Verifying Coherence Protocols
– [SPAA98], [HPCA99], [SPAA99], & [TPDS02]

• Workload Analysis & Improvement
– Database systems [VLDB99] & [VLDB01]
– Pointer-based [PLDI99] & [Computer00]
– Middleware [HPCA03]

• Modeling & Simulation





Methods

Commercial workloads [Computer02] & [HPCA03]
Decoupling timing/functional simulation [Sigmetrics02]
Simulation generation [PLDI01]
Analytic modeling [Sigmetrics00] & [TPDS TBA]
Micro-architectural slack [ISCA02]

46

Wisconsin Multifacet Project

Backup Slides

Methods

47

Wisconsin Multifacet Project

One Ongoing/Future Methods Direction
• Middleware Applications
– Memory system behavior of Java Middleware [HPCA 03]
– Machine measurements
– Full-system simulation

• Future Work: Multi-Machine Simulation
– Isolate middle-tier from client emulators and database

• Understand fundamental workload behaviors
– Drives future system design

Methods

48

Wisconsin Multifacet Project

ECPerf vs. SpecJBB

• Different cache-to-cache transfer ratios!
Methods

49

Wisconsin Multifacet Project

Online Transaction Processing (OLTP)




DB2 with a TPC-C-like workload. The TPC-C benchmark is widely used to
evaluate system performance for the on-line transaction processing market. The
benchmark itself is a specification that describes the schema, scaling rules,
transaction types and transaction mix, but not the exact implementation of the
database. TPC-C transactions are of five transaction types, all related to an
order-processing environment. Performance is measured by the number of
“New Order” transactions performed per minute (tpmC).
Our OLTP workload is based on the TPC-C v3.0 benchmark. We use IBM’s
DB2 V7.2 EEE database management system and an IBM benchmark kit to
build the database and emulate users. We build an 800 MB 4000-warehouse
database on five raw disks and an additional dedicated database log disk. We
scaled down the sizes of each warehouse by maintaining the reduced ratios of 3
sales districts per warehouse, 30 customers per district, and 100 items per
warehouse (compared to 10, 30,000 and 100,000 required by the TPC-C
specification). Each user randomly executes transactions according to the TPCC transaction mix specifications, and we set the think and keying times for
users to zero. A different database thread is started for each user. We measure all
completed transactions, even those that do not satisfy timing constraints of the
TPC-C benchmark specification.

Methods

50

Wisconsin Multifacet Project

Java Server Workload (SPECjbb)
• Java-based middleware applications are increasingly used in modern
e-business settings. SPECjbb is a Java benchmark emulating a 3-tier
system with emphasis on the middle tier server business logic.
SPECjbb runs in a single Java Virtual Machine (JVM) in which
threads represent terminals in a warehouse. Each thread independently
generates random input (tier 1 emulation) before calling transactionspecific business logic. The business logic operates on the data held in
binary trees of java objects (tier 3 emulation). The specification states
that the benchmark does no disk or network I/O.
• We used Sun’s HotSpot 1.4.0 Server JVM and Solaris’s native thread
implementation. The benchmark includes driver threads to generate
transactions. We set the system heap size to 1.8 GB and the new object
heap size to 256 MB to reduce the frequency of garbage collection.
Our experiments used 24 warehouses, with a data size of
approximately 500 MB.
Methods

51

Wisconsin Multifacet Project

Static Web Content Serving: Apache
• Web servers such as Apache represent an important enterprise server
application. Apache is a popular open-source web server used in many
internet/intranet settings. In this benchmark, we focus on static web
content serving.
• We use Apache 2.0.39 for SPARC/Solaris 8 configured to use pthread
locks and minimal logging at the web server. We use the Scalable URL
Request Generator (SURGE) as the client. SURGE generates a
sequence of static URL requests which exhibit representative
distributions for document popularity, document sizes, request sizes,
temporal and spatial locality, and embedded document count. We use a
repository of 20,000 files (totalling ~500 MB), and use clients with
zero think time. We compiled both Apache and Surge using Sun’s
WorkShop C 6.1 with aggressive optimization.

Methods

52

Wisconsin Multifacet Project

Dynamic Web Content Serving: Slashcode
• Dynamic web content serving has become increasingly important for
web sites that serve large amount of information. Dynamic content is
used by online stores, instant news, and community message board
systems. Slashcode is an open-source dynamic web message posting
system used by the popular slashdot.org message board system.
• We used Slashcode 2.0, Apache 1.3.20, and Apache’s mod_perl
module 1.25 (with perl 5.6) on the server side. We used MySQL
3.23.39 as the database engine. The server content is a snapshot from
the slashcode.com site, containing approximately 3000 messages with
a total size of 5 MB. Most of the run time is spent on dynamic web
page generation. We use a multi-threaded user emulation program to
emulate user browsing and posting behavior. Each user independently
and randomly generates browsing and posting requests to the server
according to a transaction mix specification. We compiled both server
and client programs using Sun’s WorkShop C 6.1 with aggressive
optimization.
Methods

53

Wisconsin Multifacet Project

Dokumen yang terkait

ANALISIS FAKTOR YANGMEMPENGARUHI FERTILITAS PASANGAN USIA SUBUR DI DESA SEMBORO KECAMATAN SEMBORO KABUPATEN JEMBER TAHUN 2011

2 53 20

KONSTRUKSI MEDIA TENTANG KETERLIBATAN POLITISI PARTAI DEMOKRAT ANAS URBANINGRUM PADA KASUS KORUPSI PROYEK PEMBANGUNAN KOMPLEK OLAHRAGA DI BUKIT HAMBALANG (Analisis Wacana Koran Harian Pagi Surya edisi 9-12, 16, 18 dan 23 Februari 2013 )

64 565 20

FAKTOR – FAKTOR YANG MEMPENGARUHI PENYERAPAN TENAGA KERJA INDUSTRI PENGOLAHAN BESAR DAN MENENGAH PADA TINGKAT KABUPATEN / KOTA DI JAWA TIMUR TAHUN 2006 - 2011

1 35 26

A DISCOURSE ANALYSIS ON “SPA: REGAIN BALANCE OF YOUR INNER AND OUTER BEAUTY” IN THE JAKARTA POST ON 4 MARCH 2011

9 161 13

Pengaruh kualitas aktiva produktif dan non performing financing terhadap return on asset perbankan syariah (Studi Pada 3 Bank Umum Syariah Tahun 2011 – 2014)

6 101 0

Pengaruh pemahaman fiqh muamalat mahasiswa terhadap keputusan membeli produk fashion palsu (study pada mahasiswa angkatan 2011 & 2012 prodi muamalat fakultas syariah dan hukum UIN Syarif Hidayatullah Jakarta)

0 22 0

Perlindungan Hukum Terhadap Anak Jalanan Atas Eksploitasi Dan Tindak Kekerasan Dihubungkan Dengan Undang-Undang Nomor 39 Tahun 1999 Tentang Hak Asasi Manusia Jo Undang-Undang Nomor 23 Tahun 2002 Tentang Perlindungan Anak

1 15 79

Pendidikan Agama Islam Untuk Kelas 3 SD Kelas 3 Suyanto Suyoto 2011

4 108 178

PP 23 TAHUN 2010 TENTANG KEGIATAN USAHA

2 51 76

KOORDINASI OTORITAS JASA KEUANGAN (OJK) DENGAN LEMBAGA PENJAMIN SIMPANAN (LPS) DAN BANK INDONESIA (BI) DALAM UPAYA PENANGANAN BANK BERMASALAH BERDASARKAN UNDANG-UNDANG RI NOMOR 21 TAHUN 2011 TENTANG OTORITAS JASA KEUANGAN

3 32 52