Journal of H钾bin Institute of Technolo (New Series),Vo1.14,No.1,2007
A high performance fault-tolerant approach based
on simultaneous multithreading
YANG Hua,CUI Gang,WANG Ling,YANG Xiao-zong
杨华, 崔刚, 王玲, 杨孝宗
(School of Computer Science and Technology,Harbin Institute of Technology,Harbin 1 50001,China,E-mail:i yangh,cg,1wang I@fle1.hit.edu.cn)
Abstract:To cope with the ever—increasing susceptibility to transient fauh in modem processors
,
a scheme
called Tri—modular Redundantly and Simultaneously Threaded processor with Recovery is proposed
,
which pro.
vides transient fauh coverage and reconfiguration from partial permanent fauh with high performance
.Besides
two redundant thread contexts,an arbitrator context is introduced to act as either arbitrator or ordinary thread
. which can make better use of hardware resources.Its sphere of replication is reconfigurable and flexible in han—
dling changing demands. nle simulation with l1 SPEC2O0o benchmarks shows that its performance outperforms
SMT—Single by 21.5%on average,while maintaining flexibility and fault.tolerant capabilitv
. Key words:fauh—tolerant,simultaneous muhithreading
,
reconfigurable,high.performance
CLC number: rP3O2.8 Document code:A Article ID:10o5—9113(20o7)O1 114-05
Due to smaller feature size,reduced voltage leve1.
and higher frequency,future processors are increasing-
ly prone to operational faults,especially transient
faults.Transient faults appear sporadically.and the
primary sources of these faults are single event radiation
(SER)upsets,which are usually caused by energized
particle strikes(e.g.alpha particle and cosmic ray).
Particle strikes can deposit or remove sufficient charge
so as to temporarily turn the device on or off,possibly
creating a logic error -3]
.While shielding is possible.
its physical construction and cost make it unfeasible at
present・
Most commercial fault—tolerant computers use fully
replicated hardware components to detect faults,e.g.
IBM S/390 G5.The components are lock—stepped(cy—
cle—by—cycle synchronized)to ensure that,in each cy—
cle,they perform same operation on same inputs and
produce same outputs in the absence of faults【 ’ .Un—
fortunately,full replication demands doubled area,
power,cost,etc.,and for a given hardware budget,
full replication inevitably reduces performance by stati—
cally partitioning resources among redundant opera—
tions.
In the forthcoming chip muhithreading era,one
approach of redundant execution is redundant mul—
tithreading(RMT),which run two identical copies of
the same program as independent threads,feed them
with identical inputs,and compare their outputs.Sev—
eral RMT architectures have been proposed,e.g.,
SRT[ ]
.
AR—SMT[6]
,
etc.These architectures have
Received 2o()4—09—10.
Spon ̄red by the National Pre—research Foundation(Grant No.41316.1.2)
greatly sped up redundant execution,but still lack in
resources efficiency,flexibility,or fault coverage.
We propose Tri..modular Redundantly and Simuha..
neously Threaded processor with Recovery(TRSTR),
which is SMT—based,combing fault coverage of lock—
step with efficiency of SRT,while concerning both per—
formance and flexibility.It focuses on transient fault,
as well as reconfiguration from partial permanent fault
in the next generation multi--thread processor environ.・
ment.
1 Previous Work
Simultaneous multithreading(SMT)enables multi—
pie threads to compete for and share hardware resources
each cycle,alleviating both horizontal and vertical slot
lOSSl』J.SMT is naturally attractive to fault detection.
because it can provide redundancy by running more thall
one copy of the same program simultaneously.
AR—SMTl0 firstly uses SMT to execute two copies
of the same program,it also suggests using speculation
techniques to allow communications of data values and
branch outcomes between the two redundant threads of
one single program to accelerate execution.
Simultaneous and Redundantly aded(SRT
improves on AR—SMT via a few optimizations such as
slack—fetch,load value queue(LVQ)and branch out—
come queue(BOQ).SRT replicates an application in—
to two communicating threads called leading and trail—
ing.The leading thread executes ahead of the trailing
维普资讯
Journal ofHarbin Institute of Technol0 (New Ser ̄s),Vo1.14,No.1,2007
one by a distance named slack—fetch.This technique
together with LVQ and BOQ enables the trailing thread
to utilize the pre一 ̄tched data and resolved branch re—
suits from the leading one,improving overall perform—
ance greatly.The concept of sphere of replication
(SoR)was introduced in SRT.which can be viewed
as the logical boundary of redundancy within a system:
components within SoR enjoy fauh coverage due to re—
dundant execution,whereas components out of SoR do
not.Fig.1 shows the sketch map of SRT.
Sphere of Replication
…●’-‘●‘●…●…●…●…●…-●_
Fig.1 The sketch map of SRT
2 TRSTR Architecture
To combine the fauh coverage of lockstep with the
efficiency of SRT,we propose a new architecture
called TRSTR,which involves a third thread context
named arbitrator into SRT environment.Fig.2 shows
the outline of TRSTR.Unlike the fully—share scheme in
conventional SMT.the arbitrator context in TRSTR
preserves a private set of basic hardware resources,
which guarantees the arbitrator running independently,
and normally cannot be accessed by the leading and
trailing contexts.At the same time.all other hardware
resources in TRSTR are fully shared by the leading.
trailing and arbitrator contexts.TRSTR borrows from
SRT the techniques of LVQ and BOQ in the input rep—
licator,thereby enables the trailing thread to utilize the
pre—fetched data and resolved branch outcomes by the
leading one.To meet different speed and fauh coverage
requirements,TRSTR has two working modes: 一
Simultaneous with Voting( rSV)and Dual—Simuhane—
OUS with Arbitrator(DSA).The input replicator,out—
put comparator and the small sphere of replication
(surrounded by the inner dot—dash in Fig.2)have al—
most the same meaning as in SRT.The big sphere of
replication(surrounded by the outer dot—dash in Fig.
2)means DSA mode in arbitrating or TSV mode.
,..
S
.
Phe ̄
.
R
.ep_h'gafi'on( 卿g...or J ~
Sphere of Replication(DSA running)
,
一……一…………一……
Fig.2 The outline of TRSTR
2.1 TSV Mode
When working in rSV mode.the arbitrator con—
text behaves just as a second trailing context.The
leading context,trailing context and arbitrator context
work in the fashion of tri—modular redundant(TMR)
with voting.That is.the three contexts each executes a
duplicated thread of a same application and the output
comparator compares their outputs and does a voting
before committing to the rest of the system.
The outer dot—dash in Fig.2 surrounds the sphere
of replication of the TSV mode.The arbitrator context
acts as a second trailing context,and communicates
with the input replicator and the output comparator
(arrow②,③).The leading thread runs ahead of the
trailing one by a distance of slack—fetch—A,and the
trailing thread runs ahead of the arb.itrator thread by
another slack—fetch—B.In this way.each instruction
must have been executed three times separately and in—
terleaved on the leading,trailing and arbitrator(as an—
other trailing)contexts,before the output comparator
could vote upon the outputs and commit it.In addi—
tion.the distance slack—fetch—B is much shorter than
slack—fetch—A.because the only aim of slack—fetch—B is
to stagger the same instruction’s executing in the trai—
ling and arbitrator contexts,so as to alleviate the cont—
ention for same resources by corresponding instructions
in the two threads.Thus.the sum slack distance be—
tween the leading context and the arbitrator context.
which equals slack—fetch—A plus slack—fetch—B.is near-
ly same as the slack fetch distance in SRT.Therefore.
the involvement of arbitrator context in the TSV mode
does not impact the overall executing efficiency COB—
pared with SRT when all the three outputs match with
each 0ther.
When the outputs from different contexts don’t
match,which means a fault occur in certain thread
context. I1SV mode does not re—execute the correspond—
维普资讯
JournaZ of Harbin Institute of Technolo (New Series),Vo1.14,No.1,2007
ing instruction in most cases when a voting could be
made.that is。two results are identical while the other
is inconsistent.probably caused by a transient fault.
Because each instruction executes three times in the
three contexts at different clock.almost all transient
fault of any component can be shielded off by voting,
without any performance impact.Only when all three
outputs difer form each other,need the corresponding
instruction be re.executed in the three contexts.
If some permanent faults occur in a component in
the leading and trailing contexts,e.g.some physical
faults in an I.ALU unit.there should be consecutive
failure voting in a short period of time because some in.
structions are executed on that unit.Upon detecting
this,all the three threads stop running temporarily,
and the arbitrator context begins executing a self-verifi—
cation program to verify its private resources.If no
faull detected.the arbitrator communicates directly
with the rest of the system(arrow④,⑤in Fig.2),
and continues execution of the program;while at the
same time,the leading and trailing contexts carry out a
self-checking to locate and shield off the bad compo.
nents.After that。all the three contexts retrieve the o—
riginal program’s normal execution in TSA mode.To
the contrary,if a faulty component is detected during
the self-verification of the arbitrator context.it will
simply be labeled as unusable and the arbitrator will
rejoin the ofginal program’s redundant threads execu—
tion.
In one word.TSV mode almost does not re.exe—
cute instructions upon detecting transient faults.We
call the features above seldom.re—execute,which make
TRSTR more suitable for circumstances where rea1.time
is desired as well as fault.tolerance.
2.2 DSA Mode
When working in DSA mode.TRSTR behaves
both like a conventional SMT and a SRT.and the arbi—
trator context can be employed either as a conventional
thread of SMT or as an arbitrator thread.We state that
it behaves like a SMT because generally three threads
from two different aDplications execute simultaneously
in the three contexts:one application(called crucial
job)runs its two threads redundantly on the leading
and trailing contexts,while the other application
(called common job)nlns its unique thread in the ar-
bitrator context(arrow④,⑤in Fig.2).we also
state that it behaves like a SRT because the crucial
iob’s two redundant threads run exactly as SRT,ex.
cept when a mismatch is detected in the output compa—
rator.
When the results from the leading and the trailing
thread contexts mismatch,the output comparator sends
a request for arbitrating(arrow①in Fig.2).When
detecting an arbitrating request,the DSA mode halts
temporarily the common iob’s execution in the arbitra—
tor context,and duplicates the crucial iob’s committed
state(PC,register file set,etc.)to the arbitrator con—
text.After that,the arbitrator context re.executes the
corresponding fault instruction.The result produced by
the arbitrator is compared with the two corresponding
results from the leading and trailing contexts(alTow②,
⑧in Fig.2),in order to determine whether the lcad—
ing context or the trailing context is right and commit
the instruction.In addition,the leading and trailing
contexts will experience a self-recovery during the cy—
cles when the instruction is being re—executed in the at-
bitrator context.After that,the leading and trailing
contexts continue to execute the crucial iob’s two re—
dundant threads,trailing after leading,and the arbitra—
tor context restarts the common iob’s thread.Almost
all transient faults in the leading and trailing contexts
can be detected and recovered in this way.
If a permanent fauh occurs in the】eading or trai—
ling context.there should be some consecutive mismat—
ches in a short period of time.In this case.the arbi—
trator context switches to 13.in the crucial iob’s thread:
at the same time,the leading and trailing contexts be—
gin to perform some self-testing program to locate the
faulty component(e.g.an I—ALU unit)and label it as
unusable.After shielding off the faulty component,the
leading and trailing contexts together with the input
replicator and the output comparator retrieve the crucial
iob’s execution.and the arbitrator switches back to
the execution of its common iob.In this way.some
permanent faults can be detected and covered without
stopping the crucial job’s executing.We call this fea—
ture non.stop—crucial,which makes the TRSTR more
attractive to some vital tasks that cannot be canceled or
stopped.
2.3 Implementation
TRSTR is based on SMT,and needs three con—
texts(1eading,trailing,arbitrator)that each repre—
sents a thread.Besides the techniques commonly in
use in SMT,TRSTR borrows from SRT the concepts of
slack—fetch,LVQ and BOQ,which serve as a bridge
between the leading and trailing contexts to improve the
overall performance.
The arbitrator context has a minimal set of private
resources that cannot be accessed by the leading and
trailing contexts,which guarantee the non—stop—crucial
feature when the leading and trailing contexts are carry—
ing out serf-checking.To support TSV mode,TRSTR
extends the output comparator to support comparison ot
the three outputs from the leading,trailing,and arbi—
trator contexts.The input replicator undergoes similar
extension.
To support serf-checking and self-recovery from
permanent faults in the leading and trailing contexts,
we need a table to record each functional unit’s usabil—
ity.This table also helps in the self-checking and
维普资讯
Journal ofHarbin Institute of Teehnolo (New Series),Vo1.14,No.1,2007
reconfiguring progress,as well as instruction schedu—
ling.In addition,some logics are needed to support
switching among different working states,such as back
and forth between DSA normal running and DSA arbi。
trating.between TSV normal running and TSV self-
checking,and between ripSV mode and DSA mode.
The involvement of the arbitrator context somewhat
increases the complexity for TRSTR.However.the pri.
vate resources of the arbitrator can be much fewer COB.
pared with the leading and trailing contexts(much less
than 33%in most cases).In addition,its private re.
source is scaleable to adapt to the changing demand.
The arbitrator improves the overalJ performance great.
1y,both in execution speed(f0r DSA mode)and in
faulI tolerance capability.as shown in Section 3.
In summary.it is a straightforward extension to
SMT architecture to implement TRSTR.which only
needs moderate hardware complexity compared with its
predecessors such as lockstep,AR.SMT,SRT,etc.
More importantly.TRSTR has some precious features
such as higher resource efficiency.seldom.re—execute
and non—stop—crucial,which its predecessor architec.
tures lack in or support insufficiently.
3 Simulation Environment and Results
We modify the SimpleScalar tool set【 to build
two simulators:one for 1 STR.and one for baseline
SMT.Tab.1 summarizes key parameters of the TRSTR
processor and the baseline SMT.To evaluate our ide—
as.we use 1 1 SPEC2000 benchmarks shown in Tab.
2.A1l benchmarks are compiled to SimpleScalar PISA
instruction set using gcc 2.7.2 with“.02一funroll—
loops”optimizations.We run all benchmarks for 300
million instructions after skip the first 500 million in—
structions to warm the simulator up, using the
SPEC2O00 ref data sets.
Tab.1 Hardware parameters for TRSTR and the baseline
SMT
LI Inst.Cache 64K,4-way,32一byte block,LRU
LI Data Cache 64K,4-way,32一byte block,LRU
Unified L2 Cache lM,4-way,64一byte block,LRU
Branch predictor
Hybrid 8k—entry bimodal,8k—entry
gshare,8k 2-bit selector
l6-entry RAS,4-way l K BTB
Main memory Infinite capacity,l 8 cycle latency
8 instructi。n cl I
ssue/Commit Width ……’ 。’。
F
…
unction
…
Units (fo :
Tab.2 Benchmark statistics for the baseline SMT
(Ld:load,St:store,Br Pred Acc:branch prediction accuracy.
M=million)
Fig.3 presents the simulation results.SMT—Sin—
gle means each benchmark runs on the baseline SMT
processor with comparative resources as in Tab.1.
TRsTR—TSV means each benchmark ruas in 1’SV
mode.For DSA mode。we run each benchmark as the
crucial iob twice:once with itself as the common ioh
(1abeled as TRSTR—DSA一1 s£1).and once with a dif-
ferent benchmark as the common iob f labeled as
TRSTR—DSA一1 st 2 1.T0 better reveal DSA mode’s
capability of thread—level parallelism.in TRSTR.DSA.
Test 2.we choose the FP benchmark mesa as the COB.
moB iob for all Int benchmarks,and the Int benchmark
gcc as the common iob for all the FP benchmarks.For
almost all benchmarks.IPC of TRSTR—DSA—Test 2 sur-
passes TRSTR.DSA一 l st 1.because the complemen.
tarity of Int and FP benchmarks can make better use of
hardware resources.
5
4
3
H
2
1
0 1 2 3 4 5 6 7 8 9 10 11 A
vg
Benchmark
荫SMT-Single ■TRSTR— V
■TRSTR—DSA—Test l—Crucia1 OTRSTR—DSA—Test l—C0mm0n
荫TRSTR—DSA—Test2一Crucial 1:3 TRSTR
—DSA—Tcst2一C0mmon
Fig.3 IPC of baseline SMT-single,TRSTR・TSV,TRSTR-
DSA-Test 1,TRSTR-DSA-Test 2
Just as expected,the IPC(instruction per cycle)
of both TSV mode and the crucial job of DSA mode are
inferior to the IPC of SMT—Single,because of redun—
dandy execution.However,the sum IPC of crucial job
and common job of DSA mode outperform SMT—Single
维普资讯
Journal ofHarbinInstitute of Technology(New Series),Vo1.14,No.1,2007
for most benchmarks.OH average by 9.6%for TRSTR—
DSA—Test l and 2 1.5%f0r TRSTR—DSA—Test 2 re—
spectively.
4 C:onclusion
Tremendous amount of transistors make it more
feasible to integrate multiple processor cores and to im—
plement muhithreaded environment on a single chip.
As a high performance fauh—tolerant architecture,
TRSTR behaves both like a conventional SMT and a
SRT,combining the fault coverage of lockstep with the
efficiency of SRT.Its two working modes are inter—con—
vertible,meeting different speed and fault coverage re—
quirements at wil1.The involvement of the arbitrator
context.as well as online se ̄checking and reconfigu—
ration ability,makes TRSTR more flexible and reliable
than its predecessors.
References:
[1]A1一Asaad H,Murray B T,Hayes J P.Online BIST for em—
bedded systems.IEEE Design and Test of Computers,
l998,l5(4):l7—24.
[2]Gaisler J.A portable and fault—tolerant microprocessor based
on the SPARC V8 architecture.Proc of the Int’l Conf on
Depondable Systems and Networks. Washington,DC:
IEEE,2002.4O9—415.
[3]Rubinfeld P.Managing problems at high speed.IEEE Com—
puter,l998,3l(1):47—48.
[4]Siegel T J.IBM,s S/390 G5 microprocessor design.IEEE
Micro,l999,l9(2):l2—23.
『5]Reinhardt S K,Mukherjee S S.Transient fault detection
via simultaneous muhithreading.Proc of 27th Annual Int’l
Symp on Computer Architecture.Vancouver:IEEE,2000.
25—36.
[6]Rotenberg E.AR・SMT:a microarchitectural approach to
fault tolerance in microprocessors.Proc of 29th Fault—Toler—
ant Computing Symposium.Madison:IEEE.1 999.84—
91.
[7]Tullsen D M,Eggers S J,Levy H M.Simultaneous mul・
tithreading:maximizing on・chip parallelism.Proc of 22nd
Annual Inte’l Symp on Computer Architecture.Santa Mar ̄
gherita Lignre:ACM,1995.392—403.
[8]Austin T,Lal'son E,Ernest D.Simplescalar:an inffastruc—
ture for computer system modeling.IEEE Computer,2002,
35(2):59—67.
维普资讯
本文发布于:2022-12-27 04:14:17,感谢您对本站的认可!
本文链接:http://www.wtabcd.cn/fanwen/fan/90/38036.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
留言与评论(共有 0 条评论) |