performance a

更新时间:2022-12-27 04:14:17 阅读: 评论:0


2022年12月27日发(作者:马里博尔)

Journal of H钾bin Institute of Technolo (New Series),Vo1.14,No.1,2007

A high performance fault-tolerant approach based

on simultaneous multithreading

YANG Hua,CUI Gang,WANG Ling,YANG Xiao-zong

杨华, 崔刚, 王玲, 杨孝宗

(School of Computer Science and Technology,Harbin Institute of Technology,Harbin 1 50001,China,E-mail:i yangh,cg,1wang I@fle1.hit.edu.cn)

Abstract:To cope with the ever—increasing susceptibility to transient fauh in modem processors

a scheme

called Tri—modular Redundantly and Simultaneously Threaded processor with Recovery is proposed

which pro.

vides transient fauh coverage and reconfiguration from partial permanent fauh with high performance

.Besides

two redundant thread contexts,an arbitrator context is introduced to act as either arbitrator or ordinary thread

. which can make better use of hardware resources.Its sphere of replication is reconfigurable and flexible in han—

dling changing demands. nle simulation with l1 SPEC2O0o benchmarks shows that its performance outperforms

SMT—Single by 21.5%on average,while maintaining flexibility and fault.tolerant capabilitv

. Key words:fauh—tolerant,simultaneous muhithreading

reconfigurable,high.performance

CLC number: rP3O2.8 Document code:A Article ID:10o5—9113(20o7)O1 114-05

Due to smaller feature size,reduced voltage leve1.

and higher frequency,future processors are increasing-

ly prone to operational faults,especially transient

faults.Transient faults appear sporadically.and the

primary sources of these faults are single event radiation

(SER)upsets,which are usually caused by energized

particle strikes(e.g.alpha particle and cosmic ray).

Particle strikes can deposit or remove sufficient charge

so as to temporarily turn the device on or off,possibly

creating a logic error -3]

.While shielding is possible.

its physical construction and cost make it unfeasible at

present・

Most commercial fault—tolerant computers use fully

replicated hardware components to detect faults,e.g.

IBM S/390 G5.The components are lock—stepped(cy—

cle—by—cycle synchronized)to ensure that,in each cy—

cle,they perform same operation on same inputs and

produce same outputs in the absence of faults【 ’ .Un—

fortunately,full replication demands doubled area,

power,cost,etc.,and for a given hardware budget,

full replication inevitably reduces performance by stati—

cally partitioning resources among redundant opera—

tions.

In the forthcoming chip muhithreading era,one

approach of redundant execution is redundant mul—

tithreading(RMT),which run two identical copies of

the same program as independent threads,feed them

with identical inputs,and compare their outputs.Sev—

eral RMT architectures have been proposed,e.g.,

SRT[ ]

AR—SMT[6]

etc.These architectures have

Received 2o()4—09—10.

Spon ̄red by the National Pre—research Foundation(Grant No.41316.1.2)

greatly sped up redundant execution,but still lack in

resources efficiency,flexibility,or fault coverage.

We propose Tri..modular Redundantly and Simuha..

neously Threaded processor with Recovery(TRSTR),

which is SMT—based,combing fault coverage of lock—

step with efficiency of SRT,while concerning both per—

formance and flexibility.It focuses on transient fault,

as well as reconfiguration from partial permanent fault

in the next generation multi--thread processor environ.・

ment.

1 Previous Work

Simultaneous multithreading(SMT)enables multi—

pie threads to compete for and share hardware resources

each cycle,alleviating both horizontal and vertical slot

lOSSl』J.SMT is naturally attractive to fault detection.

because it can provide redundancy by running more thall

one copy of the same program simultaneously.

AR—SMTl0 firstly uses SMT to execute two copies

of the same program,it also suggests using speculation

techniques to allow communications of data values and

branch outcomes between the two redundant threads of

one single program to accelerate execution.

Simultaneous and Redundantly aded(SRT

improves on AR—SMT via a few optimizations such as

slack—fetch,load value queue(LVQ)and branch out—

come queue(BOQ).SRT replicates an application in—

to two communicating threads called leading and trail—

ing.The leading thread executes ahead of the trailing

维普资讯

Journal ofHarbin Institute of Technol0 (New Ser ̄s),Vo1.14,No.1,2007

one by a distance named slack—fetch.This technique

together with LVQ and BOQ enables the trailing thread

to utilize the pre一 ̄tched data and resolved branch re—

suits from the leading one,improving overall perform—

ance greatly.The concept of sphere of replication

(SoR)was introduced in SRT.which can be viewed

as the logical boundary of redundancy within a system:

components within SoR enjoy fauh coverage due to re—

dundant execution,whereas components out of SoR do

not.Fig.1 shows the sketch map of SRT.

Sphere of Replication

…●’-‘●‘●…●…●…●…●…-●_

Fig.1 The sketch map of SRT

2 TRSTR Architecture

To combine the fauh coverage of lockstep with the

efficiency of SRT,we propose a new architecture

called TRSTR,which involves a third thread context

named arbitrator into SRT environment.Fig.2 shows

the outline of TRSTR.Unlike the fully—share scheme in

conventional SMT.the arbitrator context in TRSTR

preserves a private set of basic hardware resources,

which guarantees the arbitrator running independently,

and normally cannot be accessed by the leading and

trailing contexts.At the same time.all other hardware

resources in TRSTR are fully shared by the leading.

trailing and arbitrator contexts.TRSTR borrows from

SRT the techniques of LVQ and BOQ in the input rep—

licator,thereby enables the trailing thread to utilize the

pre—fetched data and resolved branch outcomes by the

leading one.To meet different speed and fauh coverage

requirements,TRSTR has two working modes: 一

Simultaneous with Voting( rSV)and Dual—Simuhane—

OUS with Arbitrator(DSA).The input replicator,out—

put comparator and the small sphere of replication

(surrounded by the inner dot—dash in Fig.2)have al—

most the same meaning as in SRT.The big sphere of

replication(surrounded by the outer dot—dash in Fig.

2)means DSA mode in arbitrating or TSV mode.

,..

Phe ̄

.ep_h'gafi'on( 卿g...or J ~

Sphere of Replication(DSA running)

一……一…………一……

Fig.2 The outline of TRSTR

2.1 TSV Mode

When working in rSV mode.the arbitrator con—

text behaves just as a second trailing context.The

leading context,trailing context and arbitrator context

work in the fashion of tri—modular redundant(TMR)

with voting.That is.the three contexts each executes a

duplicated thread of a same application and the output

comparator compares their outputs and does a voting

before committing to the rest of the system.

The outer dot—dash in Fig.2 surrounds the sphere

of replication of the TSV mode.The arbitrator context

acts as a second trailing context,and communicates

with the input replicator and the output comparator

(arrow②,③).The leading thread runs ahead of the

trailing one by a distance of slack—fetch—A,and the

trailing thread runs ahead of the arb.itrator thread by

another slack—fetch—B.In this way.each instruction

must have been executed three times separately and in—

terleaved on the leading,trailing and arbitrator(as an—

other trailing)contexts,before the output comparator

could vote upon the outputs and commit it.In addi—

tion.the distance slack—fetch—B is much shorter than

slack—fetch—A.because the only aim of slack—fetch—B is

to stagger the same instruction’s executing in the trai—

ling and arbitrator contexts,so as to alleviate the cont—

ention for same resources by corresponding instructions

in the two threads.Thus.the sum slack distance be—

tween the leading context and the arbitrator context.

which equals slack—fetch—A plus slack—fetch—B.is near-

ly same as the slack fetch distance in SRT.Therefore.

the involvement of arbitrator context in the TSV mode

does not impact the overall executing efficiency COB—

pared with SRT when all the three outputs match with

each 0ther.

When the outputs from different contexts don’t

match,which means a fault occur in certain thread

context. I1SV mode does not re—execute the correspond—

维普资讯

JournaZ of Harbin Institute of Technolo (New Series),Vo1.14,No.1,2007

ing instruction in most cases when a voting could be

made.that is。two results are identical while the other

is inconsistent.probably caused by a transient fault.

Because each instruction executes three times in the

three contexts at different clock.almost all transient

fault of any component can be shielded off by voting,

without any performance impact.Only when all three

outputs difer form each other,need the corresponding

instruction be re.executed in the three contexts.

If some permanent faults occur in a component in

the leading and trailing contexts,e.g.some physical

faults in an I.ALU unit.there should be consecutive

failure voting in a short period of time because some in.

structions are executed on that unit.Upon detecting

this,all the three threads stop running temporarily,

and the arbitrator context begins executing a self-verifi—

cation program to verify its private resources.If no

faull detected.the arbitrator communicates directly

with the rest of the system(arrow④,⑤in Fig.2),

and continues execution of the program;while at the

same time,the leading and trailing contexts carry out a

self-checking to locate and shield off the bad compo.

nents.After that。all the three contexts retrieve the o—

riginal program’s normal execution in TSA mode.To

the contrary,if a faulty component is detected during

the self-verification of the arbitrator context.it will

simply be labeled as unusable and the arbitrator will

rejoin the ofginal program’s redundant threads execu—

tion.

In one word.TSV mode almost does not re.exe—

cute instructions upon detecting transient faults.We

call the features above seldom.re—execute,which make

TRSTR more suitable for circumstances where rea1.time

is desired as well as fault.tolerance.

2.2 DSA Mode

When working in DSA mode.TRSTR behaves

both like a conventional SMT and a SRT.and the arbi—

trator context can be employed either as a conventional

thread of SMT or as an arbitrator thread.We state that

it behaves like a SMT because generally three threads

from two different aDplications execute simultaneously

in the three contexts:one application(called crucial

job)runs its two threads redundantly on the leading

and trailing contexts,while the other application

(called common job)nlns its unique thread in the ar-

bitrator context(arrow④,⑤in Fig.2).we also

state that it behaves like a SRT because the crucial

iob’s two redundant threads run exactly as SRT,ex.

cept when a mismatch is detected in the output compa—

rator.

When the results from the leading and the trailing

thread contexts mismatch,the output comparator sends

a request for arbitrating(arrow①in Fig.2).When

detecting an arbitrating request,the DSA mode halts

temporarily the common iob’s execution in the arbitra—

tor context,and duplicates the crucial iob’s committed

state(PC,register file set,etc.)to the arbitrator con—

text.After that,the arbitrator context re.executes the

corresponding fault instruction.The result produced by

the arbitrator is compared with the two corresponding

results from the leading and trailing contexts(alTow②,

⑧in Fig.2),in order to determine whether the lcad—

ing context or the trailing context is right and commit

the instruction.In addition,the leading and trailing

contexts will experience a self-recovery during the cy—

cles when the instruction is being re—executed in the at-

bitrator context.After that,the leading and trailing

contexts continue to execute the crucial iob’s two re—

dundant threads,trailing after leading,and the arbitra—

tor context restarts the common iob’s thread.Almost

all transient faults in the leading and trailing contexts

can be detected and recovered in this way.

If a permanent fauh occurs in the】eading or trai—

ling context.there should be some consecutive mismat—

ches in a short period of time.In this case.the arbi—

trator context switches to 13.in the crucial iob’s thread:

at the same time,the leading and trailing contexts be—

gin to perform some self-testing program to locate the

faulty component(e.g.an I—ALU unit)and label it as

unusable.After shielding off the faulty component,the

leading and trailing contexts together with the input

replicator and the output comparator retrieve the crucial

iob’s execution.and the arbitrator switches back to

the execution of its common iob.In this way.some

permanent faults can be detected and covered without

stopping the crucial job’s executing.We call this fea—

ture non.stop—crucial,which makes the TRSTR more

attractive to some vital tasks that cannot be canceled or

stopped.

2.3 Implementation

TRSTR is based on SMT,and needs three con—

texts(1eading,trailing,arbitrator)that each repre—

sents a thread.Besides the techniques commonly in

use in SMT,TRSTR borrows from SRT the concepts of

slack—fetch,LVQ and BOQ,which serve as a bridge

between the leading and trailing contexts to improve the

overall performance.

The arbitrator context has a minimal set of private

resources that cannot be accessed by the leading and

trailing contexts,which guarantee the non—stop—crucial

feature when the leading and trailing contexts are carry—

ing out serf-checking.To support TSV mode,TRSTR

extends the output comparator to support comparison ot

the three outputs from the leading,trailing,and arbi—

trator contexts.The input replicator undergoes similar

extension.

To support serf-checking and self-recovery from

permanent faults in the leading and trailing contexts,

we need a table to record each functional unit’s usabil—

ity.This table also helps in the self-checking and

维普资讯

Journal ofHarbin Institute of Teehnolo (New Series),Vo1.14,No.1,2007

reconfiguring progress,as well as instruction schedu—

ling.In addition,some logics are needed to support

switching among different working states,such as back

and forth between DSA normal running and DSA arbi。

trating.between TSV normal running and TSV self-

checking,and between ripSV mode and DSA mode.

The involvement of the arbitrator context somewhat

increases the complexity for TRSTR.However.the pri.

vate resources of the arbitrator can be much fewer COB.

pared with the leading and trailing contexts(much less

than 33%in most cases).In addition,its private re.

source is scaleable to adapt to the changing demand.

The arbitrator improves the overalJ performance great.

1y,both in execution speed(f0r DSA mode)and in

faulI tolerance capability.as shown in Section 3.

In summary.it is a straightforward extension to

SMT architecture to implement TRSTR.which only

needs moderate hardware complexity compared with its

predecessors such as lockstep,AR.SMT,SRT,etc.

More importantly.TRSTR has some precious features

such as higher resource efficiency.seldom.re—execute

and non—stop—crucial,which its predecessor architec.

tures lack in or support insufficiently.

3 Simulation Environment and Results

We modify the SimpleScalar tool set【 to build

two simulators:one for 1 STR.and one for baseline

SMT.Tab.1 summarizes key parameters of the TRSTR

processor and the baseline SMT.To evaluate our ide—

as.we use 1 1 SPEC2000 benchmarks shown in Tab.

2.A1l benchmarks are compiled to SimpleScalar PISA

instruction set using gcc 2.7.2 with“.02一funroll—

loops”optimizations.We run all benchmarks for 300

million instructions after skip the first 500 million in—

structions to warm the simulator up, using the

SPEC2O00 ref data sets.

Tab.1 Hardware parameters for TRSTR and the baseline

SMT

LI Inst.Cache 64K,4-way,32一byte block,LRU

LI Data Cache 64K,4-way,32一byte block,LRU

Unified L2 Cache lM,4-way,64一byte block,LRU

Branch predictor

Hybrid 8k—entry bimodal,8k—entry

gshare,8k 2-bit selector

l6-entry RAS,4-way l K BTB

Main memory Infinite capacity,l 8 cycle latency

8 instructi。n cl I

ssue/Commit Width ……’ 。’。

unction

Units (fo :

Tab.2 Benchmark statistics for the baseline SMT

(Ld:load,St:store,Br Pred Acc:branch prediction accuracy.

M=million)

Fig.3 presents the simulation results.SMT—Sin—

gle means each benchmark runs on the baseline SMT

processor with comparative resources as in Tab.1.

TRsTR—TSV means each benchmark ruas in 1’SV

mode.For DSA mode。we run each benchmark as the

crucial iob twice:once with itself as the common ioh

(1abeled as TRSTR—DSA一1 s£1).and once with a dif-

ferent benchmark as the common iob f labeled as

TRSTR—DSA一1 st 2 1.T0 better reveal DSA mode’s

capability of thread—level parallelism.in TRSTR.DSA.

Test 2.we choose the FP benchmark mesa as the COB.

moB iob for all Int benchmarks,and the Int benchmark

gcc as the common iob for all the FP benchmarks.For

almost all benchmarks.IPC of TRSTR—DSA—Test 2 sur-

passes TRSTR.DSA一 l st 1.because the complemen.

tarity of Int and FP benchmarks can make better use of

hardware resources.

0 1 2 3 4 5 6 7 8 9 10 11 A

vg

Benchmark

荫SMT-Single ■TRSTR— V

■TRSTR—DSA—Test l—Crucia1 OTRSTR—DSA—Test l—C0mm0n

荫TRSTR—DSA—Test2一Crucial 1:3 TRSTR

—DSA—Tcst2一C0mmon

Fig.3 IPC of baseline SMT-single,TRSTR・TSV,TRSTR-

DSA-Test 1,TRSTR-DSA-Test 2

Just as expected,the IPC(instruction per cycle)

of both TSV mode and the crucial job of DSA mode are

inferior to the IPC of SMT—Single,because of redun—

dandy execution.However,the sum IPC of crucial job

and common job of DSA mode outperform SMT—Single

维普资讯

Journal ofHarbinInstitute of Technology(New Series),Vo1.14,No.1,2007

for most benchmarks.OH average by 9.6%for TRSTR—

DSA—Test l and 2 1.5%f0r TRSTR—DSA—Test 2 re—

spectively.

4 C:onclusion

Tremendous amount of transistors make it more

feasible to integrate multiple processor cores and to im—

plement muhithreaded environment on a single chip.

As a high performance fauh—tolerant architecture,

TRSTR behaves both like a conventional SMT and a

SRT,combining the fault coverage of lockstep with the

efficiency of SRT.Its two working modes are inter—con—

vertible,meeting different speed and fault coverage re—

quirements at wil1.The involvement of the arbitrator

context.as well as online se ̄checking and reconfigu—

ration ability,makes TRSTR more flexible and reliable

than its predecessors.

References:

[1]A1一Asaad H,Murray B T,Hayes J P.Online BIST for em—

bedded systems.IEEE Design and Test of Computers,

l998,l5(4):l7—24.

[2]Gaisler J.A portable and fault—tolerant microprocessor based

on the SPARC V8 architecture.Proc of the Int’l Conf on

Depondable Systems and Networks. Washington,DC:

IEEE,2002.4O9—415.

[3]Rubinfeld P.Managing problems at high speed.IEEE Com—

puter,l998,3l(1):47—48.

[4]Siegel T J.IBM,s S/390 G5 microprocessor design.IEEE

Micro,l999,l9(2):l2—23.

『5]Reinhardt S K,Mukherjee S S.Transient fault detection

via simultaneous muhithreading.Proc of 27th Annual Int’l

Symp on Computer Architecture.Vancouver:IEEE,2000.

25—36.

[6]Rotenberg E.AR・SMT:a microarchitectural approach to

fault tolerance in microprocessors.Proc of 29th Fault—Toler—

ant Computing Symposium.Madison:IEEE.1 999.84—

91.

[7]Tullsen D M,Eggers S J,Levy H M.Simultaneous mul・

tithreading:maximizing on・chip parallelism.Proc of 22nd

Annual Inte’l Symp on Computer Architecture.Santa Mar ̄

gherita Lignre:ACM,1995.392—403.

[8]Austin T,Lal'son E,Ernest D.Simplescalar:an inffastruc—

ture for computer system modeling.IEEE Computer,2002,

35(2):59—67.

维普资讯

本文发布于:2022-12-27 04:14:17,感谢您对本站的认可!

本文链接:http://www.wtabcd.cn/fanwen/fan/90/38036.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

上一篇:我将永远爱你
下一篇:alternatively
标签:performance a
相关文章
留言与评论(共有 0 条评论)
   
验证码:
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图