Compiling to a VLIW Fragment Pipeline William R.Mark and Kekoa Proudfoot
Department of Computer Science∗
Stanford University
Abstract
The latest generation of graphics hardware supports fully pro-grammable vertex and pixel/fragment operations,but programming this hardware at a low level is difficult and time consuming.To address this problem,we have developed a complete real-time procedural shading system that compiles a high-level shading language to programmable vertex and fragment hardware,as described in a parate publication.In this paper,we describe in detail the algorithms ud by this system to generate and optimize fragment code for NVIDIA’s register combiner architecture and show that our compiler generates efficient code.The register combiner architecture has some similarities to VLIW CPU architectures,so we compare our compilation algorithms to tho described in the literature for VLIW CPU architectures.We also discuss some of the lessons we learned from building and using this compiler that may be uful to the designers of future programmable graphics hardware.
CR Categories:I.3.7[Computer Graphics]:Three-Dimensional Graphics and Realism—Color,shading,shadowing,and texture;
D.3.4[Programming Languages]:Processors–Compilers and code generation I.3.1[Computer Graphics]:Hardware Architecture—Graphics processors;I.3.6[Computer Graphics]:Methodology and Techniques—Languages
1Introduction
Real-time consumer-level graphics hardware is rapidly becoming application-programmable at both the vertex processing and pixel/fragment processing stages.NVIDIA’s GeForce3chip is the first example of such hardware,but we expect others to appear soon.Programmable graphics hardware enables applications to implement complex shading algorithms while maintaining high performance.The shading algorithms can more realistically model the enormous variety of materials and lighting effects that exist in the real world.
Applications specify vertex and fragment programs using either the DirectX8API,or vendor-specific OpenGL extensions.In either ca,the programs are specified at a level somewhere between asmbly-language and microcode.As a result,programs are ∗Gates Building;Stanford,CA94305US
A.
email:{billmark|kekoa}@graphics.stanford.edu
www:graphics.stanford.edu/˜{billmark|kekoa}difficult to write,difficult to understand,and often not portable across graphics chips from different vendors.
One solution to this problem is to write shading programs (shaders)in a high-level shading language,and compile the programs to the graphics hardware.Procedural shading languages [14],such as RenderMan[1,4],have been successfully ud for off-line rendering for many years.More recently,procedural shading techniques have been adapted to real-time rendering, for both PixelFlow’s SIMD pixel-processor arrays[12],and for conventional non-programmable graphics hardware[5,13].
We have developed a real-time procedural shading system[15] that is designed to support graphics hardware with programmable vertex and fragment units.This system’s compiler allows per-fragment,per-vertex,and per-primitive-group computations to be mixed within a single shader.The compiler’s front end parates the three types of computations,and invokes parate compiler back ends to generate the code for each.The system can currently choo one of two different primitive-gr
oup back ends,one of three different vertex back ends,and one of two different fragment back ends.
Our system’sfirst fragment back end generates code for standard OpenGL1.2hardware(plus a limited t of extensions),and the cond fragment back end generates code specifically for NVIDIA’s register-combiner hardware.Thus,this cond fragment back end is the only one that targets what we consider to be programmable fragment hardware.This back end is the most complex one in our system,and is the subject of this paper.
In this paper,we
•Briefly describe the register combiner architecture.•Describe in detail the algorithms ud by our register-combiner compiler,and the reasons for choosing the algorithms.
幂函数和指数函数区别•Compare our compilation strategy to that ud by conven-tional VLIW compilers.
•Demonstrate that the efficiency of code generated by our compiler is comparable that of hand-written code.•Discuss some of the lessons we learned by building and using the compiler,and the implications of the lessons for the design of future graphics hardware and compilers.
Butfirst,we will describe the input to our compiler,and outline the basic problem that our compiler add
ress.
The input to our register-combiner compiler back end is a directed acyclic graph(DAG)of operations such as ADD and MULTIPLY.This DAG reprents a unification of a surface shader with zero or more co-compiled light shaders.The root node of this DAG returns a single RGBA color to be stored in the framebuffer. Most of the nodes that can appear in the DAG correspond directly to operators supported by our shading language.
The DAG does not contain any control constructs such as branches or loops–compiler writers would say that it reprents a single basic block.Thus,the compilation problem that we address is that of generating and optimizing register-combiner code for a single basic block.
In developing this compiler back end,we have focud on the previously-unsolved problem of compiling as many operations as
possible into a single programmable rendering pass.Other systems [5,13]have already shown that multi-pass compilation is possible, and our system’s other fragment backend us the techniques. Currently,our register-combiner compiler will fail with an error message if a shader is too complex for a single rendering pass, although we have designed the compiler’s infrastructure to facilitate the futu
re addition of multi-pass compilation capability.
Our compiler is re-targetable to two different generations of the register combiner architecture.Thefirst generation (GeForce1and2)supports two register combiners with two textures,while the cond generation(GeForce3)supports eight register combiners,four textures,and a different mechanism for specifying constant values.Our compiler also makes partial u of the GeForce3’s texture-shader hardware.Becau our compiler’s design is intimately tied to some of the properties of the hardware architecture,we will describe the architecture in detail before discussing the compiler further.
2Register Combiner Architecture Pipeline organization
Figure1depicts the complete architecture to which we are compiling[11].The host CPU or vertex program supplies a t of interpolants(position,“colors”,and texture coordinates)with each vertex.We u the more generic term vertex interpolant to refer to the interpolated“colors”.The texture coordinates are ud to index textures,or when the NV texture shader extension is available, to perform more complex texture-addressing operations.At the top of the register-combiner pipeline,the vertex interpolants and filtered texture colors are placed into specific registers.
Figure1:The register combiner architecture.
海底世界画画
Each register combiner stage has an“RGB combiner”and “ALPHA combiner”.The combiners can read values from veral registers,operate on the values,and write the results to registers.The RG
B and ALPHA combiners are controlled
independently,and operate in parallel.For this reason,we consider
each register-combiner stage to be similar to an instruction for a
电脑漏电conventional VLIW CPU architecture.VLIW instructions allow
veral completely independent register-to-register operations to be
performed concurrently.
Any register that is not written to by a particular register-
combiner stage prerves its value from the previous stage.Again,
this behavior is consistent with thinking of the register combiners as
a ries of register-to-register VLIW instructions.On a GeForce3,
eight such VLIW instructions are available per rendering pass,and
on a GeForce1,two such instructions are available per rendering
pass.
The architecture also includes special“final”RGB and ALPHA
combiners.Thefinal combiners calculate the RGBA value to be placed in the framebuffer from values taken from the
RGBA registers.The capabilities of thefinal combiners are
different from tho of the standard combiners.For example,the
final combiners operate on unsigned values,while the standard
combiners operate on signed values.For simplicity,our compiler
only us thefinal combiners to choo which RGB and A registers
will be written to the framebuffer,and to perform simple input
mappings such as y=(1−x).
The architecture has two read-only“constant”registers,as well
as a read-only zero-valued register(Z0).The values of the constant
registers are t by the host at pipeline-configuration time.Our
system us the constant registers to hold both true constant
values and per-primitive-group values.On the GeForce1,the
constant registers are global,in the n that thefirst constant
register has the same value at every combiner stage,as does
the cond constant register.We refer to the global constant
registers as C0and C1.On the GeForce3,which supports the NV register combiner2extension,the constant registers are per-stage, meaning that they can hold a different value at every combiner
stage.This difference in behavior has a surprisingly broad
impact on the compiler,and to emphasize the difference from the
GeForce1’s constant registers we refer to the GeForce3’s constant
registers as L0and L1.
Internal combiner structure
Figure2shows the internal structure of an RGB(3-vector)
combiner.The ALPHA(scalar)combiner is very similar,but
becau it operates on scalar values it can not perform dot
products.The complexity of this internal structure is the main
feature that distinguishes the register combiner architecture from
most VLIW architectures.In a typical VLIW architecture,
the register-to-register operations are simple,but in the register
combiner architecture they can be quite complex– e.g.the combination of two dot products,a multiplexing operation,and a scale/bias.
Table1lists the possible combinations of scale,bias,and
clamping operations that can be performed by the input-mapping
units and output-mapping units in a combiner.Note that while
the input-mapping units can be controlled independently,the three
output-mapping units in an RGB or ALPHA combiner share the
same configuration.This coupling introduces added complexity to
the compilation algorithms.
If an RGB or ALPHA combiner’s ADD/MUX unit is not ud,
we can split the combiner into two independent pieces.We refer to
each piece as a half combiner,as illustrated in Figure3.In this ca,
we think of the VLIW instruction as controlling up to two half RGB
combiners,and two half ALPHA combiners.Unfortunately,there
is still some potential coupling between half combiners,becau
the OUTMAP must be the same in the two halves.
Register
Register
Register
小孩记忆力差
Figure 2:Internal structure of a standard RGB combiner.All operations are performed on 3-vectors.The DOT/MULTIPLY unit can be configured to perform either a dot product or a vector multiply.The ADD/MUX unit can be configured to perform either a vector add or a vector multiplexing operation.A standard ALPHA combiner is similar to the standard RGB combiner,but operates on scalars.
INMAP’s
OUTMAP’s −x 0.5x
max(0,x )
2x 1−x
4x ±2(x −1
2)*2(x −12
)*±(max(0,x )−12)*x −1
2*
Table 1:Combiner input mappings (INMAP’s)and output mappings (OUTMAP’s)other than identity.All mappings clamp to [-1,1]after completion.The mappings with an asterisk can rve as either type of mapping (what we call a BOTHMAP)under some conditions.
An RGB combiner always writes its outputs to the RGB portion of registers,and an ALPHA combiner always writes its outputs to the ALPHA portion of registers.The possibilities for combiner inputs are more complex.An RGB combiner may take an input from an RGB register,or from an ALPHA register by replicating the scalar value across all three components.Later,we will refer to this cond possibility as a TRIPLE operation.An ALPHA combiner may take an input from an ALPHA register,or from the BLUE component of an RGB register (which we will refer to as a BLUE operation).
There are veral other important properties of the register combiner architecture.First,if the RGB combiner performs an ADD in
the ADD/MUX unit,it can not perform any dot products.This restriction allows sharing of the adder hardware.Second,the result of a DOT operation is replicated across all three RGB components.Third,the register combiner architecture us a [-1,1]range for most values,but us a [-2,2]range for values between the MUL/DOT unit and the scale/bias unit.Fourth,the control input
Figure 3:Internal structure of a HALF combiner.
for the MUX unit is always taken from the ALPHA portion of the SPARE0register (S0.a).The control input is true if S0.a ≥0.5,and fal otherwi.
DirectX 8
Microsoft’s DirectX 8pixel-shader instructions express a subt of the capabilities of the NVIDIA register-combiner and texture-shader OpenGL extensions.Although our compiler does not currently target DirectX8,we believe that it would be relatively easy to modify it to do so.The most important change would be the u of a simpler model for each register combiner.
3Compilation Overview
We u an example to both illustrate the capabilities of our compiler and to provide a framework for
understanding the compilation task.Figure 4shows a shader written in our shading language.Most of the per-fragment operations expresd in the shading language
//Bump-mapping function surface float3
lightmodel bumps(float3a,float3d,float3s,texref bumps,floatv uv bumps){
//Compute normalized tangent-space light vectors vertex perlight float3Ltan =tangentspace(L);vertex perlight float3Htan =tangentspace(H);//Lookup from bump map
float4Nlookup =texture(bumps,uv bumps);//alpha has short len float3Nbump =2.0*(rgb(Nlookup)-triple(0.5));
float N avglen =Nlookup[3];//Length of mipmap filtered N,before renorm //Diffu
perlight fragment float3Lfrag =Ltan;//Interpolate vertex →fragment perlight float NdotL =dot(Nbump,Lfrag);
perlight float shadow =4*(Lfrag[2]+Lfrag[2]);//Geometric shadow ramp perlight float3diff =d *clamp01(NdotL)*clamp01(shadow)*N avglen;//Specular
perlight float3Hlookup =cubenorm(Htan);//Interpolate and normalize perlight float3Hnorm =2.0*(Hlookup-.5,.5,.5);
perlight float NdotH =clamp01(dot(Nbump,Hnorm));
perlight float NdotHs =lect(Hlookup[2]>=0.5,NdotH,0.0);perlight float NdotH2=NdotHs *NdotHs;perlight float NdotH4=NdotH2*NdotH2;perlight float NdotH8=NdotH4*NdotH4;perlight float3spec =NdotH8*shadow *s;//Combine diffu,specular,and ambient perlight float3C =diff +spec;
return integrate(rgb(Cl)*C)+a;//Sum over lights,and add ambient }
//Surface shader for bowling pin surface shader float4
bowling pin (texref bamarks,texref decals,texref bumps,float4uv){
//Per-vertex texture-coordinate computations omitted for brevity ...
//Fragment computations
float4Decals =texture(decals,uv decals);
float4BaMarks =texture(bamarks,uv bamarks);float Marks =alpha(BaMarks);float3Ba =rgb(BaMarks);
float3Ma ={.4,.4,.4};float3Md ={.5,.5,.5};float3Ms ={.3,.3,.3};float3Kd =rgb((Decals over {Ba,1.0})*Marks);
float3C =lightmodel bumps(Kd *Ma,Kd *Md,Ms,bumps,uv bumps);return {C,1.0};}
Figure 4:Example surface shader (a bump-mapped bowling pin inspired by [17]).This shader compiles to eight register combiners,when it is ud with a single light shader that returns a per-vertex light intensity.The clamp01()function clamps a value to the range [0,1],and the tangentspace()function transforms a vector into local tangent space.
are operations such as adds,multiplies,dot products,and texture lookups.The shading language also allows the ur to extract a scalar from a vector(using the[]index notation),and to join a three-vector and scalar to form an RGBA four-vector(using the{}join notation).
When wefirst decided to build a compiler for the register combiner architecture,we hoped to adapt dynamic-programming algorithms to the problem[3].We had successfully ud the algorithms in ou
r fragment backend for non-programmable graphics hardware,as had Peercy et al.before us[13].Both of the systems ud the algorithms tofind the lowest-cost t of rendering pass that could evaluate an expression,by matching a t of rules describing possible instructions)to an expression tree.
However,this strategy proved to be unworkable for the register combiner architecture.With eight combiners,it isn’t feasible to treat each rendering pass as a single instruction,becau the instruction would be too complex.This problem can be partially circumvented by reprenting the structure of the fragment pipeline hierarchically,but the hierarchical reprentations don’t allow dynamic-programming algorithms to properly track shared resources,such as interpolants and registers.Furthermore,with an eight-combiner pipeline it is crucial to be able to work with expression DAGs as well as expression trees,but the dynamic-programming algorithms do not support DAGs within a single instruction.
In general,optimal compilation is an NP hard problem[9].Code generation for VLIW architectures is no exception[2].We cho to adopt the usual approach to this challenge–we break the problem into a ries of stages bad on heuristic algorithms.While this strategy does not guarantee optimal code,it typically works well.
Our partitioning of the compilation problem roughly matches the structure of the register-combiner hardware.Thefive main stages of our compiler are the following:
1.Extract texture-shader operations.Extract any operations
from the shader’s fragment DAG that must be mapped to texture shaders rather than register combiners.We u simple pattern-matching techniques to attempt to compile the operations into a texture-shader configuration.The texture-shader hardware is too idiosyncratic to be considered to be truly programmable,so we will not discuss this part of the compiler any further.
2.Rewrite DAG to u hardware operations(§4).Rewrite
the fragment DAG to u the basic operations and data types supported by the register-combiner hardware.The data types are three-vectors and scalars.The operations include ADD, MULTIPLY,MUX controlled by compare with0.5,INMAP, OUTMAP,etc.
3.Select instructions(§5).Convert the DAG of basic
operations into a DAG of register-to-register operations.That is,group basic operations together to form partial register combiners.Each partial combiner is either a half combiner (no ADD/MUX)or full c
ombiner(us ADD/MUX).Each partial combiner is either3-vector(RGB)or scalar(ALPHA).
4.Allocate pipeline-input registers(§6).Allocate registers for
the values that enter the top of the combiner pipeline.The values include vertex interpolants,results from the texture units(or texture shaders),and GeForce1global constants.
They do not include GeForce3per-stage constants.
5.Schedule instructions and allocate registers(§7).The
partial combiners from step#3are placed into specific positions in the register-combiner pipeline.The compiler also allocates temporary registers and GeForce3per-stage constant registers.
We will describe each of the stages in detail in a corresponding ction of the paper.An important advantage of breaking the compilation process into multiple stages is theflexibility it allows in implementing each stage.For example,step#3from the list above us a top-down DAG traversal,but step#5us a bottom-up DAG traversal.If all of the stages were unified into a single compilation stage,the compiler could only u one traversal order.
Our compiler takes less than one cond to compile a shader, becau it us greedy algorithms rather than exponential-time algorithms.We have not performed in-depth studies of the compiler’s running time becau the compiler is already fast enough that it could be invoked every time that a graphics application starts up.
4Rewriting the DAG to u Hardware Operations
The DAG generated by our system’s compiler front end us data types and operations that correspond to tho supported in our shading language.However,the data types and operations do not necessarily have a one-to-one correspondence with basic capabilities of the register-combiner hardware.For example, our language can express the addition of two four-vectors as one operation,but to implement this operation in the combiner hardware requires that both an RGB and ALPHA combiner be configured to perform ADD operations.Therefore,thefirst step in the code-generation process is to re-write the DAG to u fundamental combiner operations and data types.
The compiler performs the following transformations:
•The compiler replaces four-vector operations with corre-sponding combinations of three-vector and scalar operations.
As a side effect,the root node of the DAG is split into a paired RGB root node and ALPHA root node.
•The compiler recognizes groups of operations that correspond to combiner input mappings and/or output mappings,and rewrites them as a single INMAP,OUTMAP,or BOTHMAP node.A BOTHMAP node indicates an operation that can be implemented using either an input mapping or an output mapping.The compiler us a top-down DAG traversal to greedily perform this re-write.The compiler gives higher priority to matches with more complex mappings.
•The compiler recognizes all conversions between3-vector and scalar data,and rewrites them to u either a TRIPLE operation or BLUE operation.Given a scalar x and a three-vector y,a TRIPLE operation reprents y[0]=x;y[1]=x;y[2]=x, and a BLUE operation reprents x=y[2].The operations correspond to capabilities of the register-combiner hardware.
An example transformation is the conversion of x*y into TRIPLE(x)*y.Another example is the conversion of y[0]into BLUE(DOT(y,{1,0,0}).
•The mathematical dot product operation(MATHDOT)is rewritten to u the combiner dot product operation(DOT), which produces a three-vector result.So,MA THDOT(vec3,vec3) becomes BLUE(DOT(vec3,vec3)).
•The language’s lect()operation(which is similar to the a?b:c operator in C)is rewritten to u the MUX operation.
The control input to the MUX operation is implicitly tested against the value0.5.
When the transformations are applied to the example shader in Figure5,the compiler produces the DAG shown in Figure6.
The strategy for grouping language operations into INMAP, OUTMAP,and BOTHMAP operations is complex.Some of the INMAP operations include a clamp to the range[0,1].If the compiler can guarantee that the value being operated on is already in the range[0,1],it is acceptable to u such an INMAP even though the shader did not explicitly request the clamp.The
surface shaderfloat4
example(texref tex1,texref tex2,//Specifies textures#1and#2 vertexfloat4uv,vertexfloat4c,//Texture coords.lor
primitive groupfloat g1,//Slowly-changing value#1
primitive groupfloat g2){//Slowly-changing value#2float4t1=texture(tex1,uv);
float4t2=2*(texture(tex2,uv)-{0.5,0.5,0.5,0.5});
float x1=g1*t2[2]+g2;
float x2=dot(rgb(t1),N);
float3x3=rgb(t1)*x1+x2*rgb(t2);
float x4=c[3]*t1[3];
return{x3,x4};
}
Figure5:An example shader that we trace through the compilation process.Since we designed this shader for pedagogical purpos, it doesn’t make much n as a shader.
FRAMEBUFFER
RGB ALPHA
Figure6:The DAG for the example shader after it has been rewritten to u combiner operations.Arrows indicate dependencies(as is customary in DAGs),so the direction of data flow is against the arrows.
compiler performs an interval-arithmetic analysis on the entire fragment DAG to detect the cas.
OUTMAP operations include multiply-by-2and multiply-by-4. Therefore,a shader may specify a per-fragment multiply by2.0or 4.0.The compiler will issue an error for any other per-fragment u of constants outside the range[-1,1].It would be possible to support multiplies by any constant in the range[-4,4]by automatically factoring the 3.5x→4.0·0.875x),but the compiler doesn’t currently perform this transformation.
5Instruction Selection
(Creating a DAG of register-to-register operations)
Roughly speaking,the next stage of the compilation recognizes groups of DAG operations that can be mapped to a single register combiner.For example,the compiler maps a three-vector sum of products to an RGB register combiner.Thus,this stage of the compiler decides where to place operations within the structure of a register combiner.
More precily,this stage recognizes groups of operations that can be mapped to a partial register combiner.We define a partial register combiner as a subt of a combiner that reads one or more inputs from register(s),operates on them,and writes the result(s) to registers.Figure7shows the six types of partial combiners,and Figure8shows the output of this compilation stage for our example shader.
to reprent register-to-register operations.In thisfigure,the outputs from each combiner are on the top,and the inputs are on the bottom. Note that the internal structure of a“HALF”combiner is described earlier,in Figure3.
行政许可证texture
Figure8:The DAG of partial register combiners for the example shader.Again,arrows indicate dependencies.如何给文件夹加密
The output of this stage of the compilation is a new DAG who nodes consist of partial ister-to-register operations).For a conventional VLIW architecture,this com-pilation step can be greatly simplified or eliminated,becau language operations such as ADD and MULTIPLY already directly correspond to the architecture’s register-to-register operations.This direct correspondence is due to the fact that a conventional VLIW architecture’s functional units always take their inputs from a register and write their outputs to a register.In contrast,the register combiner architecture directly connects some functional-unit outputs to other functional-unit inputs.Some application-specific signal processor architectures also have this property[16].
When deciding what algorithm to u for this stage of our compiler,we considered both dynamic programming algorithms and greedy algorithms.We decided against using a BURG-style dynamic programming algorithm[3]becau this class of algo-rithms does not readily support some optimizations we wanted to perform.More specifically,the BURG-style algorithms have two shortcomings.First,they work best when an“instruction”produces only one result,but our FULL combiners produce up to three results.Second,the algorithms are fundamentally designed for expression trees rather than expression DAGs,but we need to support DAG-like behavior(and associated optimizations)even within a partial combiner.Despite the drawbacks,it would be possibl
e to design this stage of a register-combiner compiler using a BURG-style algorithm,and such an approach might be attractive as part of a retargetable compiler.
We considered veral class of greedy algorithms for this stage of the compiler.Broadly speaking,the algorithms can work either from the root of the DAG towards the leaves(top-down)or from the leaves towards the root(bottom-up).Once the algorithms
have begunfilling a particular partial combiner,they can work either from the outputs towards the inputs,or from the inputs to the outputs.下雨的作文
The structure of a partial register combiner is mostly tree-like; the only exception is the DOT/MUL outputs from a FULL combiner.Given this tree-like structure,it is natural to work from the output(s)of a combiner towards the inputs,and this is the strategy that we cho to u.This traversal direction corresponds to a top-down traversal of the DAG,which we also cho to u.It would be possible to combine an outputs-to-inputs combiner-filling order with a bottom-up DAG traversal,but it would be more awkward to do so.
Algorithm Details
Our algorithmfills one partial combiner at a time.When it begins working on a new partial combiner,it starts with one particular DAG node that it will place in that combiner.The algorithm choos this node by taking it from the head of a priority queue. This queue contains all DAG nodes who parents have all been placed into partial combiners.At the start of this compilation stage, the queue is initialized with the RGB and ALPHA root nodes from the DAG.
The priority of nodes in the queue is their weighted distance from the furthest leaf node beneath them.Thus,nodes clor to the root of the DAG have greater priority.The weight of an ADD or MUX node is slightly lower than that of other nodes(by epsilon),which improves compilation quality for some combinations of ADD and MULTIPLY operations.
Once the compilation algorithm has chon a particular DAG node to place into a partial combiner,it continues working on that partial combiner until it is as full as possible.If the original DAG node is the root RGB or root ALPHA node,then the partial combiner is marked as a FINAL RGB or FINAL ALPHA combiner. Otherwi,the type of combiner is determined by whether or not an ADD or MUX node is encountered before a MUL or DOT node.In thefirst ca,a FULL combiner is created andfilled,and in the cond ca,a HALF combiner is created andfilled.
Our algorithm begins at the output of the combiner and works towards the inputs.For a FULL combiner,the algorithmfirst fills the OUTMAP unit;then the ADD/MUX unit;then the left DOT/MUL unit;then the two INMAP units for this DOT/MUL. Next,itfills the right DOT/MUL unit and its two INMAP units. One of the possible choices when“filling”a unit is to configure it for pass-through.
新扎师妹1For a FULL combiner,the two“side”DOT/MUL outputs are handled specially,becau the algorithm never us them to start working on a combiner.Instead,the outputs are handled when the DOT/MUL unit isfilled,by checking to e if the relevant DOT/MUL DAG node has more than one its result needs to be stored in a register).There are some additional complications in this procedure when the combiner’s OUTMAP is t to something other than pass-through.
As a combiner isfilled,the algorithm can dynamically transform the original DAG in one of two ways to better match the combiner structure.First,it can move a TRIPLE or BLUE operation leaf-ward to allow additional operations to be packed into the combiner. For example the expression y*TRIPLE(INMAP(x))will be rewritten as y*INMAP(TRIPLE(x)).The cond expression is equivalent to the first,and matches the combiner structure better.
The cond type of DAG transformation is the replication of a TRIPLE,BLUE,or INMAP node when t
he node has more than one parent.This transformation provides each such parent with its own copy of the node.The compiler performs this transformation when it is necessary to allow the TRIPLE,BLUE,or INMAP node to be incorporated into the current partial combiner.In compiler terminology,this transformation is referred to as forward substitution[9].
In general,the need to efficiently handle cas where a DAG node has more than one its result is ud more than once)adds significant complexity to the compiler.To handle the cas,the compiler must keep track of which intermediate results can be stored into a register,and which are blocked from being stored in a register by other operations that have already been placed into the combiner.
6Register allocation for pipeline inputs The compiler performs most register allocation work at the same time that it schedules partial combiners into a pipeline(as we will discuss in ction7).For example,if a variable holds the output of a partial combiner,the compiler will perform register allocation for that variable during scheduling.
However,the compiler us a parate,earlier step to allocate registers to values that enter the top of the combiner pipeline.We refer to the values as pass inputs.The pass inputs include vertex-to-frag
ment interpolants,results from the texture units(or texture shaders),and GeForce1global constants.
The compiler performs this register allocation parately becau there are so many constraints on the allocation,and becau good allocation is often critical to packing as much computation as possible into a single rendering pass.For example,on a GeForce1, vertex-to-fragment interpolants must be placed in either the primary color register(V0)or the condary color register(V1).But, the two registers are not interchangeable–V0supports RGBA (4-vector)interpolants,but V1only supports RGB(3-vector) interpolants.On a GeForce3,the texture registers T0-T3can also be ud to hold interpolants,by using the texture-shader pass-through mechanism.
The pass-input register allocations are further constrained becau the RGB and ALPHA portions of a register are often tightly coupled.For example,allocating a texture register to a texture lookup is an all-or-nothing proposition.In contrast,it is possible to allocate the RGB portion of a constant register to a3-vector, and allocate the ALPHA portion of that same constant register to a completely unrelated scalar.Our compiler correctly handles both of the cas.
We u a greedy algorithm to perform pass-input register allocation.To allow our algorithms to adapt to different architectures(GeForce1vs.GeForce3),we u a fairly general mechanism to describe the capabilities of each register for each class of pass inputs.
There are four class of pass inputs:True constants;primitive-group values;vertex-to-fragment interpolants;and texture-unit outputs.For each register and each class of pass inputs,we indicate which part(s)of the register are available.The two parts are RGB and ALPHA.So,condary-color register V1is marked as having its RGB part available for vertex-to-fragment interpolants,but not its(non-existent)ALPHA part.Global constant register C0is marked as having RGB and ALPHA available for true constants and for primitive-group values,but not for vertex-to-fragment interpolants.
For each class of pass inputs,we also specify whether or not each offive possible ts of possible manipulations is allowed during register allocation:
•SPLIT:A single RGBA value can be split up into an RGB part and an ALPHA part,with each part placed in a different register.This manipulation is allowed on constants,for example.
•JOIN:An unrelated RGB and ALPHA value can be allocated to the same RGBA register.This manipulation is not legal for