Manhattan-world Stereo
Yasutaka Furukawa Brian Curless Steven M.Seitz Department of Computer Science&Engineering
University of Washington,USA
{furukawa,curless,itz}@cs.washington.edu
Richard Szeliski Microsoft Rearch
Redmond,USA
Abstract
Multi-view stereo(MVS)algorithms now produce recon-structions that rival lar range scanner accuracy.How-ever,stereo algorithms require textured surfaces,and there-fore work poorly for many architectural ,build-ing interiors with textureless,painted walls).This paper prents a novel MVS approach to overcome the limi-tations for Manhattan World ,scenes that con-sists of piece-wi planar surfaces with dominant direc-tions.Given a t of calibrated photographs,wefirst re-construct textured regions using an existing MVS algorithm, then extract dominant plane directions,gen
erate plane hy-pothes,and recover per-view depth maps using Markov randomfields.We have tested our algorithm on veral datats ranging from office interiors to outdoor buildings, and demonstrate results that outperform the current state of the art for such texture-poor scenes.
1.Introduction
面相口诀The3D reconstruction of architectural scenes is an im-portant rearch problem,with large scale efforts underway to recover models of cities at a global ,Google Earth,Virtual Earth).Architectural scenes often exhibit strong structural regularities,includingflat,texture-poor walls,sharp corners,and axis-aligned geometry,as shown in Figure1.The prence of such structures suggests oppor-tunities for constraining and therefore simplifying the re-construction task.Paradoxically,however,the properties are problematic for traditional computer vision methods and greatly complicate the reconstruction problem.The lack of texture leads to ambiguities in matching,whereas the sharp angles and non-fronto-parallel geometry defeat the smooth-ness assumptions ud in den reconstruction algorithms.
In this paper,we propo a multi-view stereo(MVS) approach specifically designed to exploit properties of ar-chitectural scenes.We focus on the problem of recover-ing depth maps,as oppod to full object models.The key idea is to replace the smoothness prior ud in
traditional
Figure1.Increasingly ubiquitous on the Internet are images of ar-chitectural scenes with texture-poor but highly structured surfaces.
methods with priors that are more appropriate.To this end we invoke the so-called Manhattan-world assumption[10], which states that all surfaces in the world are aligned with three dominant directions,typically corresponding to the X, Y,and Z ,the world is piecewi-axis-aligned-plan
ar.We call the resulting approach Manhattan-world stereo.While this assumption may em to be overly re-strictive,note that any scene can be arbitrarily-well ap-proximated(tofirst order)by axis-aligned geometry,as in the ca of a high resolution voxel grid[14,17].While the Manhattan-world model may be reminiscent of blocks-world models from the70’s and80’s,we demonstrate state-of-the-art results on very complex environments.
Our approach,within the constrained space of Manhattan-world scenes,offers the following advan-tages:1)it is remarkably robust to lack of texture,and able to modelflat painted walls,and2)it produces remarkably clean,simple models as outputs.Our approach operates as follows.We identify dominant orientations in the scene, as well as a t of candidate planes on which most of the geometry lies.The steps are enabled byfirst running an existing MVS method to reconstruct the portion of the scene that contains texture,and analyzing the recovered geometry.We then recover a depth map for each image by assigning one of the candidate planes to each pixel in the image.This step is pod as a Markov randomfield(MRF) and solved with graph cuts[4,5,13](Fig.2).
1.1.Related work
Our work builds upon a long tradition of piecewi-planar stereo,beginning with the minal work of Wang 1
Plane hypothes generated from peaks d 1
d 2d 3
d 1
Dominant axes
extracted from points Oriented points reconstructed by MVS Point density on d 1
peaks
Reconstruction by labeling hypothes to pixels (MRF)
Figure 2.Our reconstruction pipeline.From a t of input images,an MVS algorithm reconstructs oriented points.We estimate dominant axes d 1,d 2,d 3.Hypothesis planes are found by finding point density peaks along each axis d i .The planes are then ud as per-pixels labels in an MRF.
and Adelson on layered motion models [20].Several au-thors,including Baker et al.[1],Birchfield and Tomasi [3],and Tao et al.[19],have specialized the 2D affine motion models first suggested by Wang
and Adelson to the rigid multi-view stereo tting.What all the algorithms have in common is that they alternate between assigning pixels to 3D planes and refining the plane equations.In all of the approaches,the scene is treated as a collection of simple primitives.Missing in the models,however,is a model of structural relations between the primitives that govern how they meet and combine to form more complex scenes.A key innovation in our work is to incorporate consistency constraints on how planes meet to ensure valid surfaces,and to exploit image lines as cues for crea junctions.Another departure from [1,3,19,20]is that we leverage a state-of-the-art multi-view stereo algorithm to derive plane equa-tions and data terms,rather than directly optimize photo-consistency (appearance similarity);photoconsistency can perform poorly in wide baline ttings or in the prence of occlusions.
Another line of related rearch us dominant plane ori-entations in outdoor architectural models to perform plane sweep stereo reconstruction.Notable examples are the work of Coorg and Teller [8],Werner and Zisrman [21],and Pollefeys et al.[15].The approaches first estimate the gravity (up)vector and then find one or two dominant plane directions orthogonal to this vector using low-level cues such as reconstructed 3D points or lines.They then sweep families of planes through the scene [6,16]and measure the photoconsistency or correlation at each pixel in order to estimate dept
h maps.There also exist approaches specif-ically designed for architectural scenes.Cornelis et al.[9]estimate ruled vertical facades in urban street scenes by cor-relating complete vertical scanlines in images.Barinova et al.[2]also u vertical facades to reconstruct city building models from a single image.However,the approaches that estimate vertical facades cannot handle more complex scenes consisting of mixed vertical and horizontal planes.In contrast,our approach us robust multi-view stereo cor-relation scores to measure the likelihood of a given pixel to lie on a plane hypothesis,and us a novel MRF to in-
terpolate the spar measurements to den depth maps.We demonstrate good reconstruction results on challenging complex indoor scenes with many small axis-aligned sur-faces such as tables and appliances.Zebedin et al.[22]also u an MRF to reconstruct building models,where they g-ment out buildings from images bad on a height field,a rough building mask,and 3D lines,then recover roof shapes.Their system produces impressive building mod-els,but one important difference from our approach is that height fields (or depth maps)are given as input in their sys-tem to reconstruct a roof model,while our algorithm pro-duces depth maps as outputs that can be ud for further modeling.(See our future work in Section 5.)
2.Hypothesis planes
Rather than solve for per-pixel disparity or depth values,as is common in stereo algorithms,we instead restrict the arch space to a t of axis-aligned hypothesis planes ,and ek to assign one of the plane labels to each pixel in the image (Fig.2).This ction describes our procedure for identifying the hypothesis planes.
Given a t of calibrated photographs,the first step of our algorithm is to u publicly available MVS software [11]to reconstruct a t of oriented 3D points (positions and nor-mals).We retain only high-confidence points in textured areas.The normals are then ud to extract three domi-nant axes for the scene,and the positions are ud to gener-ate axis-aligned candidate planes.The candidate planes are later ud as hypothes in MRF depth-map reconstruction.
2.1.MVS preprocessing
To recover oriented points,we employ freely available,patch-bad MVS software (PMVS)[11].PMVS takes cal-ibrated photographs and produces a t of oriented points {P i }.Associated with each point P i are 3D location P i ,a surface normal N i ,a t of visible images V i ,and a pho-tometric consistency score (normalized cross correlation)C (P i )∈[−1,1].Note that with some abu of notation,P i is ud to denote both the oriented point as well as its 3D position coordinates.
While PMVS works well for textured regions,the output tends to be unreliable where texture is weak or the surface is far from Lambertian.Since we do not require den cov-erage for generating plane hypothes,we reconstruct and retain points conrvatively.In particular,we require PMVS to recover only points obrved in at least three views,and we t its initial photometric consistency threshold to 0.95(which PMVS iteratively relaxes to 0.65).Further,to re-move points in nearly textureless regions,we project each point into its visible views and reject it if the local texture variance is low in tho views.More precily,we project each point P i into its visible images V i and,in each image,compute the standard deviation of image intensities inside a 7×7window around the projected point.If the average standard deviation (averaged over all the images in V i )is below a threshold τ,the point is rejected.We u τ=3for intensities in the range [0,255].
In the remainder of the paper,some of the parameters depend on a measure of the 3D sampling rate R implied by the input images.For a given MVS point P i and one of its visible views I ∈V i ,we compute the diameter of a sphere centered at P i who projected diameter in I equals the pixel spacing in I ,and then weight this diameter by the dot product between the normal N i and viewing direction to arrive at a foreshortened diameter.We t R to the average foreshortened diameter of all points projected into all their visible views in this manner.
2.2.Extracting dominant axes
Under the Manhattan-world assumption,scene structure is piecewi-axis-aligned-planar.We could require that the axes be mutually orthogonal,however,to compensate for possible errors in camera intrinsics and to handle archi-tecture that itlf is not compod of exactly orthogonal planes,we allow for some deviation from orthogonality.To estimate the axes,we employ a simple,greedy algorithm using the normal estimates N i recovered by PMVS (See [8,15,21]for similar approaches).We first compute a his-togram of normal directions over a unit hemisphere,subdi-vided into 1000bins.1We then t the first dominant axis −
→d 1to the average of the normals within the largest bin.Next,we find the largest bin within the band of bins that are in
the range 80to 100degrees away from −
→d 1and t the c-ond dominant axis −
→d 2to the average normal within that bin.Finally,we find the largest bin in the region that is in the
range 80to 100degrees away from both −
→d 1and −→d 2and t
the third dominant axis −
→d 3to the average normal within that bin.In our experiments,we typically find that the axes are within 2degrees of perpendicular to each other.
1A
hemisphere is the voting space instead of a sphere,becau domi-nant axes rather than (signed)directions are extracted in this step.
2.3.Generating hypothesis planes
Given the dominant axes,the next step of our algorithm is to generate axis-aligned candidate planes to be ud as hypothes in the MRF optimization.Our approach is to have the positions of the MVS points vote for a t of can-didate planes.For a given point P i ,a plane with normal
原型图工具
equal to axis direction −→
d k and passing through P i has an
offt −→
d k ·P i ;i.e.,th
e plane equation is −→d k ·X =−→d k ·P i .
For each axis direction −→
d k w
e compute the t o
f offts {−→
d k ·P i }and perform a 1D mean shift clustering [7]to ex-tract clusters and peaks.Th
e candidate planes are generated at the offts o
f the peaks.Some clusters may contain a small number of samples,thus providin
g only weak sup-port for the corresponding hypothesis;we exclude clusters wit
h fewer than 50samples.The bandwidth σof the mean shift algorithm controls how many clusters (and thus how many candidate planes)are created.In our experiments,we t σto be either R or 2R .(See Sect.4for more details on the parameter lection.)
Note that we reconstruct surfaces using oriented ,we distinguish front and back sides of candidate planes.Thus,for each plane,we include both the plane hypothesis with surface normal pointing along its corresponding domi-nant axis,and the same geometric plane with normal facing in the opposite direction.
3.Reconstruction
Given a t H ={H 1,H 2,···}of plane hypothes,we ek to recover a depth map for image I t (referred to as a target image )by assigning one of the plane hypothes to each pixel.We formulate this problem as an MRF and solve it with graph cuts [4,5,13].
The energy E to be minimized is the sum of a per-pixel data term E d (h p )and pairwi smoothness term E s (h p ,h q ):
E = p
E d (h p )+λ {p,q }∈N (p )
E s (h p ,h q ),(1)
where h p is a hypothesis assigned to pixel p ,and N (p )de-notes pairs of neighboring pixels in a standard 4-connected
neighborhood around p .λis a scaling factor for the smooth-ness term.(See Table.1for the choice of this parameter for each datat.)Note that we do not consider plane hypothe-s which are back-facing to the target image’s center of projection.
3.1.Data term
The data term E d (h p )measures visibility conflicts be-tween a plane hypothesis at a pixel and all of the points {P i }reconstructed by PMVS.We start with some notational pre-liminaries.Let X l
p
denote the 3D point reconstructed for
Smoothness term
'(h p , h q )
Data term
Figure 3.Data term measures visibility conflicts between a plane hypothesis at a pixel and all the reconstructed points {P i }.There are three different cas in which the visibility conflict occurs.The smoothness term in this figure measures the penalty of assigning a hypothesis H n to pixel q ,and a hypothesis H m to pixel p .See the text for details.
pixel p when H l is assigned to p ,i.e.,the interction be-tween a viewing ray passing through p and the hypothesis
plane H l .We define ˆπ
j (P )as the projection of a point P into image I j ,rounded to the nearest pixel coordinate in I j .Finally,we define the depth difference between two points P and Q obrved in image I j with optical center O j as:
Δj d (P,Q )=(Q −P )·
O j −P
||O j −P ||
.
(2)
Δj d (P,Q )can be interpreted as the signed distance of Q from the plane passing through P with normal pointing from P to O j ,where positive values indicate Q is clor than P is to O j .
A pixel hypothesis h p is considered to be in visibility conflict with an MVS point P i under any one of the three following cas (illustrated in Figure 3):
Ca 1.If P i is visible in image I t ,the hypothesized
point X l
p
should not be in front of P i (since it would oc-clude it)and should not be behind P i (since it would be occluded).For each P i with I t ∈V i ,we first determine if ˆπt (P i )=p .If so,we declare h p to be in conflict with P i if
|Δt d (P i ,X l
p )|>γ,where γis a parameter that determines the width of the no-conflict region along the ray to P i ,and is t to be 10R in our experiments.2
Ca 2.If P i is not visible in image I t ,X l
p should not be behind P i ,since it would be occluded.Thus,for each P i with I t /∈V i and ˆπt (P i )=p ,we declare h p to be in conflict
with P i if Δt d (P i ,X l
p
)>γ.Ca 3.For any view I j that es P i ,not including the target view,the space in front of P i on the line of sight to I j should be empty.Thus,for each P i and for each view I j ∈
V i ,I j =I t ,we first check to e if P i and X l
仔字组词p
project to the same pixel in I j ,i.e.,ˆπ
j (P i )=ˆπj (X l
p ).If so,we declare 210R
approximately corresponds to ten times the pixel spacing on the
input images.This large margin is ud in our work in order to handle erroneous points in texture-poor regions and/or compensate for possible errors in camera calibration.
h p to be in conflict with P i if Δj d (P i ,X l
p )<−ˆγ厚遇
i,j with respect to any view I j .In this ca,we employ a modified distance threshold,ˆγi,j =γ/|N h p ·r j (P i )|,where N h p is the normal to the plane corresponding to h p ,and r j (P i )is the normalized viewing ray direction from I j to P i .3
Now,given a P i and hypothesis h p ,we t the contribu-tion of P i to the data term as follows:
E i
d (h p )=
max(0,C (P i )−0.7)if h p conflicts with P i 0otherwi
(3)
where C (P i )is the photometric consistency score of P i re-ported by PMVS.Note that the penalty is automatically zero if C (P i )is less than 0.7.Finally,the data term for h p is given as follows,where 0.5is the upper-bound,impod for robustness:
E d (h p )=min(0.5,
i
E i d (h p )).(4)3.2.Smoothness term
The smoothness term E s (h p ,h q )enforces spatial consis-tency and is 0if h p =h q .Otherwi,we ek a smoothness
function that penalizes inconsistent plane neighbors,except when evidence suggests that such inconsistency is reason-able (e.g.,at a depth discontinuity).3.2.1
如何走向成功Plane consistency
We score plane consistency Δs (h p ,h q )by extrapolating the hypothesis planes corresponding to h p and h q and measur-ing their disagreement along the line of sight between p and q .In particular,Δs (h p ,h q )is the (unsigned)distance be-tween candidate planes measured along the viewing ray that
3The
modification of the threshold is necessary,becau the visibil-ity information becomes unreliable when the corresponding visible ray is nearly parallel to both the image plane of I t and the plane hypothesis.
pass through the midpoint between p and q .4Large val-ues of Δs (h p ,h q )indicate inconsistent neighboring planes.3.2.2
Exploiting dominant lines
When two dominant planes meet in a Manhattan-world scene,the resulting junction generates a crea line in the image (referred to as a dominant line )that is aligned with one of the vanishing points (Figure 4).Such dominant lines are very strong image cues which we can exploit as struc-tural constraints on the depth map.
Our procedure for identifying dominant lines is de-scribed as follows.Given an image I ,we know that the projection of all dominant lines parallel to dominant direc-tion −→
中药泡澡配方d k pass through vanishing point v k .Thus,for a given pixel p ,th
e projection o
世界上最流氓的人f any such dominant line obrved at p must pass through p and v k and therefore has orien-tation −
→l k =v k −p k in the image plane.Thus,we ek
an edge filter that strongly prefers an edge aligned with −
→l k ,
<,with gradient along −→l ⊥
k ,the direction perpendicular to
−
→l k .We measure the strength of an edge along −→l k as:
e k (p )=
Σp ∈w (p ) ∇−→l ⊥k
I (p ) Σp ∈w (p ) ∇−→l
k
I (p ) (5)
where ∇−→l k I (p
)and ∇−→l ⊥k
I (p
)are the directional deriva-tives along −→l k and −→l ⊥
k ,respectively,and w (p )is a rectangu-lar window centered at p with axes along −→l ⊥
k
and −→l k .5In-tuitively,e k (p )measures the aggregate edge orientation (or the tangent of that orientation)in a neighborhood around p .Note that due to the absolute values of directional deriva-tives,an aligned edge that exhibits just a ri in intensity and an aligned edge that both ris and falls in intensity will both give a strong respon.We have obrved both in cor-ners of rooms and corner
s of buildings in our experiments.In addition,the ratio computation means that weak but con-sistent,aligned edges will still give strong respons.
To allow for orientation discontinuities,we modulate smoothness as follows:
s (p )=
0.01if max(e 1(p ),e 2(p ),e 3(p ))>β
1otherwi
(6)Thus,if an edge respon is sufficiently strong for any ori-entation,then the smoothness weight is low (allowing a plane discontinuity to occur).We choo β=2in our experiments,which roughly corresponds to an edge within
4The distance between X m
p
and X n q
may be a more natural smoothness penalty,but this function is not sub-modular [13]and graph cuts cannot be ud.
5w (p )is not an axis aligned rectangle,and image derivatives are com-puted with a bilinear interpolation and finite differences.We u windows
of size 7×21,elongated along the −→
l k
direction.Figure 4.Input image and extracted dominant lines,ud as cues for the meeting of two surfaces.The red,green and blue compo-nents in the right figure shows the results of edge detection along the three dominant directions,respectively.Note that yellow indi-cates ambiguity between the red and green directions.Table 1.Characteristics of the datats.See text for details.
kitchen office atrium hall-1hall-2N c 2254341161N r 0.1M 0.1M 0.1M 0.1M 0.1M N p 548162449476235705154750647091N h 227370316168350λ0.20.40.40.40.4σ2R 2R R R 2R T 1442113349T 223215T 3
2.2
3.5 3.0 1.98.0
a patch being within 25degrees of any dominant line di-rection.Note that the smoothness weight is t to a value slightly larger than zero;this is necessary to constrain the optimization at pixels with zero data term contribution.Putting together the smoothness components in this c-tion,we now give the expression for the smoothness term between two pixels:
E s (h p ,h q )=min(10,s (p )
Δs (h p ,h q )
R
)(7)
Note that the plane depth inconsistency is scored relative to
the scene sampling rate,and the function is again truncated at 10for robustness.
To optimize the MRF,we employ the α-expansion algo-rithm to minimize the energy [4,5,13](three iterations are sufficient).A depth map is computed for each image.
4.Experimental Results
We have tested our algorithm on five real datats,where sample input images are shown on the left side of Fig.6.All datats contain one or more structures –e.g.,poorly tex-tured walls,non-lambertian surfaces,sharp corners –that are challenging for standard stereo and MVS approaches.The camera parameters for each datat were recovered using publicly available structure-from-motion (SfM)soft-ware [18].
Table 1summarizes some characteristics of the datats,along with the choice of the parameters in our algorithm.
MVS points
Vertical axis
Horizontal axis
Horizontal axis
Point clusters extracted by the
Figure 5.Oriented points reconstructed by [11],and point clusters that are extracted by mean shift algorithm are shown for each of the three dominant directions.Points that belong to the same cluster,and hence,contribute to the same plane hypothesis are shown with the same color.
N c is the number of input photographs,and N r denotes the resolution of the input images in pixels.Since we want to reconstruct a simple,piecewi-planar structure of a scene rather than a den 3D model,depth maps need not be high-resolution.N p denotes the number of reconstructed oriented points,while N h denotes the number of extracted plane hypothes for all three directions.红烧豆腐干
There were two parameters that varied among the datats.λis a scalar weight associated with the smooth-ness term,and is t to be 0.4except for the kitchen datat,which has more complicated geometric structure with many occlusions,and hence requires a smaller smooth-ness penalty.σis the mean shift bandwidth,t to either R or 2R bad on the overall size of the structure.We have ob-r
ved that for large scenes,a smaller bandwidth –and thus more plane hypothes –is necessary.In particular,recon-structions of such scenes are more nsitive to even small errors in SfM-recovered camera parameters or in extracting the dominant axes;augmenting the MRF with more planes to choo from helps alleviate the problem.
Finally,T 1,T 2,and T 3reprent computation time in minutes,running on a dual quad-core 2.66GHz PC.T 1is the time to run PMVS software (pre-processing).T 2is the time for both the hypothesis generation step (Sections 2.2and 2.3)and the edge map construction.T 3is the running time of the depth map reconstruction process for a single target image.This process includes a step in which we pre-compute and cache the data costs for every possible hy-pothesis at every pixel.Thus,although the number of vari-ables and possible labels in the MRFs are similar among all the datats,reconstruction is relatively slow for the hall-2datat,becau it has many views and more visibility con-sistency checks in the data term.
Figure 5shows the points reconstructed by PMVS for the kitchen datat.Note that each point P i is rendered with a color computed by taking the average of pixel col-ors at its image projections in all the visible images V i .The right side of the figure shows the clusters of points extracted by the mean shift algorithm for each of the three dominant axes.Points that belong to the same cluster are render
ed with the same color.As shown in the figure,almost no MVS points have been reconstructed at uniformly-colored surfaces;this datat is challenging for standard stereo tech-niques.Furthermore,photographs were taken with the u of flash,which changes the shading and the shadow patterns in every image and makes the problem even harder.Our re-construction results are given in Figure 6,with a target im-age shown at the left column.A reconstructed depth map is shown next,where the depth value is linearly converted to an intensity of a pixel so that the clost point has intensity 0and the farthest point has intensity 255.A depth normal map is the same as a depth map except that the hue and the saturation of a color is determined by the normal direction of an assigned hypothesis plane:There are three dominant directions and each direction has the same hue and the sat-uration of either red,green,or blue.The right two columns show mesh models reconstructed from the depth maps,with and without texture mapping.
In Figure 7,we compare the propod algorithm with a state of the art MVS approach,where PMVS software [11]is ud to recover oriented points,which are converted into a mesh model using Poisson Surface Reconstruction soft-ware (PSR)[12].The first row of the figure shows PSR reconstructions.PSR fills all holes with curved surfaces that do not respect the architectural structure of the scenes.PSR also generates clod surfaces,including floating dis-connected component ”blob
s,”that obscure or even encap-sulate the desired structure;we show only front-facing sur-faces here simply to make the renderings comprehensible.In the cond row,we show PSR reconstructions with the hole-filled and spurious geometry removed using a thresh-old on large triangle edge lengths.The bottom two rows show that our algorithm successfully recovers plausible,flat surfaces even where there is little texture.We admit that our models are not perfect,but want to emphasize that the are very challenging datats where standard stereo algorithms bad on photometric consistency do not work well (e.g.,changes of shading and shadow patterns,poorly textured