首页 > 美文鉴赏

a_comparison_and_evaluation_of_multi_view_stereo_reconstruction_algorithms

更新时间:2023-07-20 22:00:32 阅读：评论：0

A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms

Steven M.Seitz Brian Curless University of Washington

James Diebel

Stanford University

Daniel Scharstein

Middlebury College

英国脱欧的原因

Richard Szeliski

Microsoft Rearch

Abstract

This paper prents a quantitative comparison of veral multi-view stereo reconstruction algorithms.Until now,the lack of suitable calibrated multi-view image datats with known ground truth(3D shape models)has prevented such direct comparisons.In this paper,weﬁrst survey multi-view s

tereo algorithms and compare them qualitatively using a taxonomy that differentiates their key properties.We then describe our process for acquiring and calibrating multi-view image datats with high-accuracy ground truth and introduce our evaluation methodology.Finally,we prent the results of our quantitative comparison of state-of-the-art multi-view stereo reconstruction algorithms on six bench-mark datats.The datats,evaluation details,and in-structions for submitting new models are available online at vision.middlebury.edu/mview.

1.Introduction

The goal of multi-view stereo is to reconstruct a com-plete3D object model from a collection of images taken from known camera viewpoints.Over the last few years, a number of high-quality algorithms have been developed, and the state of the art is improving rapidly.Unfortunately, the lack of benchmark datats makes it difﬁcult to quan-titatively compare the performance of the algorithms and to therefore focus rearch on the most needed areas of de-velopment.

The situation in binocular stereo,where the goal is to produce a den depth map from a pair of images,was until recently similar.Here,however,a databa of images with ground-truth results has made the comparison of algorithms possible and hence stimulated an even faster increa in al-gorithm performance[1].

In this paper,we aim to rectify this imbalance by pro-viding,for theﬁrst time,a collection of high-quality cal-ibrated multi-view stereo images registered with ground-truth3D models and an evaluation methodology for com-paring multi-view algorithms.

Our paper’s contributions include a taxonomy of multi-view stereo reconstruction algorithms inspired by[1](Sec-tion2),the acquisition and dismination of a t of calibrated multi-view image datats with high-accuracy ground-truth3D surface models(Section3),an evalua-tion methodology that measures reconstruction accuracy and completeness(Section4),and a quantitative evaluation of some of the currently best-performing algorithms(Sec-tion5).While the current evaluation only includes meth-ods who authors were able to provide us their results by CVPRﬁnal submission time,our datats and evaluation results are publicly available[2]and open to the general community.We plan to regularly update the results,and publish a more comprehensive comparative evaluation as a full-length journal publication.

We limit the scope of this paper to algorithms that re-construct den object models from calibrated views.Our evaluation therefore does not include traditional binocular, trinocular,and multi-baline stereo methods,which ek to reconstruct a single depth map,or structure-from-motion and spar stereo methods that compute a spar t of fea-ture points.Furthermore,we restrict the current eval

uation to objects that are nearly Lambertian,which is assumed by most algorithms.However,we also captured and plan to provide datats of specular scenes and plan to extend our study to include such scenes in the future.

This paper is not theﬁrst to survey multi-view stereo algorithms;we refer readers to nice surveys by Dyer[3] and Slabaugh et al.[4]of algorithms up to2001.How-ever,the state of the art has changed dramatically in the last ﬁve years,warranting a new overview of theﬁeld.In addi-tion,this paper provides theﬁrst quantitative evaluation of

a broad range of multi-view stereo algorithms.

2.A multi-view stereo taxonomy

One of the challenges in comparing and evaluating multi-view stereo algorithms is that existing techniques vary signiﬁcantly in their underlying assumptions,operat-ing ranges,and behavior.Similar in spirit to the binoc-ular stereo taxonomy[1],we categorize existing meth-ods according to six fundamental properties that differen-tiate the major algorithms:the scene reprentation,photo-consistency measure,visibility model,shape prior,recon-struction algorithm,and initialization requirements.

2.1.Scene reprentation

The geometry of an object or scene can be reprented in numerous ways;the vast majority of multi-view algo-rithms u voxels,level-ts,polygon meshes,or depth maps.While some algorithms adopt a single reprentation, others employ different reprentations for various steps in the reconstruction pipeline.In this ction we give a very brief overview of the reprentations and in Section2.5 we discuss how they are ud in the reconstruction process.美容公司

Many techniques reprent geometry on a regularly sam-pled3D grid(volume),either as a discrete occupancy ,voxels[5–19]),or as a function encoding distance to the clost ,level-ts[20–26]).3D grids are popular for their simplicity,uniformity,and ability to ap-proximate any surface.

Polygon meshes reprent a surface as a t of connected planar facets.They are efﬁcient to store and render and are therefore a popular output format for multi-view algo-rithms.Meshes are also particularly well-suited for visibil-ity computations and are also ud as the central repren-tation in some algorithms[27–32].

更换硒鼓Some methods reprent the scene as a t of depth maps,one for each input view[33–38].This mult

i-depth-map reprentation avoids resampling the geometry on a3D domain,and the2D reprentation is convenient particu-larly for smaller datats.An alternative is to deﬁne the depth maps relative to scene surfaces to form a relief sur-face[39,40].

2.2.Photo-consistency measure

Numerous measures have been propod for evaluating the visual compatibility of a reconstruction with a t of in-put images.The vast majority of the measures operate by comparing pixels in one image to pixels in other images to e how well they correlate.For this reason,they are often called photo-consistency measures[11].The choice of mea-sure is not necessarily intrinsic to a particular algorithm—it is often possible to take a measure from one method and substitute it in another.We categorize photo-consistency measures bad on whether they are deﬁned in scene space or image space[22].

Scene space measures work by taking a point,patch,or volume of geometry,projecting it into the input images,and evaluating the amount of mutual agreement between tho projections.A simple measure of agreement is the variance of the projected pixels in the input images[8,11].Other methods compare images two at a time,and u window-matching metrics such as sum of squared d

ifferences or nor-malized cross correlation[20,23,31].An interesting fea-ture of scene-space window-bad methods is that the cur-rent estimate of the geometry can inform the size and shape of the window[20].A number of other photo-consistency measures have been propod to provide robustness to small shifts and other effects[12,18].

Image space methods u an estimate of scene geometry to warp an image from one viewpoint to predict a different view.Comparing the predicted and measured images yields

a photo-consistency measure known as prediction error[26,

41].While prediction error is conceptually very similar to scene space measures,an important difference is the domain of integration.Scene space error functions are integrated over a surface and thus often tend to prefer smaller surfaces, whereas prediction error is integrated over the t of images of a scene and thus ascribe more weight to parts of the scene that appear frequently or occupy a large image area.

While most stereo algorithms have traditionally assumed approximately view-independent ,Lamber-tian scenes,a number of new photo-consistency metrics have been devid that ek to model more general reﬂec-tion functions(BRDFs)[15–17,22,23,32].Some methods also

utilize silhouettes[27,30,31]or shadows[17,42]. 2.3.Visibility model

Visibility models specify which views to consider when evaluating photo-consistency measures.Becau scene vis-ibility can change dramatically with viewpoint,almost all modern multi-view stereo algorithms account for occlu-sions in some way or another.Early algorithms that did not model visibility[6,27,43]have trouble scaling to large dis-tributions of viewpoints.Techniques for handling visibility include geometric,quasi-geometric,and outlier-bad ap-proaches.

Geometric techniques ek to explicitly model the image formation process and the shape of the scene to determine which scene structures are visible in which images.A com-mon approach in surface evolution approaches is to u the current estimate of the geometry to predict visibility for ev-ery point on that surface[5,11,12,19,20,29,30,40].Fur-thermore,if the surface evolution begins with a surface that enclos the scene volume and evolves by carving away that volume,this visibility approach can be shown to be conr-vative[11,18];i.e.,the t of cameras for which a scene point is predicted to be visible is a subt of the t of cam-eras in which that point is truly visible.

Visibility computations can be simpliﬁed by constrain-ing the allowable distribution of camera viewpoints.If the scene lies outside the convex hull of the camera centers, the occlusion ordering of

points in the scene is same for all cameras[8],enabling a number of more efﬁcient algo-rithms[8,10,13,35,44].

Quasi-geometric techniques u approximate geometric reasoning to infer visibility relationships.For example,a popular heuristic for minimizing the effects of occlusions is to limit the photo-consistency analysis to clusters of nearby cameras[31,45].This approach is often ud in combi-

nation with other forms of geometric reasoning to avoid oblique views and to minimize computations[5,11,26].An-other common quasi-geometric technique is to u a rough estimate of the surface such as the visual hull[46]to guess visibility for neighboring points[19,47,48].

The third type of method is to avoid explicit geometric reasoning and instead treat occlusions as outliers[31,34, 37,38].Especially in cas where scene points are visible more often than they are occluded,simple outlier rejection techniques[49]can be ud to lect the good views.A heuristic often ud in tandem with outlier rejection is to avoid comparing views that are far apart,thereby increasing the likely percentage of inliers[31,34,37,38].

2.4.Shape prior

Photo-consistency measures alone are not always suf-ﬁcient to recover preci geometry,particularly in low-textured scene regions[11,50].It can therefore be helpful to impo shape priors that bias the reconstruction to have desired characteristics.While priors are esntial for binoc-ular stereo,they play a less important role in multi-view stereo where the constraints from many views are stronger.

Techniques that minimize scene-bad photo-consistency measures naturally ek minimal surfaces with small overall surface area.This bias is what enables many level-t algorithms to converge from a gross initial shape[20].The preference for minimal surfaces can also result in a tendency to smooth over points of high curvature (e[51,52]for ways to address this problem).Recent approaches bad on volumetric min-cut[19,47]also have a bias for minimum surfaces.A number of mesh-bad algorithms incorporate terms that cau triangles to shrink[29,31]or prefer reference shapes such as a sphere or a plane[27].

Many methods bad on voxel coloring and space carv-ing[5,8,9,11,12,16,18,53]instead prefer maximal sur-faces.Since the methods operate by removing voxels only when they are not photo-consistent,they produce the largest photo-consistent scene reconstruction,known as the “photo hull.”Becau they do not assume that the surface is smooth,the techniques are good at reconstru

cting high curvature or thin structures.However,the surface tends to bulge out in regions of low surface texture[8,11].

Rather than impo global priors on the overall size of the surface,other methods employ shape priors that en-courage local smoothness.Approaches that reprent the scene with depth maps typically optimize an image-bad smoothness term[33–37,45]that eks to give neighboring pixels the same depth value.This kind of priorﬁts nicely into a2D Markov Random Field(MRF)framework,and can therefore take advantage of efﬁcient MRF solvers[35].

A disadvantage is that there is a bias toward fronto-parallel surfaces.This bias can be avoided by enforcing surface-bad priors,as in[27,29–32,40,47,48].

2.5.Reconstruction algorithm

Multi-view stereo algorithms can be roughly categorized into four class.

Theﬁrst class operates byﬁrst computing a cost function on a3D volume,and then extracting a surface from this volume.A simple example of this approach is the voxel coloring algorithm and its variants[8,17],which make a single sweep through the volume,computing costs and re-constructing v

oxels with costs below a threshold in the same pass(note that[13]avoids the need for a threshold).Other algorithms differ in the deﬁnition of the cost function and the surface extraction method.A number of methods de-ﬁne a volumetric MRF and u max-ﬂow[6,19,47,48]or multi-way graph cut[35]to extract an optimal surface.1979年属什么

The cond class of techniques works by iteratively evolving a surface to decrea or minimize a cost func-tion.This class includes methods bad on voxels,level ts,and surface meshes.Space carving[5,11]and its variants[9,11,12,14,18,40,53]progressively remove in-consistent voxels from an initial volume.Other variants of this approach enable adding as well as deleting voxels to minimize an energy function[15,54].Level-t techniques minimize a t of partial differential equations deﬁned on a volume.Like space carving methods,level-t methods typically start from a large initial volume and shrink in-ward;unlike most space carving methods,however,they can also locally expand if needed to minimize an energy function.Other approaches reprent the scene as an evolv-ing mesh[27–32]that moves as a function of internal and external forces.

In the third class are image-space methods that com-pute a t of depth maps.To ensure a single consistent 3D scene interpretation,the methods enforce consistency constraints between depth maps[33,35–37],or merge the t of depth maps into a3D scene as a post process[45].

Theﬁnal class consists of algorithms thatﬁrst extract and match a t of feature points and thenﬁt a surface to the reconstructed features[55–58].

2.6.Initialization requirements

In addition to a t of calibrated images,all multi-view stereo algorithms assume or require as input some informa-tion about the geometric extent of the object or scene being reconstructed.Providing some constraints on scene geom-etry is in fact necessary to rule out trivial shapes,such as a different postcard placed in front of each camera lens.

Many algorithms require only a rough bounding box or ,space carving variants[8,9,11,12,14, 18,40,53]and level-t algorithms[20–26]).Some algo-rithms require a foreground/background , silhouette)for each input image and reconstruct a visual

temple temple

model

dino dino

model

bird dogs

Figure 1.Multi-view datats with lar-scanned 3D

models.

qq小说

Figure 2.The 317camera positions and orientations for the temple datat.The gaps are due to shadows.The 47cameras correspond-ing to the ring datat are shown in blue and red,and the 16spar ring cameras only in red.

hull [46]that rves as an initial estimate of scene geom-etry [5,19,31,47,48].

Image-space algorithms [33,35–37]typically enforce constraints on the allowable range of disparity or depth val-ues,thereby constraining scene geometry to lie within a near and far depth plane for each camera viewpoint.

3.Multi-view data ts

To enable a quantitative evaluation of multi-view stereo reconstruction algorithms,we collected veral calibrated

multi-view image ts and corresponding ground truth 3D mesh models.Similar data are available for surface light-ﬁeld studies [59,60];we have followed similar procedures for acquiring the images and models and for registering them to one another (although we add a step to automati-cally reﬁne the alignment of the ground truth to the image ts bad on minimizing photo-consistency).The surface lightﬁeld data ts themlves are not,however,suitable for this evaluation due to the highly specular nature of the ob-jects lected for tho studies.We note that a number of other high quality multi-view datats are publicly available (without registered ground truth models),and we provide links to many of the through our web site.

The target objects for this study were lected to have a variety of characteristics that are challenging for typi-cal multi-view stereo reconstruction algorithms.We sought objects that broadly sample the space of the character-istics by including both sharp and smooth features,com-plex topologies,strong concavities,and both strongly and weakly textured surfaces (e Figure 1).

The images were captured using the Stanford spherical gantry,a robotic arm that can be positioned on a one-meter radius sphere to an accuracy of approximately 0.01degrees.Images were captured using a CCD camera with a resolu-tion of 640×480pixels attached to the tip of the gantry arm.At this resolution,a pixel in the image spans roughly 0.25mm on the surface of the object (the temple object is 10cm ×16cm ×8cm ,and the dino is 7cm ×9cm ×7cm ).The system was calibrated by imaging a planar calibra-tion grid from 68viewpoints over the hemisphere and using [61]to compute intrinsic and extrinsic parameters.From the parameters,we computed the camera’s translational and rotational offt relative to the tip of the gantry arm,en-abling us to determine the camera’s position and orientation as a function of any desired arm position.

The target object sits on a stationary platform near the center of the gantry sphere and is lit by three external spot-lights.Becau the gantry casts shadows on the object in certain viewpoints,we double-covered the hemisphere with two different arm conﬁgurations,capturing a total of 790images.After shadowed images were manually removed,we obtained roughly 80%coverage of the sphere.From the resulting images,we created three datats for each object,corresponding to a full hemisphere,a single ring around the object,and a sparly sampled ring (Figure 2).

食品标识管理规定

The reference 3D model was captured using a Cyber-ware Model 15lar stripe scanner.This unit ha

s a single-scan resolution of 0.25mm and an accuracy of 0.05mm to 0.2mm ,depending on the surface characteristics and the viewing angle.For each object,roughly 200individ-ual scans were captured,aligned and merged on a 0.25mm grid,with the resulting mesh extracted with sub-voxel preci-sion [62];the accuracy of the combined scans is appreciably

greater than the individual scans.The procedure also pro-duces per-vertex conﬁdence information,which we u in the evaluation procedure.

The reference models were aligned to their image ts using an iterative optimization approach that minimizes a photo-consistency function between the reference mesh and the images.The alignment parameters consist of a trans-lation,rotation,and uniform scale.The scale factor was introduced to compensate for small differences in calibra-tion between the lar scanner and each image t.The photo-consistency function for each vertex of the mesh is the variance of the color of all rays impinging on that ver-tex,times the number of images in which that vertex is vis-ible,times the conﬁdence of that vertex.This function is summed over all vertices in the mesh,and minimized using a coordinate descent method with a boundedﬁnite differ-ence Newton line arch.The size of theﬁnite difference increment is reduced between successive iterations by a fac-tor of two until a minimum value is reached.After every step,a check is made to ensure that the objective function stri

ctly decreas.The optimization was initialized with the output of an iterative clost point(ICP)alignment be-tween the reference mesh and one of the submitted recon-structions.It was found that the result of the optimization was invariant to which sample reconstruction was lected for the ICP alignment.The quality of the alignments was validated by manually inspecting the reprojection of the full images;maximum reprojection errors were found to be on the order of1pixel,and usually substantially less.

4.Evaluation methodology

We now describe how we evaluate reconstructions by ge-ometric comparison to the ground truth model.

Let us denote the ground truth model as G and the sub-mitted reconstruction result to be evaluated as R.The goal of our evaluation is to asss both the accuracy of R(how clo R is to G),and the completeness of R(how much of G is modeled by R).For the purpos of this paper,we assume that R is itlf a triangle mesh.

To measure the accuracy of a reconstruction,we compute the distance between the points in R and the nearest points on G.Since R is a surface,in theory,we should construct measures that entail integ

ration over R although in practice we simply sample R at its vertices.

A problem aris where G is incomplete.In this ca, for a given point on R in an area where G is incomplete, the nearest point on G could be on its boundary or possibly on a distant part of the mesh.Rather than try to detect and remove such errors we instead compute nearest distances to G ,a hole-ﬁlled version of G,and discount points in R who nearest points on G are clost to the hole-ﬁlled regions.Figure3(b)illustrates this approach.While this solution is itlf imperfect,if the holeﬁlls are

reasonably

我的家乡好地方

Figure3.Evaluation of reconstruction R relative to ground truth model G.(a)R and G are reprented as meshes,each shown here to be incomplete at different parts of the surface.(b)To compute accuracy,for each vertex on R,weﬁnd the nearest point on G. We augment G with a holeﬁlled region(solid red)to give a mesh G .Vertices(shown in red)that project to the holeﬁlled region are not ud in the accuracy metric.(c)To measure completeness, for each vertex on G,weﬁnd the nearest points on R(where the dotted lines terminate on R).Vertices(shown in red)that map to the boundary of R or are beyond an“inlier distance”from R to G are treated as not covered by R.

“tight,”this approach will avoid penalizing accurate points in R at the cost of discarding some possibly less accurate points that happen to match to the holeﬁll.In practice,we u the hole-ﬁlled surfaces generated by space carving[62] during surface reconstruction from range scans,and we per-form many scans(approximately200per object),so that the holeﬁlls are fairly clo to the actual surface and con-stitute a small portion of the surface of the model.In addi-tion,the mesh G has per-vertex conﬁdence values indicat-ing how well it was sampled by the scanner[62];we ignore points on R that

数字6像什么

map to low conﬁdence regions of G.

After determining the nearest valid points on G from R, we compute the distances between them.We compute the signed distances to get a n of whether a reconstruction tends to under-or over-estimate the true shape.We t the sign of each distance equal to the sign of the dot product between the outward facing normal at the nearest point on G and the vector from that point to the query point on R.

Given the sampling of signed distances from the vertices of R to G(less the distances for points that project to hole ﬁlls of G ),we can now visualize their distribution and com-pute summary statistics uful in comparing the accuracy of the reconstruction algorithms.One uful example of such a statistic is to compute the distance d such that X%of the points on R are within distance d of G.When X=50for instance,this gives median distance from R to G.One such statistic is prented in Section5.

To measure completeness,we compute the distances from G to ,the opposite of what we do for mea-suring accuracy.Intuitively,points on G that have no suit-able nearest points on R will be considered“not covered”. Again,while we could measure the covered area by integra-tion,we instead sample using the vertices of G,which are fairly uniformly distributed over G for our models.Unfor-

本文发布于:2023-07-20 22:00:32，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/89/1089669.html

上一篇：电子产品名称中英文对照

下一篇：索赔

标签：食品英国美容硒鼓家乡

留言与评论（共有 0 条评论）