Abstract
In recent years, people are becoming increasingly acquainted with 3D technologies
such as 3DTV, 3D movies and 3D virtual navigation of city environments in their
daily life. Commercial 3D movies are now commonly available for consumers. Virtual
navigation of our living environment as used on a personal computer has become
a reality due to well-known web-based geographic applications using advanced imaging
technologies. To enable such 3D applications, many technological challenges
such as 3D content creation, 3D displaying technology and 3D content transmission
need to tackled and deployed at low cost. This thesis concentrates on the reconstruction
of 3D scene information from multiple 2D images, aiming for an automatic and
low-cost production of the 3D content.
In this thesis, two multiple-view 3D reconstruction systems are proposed: a 3D
modeling system for reconstructing the sparse 3D scene model from long video sequences
captured with a hand-held consumer camcorder, and a depth reconstruction
system for creating depth maps from multiple-view videos taken by multiple synchronized
cameras. Both systems are designed to compute the 3D scene information in an
automated way with minimum human interventions, in order to reduce the production
cost of 3D contents. Experimental results on real videos of hundreds and thousands
frames have shown that the two systems are able to accurately and automatically reconstruct
the 3D scene information from 2D image data. The findings of this research
are useful for emerging 3D applications such as 3D games, 3D visualization and 3D
content production.
Apart from designing and implementing the two proposed systems, we have developed
three key scientific contributions to enable the two proposed 3D reconstruction
systems. The first contribution is that we have designed a novel feature point
matching algorithm that uses only a smoothness constraint for matching the points,
which states that neighboring feature points in images tend to move with similar directions
and magnitudes. The employed smoothness assumption is not only valid
but also robust for most images with limited image motion, regardless of the camera
motion and scene structure. Because of this, the algorithm obtains two major advan-
1
tages. First, the algorithm is robust to illumination changes, as the employed smoothness
constraint does not rely on any texture information. Second, the algorithm has
a good capability to handle the drift of the feature points over time, as the drift can
hardly lead to a violation of the smoothness constraint. This leads to the large number
of feature points matched and tracked by the proposed algorithm, which significantly
helps the subsequent 3D modeling process. Our feature point matching algorithm
is specifically designed for matching and tracking feature points in image/video sequences
where the image motion is limited. Our extensive experimental results show
that the proposed algorithm is able to track at least 2.5 times as many feature points
compared with the state-of-the-art algorithms, with a comparable or higher accuracy.
This contributes significantly to the robustness of the 3D reconstruction process.
The second contribution is that we have developed algorithms to detect critical
configurations where the factorization-based 3D reconstruction degenerates. Based
on the detection, we have proposed a sequence-dividing algorithm to divide a long sequence
into subsequences, such that successful 3D reconstructions can be performed
on individual subsequences with a high confidence. The partial reconstructions are
merged later to obtain the 3D model of the complete scene. In the critical configuration
detection algorithm, the four critical configurations are detected: (1) coplanar 3D
scene points, (2) pure camera rotation, (3) rotation around two camera centers, and (4)
presence of excessive noise and outliers in the measurements. The configurations in
cases (1), (2) and (4) will affect the rank of the Scaled Measurement Matrix (SMM).
The number of camera centers in case (3) will affect the number of independent rows
of the SMM. By examining the rank and the row space of the SMM, the abovementioned
critical configurations are detected. Based on the detection results, the
proposed sequence-dividing algorithm divides a long sequence into subsequences,
such that each subsequence is free of the four critical configurations in order to obtain
successful 3D reconstructions on individual subsequences. Experimental results
on both synthetic and real sequences have demonstrated that the above four critical
configurations are robustly detected, and a long sequence of thousands frames is automatically
divided into subsequences, yielding successful 3D reconstructions. The
proposed critical configuration detection and sequence-dividing algorithms provide
an essential processing block for an automatical 3D reconstruction on long sequences.
The third contribution is that we have proposed a coarse-to-fine multiple-view
depth labeling algorithm to compute depth maps from multiple-view videos, where
the accuracy of resulting depth maps is gradually refined in multiple optimization
passes. In the proposed algorithm, multiple-view depth reconstruction is formulated
as an image-based labeling problem using the framework of Maximum A Posterior
(MAP) on Markov Random Fields (MRF). The MAP-MRF framework allows the
combination of various objective and heuristic depth cues to define the local penalty
and the interaction energies, which provides a straightforward and computationally
tractable formulation. Furthermore, the global optimal MAP solution to depth labeli
ing can be found by minimizing the local energies, using existing MRF optimization
algorithms. The proposed algorithm contains the following three key contributions.
(1) A graph construction algorithm to proposed to construct triangular meshes on
over-segmentation maps, in order to exploit the color and the texture information for
depth labeling. (2) Multiple depth cues are combined to define the local energies.
Furthermore, the local energies are adapted to the local image content, in order to
consider the varying nature of the image content for an accurate depth labeling. (3)
Both the density of the graph nodes and the intervals of the depth labels are gradually
refined in multiple labeling passes. By doing so, both the computational efficiency
and the robustness of the depth labeling process are improved. The experimental results
on real multiple-view videos show that the depth maps of for selected reference
view are accurately reconstructed. Depth discontinuities are very well preserved.
| Original language | English |
|---|---|
| Qualification | Doctor of Philosophy |
| Awarding Institution |
|
| Supervisors/Advisors |
|
| Award date | 31 Oct 2011 |
| Place of Publication | Eindhoven |
| Publisher | |
| Print ISBNs | 978-90-386-2739-7 |
| DOIs | |
| Publication status | Published - 2011 |