Acquiring 3D scene information from 2D images

Ping Li

Research output: ThesisPhd Thesis 1 (Research TU/e / Graduation TU/e)

328 Downloads (Pure)

Abstract

In recent years, people are becoming increasingly acquainted with 3D technologies such as 3DTV, 3D movies and 3D virtual navigation of city environments in their daily life. Commercial 3D movies are now commonly available for consumers. Virtual navigation of our living environment as used on a personal computer has become a reality due to well-known web-based geographic applications using advanced imaging technologies. To enable such 3D applications, many technological challenges such as 3D content creation, 3D displaying technology and 3D content transmission need to tackled and deployed at low cost. This thesis concentrates on the reconstruction of 3D scene information from multiple 2D images, aiming for an automatic and low-cost production of the 3D content. In this thesis, two multiple-view 3D reconstruction systems are proposed: a 3D modeling system for reconstructing the sparse 3D scene model from long video sequences captured with a hand-held consumer camcorder, and a depth reconstruction system for creating depth maps from multiple-view videos taken by multiple synchronized cameras. Both systems are designed to compute the 3D scene information in an automated way with minimum human interventions, in order to reduce the production cost of 3D contents. Experimental results on real videos of hundreds and thousands frames have shown that the two systems are able to accurately and automatically reconstruct the 3D scene information from 2D image data. The findings of this research are useful for emerging 3D applications such as 3D games, 3D visualization and 3D content production. Apart from designing and implementing the two proposed systems, we have developed three key scientific contributions to enable the two proposed 3D reconstruction systems. The first contribution is that we have designed a novel feature point matching algorithm that uses only a smoothness constraint for matching the points, which states that neighboring feature points in images tend to move with similar directions and magnitudes. The employed smoothness assumption is not only valid but also robust for most images with limited image motion, regardless of the camera motion and scene structure. Because of this, the algorithm obtains two major advan- 1 tages. First, the algorithm is robust to illumination changes, as the employed smoothness constraint does not rely on any texture information. Second, the algorithm has a good capability to handle the drift of the feature points over time, as the drift can hardly lead to a violation of the smoothness constraint. This leads to the large number of feature points matched and tracked by the proposed algorithm, which significantly helps the subsequent 3D modeling process. Our feature point matching algorithm is specifically designed for matching and tracking feature points in image/video sequences where the image motion is limited. Our extensive experimental results show that the proposed algorithm is able to track at least 2.5 times as many feature points compared with the state-of-the-art algorithms, with a comparable or higher accuracy. This contributes significantly to the robustness of the 3D reconstruction process. The second contribution is that we have developed algorithms to detect critical configurations where the factorization-based 3D reconstruction degenerates. Based on the detection, we have proposed a sequence-dividing algorithm to divide a long sequence into subsequences, such that successful 3D reconstructions can be performed on individual subsequences with a high confidence. The partial reconstructions are merged later to obtain the 3D model of the complete scene. In the critical configuration detection algorithm, the four critical configurations are detected: (1) coplanar 3D scene points, (2) pure camera rotation, (3) rotation around two camera centers, and (4) presence of excessive noise and outliers in the measurements. The configurations in cases (1), (2) and (4) will affect the rank of the Scaled Measurement Matrix (SMM). The number of camera centers in case (3) will affect the number of independent rows of the SMM. By examining the rank and the row space of the SMM, the abovementioned critical configurations are detected. Based on the detection results, the proposed sequence-dividing algorithm divides a long sequence into subsequences, such that each subsequence is free of the four critical configurations in order to obtain successful 3D reconstructions on individual subsequences. Experimental results on both synthetic and real sequences have demonstrated that the above four critical configurations are robustly detected, and a long sequence of thousands frames is automatically divided into subsequences, yielding successful 3D reconstructions. The proposed critical configuration detection and sequence-dividing algorithms provide an essential processing block for an automatical 3D reconstruction on long sequences. The third contribution is that we have proposed a coarse-to-fine multiple-view depth labeling algorithm to compute depth maps from multiple-view videos, where the accuracy of resulting depth maps is gradually refined in multiple optimization passes. In the proposed algorithm, multiple-view depth reconstruction is formulated as an image-based labeling problem using the framework of Maximum A Posterior (MAP) on Markov Random Fields (MRF). The MAP-MRF framework allows the combination of various objective and heuristic depth cues to define the local penalty and the interaction energies, which provides a straightforward and computationally tractable formulation. Furthermore, the global optimal MAP solution to depth labeli ing can be found by minimizing the local energies, using existing MRF optimization algorithms. The proposed algorithm contains the following three key contributions. (1) A graph construction algorithm to proposed to construct triangular meshes on over-segmentation maps, in order to exploit the color and the texture information for depth labeling. (2) Multiple depth cues are combined to define the local energies. Furthermore, the local energies are adapted to the local image content, in order to consider the varying nature of the image content for an accurate depth labeling. (3) Both the density of the graph nodes and the intervals of the depth labels are gradually refined in multiple labeling passes. By doing so, both the computational efficiency and the robustness of the depth labeling process are improved. The experimental results on real multiple-view videos show that the depth maps of for selected reference view are accurately reconstructed. Depth discontinuities are very well preserved.
Original languageEnglish
QualificationDoctor of Philosophy
Awarding Institution
  • Department of Electrical Engineering
Supervisors/Advisors
  • de With, Peter H.N., Promotor
  • Vandewalle, P., Copromotor, External person
Award date31 Oct 2011
Place of PublicationEindhoven
Publisher
Print ISBNs978-90-386-2739-7
DOIs
Publication statusPublished - 2011

Fingerprint

Labeling
Cameras
Textures
Factorization
Personal computers
Labels
Costs
Navigation
Visualization
Lighting
Color
Processing

Cite this

Li, P. (2011). Acquiring 3D scene information from 2D images. Eindhoven: Technische Universiteit Eindhoven. https://doi.org/10.6100/IR716683
Li, Ping. / Acquiring 3D scene information from 2D images. Eindhoven : Technische Universiteit Eindhoven, 2011. 164 p.
@phdthesis{ad6f818a7fcc40718a0b9337367e1603,
title = "Acquiring 3D scene information from 2D images",
abstract = "In recent years, people are becoming increasingly acquainted with 3D technologies such as 3DTV, 3D movies and 3D virtual navigation of city environments in their daily life. Commercial 3D movies are now commonly available for consumers. Virtual navigation of our living environment as used on a personal computer has become a reality due to well-known web-based geographic applications using advanced imaging technologies. To enable such 3D applications, many technological challenges such as 3D content creation, 3D displaying technology and 3D content transmission need to tackled and deployed at low cost. This thesis concentrates on the reconstruction of 3D scene information from multiple 2D images, aiming for an automatic and low-cost production of the 3D content. In this thesis, two multiple-view 3D reconstruction systems are proposed: a 3D modeling system for reconstructing the sparse 3D scene model from long video sequences captured with a hand-held consumer camcorder, and a depth reconstruction system for creating depth maps from multiple-view videos taken by multiple synchronized cameras. Both systems are designed to compute the 3D scene information in an automated way with minimum human interventions, in order to reduce the production cost of 3D contents. Experimental results on real videos of hundreds and thousands frames have shown that the two systems are able to accurately and automatically reconstruct the 3D scene information from 2D image data. The findings of this research are useful for emerging 3D applications such as 3D games, 3D visualization and 3D content production. Apart from designing and implementing the two proposed systems, we have developed three key scientific contributions to enable the two proposed 3D reconstruction systems. The first contribution is that we have designed a novel feature point matching algorithm that uses only a smoothness constraint for matching the points, which states that neighboring feature points in images tend to move with similar directions and magnitudes. The employed smoothness assumption is not only valid but also robust for most images with limited image motion, regardless of the camera motion and scene structure. Because of this, the algorithm obtains two major advan- 1 tages. First, the algorithm is robust to illumination changes, as the employed smoothness constraint does not rely on any texture information. Second, the algorithm has a good capability to handle the drift of the feature points over time, as the drift can hardly lead to a violation of the smoothness constraint. This leads to the large number of feature points matched and tracked by the proposed algorithm, which significantly helps the subsequent 3D modeling process. Our feature point matching algorithm is specifically designed for matching and tracking feature points in image/video sequences where the image motion is limited. Our extensive experimental results show that the proposed algorithm is able to track at least 2.5 times as many feature points compared with the state-of-the-art algorithms, with a comparable or higher accuracy. This contributes significantly to the robustness of the 3D reconstruction process. The second contribution is that we have developed algorithms to detect critical configurations where the factorization-based 3D reconstruction degenerates. Based on the detection, we have proposed a sequence-dividing algorithm to divide a long sequence into subsequences, such that successful 3D reconstructions can be performed on individual subsequences with a high confidence. The partial reconstructions are merged later to obtain the 3D model of the complete scene. In the critical configuration detection algorithm, the four critical configurations are detected: (1) coplanar 3D scene points, (2) pure camera rotation, (3) rotation around two camera centers, and (4) presence of excessive noise and outliers in the measurements. The configurations in cases (1), (2) and (4) will affect the rank of the Scaled Measurement Matrix (SMM). The number of camera centers in case (3) will affect the number of independent rows of the SMM. By examining the rank and the row space of the SMM, the abovementioned critical configurations are detected. Based on the detection results, the proposed sequence-dividing algorithm divides a long sequence into subsequences, such that each subsequence is free of the four critical configurations in order to obtain successful 3D reconstructions on individual subsequences. Experimental results on both synthetic and real sequences have demonstrated that the above four critical configurations are robustly detected, and a long sequence of thousands frames is automatically divided into subsequences, yielding successful 3D reconstructions. The proposed critical configuration detection and sequence-dividing algorithms provide an essential processing block for an automatical 3D reconstruction on long sequences. The third contribution is that we have proposed a coarse-to-fine multiple-view depth labeling algorithm to compute depth maps from multiple-view videos, where the accuracy of resulting depth maps is gradually refined in multiple optimization passes. In the proposed algorithm, multiple-view depth reconstruction is formulated as an image-based labeling problem using the framework of Maximum A Posterior (MAP) on Markov Random Fields (MRF). The MAP-MRF framework allows the combination of various objective and heuristic depth cues to define the local penalty and the interaction energies, which provides a straightforward and computationally tractable formulation. Furthermore, the global optimal MAP solution to depth labeli ing can be found by minimizing the local energies, using existing MRF optimization algorithms. The proposed algorithm contains the following three key contributions. (1) A graph construction algorithm to proposed to construct triangular meshes on over-segmentation maps, in order to exploit the color and the texture information for depth labeling. (2) Multiple depth cues are combined to define the local energies. Furthermore, the local energies are adapted to the local image content, in order to consider the varying nature of the image content for an accurate depth labeling. (3) Both the density of the graph nodes and the intervals of the depth labels are gradually refined in multiple labeling passes. By doing so, both the computational efficiency and the robustness of the depth labeling process are improved. The experimental results on real multiple-view videos show that the depth maps of for selected reference view are accurately reconstructed. Depth discontinuities are very well preserved.",
author = "Ping Li",
year = "2011",
doi = "10.6100/IR716683",
language = "English",
isbn = "978-90-386-2739-7",
publisher = "Technische Universiteit Eindhoven",
school = "Department of Electrical Engineering",

}

Li, P 2011, 'Acquiring 3D scene information from 2D images', Doctor of Philosophy, Department of Electrical Engineering, Eindhoven. https://doi.org/10.6100/IR716683

Acquiring 3D scene information from 2D images. / Li, Ping.

Eindhoven : Technische Universiteit Eindhoven, 2011. 164 p.

Research output: ThesisPhd Thesis 1 (Research TU/e / Graduation TU/e)

TY - THES

T1 - Acquiring 3D scene information from 2D images

AU - Li, Ping

PY - 2011

Y1 - 2011

N2 - In recent years, people are becoming increasingly acquainted with 3D technologies such as 3DTV, 3D movies and 3D virtual navigation of city environments in their daily life. Commercial 3D movies are now commonly available for consumers. Virtual navigation of our living environment as used on a personal computer has become a reality due to well-known web-based geographic applications using advanced imaging technologies. To enable such 3D applications, many technological challenges such as 3D content creation, 3D displaying technology and 3D content transmission need to tackled and deployed at low cost. This thesis concentrates on the reconstruction of 3D scene information from multiple 2D images, aiming for an automatic and low-cost production of the 3D content. In this thesis, two multiple-view 3D reconstruction systems are proposed: a 3D modeling system for reconstructing the sparse 3D scene model from long video sequences captured with a hand-held consumer camcorder, and a depth reconstruction system for creating depth maps from multiple-view videos taken by multiple synchronized cameras. Both systems are designed to compute the 3D scene information in an automated way with minimum human interventions, in order to reduce the production cost of 3D contents. Experimental results on real videos of hundreds and thousands frames have shown that the two systems are able to accurately and automatically reconstruct the 3D scene information from 2D image data. The findings of this research are useful for emerging 3D applications such as 3D games, 3D visualization and 3D content production. Apart from designing and implementing the two proposed systems, we have developed three key scientific contributions to enable the two proposed 3D reconstruction systems. The first contribution is that we have designed a novel feature point matching algorithm that uses only a smoothness constraint for matching the points, which states that neighboring feature points in images tend to move with similar directions and magnitudes. The employed smoothness assumption is not only valid but also robust for most images with limited image motion, regardless of the camera motion and scene structure. Because of this, the algorithm obtains two major advan- 1 tages. First, the algorithm is robust to illumination changes, as the employed smoothness constraint does not rely on any texture information. Second, the algorithm has a good capability to handle the drift of the feature points over time, as the drift can hardly lead to a violation of the smoothness constraint. This leads to the large number of feature points matched and tracked by the proposed algorithm, which significantly helps the subsequent 3D modeling process. Our feature point matching algorithm is specifically designed for matching and tracking feature points in image/video sequences where the image motion is limited. Our extensive experimental results show that the proposed algorithm is able to track at least 2.5 times as many feature points compared with the state-of-the-art algorithms, with a comparable or higher accuracy. This contributes significantly to the robustness of the 3D reconstruction process. The second contribution is that we have developed algorithms to detect critical configurations where the factorization-based 3D reconstruction degenerates. Based on the detection, we have proposed a sequence-dividing algorithm to divide a long sequence into subsequences, such that successful 3D reconstructions can be performed on individual subsequences with a high confidence. The partial reconstructions are merged later to obtain the 3D model of the complete scene. In the critical configuration detection algorithm, the four critical configurations are detected: (1) coplanar 3D scene points, (2) pure camera rotation, (3) rotation around two camera centers, and (4) presence of excessive noise and outliers in the measurements. The configurations in cases (1), (2) and (4) will affect the rank of the Scaled Measurement Matrix (SMM). The number of camera centers in case (3) will affect the number of independent rows of the SMM. By examining the rank and the row space of the SMM, the abovementioned critical configurations are detected. Based on the detection results, the proposed sequence-dividing algorithm divides a long sequence into subsequences, such that each subsequence is free of the four critical configurations in order to obtain successful 3D reconstructions on individual subsequences. Experimental results on both synthetic and real sequences have demonstrated that the above four critical configurations are robustly detected, and a long sequence of thousands frames is automatically divided into subsequences, yielding successful 3D reconstructions. The proposed critical configuration detection and sequence-dividing algorithms provide an essential processing block for an automatical 3D reconstruction on long sequences. The third contribution is that we have proposed a coarse-to-fine multiple-view depth labeling algorithm to compute depth maps from multiple-view videos, where the accuracy of resulting depth maps is gradually refined in multiple optimization passes. In the proposed algorithm, multiple-view depth reconstruction is formulated as an image-based labeling problem using the framework of Maximum A Posterior (MAP) on Markov Random Fields (MRF). The MAP-MRF framework allows the combination of various objective and heuristic depth cues to define the local penalty and the interaction energies, which provides a straightforward and computationally tractable formulation. Furthermore, the global optimal MAP solution to depth labeli ing can be found by minimizing the local energies, using existing MRF optimization algorithms. The proposed algorithm contains the following three key contributions. (1) A graph construction algorithm to proposed to construct triangular meshes on over-segmentation maps, in order to exploit the color and the texture information for depth labeling. (2) Multiple depth cues are combined to define the local energies. Furthermore, the local energies are adapted to the local image content, in order to consider the varying nature of the image content for an accurate depth labeling. (3) Both the density of the graph nodes and the intervals of the depth labels are gradually refined in multiple labeling passes. By doing so, both the computational efficiency and the robustness of the depth labeling process are improved. The experimental results on real multiple-view videos show that the depth maps of for selected reference view are accurately reconstructed. Depth discontinuities are very well preserved.

AB - In recent years, people are becoming increasingly acquainted with 3D technologies such as 3DTV, 3D movies and 3D virtual navigation of city environments in their daily life. Commercial 3D movies are now commonly available for consumers. Virtual navigation of our living environment as used on a personal computer has become a reality due to well-known web-based geographic applications using advanced imaging technologies. To enable such 3D applications, many technological challenges such as 3D content creation, 3D displaying technology and 3D content transmission need to tackled and deployed at low cost. This thesis concentrates on the reconstruction of 3D scene information from multiple 2D images, aiming for an automatic and low-cost production of the 3D content. In this thesis, two multiple-view 3D reconstruction systems are proposed: a 3D modeling system for reconstructing the sparse 3D scene model from long video sequences captured with a hand-held consumer camcorder, and a depth reconstruction system for creating depth maps from multiple-view videos taken by multiple synchronized cameras. Both systems are designed to compute the 3D scene information in an automated way with minimum human interventions, in order to reduce the production cost of 3D contents. Experimental results on real videos of hundreds and thousands frames have shown that the two systems are able to accurately and automatically reconstruct the 3D scene information from 2D image data. The findings of this research are useful for emerging 3D applications such as 3D games, 3D visualization and 3D content production. Apart from designing and implementing the two proposed systems, we have developed three key scientific contributions to enable the two proposed 3D reconstruction systems. The first contribution is that we have designed a novel feature point matching algorithm that uses only a smoothness constraint for matching the points, which states that neighboring feature points in images tend to move with similar directions and magnitudes. The employed smoothness assumption is not only valid but also robust for most images with limited image motion, regardless of the camera motion and scene structure. Because of this, the algorithm obtains two major advan- 1 tages. First, the algorithm is robust to illumination changes, as the employed smoothness constraint does not rely on any texture information. Second, the algorithm has a good capability to handle the drift of the feature points over time, as the drift can hardly lead to a violation of the smoothness constraint. This leads to the large number of feature points matched and tracked by the proposed algorithm, which significantly helps the subsequent 3D modeling process. Our feature point matching algorithm is specifically designed for matching and tracking feature points in image/video sequences where the image motion is limited. Our extensive experimental results show that the proposed algorithm is able to track at least 2.5 times as many feature points compared with the state-of-the-art algorithms, with a comparable or higher accuracy. This contributes significantly to the robustness of the 3D reconstruction process. The second contribution is that we have developed algorithms to detect critical configurations where the factorization-based 3D reconstruction degenerates. Based on the detection, we have proposed a sequence-dividing algorithm to divide a long sequence into subsequences, such that successful 3D reconstructions can be performed on individual subsequences with a high confidence. The partial reconstructions are merged later to obtain the 3D model of the complete scene. In the critical configuration detection algorithm, the four critical configurations are detected: (1) coplanar 3D scene points, (2) pure camera rotation, (3) rotation around two camera centers, and (4) presence of excessive noise and outliers in the measurements. The configurations in cases (1), (2) and (4) will affect the rank of the Scaled Measurement Matrix (SMM). The number of camera centers in case (3) will affect the number of independent rows of the SMM. By examining the rank and the row space of the SMM, the abovementioned critical configurations are detected. Based on the detection results, the proposed sequence-dividing algorithm divides a long sequence into subsequences, such that each subsequence is free of the four critical configurations in order to obtain successful 3D reconstructions on individual subsequences. Experimental results on both synthetic and real sequences have demonstrated that the above four critical configurations are robustly detected, and a long sequence of thousands frames is automatically divided into subsequences, yielding successful 3D reconstructions. The proposed critical configuration detection and sequence-dividing algorithms provide an essential processing block for an automatical 3D reconstruction on long sequences. The third contribution is that we have proposed a coarse-to-fine multiple-view depth labeling algorithm to compute depth maps from multiple-view videos, where the accuracy of resulting depth maps is gradually refined in multiple optimization passes. In the proposed algorithm, multiple-view depth reconstruction is formulated as an image-based labeling problem using the framework of Maximum A Posterior (MAP) on Markov Random Fields (MRF). The MAP-MRF framework allows the combination of various objective and heuristic depth cues to define the local penalty and the interaction energies, which provides a straightforward and computationally tractable formulation. Furthermore, the global optimal MAP solution to depth labeli ing can be found by minimizing the local energies, using existing MRF optimization algorithms. The proposed algorithm contains the following three key contributions. (1) A graph construction algorithm to proposed to construct triangular meshes on over-segmentation maps, in order to exploit the color and the texture information for depth labeling. (2) Multiple depth cues are combined to define the local energies. Furthermore, the local energies are adapted to the local image content, in order to consider the varying nature of the image content for an accurate depth labeling. (3) Both the density of the graph nodes and the intervals of the depth labels are gradually refined in multiple labeling passes. By doing so, both the computational efficiency and the robustness of the depth labeling process are improved. The experimental results on real multiple-view videos show that the depth maps of for selected reference view are accurately reconstructed. Depth discontinuities are very well preserved.

U2 - 10.6100/IR716683

DO - 10.6100/IR716683

M3 - Phd Thesis 1 (Research TU/e / Graduation TU/e)

SN - 978-90-386-2739-7

PB - Technische Universiteit Eindhoven

CY - Eindhoven

ER -

Li P. Acquiring 3D scene information from 2D images. Eindhoven: Technische Universiteit Eindhoven, 2011. 164 p. https://doi.org/10.6100/IR716683