Human vision is marvelous in obtaining a structured representation of complex dynamic scenes, such as spatial scene-layout, re-organization of the scene into its constituent objects, support of each object, etc. We also see the complete extent of the scene, even parts which are occluded. For example, even when part of the scene directly below a car is not visible, we infer that it is a part of road. This kind of structured and complete 3D scene understanding is very useful for several applications like autonomous driving. Our objective is to build a 3D scene representation of complex, real-world urban scenes from images alone much like the capabilities of human vision. The classic top-down "analysis-by-synthesis" approach offers an elegant account for such richness in human vision, but is computationally expensive and the resulting energy landscape is highly multi-modal and thus difficult to optimize. Combining top-down analysis with fast, discriminatively trained bottom-up predictors offers to solve this problem. However even recent versions of this hybrid approach are still restricted to toy problems. We revisit analysis-by-synthesis approach for complex real-world 3D scene understanding in light of advances in deep-learning methods, andavailability of large-scale training data in the form of annotated images and 3D CAD models. In this thesis, we explore three different scene understanding frameworks with increasing richness in representation. The presented frameworks reasons jointly about the scene structure, their semantic labels along with 3D orientation and position of object instances over time. We also demonstrate seamless integration of different constraints and prior knowledge into our model and an effective fusion of measurements from multiple images in a video into a final representation of the scene. We evaluate these scene understanding frameworks on challenging real-world datasets of complex urban scenes.