Appearance-based approaches to object recognition mostly rely on measuring the visual similarity of objects based on global or local descriptors. They have shown great success in object identification but often do not generalize to the more challenging case of object categorization, where category membership is often decided not only on a level of appearances, but also on a semantic level. It has been argued that model-based approaches are better suited to this problem, since they allow to inject high-level knowledge, for example about the constituting object parts and possible configurations. Postulating a set of object parts is problematic, though, since it is not guaranteed that those parts can be reliably extracted from real-world images. There is a need for a middle layer, forming an interface between the visual information readily available from the image and the higher-level semantic information that can be used by reasoning processes. In this work, we investigate how such an interface can be learned. As the appearance of object parts may vary considerably, this cannot be achieved by relying on visual similarity alone. Rather, this paper proposes to also use co-location and co-activation, together with weak top-down constraints, such as alignment, as guiding principles for learning the appearance of local object parts. The learned structures generalize beyond the appearance of single objects and often correspond to semantically plausible object parts, such as wheels, trunks, or windshields of cars. In a later stage, a Bayesian network of those extracted structures is used to verify object hypotheses successfully in difficult scenes.