While vanishing point (VP) estimation has received extensive attention, most approaches focus on static images or perform detection and tracking separately. In this paper, we focus on man-made environments and propose a novel method for detecting and tracking groups of mutually orthogonal vanishing points (MOVP), also known as Manhattan frames, jointly from monocular videos. The method is unique in that it is designed to enforce orthogonality in groups of VPs, temporal consistency of each individual MOVP, and orientation consistency of all putative MOVP. To this end, the method consists of three steps: 1) proposal of MOVP candidates by directly incorporating mutual orthogonality; 2) extracting consistent tracks of MOVPs by minimizing the flow cost over a network where nodes are putative MOVPs and edges are putative links across time; and 3) refinement of all MOVPs by enforcing consistency between lines, their identified vanishing directions and consistency of global camera orientation. The method is evaluated on six newly collected and annotated videos of urban scenes. Extensive experiments show that the method outperforms greedy MOVP tracking method considerably. In addition, we also test the method for camera orientation estimation and show that it obtains very promising results on a challenging street-view dataset.