This paper investigates the definitions and methods for developing multimodal spoken corpus. Facial expression, hand gestures, body gestures, and head movement will provide us essential information to understand spoken discourse as much as phonological, morphological, syntactic and semantic cues do. Thus for computer-based interpretation of human face-to-face dialogue, it is important to construct corpora which store those variety of multimodal information. We examines how multimodality is defined and analyzed in the previous studies and defines multimodal spoken corpus. We also discuss some issues to keep in mind when creating and analyzing a multimodal corpus. Finally, we describe how to transcribe and analyze multimodal data using the ELAN annotation tool.