The context of this research is one or more online video repositories containing several hours of documentary footage and users possibly interested only in particular topics of that material. In such a setting it is not possible to craft a single version containing all possible topics the user might like to see, unless including all the material, which is clearly not feasible. The main motivation for this research is, therefore, to enable an alternative authoring process for film makers to make all their material dynamically available to users, without having to edit a static final cut that would select out possible informative footage. We propose a methodology to automatically organize video material in an edited video sequence with a rhetorical structure. This is enabled by defining an annotation schema for the material and a generation process with the following two requirements: • the data repository used by the generation process could be extended by simply adding annotated material to it • the final resulting structure of the video generation process would seem familiar to a video literate user. The first requirement was satisfied by developing an annotation schema that explicitly identifies rhetorical elements in the video material, and a generation process that assembles longer sequences of video by manipulating the annotations in a bottom-up fashion. The second requirement was satisfied by modeling the generation process according to documentary making and general film theory techniques, in particular making the role of rhetoric in video documentaries explicit. A specific case study was carried out using material for video documentaries. These used an interview structure, where people are asked to make statements about subjective matters. This category is characterized by rich information encoded in the audio track and by the controversy of the different opinions expressed in the interviews. The approach was tested by implementing a system called Vox Populi that realizes a user-driven generation of rhetoric-based video sequences. Using the annotation schema, Vox Populi can be used to generate the story space and to allow the user to select and browse such a space. The user can specify the topic but also the characters of the rhetorical dialogue and the rhetoric form of the presentation. Presenting controversial topics can introduce some bias: Vox Populi tries to control this by modeling some rhetoric and film theory editing techniques that influence the bias and by allowing the user to select the point of view she wants the generated sequence to have. 158 SUMMARY Overview We present a model to automatically generate documentaries and an implementation of it. We focus on matter-of-opinion documentaries based on interviews. Our model has the following characteristics, which are lacking in previous automatic generation approaches: • it allows the viewer to select the subject and the point of view of the documentary; • it allows the documentarist to add material to the repository without having to specify how this material should be presented (data-driven approach); • it generates documentaries according to presentation forms used by documentarists. This thesis answers the following research questions: RESEARCH QUESTION 1 (DOCUMENTARY FORM) What characteristics of the presentation forms used by documentaries on matter-of-opinion issues must be modeled? RESEARCH QUESTION 2 (ANNOTATION SCHEMA) What information should be captured in an annotation schema for an automatic video generation approach where: • the viewer can specify the subject and the point of view, • the documentarist can collect material to be used for documentaries, without having to specify how this material should be presented to the viewer, • the material is presented according to presentation forms used by documentarists? RESEARCH QUESTION 3 (GENERATION PROCESS) How must a generation process be defined for an automatic video generation approach where: • the viewer can specify the subject and the point of view, • the documentarist can collect material to be used for documentaries, without having to specify how this material should be presented to the viewer, • the material is presented according to presentation forms used by documentarists? RESEARCH QUESTION 4 (AUTHOR SUPPORT) How must a generation process be defined so that it can give to the documentarist an indication of the quality of the documentaries it can generate? Chapter 2 To determine what needs to be modeled, we analyze the domain of documentaries and the process of documentary making. This analysis leads to the definition of HIGHLEVEL REQUIREMENTS, which specify the presentation forms a documentary generation model can use, and how to edit video material into a correct (according to traditional film making) sequence. These high-level requirements provide an answer to Research Question Documentary Form . In more detail, these high-level requirements restate the first two bullet points, while the third one is further specified using an analysis of the domain. The requirements point out the presentation forms that can be used in documentaries, namely the narrative form (where the presentation of information is organized into stories), the categorical form (where the presentation of information is organized into categories) and the rhetorical form (where the presentation of information is organized according to points of view, positions and arguments). We consider two levels in a story: the level of the scene, called micro-level, and the overall structure, called the macro-level. The narrative and categorical forms can be used at the macro-level, while the rhetorical form must be used at the micro level. The rhetorical form is particularly relevant for our domain, namely matter-of-opinion documentaries. This form is composed of points of view (propagandist and binary communicator), which communicate positions (e.g. "war in Afghanistan - For"), which in turn are expressed by arguments. Arguments are based on logos, pathos and ethos techniques. The high-level requirements also specify that the model should implement a montage technique often used in documentaries to present interviews. This technique, called vox populi, consists of showing in a rapid sequence how interviewees answer related questions. To avoid misquoting an interviewee, the generation model is required to encode context information for the statements made during interviews. For the editing part, the analysis of the documentary-making process requires the generation model to include continuity editing rules as used in traditional film making. Chapter 3 Having defined what aspects of the domain need to be modeled, we examine how related work has solved similar problems, and determine which existing technical solutions are feasible given the high-level requirements we set. This analysis leads to the definition of LOW-LEVEL REQUIREMENTS. These requirements are divided into two groups. The first group specifies what data structure can represent video material for the purpose of documentary generation. The second group determines the characteristics of a processthat is capable of generating documentaries according to the high-level requirements. In more detail, the first group of requirements concerning the annotations specify that the video material should be segmented into discrete units called clips. The description of the clips should capture connotative as well as denotative aspects of the video material, using property-based annotations and a controlled vocabulary. Arguments contained in interviews and based on logos should be encoded by an argument model, the model of Toulmin. Arguments based on pathos and ethos should be evaluated using a cognitive model, the OCC model. In addition to the OCC model, film theory provides another method to evaluate pathos, based on the cinematic properties of the clip, namely gaze direction and framing distance. The second group of requirements specify that the generation process should dynamically create, using the annotations, a data structure (the Semantic Graph) that provides information about the argumentation relations (SUPPORTS and CONTRADICTS) among media items in the repository. Furthermore, based on argumentation theory the requirements define a means of composing arguments from single statements, such as rebuttals and undercutters, and specify that the categorical form should be used as the presentation form at the macrolevel. Chapter 4 Guided by the high-level requirements and the first group of low-level requirements, we examine the content of video to determine the characteristics of the information we need to model. Based on this analysis, we specify an annotation schema capable of encoding the rhetorical form and the categorical form, and the cinematic properties of video to support automatic editing. The definition of this annotation schema provides an answer to Research Question Annotation Schema . In more detail, two components of the rhetorical form are modeled, namely arguments and positions. Arguments based on logos are encoded by modeling verbal information contained in the auditory and visual channel. The arguments are modeled using three-part sentence-like descriptions of what an interviewee says, called statements, a thesaurus for the controlled vocabulary of terms used in the statements and the model of Toulmin for the role each statement plays in an argument. Arguments based on pathos are modeled using non-verbal information contained in the visual channel, by modeling the clip cinematic properties framing distance and gaze direction. Ethos is modeled based on the OCC model, by using verbal and non-verbal information to determine social categories an interviewee belongs to, such as gender, race, education level, and a user profile that values how important these categories are for the viewer. Positions are modeled as a subject and the interviewee’s attitude with respect to that subject, e.g. "war in Afghanistan - For". Further we define the categories to support the categorical form, namely categories related to interviews, such as question asked, location categories describing where the clip was shot, such as the geographical location, and temporal categories describing when the clip was shot, such as the time of the day. Finally, the cinematic properties of video are modeled to support the continuity rules, such as gaze direction for the gaze continuity rule and framing distance for the framing continuity rule (both properties are also required to calculate pathos). Chapter 5 Having encoded the information needed to generate documentaries of the form specified by the high-level requirements, we describe a process capable of generating these documentaries. This generation process first creates the Semantic Graph, a data structure that establishes the argumentation relations among media items, then manipulates this structure to form arguments using video clips. The selected video clips are presented according to the rhetorical form and the categorical form, in a video sequence that also satisfies the continuity rules and the montage specified in the high-level requirements. The definition of this generation process provides an answer to Research Question Generation Process . We then describe methods to provide the documentarist with a means of verifying the correctness of the annotations. The specification of these methods answers Research Question Author Support . In more detail, the generation process dynamically creates the Semantic Graph in two steps: the first one generates possible candidate targets for linking using the statements and the relations in the thesaurus. The second one verifies which of the possible targets is associated to media items present in the repository. The result of these two steps is a graph where the edges are the argumentation relations CONTRADICTS and SUPPORTS and the nodes correspond to media items. This structure is used to assemble arguments that show supporting or conflicting positions, using actions such as rebuttals or undercutters. Pathos and ethos are used to assess which side in a conflicting argument appears more convincing to the viewer. This allows the generation of video sequences that express a particular point of view, i.e. the propagandist where one side is more convincing than the other, or the binary communicator where both sides appear equally convincing. Selected video clips are then edited using rhetoric-driven editing such as shot-reverse shot and continuity rules such as framing continuity. The process then uses the categorical form to assemble more arguments together and form longer video sequences. The resulting generation process is driven by the viewer requests, as specified in the SUBJECT-POINT OF VIEW [HLR 2] requirement. The feedback method aims at pinpointing where the annotations do not fully support the Semantic Graph creation. These methods are based on the definition of indexes that measure the performance of the two steps used to create the graph. The documentarist can also use these indexes for two other purposes: to suggest possible annotations to be specified in the thesaurus, and to fine-tune the process of graph creation. Chapter 6 Having defined the model, we provide an implementation of it with a demonstrator called Vox Populi. Vox Populi’s architecture consists of a web user interface, through which the viewer specifies the subject and the point of view of the documentary, core functionality running inside a web server and a storage back-end for the annotations. The documentarist can create the repository using an annotations editor and video editing tools. Vox Populi has also been used in two other projects, the Visual Jockey and Passepartout - move.me projects. We further report the result of a technical evaluation and our own experiences in using the system as documentarists. Chapter 7 In this chapter we present an overview of the thesis. Our research contributions are the high-level requirements in chapter 2, the low-level requirements in chapter 3, the automatic video generation model composed of the annotation schema in chapter 4 and of the generation process in chapter 5, and the implementation of the model in chapter 6. We then discuss general issues for an automatic video generation approach (the annotation effort required, the influence of the documentarist/annotator in the process, and the consequences deriving from the open-world assumption we make), as well as issues related to each research questions. We conclude examining future directions for our work, the most promising of which is the modeling of non-verbal information and its influence on pathos.
|Qualification||Doctor of Philosophy|
|Award date||30 Nov 2006|
|Place of Publication||Eindhoven|
|Publication status||Published - 2006|