Vox Populi : generating video documentaries from semantically annotated media repositories

S. Bocconi

    Research output: ThesisPhd Thesis 2 (Research NOT TU/e / Graduation TU/e)

    437 Downloads (Pure)


    The context of this research is one or more online video repositories containing severalhours of documentary footage and users possibly interested only in particular topicsof that material. In such a setting it is not possible to craft a single version containingall possible topics the user might like to see, unless including all the material, whichis clearly not feasible. The main motivation for this research is, therefore, to enablean alternative authoring process for film makers to make all their material dynamicallyavailable to users, without having to edit a static final cut that would select out possibleinformative footage.We propose a methodology to automatically organize video material in an editedvideo sequence with a rhetorical structure. This is enabled by defining an annotationschema for the material and a generation process with the following two requirements:• the data repository used by the generation process could be extended by simplyadding annotated material to it• the final resulting structure of the video generation process would seem familiarto a video literate user.The first requirement was satisfied by developing an annotation schema that explicitlyidentifies rhetorical elements in the video material, and a generation process thatassembles longer sequences of video by manipulating the annotations in a bottom-upfashion.The second requirement was satisfied by modeling the generation process accordingto documentary making and general film theory techniques, in particular makingthe role of rhetoric in video documentaries explicit.A specific case study was carried out using material for video documentaries. Theseused an interview structure, where people are asked to make statements about subjectivematters. This category is characterized by rich information encoded in the audiotrack and by the controversy of the different opinions expressed in the interviews.The approach was tested by implementing a system called Vox Populi that realizesa user-driven generation of rhetoric-based video sequences. Using the annotationschema, Vox Populi can be used to generate the story space and to allow the user toselect and browse such a space. The user can specify the topic but also the charactersof the rhetorical dialogue and the rhetoric form of the presentation.Presenting controversial topics can introduce some bias: Vox Populi tries to controlthis by modeling some rhetoric and film theory editing techniques that influencethe bias and by allowing the user to select the point of view she wants the generatedsequence to have.158 SUMMARYOverviewWe present a model to automatically generate documentaries and an implementationof it. We focus on matter-of-opinion documentaries based on interviews. Our modelhas the following characteristics, which are lacking in previous automatic generationapproaches:• it allows the viewer to select the subject and the point of view of the documentary;• it allows the documentarist to add material to the repository without having tospecify how this material should be presented (data-driven approach);• it generates documentaries according to presentation forms used by documentarists.This thesis answers the following research questions:RESEARCH QUESTION 1 (DOCUMENTARY FORM) What characteristics of the presentationforms used by documentaries on matter-of-opinion issues must be modeled?RESEARCH QUESTION 2 (ANNOTATION SCHEMA) What information should be capturedin an annotation schema for an automatic video generation approach where:• the viewer can specify the subject and the point of view,• the documentarist can collect material to be used for documentaries, withouthaving to specify how this material should be presented to the viewer,• the material is presented according to presentation forms used by documentarists?RESEARCH QUESTION 3 (GENERATION PROCESS) How must a generation processbe defined for an automatic video generation approach where:• the viewer can specify the subject and the point of view,• the documentarist can collect material to be used for documentaries, withouthaving to specify how this material should be presented to the viewer,• the material is presented according to presentation forms used by documentarists?RESEARCH QUESTION 4 (AUTHOR SUPPORT) How must a generation process bedefined so that it can give to the documentarist an indication of the quality of thedocumentaries it can generate?Chapter 2To determine what needs to be modeled, we analyze the domain of documentaries andthe process of documentary making. This analysis leads to the definition of HIGHLEVELREQUIREMENTS, which specify the presentation forms a documentary generationmodel can use, and how to edit video material into a correct (according totraditional film making) sequence. These high-level requirements provide an answer toResearch Question Documentary Form [1].In more detail, these high-level requirements restate the first two bullet points,while the third one is further specified using an analysis of the domain. The requirementspoint out the presentation forms that can be used in documentaries, namely thenarrative form (where the presentation of information is organized into stories), thecategorical form (where the presentation of information is organized into categories)and the rhetorical form (where the presentation of information is organized accordingto points of view, positions and arguments). We consider two levels in a story: thelevel of the scene, called micro-level, and the overall structure, called the macro-level.The narrative and categorical forms can be used at the macro-level, while the rhetoricalform must be used at the micro level. The rhetorical form is particularly relevant forour domain, namely matter-of-opinion documentaries. This form is composed of pointsof view (propagandist and binary communicator), which communicate positions (e.g."war in Afghanistan - For"), which in turn are expressed by arguments. Arguments arebased on logos, pathos and ethos techniques. The high-level requirements also specifythat the model should implement a montage technique often used in documentaries topresent interviews. This technique, called vox populi, consists of showing in a rapidsequence how interviewees answer related questions. To avoid misquoting an interviewee,the generation model is required to encode context information for the statementsmade during interviews. For the editing part, the analysis of the documentary-makingprocess requires the generation model to include continuity editing rules as used intraditional film making.Chapter 3Having defined what aspects of the domain need to be modeled, we examine howrelated work has solved similar problems, and determine which existing technical solutionsare feasible given the high-level requirements we set. This analysis leads to thedefinition of LOW-LEVEL REQUIREMENTS. These requirements are divided into twogroups. The first group specifies what data structure can represent video material forthe purpose of documentary generation. The second group determines the characteristicsof a processthat is capable of generating documentaries according to the high-levelrequirements.In more detail, the first group of requirements concerning the annotations specifythat the video material should be segmented into discrete units called clips. The descriptionof the clips should capture connotative as well as denotative aspects of thevideo material, using property-based annotations and a controlled vocabulary. Argumentscontained in interviews and based on logos should be encoded by an argumentmodel, the model of Toulmin. Arguments based on pathos and ethos should be evaluatedusing a cognitive model, the OCC model. In addition to the OCC model, filmtheory provides another method to evaluate pathos, based on the cinematic propertiesof the clip, namely gaze direction and framing distance. The second group ofrequirements specify that the generation process should dynamically create, using theannotations, a data structure (the Semantic Graph) that provides information about theargumentation relations (SUPPORTS and CONTRADICTS) among media items in the repository.Furthermore, based on argumentation theory the requirements define a means ofcomposing arguments from single statements, such as rebuttals and undercutters, andspecify that the categorical form should be used as the presentation form at the macrolevel.Chapter 4Guided by the high-level requirements and the first group of low-level requirements,we examine the content of video to determine the characteristics of the information weneed to model. Based on this analysis, we specify an annotation schema capable ofencoding the rhetorical form and the categorical form, and the cinematic properties ofvideo to support automatic editing. The definition of this annotation schema providesan answer to Research Question Annotation Schema [2].In more detail, two components of the rhetorical form are modeled, namely argumentsand positions. Arguments based on logos are encoded by modeling verbalinformation contained in the auditory and visual channel. The arguments are modeledusing three-part sentence-like descriptions of what an interviewee says, called statements,a thesaurus for the controlled vocabulary of terms used in the statements and themodel of Toulmin for the role each statement plays in an argument. Arguments basedon pathos are modeled using non-verbal information contained in the visual channel,by modeling the clip cinematic properties framing distance and gaze direction. Ethosis modeled based on the OCC model, by using verbal and non-verbal information todetermine social categories an interviewee belongs to, such as gender, race, educationlevel, and a user profile that values how important these categories are for the viewer.Positions are modeled as a subject and the interviewee’s attitude with respect to thatsubject, e.g. "war in Afghanistan - For". Further we define the categories to supportthe categorical form, namely categories related to interviews, such as question asked,location categories describing where the clip was shot, such as the geographical location,and temporal categories describing when the clip was shot, such as the time of theday. Finally, the cinematic properties of video are modeled to support the continuityrules, such as gaze direction for the gaze continuity rule and framing distance for theframing continuity rule (both properties are also required to calculate pathos).Chapter 5Having encoded the information needed to generate documentaries of the form specifiedby the high-level requirements, we describe a process capable of generating thesedocumentaries. This generation process first creates the Semantic Graph, a data structurethat establishes the argumentation relations among media items, then manipulatesthis structure to form arguments using video clips. The selected video clips are presentedaccording to the rhetorical form and the categorical form, in a video sequencethat also satisfies the continuity rules and the montage specified in the high-level requirements.The definition of this generation process provides an answer to ResearchQuestion Generation Process [3]. We then describe methods to provide the documentaristwith a means of verifying the correctness of the annotations. The specification of thesemethods answers Research Question Author Support [4].In more detail, the generation process dynamically creates the Semantic Graph intwo steps: the first one generates possible candidate targets for linking using the statementsand the relations in the thesaurus. The second one verifies which of the possibletargets is associated to media items present in the repository. The result of these twosteps is a graph where the edges are the argumentation relations CONTRADICTS andSUPPORTS and the nodes correspond to media items. This structure is used to assemblearguments that show supporting or conflicting positions, using actions such as rebuttalsor undercutters. Pathos and ethos are used to assess which side in a conflictingargument appears more convincing to the viewer. This allows the generation of videosequences that express a particular point of view, i.e. the propagandist where one sideis more convincing than the other, or the binary communicator where both sides appearequally convincing. Selected video clips are then edited using rhetoric-driven editingsuch as shot-reverse shot and continuity rules such as framing continuity. The processthen uses the categorical form to assemble more arguments together and form longervideo sequences. The resulting generation process is driven by the viewer requests, asspecified in the SUBJECT-POINT OF VIEW [HLR 2] requirement.The feedback method aims at pinpointing where the annotations do not fully supportthe Semantic Graph creation. These methods are based on the definition of indexesthat measure the performance of the two steps used to create the graph. The documentaristcan also use these indexes for two other purposes: to suggest possible annotationsto be specified in the thesaurus, and to fine-tune the process of graph creation.Chapter 6Having defined the model, we provide an implementation of it with a demonstratorcalled Vox Populi. Vox Populi’s architecture consists of a web user interface, throughwhich the viewer specifies the subject and the point of view of the documentary, corefunctionality running inside a web server and a storage back-end for the annotations.The documentarist can create the repository using an annotations editor and video editingtools. Vox Populi has also been used in two other projects, the Visual Jockey andPassepartout - move.me projects. We further report the result of a technical evaluationand our own experiences in using the system as documentarists.Chapter 7In this chapter we present an overview of the thesis. Our research contributions arethe high-level requirements in chapter 2, the low-level requirements in chapter 3, theautomatic video generation model composed of the annotation schema in chapter 4 andof the generation process in chapter 5, and the implementation of the model in chapter6. We then discuss general issues for an automatic video generation approach (theannotation effort required, the influence of the documentarist/annotator in the process,and the consequences deriving from the open-world assumption we make), as well asissues related to each research questions. We conclude examining future directions forour work, the most promising of which is the modeling of non-verbal information andits influence on pathos.
    Original languageEnglish
    QualificationDoctor of Philosophy
    Awarding Institution
    • Mathematics and Computer Science
    • Hardman, Lynda, Promotor
    • Nack, F.-M., Copromotor, External person
    Award date30 Nov 2006
    Place of PublicationEindhoven
    Print ISBNs90-386-0824-1
    Publication statusPublished - 2006


    Dive into the research topics of 'Vox Populi : generating video documentaries from semantically annotated media repositories'. Together they form a unique fingerprint.

    Cite this