VideoPoet is a groundbreaking technology developed by Google Research that utilizes a large language model to generate high-quality videos in a zero-shot manner. This innovative modeling method allows for the conversion of any autoregressive language model or large language model (LLM) into a powerful video generator.
With VideoPoet, the potential for video generation is limitless. By leveraging the capabilities of language models, it enables the creation of videos based on text prompts without the need for explicit training data. This means that VideoPoet can generate videos on a wide range of topics and scenarios, even ones it has never seen before.
The strength of VideoPoet lies in its ability to seamlessly integrate multiple modalities, including text, images, audio, and video. It employs a pre-trained MAGVIT V2 video tokenizer and a SoundStream audio tokenizer to transform videos, images, and audio clips into a sequence of discrete codes. These codes are then used by the autoregressive language model to predict the next video or audio token in the sequence, resulting in the generation of highly realistic and coherent videos.
To showcase the capabilities of VideoPoet, Google Research has produced a short movie composed of various video clips generated by the model. By providing a series of text prompts, a captivating story about a traveling raccoon was brought to life. The generated clips were seamlessly stitched together, resulting in a visually stunning and engaging short film.
For more information about VideoPoet and to see additional examples, you can visit Google Research - VideoPoet.