Let’s explore image to video generation using CogVideoX’s 5B model. The process is straight forward. From the CogStudio web UI we have seen before, click the “image-to-video” tab on the top. Upload an image there, enter the prompt, then click “Generate Video” button.<\/p>\n\n\n\n
\nSince CogVideoX is trained on long texts, we need to transform the input text distribution to match the training data using an LLM.<\/p>\n<\/blockquote>\n\n\n\n
Okay, let’s do it. We start by using the feature image from the previous post.<\/p>\n\n\n\n <\/figure>\n\n\n\nThis is our input for image to video, and I asked ChatGPT to generate a long text description at our prompt: <\/p>\n\n\n\n
\nIn a cozy modern studio adorned with lush plants and colorful artwork, a cheerful animated bear with a warm smile and a bright yellow scarf dances joyfully against a stunning sunset skyline. The large windows frame the vibrant oranges and pinks of the setting sun, casting a warm glow across the room. As upbeat music fills the air, the bear captivates the scene with its lively movements, spinning and hopping to the rhythm. Its expressive face reflects pure delight, while its paws sway gracefully, tapping along to the catchy beat. As the dance reaches a crescendo, the bear gives a playful wave, radiating happiness and leaving a lasting impression of joy in this vibrant setting.<\/p>\n<\/blockquote>\n\n\n\n
Here is our first image to video output:<\/p>\n\n\n\n<\/video><\/figure>\n\n\n\nWell, it doesn’t look right. Let me use the original prompt to try again:<\/p>\n\n\n\n
\nBear dancing inside the studio<\/p>\n<\/blockquote>\n\n\n\n<\/video><\/figure>\n\n\n\nHey, it looks nature and better in a shorter prompt!<\/p>\n\n\n\n
The Chain-of-Thought on Image to Video Processing<\/h2>\n\n\n\n At this point, let’s analyze how the model processes the image to video generation logically. The CogVideoX author suggested using long text description as the prompt. We did it, but it did not provide our desired outcome. And the shorted prompt just did it on point. Okay, we have discovered something here.<\/p>\n\n\n\n
The longer text extended the prompt with unnecessary information.<\/em><\/p>\n\n\n\nThe prompt should focus on two things: the object in the image and the action of the object. Then the longer prompt should only extend the object and the action. Let’s take another example, below is a photo of me taken in Himeji Castle.<\/p>\n\n\n\n <\/figure>\n\n\n\n I ask ChatGPT to extend the prompt, “Man walks slowly along the corridor” and it becomes:<\/p>\n\n\n\n
\nThe man walks slowly, his hands in his pockets, reflecting on the history of the place, with a serene expression that invites viewers to share in his contemplation.<\/p>\n<\/blockquote>\n\n\n\n
And we got:<\/p>\n\n\n\n<\/video><\/figure>\n\n\n\nIt looks good for creating the moving action and background shifting. But I was just faceshifted into another person, likes a X-men mutant power. We may assume that the CogVideoX model was trained highly on western data.<\/p>\n\n\n\n
So this time we use a public figure as an example, Elon Musk:<\/p>\n\n\n\n <\/figure>\n\n\n\nAnd our prompt that focuses on the object and the action is:<\/p>\n\n\n\n
\nAfter a successful announcement, the character throws his arms up in a victorious pose, releasing a spectacular burst of flames that shoot into the air, igniting the excitement of the crowd.<\/p>\n<\/blockquote>\n\n\n\n<\/video><\/figure>\n\n\n\nThe generated video looks smooth and does not have dramatic change in the face of the object.<\/p>\n\n\n\n
Conclusion: Tips for Better Image-to-Video Outputs<\/h2>\n\n\n\n Based on our experiments, we come up the following tips for generating better image to video outputs:<\/p>\n\n\n\n
\nFocus on Object and Action:<\/strong> ensures the model understand the primary elements to animate effectively<\/li>\n\n\n\nProvide Detailed Description:<\/strong> guides the model with descriptions that enrich the primary elements <\/li>\n\n\n\nAvoid Unnecessary Information:<\/strong> skips details like camera angles and background elements that can confuse the model.<\/li>\n\n\n\nUse Recognizable Objects for Clarity:<\/strong> familiar items \/ public figures help the model understand and create relevant outputs<\/li>\n\n\n\nIterate and Adjust:<\/strong> just like our machine learning journey, try-and-tune is always the thing we do for getting better results<\/li>\n<\/ol>\n\n\n\nBy applying these tips, you can enhance the quality of your image to video outputs. Don\u2019t hesitate to experiment with different prompts and approaches. Give it a try and see how your ideas come to life!<\/p>\n","protected":false},"excerpt":{"rendered":"
In our last post, we explored how to generate images using FLUX. This time, we are taking a step further, by using generative AI to generate videos. The most popular generative video nowadays is Sora from OpenAI. But it is not freely available. Then we go for another popular choice, Dream Machine from Luma AI, […]<\/p>\n","protected":false},"author":1,"featured_media":2714,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"set","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","enabled":false},"version":2}},"categories":[18,185],"tags":[190,193,192,191],"class_list":["post-2678","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","category-stable-diffusion-artist","tag-chain-of-thought","tag-cogvideox","tag-image-to-video","tag-prompt-engineering"],"jetpack_publicize_connections":[],"yoast_head":"\n
CogVideoX: Self-hosted Image to Video LLM ⋆ Code A Star<\/title>\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\t \n\t \n\t \n \n \n \n \n \n\t \n\t \n\t \n