In our last post, we explored how to generate images using FLUX. This time, we are taking a step further, by using generative AI to generate videos. The most popular generative video nowadays is Sora from OpenAI. But it is not freely available. Then we go for another popular choice, Dream Machine from Luma AI, which is free for anyone to try. But just like what we did in past, we prefer solutions which are open source and self-hosted. We want things are under our control and do not need to worry about the usage limits. Here we have the open sourced text and image to video model, CogVideoX.
Text and Image to Video
If you have not tried the text and image to video, visit Dream Machine website to try it for free. There are two major types of video generation, but in this post, we will focus on image to video.
Text to video is similar to what we have tried in FLUX. We type the prompts and get the expected results. But the generated output is a video instead of an image. On the other hand, image to video is generating a video based on the prompts *plus* the image content. Interestingly, text to video is often considered more challenging than image to video. It is because the model needs to understand the text prompts and generates the output that match the narrative content. But this is the consideration from a machine’s perspective. From a human perspective, image to video is way harder than text to video. This is because we inherently have specific expectations based on the images we provide. Having already seen the images, we will expect more than just the text description from the prompt.
Use of CogVideoX in Easy Mode
Okay, back to the post topic, CogVideoX. Among the few Large Language Models (LLMs) that can generate videos, CogVideoX stands out as an open source option. To get started, go to its GitHub page and clone the project. After that, we can use its Python code sample to generate a video. Or use its web UI to do the video generation.
For absolutely beginners, we suggest using CogStudio. It is a Gradio web UI just like the one mentioned early. But this web UI provides more functions and most importantly, it provides one click install. A user can press one click to install CogVideoX and CogStudio web UI. That’s it! Once the installation is finished, we can run CogStudio directly in a web browser.

Look similar? Yes, our favorite FLUX Forge is also made from Gradio web UI.
CogVideoX Image to Video Experiment
Let’s explore image to video generation using CogVideoX’s 5B model. The process is straight forward. From the CogStudio web UI we have seen before, click the “image-to-video” tab on the top. Upload an image there, enter the prompt, then click “Generate Video” button.
According to CogVideoX’s GitHub page:
Since CogVideoX is trained on long texts, we need to transform the input text distribution to match the training data using an LLM.
Okay, let’s do it. We start by using the feature image from the previous post.

This is our input for image to video, and I asked ChatGPT to generate a long text description at our prompt:
In a cozy modern studio adorned with lush plants and colorful artwork, a cheerful animated bear with a warm smile and a bright yellow scarf dances joyfully against a stunning sunset skyline. The large windows frame the vibrant oranges and pinks of the setting sun, casting a warm glow across the room. As upbeat music fills the air, the bear captivates the scene with its lively movements, spinning and hopping to the rhythm. Its expressive face reflects pure delight, while its paws sway gracefully, tapping along to the catchy beat. As the dance reaches a crescendo, the bear gives a playful wave, radiating happiness and leaving a lasting impression of joy in this vibrant setting.
Here is our first image to video output:
Well, it doesn’t look right. Let me use the original prompt to try again:
Bear dancing inside the studio
Hey, it looks nature and better in a shorter prompt!
The Chain-of-Thought on Image to Video Processing
At this point, let’s analyze how the model processes the image to video generation logically. The CogVideoX author suggested using long text description as the prompt. We did it, but it did not provide our desired outcome. And the shorted prompt just did it on point. Okay, we have discovered something here.
The longer text extended the prompt with unnecessary information.
The prompt should focus on two things: the object in the image and the action of the object. Then the longer prompt should only extend the object and the action. Let’s take another example, below is a photo of me taken in Himeji Castle.

I ask ChatGPT to extend the prompt, “Man walks slowly along the corridor” and it becomes:
The man walks slowly, his hands in his pockets, reflecting on the history of the place, with a serene expression that invites viewers to share in his contemplation.
And we got:
It looks good for creating the moving action and background shifting. But I was just faceshifted into another person, likes a X-men mutant power. We may assume that the CogVideoX model was trained highly on western data.
So this time we use a public figure as an example, Elon Musk:

And our prompt that focuses on the object and the action is:
After a successful announcement, the character throws his arms up in a victorious pose, releasing a spectacular burst of flames that shoot into the air, igniting the excitement of the crowd.
The generated video looks smooth and does not have dramatic change in the face of the object.
Conclusion: Tips for Better Image-to-Video Outputs
Based on our experiments, we come up the following tips for generating better image to video outputs:
- Focus on Object and Action: ensures the model understand the primary elements to animate effectively
- Provide Detailed Description: guides the model with descriptions that enrich the primary elements
- Avoid Unnecessary Information: skips details like camera angles and background elements that can confuse the model.
- Use Recognizable Objects for Clarity: familiar items / public figures help the model understand and create relevant outputs
- Iterate and Adjust: just like our machine learning journey, try-and-tune is always the thing we do for getting better results
By applying these tips, you can enhance the quality of your image to video outputs. Don’t hesitate to experiment with different prompts and approaches. Give it a try and see how your ideas come to life!