{"id":2678,"date":"2025-03-02T06:19:32","date_gmt":"2025-03-02T06:19:32","guid":{"rendered":"https:\/\/www.codeastar.com\/?p=2678"},"modified":"2025-03-02T06:27:50","modified_gmt":"2025-03-02T06:27:50","slug":"cogvideox-self-hosted-ai-image-to-video-gene","status":"publish","type":"post","link":"https:\/\/www.codeastar.com\/cogvideox-self-hosted-ai-image-to-video-gene\/","title":{"rendered":"Generation Next: Self-hosted Image to Video with CogVideoX"},"content":{"rendered":"\n

In our last <\/a>p<\/a>ost,<\/a> we explored how to generate images using FLUX. This time, we are taking a step further, by using generative AI to generate videos. The most popular generative video nowadays is Sora<\/a> from OpenAI. But it is not freely available. Then we go for another popular choice, Dream Machine<\/a> from Luma AI, which is free for anyone to try. But just like what we did in past, we prefer solutions which are open source and self-hosted. We want things are under our control and do not need to worry about the usage limits. Here we have the open sourced text and image to video model, CogVideoX.<\/p>\n\n\n\n

Text and Image to Video<\/h2>\n\n\n\n

If you have not tried the text and image to video, visit Dream Machine website to try it for free. There are two major types of video generation, but in this post, we will focus on image to video. <\/p>\n\n\n\n

Text to video is similar to what we have tried in FLUX. We type the prompts and get the expected results. But the generated output is a video instead of an image. On the other hand, image to video is generating a video based on the prompts *plus* the image content. Interestingly, text to video is often considered more challenging than image to video. It is because the model needs to understand the text prompts and generates the output that match the narrative content. But this is the consideration from a machine’s perspective. From a human perspective, image to video is way harder than text to video. This is because we inherently have specific expectations based on the images we provide. Having already seen the images, we will expect more than just the text description from the prompt.<\/p>\n\n\n\n

Use of CogVideoX in Easy Mode<\/h2>\n\n\n\n

Okay, back to the post topic, CogVideoX. Among the few Large Language Models (LLMs) that can generate videos, CogVideoX stands out as an open source option. To get started, go to its GitHub page<\/a> and clone the project. After that, we can use its Python code sample<\/a> to generate a video. Or use its web UI<\/a> to do the video generation.<\/p>\n\n\n\n

For absolutely beginners, we suggest using CogStudio<\/a>. It is a Gradio web UI just like the one mentioned early. But this web UI provides more functions and most importantly, it provides one click install. A user can press one click to install CogVideoX and CogStudio web UI. That’s it! Once the installation is finished, we can run CogStudio directly in a web browser.<\/p>\n\n\n\n

\"CogVideoX<\/figure>\n\n\n\n

Look similar? Yes, our favorite FLUX Forge<\/a> is also made from Gradio web UI.<\/p>\n\n\n\n

CogVideoX Image to Video Experiment<\/h2>\n\n\n\n

Let’s explore image to video generation using CogVideoX’s 5B model. The process is straight forward. From the CogStudio web UI we have seen before, click the “image-to-video” tab on the top. Upload an image there, enter the prompt, then click “Generate Video” button.<\/p>\n\n\n\n

According to CogVideoX’s GitHub page:<\/p>\n\n\n\n

\n

Since CogVideoX is trained on long texts, we need to transform the input text distribution to match the training data using an LLM.<\/p>\n<\/blockquote>\n\n\n\n

Okay, let’s do it. We start by using the feature image from the previous post.<\/p>\n\n\n\n

\"Digital<\/figure>\n\n\n\n

This is our input for image to video, and I asked ChatGPT to generate a long text description at our prompt: <\/p>\n\n\n\n

\n

In a cozy modern studio adorned with lush plants and colorful artwork, a cheerful animated bear with a warm smile and a bright yellow scarf dances joyfully against a stunning sunset skyline. The large windows frame the vibrant oranges and pinks of the setting sun, casting a warm glow across the room. As upbeat music fills the air, the bear captivates the scene with its lively movements, spinning and hopping to the rhythm. Its expressive face reflects pure delight, while its paws sway gracefully, tapping along to the catchy beat. As the dance reaches a crescendo, the bear gives a playful wave, radiating happiness and leaving a lasting impression of joy in this vibrant setting.<\/p>\n<\/blockquote>\n\n\n\n

Here is our first image to video output:<\/p>\n\n\n\n