Video Generation
It all started with a school project where we are trying to prepare the training dataset for a custom detection model. However, due to scarcity of the resources, it has been extremely difficulty to obtain the actual footage. As proposed by one of the project supervisors, AI video generation has been raised as a solution to generate the training data.
Although this kind of "data synthesis" method has been considered as controversial in academic field (Cai, 2025). It is still a quite nice interesting path for us to explore where the meaning of the data itself means nothing but solely for segmentation training purposes.
With two reference images, we started our image to video (I2V) journey.


Feed Images
Open-source Model
While the supervisor quite explicitly indicate the model recently published by Bytedance: Seedance 2.0 to be the model that we should try as it recently earned great popularity due to exponential increase in output quality. Its output has gone virally among all social media and video streaming platforms. However, as an open source lover, how can I resist myself from trying those open-source model and see how far we have progressed from Will Smith dining spaghetti.
Nvidia's Cosmos2.5 Predict
This model was first raised by one of my colleagues who is working on an autonomous car related project. Compared with other models, there exists one major difference that distinguishes Cosmos from other video generation model: real-world physics. Cosmos was not a pure video generation model but a world foundation model. These series of model were developed with intention to empowering robotics. Hence, it makes perfect sense for us to use the video generated by the model(s) for simulation.
NVIDIA intended usage case (Video Credit: NVIDIA)
However, this solution also has its own catch. As a product of NVIDIA, it is closely bonded with CUDA eco-system. Although it may sound fine, please also bear in mind that running this 4B model requires over 40GB VRAM. For most consumer grade (or even professional grade) NVIDIA GPU, the 16GB VRAM will ran into OOM in no time. Even if you have a beefy server grade GPU like L40 (with 48GB VRAM), the generation time is still a pain in the ass. A 6 second video clip took more than half an hour (the clip below cost me something like $0.6). In the close feature with no technology exposition, I see no possibility that this model can be deployed on a "non-nuclear powered" robot. But still, to their credits, the set up procedure is quite simple with no known dependencies issues as long as you have correct CUDA installed.
Nvidia's World foundation model
To be honest, the output of the model was way out of my expectations. The original structure was well maintained. Even some of the Chinese characters as shown on the advertisement were also well-preserved (a very challenging point for diffusion models). The most outstanding characteristic of the clip would of course be the human physics. You can see the person has the most natural walking behavior compared to other open source models.
Alibaba's Wan2.6
Alibaba's Wan2.6
Closed Sourced Models
Google Veo
Here comes the funny one. Although declaimed as a SOTA model, Google's Veo outputted by far the worst clip I have ever seen. To some extent, I don't think there's much difference from Will's classic video. I also found google has very weak control effect where the output video can yield far from original inputted reference image.
I am also suspecting that the agent workflow behind the Gemini Web face is also going to have "side-effects" on user prompt where the Veo model was not directly taking the prompt from user input but what has been "tuned" by gemini models. "What are the effects?" as you may ask. Well, the fact that this clip was already the second attempts of Veo after I specifically ask it to maintain the bus structure as shown in the image. The original clip was something that will be appreciated by Christopher Nolan.
We are also seeing that Veo failed at continuity which resulted in this "inception" liked special effects. As nano banana has such great performance in "editing" image, I see no reason why Neo failed in such simple tasks. Maybe Deepmind's world foundation model should be the one for this case.
Seedance 1.5
Till the point of composing this article, Bytedance still yet release the public access for oversea users to their SOTA Seedance 2.0 model. Hence, we use their second-best model Seedance 1.5. This following one was generated using free credits that comes with first time registration.
It was actually quite nice. The structure was well preserved and people have a relatively natural movement (although the turn over was quite a maneuver). The model do actually fulfill the prompted requirements as it might interprets. It is the best performance I have seen so far in terms of overall quality and I can't wait to see how their flagship model would perform.