Exploring a New Video Generation Model LTX-Video Capable of Running on Most Home Computers

Until recently, generative video models were almost inaccessible to the average user. To create videos, one had to rely on paid services like Runway or Kling, or use platforms that provide remote GPU power. All these options were, to some extent, paid solutions.

In the past few months, however, remarkable new models have been introduced, which can be run on a local computer, albeit with some challenges. Models like CogVideoX and Mochi have become popular. Yet, running these models often requires a graphics card with at least 24 GB of VRAM.

LTX-Video, a groundbreaking model that allows users to generate videos up to 10 seconds long, at 25 FPS, with resolutions up to 720x1280 pixels, on a computer equipped with an above-average GPU.

This article explores how various parameters of LTX-Video influence the results and what this means for users looking to harness its capabilities for video generation.

LTX-Video is an open-source model designed for video generation, but it comes with a restricted license, allowing use only for non-commercial purposes.

As for its architecture, the available information suggests that it is based on the Diffusion Transformer (DiT) framework. For a deeper dive into its underlying technology, check out this video.

txt2vid Pipeline

Let’s examine the basic setup for generating videos from text prompts.

Basic text to video pipeline in ComfyUI (click to zoom)

For our model, we require a Variational Autoencoder (VAE), model weights from ltx-video-2b-v0.9, and CLIP weights from t5xxl_fp16. We can utilize both positive and negative prompts. The remaining nodes are familiar samplers and schedulers. We will experiment with various configuration options later in this article.

Different parameter sets for inference

In general prompt should be very descriptive. Prompt was taken from the model’s official repo. Here’s what it looks like:

The waves crash against the jagged rocks of the shoreline, sending spray high into the air.The rocks are a dark gray color, with sharp edges and deep crevices. The water is a clear blue-green, with white foam where the waves break against the rocks. The sky is a light gray, with a few white clouds dotting the horizon.

For these experiments was used a configuration with 30 scheduler steps, a cfg scale of 3, and selected the Euler sampler as defaults.

Varying scheduler steps

Varying this parameter does not yield a significant difference in results. However, at higher values, the camera appears to move, while at very low values, such as 20, a grid-like pattern becomes slightly visible in the video.

Varying common samplers

By default, the Euler parameter is set. As the experiments show, some samplers can introduce motion to a static camera. Ddim produces results that are essentially identical to Euler. However, dpmpp_2s generates a significantly different, yet promising, output.

Varying cfg (classifier-free guidance)

The optimal range for this parameter appears to be between 2 and 5. Values outside this range produced less desirable results, with values below 2 causing excessive floating and values close to 10 resulting in complete failure.

Conclusion

LTX-Video is a significant step forward in accessible video generation. It offers a user-friendly way to create high-quality videos on local hardware, bypassing the need for cloud-based services. While its current limitations, such as the restricted license and hardware requirements, might pose challenges for some, its potential is undeniable.

As we’ve explored, understanding the influence of various parameters is crucial for optimizing video generation. By fine-tuning settings like scheduler steps, samplers, and CFG scale, users can achieve a wide range of visual styles and effects.

The future of video generation is exciting. As models like LTX-Video continue to evolve, we can anticipate even more powerful and accessible tools that will revolutionize the way we create visual content.