localhost

engineering

How Do Video Gen Models Actually Work?

Kling and Sora turn a prompt into video in seconds — using noise, denoising, and a lot of training data. Hands still melt.

By

Hardeep Gambhir avatar

Hardeep Gambhir

26 March 2026

How Do Video Gen Models Actually Work? hero image

You’ve typed a prompt into Kling, hit generate, stared at a progress bar for 90 seconds, and gotten back something that either made your jaw drop or looked like a melting wax museum.

But what actually happened in those 90 seconds?

Every major video model right now, Sora, Kling, Veo, Seedance, Runway, they’re all doing fundamentally the same thing under the hood. The differences matter and we’ll get into them, but the core process is identical. Once you understand it, a lot of the weird behavior you’ve seen in your own generations starts making sense.

It starts with static

Every generated video begins as pure random noise. Like a TV with no signal. The model’s job is to take that static and clean it up, step by step, until a video appears.

Inside AI Video examples strip

The way it learned to do this is kind of counterintuitive. During training, the model watched millions of real videos get progressively destroyed. Noise layered on top, again and again, until the original was completely unrecognizable. The model’s job was to learn how to reverse that. Look at a noisy mess and figure out what the clean version underneath should be.

So when you hit generate, it starts with a block of random static shaped to your video dimensions, then runs 30 to 50 cleanup passes. Each one removes a bit of noise and adds a bit of structure. The early passes sketch out the broad strokes. Sky up here, ground down there, figure in the middle. The later passes fill in the details. Facial features, fabric texture, how light bounces off water.

Denoising progression

Here’s the part that matters most: the whole video gets generated at once, not one frame at a time. Frame 30 already knows what frame 1 looks like while it’s being created. That’s why objects can go behind something and come back looking the same. And when that breaks, when you see flickering or a character’s shirt changing color halfway through, it means the model lost track of what it was doing across frames.

Why it works on a compressed version

A single 1080p frame has about 2 million pixels. Multiply that by 24 frames per second across a 5 second clip and you’re at 240 million pixels. No model can work with that directly.

So before anything creative happens, the model compresses everything down. Think about how Netflix compresses a movie for streaming. It keeps the important stuff like shapes, colors, and motion patterns, and throws out the redundant detail. The model does the same thing, shrinking the video down by roughly 8x in each direction. That massive 1080p video becomes a small dense block of numbers that captures the essence of what the video looks like without all the pixel level noise.

Latent space illustration

All 30 to 50 cleanup passes happen on this compressed version. Only at the very end does it expand everything back out to full resolution.

This explains a few things you’ve probably noticed. Upscaling is always a separate step in these tools because the model was literally working on a compressed low res version the entire time. When Runway offers 4K upscaling or Kling gives you resolution options, that’s a different process expanding the output into something crisp.

It also explains why fingers get destroyed. At that compressed scale, five fingers basically merge into a blob. The model isn’t being lazy. It genuinely cannot distinguish between four and five fingers at the internal resolution it operates at.

Hands failure mode example

How your prompt actually controls the output

You type “a golden retriever running through a wheat field at sunset.” But the model works with numbers. So your text goes through an encoder that converts it into a mathematical representation, basically a big block of numbers that captures the meaning and relationships between the words in your sentence.

That block of numbers then steers every single one of those cleanup passes. At each step the model is essentially asking, does this look more or less like what was described? And nudging accordingly.

There’s a clever trick happening here that you’ve already interacted with even if you didn’t know it. That creativity or prompt adherence slider in most video tools? Here’s what it actually does.

The model generates two predictions at every cleanup step. One following your prompt and one completely ignoring it. Then it looks at the difference between those two and pushes the result toward the prompted version. The slider controls how aggressively it pushes. Crank it up and the model obeys your instructions precisely, but things can start looking overprocessed or unnatural. Turn it down and it takes creative liberties. Sometimes that’s brilliant. Sometimes it goes completely off script.

Prompt control panel example

Most tools default somewhere in the middle and that’s the range where things tend to look both accurate and natural. If you’ve ever cranked that slider to max and gotten back something that looked weirdly saturated or artificial, now you know why.

Text to video vs image to video

This is where understanding how these models work can immediately change your results.

Text to video takes your prompt and generates everything from scratch. The model decides what the scene looks like, what the characters look like, the lighting, the composition, all of it. Text is the only guide, and text is vague. “A woman walking through Tokyo at night” could look a million different ways.

Image to video is a completely different game. You give the model an actual image and it generates motion around it. Think of it like handing someone page 1 of a flip book and asking them to draw the rest of the pages to match. The model knows exactly what the scene looks like because you showed it.

Reference-to-motion example

The control difference is massive. An image gives the model exact information. This face, this lighting, this composition, this color palette. Text can’t do that.

This is why most serious workflows don’t go straight from text to video. They generate a still image first using Midjourney or Flux or whatever gets them the right look, lock in the visual, then feed that image into a video model for animation. If you’re skipping that step and wondering why you can’t get consistent results across shots, this is probably the reason. You’re giving the model the vaguest possible input and expecting it to make the same creative choices every time.

Some models push this even further. Kling 3.0 lets you feed in multiple reference images to build an internal map of faces, clothing, and environments. Runway supports defining both a start and end frame so you control exactly where a shot begins and ends. Seedance accepts text, image, and audio inputs all at once. Same underlying idea, give the model more specific information so it has less room to guess wrong.

What actually makes these models different from each other

If they all use the same basic process, compress, clean up, decompress, why do results look so different between Kling, Sora, Veo, and everything else?

A few things.

Training data is the biggest one and nobody talks about it. The underlying architecture has mostly converged across the industry. What actually separates these models is what videos they were trained on, how much data they used, and how carefully it was curated. It’s also the most closely guarded secret at every one of these companies.

Model output comparison panel

To understand just how seriously these companies take training data, look at the deals being made. Runway partnered with Lionsgate to train a custom model on their entire library of 20,000+ film and TV titles, everything from John Wick to The Hunger Games. The exact financials were never disclosed but Lionsgate’s vice chair said it would save them “millions and millions of dollars.” It was the first deal of its kind between an AI video company and a major Hollywood studio. And the rest of the industry is quietly doing the same thing.

Then there’s Luma Labs raising $900 million in a single round, over $1 billion total, at a $4 billion valuation. A massive chunk of that capital goes toward compute and data acquisition. From conversations across the industry, what’s become clear is that these companies are willing to pay almost anything for high quality training data. The race right now is about winning the market, not about being cautious with copyright. Most of these companies are expanding their legal teams in parallel, essentially building up their legal defense while moving as fast as possible on data acquisition. The copyright problem is being treated as a later problem. The immediate problem is falling behind competitors who trained on more or better data. That dynamic isn’t spoken about publicly but it’s the reality of how this market is operating right now.

How they handle motion across frames. Some models figure out what each individual frame looks like first, then separately work out how those frames connect to each other. Others like Sora and Kling process everything together, every frame aware of every other frame simultaneously. The second approach costs more compute but tends to produce smoother, more natural movement.

Audio generation. This is the single biggest shift from 2025 into 2026. Veo 3 generates audio alongside the video in one pass. Lip synced dialogue, ambient sounds, everything together. Kling 3.0 and Seedance 2.0 do something similar. A year ago you needed separate tools and had to manually sync audio in post. Now it’s just part of the output.

Multi shot generation. Kling 3.0 can produce up to 6 connected shots with consistent characters across cuts. That’s a massive deal because it’s the difference between making random isolated clips and making something that actually feels like an edited sequence. Runway takes a different approach, giving you more granular control over individual shots with strong keyframe tools.

Resolution and duration tradeoffs. Kling 3.0 pushed to native 4K at 60fps. Sora 2 goes up to about 20 seconds at 1080p. Seedance 2.0 prioritizes motion quality over raw resolution. These reflect different bets about what users care about most right now.

What’s still broken

Worth being honest here because the marketing from these companies won’t be.

Hands and fingers are still rough across every model. 27 bones that constantly change shape and overlap each other. At the compressed resolution the model works at internally, fingers just blur together. Getting better, not solved.

Body contact is messy. Two people hugging, a hand on a doorknob, someone sitting in a chair. Anything where surfaces touch or overlap tends to produce melting or merging artifacts. The model struggles to understand which parts belong to which object when they’re pressed together.

Text on screen doesn’t work reliably. Signs, logos, anything with readable words tends to scramble or shift between frames.

Longer clips lose consistency. Most models produce solid 3 to 5 second clips. Go past 10 seconds and things start drifting. Subtle appearance changes, physics breaking down, backgrounds shifting. Every lab is working on this.

Nothing is real time yet. A 5 second clip takes anywhere from 30 seconds to 5 minutes depending on the model and settings. And roughly a third of generations still need a do over. The actual workflow is still prompt, wait, check, retry.

Where this is going

The trajectory from the last 12 months tells the story pretty clearly.

We went from silent isolated clips to multi shot sequences with synchronized audio. From 720p to native 4K. From text only input to mixed inputs combining images, audio, and text in a single generation. From hoping characters stay consistent to systems that actively maintain identity across shots.

The next wave is real time generation, physics that actually holds up, and longer outputs that don’t fall apart.

The copyright question nobody has answered yet

The copyright situation that exploded when Seedance 2.0 went viral and Disney and Paramount sent cease and desist letters forced a conversation the industry had been avoiding. But the existing frameworks, Creative Commons, fair use, DMCA, none of them were designed for a world where a model can absorb 20,000 movies and produce something new that doesn’t technically copy any of them but wouldn’t exist without all of them.

Something new is going to emerge. One strong possibility is cryptographic provenance baked into media at the protocol level. Think of it as a permanent, verifiable record of what data was used to train a model, what content was generated by AI, and what rights are attached. Not speculative token trading, but the underlying technology of tamperproof, decentralized verification applied to creative rights. Some version of this feels inevitable because the alternative, trying to enforce 20th century copyright law on models that process the entire internet, is clearly not working. The companies building these models know this. The studios suing them know this. The next few years are going to produce a new framework that probably looks nothing like what either side is arguing for today.

The core takeaway

The core of how these models work is settled. Compress the video, clean up the noise guided by your input, decompress back to pixels. Every model is a variation on that recipe. The real competition now is training data, specific feature sets, and who builds the tool that creative professionals actually want to open every day.

Knowing this won’t instantly make your generations better. But understanding why image to video gives you more control than text to video, why that creativity slider behaves the way it does, why hands fall apart, why some models handle motion better than others, that kind of understanding changes how you approach every single prompt. And it adds up.

We’re LocalHost. Last year we ran the Mumbai AI Film Festival at the Royal Opera House, where over 1,200 teams applied, 15 were flown in from across the world, and 14 AI short films premiered on the red carpet in front of 600 people, judged by directors like Ram Madhvani and Shakun Batra, with Tanmay Bhat, Ritesh Deshmukh, and teams from Netflix India and Google in attendance. In February 2026 we followed that up with the India AI Film Festival at Qutub Minar during the India AI Impact Summit, in collaboration with the Government of India and sponsored by NVIDIA, screening films for 150+ investors, policymakers, and AI leaders from around the world. This year we’re going global: five more AI film festivals in Los Angeles, San Francisco, Paris, Tokyo (in collaboration with the Tokyo Metropolitan Government), and Mumbai. If you’re making things with these tools, come build with us.

Hardeep Gambhir

Hardeep Gambhir

Co-founder, Content & Media

Hardeep co-founded LocalHostHQ and leads media and storytelling across the network. Previously part of the founding team at The Residency, an accelerator backed by Sam Altman.

The lab is open

Applications are reviewed on a rolling basis. We back young people from all backgrounds, regardless of credentials.