Smooth scrubbing videos

2022-11-07 (Mon)
5-minute read

Warning: This post is over 365 days old. The information may be out of date.

Update

The new revision of this blog post can be found here.

You may have noticed that some videos you play in video players allow you to “scrub” smoothly. That is, when you “scrub” the video by moving the playhead back and forth, the frame is instantly shown on screen without any stuttering or “jank,” as long as your playback device can keep up, of course.

Not only does the player scrub frame-by-frame, some videos even let you scrub backwards, too. Which is great if you’re analyzing a replay or trying to find a specific moment in a video.

Well, in this blog post, I’ll explain how some videos manage to scrub so well, and how you can re-encode videos so that they scrub perfectly!

A quick overview on video encoding and decoding

Note: this is a simplified version of how video encoders and decoders work. For the purposes of this blog post, we don’t need to completely understand them, but we still need to know what keyframe types are so that the next part makes sense. If any experts are wincing at the parts below I’m sorry!

As we all know, video files are just containers containing one or more video streams and one or more audio streams, as well as other stream types (like subtitles, etc). We’ll ignore audio streams and other stream types for now, because they’re not important. What we’ll focus on is the video stream.

A video stream is just a collection of photos, or frames of photos. But storing photos as just one raw frame is rather large and wasteful. Think about it.

A pixel is made up of red, green, and blue values (RGB), each from 0 to 255 (we’ll assume that the video is encoded in 8-bit color here).
That means there are three color types, each with 8 bits of information. 8 bit is 1 byte, so we have 3 bytes of data per pixel to store.
For a given 1080p video file, there will be 1920 * 1080 pixels to store. 1920 * 1080 = 2,073,600. Each pixel has 3 bytes, so we’re up to 6,220,800 bytes for a single frame. (That’s 6,221 kilobytes, or 6.2 megabytes of data.)
Let’s say the video file has 24 frames per second (typical for movies and what not). 6,220,800 * 24 = 149,299,200 bytes, or 149,299 kilobytes, or 149 megabytes. For a single second of video!
Let’s say it’s a short video file, about 30 seconds long. 149,299,200 * 30 = 4,478,976,000 bytes, or 4,478,976 kilobytes, or 4,479 megabytes, or 4.4 gigabytes.
60 seconds. 149,299,200 * 60 = 8,957,952,000 bytes = 8,957,952 kilobytes = 8,958 megabytes = 9 gigabytes!
How about a two-hour long movie? 149,299,200 * 60 * 60 * 2 = 1,074,954,240,000 bytes = 1,074,952,240 kilobytes = 1,074,952 megabytes = 1,075 gigabytes = 1 terabyte.

Obviously, you don’t want to store that much data. And that’s where video encoders and decoders come in.

Video streams are encoded and decoded by various standards: H264 being one, with H265 and AV1 competing to replace it. (There are more, of course.) A video encoder’s job is to make keyframes out of the frames of video. There are three types of keyframes: I-frames, P-frames, and B-frames.

I-frames are like the frames we talked about before. They contain all of the data for a single frame of video.

P-frames reference I-frames. Most video content have some sort of pattern (for example, a static background with a moving subject in front). In this case, the video encoder can just make an I-frame containing the entire scene, then make a P-frame that says “hey, the subject in the video moved right by two thousand pixels and moved up by four hundred pixels.” Then the decoder uses that data to just re-draw the changed parts on top of the I-frame.

B-frames reference frames forwards and backwards. They save space the most, at the cost of being hard to encode and decode.

So that’s a very brief and dumbed-down overview on video encoding and decoding. So how does that relate to scrubbing, anyway?

Keyframes and scrubbing

When you scrub back and forth in a video, the player must first find the closest I-frame to the playhead. Then it builds up the P-frames and B-frames on top of the I-frames until you get the exact picture you want. (Some players skip this and just forcibly land you on the nearest I-frame. You may have experienced this if your playhead jumps back or forward a bit from where you let go of it.)

This is because P-frames and B-frames are partial frames. You can’t just display them because they describe how the frame has changed relative to the I-frame preceding them. (Like I said before, if I-frames are instructions like “draw a blue background sky with white clouds and a person wearing a green dress in front”, P and B-frames are like “the green dress moved right by two hundred pixels.” You can’t draw P and B-frames without reading the I-frame instruction, and so can’t a computer.)

Now, devices (modern ones, anyway) can decode I-frames and P-frames relatively fast. So if you scrub through a video that is made up of I and P-frames only, it will be smooth. But B-frames take much more time and compute power to process, which means devices may start to lag and jank around when you scrub fast across the timeline.

The solution?

Be gone, B-frames!

You can re-encode videos to remove all B-frames from them! Here’s a quick ffmpeg command:

ffmpeg -i input.mp4 -bf 0 output.mp4

Note that depending on the encoder used, you may need to type something different, like:

ffmpeg -i input.mp4 -c:v libx264 -x264-params bframes=0 output.mp4

Videos that scrub smoothly

If you want to try a video that scrubs smoothly, and if you own a Nintendo Switch, you’re in luck. Gameplay footage saved on a Nintendo Switch are comprised completely of I and P-frames. Just grab one and try scrubbing through it! This is a quick and easy example to check if you can’t be bothered with re-encoding video.

Things to do before sharing videos

ffmpeg