Have you ever wondered how it is possible to put a full-length movie which used to fit on dozens of rolls of film on a memory stick that hangs on your keychain? Actually, it is usually possible to put dozens, even hundreds of full-length movies on a memory stick today. This is digital video compression.
Video compression is one of the great successes in highly mathematical software outside of the research laboratory. A genuine technological breakthrough in video compression in 2003 enabled many video services such as Netflix, YouTube, and Skype that are in widespread use today.
What is Video Compression?
Let’s start with some basics. In computers and electronics, a bit is a single piece of information, just a 0 or a 1. A byte is a sequence of eight bits. A single character in a text file such as A or B is usually stored as a single byte (eight bits).
A kilobit (kb) is a thousand bits. A kilobyte (KB) is a thousand bytes. A megabit (mb) is a million bits. A megabyte (MB) is a million bytes. A gigabit is a billion bits. A gigabyte (GB) is a billion bytes. A terabit is a trillion (one million million) bits. A terabyte (TB) is a trillion bytes. A petabit is a quadrillion (one million billion) bits. A petabyte (PB) is a quadrillion bytes. Computer people frequently use capital B as an abbreviation for byte and lower case b as an abbreviation for bit.
The size of files such as movies on a computer is usually expressed in terms of bytes, or sometimes bits. An uncompressed text-only book (no illustrations) is often about one megabyte (MB). A full-length movie compressed using modern video compression technologies will take up about six-hundred (600) megabytes on a computer disk drive. A full-length movie compressed using the leading video compression technology of the 1990’s (known as MPEG-2) takes about 3-4 gigabytes (GB).
The number of bits of compressed video for each second of video playback is known as the bitrate. This is a particularly important number when transmitting or playing a video over a computer network. YouTube and other Internet videos today (2013) often have a bitrate of 250-300 kilobits per second.
A digitized but uncompressed full length movie is usually stored as 720 by 480 pixel frames at 24 frames per second. Each pixel is three bytes, one byte for each color (loosely red, green, and blue components). Color in video is very complex and I am simplifying the discussion of color. An uncompressed ninety minute movie is 134 Gigabytes (GB)! The bitrate of uncompressed digital video is about 199 Megabits per second (Mbps).
What does this mean? An old-fashioned 1990’s DVD compresses a full-length movie by a ratio of about 33.5 (134 GB/ 4GB) to one! The advances in video compression in 2003 enable one to compress a full-length movie by a ratio of about 225 to one (or even higher)!
This would be as if — moving cross country — you could take all your possessions weighing 2000 pounds and shrink them down to less than ten pounds, put the compressed items in a small suitcase and drive from New York City to Los Angeles — then reconstitute your possessions in LA.
How Does Video Compression Work?
Under the hood, video compression is extremely complex. The programs (the jargon is video codec — short for video encoder/decoder) that compress (encode) video and decompress (decode) video are tens of thousands of lines of computer code, often in the C programming language. For example, the free open-source x264 H.264 video encoder/decoder is over 67,000 lines of code. A line of code is something like a single moving part in a complex machine like an automobile or a rocket engine. A modern video codec is comparable in complexity to a rocket engine such as the Space Shuttle Main Engine which had about 50,000 parts.
In video compression, the original uncompressed digital video is converted (“encoded” or “compressed”) into a sequence of digital codes that are stored in memory or transmitted. These codes represent the uncompressed digital video but with fewer bits. The video player or decoder converts (“decodes”) the sequences of digital codes into uncompressed digital video which is then displayed. Video compression enthusiasts often refer to the video compression process as “encoding” the video and the playback as “decoding”.
Video codecs are difficult to implement. Like rocket engines, even a single error is often fatal. A single bug in a video codec often results in gross visible artifacts in the video that make it unwatchable. Video codecs are sufficiently complex and interrelated that it can take weeks to locate and fix a single bug.
A good programmer may have an error rate of about one bug per one-hundred lines of code. If the programmer implements a 30,000 line video codec, he or she will have three hundred bugs. If every bug must be fixed and each bug takes a week, this would be about six years of debugging. Now, actually, bugs don’t always take a week to find and, more importantly, modern video codecs are usually implemented by small teams of programmers.
I won’t go into the extensive complex details of how video codecs work. I will discuss the basic principles used by video codecs today. Keep in mind the real video codecs are much more complex than the simplified explanations below.
Omit Fine Details
The human visual system (eyes, optic nerves, and brain) actually cannot even perceive many fine details that video cameras capture and digitize. In addition, there are fine details that although human can perceive them, they don’t care much about them and don’t miss them. For example, humans mostly perceive the edges or boundaries of objects and components of objects. We don’t often notice fine details of the textures of objects such as skin or clothing.
If you look closely at highly compressed video on YouTube or elsewhere, you will notice that the textures of objects are often smoothed out and lack fine details. In rare cases, they will look blurry. This is the video compression.
In most videos our attention will be largely focused on the faces of the people in the video. This means that if the face — the eyes, nose, mouth, hairline, etc. — and the skin color and to a lesser degree texture of the skin of the face is correct, we won’t even notice problems elsewhere in the video. We often pay little attention to the backgrounds, the details of the speaker’s clothing etc. The farther from the faces, the less we often care.
To be sure, there are exceptions to this, but they are exceptions.
Video compression technologies use mathematical methods such as the Discrete Cosine Transform (DCT) to filter out and heavily compress these fine details that humans either cannot perceive at all or pay little attention to.
Only Encode Changes from Frame to Frame
Much of the dramatic success of video compression is due to mathematical techniques, known as motion estimation and motion compensation, that encode and transmit only changes between frames.
Consider, a simple example, a “talking head” video. This is a common type of video in which a speaker in front of a static, unchanging background talks with little or no movement. His or her lips move, eyes move, very little else. This type of video is especially easy to compress. If we encode and transmit only the changes between frames (the background never changes) we can achieve very high compression levels.
Modern video compression methods are designed to compress more difficult video. Consider for example two people tossing a ball back and forth between them. They are standing in front of a mostly static background. Loosely, we can detect and track the movement of the ball and send only that movement from frame to frame. This is roughly what motion estimation and motion compensation do.
Use More Bits for Rare Occurrences
This is known by the fancy and rather confusing buzzphrase entropy coding. The basic idea is simple, use fewer bits (information) to encode common occurrences in the video and more bits (information) to encode rare occurrences in the video.
This is actually the way languages such as English mostly work. We have short one syllable words (such as “he”, “she”, “man”, “dog”, “door” etc.) for objects and concepts often used in conversation. English and other languages use longer, multi-syllable words for rarely encountered objects and concepts such as “xylophone”. On average, this enables us to communicate faster.
Video compression programs combine advanced mathematical versions of these three methods in very complex ways to achieve the dramatic levels of compression that most people not only take for granted today but often are not aware of at all!
Some History of Video Compression
Digital video compression took off in the Far East in the early 1990’s with a technology known as VideoCD that used the original MPEG-1 digital video compression standard from ISO (the International Organization for Standardization). VideoCD was used in Japan, Hong Kong, and other Asian nations for games, Karaoke, some mainstream movies, and especially pornography.
VideoCD and MPEG-1 had a bitrate of about 1 Megabit per second and achieved a video quality comparable to an old analog NTSC television video. This is about as low as one can go in video quality and achieve widespread use.
VideoCD never took off in the United States although it had some limited success with afficionados of porn. The DVD (Digital Versatile Discs) and the MPEG-2 digital video compression that they used did take off and achieve widespread mainstream use in the United States and around the world.
DVDs use MPEG-2 digital video with a bitrate of about 4-6 Megabits per second. Some high action video such as sports video or movies with heavy action required higher bit rates. MPEG-2 is very similar to MPEG-1 digital video but includes support for the alternating fields in television video and some other features. The basic compression is nearly the same and did not significantly outperform MPEG-1. That is, an MPEG-2 video with a bitrate of 4-6 Megabits per second looks mostly the same as an MPEG-1 video with 4-6 Megabits per second. MPEG-2 was also used for distributing cable television video and some other uses.
There were many attempts to achieve much higher compression ratios/lower bitrates for the same perceived video quality from 1995 to 2003, with negligible success. Higher compression means more cable TV channels, for example, and presumably more money. Probably, most significantly, if one could push the bitrate below the 384 kilobits/second rate of basic Digital Subscriber Line (DSL), it would be possible to distribute videos in real-time over the Internet as Netflix, YouTube and other do today.
In 2003, video compression leaped forward in a rare technological breakthrough. A number of improvements, especially in the motion compensation and motion estimation, were combined successfully in a new version of the H.264 video-conferencing standard (H.264/AVC), then added to the MPEG-4 video compression standard, and rapidly added to other video codecs such as Windows Media, Adobe Flash, and Xiph.org’s ogg theora. The exact origins of these advances remain a bit murky. There does not seem to be a good account of the breakthrough and I would be cautious of any account. There are various patents and undoubtedly there are lawsuits afoot over who did what when.
In 2003, it became possible to download near DVD-quality videos over a basic DSL line. A typical new video had a bitrate from as low as 140 Kilobits/second for some talking heads material to 350 kilobits per second (below the magic 384 Kilobits per second). Netflix, YouTube, and many other services that we take for granted today became feasible.
Remarkably, when I talk to most audiences about video compression, most people never even noticed. They just took the sudden appearance of high quality real-time video over the Web for granted! It is rather as if Detroit came out with a car that got 200 miles to the gallon, the United States pulled out of the Middle East, and no one noticed!
The breakthrough video compression in 2003 had some initial problems. If one watched closely, there was an occasional jitter between frames which could be annoying. More seriously, the skin tone was often off or even somewhat pasty. Since our attention is mostly on the faces of speakers and actors, the skin tone was an especially serious problem, although the video was watchable.
In about 2008, there were widespread advances in reproducing skin tones more accurately, so that today it is rare to see a video with a poor skin tone. Every once in a while it happens, but it is rare. Bitrates did not improve but the perceived quality did. The jitter that I mentioned also seems to have mostly gone away.
It may be possible to improve the compression further. It is claimed that the new H.265 video compression standard achieves two times the compression of the current methods. There appear to be a number of groups trying to improve compression further.
The current compression ratio is very high (around 225:1 for high quality video). I did some theoretical calculations at NASA prior to 2003 that indicated it was (just) possible to achieve the compression levels that are now taken for granted. The calculations would also indicate it may be impossible or extremely difficult to get much more compression. Of course, in practice, theories can be wrong.
At present, the frontier of video compression lies in achieving reliable, easy-to-use video telephony and conferencing. Despite Skype and other video phone products, there remains a lot of room for improvement.
Although there has been some progress, it remains difficult to make a video phone call. Many big companies and organizations have sophisticated video conferencing systems that are often unused. Some organizations have large staffs to set up the video calls and conferences. In practice, the user interfaces and the systems are hard to use.
Technical Problems with Real-Time Video over the Internet
The Internet was designed over forty years ago primarily for e-mail and other non-real time text transmissions. What this means is that the Internet often cannot guarantee that a packet (e.g. a frame of video or associated audio) will arrive in time (say 1/4 of a second for a phone conversation). In the old days, I would be happy if an e-mail got to someone in 24 hours. Even today, we rarely notice if an e-mail takes a few minutes to get to the recipient.
These delays have not been a serious problem for video downloading services such as Netflix or YouTube because they can buffer minutes or even in some cases the entire video. If the Internet hangs or slows down for some reason, they can usually continue to play from the local buffer until the Internet resumes downloading.
In a phone conversation, we need to hear and see the other party within a fraction of a second of when they actually spoke, made a facial expression, gestured, or did something else. In this context, it can be difficult to use video compression effectively. It is still common to encounter audio dropouts, garbling, and a variety of other problems with Skype and other systems.
Video telephony and conferencing may offer huge economic benefits in greatly reducing travel time and costs, dramatically reducing demand for gasoline and other hydrocarbon products that have become increasingly expensive over the last decade.
For me, one of the most remarkable things is how unaware most people actually are of the video compression technology that they use and take for granted. Most people appear to have been completely unaware of the advances in 2003. Similarly, I find in conversations that most people are blissfully unaware of the sophistication and complexity hidden behind a YouTube or Netflix video player. They often seem to think it is quite easy to compress video.
Our enormous progress in video compression offers hope that we can successfully tackle other problems, more serious problems, by combining the enormous power of today’s computers and electronics with more advanced mathematics. Indeed, video compression is likely to assist in handling our current energy shortage (rising prices means a shortage in conventional economics).
So next time that you watch a Netflix or YouTube video, take a moment to reflect that you are watching a miracle of modern technology!
© 2013 John F. McGowan
About the Author
John F. McGowan, Ph.D. solves problems using mathematics and mathematical software, including developing video compression and speech recognition technologies. He has extensive experience developing software in C, C++, Visual Basic, Mathematica, MATLAB, and many other programming languages. He is probably best known for his AVI Overview, an Internet FAQ (Frequently Asked Questions) on the Microsoft AVI (Audio Video Interleave) file format. He has worked as a contractor at NASA Ames Research Center involved in the research and development of image and video processing algorithms and technology. He has published articles on the origin and evolution of life, the exploration of Mars (anticipating the discovery of methane on Mars), and cheap access to space. He has a Ph.D. in physics from the University of Illinois at Urbana-Champaign and a B.S. in physics from the California Institute of Technology (Caltech). He can be reached at [email protected].
See this announcement from xiph.org
next generation video:Introducing Daala