main | forum
March 27th, 2017    

CIS 751



Database Notes

Final(Take Home)

Notes 0005

Data Compression

Nearly every data we deal with on a day to day basis has some form of redundancy. For example, in English, the letter E appears very often, while the letter Z relatively rare. This is called `alphabetic redundancy'. Some letters also tend to follow each other, for example, the letter T is often followed by H; `th', this is called `contextual redundancy'. In the case of images, we can view that as `pixels close to each other tend to have the same or similar color'.

Similar ideas apply to ASCII code in the computer. While each letter takes up 8 bits, the usual `range' of English text doesn't require nearly as much. ie: there is lots of redundancy.

The point of data compression is to reduce this redundancy.

More precisely, we need to efficiently encode things that occur often (and possibly less efficiently things that occur less often). For an `alphabetic redundancy' example, this would be replacing E characters with something like `1', (1 bit), while possibly replacing Z with something like 0000000001 (10 bits). While it may appear to be inefficient to replace one letter with 10 bits, since Z appears so rarely and E appears so often, in the end we will end up saving space (and reducing the redundancy).

Kinds of Compression

Nonadaptive compression - where the compression method is not effected by the data being compressed.

Lossy/Lossless compression - The process of compression can loose information (or it may not). To see where this is useful consider images. Nobody would really notice if a shade of some color was missing; thus you can get rid of it. Or something along those lines. Basically if you compress and then decompress the data, the resulting data won't be the same as you compressed. Lossy compression is primarily used for images, sound and video. You can't really apply lossy compression to text (well, not of much use).

Symmetrical compression - Is where the compression and decompression are basically the same algorithms just working in reverse.

Strange Methods

There are a ton of data compression methods. Many are too old or too useless in the general sense to base a detailed discussion. Here are some examples:

Compressing text by throwing away special characters and spaces. You'll notice that you can still read the text.

We know that text has 26 characters, 10 digits, and possibly a few more control symbols. We can conceivably represent all the `needed' information in 6 bits. So given text in ASCII, we can automatically `compress' each byte by those 2 bits to save some space.

If we know that data is in a certain order (like in a dictionary) we can omit certain pieces of data. For example, in case of a dictionary, we don't really need to list the beginnings of each word, since words close together have the same beginning. This technique is known as front compression.

More `Serious' Methods


RLE is run-length-encoding. Basically says that we if we have repetition in the text, we can just represent it by a single digit. For example, lets say we have a text that look something like this:


Then we can encode that as ``a4b9c6''. Or something along these lines. Notice that if the text isn't `right' for this compression, then the resulting output can actually be bigger than what we started with.

RLE is used in a relatively old image file format: PCX files.

The general purpose RLE can be used to compress images better than PCX files by scanning data in various directions. Instead of scanning the input from left to right, we can scan it sideways, or zigzaging, etc.,

We can even get lossy compression from RLE, by simply throwing away short occurrences, of pixels.

We can also encode the difference between the current pixel and the last pixel, etc.

Statistical Methods


This is where we figure out redundancy, and then directly attack it.

The process generally goes like this: We loop go through the input data computing something (usually we're computing the number of times each byte occurs). We then come up with a code for each byte. The code is based on the frequency of the character. So if something occurs often, the code will be short, and if something occurs rarely, then the code will be long.

The most `famous' (and optimal) approach to figuring out these codes is Huffman compression. It goes something like this:

Figure out the frequencies of each character. (for every character, add up all the number of times it occurs).

For example, lets say we have a phrase: "tobeornottobe."

Frequencies (ignoring spaces): t:3 o:4 b:2 e:2 r:1 n:1

We order them in ascending order... [n:1][r:1][e:2][b:2][t:3][o:4]

We then pick out the least common two nodes, and combine them:[[n:1][r:1]:2][e:2][b:2][t:3][o:4], and reinsert it into the list (by the new combined weight).

We then continue on to do the same thing again: [[[n:1][r:1]:2][e:2]:4] [b:2][t:3][o:4], and reinsert into the list based on the weight: [b:2][t:3][[[n:1][r:1]:2][e:2]:4][o:4]

We do the same thing again (combine the two least frequent nodes): [[b:2][t:3]:5][[[n:1][r:1]:2][e:2]:4][o:4], and reinsert it into the list: [[[n:1][r:1]:2][e:2]:4] [o:4] [[b:2][t:3]:5]

We continue onwards: [[[[n:1][r:1]:2][e:2]:4] [o:4]:8] [[b:2][t:3]:5], and now to reinsert it: [[b:2][t:3]:5] [[[[n:1][r:1]:2][e:2]:4] [o:4]:8].

Now we have 2 nodes, left, but we still continue to combine them: [[[b:2][t:3]:5] [[[[n:1][r:1]:2][e:2]:4] [o:4]:8]:13]

Notice that now we're left with 1 node (more info on the board).

Now, according to the `tree' we've constructed, we can replace every letter with the path through the tree: left:0, right:1.

ie: t:01 o:11 b:00 e:101 r:1001 n:1000

Now we can replace ``tobeornottobe'' with: ``01 11 00 101 11 1001 1000 11 01 01 11 00 101'', which is 32 bits, or 4 bytes. (quite an improvement over the original which is 13 bytes).

More examples in class.

Arithmetic Coding

When we write down floating point numbers, the more digits we write down, the more `precise' the number is. We apply the same idea to compression. We create a `number' between 0, and 1, and with every letter it becomes more and more precise.

Even though Huffman compression is `optimal', is it using fixed length codes (we cannot replace a letter with fractional bits). Arithmetic coding achieves that level of compression. Example in class.


What can be compressed? Well, non-compressed files. Files with lots of redundancy.

We cannot compress already compressed data, since it doesn't have much redundancy.

Similarly, we cannot compress random data. `True' random data has no redundancy.

We can think of the limit on compression as Kolmogorov complexity. Basically we cannot get beyond the point Kolmogorov complexity, and still have a lossless compression. More on this in class.

© 2006, Particle