Saturday, December 18, 2010

Parsing Level 1 : Lazy matching

All compressors that you can download on my site are using a pretty simple logic :

On each position, we try to find a match (the best possible one for LZ4 HC, or just a good enough one for all others).  If we do, we encode a match, then move forward by match length, a try again.

This is known as "greedy matching". We just use a match as soon as we get an opportunity for it.

However, even when making a full search for the longest possible match, this is not the best possible output. It's quite possible that, by using the match at current position, we miss a better one on next position.

Need an example ?
Just consider the following sentence :

(...) a roaming attempt (...) really necessary. (...) a really necessary (...)

On reaching the second "a", the match finder will find a 3 characters match : "a_r".
Reaching the final "y" will therefore need 2 match sequences, by then adding "eally necessary".

However, should we wait for the next position, we would create an "a" as as literal.
And then, all the rest of the sentence would be described as a single match.

This is cheaper. Considering LZ4 format (that's just an example), a match sequence costs 3 bytes, and a literal costs 1 byte. This gives us :
Don't wait (greedy) : 3 + 3 = 6 Bytes
Wait one byte : 1 (literal) + 3 (match) = 4 Bytes.
So we would lose 2 bytes with greedy matching.

There are a number of way to improve on this situation. Let's start with the simplest one : lazy matching.

The idea is very simple : on finding a match, we keep it in memory and try to find another and better match at the next position.
If we don't find a better match, then we use the first match.
If we do find a better match, we use this one instead.

The idea can also be applied recursively; that means that if we find a better match, we continue checking the next position, and so on.

This idea is known as "lazy matching", and used for several years in LZ compression programs.
However, it obviously requires more searches, and is therefore slower.
This is a typical trade-off that each compression program must choose.

For maximum compression strategy, advanced parsing is a necessity.

2 comments:

  1. sometimes we get better length matches but the cost sometimes that the better length matches found in far position which will cost more to point on it.

    ReplyDelete
    Replies
    1. Correct, although such cases become careful only when the distance is a lot larger while the match length is only marginally improved.

      But that's not a problem for LZ4, since all distances have the same cost...

      Delete