Saturday, November 20, 2010

First results of fusion search

As stated in an earlier post, segment search provides a nice boost to compression speed when compressed files contain some long stream of identical characters. However, this benefit does not ramp up with search depth, as there is no gradual elimination of bad choices, like MMC is doing for normal sequences.

Therefore, the ideal solution would be to "merge" both algorithms. But this is easier said than done.

Finally, after much time at correcting many small inefficiencies and inaccuracies, a first version of this merged algorithm is able to provide some figure, in order to verify the claim.
The current version is only limited to handling streams of zeros. When a segment is detected, it is fully populated with pointers to previous best possible match, then the search continue using the normal MMC algorithm.

This approach results in better elimination of bad positions, and therefore less comparison checks. On the other hand it requires much heavier updates on chain table.

So where does that lead us to ? Here are some extracted statistics, comparing new fusion algorithm with current 0.5 (segment) version of LZ4HC :

                       MMC  Segment  Fusion/0  Improvement
Calgary 64K Searches  20.3M  6.55M    5.62M       17%
Firefox 64K Searches  70.3M  47.1M    45.3M        4%
Enwik8  64K Searches   188M   187M     187M        0%

Calgary 512K Searches 34.2M  14.6M    11.1M       32%
Firefox 512K Searches  153M   121M     115M        5%
Enwik8  512K Searches  435M   435M     435M        0%

Calgary 4M Searches   38.7M  18.5M    14.4M       28%
Firefox 4M Searches    271M   308M     295M        4%
Enwik8  4M Searches    753M   758M     758M        0%


There is definitely some improvement, but not that large.
It seems to get even better with larger window size (64K->512K), but the tendency is surprisingly reversed at largest tested size (4M).

There are a number of potential explanations for that.
To begin with, the new Fusion algorithm only deals with streams of zero. The Segment algorithm is still being used for all other streams. As a consequence, there is still room for improvement, as can be witnessed in Firefox at 4M range : we still have more comparisons than with MMC alone. So there are probably some tangible benefits to expect with generalizing Fusion to all byte streams.

Then, at 4M, collision is really large for Firefox, reaching 64M, which is more than 20% of comparisons. It is also more than 1M for Calgary (about 7%). This amount is not reduced by Fusion, therefore its relative benefit importance is reduced by increased share of collision.
The easier solution is to increase hash table, to reduce collision effect. However, as i want directly comparable results, hash table has to remain the same for all search depth and all search algorithms.
Maybe i'll change that policy, to allow adapted hash size per search depth.

Calgary is also a strange beast : its size is less than 4M, so it does not fully benefit from largest window size. Still, the small difference between 512K and 4M is quite unexpected. This observation is valid whatever the search algorithm, so there is definitely something at stake. Maybe the fact that calgary.tar contains some very different file type, which individual size is less than 512K, could explain this behavior.

Next logical step seems to be generalization of Fusion, and measure potential improvements.

No comments:

Post a Comment