From jloup@chorus.fr Mon Jan 18 14:06:16 1993
Status: RO
X-VM-v5-Data: ([nil nil nil nil nil nil nil nil nil]
	["1689" "" "18" "January" "1993" "14:59:32" "GMT" "Jean-loup Gailly" "jloup@chorus.fr" nil "39" "Re: improvements to lha" "^From:" nil nil "1" nil nil nil nil]
	nil)
Newsgroups: comp.compression.research
Reply-To: jloup@chorus.fr (Jean-loup Gailly)
From: jloup@chorus.fr (Jean-loup Gailly)
Subject: Re: improvements to lha
Date: 18 Jan 93 14:59:32 GMT

jonathan forbes <jonathan.forbes@canrem.com> writes:

> Well, AR's method of creating a Huffman tree from a statistics table is
> rather complex; if there is a faster way of doing it, I'd love to know.

zip 1.9 spends a negligible portion of its time in the Huffman tree
construction. Most of the time is spent in the string matching.
Concerning the complexity of Huffman tree construction, see the FAQ
for textbooks.

> I thought AR002 used a digital hash trie.  I was thinking about having a
> 256*256 "perfect" hash table for the first occurance of two adjacent
> characters in a buffer, and then a separate array of size BUFLENGTH of
> the previous occurance of those two characters.  Brute force could be
> used to match lengths > 2.

This is close to what I'm doing in zip 1.9. I am however using a
smaller hash table, because my experiments showed that a 256*256 hash
table did not improve the speed. I am also using linked lists instead
of an array for the hash chains, because this uses less memory and
also because Robert Jung has patented (5,140,321) the lzrw3a algorithm
which uses arrays instead of linked lists. (If you want to argue about
the value of such a patent, please followup to misc.legal.computing,
not comp.compression.research.)

> What is lazy evaluation anyway? I did look at the ZIP 1.9 source a
> while ago, but it really didn't explain anything

Did you look at algorithm.doc in the zip 1.9 sources? (See the FAQ
for ftp sites.)

> and the match location routines simply weren't
> separable from the rest of the program.

Agreed. Restructuring the code to build a general purpose compression
library is on my list of things to do.

Jean-loup Gailly
info-zip

From nevries@accucx.cc.ruu.nl Thu Jan 14 13:06:52 1993
Status: RO
X-VM-v5-Data: ([nil nil nil nil nil nil nil nil nil]
	["1772" "" "13" "January" "1993" "23:57:25" "GMT" "Nico E de Vries" "nevries@accucx.cc.ruu.nl" nil "40" "Re: improvements to lha" "^From:" nil nil "1" nil nil nil nil]
	nil)
Newsgroups: comp.compression.research
Distribution: comp
Organization: Academic Computer Centre Utrecht
From: nevries@accucx.cc.ruu.nl (Nico E de Vries)
Subject: Re: improvements to lha
Date: 13 Jan 93 23:57:25 GMT

In <1993Jan12.2816.1018@dosgate> jonathan.forbes@canrem.com (jonathan forbes) writes:

>-> By the way, the basic algorithm of LHA is in my public-domain pre-LHA
>-> experimental archiver ar002 (which I wrote when I was a high-school
>-> teacher).  It's in CompuServe's IBMPRO Library and also on the disk

>What do you think programs such as ARJ are doing to get compression
>improvements over LHA?  Simply increasing the buffer size from 8K to 24K
>can't be all it is.  Perhaps there is a better way of encoding the

Actually buffer size is the most major factor. This is not unlimited
BTW. Making in 300k doesn't improve it on 200k.

>huffman tree and outputting it - there may also be a faster way of
>generating the huffman tree.

That is less interesting. Most time is soend looking for matches.

>Also, do you know of a faster way to locate matches (than the digital
>hash trie used in AR002)?  It doesn't matter if it takes more memory;
>I'm just looking for speed improvements in the match finding process.

The digital trie is nice from a theoretic viewpoint but useless in
practice. The overhead is too high. Simply brute force searching
with hashing  to speed it up is sufficient. (some small tricks
like lazy evaluation are needed of cource but those are already
available in AR002). Look at the ZIP 1.9 source to see excactly
what I mean.

>AR002 is a great program!

It is the best. ARJ, PKZIP and LHA are all derived from it. All faster
but essential no better compression.


Nico de Vries  "Personal opinion, 'AS IS', no waranties, bad grammar" etc.
    InterNet: nevries@cc.ruu.nl   CompuServe: 100115,2303

To get (many) ==> LOSSLESS DATA COMPRESSION SOURCES <== get lds_10.zip at 
garbo.uwasa.fi /pc/programming. Make files for Borland C are included. 

From jloup@chorus.fr Tue Jan 19 14:32:55 1993
Status: RO
X-VM-v5-Data: ([nil nil nil nil nil nil nil nil nil]
	["1886" "" "19" "January" "1993" "09:16:56" "GMT" "Jean-loup Gailly" "jloup@chorus.fr" nil "38" "Re: improvements to lha" "^From:" nil nil "1" nil nil nil nil]
	nil)
Newsgroups: comp.compression.research
Reply-To: jloup@chorus.fr (Jean-loup Gailly)
Distribution: comp
From: jloup@chorus.fr (Jean-loup Gailly)
Subject: Re: improvements to lha
Date: 19 Jan 93 09:16:56 GMT

jonathan forbes <jonathan.forbes@canrem.com> writes:

> The ZIP 1.9 source appears to used arrays for the hash chains (#ifndef
> DYN_ALLOC... Pos far prev[WSIZE]) just like I do.  That's not what
> you're referring to as being patented is it?

The prev[] array implements a linked list. This is the straightforward
implementation of external chaining in hash tables.  lzrw3a (ftp
location in FAQ item 5) uses a two dimensional array as the hash
table. All elements hashing to the same value are contiguous, so lzrw3a
has no need for a prev[] array. This is what Jung patented. (Yes, it is
pathetic that such things are patentable. Even run length encoding is
patented, see the FAQ item 8. But this subject is not appropriate for
comp.compression.research.)

> I wonder ifone can output matches in some way other than the standard
> LZ77 "Position,Length"... [while still using a sliding window type of
> algorithm]

Yes, you can. See for example the series of lzrw* algorithms and the Fiala and
Greene algorithm (also patented, see the FAQ for references). You can
use a hash index or a position in a trie instead of the usual distance.

> Also, did the use of dynamic Huffman trees improve compression any (in
> ZIP) or do they come out to about the same as static Huffman trees
> (which are, of course, much faster to use)?

There may be a problem of terminology here. zip uses static Huffman
coding, as opposed to adaptive Huffman coding, for which the tree is
changed after each input byte. (The review paper in ACM Computing
Surveys, 19,3, recommends the term 'adaptive coding' instead of the
more ambiguous 'dynamic coding').  As I mentioned in my previous
reply, building the tree takes negligible time, so it is not 'of
course much faster' to use a unique built-in tree for all files (as in
pkzip -es). This degrades compression significantly.

Jean-loup Gailly
jloup@chorus.fr