Show HN: COmpREss (a new compressor)

by 2roon 2/12/25, 10:16 PMwith 7 comments
by 2roon 2/17/25, 9:12 AM

Don Knuth Marcus Hutter Terry Fuller Tom Standish

by 2roon 2/14/25, 9:43 AM

my COmpREss algorithm CORE

here it is:

a sequence of bits that is not necessarily random can be made to look random - the bits are lined up with pseudorandom bits and each one is XORed in turn

example data: 01011010 .. 01001000

pseudorandom bits: 10101110 .. 00100111 ...

resulting bits: 11110100 .. 01101111

this is reversible and the pseudorandom bits take up no storage space in the compressed file since they can be replicated on demand

"run": a run ends with the first zero bit encountered

example: 01011010 is broken up into 4 runs (0, 10, 110, 10)

probability of two runs taken together being of length two is .25 , three is .25 , four is .1875 , and so on ...

for concreteness we make up some perfectly representative data using the following convention. i list the length of 2 runs taken together in this part. for instance, our ascii "Z" = 01011010 is broken up as 010 , 11010. now we have 3 , 5 to write in its place. note that this is ambiguous because 3 , 5 might equally represent 100 , 01110 (for just one example)

"perfect data":

1,1,1,1,1,1,1,1,2,2,2,2,3,3,4,6

1,1,1,1,1,2,1,2,1,3,1,4,2,3,2,6

2,2,3,3,4,5,5,8

by 2roon 2/16/25, 8:46 PM

1/4+2/8+3/16+4/32+5/64+...=1

by 2roon 2/14/25, 7:12 AM

test

by 2roon 2/13/25, 4:26 PM

my COmpREss algorithm CORE

here it is:

a sequence of bits that is not necessarily random can be made to look random - the bits are lined up with pseudorandom bits and each one is XORed in turn

example data: 01011010 .. 01001000

pseudorandom bits: 10101110 .. 00100111 ...

resulting bits: 11110100 .. 01101111

this is reversible and the pseudorandom bits take up no storage space in the compressed file since they can be replicated on demand

"run": a run ends with the first zero bit encountered

example: 01011010 is broken up into 4 runs (0, 10, 110, 10)

probability of two runs taken together being of length 2 is .5 , three is .25 , four is .125 , and so on ...

for concreteness we make up some perfectly representative data using the following convention. i list the length of 2 runs taken together in this part. for instance, our ascii "Z" = 01011010 is broken up as 010 , 11010. now we have 3 , 5 to write in its place. note that this is ambiguous because 3 , 5 might equally represent 100 , 01110 (for just one example)

"perfect data": 2,2,2,2,2,2,2,2,3,3,3,3,4,4,5,7

"perfect data": 2,2,2,2,3,3,4,6

"perfect data": 2,2,3,5

we will look at two runs at a time. if they are written as 2 when taken together, then concretely there is no choice to be made because that corresponds to the bits 00 and nothing else. for 3 we have a choice of concrete representation so it could be 010 or 100.

my compression works on the first two runs and is repeated on the resulting bits thus obtained after XORing

the method: first make ambiguous and then disambiguate. XOR with pseudorandom bits. repeat.

again, i will be concrete.

input data: 00111010

run encoding: 1 , 1 , 4 , 2

ambiguous encoding: 2 , 6

efficient ambiguous encoding: 1 , 5

we must replace the current run encoding with a new run encoding. looking at the first pair of runs in the data we will want to replace them with some other runs reversibly.

we always start by writing the efficient ambiguous encoding of the pair. so we write 1 in this case. and since this represents the concrete 1,1 unambiguously

1,1 -> 1 and we XOR the bits and repeat

input data: 11101000

run encoding: 4 , 2 , 1 , 1

ambiguous encoding: 6 , 2

efficient ambiguous encoding: 5 , 1

again we write the efficient ambiguous encoding (in this case 5). and disambiguate by writing the first run (in this case 4)

4,2 -> 5,4 and we XOR the bits and repeat

at this point you can verify by analyzing "perfect data" as above or by summing the infinite geometric series, that there is no bias - we will not on average have inflation or deflation of the data using this method (this alone is good enough because an unbiased random walk can be expected to spend most of its time away from zero - repeat the above steps a thousand times and then save a compression or disregard an expansion with use of a one bit "failure" flag) - but there is actually a special case which turns the algorithm towards compression more directly

input data: 11110010

run encoding: 5 , 1 , 2

ambiguous encoding: 6 ,

efficient ambiguous encoding: 5 ,

special case (the first run is not 1 and the second one is): now something very interesting happens. we would replace the first two runs with 5,5 but this is wasteful. by the time we get to concrete code 111101111 we already know the next bit is zero! so we replace as 5,1,2 -> 5,6 instead

as you can see this does not look like compression on the face of it, but it improves our random walk by biasing it towards compression

i have the compressor already written as computer code - just need to finish the uncompressor - i am currently demonstrating compression of an Ai generated image i had made

i would enter and win the Hutter Prize as a next step but my code is currently too slow

here is my website: https://ytp.me/ ("empty" backwards)

i'm up to all kinds of good things and have lots of good ideas

i would love to clarify any points you have questions about

here is my working copy of a Python implementation of the compressor (i reserve all rights for the time being):

<python code>