In a typical serialization library, encoding a collection gets carried out in the same way for lists (ordered) and sets (order-agnostic). Regardless of the collection type, the elements are simply written one after the other. This means that the encoded bytes implicitly carry the information of the order of elements. Is it possible to isolate this information, eliminate it from the encoding, therefore end up with a more compact byte representation?

Bottom line up front: Yes, it is possible to store a set using less than the regular amount of bytes. Both theoretically and practically. However, the practical algorithm is not so straightforward hence it's being classified as a "compression" algorithm.

First, let me demonstrate a very simple example to show how this is theoretically possible. I'll be using a "multiset" instead of a "set" for the order-agnostic collection to keep the example simpler.

Let's say our elements consist of only one bit: `0`

, `1`

. And we're trying to encode a three-element collection.

There will be 8 possible three-element lists. **They get encoded with 3 bits.**

No | Collection | Encoding |
---|---|---|

#1 | [0, 0, 0] | 000 |

#2 | [0, 0, 1] | 001 |

#3 | [0, 1, 0] | 010 |

#4 | [0, 1, 1] | 011 |

#5 | [1, 0, 0] | 100 |

#6 | [1, 0, 1] | 101 |

#7 | [1, 1, 0] | 110 |

#8 | [1, 1, 1] | 111 |

On the other hand, there will be only 4 possible three-element multisets. **They can be encoded with just 2 bits!**

No | Collection | Encoding | Notes |
---|---|---|---|

#1 | [0, 0, 0] | 00 | |

#2 | [0, 0, 1] | 01 | |

01 | Identical to #2 | ||

#3 | [0, 1, 1] | 10 | |

01 | Identical to #2 | ||

10 | Identical to #3 | ||

10 | Identical to #3 | ||

#4 | [1, 1, 1] | 11 | |

We were able to drop half of the possible collections in the multiset case since we don't respect the order of elements. This then allowed us to use one less bit during the encoding.

Also note that all these encodings given above are in their theoretical minimum length as we create a one-to-one mapping between all possible collections and encodings.

I have been thinking about this question for some time and searching lots of different keywords on Google to find a related work: "efficient set serialization", "order-agnostic collection encoding", "serializing orderless(?) collection efficiently", and many other weird ones... None of them returned any meaningful results. Then, I figured out that the correct keyword for this concept is "compression" so I finally started to come across some related stuff.

Among the results, the most relevant one was the following paper: Compressing Sets and Multisets of Sequences. If you're interested, I encourage you to do your reading on this, but here's my summary through a fast and shallow glance: The writers have come up with an algorithm that uses binary trees for encoding/decoding multisets. It seems to be reasonably efficient and able to do compression very close to the theoretical limits.

Why this algorithm is not a standard for serialization libraries?

I believe the first reason is keeping serialization simple by not depending on the type of the collection. Another important reason is that the maximum theoretical gain from this is not super impressive (it's something like 13%) with large element sizes. This might make it not worth the extra time complexity in most cases.

What is meant by "theoretical limits" or "theoretical maximum gain"?

The "theoretical maximum gain" I have been mentioning here refers to an ideal algorithm that eliminates the "order information" completely from the encoded bytes. The example we did above was an instance of such an ideal case but it seems that it's not easy to generalize it to more elements and larger element sizes.

Before finishing, I want to share one GitHub repository I came across during my search which I found quite interesting: BucketCompressionTrick. I think this is a very good (and possibly useful) demonstration of the concept as well.

It's a neat little trick to efficiently squeeze four 5-bit values into a single 16-bit value when we don't care about the order of those 5-bit values ...