Using gzip as a machine learning model
Paper: “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors
Summary by Adrian Wilkins-Caruana
Today’s paper isn’t about training a new kind of LLM. It isn’t about multi-modal generative AI. It isn’t even about neural networks. And yet the message from this paper is one every ML practitioner should pay attention to since, sometimes, we just can’t see the forest for the trees.
Jiang et al. classify text documents using a lossless compression algorithm and a simple, centroid-based clustering algorithm (k-nearest neighbors). That’s it! The state-of-the-art models for text classification use sophisticated neural networks like recurrent and graph NNs. Before we get into their results, let’s first get an intuitive understanding of why a lossless compression algorithm might be useful for classifying text, and then we’ll discuss exactly how their method works.
Lossless compressors aim to represent information using as few bits as possible by assigning shorter codes to symbols with higher probability. For example, in a black-and-white image with 200 consecutive white pixels, the algorithm might record this as "200 white pixels" instead of listing "white pixel" 200 times. When decompressed, the original image is perfectly reconstructed, with no loss of detail or information. There are two reasons why compressors might be good at classification: they’re good at identifying patterns or regularities, and objects belonging to the same category tend to exhibit more regularities than objects from different categories.
Let’s see an example of how the compressor-classification method works. In the sentences below, x1 and x2 are more similar to each other than x3:
x1 = Japan’s Seiko Epson Corp. has developed a 12-gram flying microrobot.
x2 = The latest tiny flying robot has been unveiled in Japan.
x3 = Michael Phelps won the gold medal in the 400 individual medley.
A compressor would need less information to represent the concatenated sentences x1 and x2 than it would for x1 and x3. Why? Because some of the patterns that the compressor identified in x1 can be reused to compress x2, but not for x3 because it contains different patterns that need to be compressed separately. To formalize this intuition, the authors use something called the Normalized Compression Distance (NCD), which measures the similarity between two sentences by compressing (represented as C(·)) each sentence individually and concatenated together. Its equation is quite simple:
The equation reads: The NCD distance between x and y is the compressed size of the concatenation of x and y, minus the minimum compressed size of x and y individually, divided by the maximum compressed size of x and y individually. In other words, the NCD is lower if sentences are similar, and higher if they’re dissimilar. Clustering algorithms (like k-nearest neighbors) can use the NCD as a distance function in the same way they would use the Euclidean or cosine distance functions. The researchers also used a very standard compression program, called gzip. It’s probably installed on your computer.
To measure the effectiveness of their method, the authors compared it to 13 other text-classification algorithms across 7 datasets. The authors claim that their approach outperformed the other algorithms 53% of the time, but it appears they made a measurement mistake and it’s only better in a smaller subset of cases, such as the DengueFilipino dataset. Despite this mistake, their result is still quite meaningful for ML researchers. Neural networks aren’t the one-size-fits-all solution to every ML problem, and sometimes non-ML-based approaches — such as the gzip approach in this paper — can serve as surprisingly good baselines.