Walks in the neigborhood: May 2025

Earlier this year, I needed to find the best way to divide a set of $n$ things into $k$ subsets so that when an optimization is run in batches with the universe divided that way, the best overall result is achieved. Mathematically, the problem is to optimize an objective function defined over partitions of a set. A common instance of this problem is cluster analysis, where the objective function is a summary measure of within-cluster variation. Another example is the multi-way number partitioning problem that arises in job scheduling, where the problem is to allocate jobs to processing units so that some characteristics of performance or throughput are optimized. These problems are in general NP-hard, unless the objective function has nice properties or there is some natural way to limit the search space or to direct the search.

Brute force is never really an option even for small $n$. The number of ways to divide a set with $n$ elements into $k$ subsets grows very, very fast. For example, with $n =100$ and $k = 15$, there are more than $10^{100}$ ways to partition the set. See Stirling Numbers of the Second Kind for the formula.

For the general problem, in which nothing useful is known about the objective function, dynamic programming, integer programming, greedy algorithms, local search, simulated annealing, multilevel k-way partitioning and genetic algorithms are all possible ways to attack the problem. We focus here on the last one.

Solution using the Genetic Algorithm

One way to direct the search for good partitions is to represent partitions as chromosomes and use the Genetic Algorithm (GA) with chromosome fitness determined by the objective function.

This github repo contains a framework for optimizing objective functions defined over partitions using a Java implementation of GA. The Apache Commons Math GA implementation that it uses is not optimized for performance or scale. Neither is the optimize-partition framework built on top of it. This is proof of concept code to use in experiments to see if the GA can find good partitions. The tl;dr is that it can as long as either $n$ is small or good partitions are not too sparse in the topology of pointwise convergence on $^n k$ (viewing $k$-size partitions as functions $n \to k$). That is just a fancy way of saying that the implementation will find good solutions in reasonable time if given a random partition, one can turn it into a "good" partition by making a reasonably small number of element placement changes.

The chromosome representation of a partition views the partition as a list of partition-piece values, one each for the elements in the universe. Here is an example.

Suppose that $U = \{0, 1, 2, 3, 4, 5\}$ and $P = \{\{0,1\},\{2\}, \{3,4\}, \{5\}\}$ is a partition of $U.$

We can represent $P$ with the integer array $[0,0,1,2,2,3].$ The first two $0$ entries indicate that the first two elements of $U$ are assigned to the first partition piece in $P$. The next value means that the third element of $U$ is assigned to the second partition piece.

The sets in a partition don't really have an order, so relabeling them results in a different chromosome representing the same partition. For example, $[2,2,1,0,0,3]$ is another chromosome representation of partition represented by $[0,0,1,2,2,3].$ Every partition with $k$ pieces has $k!$ chromosome representations. At first this looks like a suboptimal feature of the encoding, but it actually helps the search as each good partition ends up having $k!$ basins of attraction.

To apply the Genetic Algorithm, we need a way for chromosomes to "cross" with one another to create new chromosomes. Our first attempt at crossover interleaves partition placement decisions of the parents. So for example,

$[0,0,1,2,2,3]$ crossed with $[2,0, 2, 3, 1, 3]$ gives $[0,0,1,3,2,3]$.

The first chromosome determines the first value, the second one the second, the third comes from the first, and so on. Sometimes, crossover defined this way results in "holes" (unused partition pieces). Those have to be discarded when we are searching for fixed size partitions.

Our first idea for mutation is just random reassignment of a single value.

The algorithm starts with an initial population of chromosomes, usually randomly generated.

Then successive populations are generated by

Compute fitness for each chromosome in current generation
Pass the top elitismRate chromosomes directly through to the next generation
While next generation still has space, randomly select 2 parents from current generation and

With probability crossoverRate, replace the parents with each crossed with the other
With probability mutationRate, replace the crossed pair with mutation applied to each element.

Evolution continues until numGenerations generations have been created.

Experiments

Unit tests in optimize-partition apply the GA to some partition search problems with known optimal solutions. In each case, with suitable parameters and enough generations, optimize-partition finds an optimal solution.

Spread the max value

Suppose $U = \{0, ..., 99\}$ and $g: U \to \mathbb{R}$ is defined by by $g(i)= 10$ for $i < 5$ and $g(i) = 0$ for $i \geq 5.$

Then define partition fitness by summing the max value of $g$ over each piece of the partition. So for example, given the $5$-piece partition is

$P = \{\{0, 1\}, \{2, 3\}, \{4, 5\}, \{6\}, \{7, ..., 99\}\}$

$P$ has fitness $10 + 10 + 10 + 0 + 0 = 30$

Optimal $5$-piece partitions are those that take the five initial elements of $U$ (where $g$ takes the value $10$) and spread them across the pieces of the partition. The maximum attainable fitness is $50$.

The test cases in the class TestOptimizePartition show that the GA consistently finds optimal partitions for this problem.

Cluster analysis

The test cases in the class TestClusterPartitionOptimizer test a form of GA-clustering on randomly generated test datasets. The universe for each test consists of 50 random points in $\mathbb{R}^3$. Each point is a random deviate from a pre-determined centroid created by adding Gaussian noise to each component. Each centroid has 9 deviates around it. The centroids are generated to be at least 10 units apart and the component noise has standard deviation $0.1$, so their is (almost certainly) a unique solution and the clustering problem is easy. The objective function is the negated sum of intra-cluster pairwise Euclidean distances. The tests show that the GA finds the unique optimal solution consistently after 100 generations.

Our implementation of this example is basically the same as that described in A Genetic Algorithm Approach to Cluster Analysis. See that paper for more empirical performance and accuracy results.

Walks in the neigborhood

Monday, May 26, 2025

Optimizing objective functions defined over partitions using the Genetic Algorithm

Solution using the Genetic Algorithm

Experiments

Spread the max value

Cluster analysis