Thursday, October 20, 2016

Comparing Runtimes of Four Implementations for Calculating Collatz Path Lengths

 Background 

The Collatz Conjecture
 describes a phenomenon that is easily apprehended, but (so far) impossible to solve even by the world's best mathematicians. One of the most influential, Paul Erdös, famously remarked that "Mathematics may not be ready for such problems", and offered $500 of his own money as a reward to any potential solver.

The conjecture states that, given a positive number n, iteratively following these rules:
- if n is even, n = n/2

- if n is odd, n = 3n + 1
will lead any n, eventually, to 1.

Examples:
2 -> 1
3 -> 10 -> 5 -> 16 -> 8 -> 4 -> 2 -> 1
4 -> 2 -> 1
5 -> 16 -> 8 -> 4 -> 2 -> 1
...
(and so on)

The examples above show the Collatz Paths for the numbers two through five, inclusive. In this context and for the rest of this post, we will refer to this type of list more simply as path.

Another way to state the conjecture is that every natural number has a finite path. Checking that this is true is usually where one starts when trying to grasp the problem. Pencil and paper are more than adequate in calculating the path length, but the iterative nature of this lends itself nicely to automation.


 Implementations 


A simple, straight-forward approach to implement a function that will calculate the path of a number, n, is as follows:

def simple_collatz(n):
    """
    expects a number n > 0 as an argument
    returns the Collatz Path length for n
    """
    path_length = 0
    while(n > 1):
        path_length += 1
        if(n % 2 == 0):
            n //= 2
        else:
            n = 3 * n + 1

    return path_length

This works well for most purposes. Some examples:




But let's say you wanted to gather some data about the distribution of path lengths. This would require performing the above calculation over a wide range of numbers, potentially millions. The simple implementation will not work well in that case. As was pointed out in the video where I learned about this conjecture, there are improvements that can be made:
def collatz(n):
    """
    expects a number n > 0 as an argument
    returns the Collatz Path length for n
    """
    path_length = 0
    while(n > 1):
        if(n % 2 == 1):
            n = (3 * n + 1) / 2
            path_length += 2
        else:
            n //= 2
            path_length += 1
    return path_length

The improvement above is small. But by noticing that if n is odd, then in the next step 3n+1 will be even, we can bundle an odd step with an even one. In that case the transformation of n becomes (3n+1)/2, and the path length is incremented by two. So, it seems like we saved ourselves a check of the while condition sometimes. As far as I can tell, this only provide modest improvements and it is not sustained as the numbers and number of path length calculations grows into the millions.

The next class of improvements yields a much quicker run time over a range of values. Notice that both four and five traverse the same path for at least a number of their steps:

4 -> 2 -> 1
5 -> 16 -> 8 -> 4 -> 2 -> 1

If we store numbers and their calculated path lengths as key-value pairs, we can eliminate a significant amount of duplicate work. By using a collection that leverages a hash lookup, we can very speedy (theoretically constant-time). The implementation is as follows:
def quick_collatz(n, lookup = {}):
    starting_number = n
    path_length = 0
    while(n > 1):
        if n in lookup:
            path_length += lookup[n]
            break
        if n % 2 == 1:
            n = (3 * n + 1) // 2
            path_length += 2
        else:
            n //= 2
            path_length += 1
    lookup[starting_number] = path_length

    return path_length

And for completeness, I included a recursive implementation that you can see performs unexpectedly well up to a million. It is very similar to the above, but slightly more elegant.
def recursive_collatz(n, lookup = {}):
    if(n < 2):
        return 0
    elif n in lookup:
        return lookup[n];
    elif n % 2 == 0:
        lookup[n] = 1 + recursive_collatz(n // 2, lookup)
    else:
        lookup[n] = 2 + recursive_collatz((3 * n + 1) // 2, lookup)

    return lookup[n]

The main benefit of implementing a recursive algorithm, is that we can store the path lengths not only for the number n, for which we are calculating the path length, but for all intermediate numbers on the way to one. This would in theory reduce the number of redundant path calculations. But is there a performance cost for recursion? Usually this is assumed to be yes. If so, which effect will triumph, and does that depend on n?

 Results 


I collected a little bit of data, which I will display and visualize here:



Up ToSimpleImprovedLookupRecursive
1000.000840.000920.000230.00015
10000.015350.017030.003790.00145
100000.254620.275650.022240.01561
1000002.952523.283850.205430.15675
100000036.4222840.739342.319771.79296


Because the data are orders of magnitude different, the following is a plot of the log of the runtimes.



The results are dramatic, and not necessarily expected.

If I had ventured to guess beforehand, I would have said that the non-recursive lookup would generally perform the best, followed by the recursive implementation, then the improved conventional implementation, and lastly the most simple one.

However, I feel compelled to listen to the story told by the data which is:

- The recursive implementation performed the best.

- The non-recursive lookup performed second best.
- The 'improved' simple implementation performed worst.

If you've enjoyed this post, maybe you could try this on our own hardware or suggest further refinements here. 
I would look forward to hearing from you.

Please find all code on Github.

Thank you and best regards,
Kevin


No comments:

Post a Comment