BEST SOLUTION TO THE COMPARISON SORT PROBLEM


Sorting data represented in digital format on a computer system by comparisons has been a long standing problem in theoretical computer science. This problem got so much attention, it is now difficult to know exactly what the problem is and to what extent this problem has been solved! It is necessary to read huge amount of published literature and some time perform experimental tests on algorithms to know to what extent and what aspect of this problem has been solved. It is even difficult knowing what the extent and aspects of the problem are, let alone to verify or know their solution! Some authors even rely on making claims that are base on the correctness of their mathematical presentation rather than having any substance of practical value to the problem solution! I think that Mathematics is a tool and a wise person will not use it as a vice. I tend to use Mathematics as a device to perfect my creativity and to affirm the applicability and validity of a solution when solving Technical Problems!

The sorting problem got a lot of attention mainly because of its practical application and its theoretical implication. However, a theoretical implication as in many other areas of academia is not necessarily practically correct. For example, the theoretical lower limit on the number of comparisons that can be done by any comparison base sort algorithm is not always true in practicality and I do not think it has been properly presented in the literature from a purely theoretical point of view. I have tested a comparison sort algorithm that demonstrates this point. The point I am making is that there are always new things to discover or innovate that some times goes against the perceived limitations or the perception is not properly understood or perceived!

THE RANKING PROBLEM!

In this problem, we have a sequence of digitally represented data values that we want to place in an orderly fashion (rank) in space base on the size of individual data items in the sequence and we want to do this with use of a computer system that has a single processor. We would like to find a way to do this so that minimum processor time and system memory is used. This ranking is to be done on a digital system that operates under the "Stored Program" concept and that uses the Single Processor Random Access Memory (SPRAM) model. There are three basic fundamental ways in which this can be done. I have listed two of these below.

A set of data values can be linearly ranked in space by virtue of their relative size by
(1) A process of pair-wise comparison followed by data move operations or
(2) A logical or mathematical method whereby individual data value locations are determined

From this it can be easily seen that the process of ranking a set of data values require data space and processing time that are related in some form to the method used to achieve this objective. The complexity of a process can be looked at in terms of its spatial arrangement over time or in terms of the time related sub-processes that describes the process. The space and time complexity of the first method for ranking data values tends to be more involved and require a higher degree of creativity or imaginative thinking to achieve useful results. Sorting by the first method is commonly termed as "Comparison Base Sorting". The second method is called Hashing. In most case, hashing is a probabilistic procedure for placing values in a set at their sorted locations. You can also use a hash function to retrieve values from their sorted location. We use the term "Hash Table" to refer to the set of sorted location that the hash function can reference. Hash functions are usually mathematical in nature and tend to have relatively simple algorithmic form in comparison to comparison-base sort algorithms. The first method is a common solution to the ranking problem as it allows minimal restriction on the type and range of data values to be ranked. We seek to achieve a solution to the ranking problem using minimum data space and processor time. We seek this solution by use of a comparison base sorting method. From here on, whenever we use the term sort we mean a method for ranking data values by comparison and data move operations.

A hash function h maps keys belonging to a domain D to the address values of a table T[0:n], where n a positive finite integer value is the number of element in the table. The hash function simple uses a key value to compute an address from [0, n-1]. An example simple hash function is one that takes a sequence of positive integer values and map each value at the address location that is the actual integer value. I.e. h(x)=x. Hash functions are also used for other data processing operations such as lookup tables, dictionaries, encryption algorithms and so on. If the two keys xi and xj are distinct with i < j and h(xi)<h(xj), we say that the hash function is order preserving. More detailed explanation of these algorithms is available from the publication mentioned atthe end of this page.

Hashing schemes generally have interested properties that makes them very useful for other data processing activities apart from sorting.

COMPARISON BASE SORT ALGORITHMS

It is well known that there is a minimum limit to the number of comparisons that can be done in order to sort a set of N data values. This minimum number of comparisons is known as the theoretical lower limit and is easily shown to be Nlog2(N)–N/ln 2 +log2(N)/2 + O(1) under certain assumption! Where O(1) is a constant value, ln 2 = loge(2) and e is 2.7182818 to 7 significant digits after the decimal point. This lower limit mean that you cannot design an algorithm that will do less than this number of comparisons on average to sort a set of N data values. Similarly, the lower limit on the number of data moves is N-S+C, where S is the number of single element permutation cycles and C is the number of permutation cycles in the input permutation. Note that the lower limit on the number of comparisons does not limit the amount of extra memory space available for sorting. Therefore, it has been a big challenge to develop an algorithm that will sort N data values using the lower limit and a constant amount of extra memory space. Research and development to solve this problem have been going on for over 60 years. Best known algorithms that approaches the lower limit on the number of comparisons using constant amount of extra memory and no more than a constant amount of Nlog2(N) data moves are listed in the attached bibliography. These are associated with the listed papers at the end of these pages. However, these algorithms do not achieve the exact lower limit and in some instance have other undesirable attributes as explained below. From now on, we use the term sort to mean a process of ranking a set of data values by the method of pair-wise comparisons of data values followed by data move operation(s). Actually, you can do an optimum comparison sort with O(N) data moves. I have just finished implementing several such an in-place optimum algorithms using k-way merge-sort and other bit manipulating techniques. One implementation actually does optimal comparisons with no more than 2N moves using N-1 additional bits. I will make it available to you soon, prefarbly via a book and a patent.

There are many claims out there about the best or the fastest sort algorithm but in most case, these are lacking in understanding or actual usefulness. For instance, there is a poor quality hashing scheme that compares it self to quick-sort in order to make a claim about its speed. I am surprise that the system have actually been granted a patent! There are also claims that use confusing words and unnecessary complex mathematics to affirm this claim about improvement on some thing, which already exist.

I can assure you as an open minded person that even the above quoted expression on the lower limit for the number of comparisons is not correct in practical terms! I came to this conclusion from two accounts; (1) From an algorithm I tested a couple years ago, (2) From algorithm design considerations base on the main assumption in deriving this expression.

FINDING THE BEST ALGORITHM

The design and analysis of algorithms is concern with the efficient use of Hardware and Time resources on Digital Computer Systems that does data manipulation by way of the Stored Program Concept. In the computational analysis of algorithms, we seek to analyse the correctness and complexity of methods that will efficiently use time and hardware resources on digital computer systems. By correctness, we mean that the algorithm will produce correct results for all instances of the input problem. We restrict our discussion by assuming that computational problems are generally defined over a data set. We do further restriction by saying that the computational problem deals with permutations of an input set of values. A more general but still restrictive view is that of the Object Oriented concept of Classification. In this case, the descriptive attributes include the set of data values along with the algorithmic behaviour that can do modification to these values or give a dynamic description of an encapsulation.

Usually, most algorithms designed for digital computer systems are specific to the problem been solved. The way the algorithm works is highly dependent on the class of input problems it is designed to solve. Therefore, a thorough understanding of the problem to be solved is essential to the development of an efficient and correct algorithm that makes best use of computer resources. The computational solution to some problems is generalised solution that must be applicable to a large number of, if not all input permutations of a given data set. Therefore, we see that in most case the problem solution is also a question of generalization and classification of a computational method that best fit the general class of input permutations. Assumptions about particular characteristics of the input problem can also be generalised. Generalizations is an important step in the derivation of efficient algorithmic solutions for computational methods on data values that can assume a large number of input permutations and a large variation in the structure of their component values. However, generalizations some times introduces counterproductive restrictions that can be very detrimental if they are not understood or their implication is not made known during the derivation of an algorithmic solution. For example, in deriving the solution to the merge/sort problem, it is commonly assumed that all data items in the input list to be sorted occupy the same number of memory locations. Most solution to the sorting problem would not work if each data item were allowed to have variable length. Fortunately, our historical experience is that most instance of the sorting problem has fixed length record for each data item in the input file to be sorted. Another problem with generalisation is the way in which the lower limit on the number of comparisons for sorting is derived and presented. Surely, this presentation can not be correct! Because the number of comparisons must be a discrete value.

The permutation of discrete entities and their relationship to a computational method should be well understood before a generalizations is attempted in order to derive a general definition of the computational problem at hand. The process of generalizations should not be too restrictive so as to exclude legitimate specific instance of the problem neither should it be too inclusive so as to extend the complexity of the problem. As we can see from the case of the sorting problem, many solutions have been derived that approximate the desired optimum solution.

I had to make this point because of my personal experience in researching solutions to the sorting problem. I can for instance state that there is indeed a case where the commonly quoted theoretical limit on the number of comparisons for comparing sort algorithms does not hold. I discovered this long ago but I was always given a hostile response. I was also about discovering a generality that gives a hint solution to the network sorting problem but I was horrified by those who had self centred pursuite. Nevertheless, I have now managed to come up with a simple proof that is not base on zero-one principle and that gives some other results. I am also more matured in dealing with these characters and have perserveired!

BEST SOLUTION TO THE COMPARISON SORT PROBLEM

Why is the algorithm listed herein the best solution so far to the comparison sorting problem?

(1) It is base on the standard 2-way merge-sort
(2) It uses extra constant memory k to do no more than (1+2log2k/k)Nlog2(N)-(N+1) comparisons and 2(1+3/k)Nlog2(N) record moves
(3) The algorithm is in-place and stable
(4) It is easy to understand and code
(5) It is consistent. It always give the same operational complexity regardless of the input permutation
(6) It can be extended to k-way merge-sort to implement an O(N) moves optimum sort algorithm
(7) It is free! No patent or copyright to worry about and I will give you some code at a later date so that you can run experimental test to see for your self. You can get information on this algorithm in the public domain. A copy is available at King's College DCS web site.

We also have an improvement on the above algorithm with implementation code in Java:
SORY! At this point in time! I have to try and recoup my cost and to get some returns on my investment in Time and Effort! I'll put all of this in a book with more stuff so you can buy a copy. It wont be cheap! And I am now please to let you know that this book will soon be ready, some time in 2015. You will also see explanation of the fastest merge-sort algorithm that does stable sort with N\log(N) -(N+1) comparisons and N-S+C data moves using o(N) pointers. With this technique you do not even need to move data values to do search by the linear or the binary techniques. In addition, most of your data manipulation during the sort can be done in the CPU and by bit manipulation.

 

Page 1 of 7