Saturday, 29 March 2014

Week11 - Sorting and Efficiency

Hi Everyone,

This week we learned four new kinds of sorting algorithms: quick sort, merge sort, count sort and built-in sort. Quick sort and Merge sort both using a divide strategy to sort the list, which is completely new from what I've learned in CSC108. Count sort uses loops, but it is faster than both Quick sort and Merge sort since it has O(n) complexity. Built-in sort is amazing, it is the fastest sorting algorithm, but I don't really understand how it works until I reached my lab session.

In this post, I am going to talk about Sorting and Efficiency computation that I learned from my extra reading. Then I will use Quick sort and Merge sort as examples to develop their time complexity. After that I will provide a graph from my last lab to compare the efficiency of all the sorting algorithms that we've seen so far. Finally, I will show you what my TA said and what I think for the original creation of Python Built-in sort. Don't miss it!

Reflective Thinking

Danny taught us the way to see the sorting time for different algorithms to check sorting efficiency. Import time module, record the start time before running the sort, then record the end time after running the sort. The difference is the running time for the sort. The less the time, the more the efficient.

A more formal way is to compute the time complexity of a sorting algorithm. Actually, we're already seen implicitly how to compute the time complexity in CSC108. For example, we think of how many iterations are there in a insertion sort's worst for a list of length n. We conclude there will n^2 -n + 1 's iterations, and when n goes to infinity, n^2 grows a lot faster than linear functions. we just ignore these little difference. Similarity, if there are any coefficients, the coefficients can be ignored. Therefore, since we know the running time depends on how many iterations we have, we see the running time is bounded by the n^2. It's time complexity is O(n^2). Time complexity of an algorithm is defined as "to quantify the amount of time taken by an algorithm to run as function of the length of the string representing the input" (Wikipedia). It is always represented by Big-O notation, in our example it is O(n^2).

But I was wondering about some algorithms that may require more space to store values, even it is less time-consuming. So I did some research for sorting algorithms and their efficiency. Best and Worst time complexity comparison is one of those aspects that we need to consider in order to compute the efficiency. For a more complete and formal efficiency computation, we also need to consider the memory needed and whether this algorithm is stable.

I was thinking the sorting algorithm only applies to a list of integer, it only helps to develop our logic skills and there's no practical use. However, after some extra reading, I've changed my mind. For example, a list of countries can be sorted by its population, by area, by country code any so on. It is very beneficial to have a sorted list.

Divide List Sorting:

We've seen three kinds of sort in CSC108: bubble sort, insertion sort and selection sort. I think the sort in CSC108 makes more sense to me intuitively, when Danny talk about insertion sort in class, the first thing I thought of is to play poker. Haha~
But quick sort and merge sort are definitely faster and more advanced than those three sorts we've seen before. They both uses recursion and divide the list.

Quick sort:

Algorithm:
Find a pivot(always the first item in the list)
Compare every item with pivot, smaller goes to left, greater goes to right
Keep doing this until when the left/right only contains 0 or 1 values
Add left list, pivot list, and right list together
"return quick(left) + [pivot] + quick(right)"

Best case:
if the pivot always occurs in the middle of the list.
The list needed to be divided to half log(n) times.
In order to find the pivot point correct position, every item needed to be compared against the pivot value. There is (n) comparisons.
Also, every time we want to divide the list into two parts, we need to use comparisons to find the pivot point position. Therefore, the result is O(nlog(n)).
(Another way of thinking I imagine is, since there is totally n points, all points will be used once to divide the list. Therefore the total time is just the total points * every divide time)

Worst case:
if the pivot on the extreme left or extreme right, leave a very uneven division
Sorting n items divides into sorting 0 item and sorting n - 1 items, then 0 item and n - 2 items, then 0 item and n - 3 items. Therefore, the list needed to be divided (n) items.
In order to find the pivot point correct position, every item needed to be compared against the pivot value. There is (n) comparisons.
Also, every time we divide the list into two parts, we need to use comparisons to find the pivot point position. Therefore the result is O(n^2)

Code:

def quick(L: list) -> list:
    if len(L) < 2:
        return L[:]
    else:
        i = 0
        return (quick([x for x in L if x < L[i]]) +
                [L[i]] +
                quick([x for x in (L[0:i] + L[i+1:]) if x >= L[i]]))

Merge sort:

One main difference between merge sort and quick sort is that merge sort requires more extra memory space. Also, merge sort best case and worst case have the same time complexity.

Algorithm:
Divide the list into half
Keep doing this until when left or right contains 0 or 1 item
Start from this point to merge the divided list together orderly
  Compare left and right, and add item orderly one by one to a new list, as a new left/right
  Compare each item in the new left/right with its corresponding right/left, add item one by one to a new
  list
Keep doing this until all the item are put back

Best Case & Worst Case:
The algorithm for best case and worst case for merge sorting is exactly the same. Since no matter what, we divide the list completely to each single list by each time dividing to half,  then put them back one by one.
The length n list will be divided log(n) times. Then since for each division, we use comparison to compare every item and put them back to a new list. There will be (n) times comparison for every time division.
Therefore, the time complexity for merge sort is O(nlog(n)) in both best case and worst case.

Graph Illustration

Since we didn't save the graph that we did in lab, so I will just post my hand drawing graph. It is not accurate, but the general shape and the growth tend of the function won't be affected too much. We tend to choose some large length list, since we are more interested in the sorting performance when n goes to infinity. I will explain why some sort have the same time complexity but there is still a difference in the following.



We have seen before that Bubble, Insertion and Selection sort all have O(n^2) time complexity. Merge, Quick and built-in sort all have O(nlog(n)) time complexity. In generally, the graph coincides with our computation. Bubble, Insertion, Selection obviously take more time than Merge, Quick and Built-in.
*I will explain Built-in sort after

Bubble > Insertion > Selection:
Bubble checks the entire list every time, and swap items that out of order. 
Insertion only checks some part of sorted list, and insert the first item in the unsorted list to its correct position in the sorted list by swap items with their neighbours. 
Selection find the minimum value in the unsorted list and swap it with the first item in the unsorted list. 
Bubble swaps most, then insertion then selection. Therefore, even though they all have O(n^2), Bubble is the slowest, selection is the quickest.
Merge > Quick > Built-in:
I feel very weird Merge is greater than Quick by compare their complexity, since Quick's worst case is O(n^2), Merge's worst case is O(nlog(n)).
So I go back to think over those two functions.
Merge divides the list to random order single list, then add them one by one to a new list orderly.
However, Quick sort first separates the list into halves less than or greater than the pivot. So the pivot position is already decided simultaneously. At the end the list is divided out, but at the same time all the items have been used as a pivot once, and they are at the position that they should be.
That's what I why merge > quick.

Built-in sort:

Now... We come to the hardest part I thought...
I was frustrated about what indeed the built-in sort is, why it is so fast, what is its time complexity??
So I talked with my great TA ~~

¿Where does Python come from? 
Computer language was written based on another basic language. Python was written based on C. C was invented 1970s, very old ha, but it is the basis for many programming languages that we are using nowadays C++, Java, Python are all C-based languages. So python built-in sort was also written based on C.
¿But, C what?
So I went on googling about these stuff, and understood that Python build-in was originally written based on C's quick sort. That's why it has a complexity O(nlog(n)) the same as Quick sort's complexity. 
Since C is simple and it is a very high-speed language, so the Python build-in sort (which is C Quick sort) is faster than Python Quick Sort.

Thus we can conclude and explain why the running time Merge > Quick > Built-in Sort.



It is a long post this week~ :P
I feel sorting and efficiency is not easy to comprehend. I used to train myself to spin a globe in my head in order to do well in my Geography class. Now, yesterday once more!
How does everyone feel about sorting and efficiency?
Any comment will be appreciated~ Well, especially for the Built-in sort part :D

See you next week~

3 comments:

  1. Wow, now I see why Python built-in sort is so efficient.

    ReplyDelete
  2. notice even in the worst case, the quicksort is still better than others( of course they do not include the timsort)

    ReplyDelete
  3. Ya quick sort is very amazing~

    ReplyDelete