Leap

Computer Science. Technology. Impression

Notes on data structures and algorithms

Motivation

Computers are ubiquitous. They help us solving so many problems and for many people they are already integral part of daily life. We expect them to be fast and use hardware resources effectively. We can achieve these goals through carefully crafted software.

Computers are dumb and they have only two fundamental capabilities: following instructions precisely and storing data. They don’t actually solve the problems, we have to solve the problem and provide them with steps of solution. Those steps of solution is called an algoritm. As the algorithm is written for computer, we should have a good understanding of how computers work.

As we want to have an efficient solution/algorithm, we are interested in time complexity and space complexity of the solution.

Unless the solution is just a mathematical expression (For example, calculate sum of consecutive natural numbers upto given number), it operates on the data that is specified in(or derived from) the problem. For the sake of understanding let us assume that the data is a group of objects, then the basic operations on the group will be: add more objects, delete objects, updating an object, simple exploration/traversal of objects and more importantly search for an object. As computer’s memory is a gaint array of bytes, we need to first solve those operations(on data) as needed by the solution. Data Structure is a concept where we study various ways to organize and operate on the data so that we’ve an idea of time complexity or space complexity of various operations we will need. As solution to data operations is an important part of the solution itself, the data structure we choose affects time and space complexities of solution.

We can produce many solutions/algorithms to a problem, but we want the best of those in terms of runtime and space complexity. To do that, we need to compare those solutions. Using a real computer to calculate runtime and space complexity for each is a very tedious task, and all computers don’t have same capabilities. Hece, to simplify the complexity calculation, to make a common ground for benchmarking algorithms we choose a theoritical computer — the RAM model of computation.

Random Access Machine

RAM is a simpliefied computer: it has single procesor, no cache, each basic instruction take constant time. Get to know more about RAM here and here.

But I’m also listing properties here:

Accounting on RAM

We calculate time and space complexity of various algorithms for a problem and compare them to pick a relatively better algorithm. Those calculations are formulated interms of number of constant-time(low-level) operations needed on RAM(for some amount of data).

TODO: worst-case, average-case and best case analysis somewhere

Asymptotic analysis

Sometimes we are interested in solving the problem assuming that we are working on large amount of input data, because these days computers are operating on huge volumes of data. If that is the case, we use asymptotic analysis to simplify the complexity calculation of an algorithm. Once we’ve formulated the complexities, as part of asymptotic analysis we simplify the formulation assuming that the input is very large, i.e. close to infinify. With this, our process of selecting a relatively better algorithm will be easier.

The analysis made for large inputs might not be suitable for small inputs (For ex: the insertion sort is better for small inputs than merge sort which asymptoticallly better than insertion sort).

Refer: Big-O Notation

Data structures

As discussed, solutions that just need numerical computation may not any data structure per se. For example, calculating GCD, Check if given number is prime number or not? etc. But many other solutions do need data structures. The memory of RAM, is a gaint array of memory locations which can be randomly accessed. We store/organize the data in it and operate.

Fundamental ways to organize data in memory

Fundamentally we can store the data contigously (as an array of objects) or non-contigous/linked (linked objects).

Array:

As the objects stored as an array are stored contigously, randomly accessing an object based on its index takes constant time. But insert/delete operations will take more work. We also need to know the no. of objects upfront to be able to allocate the required memory.

In an array, objects are stored contigously. As an array objects are of same type and are stored contigously, is very easy to index into an array – you can always calculate the position of object if you know the index of the object in that array. But inserting into an array would be a costlier operation, because to make room for the extra object. Array is suitable if we the data is of fixed size known at the time of allocation and we perform indexing operation often.

Linked objects:

But if we want to be able to insert/delete objects often or we don’t know the size of data upfront. We can go for Linked objects, where objects are linked to each other. In this objects may not be stored contigously. As the physical location of the objects are not evident, we can’t index into Linked objects as quickly we did in an array. Each object stores position of object(s) that can be reached. Eamples: Single linked list, double-linked list.

Where do we place the data structure: either stack area or heap area depends on the life time of the data(function scope or program scope) and the size of data as well(stack will be limited).

Imporatnace of relations

Now, imagine we’ve an array of integers and our task is to check if a given integer exists in our array. Here, we need to find the given integer in the array. It costs us time that is propotional to the size of the array. (Consider we do this opeartion very often) But how can we reduce this? We sort the array. Interestingly when we sort the array in ascending order, we’ve established correlation between locality of the integer with its value — an integer is located after integers that are less than this. Using this correlaation we perform binary search. The same principle applies to binary search trees where all keys that are lesser will be stored in its left side. We know the concept of Height Balanced Binary Search Trees, which provide us find operation in logarithmic of input even in worst case. But we also spend some extra time to balance the tree, right after a change is done on the tree(which may be ignored if changes to the tree is lesser compared to read/find operations).

I would highly recommend watching Sean Parent “Better Code: Data Structures”, which helped me concretise this idea.

Another example is Hashing: where we bring correlation between representaton of object and its location.

TODO: Try to give more examples.

Abstract data types:

For solving the problem, we first need to decide the operations on objects. The theoretical definition of required operations is called an Abstract Data Type. Type of the data describes operations allowed on the data. Because ADT don’t have implementation, it is called as abstract. For a given ADT, we try to implement data structures that supports those operations; we compare them and pick the one with a reasonable amount of complexity. Some common ADTs that may be incorporated into the solution are Dynamic array, Stack, Queue, Circuar queue, Priority queue, Graph, Min-Max-heap, Map(of a key-value pair), and Union-Find etc. It is very rare that you’ll have to implement ADT yourself. You may have to implement ADT yourself only when you feel the availble implementaion is not suitable for your use case or you’ve not found any implemention that suits your need. Sometimes, well implemented ADTs may be built-into the programming language you work on or can be used from a library. It is worth knowing various properties of the readily availble implementsions before using them in your particular case.

Map

For example, a key to value map can be implemented with hashing which is called as HashMap (in Python it is Dictionary) or with a balanced BST which can be called as TreeMap. These different implementations have different characteristics.

Set

A set of items can be implemented with Hashing, which is HashSet or a balanced BST, which is TreeSet. The characteristics of HashSet and TreeSet are different such as the ordering of elements stored, HashSet doesn’t gurantee the sorted order of items but TreeSet does. If we want to remember the insertion order of items into the set(while traversing through the elements), then LinkedHashSet maybe used.

List

Let us take an example of list type provided by Python. Though list can be used as both Stack and Queue, list is not an optimal option as Queue. Deque from Python’s collections library is more suitable as Queue.

list is an example of dynamic array or variable sized array. Variable sized array can be implemented as linked objects as well as contigously stored objects. If indexing operation is required then Variable sized array should be implemented with contigously stored objects. But it is little tricky to implement Variable sized array with contigously stored objects. In this lecture Prof. Erik Demaine explains implementing Variable sized array as contigously stored objects using Table Doubling. Python’s list data type is backed by table doubling implementation where as Deque is implemented as double-linked data objects. Both can grow to variable length, but Deque can do insert and delete operations on both the sides effectively. And, Deque can be used as a Circular Queue aswell.

Priority Queue

Priority Queue can be implemented with Heap data structure.

Graphs can be implemented with linked nodes or adjacency matrix (two-dimentional array) or adjacency list

Explore trees: Height balanced binary search trees(AVL or Red black trees), B-trees etc

Example

TODO: give more examples

Data structuers in programming lanauges

Many high level programming lanauges provide abstract data types built-in and other will have a library where we can pick up. For ex: Python, Javascript have common ADTs as built-in. For C++ and Java, you can use from their standard library. Maybe try to Understand the memory model of each the language you are using.

TODO: more in this section?

Algorithms

As discussed, for producing algorithms for computers, we need good understanding on how computers work. Algorithm must be correct, and we should be able to check/prove the correctness before using it(For all problem instances it should produce correct result). There are techniques to prove that given algorithm is correct. Out of all correct algorithms (candidate algorithms), we pick the optimal one(in our context). So, we should always focus on producing correct algorithms, and then we can think about optimations.

Refer Chapter 1.1 of CLRS. (CLRS 3rd Ed)

Some important points:

Various common problems and solutions/algorithms:

Some well-known problems and their solutions are listed here.

Searching problems:

Sorting problems:

Graph algorithms

TODO: update this section

General problem solving techniques:

Give reference to examples of these:

Talk about these terms:

Computablility

Introduce Computatability of problem, NP-Completeness?

Preperation for interviews

References:

Videos

  1. MIT 6.006 Introduction to Algorithms, Fall 2011
  2. Better Code series by Sean Parent

    Textbooks

  3. CLRS
  4. Algorithm Design Manual

Further Reading: