Motivation
Computers are ubiquitous. They help us solving so many problems and for many people they are already integral part of daily life. We expect them to be fast and use hardware resources effectively. We can achieve these goals through carefully crafted software.
Computers are dumb and they have only two fundamental capabilities: following instructions precisely and storing data. They don’t actually solve the problems, we have to solve the problem and provide them with steps of solution. Those steps of solution is called an algoritm. As the algorithm is written for computer, we should have a good understanding of how computers work.
As we want to have an efficient solution/algorithm, we are interested in time complexity and space complexity of the solution.
Unless the solution is just a mathematical expression (For example, calculate sum of consecutive natural numbers upto given number), it operates on the data that is specified in(or derived from) the problem. For the sake of understanding let us assume that the data is a group of objects, then the basic operations on the group will be: add more objects, delete objects, updating an object, simple exploration/traversal of objects and more importantly search for an object. As computer’s memory is a gaint array of bytes, we need to first solve those operations(on data) as needed by the solution. Data Structure is a concept where we study various ways to organize and operate on the data so that we’ve an idea of time complexity or space complexity of various operations we will need. As solution to data operations is an important part of the solution itself, the data structure we choose affects time and space complexities of solution.
We can produce many solutions/algorithms to a problem, but we want the best of those in terms of runtime and space complexity. To do that, we need to compare those solutions. Using a real computer to calculate runtime and space complexity for each is a very tedious task, and all computers don’t have same capabilities. Hece, to simplify the complexity calculation, to make a common ground for benchmarking algorithms we choose a theoritical computer — the RAM model of computation.
Random Access Machine
RAM is a simpliefied computer: it has single procesor, no cache, each basic instruction take constant time. Get to know more about RAM here and here.
But I’m also listing properties here:
- Each simple operation (+, *, -, =, if, call) takes exactly 1 time step.
- Loops and subroutines are not considered simple operations. Instead, they are the composition of many single-step operations. It makes no sense for sort to be a single-step operation, since sorting 1,000,000 items will take much longer than sorting 10 items. The time it takes to run through a loop or execute a subprogram depends upon the number of loop iterations or the specific nature of the subprogram.
- Each memory access takes exactly one time step, and we have as much memory as we need. The RAM model takes no notice of whether an item is in cache or on the disk, which simplifies the analysis.
Accounting on RAM
We calculate time and space complexity of various algorithms for a problem and compare them to pick a relatively better algorithm. Those calculations are formulated interms of number of constant-time(low-level) operations needed on RAM(for some amount of data).
TODO: worst-case, average-case and best case analysis somewhere
Asymptotic analysis
Sometimes we are interested in solving the problem assuming that we are working on large amount of input data, because these days computers are operating on huge volumes of data. If that is the case, we use asymptotic analysis to simplify the complexity calculation of an algorithm. Once we’ve formulated the complexities, as part of asymptotic analysis we simplify the formulation assuming that the input is very large, i.e. close to infinify. With this, our process of selecting a relatively better algorithm will be easier.
The analysis made for large inputs might not be suitable for small inputs (For ex: the insertion sort is better for small inputs than merge sort which asymptoticallly better than insertion sort).
Refer: Big-O Notation
Data structures
As discussed, solutions that just need numerical computation may not any data structure per se. For example, calculating GCD, Check if given number is prime number or not? etc. But many other solutions do need data structures. The memory of RAM, is a gaint array of memory locations which can be randomly accessed. We store/organize the data in it and operate.
Fundamental ways to organize data in memory
Fundamentally we can store the data contigously (as an array of objects) or non-contigous/linked (linked objects).
Array:
As the objects stored as an array are stored contigously, randomly accessing an object based on its index takes constant time. But insert/delete operations will take more work. We also need to know the no. of objects upfront to be able to allocate the required memory.
In an array, objects are stored contigously. As an array objects are of same type and are stored contigously, is very easy to index into an array – you can always calculate the position of object if you know the index of the object in that array. But inserting into an array would be a costlier operation, because to make room for the extra object. Array is suitable if we the data is of fixed size known at the time of allocation and we perform indexing operation often.
Linked objects:
But if we want to be able to insert/delete objects often or we don’t know the size of data upfront. We can go for Linked objects, where objects are linked to each other. In this objects may not be stored contigously. As the physical location of the objects are not evident, we can’t index into Linked objects as quickly we did in an array. Each object stores position of object(s) that can be reached. Eamples: Single linked list, double-linked list.
Where do we place the data structure: either stack area or heap area depends on the life time of the data(function scope or program scope) and the size of data as well(stack will be limited).
Imporatnace of relations
Now, imagine we’ve an array of integers and our task is to check if a given integer exists in our array. Here, we need to find the given integer in the array. It costs us time that is propotional to the size of the array. (Consider we do this opeartion very often) But how can we reduce this? We sort the array. Interestingly when we sort the array in ascending order, we’ve established correlation between locality of the integer with its value — an integer is located after integers that are less than this. Using this correlaation we perform binary search. The same principle applies to binary search trees where all keys that are lesser will be stored in its left side. We know the concept of Height Balanced Binary Search Trees, which provide us find operation in logarithmic of input even in worst case. But we also spend some extra time to balance the tree, right after a change is done on the tree(which may be ignored if changes to the tree is lesser compared to read/find operations).
I would highly recommend watching Sean Parent “Better Code: Data Structures”, which helped me concretise this idea.
Another example is Hashing: where we bring correlation between representaton of object and its location.
TODO: Try to give more examples.
Abstract data types:
For solving the problem, we first need to decide the operations on objects. The theoretical definition of required operations is called an Abstract Data Type. Type of the data describes operations allowed on the data. Because ADT don’t have implementation, it is called as abstract. For a given ADT, we try to implement data structures that supports those operations; we compare them and pick the one with a reasonable amount of complexity. Some common ADTs that may be incorporated into the solution are Dynamic array, Stack, Queue, Circuar queue, Priority queue, Graph, Min-Max-heap, Map(of a key-value pair), and Union-Find etc. It is very rare that you’ll have to implement ADT yourself. You may have to implement ADT yourself only when you feel the availble implementaion is not suitable for your use case or you’ve not found any implemention that suits your need. Sometimes, well implemented ADTs may be built-into the programming language you work on or can be used from a library. It is worth knowing various properties of the readily availble implementsions before using them in your particular case.
Map
For example, a key to value map can be implemented with hashing which is called as HashMap (in Python it is Dictionary) or with a balanced BST which can be called as TreeMap. These different implementations have different characteristics.
Set
A set of items can be implemented with Hashing, which is HashSet or a balanced BST, which is TreeSet. The characteristics of HashSet and TreeSet are different such as the ordering of elements stored, HashSet doesn’t gurantee the sorted order of items but TreeSet does. If we want to remember the insertion order of items into the set(while traversing through the elements), then LinkedHashSet maybe used.
List
Let us take an example of list type provided by Python. Though list
can be used as both Stack and Queue, list
is not an optimal option as Queue. Deque from Python’s collections library is more suitable as Queue.
list
is an example of dynamic array or variable sized array. Variable sized array can be implemented as linked objects as well as contigously stored objects. If indexing operation is required then Variable sized array should be implemented with contigously stored objects. But it is little tricky to implement Variable sized array with contigously stored objects. In this lecture Prof. Erik Demaine explains implementing Variable sized array as contigously stored objects using Table Doubling. Python’s list
data type is backed by table doubling implementation where as Deque
is implemented as double-linked data objects. Both can grow to variable length, but Deque can do insert and delete operations on both the sides effectively. And, Deque
can be used as a Circular Queue aswell.
Priority Queue
Priority Queue can be implemented with Heap data structure.
Graphs can be implemented with linked nodes or adjacency matrix (two-dimentional array) or adjacency list
Explore trees: Height balanced binary search trees(AVL or Red black trees), B-trees etc
Example
TODO: give more examples
Data structuers in programming lanauges
Many high level programming lanauges provide abstract data types built-in and other will have a library where we can pick up. For ex: Python, Javascript have common ADTs as built-in. For C++ and Java, you can use from their standard library. Maybe try to Understand the memory model of each the language you are using.
TODO: more in this section?
Algorithms
As discussed, for producing algorithms for computers, we need good understanding on how computers work. Algorithm must be correct, and we should be able to check/prove the correctness before using it(For all problem instances it should produce correct result). There are techniques to prove that given algorithm is correct. Out of all correct algorithms (candidate algorithms), we pick the optimal one(in our context). So, we should always focus on producing correct algorithms, and then we can think about optimations.
Refer Chapter 1.1 of CLRS. (CLRS 3rd Ed)
Some important points:
- Understanding recursion is important, because for reusing/parameterizing the code, we break the code in functions. Recursion is very useful in breaking down the problems into smaller problems of easy analysis.
- Understanding the problem context and the input data very well can help producing algorithms that work best in that specific context. For example, if we are solving a sorting problem, if our data range is very small compared to the size of the data, we go for a counting sort.
Various common problems and solutions/algorithms:
Some well-known problems and their solutions are listed here.
Searching problems:
- Binary Search: This algorithm requires data to be sorted.
Sorting problems:
- Insertion sort: in-place, stable, optimal for almost sorted data. If other properties of data are unknown, then suitable for small number of data items.
- Quick sort: D&C
- Merge sort: D&C
- Heap sort: Using heap data structure.
- Counting/bucket sort: Linear time sorting, data assumptions
- Lower bounds for searching or sorting.
Graph algorithms
TODO: update this section
General problem solving techniques:
- Divide and conquer
- Greedy
- Dynamic programming
- etc
- Explain the relation among those.
Give reference to examples of these:
- Recursion
- Back tracking
- Search algorithms
- Sort algorithms
- Graph algorithms
- Exploration algorithms
- BFS, DFS
- Shortest path algorithms (generic cases to specific cases (positive weighted, negative weighted, negative weighted with cycles))
- Union-find and Hacker rank union-find
- Famous problems (TSP, Knapsack)
- Exploration algorithms
Talk about these terms:
- Invariant
- Correctness
- Reductions
Computablility
Introduce Computatability of problem, NP-Completeness?
Preperation for interviews
- (Top 5 Data Structures)[https://www.youtube.com/watch?v=DDRo29ptFwE] and (Top 5 Algorithms)[https://www.youtube.com/watch?v=EM8IgIIiOdY]
- (Several Coding Patterns for Solving Data Structures and Algorithms Problems during Interviews)[https://github.com/Chanda-Abdul/Several-Coding-Patterns-for-Solving-Data-Structures-and-Algorithms-Problems-during-Interviews]
- (Coding Interview University)[https://github.com/jwasham/google-interview-university/blob/master/README.md#google-interview-university]
References:
Videos
- MIT 6.006 Introduction to Algorithms, Fall 2011
- Better Code series by Sean Parent
Textbooks
- CLRS
- Algorithm Design Manual
Further Reading:
- What is Datastructure
- TODO: Update