Notes on data structures and algorithms
Introduction
Computers are ubiquitous. They help us solving so many problems and for many people they are already integral part of daily life. We expect them to be fast and use hardware resources effectively. We can achieve these goals through carefully crafted software.
Computers are dumb and they have only two fundamental capabilities: following instructions precisely and storing data. They don’t actually solve the problems, we have to solve the problem and provide them with steps of solution. Those steps of solution is called an algoritm. As the algorithm is written for computer, we should have a good understanding of how computers work.
As we want to have an efficient solution/algorithm, we are interested in time complexity and space complexity of the solution.
Unless the solution is just a mathematical expression (For example, calculate sum of consecutive natural numbers upto given number), it operates on the data that is specified in(or derived from) the problem. For the sake of understanding let us assume that the data is a group of objects, then the basic operations on the group will be: add more objects, delete objects, updating an object, simple exploration/traversal of objects and more importantly search for an object. As computer’s memory is a gaint array of bytes, we need to first solve those operations(on data) as needed by the solution. Data Structure is a concept where we study various ways to organize and operate on the data so that we’ve an idea of time complexity or space complexity of various operations we will need. As solution to data operations is an important part of the solution itself, the data structure we choose affects time and space complexities of solution.
We can produce many solutions/algorithms to a problem, but we want the best of those in terms of runtime and space complexity. To do that, we need to compare those solutions. Using a real computer to calculate runtime and space complexity for each is a very tedious task, and all computers don’t have same capabilities. Hece, to simplify the complexity calculation, to make a common ground for benchmarking algorithms we choose a theoritical computer — the RAM model of computation.
Random Access Machine
RAM is a simpliefied computer: it has single procesor, no cache, each basic instruction take constant time. Get to know more about RAM here and here.
But I’m also listing properties here:
- Each simple operation (+, *, -, =, if, call) takes exactly 1 time step.
- Loops and subroutines are not considered simple operations. Instead, they are the composition of many single-step operations. It makes no sense for sort to be a single-step operation, since sorting 1,000,000 items will take much longer than sorting 10 items. The time it takes to run through a loop or execute a subprogram depends upon the number of loop iterations or the specific nature of the subprogram.
- Each memory access takes exactly one time step, and we have as much memory as we need. The RAM model takes no notice of whether an item is in cache or on the disk, which simplifies the analysis.
Accounting on RAM
We calculate time and space complexity of various algorithms for a problem and compare them to pick a relatively better algorithm. Those calculations are formulated interms of number of constant-time(low-level) operations needed on RAM(for some amount of data).
TODO: worst-case, average-case and best case analysis somewhere
Asymptotic analysis
Sometimes we are interested in solving the problem assuming that we are working on large amount of input data, because these days computers are operating on huge volumes of data. If that is the case, we use asymptotic analysis to simplify the complexity calculation of an algorithm. Once we’ve formulated the complexities, as part of asymptotic analysis we simplify the formulation assuming that the input is very large, i.e. close to infinify. With this, our process of selecting a relatively better algorithm will be easier.
The analysis made for large inputs might not be suitable for small inputs (For ex: the insertion sort is better for small inputs than merge sort which asymptoticallly better than insertion sort).
Refer: Big-O Notation
Data structures
As discussed, solutions that just need numerical computation may not any data structure per se. For example, calculating GCD, Check if given number is prime number or not? etc. But many other solutions do need data structures. The memory of RAM, is a gaint array of memory locations which can be randomly accessed. We store/organize the data in it and operate.
Fundamental ways to organize data in memory
Fundamentally we can store the data contigously (as an array of objects) or non-contigous/linked (linked objects).
Array
As the objects stored as an array are stored contigously, randomly accessing an object based on its index takes constant time. But insert/delete operations will take more work. We also need to know the no. of objects upfront to be able to allocate the required memory.
In an array, objects are stored contigously. As an array objects are of same type and are stored contigously, is very easy to index into an array – you can always calculate the position of object if you know the index of the object in that array. But inserting into an array would be a costlier operation, because to make room for the extra object. Array is suitable if we the data is of fixed size known at the time of allocation and we perform indexing operation often.
Linked objects
But if we want to be able to insert/delete objects often or we don’t know the size of data upfront. We can go for Linked objects, where objects are linked to each other. In this objects may not be stored contigously. As the physical location of the objects are not evident, we can’t index into Linked objects as quickly we did in an array. Each object stores position of object(s) that can be reached. Eamples: Single linked list, double-linked list.
Where do we place the data structure: either stack area or heap area depends on the life time of the data(function scope or program scope) and the size of data as well(stack will be limited).
Read:
Operations on linked-list:
- Can’t index, linear traversal is required. Even if the linked list is sorted finding an object will take linear time.
- Binary seach won’t work, linked list can be augmented to be tranformed into a skip list to get same effect of binary search.
- Modified mergesort suits best for sorting.
- Using fast and slow pointer, we can find, middle of linked list, k th last element etc.
Imporatnace of relations
Now, imagine we’ve an array of integers and our task is to check if a given integer exists in our array. Here, we need to find the given integer in the array. It costs us time that is propotional to the size of the array. (Consider we do this opeartion very often) But how can we reduce this? We sort the array. Interestingly when we sort the array in ascending order, we’ve established correlation between locality of the integer with its value — an integer is located after integers that are less than this. Using this correlaation we perform binary search. The same principle applies to binary search trees where all keys that are lesser will be stored in its left side. We know the concept of Height Balanced Binary Search Trees, which provide us find operation in logarithmic of input even in worst case. But we also spend some extra time to balance the tree, right after a change is done on the tree(which may be ignored if changes to the tree is lesser compared to read/find operations).
I would highly recommend watching Sean Parent “Better Code: Data Structures”, which helped me concretise this idea.
Another example is Hashing: where we bring correlation between representaton of object and its location.
TODO: Try to give more examples.
Abstract data types
For solving the problem, we first need to decide the operations on objects. The theoretical definition of required operations is called an Abstract Data Type. Type of the data describes operations allowed on the data. Because ADT don’t have implementation, it is called as abstract. For a given ADT, we try to implement data structures that supports those operations; we compare them and pick the one with a reasonable amount of complexity. Some common ADTs that may be incorporated into the solution are Dynamic array, Stack, Queue, Circuar queue, Priority queue, Graph, Min-Max-heap, Map(of a key-value pair), and Union-Find etc. It is very rare that you’ll have to implement ADT yourself. You may have to implement ADT yourself only when you feel the availble implementaion is not suitable for your use case or you’ve not found any implemention that suits your need. Sometimes, well implemented ADTs may be built-into the programming language you work on or can be used from a library. It is worth knowing various properties of the readily availble implementsions before using them in your particular case.
List
list
is an example of dynamic array or variable sized array. Variable sized array can be implemented as linked objects as well as contigously stored objects. If indexing operation is required then Variable sized array should be implemented with contigously stored objects. But it is little tricky to implement Variable sized array with contigously stored objects. In this lecture Prof. Erik Demaine explains implementing Variable sized array as contigously stored objects using Table Doubling. Python’s list
data type is backed by table doubling implementation where as Deque
is implemented as double-linked data objects. Both can grow to variable length, but Deque can do insert and delete operations on both the sides effectively. And, python’s deque
container can be used as a Circular Queue aswell.
Python’s list can be used as both Stack and Queue. But, list
is not an optimal option to implement Queue as queue will take insertions at the beginning. Deque from Python’s collections library is more suitable as Queue.
Queue
FIFO is the removal strategy. Read:
- To Queue Or Not To Queue
- Queue can be implemented with both array and linked list. Lined list based implementation is easy to do. Because dequeue will create empty slots at the front of the list.
- Circular queue or ring buffer can be thought of a bounded queue, and when this is implemented using array, each deque operation will create empty slots at the front of the array. We need to use these empty slots to enqueue items further, hence we can see circular usage..
- We can use ring buffer implementation to track most recent elements in the stream, for example, given a stream of data, calculate max/sum of most recent K numbers.
Stack
LIFO is removal strategy. Example problem that uses stack datastructure:
Find immediante next greater element for each element in an array
[Backtracking, Browser tab’s next and previous navigation and undo-redo on a document] () TODO Read:
Know both array based and linked list based implementation.
Map
Map datastructure store key and value pairs to provide efficient look-ups. For example, a key to value map can be implemented with hashing which is called as HashMap (in Python it is Dictionary) or with a balanced BST which can be called as TreeMap. LinkedHashMap with a doubley linked list. These different implementations have different characteristics.
- HashMap provides faster access and can’t provide natural/sorted ordering of the keys. No order.
- TreeMap provides O(log N) operations and provides natural/sorted ordering of the keys. Natural order.
- LinkedHashMap remembers the insertion order.
Examples
- LRU cache implementation using HashMap and DLL
- HashMap will have the mapping of key and the objects location in a double-linked-list of the objects. We use linked list of objects to sort the objects based on “recently access”, the “least recently used” object will be at the end of the linked list. Whenever we insert/update a key’s value, we move that object to the beginning. When we needed to evict a key to insert a new key, we delete the last element from linked list (which is least recently used).
- We could also use a single linked list here, but the implementation will be tricky. The main reason for double linked list is to facilitate the easy deletion of object. So, check how we can replace that with single linked list
- LFU cache implementation - ADT using HashMap and DLL.
- This cache will evict the “least frequently used” element when it has to insert a new key-value pair and its limit is reached.
- Usually this is implemented with “min-heap” in which objects are stored based on their “access count”, but this provides O(log N) complexity. O(1) solution is discussed, using double linked list and HashMap, go through this.
Set
A set of items can be implemented with Hashing, which is HashSet or a balanced BST, which is TreeSet, or a linked list, which is LinkedHashSet. The characteristics of HashSet, TreeSet and LinkedHashSet are different. HashSet doesn’t remember any order, where as TreeSet and LinkedHashSet remember order of elements. Along with set functionality, if we need an iterator of elements to traverse them in the natural order then TreeSet can be used, if we need an iterator to traverse in the order of insertion then LinkedHashSet (ADT contains set and double linked list) can be used. Also, the timecomplexities may vary. HasSet and LinkedHashSet provides O(1) but TreeSet provides O(log N) for all basic operations like find, insert and delete.
Difference between Map and Set is elements of set will not have associated values but map’s elements will have associate members. In Map values are indexed by the key like elements in array are indexed by their position.
Redis Sorted Set
Let us say we’ve below requirements:
- We’ve a key value pair, we want to get the value of given key: HasMap functionality (Or dict in python).
- We want the ordering of elements based on their value. To answer queries like, rank of an element, top X elements based on value and update the value as and when required.
Example could be a leader board. Given a player you want to know their score. Who are top 10 players, what is the rank of a particular user.
This ADT is interesting and is provided in Redis, this isn’t provided in Python or Java but we can implement. This ADT can be implemented using a HasMap and a skip list. Here, skip list will have elements (key-value pairs) sorted using their values. Through skip list we can achieve binary search like functionality in linked lists – O(long N) for search. In this ADT, skip list can be augmented with extra information (like how many elements are skipped) to calculate the rank very quickly.
For more information refer Sorted Set. The name sorted set seems counterintuitive but I also have no name for a HashMap which also maintains the sorted ordering based on values.
Heap
Read:
Priority Queue
Priority Queue can be implemented with Heap data structure.
Graphs can be implemented with linked nodes or adjacency matrix (two-dimentional array) or adjacency list
Explore trees: Height balanced binary search trees(AVL or Red black trees), B-trees etc
TODO: give more examples
Trees
- in-order, pre-order and post-order traversal.
- depth-first traversal (recursion/backtracking/stack), breadth-first traversal (queue).
Data structuers in programming lanauges
Many high level programming lanauges provide abstract data types built-in and other will have a library where we can pick up. For ex: Python, Javascript have common ADTs as built-in. For C++ and Java, you can use from their standard library. Maybe try to Understand the memory model of each the language you are using.
In the above section, we’ve mentioned some datastructures from Java and some from Python. The goal is to understanding various implementations of ADTs, hence we’ve collated for different languages.
TODO: more in this section?
Algorithms
As discussed, for producing algorithms for computers, we need good understanding on how computers work. Algorithm must be correct, and we should be able to check/prove the correctness before using it(For all problem instances it should produce correct result). There are techniques to prove that given algorithm is correct. Out of all correct algorithms (candidate algorithms), we pick the optimal one(in our context). So, we should always focus on producing correct algorithms, and then we can think about optimations.
Refer Chapter 1.1 of CLRS. (CLRS 3rd Ed)
Some important points:
- Understanding recursion is important, because for reusing/parameterizing the code, we break the code in functions. Recursion is very useful in breaking down the problems into smaller problems of easy analysis.
- Understanding the problem context and the input data very well can help producing algorithms that work best in that specific context. For example, if we are solving a sorting problem, if our data range is very small compared to the size of the data, we go for a counting sort.
Various common problems and solutions/algorithms:
Some well-known problems and their solutions are listed here.
Searching problems:
- Binary Search: This algorithm requires data to be sorted.
Sorting problems:
- Insertion sort: in-place, stable, optimal for almost sorted data. If other properties of data are unknown, then suitable for small number of data items.
- Quick sort: D&C
- Merge sort: D&C
- Heap sort: Using heap data structure.
- Counting/bucket sort: Linear time sorting, data assumptions
- Lower bounds for searching or sorting.
Graph algorithms
TODO: update this section
General problem solving techniques:
- Divide and conquer
- Greedy
- Dynamic programming
- etc
- Explain the relation among those.
- Sliding window technique
Give reference to examples of these:
- Recursion
- Back tracking
- Search algorithms
- Sort algorithms
- Graph algorithms
- Exploration algorithms
- BFS, DFS
- Shortest path algorithms (generic cases to specific cases (positive weighted, negative weighted, negative weighted with cycles))
- Union-find and Hacker rank union-find
- Famous problems (TSP, Knapsack)
- Exploration algorithms
Talk about these terms:
- Invariant
- Correctness
- Reductions
Computablility
Introduce Computatability of problem, NP-Completeness?
Preperation for interviews
- Top 5 Data Structures and Top 5 Algorithms
- [Several Coding Patterns for Solving Data Structures and Algorithms Problems during Interviews] (https://github.com/Chanda-Abdul/Several-Coding-Patterns-for-Solving-Data-Structures-and-Algorithms-Problems-during-Interviews)
- Coding Interview University
References:
Videos
Textbooks
- CLRS
- Algorithm Design Manual