Let's Talk Linked-Lists

July 21, 2019

Nick Parlante

This blog is first and foremost a tribute to Nick Parlante, whose articles on pointers and linked-lists I have found tremendously helpful as I continue on this path of programming.

Arrays in C

As someone whose experience is primarily in JavaScript and Python, resizing an array is not something I ever needed to worry about.

If I add an element to a list (array) in Python or an array in JS, I can do so without worrying that the array in question will have enough space for these new items. In Python and JS, lists and arrays will resize themselves automatically based on need, however the same can't be said for languages like C.

In C, you need to say, "I need an array of size n of this type." Something like this:

char charArr[14];
int matrix[50][50]; 

For example, say the above code is part of a function somewhere in your program. At compile time, C knows you will have a char array of size 14 and an int matrix (or 2-D array) of size 2500. C will take this information and allocate memory for these arrays in what is known as the stack. The stack is a part of your computer's memory where functions and variables live. Items placed in the stack are temporary, since they only exist for the duration of a function and are then destroyed. Consequently, when a function returns and its local variables are destroyed, the memory being used by those variables is freed up, meaning it can be used for other stuff.

In a moment, we will see how the stack differs from another piece of your computer's memory known as the heap, though for now simply understand that the heap needs help and the stack got your back; that is, the programmer needs to explicitly declare what the heap should do with memory and when, whereas data in the stack is managed automatically by C.

Arrays in Memory

What is happening when memory is allocated in the stack for our arrays?

C designates in the stack a contiguous block of memory for each array. To understand why an array uses a contiguous block of memory, let's review how memory is represented in a computer.

Your computer's memory is a contiguous (someone get me a thesaurus!) sequence of bytes, each of which has its own address. These addresses are integers that range from 0 to the size of your memory (e.g., 8GB) and are in order, i.e., 1, 2, 3, ... n where n is the size of your memory. These addresses are important because it's where your data (or technically words) lives, so let's see how an array makes use of them.

In C a char is one byte, thus a char array of size 14 is 14 bytes. Let's say the first element in this array is at memory address 1. Since the data in our memory is contiguous and we know the size of our data-type (i.e., a char is 1 byte in memory) we know the next element is at address 2. Let's put this idea into in an equation.

Let B be our base address. For all of our arrays, the base address, B, will be the address of our first element. Let c be a contsant that represents how big of a step over our next piece of data is — we'll call this the stride. The stride will be the size of our data-type. For example, our stride with a char array is 1 byte, whereas with an int array it would be 4. Last, i is our index. Given this information, we can say:

address of element in array = B + c * i

This is why arrays start indexing at 0, because B + (0 * 1 byte) = B, i.e., the first element of an array.

What is our runtime of accessing an indexed element? Well our equation has one addition and one multiplication — two constant time procedures. Therefore indexing an array costs constant time.

Array Drawbacks

So constant runtime when indexing an array of any size is obviously very nice, but we only get this functionality because we predetermined the size of the array.

Though what if our array requires more space than we allocated at compile time? We could provide extra space in the array by making it bigger than we anticipate our needs to be, but then the stack will allocate memory that will never be used and that's poor use of resources. Also with this approach, theoretically, your array can still overflow.

Another approach would be to create a new array that is twice the size of the current array, then copy the elements from the current array to the new, larger array. But the copying of elements would take linear time and you still have the issue of allocating more memory than necessary.

This is where linked-lists provide an alternative. With linked-lists, we can construct a connected sequence of elements using dynamic memory allocation, and thus never encounter the issue of running out of space in our data-structure.

Linked-Lists

It's About the Address

Earlier I mentioned how the stack got your back, but the heap needs help. In order to implement dynamic-memory, the programmer needs to explicitly allocate and deallocate the heap's memory based on need — a common data-structure for this procedure is called a linked-list (For this post, a linked-list refers to a singly linked-list).

Before we get into code, let's get a picture of what a linked-list is doing. Let's say we have two values of type int we want to store, 1 and 2. We already know that if we put these values in an array, since there are only two, their addresses will make them neighbors in memory. We also know that because elements in an array are part of a contiguous block of memory, any element in an array can be indexed in constant time. Remember, indexing is about finding the memory address of a particular value in an array.

Then a linked-list comes a long and sees what really makes an array useful is knowing where the address of an element is in constant time. Knowing this, it says forget this contiguous memory stuff, I'm just going to keep track of these addresses, wherever they may be in memory.

So really a linked-list is a data-structure that will create a relationship between memory addresses when the memory is not contiguous. To understand how it does this, we need know about nodes.

Hello, Node!

So we still have our values of type int, 1 and 2, and we want to connect them in some way. We now understand that what we really need is someway of knowing the address of an element we want to connect to, i.,e. we are at value 1 and want to connect to 2. What if we could develop a data-structure that contains our value, e.g., 1, and the memory address of wherever 2 is — this is a node.

A node is comprised of two parts:

  • a value (e.g., int, char, etc.)
  • a reference to a memory address called next

If there isn't another value to reference, then next will point to NULL.

This is what a node looks like:

{
  value: someValue,
  next: addressOrNull
}

Now we can see a way to connect our two values. In our first node we will store 1 and an address to where 2 is and our second node will contain 2 and its next will point to NULL, since we only have two nodes in this case. But what if we did want to add another value, and thus another node? Well, we can tell the new node's next to point to the first node, i.e., the one containing 1, and then we're done.

It would just be nice to know where this connection of nodes starts, so let's call our leading node head. The head is important because we need it to access every node in our list. In other words, a linked-list only points to next not prev, so it can't go back. (That's a doubly-linked list, and is beyond the scope of this post.) So if we access the middle node as a starting point to find a value, half of our list is lost in memory!

And that is a linked-list: a collection of nodes where the first node is called head and the last node's next points to NULL. Every node in between these two will contain an address to another node. This address will tell us where exactly the next node is in memory, thus allowing access to the next node in constant time.

Memory Allocation

Now you're probably wondering, "What about all that talk of the programmer allocating memory in the heap?"

I am glad you asked. Now we will return to some C code to understand how a program will implement a linked-list.

First, let's see what a node looks like in C:

struct node {
  int value;
  struct node* next;
};

That * after node is saying that next isn't a node itself, but rather a memory address for another node — let's call this a pointer. Why a pointer? Becuase we only have the address of the node, not the node itself. The address points us to the where the node is in memory, thus the name pointer.

Let's take a look at some code from Nick Parlante's Linked Lists Basics, then we will break it down.

struct node* BuildOneTwoThree() {
  struct node* head = NULL;
  struct node* second = NULL;
  struct node* third = NULL;
  
  head = malloc(sizeof(struct node)); 
  second = malloc(sizeof(struct node));
  third = malloc(sizeof(struct node));

  head->data = 1; 
  head->next = second;

  second->data = 2; 
  second->next = third;

  third->data = 3; 
  third->next = NULL;  

  return head;
}

struct node* head = NULL;

With struct node* head = NULL, we are declaring our first node pointer named head. It is very important to know that when we declare a pointer, it does not have a memory address to point to yet. If we were to try and assign a value to head at this point, we would get a Segmentation Fault error in our program, because there is no where in memory for our value to go.

Think of the memory address as a container that holds your data. At this point, we have not assigned head a container.

Thus before we can start using a node, we need to allocate memory for it. This is where malloc comes in.

malloc

malloc is used to allocate memory on the heap. When we call head = malloc(sizeof(struct node)), we are giving the orders, "head, here is your memory address. malloc, create a space in memory for head the size of struct node at the address given to head."

Now we can fill our nodes with some data since there is a place in memory for that data to go.

Dereferencing

Lets give our nodes some data. In C, if a pointer is a struct (think dict in Python, or Object in JavaScript), its members are accessed via the arrow operater (->) — this is known as dereferencing.

Dereferencing is the process of accessing the memory your pointer has the address to. If we were to try and deference a node, i.e., head->data = 1 where head is a node pointer, before head = malloc(sizeof(struct node)) allocates memory on the heap for the node, we just dereferenced a bad pointer (which is bad)! Deferencing a bad pointer is when you try and put some data in your pointer's reference before creating that reference.

So we allocated memory for our pointers and provided them data, but what happens when we're done with the data? We know the stack automatically destroys variables when a function exits, but that's not the case when using the heap. With the heap, we need to free up that memory once we are done with it.

free

Once you are done with your pointer, you need to free up the memory it was using. Of course, if you do this before you finish using the values being referenced in your pointers, well, you won't be able to use those values anymore. Consider the following code that frees up the memory used by nodes in the linked-list returned by BuildOneTwoThree:

// Creates a pointer to head, i.e., the first node
struct node* LinkedList = BuildOneTwoThree();
struct node* currNode;

while(currNode != NULL) {
  currNode = LinkedList;
  LinkedList = LinkedList->next;
  free(currNode);
}

This code will free every node until we reach NULL, indicating there are no more items in the list.

Linked-Lists Drawbacks

So we clearly see how linked-lists win in the memory allocation debate, though how do they fair in terms of time complexity? In particular, we are concerned with the worst case runtime of accessing an element in a linked-list.

With arrays we saw that indexing any element is a constant time procedure, though with linked-lists our Big O of accessing an element is linear-time. This is because the worst-case scenario is having the element of interest be in the last node of the linked-list, thus meaning we need to traverse the entire linked-list to get our value.

Conclusion

I really enjoyed learning about linked-lists and C in general. C has something that always seemed so enigmatic to me, namely because of all the horror stories I read of people claiming how difficult pointers are. Thankfully someone responded to one of these posts with a link to Nick Parlante's computer science resources: