Data Structures 101
From Real Software Documentation
Aim
We want to make a version of the File Enumerator that can handle folder shortcuts without getting trapped in loops that can be caused by them.
The only practical way to avoid this looping problem is to keep a list of all the folders enumerated so far. Before we enumerate a folder, we check whether it is already in the list.
Now, a modern computer can have thousands of folders, and building and searching that list every time we encounter a folder takes us well into the territory of wanting to be concerned with efficiency[note 1]. If we just keep this list in the kinds of simple data structures we’ve been using so far (say, an array of strings), and search that list from one end to the other (called a linear search), this searching process will slow the whole enumeration process significantly as the list gets long.
With all of this in mind, this lesson will examine a number of different data structures for storing and searching a list of strings. Note that Real Studio has a built-in data structure that can do what we need, the Dictionary. But we will learn a lot by implementing our own solution.
Contents |
The Project
Following the ideas we’ve been talking about, we’re going to develop a class we will call a String Accumulator, that keeps a list of strings. It has one main method, to which you hand a string. If the string is already in the list, it returns True; if not, the string is added to the list, and it returns False. Thus, we can feed the path of the folder we’re about to follow into this method, and it will tell us whether we’ve already searched it.
In this lesson, we will examine three different data structures for storing the list, and two different algorithms for searching each of these data structures. We will use a framework that allows us to test the speed of the data structures and algorithms.
Make sure you spend some time thinking about how you might implement the string accumulator, using what we’ve learned so far.
As we saw in a previous lesson, we don’t have a practical way to understand exactly what Real Studio is doing behind the scenes of our code. When it comes to efficiency questions, the only reliable way to determine what is the most efficient code for a given task is to set up a test.
- Open the project 19_StringAccumulator.
The project’s main window provides two buttons. The top one lets you run a timing test comparing different implementations of a string accumulator. The bottom button lets you run a simple test on a single accumulator. Neither of the buttons have any output deliberately: this project is just the kind of minimal but functional framework a developer would probably put together in practice to run some timing comparisons. You can just apply breakpoints at a suitable point in a method to determine how things are going. For instance, you would put a break point on the last blank line of the top button and then examine the contents of the array Timings if you want to run a timing test.
- Begin by examining the very simple StringAccumulator class.
- Now examine the ArrayStringAccumulator class.
This might have been what you envisaged when you thought about how to create the StringAccumulator class.
Binary Search
One way to speed up a search through a sorted list is a binary search, in which you repeatedly cut the list in half. You start looking at the middle element in the list, and cut your search to the first or last half of the list, according to where the element you’re searching for is. You then look to the middle of that sub-list, and cut the list in half again. Lather, rinse, repeat.
- Examine the code in the BinaryArrayStringAccumulator.
- Examine the code for the RunButton in the window.
Comment out the code to add all but the BinaryArrayStringAccumulator and the ArrayStringAccumulator.
- Make sure you have a break point at the end of the method.
- Run the project and click the button.
Examine the values in the Timings array at the end. You might be surprised at how much faster the binary search is. If your math is up to it, note that a linear search through N items will average about 1/2N comparisons, while a binary search will average about log2N comparisons.
Before we go on, you should review the PowerPoint presentation: Memory Management.
Based on what you learned about memory management, we should consider using a Linked List built out of a series of separate objects. It would be impossible to do a binary search on such a list, but inserting an item into such a list can be done in a constant amount of time, no matter how big the list is. What is not clear is whether such a list would be faster than the other lists we’ve looked at.
- Examine the ListStringAccumulator2 class’s code, and then comment it back in to the button’s code, and run the project.
Ouch. The new class is still half the speed of the ArrayStringAccumulator, let alone the binary search array.
Next, we are going to examine some tree data structures. These are very common types of data structures, and we will look at one of the simplest, a Binary Tree.
About Binary Trees
The data in a binary tree is contained in Nodes, which are joined by Edges (these will be objects as the nodes, and references to the other objects as edges):
In this diagram, the numbers are contained in the nodes, and the arrows represent the edges. Each node has a less than and a greater than edge (either or both might be Nil).
To find something in a binary tree, just follow the edges:
To insert something, find where it should be, and insert it there:
A potential problem: the “shape” of the tree depends on the order in which nodes were added:
The above tree is unbalanced, and will work as badly as or worse than a straight linked list would.
- Examine the code driving the BinaryTreeAccumulator class. Run it from the TestOneButton, tracing through it as you go.
- Examine the non-recursive BinaryTreeAccumulator2 class.
If you are unclear how it works, trace through it from the TestOneButton.
- Rearrange the RunButton method so we can compare the timings for the two BinaryTreeAccumulator classes and the BinaryArrayStringAccumulator.
- Comment out the StackOverflowException handler. Run the application.
Wow. BinaryArrayStringAccumulator is still many times faster than any other approach we’ve tried.
Other Approaches
The BinaryArrayStringAccumulator is reasonably fast; certainly, many times faster than a linearly-searched list. It will suffice for our file enumerator project.
Nevertheless, it is interesting to consider how we might make our ArrayStringAccumulator faster. We could use a smarter type of tree that always stays balanced: a 2-3 Tree. There is also a source of inefficiency in all of our classes: Real Studio doesn’t know that we are never going to resize our strings, or that we are going to need to store a lot of strings. By taking control of our own memory management at a lower level (using the MemoryBlock class), we could make all of these classes faster. Using a scheme like this, we might well find that the BinaryArrayStringAccumulator version is no longer the fastest.
Any textbook on data structures should describe the 2-3 tree — it is one of the most basic data structures in computer science; for the same reason, it should not be hard to find a description of the 2-3 tree online. However, there is a reasonable amount of coding involved in implementing a 2-3 tree, putting it a little out of the scope of this course. Also, Real Studio’s built-in data-storage features (particularly the Dictionary class) means that you are unlikely to need to implement a data structure like that to program in Real Studio.
Other Improvements
You should read the REAL Studio documentation about the MemoryBlock. A MemoryBlock is a very simple chunk of memory, basically just an array of bytes. Because it is so simple, it is also very fast. REAL Studio doesn’t go to the extra trouble required to make strings so much more flexible. This means you can take advantage of your knowledge that the strings you are storing will not need to change size, to make them work faster.
You could even keep an entire binary or 2-3 tree, or for that matter our simple list, in a large MemoryBlock, which would also be substantially faster. There is one more reasonably simple and very flexible and efficient way of speeding up almost any data structure, called hashing. We will employ hashing in a lesson fairly soon, and we will see a substantial improvement over our binary searched list at that time. For the time being, though, we have a class that will provide reasonable performance for our file enumerator, so we will go back to the file enumerator in the next lesson.
Further Exercises
There are a great many potential exercises that can be pursued from here.
A nice one would be a fairly simple way of speeding up either of the array-based classes: resize them rarely and in large increments. This would involve keeping separate track of how much of the allocated size is actually used, allocating say 1000 or 10000 Data Structures elements in the constructor, and, say, doubling the size of the array when it runs out of space. Note that a construct like this (at the end of a SubRangeSearch method, in this example):
is both perfectly acceptable and certainly the easiest way to take care of the array resizing.
References
- Notes




