Programming Pearls Column 1: Cracking the Oyster Link to heading
Programming Pearls is one of those books that is commonly talked about as a must read. I finally bought a copy and wanted to document reading it so I had easily available notes on it in the future.
The Problem Link to heading
This column starts off sounding like a whiteboard interview problem
Input: A file containing at most n positive integers, each less than n,
where n=10^7. It is a fatal error if any integer occurs twice in
the input. No other data is associated with the integer.
Output: A sorted list in increasing order of the input integers.
Constraints: At most (roughly) a megabyte of storage is available in main
memory; ample disk storage is available. The run time can be at
most several minutes; a run time of ten seconds need not be decreased.
The Solution Link to heading
Initially due to the memory constraint it seemed like there was no way to do it without using some intermediate disk storage. With more memory a bit vector would an ideal solution but I left that on the table until later in the reading when they “found 2 million more bits”.
/* phase 1: initialize set to empty */
for i = [0, n)
bit[i] = 0
/* phase 2: insert present elements into the set */
for each i in the input file
bit[i] = 1
/* phase 3: write sorted output */
for i = [0, n)
if bit[i] == 1
write i on the output file
Principles Link to heading
- The Right Problem
- The Bitmap Data Structure
- Multiple-Pass Algorithms
- A Time-Space Tradeoff and One That Isn’t
- A Simple Design
- Stages of Program Design
Of the principles in this column I think that The Right Problem
and A Simple Design
are the most important as they encapsulate what it takes to be a great engineer in the industry. Often problems come to us ambiguous or without context, digging into what the real problem is saves time and frustration from wasted cycles. Most engineers also tend to make things more complicated then neccessary, striving for simplicity makes you and your peers life easier in the long run with easier to understand and maintain code.
Problems Link to heading
- If memory were not scarce, how would you implement a sort in a language with libraries for representing and sorting sets?
In Java I would just insert and then iterate over a TreeSet<Integer>
. This would be slower than the bit vector solution (N vs N log(n)) so probably wouldn’t use it.
- How would you implement bit vectors using bitwise logical operations (such as and, or and shift)?
Intnerally an array or vector of bytes would hold the data. The following psuedo code shows the math
a = byte[]
set(n) => a[n/8] |= (1 << n % 8)
get(n) => a[n/8] & (1 << n % 8)
- Run-time efficiency was an important part of the design goal, and the resulting program was efficient enough. Implement the bitmap sort on your system and measure its run time; how does it compare to the system sort and to the sorts in Problem 1? Assume that n is 10,000,000, and that the input file contains 1,000,000 integers.
bitmap sort ~2 seconds (using go)
system sort ~0.4 seconds (using sort -n
)
- If you take Problem 3 seriously, you will face the problem of generating k integers less than n without duplicates. The simplest approach uses the first k positive integers. This extreme data set won’t alter the run time of the bitmap method by much, but it might skew the run time of a system sort. How could you generate a file of k unique random integers between 0 and n – 1 in random order? Strive for a short program that is also efficient.
package main
import (
"fmt"
"math/rand"
)
func main() {
seen := make(map[int]bool)
for len(seen) < 1_000_000 {
i := rand.Intn(10_000_000)
if !seen[i] {
fmt.Println(i)
}
seen[i] = true
}
}
- The programmer said that he had about a megabyte of free storage, but the code we sketched uses 1.25 megabytes. He was able to scrounge the extra space without much trouble. If the megabyte had been a hard and fast boundary, what would you have recommended? What is the run time of your algorithm?
The same algorithm could be used with two spaces and dividing the space in half, the big O would be the same but runtime would double.
- What would you recommend to the programmer if, instead of saying that each integer could appear at most once, he told you that each integer could appear at most ten times? How would your solution change as a function of the amount of available storage?
Instead of using a bit/bool each index could be a byte to track the total number of occurences. This would cause memory by 8x.With some extra logic bytes could be split into upper and lower halves and only a 4x increase in memory would be needed.
- [R. Weil] The program as sketched has several flaws. The first is that it assumes that no integer appears twice in the input. What happens if one does show up more than once? How could the program be modified to call an error function in that case? What happens when an input integer is less than zero or greater than or equal to n? What if an input is not numeric? What should a program do under those circumstances? What other sanity checks could the program incorporate? Describe small data sets that test the program, including its proper handling of these and other ill-behaved cases.
- The bit vector can be checked for a true value before setting and throw an error if it is true
- In the original version bad input would cause the program to crash. Boundary checks and input validation could be added to address cases outside the happy path.
- When the programmer faced the problem, all toll-free phone numbers in the United States had the 800 area code. Toll-free codes now include 800, 877 and 888, and the list is growing. How would you sort all of the toll-free numbers using only a megabyte? How can you store a set of toll-free numbers to allow very rapid lookup to determine whether a given toll-free number is available or already taken?
- Perform the same process for each prefix and store those in indivdual files.
- Since files are sorted and have a constant length a binary search on the file would allow for fast lookup.
- One problem with trading more space to use less time is that initializing the space can itself take a great deal of time. Show how to circumvent this problem by designing a technique to initialize an entry of a vector to zero the first time it is accessed. Your scheme should use constant time for initialization and for each vector access, and use extra space proportional to the size of the vector. Because this method reduces initialization time by using even more space, it should be considered only when space is cheap, time is dear and the vector is sparse.
- Since most of my experience is with languages that take care of this initializing values I couldn’t really come up with a great way to do this. Eventually some googling led me to this blog which appears to be an explanation of this problem with a solution from later in the book.
- Before the days of low-cost overnight deliveries, a store allowed customers to order items over the telephone, which they picked up a few days later. The store’s database used the customer’s telephone number as the primary key for retrieval (customers know their phone numbers and the keys are close to unique). How would you organize the store’s database to allow orders to be inserted and retrieved efficiently?
- A B-tree would allow for this.
- In the early 1980’s Lockheed engineers transmitted daily a dozen drawings from a Computer Aided Design (CAD) system in their Sunnyvale, California, plant to a test station in Santa Cruz. Although the facilities were just 25 miles apart, an automobile courier service took over an hour (due to traffic jams and mountain roads) and cost a hundred dollars per day. Propose alternative data transmission schemes and estimate their cost.
- This was solved with pigeons which were twice as fast and only a couple bucks a day. They transported 35mm film of the drawings.
- Pioneers of human space flight soon realized the need for writing implements that work well in the extreme environment of space. A popular urban legend asserts that the United States National Aeronautics and Space Administration (NASA) solved the problem with a million dollars of research to develop a special pen. According to the legend, how did the Soviets solve the same problem?
- A pencil
- Like most urban legends this one isn’t accurate.