Skip to content

Rites of Passage in Computer Science

Alex edited this page Dec 16, 2019 · 3 revisions

Computer science majors typically have a couple "rite of passage" projects which demonstrates a mastery (or at least some degree of mild competency) of data structures, algorithms, and the generic ability to code. Since coder Bootcamps are an attempt to condense a CS degree into a couple weeks, these rites of passage are sadly lost.

All you really need to get started is knowledge of C, or any other programming language you'd like.

Interpreters and Compilers

Writing an interpreter (or, for the more daring, a compiler to machine code) may be done in any language you'd like. The basic scheme for an interpreter is done in the following steps:

  • Step 0: read in the file (or REPL input) as a string input_string
  • Step 1: transform the input_string into a list of tokens (done through a lexical scanner)
  • Step 2: transform the list of tokens into an abstract syntax tree (done through a parser)
  • Step 3: evaluate the syntax tree

If you are coming up with a language from scratch, you need to specify its formal grammar (detailing how to form expressions). This is usually done in BNF.

Lisp Interpreters

Lisp greatly simplifies this process by having almost no grammar, representing the abstract syntax tree using...a list, and minimizing the amount of work done lexing and parsing input. That is to say, there's almost no work to be done transforming the list of tokens into an abstract syntax tree.

When you think about it, most languages differ by syntactic sugar. If you want to get to the meat of the language, it boils down to using S-expressions.

Imperative Language Interpreters

Of course, not everyone loves lisp...for reasons I do not pretend to understand...and want to write a language which looks like C/Java/etc.

Databases

Writing a database from scratch requires knowing quite a bit about algorithms and tree data structures. This is probably best done in a higher level language than C (e.g., Java or C++ or whatever). A SQL database is a fine thing to try to implement, there's quite a bit to it: query construction and optimization, data storage and retrieval, implementing indexes, constraint checking, etc. etc. etc.

Operating Systems

Here, C is really the language we must use, since it is small and simple and native. Java, C#, et al., although fashionable and nice, they are not native. C++, while native, is neither small nor simple (but it is possible, just not advisable).

The operating system consists of, at its heart, a process manager plus memory (RAM) management plus a uniform API for dealing with hardware through system calls and drivers and so on. (Memory management is what happens when you call malloc() in C, or new Object() in your favorite object oriented language.) File systems are usually a necessary second component.

Since the '70s, "the" operating system de rigueur has been Unix and its variants (BSD, Linux, Solaris, etc. etc. etc.). Fortunately these have been optimized, unfortunately the optimizations come at a cost of readability. MIT has since produced a free "toy model" of Unix which they call xv6 designed specifically for multicore x64 architecture, oriented towards readability and good code.

File Systems

It helps to know something about hard disk geometry since file systems manage storing data on hard disks.

Usually this is done through inodes, a data structure which represents a file or directory. An inode consists of "metadata" plus the pointers to blocks representing the file's contents. (A block is 2^n data sectors of a disk.)

Puzzle. Suppose:

  • a sector has bytesPerSector = 2^s bytes (e.g., flash drives have s=9 or 2^9 = 512 bytes per sector, CDs have s=11 or 2^11 = 2048 bytes per sector)
  • a block consists of sectorsPerBlock = 2^b sectors (where usually 1 <= b <= 4, but any b is possible), and
  • pointers take up pointerSizeInBytes = 2^p bytes (p=5 or p=6 on most modern architectures).

We require 2^p divides 2^b, which is true if p <= b.

Inodes can have pointers directly to blocks, or have a block consist of pointers to the data (called "singly indirect blocks"). This process can be iterated to several levels of indirectness (a block consisting of pointers which point to other blocks of pointers).

If an inode has at most k direct pointers to blocks, at most l singly indirect blocks, and at most m doubly indirect blocks, then what is the maximum size of the file? (End of Puzzle)

Sketch Solution. A direct pointer points to 2^b sectors or bytesPerDirectBlock = bytesPerSector * sectorsPerBlock = 2^(s) * 2^(b) = 2^(s + b) bytes.

A singly indirect block consists of numberPointersPerSinglyIndirectBlock = bytesPerDirectBlock/pointerSizeInBytes = 2^(s + b - p) pointers to blocks of data. So we get numberPointersPerSinglyIndirectBlock * bytesPerDirectBlock = 2^(s + b - p) * 2^(s + b) = 2^(2*(s+b) - p) bytes.

A doubly indirect block consists of 2^(s+b - p) pointers to singly indirect blocks, or numberPointersPerDoublyIndirectBlock = numberPointersPerSinglyIndirectBlock * numberPointersPerSinglyIndirectBlock. The total bytes which a doubly indirect block can point to is given by numberPointersPerDoublyIndirectBlock * bytesPerDirectBlock = 2^(s+b - p) * 2^(s+b - p) * 2^(s+b) = 2^(3*(s+b) - 2*p).

The total bytes which an inode can support is then a linear combination of these numbers. (End of Sketch of Solution)

Example. When we have 12 direct blocks, 1 singly indirect block, and 1 doubly indirect block, 512 bytes per sector, and 4 sectors per block, and 64 bit (i.e., 8-byte) pointers. A single block then consists of 2048 bytes (or 2 kilobytes) and can contain 2048/64 = 512/16 = 32 pointers.

We have a singly indirect block point to 32 blocks. A doubly indirect block points to 32*32 = 1024 blocks. (A triply indirect block points to 32*32*32 = 32768 blocks.) Thus we have a total of 12 + 32 + 1024 = 1060 blocks, or 2120 kilobytes. (Adding a triply indirect block would increase this by an additional 33836 blocks, or roughly 66 megabytes.)

If we instead had 32-bit pointers (i.e., taking up 4-bytes instead of 8), then a singly indirect block points to 512 blocks. A doubly indirect block points to 512*512 = 262,144 blocks. (A triply indirect block points to 512*512*512 = 134,217,728 blocks or 127 kiloblocks.) We could in this case have a total of 12 + 512 + 262144 = 262668 blocks, or 525336 kilobytes (513 megabytes + 24 kilobytes). (Adding the triply indirect block would boost this to 268,435,456 kilobytes or about 268 gigabytes.)

What if a block is 4096 bytes and a pointer is 4-bytes? What's the largest file size that can be supported with 12 direct blocks, 1 singly indirect block, 1 doubly indirect block, and 1 triply indirect block? (End of Example)

References

Clone this wiki locally