-
Notifications
You must be signed in to change notification settings - Fork 0
Rites of Passage in Computer Science
Computer science majors typically have a couple "rite of passage" projects which demonstrates a mastery (or at least some degree of mild competency) of data structures, algorithms, and the generic ability to code. Since coder Bootcamps are an attempt to condense a CS degree into a couple weeks, these rites of passage are sadly lost.
All you really need to get started is knowledge of C, or any other programming language you'd like.
Writing an interpreter (or, for the more daring, a compiler to machine code) may be done in any language you'd like. The basic scheme for an interpreter is done in the following steps:
- Step 0: read in the file (or REPL input) as a string
input_string - Step 1: transform the
input_stringinto a list of tokens (done through a lexical scanner) - Step 2: transform the list of tokens into an abstract syntax tree (done through a parser)
- Step 3: evaluate the syntax tree
If you are coming up with a language from scratch, you need to specify its formal grammar (detailing how to form expressions). This is usually done in BNF.
- Matt Might, Grammar: The language of languages
Lisp greatly simplifies this process by having almost no grammar, representing the abstract syntax tree using...a list, and minimizing the amount of work done lexing and parsing input. That is to say, there's almost no work to be done transforming the list of tokens into an abstract syntax tree.
When you think about it, most languages differ by syntactic sugar. If you want to get to the meat of the language, it boils down to using S-expressions.
Of course, not everyone loves lisp...for reasons I do not pretend to understand...and want to write a language which looks like C/Java/etc.
Writing a database from scratch requires knowing quite a bit about algorithms and tree data structures. This is probably best done in a higher level language than C (e.g., Java or C++ or whatever). A SQL database is a fine thing to try to implement, there's quite a bit to it: query construction and optimization, data storage and retrieval, implementing indexes, constraint checking, etc. etc. etc.
- UC Berkeley, CS186
- MIT, 6.830 / 6.814
- Hellerstein, Joseph, Michael Stonebraker, and James Hamilton. "Architecture of a Database System." Foundations and Trends in Databases 1, no. 2 (2007): 141-259, http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf
Here, C is really the language we must use, since it is small and simple and native. Java, C#, et al., although fashionable and nice, they are not native. C++, while native, is neither small nor simple (but it is possible, just not advisable).
The operating system consists of, at its heart, a process manager plus memory (RAM) management plus a uniform API for dealing with hardware through system calls and drivers and so on. (Memory management is what happens when you call malloc() in C, or new Object() in your favorite object oriented language.) File systems are usually a necessary second component.
Since the '70s, "the" operating system de rigueur has been Unix and its variants (BSD, Linux, Solaris, etc. etc. etc.). Fortunately these have been optimized, unfortunately the optimizations come at a cost of readability. MIT has since produced a free "toy model" of Unix which they call xv6 designed specifically for multicore x64 architecture, oriented towards readability and good code.
- XV6
- Angrave's crowd-sourced System Programming wiki-book
- Maurice J. Bach's The Design of the UNIX Operating System (a really excellent book)
- Carlos Fenollosa's OS Tutorial
It helps to know something about hard disk geometry since file systems manage storing data on hard disks.
Usually this is done through inodes, a data structure which represents a file or directory. An inode consists of "metadata" plus the pointers to blocks representing the file's contents. (A block is 2^n data sectors of a disk.)
Puzzle. Suppose:
- a sector has
bytesPerSector = 2^sbytes (e.g., flash drives haves=9or2^9 = 512bytes per sector, CDs haves=11or2^11 = 2048bytes per sector) - a block consists of
sectorsPerBlock = 2^bsectors (where usually1 <= b <= 4, but anybis possible), and - pointers take up
pointerSizeInBytes = 2^pbytes (p=5orp=6on most modern architectures).
We require 2^p divides 2^b, which is true if p <= b.
Inodes can have pointers directly to blocks, or have a block consist of pointers to the data (called "singly indirect blocks"). This process can be iterated to several levels of indirectness (a block consisting of pointers which point to other blocks of pointers).
If an inode has at most k direct pointers to blocks, at most l singly indirect blocks, and at most m doubly indirect blocks, then what is the maximum size of the file?
(End of Puzzle)
Sketch Solution. A direct pointer points to 2^b sectors or bytesPerDirectBlock = bytesPerSector * sectorsPerBlock = 2^(s) * 2^(b) = 2^(s + b) bytes.
A singly indirect block consists of numberPointersPerSinglyIndirectBlock = bytesPerDirectBlock/pointerSizeInBytes = 2^(s + b - p) pointers to blocks of data. So we get numberPointersPerSinglyIndirectBlock * bytesPerDirectBlock = 2^(s + b - p) * 2^(s + b) = 2^(2*(s+b) - p) bytes.
A doubly indirect block consists of 2^(s+b - p) pointers to singly indirect blocks, or
numberPointersPerDoublyIndirectBlock = numberPointersPerSinglyIndirectBlock * numberPointersPerSinglyIndirectBlock. The total bytes which a doubly indirect block can point to is given by numberPointersPerDoublyIndirectBlock * bytesPerDirectBlock = 2^(s+b - p) * 2^(s+b - p) * 2^(s+b) = 2^(3*(s+b) - 2*p).
The total bytes which an inode can support is then a linear combination of these numbers. (End of Sketch of Solution)
Example.
When we have 12 direct blocks, 1 singly indirect block, and 1 doubly indirect block, 512 bytes per sector, and 4 sectors per block, and 64 bit (i.e., 8-byte) pointers. A single block then consists of 2048 bytes (or 2 kilobytes) and can contain 2048/64 = 512/16 = 32 pointers.
We have a singly indirect block point to 32 blocks.
A doubly indirect block points to 32*32 = 1024 blocks.
(A triply indirect block points to 32*32*32 = 32768 blocks.)
Thus we have a total of 12 + 32 + 1024 = 1060 blocks, or 2120 kilobytes.
(Adding a triply indirect block would increase this by an additional 33836 blocks, or roughly 66 megabytes.)
If we instead had 32-bit pointers (i.e., taking up 4-bytes instead of 8), then a singly indirect block points to 512 blocks.
A doubly indirect block points to 512*512 = 262,144 blocks.
(A triply indirect block points to 512*512*512 = 134,217,728 blocks or 127 kiloblocks.)
We could in this case have a total of 12 + 512 + 262144 = 262668 blocks, or 525336 kilobytes (513 megabytes + 24 kilobytes).
(Adding the triply indirect block would boost this to 268,435,456 kilobytes or about 268 gigabytes.)
What if a block is 4096 bytes and a pointer is 4-bytes? What's the largest file size that can be supported with 12 direct blocks, 1 singly indirect block, 1 doubly indirect block, and 1 triply indirect block? (End of Example)