Wednesday, September 19, 2007

GATTACA

As usual the source code is a walkthrough. This time we're preparing for a bit of data processing/compression. We're dealing with the 65.14 MB file of the X chromosome from the human genome project.

The warmup (this post) is to write a little preprocessing tool.


// http://www.gutenberg.org/etext/3501

// We got this big DNA file but it's got NNNNNNNNNNNN in it, as well as newlines
// so lets filter out everything that isnt A C T or G

// So for the first we're writing a "real" program which will take command line
// arguments and process files.

// Well take a input filename (if not supplied use stdin),
// output filename (if not supplied use stdout), and a filter string (if not
// supplied error)

// to parse command line arguments you could
// 1) Do what most people do and spend a few mins writing
// a silly watered down version of getopt (every time)
// 2) Spend a few mins using getopt then knowing how to do it for future.
// The choice is obvious, don't make the same mistake as others!

// The basic structure of our command line processing etc shalle be:

/*
* int main(int argc, char ** argv) {
* if(!process_args(argc, argv)) {
* display_usage();
* return EXIT_FAILURE;
* }
*
* // ... do the actual stuff ...
*
* return EXIT_SUCCESS;
* }
* * * * * * * * * * *
*/

// I don't bother declaring the input, output files and filter string
// then passing them around as paramters, because they can be declared
// at the toplevel instead.

// The next stage is to have a quick read through the getopt manpage
// in a console/terminal do: `man 3 getopt`

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

FILE * input_file, * output_file;
char * filter_string;

void display_usage(char * name) {
printf("usage: %s \"filter-string\" [-i input filename] [-o output filename]\n", name);
}

int process_args(int argc, char ** argv) {
int ch;

// Deal with the command line arguments here
// return 1 on success, 0 on failure
// (failure meaning malformed parameters or anything)

if(argc <= 1)
return 0; // we require a filter string

filter_string = argv[1];
input_file = stdin;
output_file = stdout;

while((ch = getopt(argc-1, argv+1, "i:o:")) != -1) {
// getopt returns -1 once it's finished
if(ch == '?') return 0;

switch(ch) {
case 'i':
input_file = fopen(optarg, "r");
if(!input_file) {
printf("Could not open input file \"%s\"\n", optarg);
return 0;
}
break;
case 'o':
output_file = fopen(optarg, "w");
if(!output_file) {
printf("Could not open output file \"%s\"\n", optarg);
return 0;
}
break;
}
}

return 1;
}

int main(int argc, char ** argv) {
int ch;

if(!process_args(argc, argv)) {
display_usage(argv[0]); // ammendment, argv[0] is the program name
// we shall display this in usage
return EXIT_FAILURE;
}

// Now that we've parsed the command line args and set things up
// we now have an open input and output file, and a filter string
// the task now is simply read each char and print it if its in the string
// once we hit EOF, flush the output and finish.

while((ch = fgetc(input_file)) != EOF)
if(strchr(filter_string, ch))
fputc(ch, output_file);
fputc('\n', output_file);
fflush(output_file);
fclose(output_file);

return EXIT_SUCCESS;
}

// It even seems to work!
// % ./filter '(){}' -i filter.c
// ()()())())(){(()){()}}(){()}(){()()((())){()(){()(){()}()(){()}}}}(){(()){()}((()))(())()()()()}

// so now we can process the DNA file, strip out anything thats not "ACTG"
// % time ./filter "ACTG" -i dna.txt -o dna_flat.txt
// ./filter "ACTG" -i dna.txt -o dna_flat.txt 1.74s user 0.21s system 98% cpu 1.985 total
// I was going to write about how to improve the speed of the program next,
// the problem is it only took 2 seconds to process 66.9 MB of text
// It should be noted that the naive approach to solving this problem was
// 1) The easiest to program
// 2) Got the job done quickly
// 3) Required no optimisation
//
// So in general it's good to do the easiest thing (be lazy! + stay simple)
// and optimize when you find out it's too slow/uses to much ram (if it ever is/does)

No comments: