C Programming Tutorials and Style: GATTACA

As usual the source code is a walkthrough. This time we're preparing for a bit of data processing/compression. We're dealing with the 65.14 MB file of the X chromosome from the human genome project.

The warmup (this post) is to write a little preprocessing tool.

// http://www.gutenberg.org/etext/3501

// We got this big DNA file but it's got NNNNNNNNNNNN in it, as well as newlines
// so lets filter out everything that isnt A C T or G

// So for the first we're writing a "real" program which will take command line
// arguments and process files.

// Well take a input filename (if not supplied use stdin),
// output filename (if not supplied use stdout), and a filter string (if not
// supplied error)

// to parse command line arguments you could
//    1) Do what most people do and spend a few mins writing
//       a silly watered down version of getopt (every time)
//    2) Spend a few mins using getopt then knowing how to do it for future.
// The choice is obvious, don't make the same mistake as others!

// The basic structure of our command line processing etc shalle be:

/*
 * int main(int argc, char ** argv) {
 *   if(!process_args(argc, argv)) {
 *     display_usage();
 *     return EXIT_FAILURE;
 *   }
 *   
 *   // ... do the actual stuff ...
 *   
 *   return EXIT_SUCCESS;
 * }
 * * * * * * * * * * * */

// I don't bother declaring the input, output files and filter string
// then passing them around as paramters, because they can be declared
// at the toplevel instead.

// The next stage is to have a quick read through the getopt manpage
// in a console/terminal do: `man 3 getopt`

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

FILE * input_file, * output_file;
char * filter_string;

void display_usage(char * name) {
  printf("usage: %s \"filter-string\" [-i input filename] [-o output filename]\n", name);
}

int process_args(int argc, char ** argv) {
  int ch;
  
  // Deal with the command line arguments here
  // return 1 on success, 0 on failure
  // (failure meaning malformed parameters or anything)
  
  if(argc <= 1)
    return 0; // we require a filter string
  
  filter_string = argv[1];
  input_file = stdin;
  output_file = stdout;
  
  while((ch = getopt(argc-1, argv+1, "i:o:")) != -1) {
    // getopt returns -1 once it's finished
    if(ch == '?') return 0;
    
    switch(ch) {
    case 'i':
      input_file = fopen(optarg, "r");
      if(!input_file) {
        printf("Could not open input file \"%s\"\n", optarg);
        return 0;
      }
      break;
    case 'o':
      output_file = fopen(optarg, "w");
      if(!output_file) {
        printf("Could not open output file \"%s\"\n", optarg);
        return 0;
      }
      break;
    }
  }
  
  return 1;
}

int main(int argc, char ** argv) {
  int ch;
  
  if(!process_args(argc, argv)) {
    display_usage(argv[0]); // ammendment, argv[0] is the program name
                           // we shall display this in usage
    return EXIT_FAILURE;
  }
  
  // Now that we've parsed the command line args and set things up
  // we now have an open input and output file, and a filter string
  // the task now is simply read each char and print it if its in the string
  // once we hit EOF, flush the output and finish.
  
  while((ch = fgetc(input_file)) != EOF)
    if(strchr(filter_string, ch))
      fputc(ch, output_file);
  fputc('\n', output_file);
  fflush(output_file);
  fclose(output_file);
  
  return EXIT_SUCCESS;
}

// It even seems to work!
//   % ./filter '(){}' -i filter.c
//   ()()())())(){(()){()}}(){()}(){()()((())){()(){()(){()}()(){()}}}}(){(()){()}((()))(())()()()()}

// so now we can process the DNA file, strip out anything thats not "ACTG"
//   % time ./filter "ACTG" -i dna.txt -o dna_flat.txt
//   ./filter "ACTG" -i dna.txt -o dna_flat.txt  1.74s user 0.21s system 98% cpu 1.985 total
// I was going to write about how to improve the speed of the program next,
// the problem is it only took 2 seconds to process 66.9 MB of text
// It should be noted that the naive approach to solving this problem was
//   1) The easiest to program
//   2) Got the job done quickly
//   3) Required no optimisation
//
// So in general it's good to do the easiest thing (be lazy! + stay simple)
// and optimize when you find out it's too slow/uses to much ram (if it ever is/does)
C Programming Tutorials and Style

Wednesday, September 19, 2007

GATTACA

No comments:

Blog Archive

About Me