Web Techniques Column 57 (Jan 2001)

[suggested title: Leveraging with open source]

The Open Source movement has been going strong for quite some time, starting long before the term had even been coined. In fact, some have argued that Perl and GNU Emacs are the canonical early success stories for Open Source, having helped spearhead the notion that a community of network-connected people can contribute to a publicly available tool to make it industry-strength and suitable for use in mission-critical commercial applications. If you haven't been overloaded by the buzzwords in that last sentence, read on.

One of the cool things I find about the Open Source movement is the willingness of the user community to "give back". Most people see Open Source projects as if they were a "potluck picnic". In a potluck, everyone tries to bring some dish, usually something they most enjoy making (or in my case, buying at the store already made, like chips), and from the individual contributions, we get to make a complete meal. Unless it's a bunch of geeks and everyone brought chips and nothing else. That's why geeks usually do potlucks with software (or a sufficient number of non-geeks for a complete meal) instead.

The advantage here is that I can leverage my contribution to the meal. I can bring one easy thing to purchase, and I end up getting a bit of this, a bit of that, and a more-or-less balanced meal, counting the three different kinds of desserts.

Similarly, with a software potluck, a few lines of code I "bring" can result in a complete program, by combining them with copies of the code in the libraries that others have brought, either in the distribution, or in the wonderful Comprehensive Perl Archive Network (CPAN) at search.cpan.org and hundreds of other places. Luckily, the software potluck comes with a built-in replicator, so I can take a lot more than I give and not feel very guilty.

So, while I was thinking about how Open Source has contributed to my work, I started wondering how many lines of code my programs actually used, given that I use a lot of modules in my programs. And just about then, as luck would have it, someone on Usenet posted about the Devel::Modlist module, a debugging aid to see which modules your program had pulled in. One of its features is to dump out a full path listing of all use'd modules, and that gave me the idea of checking out how many lines of code I pulled in for every line of code I wrote. And that brings us to the program in [listing one], to show how much leverage I've gotten from using Open Source libraries.

Lines 1 through 3 start nearly every program I do, enabling warnings, turning on compiler restrictions, and disabling buffering on STDOUT. If this were a CGI program I'd also add taint mode (-T on the first line), but it isn't.

Line 5 brings in the Config module, a standard module that lets us find out some common information about this particular Perl installation. It defines a %Config hash that we'll use later.

Line 6 pulls in the IPC::Open2 module, also a standard module. We'll need that to fire off a child process and babysit it, talking to both STDIN and STDOUT at the same time.

Line 7 uses the Memoize module, from the CPAN. This module enables a subroutine to be "memoized" (and no, that's not a misspelling). This means that successive calls to the subroutine with the same arguments will return the same result, but without reinvoking the body of the subroutine. The call to memoize then modifies the lines_in_file subroutine so that the results are automatically cached. This simplified the design of my program greatly, because I could just write what I wanted, and optimize later.

Line 8 pulls in the standard File::Basename module, particularly for the basename subroutine.

And now for the only configuration variable. Line 12 defines a pattern for glob matching all filenames to be processed. This is a path to my website's WebTechniques column archive, as viewed from the Unix side, not as a URL. The .listing.txt files are the source code for the various columns, up to col57.listing.txt, containing the source code for this month's column.

Lines 16 through 18 get information about the current Perl installation. First, we get the path to Perl into $perlpath. Next, we grab the two places things are installed: $privlib for distribution modules, and $sitelib for CPAN or local modules.

Line 20 create a regular expression object that matches files found either in $privlib or $sitelib. Hopefully. Worked for my installation, but your mileage may vary.

Line 22 loads up the @ARGV array (used for the diamond operator later) with the list of names matching the $PAT configuration variable. This results in the 57 filenames of the column code source listings.

Line 24 undefs the $/ variable, ensuring that each filehandle read returns all available data. This makes the diamond loop grab the entire file into $_ on each read, so that the loop is once per file not once per line.

Lines 26 and 27 hold the grand totals for original lines and used-module lines.

Lines 29 through 48 handle the main operation: processing one file at a time to see how many lines of modules versus the source lines. Each iteration looks at the next file, and has the entire contents of the file in $_, and the name of that file automatically in $ARGV as well.

Line 30 breaks apart a few of the listings with multiple programs in the same listing textfile. I wasn't consistent in naming the multiple parts, except for three pound characters at the beginning, and the word listing in either upper or lower case later in the same line. For this loop, $_ is now a single listing.

Line 31 skips over counting any program that has Apache:: as part of the source. Apparently, trying to use things that are meant to be used within mod_perl doesn't work very well, so I have to filter them out. But on first run, the addition of this line caused this program to not consider itself! So I added the character class brackets, which still matches Apache:: but doesn't look like Apache::. Nice trick.

Line 32 counts the number of source lines, and ignores the now empty items created by the earlier split.

Lines 33 and 34 launches a child Perl process to run the program, pulling in the Devel::Modlist module (found in the CPAN) and triggering the path output, giving us a list of all the use'd modules as their full paths, on STDERR of the child process. Because it's on STDERR, we need to merge that with STDOUT using the shell syntax to accomplish that. I've also enabled the "compile only" (-c) and "taint mode" (-T) options as well to keep from actually executing the code and to prevent the "taint mode too late" error.

Line 35 sends the program source code to the newly launched Perl process. Line 36 closes the input handle for that process. After a short time passes, data is available on RDR, which is read in line 37, and closed in line 38. Note that I know that the child Perl process isn't going to try to write more than 8K before I finish sending the program, so this operation is safe. (If it did try, we'd be both trying to shove data to each other, resulting in a staring match with nobody winning.)

Now it's time to see just how many things got used. We'll start with setting the total for this program to zero in line 39. Then for each output line, broken apart in line 40, we see if it's a pathname within a module in line 41. If so, we'll call lines_in_file for that filename in line 42. That returns the number of lines in that file, which gets summed into our total.

Finally in line 44, we'll dump the filename for this program, lines in the program, and total lines of modules sucked in.

Lines 45 and 46 sum the two counts into the grand totals.

And line 49 dumps out the grand totals at the bottom.

And now for the nice little subroutine lines_in_file starting in line 51. The only parameter gets saved in line 52 into $filename.

Line 53 creates a local filehandle. Starting in Perl 5.6, we can simplify this to just my $handle, but since I'm still using Perl 5.5.3 until 5.6.1 comes out, I'll do it in the more traditional way.

Line 54 opens the handle, returning 0 if something breaks. Line 55 reads the entire file into a buffer. And line 56 counts the lines in the same way we counted lines above: by changing all newlines into newlines and counting how many of those we did.

Now, this subroutine has been memoized above, meaning that even though it appears to open the file repeatedly as it has been seen in each program (think about how many times we're asked about strict.pm, for example), it's really only going to do this operation once per filename. The savings in a long-running program can be significant, however, the subroutine must have no side effects or chaos will ensue. See the Memoize module documentation for details.

And this results in something like:

col57.pl
    halfdome.holdit.com>> ./countlines
		 col01.listing.txt     79  12133
		 col02.listing.txt    108   8799
		 col03.listing.txt     36  15529
		 col04.listing.txt     75    104
		 col05.listing.txt    104  14596
		 col06.listing.txt     59   9416
		 col07.listing.txt     91  15529
		 col08.listing.txt     87    104
		 col09.listing.txt     95   1040
		 col10.listing.txt     41    104
		 col10.listing.txt     35   1223
		 col11.listing.txt    101  13229
		 col12.listing.txt     84   9081
		 col13.listing.txt     15      0
		 col13.listing.txt     45   4917
		 col14.listing.txt    167  10014
		 col15.listing.txt     63  11080
		 col16.listing.txt     74  11287
		 col17.listing.txt     91   1796
		 col18.listing.txt     61   3561
		 col19.listing.txt     93  16956
		 col20.listing.txt     88  11287
		 col21.listing.txt     90  11811
		 col22.listing.txt     98  14160
		 col23.listing.txt     58  18531
		 col24.listing.txt     77  18531
		 col25.listing.txt     33    104
		 col25.listing.txt     26   6848
		 col26.listing.txt     63   3561
		 col27.listing.txt    220  12218
		 col28.listing.txt     37   7943
		 col28.listing.txt     47  12218
		 col29.listing.txt     71   3445
		 col30.listing.txt    122  16697
		 col31.listing.txt     36   1223
		 col32.listing.txt    140  10952
		 col33.listing.txt    128   9502
		 col34.listing.txt    193  13229
		 col35.listing.txt    312  12104
		 col36.listing.txt     58   8912
		 col37.listing.txt    118  19999
		 col38.listing.txt     72   9081
		 col38.listing.txt     36   9081
		 col39.listing.txt     76  13068
		 col40.listing.txt    114   9081
		 col42.listing.txt    114  26458
		 col43.listing.txt    164  11666
		 col44.listing.txt     82   9081
		 col45.listing.txt     91   9420
		 col46.listing.txt     85   9081
		 col48.listing.txt    188  15597
		 col51.listing.txt    115   9081
		 col52.listing.txt     70   9736
		 col53.listing.txt     67  10603
		 col54.listing.txt     22  10209
		 col55.listing.txt      5      0
		 col55.listing.txt     41   1223
		 col55.listing.txt     25   9502
		 col56.listing.txt     95  17265
		 col57.listing.txt     57   8145
		    grand total =>   5138 571151
    halfdome.holdit.com>> 

Lookee there. For the 5,138 lines of code I wrote (not counting some of the mod_perl stuff), I got a whopping 571,151 lines of code written by someone else, better than a hundred to one return on effort!

Now you can argue that not all of those 100 lines for every line I wrote are being used, and that if I had pulled out only the parts of those libraries actually used, it might be only 1/10th of that. But even then, that's a 10 to 1 ratio of code I write (and debug) versus code other people have written (and hopefully debugged).

I would dare argue that without Open Source, those CPAN modules would not have been as easily shared. And I think I'd find a lot of agreement there.

So, remember leverage. Learn the CPAN modules. And when you take, give back when you can. Contribute your cool reusable modules to the CPAN, and keep the potluck going. Until next time, enjoy!

Listings

col57.pl
	=1=	#!/usr/bin/perl -w
	=2=	use strict;
	=3=	$|++;
	=4=	
	=5=	use Config;
	=6=	use IPC::Open2;
	=7=	use Memoize; memoize('lines_in_file');
	=8=	use File::Basename;
	=9=	
	=10=	## CONFIG
	=11=	
	=12=	my $PAT = "/home/merlyn/Html/merlyn/WebTechniques/col??.listing.txt";
	=13=	
	=14=	## END CONFIG
	=15=	
	=16=	my $perlpath = $Config{perlpath};
	=17=	my $privlib = $Config{privlib};
	=18=	my $sitelib = $Config{sitelib};
	=19=	
	=20=	my $files_regex = qr/^(\Q$privlib\E|\Q$sitelib\E)/;
	=21=	
	=22=	@ARGV = glob $PAT or die "no files?";
	=23=	
	=24=	undef $/;
	=25=	
	=26=	my $source_grand = 0;
	=27=	my $used_grand = 0;
	=28=	
	=29=	while (<>) {
	=30=	  for (split /^\#\#\#.*listing.*\n/im) {
	=31=	    next if /Apach[e]::/; # bleh
	=32=	    next unless my $source_count = tr/\n//;
	=33=	    open2(\*RDR, \*WTR, "$perlpath -cTMDevel::Modlist=path 2>&1")
	=34=	      or die "Cannot create pipe or fork or something: $!";
	=35=	    print WTR $_;
	=36=	    close WTR;
	=37=	    $_ = <RDR>;
	=38=	    close RDR;
	=39=	    my $used_count = 0;
	=40=	    for (split /\n/) {
	=41=	      next unless /$files_regex/;
	=42=	      $used_count += lines_in_file($_);
	=43=	    }
	=44=	    printf "%30s %6d %6d\n", basename($ARGV), $source_count, $used_count;
	=45=	    $source_grand += $source_count;
	=46=	    $used_grand += $used_count;
	=47=	  }
	=48=	}
	=49=	printf "%30s %6d %6d\n", "grand total =>", $source_grand, $used_grand;
	=50=	
	=51=	sub lines_in_file {
	=52=	  my $filename = shift;
	=53=	  my $handle = \do { local *STDIN };
	=54=	  open $handle, "<$filename" or return 0;
	=55=	  read $handle, my $buffer, -s $handle;
	=56=	  $buffer =~ tr/\n//;
	=57=	}