Open Computing ``Hands-On'': ``Wizard's Grabbag'' Column: June, 1994
The Data Shuffle
Need to organize your data? Here's a personal productivity tool for managing lists of information
By Dr. Rebecca ThomasIf there is one thing that the information age has created, it's gobs of data often in unwieldy chunks. The key to keeping organized is being able to extract the information you need in a format that you can use.
Tom Baker provides a Korn shell script implementation of a tool that helps manage line-oriented textual information. The user constructs a rule file that directs the script how to process the data files in their working directory.
Deal Me In
Dear Dr. Thomas:
My
shuffleKorn shell program [Part A of Listing 1] is a rule-directed list processor designed to organize files containing lists. It is especially good for lists that undergo continual growth and revision, such as calendars, phone directories, event logs, and lists of things to do.The name ``shuffle'' is based on a playing card metaphor. When cards are shuffled, they are swept together, mixed, and dealt back out into random hands. This script sweeps a set of list files together into one big file, then--under direction of regular expressions contained in rules defined by the user--deals the data back out into a new set of list files and, when directed, sorts them.
My lists contain one-line items of information: in principle, anything that can be expressed within a line or sortable sequence of lines [see Part B]. Some of the data are structured by an organizing principle, such as date, name, or priority.
These organizing principles are expressed in an editable set of rules [see Part C]. Minimally, each rule contains a search key, which is used with
egrepto extract a line from a source file into a target file. Optionally, the rule also specifies a sort command for the target file.When
shuffleis run, it first concatenates all of the files specified as arguments into one big file, named by theAllfilesvariable. After making a safety backup, it erases the originals, thus wiping the slate clean for their reconstitution. From this one aggregate file,shuffleextracts an entirely new set of lists.Figure 1 shows a typical flow diagram. The first rule extracts every line (specified by the ``.'' pattern) from the source file, named by the
Allfilesvariable, into the target file--here calledphone--effectively renamingcombined.dattophone, and then sorts it.The second rule moves all lines that begin with a hyphen followed by a space character
(``- '') fromphoneinto1993and sorts it by year. The third rule moves all lines that start with the``- 1994 '' pattern from1993into1994. After all of the nine rules in our example have been applied, any lines that remain are left in the file namedphone.If you edit a data line to match a different rule, you mark that line for export to a different list. For example, I might expand the information from the line shown [in Part D] into the lines shown [in Part E]. Then when I run
shuffle, the event lines will be moved into the 1994 log, Joachim Mann will go to the phone directory, and the article on SGML will end up in a list of things to do later.When you edit the rules, older lists are merged or new ones created to meet new needs. For instance, the rule shown in Part F creates a separate list of things I need to follow up on, such as the two items from the Smith meeting.
Use of line-oriented data files means that I can use simple
grepsearching commands to locate items that meet certain criteria, for instance, ``show me everything I have on Smith'' (grep Smith) or ``what is my shoe size?'' (grep -i shoe) or ``when is the music library open?'' (grep musikbuecherei).Furthermore, I often organize the elements of my data lines from general to specific, reading from left to right. This approach means that related items will be grouped together when sorted: lines referring to ``Clothes Shoes'' will remain near the ``Clothes Pants'' and ``Clothes Shirts'' in the residual phone file. This general-to-specific arrangement means that if a search doesn't tell me what I want to know because it was too specific (``Gap'' or ``Bean''), I can search for a more general category (``pants'').
I find that the rule file evolves as I edit the data. And because the rule file is just another list--albeit a special one--stored along with the files to which it refers, the set of lists is largely self-documenting.
Tom Baker / Bonn, GermanyConfiguration Notes: The
shuffleprogram was developed under the MKS Toolkit Korn shell running under DOS 3.3 and ported to Korn shell Version 11/16/88d running under System V Release 4.0.3. It has been tested under the environments mentioned in the ``acknowledgments'' paragraph near at the end of this column.The configuration section (lines 12-23) was written to support both DOS-based MKS Toolkit Korn shell and Unix-based Korn shell versions as indicated by the comments. For instance, DOS-MKS Toolkit doesn't have the equivalent for the Unix ``bit-bucket'' file
/dev/null, so a temporary file is used instead (line 21 instead of line 15).Under MKS Toolkit, the rule file is named ``rules'' whereas Unix users can use ``.rules''. The latter usage lets one invoke the script using the asterisk wild card, as in
shuffle *, without fear of shuffling the rule file. Also, MKS Toolkit does not have a command namednawk, but one can either copyawk.exetonawk.exeor edit the script to invokeawk. By now, many implementations useawkas the name of the ``new''awkprogram, instead ofnawk, a name that was used when the new version was first introduced.Usage Note: The
shuffleprogram is designed to process data files under direction of a rule file all in the same directory. A backup subdirectory is created whenshuffleruns.Tester's Comments: It's a nice and useful script, but I was able to change it to handle multiline text to shuffle mail files or Usenet news articles. By employing the public-domain agrep program--which is record, not just line, oriented--and using
``^From '' as the field delimiter, I could extract data from our electronic mail support database. The same idea holds for our news archives, although I had to modifyshuffleso it wouldn't combine all input into a single file, which could be many megabytes in size. Additionally, I would like to seeshuffleallow read-only data files and allow sharing of files with my coworkers. The latter means I would need to remove the restriction to use files in a single directory. Also, there is no lock mechanism to prevent two instances of the program from running at the same time in the same directory.--Kees HendrikseThe script runs unmodified under Unixware. It doesn't run on my BSD 386 system--which uses Bash instead of the Korn shell--unless you replace the
echostatements. I also had to replace thenawkscript by one written in Perl, which I obtained by translating the script using thea2pconversion utility provided by the Perl distribution. My guess would be thatshufflecould be improved further by translating it completely to Perl.--Endre Bálint NagyFor AIX 3.2 I had to rename
awktonawk, but both AIX and Ultrix 4.3 required that I not use the unsupported-Msort option in the ``rule'' file.--Steve WrightThis script worked fine under ISC 3.2.2, but had to be changed significantly to run with Coherent (version 4.2.05). [See Part G for the Coherent port of
shuffle, which by the way, should also work with System V Release 2 and later Bourne shells and the oldawk.]--Gábor ZahemszkyWanted: Rewrite Shuffle in Perl
I'm looking for a Perl version of the
shuffleprogram discussed here. We'll pay you US$100 for your trouble. You're welcome to enhance or improve, as long as you coordinate with me.Acknowledgments
I wish to thank the following readers for their help with testing this month's contributions: Gábor Zahemszky, CoDe Ltd., Budapest, Hungary (ISC 3.2.2 Unix and Coherent 4.2); Kees Hendrikse, Echelon Consultancy, Enschede, The Netherlands (current SCO Unix and Xenix versions); Endre Bálint Nagy, Walton Networking Ltd., Budapest, Hungary (Unixware Application Server 1.0); and Steve Wright, Computer Science Dept., University of South Carolina, Columbia, S.C. (AIX 3.2).
Copyright © 1995-1997 The McGraw-Hill Companies, Inc. All Rights Reserved.
Edited by Becca Thomas / editor@unixworld.com Last Modified: Thursday, 12-Sep-96 20:53:04