Open Computing ``Hands-On'': ``Wizard's Grabbag'' Column: June, 1994: Listings

Listing 1: The shuffle script processes line-oriented data, catenating it, then extracting selected lines into specified files with possible ordering.

A. Listing of the shuffle Korn shell script:

  1  #!/usr/bin/ksh
  2  # @(#) shuffle Version 5  A rule-based list processor
  3  # Author: Thomas Baker <tbaker@unix.amherst.edu>
  4  # Modified by: Becca Thomas, February 1994
  5  $DBG_SH                         # Dormant debugging directive
  6  
  7  trap 'rm -f $Tmpfile $Targetfilenames >|$Devnull 2>&1; \
  8      exit $Stat' 0
  9  trap 'print -u2 "$(basename $0): Interrupted!"; exit' 1 2 3 15
 10  
 11  # CONFIGURATION
 12  Allfiles=combined.dat               # File for all catenated input files
 13  Bkupdir=.backup                     # Unix input-files backup directory
 14  #Bkupdir=backup                     # MKS  input-files backup directory
 15  Devnull="/dev/null"                 # Unix bit-bucket file
 16  Rulefile=.rules                     # Unix rule file
 17  #Rulefile=rules                     # MKS  rule file
 18  Usage="Usage: $(basename $0) datafile [datafile ...]" # Correct usage
 19  # Temporary directory-dependent variables:
 20  Tmpdir=/tmp                         # MKS/Unix temporary directory
 21  #Devnull=$Tmpdir/null               # MKS  bit-bucket file
 22  Targetfilenames=$Tmpdir/sht$$.tmp   # MKS/Unix target-names file
 23  Tmpfile=$Tmpdir/shf$$.tmp           # MKS/Unix temporary work file
 24  
 25  # FUNCTION DEFINITIONS:
 26  function usage_exit {
 27      print -u2 "$Usage"; Stat=1 ; exit
 28  }
 29  function movelines { # Args: $Searchkey $Source $Target $Sortcmd
 30      print -n "Lines with [$1] moved from \""$2"\" to \""$3"\""
 31      egrep "$1" $2 >>$3; egrep -v "$1" $2 >|$Tmpfile; mv $Tmpfile $2
 32      [ "$4" ] && print ", ${4}." || print "." # Print sort command
 33      [ "$4" ] && { eval $4 -o $3 $3 ||
 34          { print "\aBad rule-file sort command: $4"; Stat=2; exit;};}
 35  }
 36  
 37  # PROCESS COMMAND-LINE ARGUMENTS:
 38  case $# in      # User must specify at least one file-name argument
 39      0)  usage_exit ;;
 40  esac
 41  
 42  # SANITY CHECK: Rule file:
 43  [ -r $Rulefile ] ||
 44      { print -u2 "\aCannot read \"$Rulefile\" file!"; Stat=4; exit;}
 45  sed 's/#.*$//' $Rulefile |          # Remove comments.
 46  egrep -v '^$' |                     # Remove blank lines.
 47  nawk -F\| '                         # Rules separated by vertical bar
 48  NR == 1 && ($1 != "." || $2 != "$Allfiles") {   # Check first rule
 49      print $0, ": rule 1 is illegal!" }
 50  NF != 3 && NF != 4 {                # All rules have 3 or 4 fields.
 51      print $0, ": must have 3 or 4 fields!" }
 52  $2 == $3 {                          # Source different from target.
 53      print $0, ": source cannot equal target!" }
 54  $4 != "" && $4 !~ /^sort/ {         # Field 4 is for sort commands.
 55      print $0, ": field 4 is only for sort!" }
 56  $1 == "" || $2 == "" || $3 == "" {  # First three fields are non-empty.
 57      print $0, ": 1 of first 3 fields is empty!" }
 58  { target[$3] = 1 }                  # Note names of target files
 59  NR > 1 {                            # For all lines after the first
 60      if ($2 in target)               # If source file is also a target
 61          next;                       # No problem, fetch next input line
 62      else print $0, ": ", $2, "has no precedent!"
 63  }' >| $Tmpfile                      # Save unique lines and display
 64  [ -s $Tmpfile ] &&
 65      { print -u2 "Bad rule format:\n$(cat $Tmpfile)"; Stat=5; exit;}
 66  
 67  # SANITY CHECKS: Current directory, combined data, backup directory:
 68  [ -w "." ] ||                       # Current (data) directory
 69      { print -u2 "\aCannot write to current directory!"; Stat=6; exit;}
 70  [ -f $Allfiles ] &&                 # Combined data file
 71      { print -u2 "\a\"$Allfiles\" should not yet exist!"; Stat=7; exit;}
 72  [ -d $Bkupdir ] || mkdir $Bkupdir 2>|$Devnull ||
 73      { print -u2 "\aCannot make directory \"$Bkupdir\"!"; Stat=8; exit;}
 74  [ "$(ls $Bkupdir)" ] && {           # if there are files in backup dir
 75  print -n "Okay to erase files in $Bkupdir (y*|Y*/n)? "; read ans
 76  case $ans in
 77      y*|Y*)  rm -f $Bkupdir/* >|$Devnull 2>&1 ;; # Remove old backups
 78      *)      print "Exiting, check $Bkupdir directory."; Stat=0; exit ;;
 79  esac;}
 80  
 81  # CHECK DATA FILES, BACK UP, THEN COMBINE INTO A COMMON FILE:
 82  for File in "$@"; do
 83      [ -d $File ] && continue                # Ignore directories.
 84      [ "$File" = "$Rulefile" ] && continue   # Ignore rules (just data).
 85      [ "$(dirname $File)" = "." ] || [ "$(dirname $File)" = "$PWD" ] ||
 86          { print -u2 "\aData files must be in current directory!"
 87          Stat=9; exit;}
 88      [ -r $File ] ||
 89          { print -u2 "\a\"$File\" file not readable."; Stat=10; exit;}
 90      { file $File | egrep 'text|empty' >|$Devnull 2>&1;} ||
 91          { print -u2 "\a\"$File\" not text nor empty."; Stat=11; exit;}
 92      egrep '^[   ]*$' $File >|$Devnull 2>&1 &&
 93          { print -u2 "\a\"$File\" has blank lines!"; Stat=12; exit;}
 94      cp $File $Bkupdir ||    # Copy to backup directory.
 95          { print -u2 "\aCannot back up $File!"; Stat=13; exit;}
 96      cat $File >> $Allfiles; rm $File   # Combine into common file.
 97  done
 98  
 99  # CHECK COMBINED DATA FILE:
100  [ -s $Allfiles ] || { print -u2 "\aNo data to process!"; Stat=14; exit;}
101  Beforesize=$(wc -c <$Allfiles | awk '{ print $1 }') # Data size before
102  print "Data backed up to \"$Bkupdir\", concatenated in \"$Allfiles\"."
103  
104  # PROCESS DATA FILES under direction of rule file:
105  OldIFS="$IFS"               # Save old internal field separator char(s)
106  IFS="|"                     # Rule-file field separator for "read"
107  sed 's/#.*$//' $Rulefile |          # Remove rule-file comments
108  egrep -v '^$' |                     # Remove blank lines
109  while read Searchkey From To Sortcmd ; do   # put fields into variables
110      eval Source=$From; eval Target=$To      # interpolate these var.
111      movelines $Searchkey $Source $Target $Sortcmd # Do the shuffle
112      print -u3 "$Target"             # Output goes to fd 3.
113  done 3>| $Targetfilenames           # Store fd3 output in a file.
114  IFS="$OldIFS"                       # Restore original IFS values.
115  Targetnames=$(sort -u $Targetfilenames) # Place unique list in variable.
116  
117  # CONCLUSION: Cleanup and exit message:
118  for File in $Targetnames $Allfiles; do
119      [ -s $File ] || rm $File        # Erase data files if empty
120  done
121  if [ $Beforesize -ne $(cat $Targetnames 2>|$Devnull | wc -c) ]; then
122      print -u2 "Warning: data may have been lost--use backup!\a\a\a"
123  else
124      print -u2 "Done: data shuffled and intact!"
125  fi

B. A sample data file:

- 1994 Feb 23 Smith 01 John Lunch at Panda East.
- 1994 Jan 23 Smith 02 Not coming to session, but writing paper.
- 1994 Feb 23 Smith 03 FOLLOWUP Read Sep 1993 SCILS article on SGML
Smith John 432 E43rd St, New York NY 01002 212-555-5555, fax 666-6666
Feb 10 BDAY Sarah (1956)
LATER Read SCILS article on SGML.
NOW Renew passport!
Beans stock and info 800-221-4221, customer service 800-341-4341
Clothes Shoes Timberland "Blucher" size W12
Convert US Ounces to Grams: 1 oz = 28.35 gm
Wallet [07 Sep 93] NY Drivers' # A01234 56789 123456 78, exp 7/96
- 1993 Dec 20 10am Called John Smith, set appt and faxed letter.
Wallet [07 Sep 93] Visa 1234-5678-1234-5678, lost: 1-800-423-3823
Fastback differential backup of C: c:/fastback/fb ')c)b)d)s))'
Clothes Shoes Adidas Marath.Train.II 1CA, size 12.5(D) 48(F) 13(USA)

C. A sample rule file:

# Rule file for "Shuffle: a rule-based list processor"
# 1. Rules contain: searchkey|source|target|optional_sort_command
# 2. First rule must have "." in first field, "$Allfiles" in second.
# 3. Common sort types:
#    sort                        Straight alphabetic.
#    sort +0M -1 +1n -2          Data format: Jun 25
#    sort +1n -2 +2M -3 +3n -4   Data format: - 1992 Jun 25
.|$Allfiles|phone|sort
^- |phone|1993|sort +1n -2 +2M -3 +3n -4
^- 1994 |1993|1994|sort +1n -2 +2M -3 +3n -4
^Jan |phone|calendar
^Feb |phone|calendar
^Dec |phone|calendar|sort +0M -1 +1n -2
BDAY|calendar|bday|sort +0M -1 +1n -2
^NOW |phone|now|sort
^LATER |phone|later|sort

D. Another example of a data-file line:

Jan 23 Smith John Lunch at Panda East.

E. Some transformations of the data-file line shown above in Part D:

- 1994 Jan 23 Smith 01 John Lunch at Panda East.
- 1994 Jan 23 Smith 02 Not coming to session, but writing paper.
- 1994 Jan 23 Smith 03 FOLLOWUP Sep 1993 SCILS article on SGML
- 1994 Jan 23 Smith 04 FOLLOWUP Call Joachim Mann 321-4567
Mann Joachim, tel 321-4567
LATER Read Sep 1993 SCILS article on SGML.

F. Another example of a rule-file line:

FOLLOWUP|1994|followup|sort +1n -2 +2M -3 +3n -4

G. A version of shuffle written for Coherent that runs under the Bourne shell with the ``old'' awk.

  1  #!/usr/bin/sh
  2  # @(#) shuffle Version 5  A rule-based list processor
  3  # Author: Thomas Baker <tbaker@unix.amherst.edu>
  4  # Modified by: Becca Thomas, February 1994
  5  # Modified by: Ga'bor Zahemszky, March 1994 to use sh and "old" awk
  6  $DBG_SH                             # Dormant debugging directive
  7  
  8  trap 'rm -f $Tmpfile $Targetfilenames >$Devnull 2>&1; exit $Stat' 0
  9  trap 'echo "`basename $0`: Interrupted!" >&2 ; exit' 1 2 3 15
 10  
 11  # CONFIGURATION
 12  Allfiles=combined.dat               # File for all catenated input files
 13  Bkupdir=.backup                     # Unix input-files backup directory
 14  #Bkupdir=backup                     # MKS  input-files backup directory
 15  Devnull="/dev/null"                 # Unix bit-bucket file
 16  Rulefile=.rules                     # Unix rule file
 17  #Rulefile=rules                     # MKS  rule file
 18  Usage="Usage: `basename $0` datafile [datafile ...]" # Correct usage
 19  # Temporary directory-dependent variables:
 20  Tmpdir=/tmp                         # MKS/Unix temporary directory
 21  #Devnull=$Tmpdir/null               # MKS  bit-bucket file
 22  Targetfilenames=$Tmpdir/sht$$.tmp   # MKS/Unix target-names file
 23  Tmpfile=$Tmpdir/shf$$.tmp           # MKS/Unix temporary work file
 24  
 25  # FUNCTION DEFINITIONS:
 26  usage_exit() {
 27      echo "$Usage" >&2 ; Stat=1 ; exit
 28  }
 29  movelines() { # Args: $Searchkey $Source $Target $Sortcmd
 30      echo "Lines with [$1] moved from \""$2"\" to \""$3"\""
 31      egrep "$1" $2 >>$3; egrep -v "$1" $2 >$Tmpfile; mv $Tmpfile $2
 32      [ "$4" ] && echo ", ${4}." || echo "." # Print sort command
 33      [ "$4" ] && { eval $4 -o $3 $3 ||
 34          { echo "\007Bad rule-file sort command: $4"; Stat=2; exit;};}
 35  }
 36  
 37  # PROCESS COMMAND-LINE ARGUMENTS:
 38  case $# in      # User must specify at least one file-name argument
 39      0)  usage_exit ;;
 40  esac
 41  
 42  # SANITY CHECK: Rule file:
 43  [ -r $Rulefile ] ||
 44      { echo "\007Cannot read \"$Rulefile\" file!" >&2 ; Stat=4; exit;}
 45  sed 's/#.*$//' $Rulefile |          # Remove comments.
 46  egrep -v '^$' |                     # Remove blank lines.
 47  oawk -F\| '                         # Rules separated by vertical bar
 48  NR == 1 && ($1 != "." || $2 != "$Allfiles") {   # Check first rule
 49      print $0, ": rule 1 is illegal!" }
 50  NF != 3 && NF != 4 {                # All rules have 3 or 4 fields.
 51      print $0, ": must have 3 or 4 fields!" }
 52  $2 == $3 {                          # Source different from target.
 53      print $0, ": source cannot equal target!" }
 54  $4 != "" && $4 !~ /^sort/ {         # Field 4 is for sort commands.
 55      print $0, ": field 4 is only for sort!" }
 56  $1 == "" || $2 == "" || $3 == "" {  # First three fields are non-empty.
 57      print $0, ": 1 of first 3 fields is empty!" }
 58  { target[$3] = 1 }                  # Note names of target files
 59  NR > 1 {                            # For all lines after the first
 60      ZGvar2 = 0
 61      for (ZGvar1 in target) {
 62          if (ZGvar1 == $2) {
 63              next
 64          } else {
 65              ZGvar2 = 1
 66          }
 67       }
 68      if (ZGvar2 == 1) {
 69          print $0, ": ", $2, "has no precedent!"
 70      }
 71  }' > $Tmpfile                       # Save unique lines and display
 72  [ -s $Tmpfile ] &&
 73      { echo "Bad rule format:\n`cat $Tmpfile`" >&2 ; Stat=5; exit;}
 74  
 75  # SANITY CHECKS: Current directory, combined data, backup directory:
 76  [ -w "." ] ||                       # Current (data) directory
 77      { echo "\007Cannot write to current directory!" >&2 ; Stat=6; exit;}
 78  [ -f $Allfiles ] &&                 # Combined data file
 79      { echo "\007\"$Allfiles\" shouldn't exist!" >&2 ; Stat=7; exit;}
 80  [ -d $Bkupdir ] || mkdir $Bkupdir 2>$Devnull ||
 81      { echo "\007Can't make directory \"$Bkupdir\"!" >&2 ; Stat=8; exit;}
 82  [ "`ls $Bkupdir`" ] && {            # if there are files in backup dir
 83  echo "Okay to erase files in $Bkupdir (y*|Y*/n)? \c"; read ans
 84  case $ans in
 85      y*|Y*)  rm -f $Bkupdir/* >$Devnull 2>&1 ;;  # Remove old backups
 86      *)      echo "Exiting, check $Bkupdir directory."; Stat=0; exit ;;
 87  esac;}
 88  
 89  # CHECK DATA FILES, BACK UP, THEN COMBINE INTO A COMMON FILE:
 90  for File in $*; do
 91      [ -d $File ] && continue                # Ignore directories.
 92      [ "$File" = "$Rulefile" ] && continue   # Ignore rules (just data).
 93      [ "`dirname $File`" = "." ] || [ "`dirname $File`" = "`pwd`" ] ||
 94          { echo "\007Data files must be in current directory!" >&2
 95          Stat=9; exit;}
 96      [ -r $File ] ||
 97          { echo "\007\"$File\" file not readable." >&2 ; Stat=10; exit;}
 98      { file $File | egrep 'text|empty' >$Devnull 2>&1;} ||
 99          { echo "\007\"$File\" not text nor empty." >&2 ; Stat=11; exit;}
100      egrep '^[   ]*$' $File >$Devnull 2>&1 &&
101          { echo "\007\"$File\" has blank lines!" >&2 ; Stat=12; exit;}
102      cp $File $Bkupdir ||    # Copy to backup directory.
103          { echo "\007Cannot back up $File!" >&2 ; Stat=13; exit;}
104      cat $File >> $Allfiles; rm $File   # Combine into common file.
105  done
106  
107  # CHECK COMBINED DATA FILE:
108  [ -s $Allfiles ] || { echo "\007No data to process!">&2; Stat=14; exit;}
109  Beforesize=`wc -c <$Allfiles | oawk '{ print $1 }'` # Data size before
110  echo "Data backed up to \"$Bkupdir\", concatenated in \"$Allfiles\"."
111  
112  # PROCESS DATA FILES under direction of rule file:
113  OldIFS="$IFS"               # Save old internal field separator char(s)
114  IFS="|"                     # Rule-file field separator for "read"
115  sed 's/#.*$//' $Rulefile |          # Remove rule-file comments
116  egrep -v '^$' |                     # Remove blank lines
117  while read Searchkey From To Sortcmd ; do   # put fields into variables
118      eval Source=$From; eval Target=$To      # interpolate these var.
119      movelines $Searchkey $Source $Target $Sortcmd # Do the shuffle
120      echo "$Target" >&3              # Output goes to fd 3.
121  done 3> $Targetfilenames            # Store fd3 output in a file.
122  IFS="$OldIFS"                       # Restore original IFS values.
123  Targetnames=`sort -u $Targetfilenames`  # Place unique list in variable.
124  
125  # CONCLUSION: Cleanup and exit message:
126  for File in $Targetnames $Allfiles; do
127      [ -s $File ] || rm $File        # Erase data files if empty
128  done
129  if [ $Beforesize -ne `cat $Targetnames 2>$Devnull | wc -c` ]; then
130      echo "Warning: data may have been lost--use backup!\007" >&2
131  else
132      echo "Done: data shuffled and intact!" >&2
133  fi

Figure 1: A data-flow diagram for the example discussed in Tom Baker's introductory letter.

$Allfiles
   |
   V                                      Sorted by year:
 phone ---> 1993 [^- ] -----\-----------> 1994 [^- 1994 ]
   |                         \----------> 1993 (everything else) 
   |
   V                                      Sorted by month:
 phone ---> calendar [^Jan,^Feb..] \----> bday [BDAY]
   |                                \---> calendar (everything else)
   |
   V        Sorted alphabetically:
 phone ---> now [^NOW ]
   |
 phone ---> later [^LATER ]
   |
   \------> phone (everything else)

Copyright © 1995-1997 The McGraw-Hill Companies, Inc. All Rights Reserved.
Edited by Becca Thomas / editor@unixworld.com

[Go to Site Home Page] [Go to Contents Page] [Search Editorial] [Register]

Last Modified: Thursday, 12-Sep-96 20:50:47