Pages

Tuesday, August 16, 2011

Episode #156: Row, Row, Row... You're Columns!

Hal receives stroking via email

I recently received an email from my old friend Frank McClain:

It is with much humility that I kneel before the masters and ask this request, which I am certain is but a simple task for such honored figures.


Well a little sucking up never hurts, Frank. Let's see what your issue is:

Tab-delimited text file containing multiple email addresses per row. The first such field is sender, and that's fine. The following fields are recipients. The first recipient can stay where it is, but the following for that row need to be moved individually into column-format below the first recipient, in new rows. If there is only one recipient in a row, nothing more needs to be done with that row.

Example:

7/27/2011    15:40:00    steve.jobes@place.com    jmarcus@someplace.com

ronsmith@someplace.com pgonzalez@someplace.com
6/17/2011 15:19:00 ssummers@someplace.com kevin.smart@provider.com
Pamla.Barras@store.com pamlabs@webmail.com
5/14/2011 12:35:00 amartelli@someplace.com apiska@business.com
jmilch@provider.net pampwanla@webmail.com

What I need to end up with is:

7/27/2011    15:40:00    steve.jobes@place.com    jmarcus@someplace.com

7/27/2011 15:40:00 steve.jobes@place.com ronsmith@someplace.com
7/27/2011 15:40:00 steve.jobes@place.com pgonzalez@someplace.com
6/17/2011 15:19:00 ssummers@someplace.com kevin.smart@provider.com
6/17/2011 15:19:00 ssummers@someplace.com Pamla.Barras@store.com
6/17/2011 15:19:00 ssummers@someplace.com pamlabs@webmail.com
5/14/2011 12:35:00 amartelli@someplace.com apiska@business.com
5/14/2011 12:35:00 amartelli@someplace.com jmilch@provider.net
5/14/2011 12:35:00 amartelli@someplace.com pampwanla@webmail.com


No worries, Frank. I got this one.

It's pretty clear to me that two nested loops are going to be required. We'll need one loop to read each line, and then another loop to output a series of lines listing each recipient individually:

$ while read date time from recips; do 

for r in $recips; do
echo -e "$date\t$time\t$from\t$r";
done;
done <input-file

7/27/2011 15:40:00 steve.jobes@place.com jmarcus@someplace.com
7/27/2011 15:40:00 steve.jobes@place.com ronsmith@someplace.com
7/27/2011 15:40:00 steve.jobes@place.com pgonzalez@someplace.com
6/17/2011 15:19:00 ssummers@someplace.com kevin.smart@provider.com
6/17/2011 15:19:00 ssummers@someplace.com Pamla.Barras@store.com
6/17/2011 15:19:00 ssummers@someplace.com pamlabs@webmail.com
5/14/2011 12:35:00 amartelli@someplace.com apiska@business.com
5/14/2011 12:35:00 amartelli@someplace.com jmilch@provider.net
5/14/2011 12:35:00 amartelli@someplace.com pampwanla@webmail.com

So the outer "while read ..." loop is what we're using to read the input file-- notice the "<input-file" hiding at the end of the loop construct. Since read will automatically split up fields on whitespace for us, we can quickly pull out the date, time, and from address. We then have one more variable, recips, that gobbles up everything else on the line-- i.e., all of the recipient addresses.

But the recipient addresses are themselves whitespace delimited, so we can just whack $recips down into our for loop and iterate over each email address in the list. For each one of those recipients we output a tab-delimited line of output containing $date, $time, $from, and the current recipient, $r. We need to use "echo -e" here so that the "\t"s get expanded as tabs.

Nothing could be easier. In fact, I bet Tim could even handle this one in CMD.EXE. But Frank was so moved by our solution that he replied:

Your meaningless servant is like unto a worm to be crushed beneath the might of your foot, nay, even but a toe. The mere fact that the Master has deemed to write an honored response to this insignificant gnat has caused tears of joy to stream in a veritable rain from my eyes, too blind to look upon the shining radiance of the Master.

Not much we can add to that.

Tim crushes worms

Because Frank asked so nicely (and because Hal threw me under the bus) I'll do some ugly cmd, first.

C:\> cmd.exe /v:on /c "for /f "tokens=1-25" %a in (input.txt) do @(

echo %a %b %c %d &&
echo %e | find "@" > NUL && echo %a %b %c %d %e &&
echo %f | find "@" > NUL && echo %a %b %c %d %f &&
echo %g | find "@" > NUL && echo %a %b %c %d %g &&
...
echo %y | find "@" > NUL && echo %a %b %c %d %y)"



In this command, we start off by reading our input file. The default delimiters of tab and space will work fine for us because 1) the only space we have is between the date and time and 2) using just the tab as a delimiter is a pain. We can do it, but we have to start a new shell with tab completion disabled, and I like tab completion.

Once we read the file we output the date (%a), time (%b), sender (%c), and the first recipient (%d). Next, we output the second recipient and see if it contains an "@". If it doesn't then our short circuit Logical And (&&) will stop the rest of the line from executing. If it does then we output the second recipient (%e). We do the same for the third (%f) through 22nd (%y) recipient (Frank said 22 was the max).

It isn't a brief command, but I do think it is quite elegant in its form and function. Building such a big command with just basic building blocks is like building fire with sticks. Any many times I feel that with cmd all I have is sticks.

Now for PowerShell...

The PowerShell version is pretty similar to what Hal did but with his Foreach loop replaced with a For loop and a little extra math.

PS C:\> gc input.txt | % {$s = $_.split("`t");

for ($i=2; $i -lt $s.length; $i++) { write-host $s[0] $s[1] $s[$i] } }


7/27/2011 15:40:00 steve.jobes@place.com jmarcus@someplace.com
7/27/2011 15:40:00 steve.jobes@place.com ronsmith@someplace.com
7/27/2011 15:40:00 steve.jobes@place.com pgonzalez@someplace.com
6/17/2011 15:19:00 ssummers@someplace.com kevin.smart@provider.com
...


We use Get-Content (alias gc) to read in our file. We then use the ForEach-Object cmdlet (alias %) to operate on each line. Each line is split, using tab as delimeter, and held in the array $s. We then use a for loop to output the 0th element (date), the 1st element (sender), and repent held in the Nth element (Ok, so technically the Ith element). This gives us output, but of course with PowerShell the right way to do it is with objects.

PS C:\> $r = "" | Select Date, Sender, Recipient

PS C:\> gc input.txt | % {$s = $_.split("`t"); $r.Date = (Get-Date $s[0]); $r.Sender = $s[1];
for ($i=2; $i -lt $s.length; $i++) {$r.Recipient = $s[$i]; $r}}

Date Sender Recipient
---- ------ ---------
7/27/2011 3:40:00 PM steve.jobes@place.com jmarcus@someplace.com
7/27/2011 3:40:00 PM steve.jobes@place.com ronsmith@someplace.com
7/27/2011 3:40:00 PM steve.jobes@place.com pgonzalez@someplace.com
6/17/2011 3:19:00 PM ssummers@someplace.com kevin.smart@provider.com
...


The approach is very similar to our original, the notable difference is the use of our custom object $r. To create this basic object we pipe nothing ("") into the Select-Object cmdlet (alias select) and select our new property names. This gives us our object with the properties we need. The shell of our object exists, but with no values.

Next, we use our same Get-Content cmdlet with our ForEach-Object loop. Instead of outputting the results, we set the relevant property in our object. In addition, the Date string is converted to a Date object so we could later use PowerShell's date comparisons and operators. Finally, we output the object.

Now, back to enjoying the groveling.