Tuesday, October 13, 2009

Episode #64: The Times (OK, Dates) They Are a Changing

Hal finds an interesting topic:

Recently Rich Shepard, one of my colleagues on the Portland Linux User Group mailing list, posted an interesting problem. He had a data set with pipe-delimited records like:

1993-1|Water Quality|WVR|Yamhill, City of|Yamhill|Hamlin|Holt|Npv|
NPDES-Waste Discharge Limits|7/6/1993|0


All he wanted to do was convert the date column in the 10th field to YYYY-MM-DD format so that the file could be more easily imported into a relational database. He was curious if there was a simple command-line he could use to accomplish this.

To me, this seemed like a task that was tailor made for awk (apparently Joe Pruett agreed, since his solution on the mailing list was essentially identical to the one I'm presenting here). While awk normally splits fields on whitespace, we can use the "-F" option to specify an alternate delimiter. Once we've got the fields split up, we can work a little magic with the built-in split() and sprintf() operators in awk:

$ awk -F'|' '{split($10,f,"/");
$10=sprintf("%d-%02d-%02d", f[3], f[1], f[2]);
print}' data

1993-1 Water Quality WVR Yamhill, City of Yamhill Hamlin Holt Npv
NPDES-Waste Discharge Limits 1993-07-06 0
...

The split() function in the example breaks up field #10 on "/" characters and puts the results into the array named "f". Actually the last argument to split() can be a full-on egrep-style regular expression delimited with "/.../". But since we're just splitting on literal slash characters, "/" is a lot easier to type than "/\//".

Once we have the year, month, and day split into an array, we then replace the contents of the 10th field with the output of the sprintf() routine. This puts our data in the desired format. The final "print" statement outputs all of the fields from the original line, including our reformatted field.

Now you'll notice that the output is space-delimited rather than pipe-delimited. That's because awk's default "output field separator" (OFS for short) is space. You can actually change this by changing the value of the OFS variable. The trick is you need to set variables like this in a "BEGIN" block at the front of your awk code so that the new value is set before you begin processing your input file:

$ awk -F'|' 'BEGIN { OFS="|" }
{split($10,f,"/");
$10=sprintf("%d-%02d-%02d", f[3], f[1], f[2]);
print}' data

1993-1|Water Quality|WVR|Yamhill, City of|Yamhill|Hamlin|Holt|Npv|
NPDES-Waste Discharge Limits|1993-07-06|0
...

Of course we could set OFS to anything. For example, we could set it to comma to produce CSV files (though there are possible quoting issues if your data contains commas). There are other variables we can set to control awk's splitting behavior. For instance, the "-F" option is equivalent to setting the FS ("field separator") variable. Similarly, there are the RS ("record separator") and ORS ("output record separator") variables, which are normally set to newline since awk operates on a line-at-a-time basis.

Anyway, if your task is chopping up data and dumping it into a different format, awk is always one good tool to reach for. I could have solved this a bit more tersely using Perl, but that would be breaking the rules for this blog. For those of you who are thinking that even my awk code is breaking the "no scripting languages" rule, it is possible to do this with cut instead of awk or sed, but the result is pretty nasty:

$ IFS='|'
$ while read -a F; do
printf -v d "%d-%02d-%02d" \
`echo ${F[9]} | cut -d/ -f3` \
`echo ${F[9]} | cut -d/ -f1` \
`echo ${F[9]} | cut -d/ -f2`;
F[9]=$d;
echo "${F[*]}";
done < data

1993-1|Water Quality|WVR|Yamhill, City of|Yamhill|Hamlin|Holt|Npv|
NPDES-Waste Discharge Limits|1993-07-06|0
...

"read -a F" splits each line using the delimiter we specified in the IFS variable and assigns the fields to elements of the array named F. Notice, however, that bash indexes its arrays starting at zero (like C programs) while awk starts with one. So the date we're reformatting is in F[9], not F[10].

The real difficulty here is that cut doesn't let us reorder multiple fields in a single command, so we're forced to do three instances of the "echo ... | cut ..." pipeline to get the date fields in the order we want. Another minor annoyance is that "printf -v ..." doesn't let us assign directly to array variables, so I have to use $d as a temporary variable.

It's also worth pointing out that the double quotes in the last echo statement in the loop are significant. If I just wrote "echo ${F[*]}" without the double quotes, then I'd get space-separated output. Using the double quotes causes the output to be delimited with the first character of $IFS (similar to setting OFS in awk).

So there you go: an awk solution and a nasty shell-only version. Somehow I think that Tim's Windows solution is going to look even uglier though...

Tim brings the ugly:

First off, the date format of our sample wasn't specified, so I will assume the sample date is July 6th, 1993. My apologies to military and European followers who think the date should be June 7th, 1993.

Linux may have all sorts of different "cool" commands to use, but in the windows world we use the FOR loop...for everything.

We use our FOR loop to split the fields using the "|" and "/" characters as a delimiters. Then all we need to do is rearrange the date parts and put it all back together.

C:\> for /F "tokens=1-14 delims=|/" %a in (c:\file.txt) do @echo
%a^|%b^|%c^|%d^|%e^|%f^|%g^|%h^|%i^|%l-%j-%k^|%m

1993-1|Water Quality|WVR|Yamhill, City of|Yamhill|Hamlin|Holt|Npv|
NPDES-Waste Discharge Limits|1993-7-6|0


Regular readers will remember the FOR loop represents the tokens by using sequential letters of alphabet. We have chosen %a to represent the first token, so %b will represent the second token, %j the 10th (month), %k the 11th (day), and %l the 12th (year). We recreate the original format by adding the "|" and "-" characters between the rearranged tokens. The problem is, if there is a "/" character in any of the text fields our results will be messed up. If we change "Water Quality" to be "Water Quality/Temp" we get these results.

C:\> for /F "tokens=1-14 delims=|/" %a in (c:\file.txt) do @echo
%a^|%b^|%c^|%d^|%e^|%f^|%g^|%h^|%i^|%l-%j-%k^|%m

1993-1|Water Quality|Temp|WVR|Yamhill, City of|Yamhill|Hamlin|Holt|Npv|
6-NPDES-Waste Discharge Limits-7|1993


We need a more robust solution that will only use the "/" character to split the date, but not the rest of the string. How do we do that? Well, if one FOR loop is good, then two must be better.

C:\> for /F "tokens=1-12 delims=|" %a in (c:\file.txt) do @for /F
"tokens=1-3 delims=/" %x in ('echo %j') do
@echo %a^|%b^|%c^|%d^|%e^|%f^|%g^|%h^|%i^|%z-%x-%y^|%k

1993-1|Water Quality|WVR|Yamhill, City of|Yamhill|Hamlin|Holt|Npv|
NPDES-Waste Discharge Limits|1993-7-6|0


The first FOR loop is used to split the string using the "|" character as the delimiter. The second FOR loop is used to split only the date field using the "/" character as a delimiter. The variables can be a little confusing so let's take a deeper look in to the second FOR loop.

..for /F "tokens=1-3 delims=/" %x in ('echo %j') do...


This FOR loop operates on the output of
echo %j
, which is the entire date field. Using the delims option we slice the date field using the "/" character as our delimiter. The iterator in this loop is %x and it will contain the first token (month). The second and third tokens are represented by %y (day) and %z (year). Finally, we glue it all back together in the order we like using the variables created by both FOR loops.

Some of you detail oriented folks may have noticed that I neglected one point, the month and day need a leading zero. I ignored this point because this tiny change makes things really ugly. We have to use our old friend "delayed environment variable expansion" which you can read about in episodes #48, #12, and #46. Since it has been covered so many times I'll skip some of the details for sake of brevity (ironic I know). Here is our final result:

C:\> cmd.exe /v:on /c "for /F "tokens=1-12 delims=^|" %a in (c:\file.txt) do
@for /F "tokens=1-3 delims=/" %x in ('echo %j') do @set month=0%x& @set day=0%y&
@echo %a^|%b^|%c^|%d^|%e^|%f^|%g^|%h^|%i^|%z-!month:~-2!-!day:~-2!^|%k"

1993-1|Water Quality|WVR|Yamhill, City of|Yamhill|Hamlin|Holt|Npv|
NPDES-Waste Discharge Limits|1993-07-06|0


That is a big mess and it may be difficult to see where and how the leading zero was added. Using the delayed variable expansion we set the month variable, with a leading zero, like this:

... set month=0%x& ...!month:~-2!....


The variable %x could contain 7 (July) or 11 (November). We set the variable, month, equal to the concatenation of zero and %x. The month variable would contain 07 (July) or 011 (November). Notice when the month variable is set there is no space between the variable (%x) and the "&" character. If we did leave a space then our month variable would contain a trailing space which would later have to be removed. When we echo the month variable we only want the two rightmost characters so July is displayed as 07 and November as 11. The same process is used for the day of the month.

Powershell:

Powershell gives us the ability to use regular expressions which makes everything much easier. We can reformat any date in our file using this command:

PS C:\> Get-Content file.txt | ForEach-Object { $_ -replace
'(\d{1,2})/(\d{1,2})/(\d{4})','$3-$1-$2' }

1993-1|Water Quality|WVR|Yamhill, City of|Yamhill|Hamlin|Holt|Npv|
NPDES-Waste Discharge Limits|1993-7-6|0


The Get-Content commandlet (alias gc) returns each line of the file given. Using the ForEach-Object (alias %) we operate on each line of the file. The "current pipeline object", represented as $_, contains the content of the current line in the file (in our example we only have one line in our file).

Our regular expression search and replace finds the month/day/year and rearranges it as year-month-day. Again we have the problem of adding that pesky leading zero so we need to use a slightly different command.

PS C:\> gc file.txt | % { $_ -replace '(\d{1,2})/(\d{1,2})/(\d{4})',
'$3-0$1-0$2' } | % { $_ -replace '(\d{4}-)\d?(\d{2}-)\d?(\d{2})','$1$2$3'}

1993-1|Water Quality|WVR|Yamhill, City of|Yamhill|Hamlin|Holt|Npv|
NPDES-Waste Discharge Limits|1993-07-06|0


We use two replace commands in order to add our leading zero. The first replace command adds a leading zero and rearranges our month/day/year, resulting in year-0month-0day. The second command removes the leading zeros if they are unnecessary.