Tuesday, September 1, 2009

Episode #58: That's Pretty Random

Hal has too much time on his hands:

So there I was trolling the CommandLineFu blog looking for an idea for this week's Episode, and I came across this posting by user "curiousstranger" with an idea for generating a random 8-digit number:

jot -s '' -r -n 8 0 9

As useful as the jot command is in this situation, it's not a common command on most Unix systems I encounter. So that got me thinking how I'd answer this question within the rules of our blog.

My first notion was to just use the special $RANDOM variable to spit out an 8-digit number. But $RANDOM only produces values up to 32K. I could use $RANDOM in a loop to produce the individual digits I need, just like curiousstranger does with the jot command:

for ((i=0; $i<8; i++)); do echo -n $(( $RANDOM % 10 )); done

Actually the fact that the leading digit might be zero and the lack of a trailing newline are bugging me, so let's change that to:

echo -n $(( $RANDOM % 9 + 1)); \
for ((i=0; $i<7; i++)); do echo -n $(( $RANDOM % 10 )); done; \
echo


Here's a different loop idea that uses repeated multiplication so that we do fewer iterations:

for (( i=1; i<10000000; i=$(($i * $RANDOM)) )); do :; done; \
echo $(( $i % 100000000 ))

Notice that we're multiplying a series of $RANDOM values together until we get at least an 8-digit value. Then we cut off the first 8 digits of the number we generated.

But these loops are all pretty ugly, and anyway I was feeling bad about hitting Ed with the "no loops" requirement back in Episode #56. So I started thinking about ways that I could accomplish this without a loop, and also about ways that I could generate sequences of any arbitrary length.

Now most modern Unix systems have a /dev/urandom device for generating pseudo-random data, but I needed to figure out a way to convert the binary data coming out of /dev/urandom into decimal numbers. And then it hit me: the "od" command. "od" stands for "octal dump", but it's really a very flexible binary data dumper that can output a wide variety of formats:

$ od -A n -N 16 -t u8 /dev/urandom
5073535022611155147 14542989994974172695

Here I'm reading the first 16 bytes ("-N 16") of data from /dev/urandom and formatting the output as 8-byte unsigned integers ("-t u8"). The "-A n" flag suppresses the byte offset markers that would normally appear in the first column.

Neat! All I need to do is remove the whitespace and chop off the first 8 digits, right?

$ od -A n -N 16 -t u8 /dev/urandom | sed 's/ //g' | cut -c1-8
13065760

Experimenting with this command a little bit, it looked to me like the distribution of random numbers wasn't really even: there seemed to be an awful lot of numbers starting with 1 coming up. Since the maximum 64-bit unsigned value is 18,446,744,073,709,551,615 this isn't too surprising-- roughly half the random numbers we generate are going to start with 1. You can confirm this with a little loop:

$ for ((i=0; $i<1000; i++)); do \
od -A n -N 16 -t u8 /dev/urandom | sed 's/ //g' | cut -c1; \
done | sort | uniq -c

512 1
65 2
61 3
53 4
63 5
62 6
64 7
61 8
59 9

Here I'm using od to generate the random digit strings but only pulling off the first digit. I do that 1000 times in a loop and then use "sort | uniq -c" to count the number of times each digit appears in the output. As you can see, the number 1 does indeed account for roughly half the output.

To fix this problem, I decided to simply throw away the first digit of each number. Since I still don't want any leading zeroes in my output, I'm also going to throw away any zeroes after the first digit. This just means a slightly more complicated sed expression:

$ od -A n -N 16 -t u8 /dev/urandom | sed -r 's/ +[0-9]0*//g' | cut -c1-8
28413748

If you run a test on this data, you'll find it yields a nice, even distribution of leading digits.

Now this method will give us a fairly long string of digits-- up to at least 30 digits or so. But what if we wanted a really long string of numbers? The answer is just to suck more data out of /dev/urandom:

$ od -A n -N 64 -t u8 /dev/urandom | sed -r 's/ +[0-9]0*//g'
6364786111067772577530871020346806310
5788133289679762529333695157109594
360158217882969288611779773921873084
16718162743276945583432311789153227

As you can see, od puts each 16 bytes of data on its own line. So to make this one continuous stream of digits we need to remove the newlines:

$ od -A n -N 64 -t u8 /dev/urandom | tr \\n ' ' | sed -r 's/ +[0-9]0*//g'
3646335685510840910494490828374553833127833877255807760836799932166902525984...

I have to admit that I was feeling pretty good about myself at this point. By changing the "-N" parameter we can suck any amount of data we need out of /dev/urandom and produce an arbitrarily long string of random digits.

Then I went back to the CommandLineFu blog and noticed the follow-up comment by user "bubo", who shows us a much simpler method:

head /dev/urandom | tr -dc 0-9 | cut -c1-8

In the tr command, "-c 0-9" means "take the compliment of the set 0-9"-- in other words, the set of everything that is not a digit. The "-d" option means delete all characters in this set from the output. So basically we're sucking data from /dev/urandom and throwing out everything that's not a digit. Much tighter than my od solution. Good call, bubo!

Now what's that I hear? Ah yes, that's Ed's soul making the sound of ultimate suffering. Don't worry, big guy, I'll let you use loops for this one. I'm pretty sure Windows doesn't offer anything close to our /dev/urandom solution.

Update from a Loyal Reader: Jeff Haemer, who apparently has even more time on his hands than I do, came up with yet another solution for this problem that uses only built-in shell math operators. You can read about his solution in this blog post. Mmmm, tasty!

Ed responds:
Wow, Hal... you really do have too much time on your hands, don't ya?

And, yes, the sound of ultimate suffering started about the time you dumped your loops and went, well, loopy with /dev/urandom. Still, it was cool stuff, and I'll take you up on your reprieve from the loop embargo.

I frequently worry about writing shell fu that uses random numbers for fear that some of my shell gyrations will lower the entropy of an already-questionable entropy source. If the distribution of pseudo-random digits isn't quite random enough, and I compound that by using it multiple times, my result will be less random. Thus, I wouldn't use these solutions for industrial-grade pseudo-randomness, but they'll pass for casual coin flippers.

As I discussed in Episode #49, you can generate a random digit between 0 and 32767 using the %random% variable, and then apply modulo arithmetic (via the % operator) to get it between 0 and the number you want. Thus, we can create a single random digit between 0 and 9 using:

C:\> set /a %random% % 10
5

Now, Hal wants 8 of these bad boys, which we can do with a FOR loop thusly:

C:\> cmd.exe /v:on /c "for /L %i in (1,1,8) do @set /a !random! % 10"
03133742

At first, I thought that this would be harder, because echo'ing output always adds an annoying extra Carriage Return / Linefeed (CRLF) at the end. To see what I mean, try running:

C:\> echo hello & echo hello
hello
hello

But, in my random digit generator, I don't actually echo the output here. I just use "set /a" to do the math, which does display its result *without the extra CRLF*. That's nice. Makes my command easier. If you need more insight into the cmd.exe /v:on stuff up front, check out Episode #46, which describes delayed variable expansion.

While my solution above is simple and readily extensible to more digits, it does have some problems. The biggest problem is that it just displays the number, but doesn't set it in a variable. In fact, my loop never knows what the full random number it generates is, as it just steps through each unrelated digit.

A solution that actually puts the number in a variable is more useful for us, because then we can do further math with our random number, strip off leading zero digits, and use it in a script.

I can think of a whole bunch of ways to put our results in a variable for further processing. One that maintains our existing logic is:

C:\> for /f %j in ('cmd.exe /v:on /c "for /L %i in (1,1,8) do 
@set /a !random! % 10"') do @echo %j


Here, I'm just using a for /f loop to extract the output of our previous command into the iterator variable %j.

Another rather different approach that puts our value into a variable involves generating each digit and appending it to a string:

C:\> cmd.exe /v:on /c "set value= & (for /L %i in (1,1,8) do 
@set /a digit = !random! % 10 > nul & set value=!digit!!value! & echo !value!)"
0
90
490
5490
55490
055490
1055490
71055490

Here, after turning on delayed variable expansion, I clear the variable called "value" by simply setting it to nothing with "set value=". Then, I generate each digit using %random% % 10. Finally, I simply append my value to the new current digit appended with the old value, building my string digit by digit. For fun, I echo it out at each iteration so we can see the number being built. I put an extra set of parens () around my FOR loop to indicate where it ends, because after that close paren, you can put in more logic.

The nice thing about this approach is that you can now use !value! in follow-up logic and commands to do stuff with it. Stuff like what? Well, how about this?

C:\> cmd.exe /v:on /c "set value= & (for /L %i in (1,1,8) do 
@set /a digit = !random! % 10 > nul & set value=!digit!!value!
& echo !value!) & echo. & echo. & set /a !value! + 10"
1
61
961
0961
90961
590961
8590961
78590961


78590971

Good! So, we can gen numbers and then do math with them. Uh... but we have a problem. Let's try running this again to see what might happen:

C:\> cmd.exe /v:on /c "set value= & (for /L %i in (1,1,8) do 
@set /a digit = !random! % 10 > nul & set value=!digit!!value!
& echo !value!) & echo. & echo. & set /a !value! + 10"
5
45
445
3445
33445
233445
0233445
00233445


79663

What? My sum at the end is waaaaay wrong. What happened? Well, as we discussed in Episode #25 on shell math, if you use "set /a" to do math on a number, and your number has a leading zero, the shell interprets it as frickin' octal. Not just octal, my friends, but frickin' octal. What's the difference? Octal is a fun and nice way to refer to numbers. Frickin' octal, on the other hand, happens when your shell starts using octal and you don't want it to.

So, what can we do here? Well, we can use a variant of the kludge I devised in Episode #56 for dealing with leading zeros. We can put a digit in front of them, and then simply subtract that digit at the end. Here goes:

C:\> cmd.exe /v:on /c "set value= & (for /L %i in (1,1,8) do
@set /a digi
t = !random! % 10 > nul & set value=!digit!!value! & echo !value!)
& echo. & ech
o. & echo !value! & set interim=1!value! & echo !interim!
& set /a result=!inter
im! - 100000000"
6
86
986
8986
58986
058986
9058986
09058986


09058986
109058986
9058986

I build my random number as before. Then, I build an interim result using string operations to prepend a 1 in front of it. Then, I subtract 100000000 at the end. Now, the variable called result has what I'm looking for. This approach also has the interest side-effect of removing leading zeros from my random number because of the math that I do. Also, thankfully, we can add and remove 100000000 without tripping past our signed 32-bit integer limit here of around 2 billion. If you need more than 8 digits in your random number, I suggest you start appending them together.

By the way, I included many echo's of my output to help make this more understandable. I've cut those down in the following command.

C:\> cmd.exe /v:on /c "set value= & (for /L %i in (1,1,8) do 
@set /a digit = !random! % 10 > nul & set value=!digit!!value!)
& set interim=1!value! >nul & set /a result=!interim! - 100000000"
76019162

And that one is my final answer, Regis.

Addendum from Ed: In my SANS class yesterday, an attendee asked how he could fill the screen with Matrix-like spew. Based on this episode, I came up with:

C:\> cmd.exe /v:on /c "for /L %i in (1,0,2) do @set /a !random!"