Hello World, Utterly Overdone

erick @

I thought I knew everything there is to know about Hello World. It’s so simple, how couldn’t I? I was wrong.

Stick with me to the end and we’ll take hello world as deep as human readable software will let us go.

Hello World

Hello World is the canonical first program we all write. It takes 0 inputs, produces 1 output. Literally the simplest useful program that you can write. But over time I’d written it off as something that I had no more use for.

I’d overlooked that analyzing simple things can lead to deep insights. Complex topics are too difficult to analyze deeply, at least I’m not smart enough to do that.

But Hello World is perfect for this.

So what’s this post about? Well if I were ever to teach a programming course, this is how I would teach Hello World (HW from now on). As a series of questions:

  1. What is HW?
  2. What does HW do?
  3. How well did HW do what it did?
  4. Can HW do it better?

This post is my answers to these questions.

NB: Please know that I’m not a C programmer. And I’m certainly no assembler or gcc expert. If I’ve got something totally wrong then please let me know and correct me if I’m wrong (CMIIW).

As with my other work, feel free to dig into the details in my HW scratch pad.

Info For Newbies

Hello World is historically intended for newcomers. That’s exactly who this post is for. If you read from start to finish then there should be no trouble following along. But each section is built upon the previous. So as you read along, I’ll show my work:

  • Linux users - you should be able to follow along just fine.
  • Mac users - I’d assume some tools will just work, others will be frustratingly different.
  • Windows users - bash on windows might be your best bet to follow along, but I have no idea.

When I write something like:

$ ./dostuff
did stuff

That means I’m running the command ./dostuff in a bash terminal and it output did stuff. You should try to run ./dostuff too.

What is HW?

If you’ve written even 1 line of software, in any language, then it was probably ‘Hello World’. It’s exposes newbies to the syntax and runtime environment (more on this later) of a programming language.

Polyglot HW

I’m going to cover HW in 4 languages: C, Java, Ruby, and Javascript.

I’d like to say that I went through a careful evaluation process, selecting each languages for an important aspect… but really I just picked the first few that I thought of (I apologize to the functional language fans).

C

C is the Illuminati of programming languages. It’s the one we rarely think about anymore, but underneath it all, runs the whole world.

hello.c:

#include <stdio.h>

int main() {
  printf("hello world\n");
}

Let’s compile and run it, I’ll assume you already have gcc installed.

$ gcc hello.c -o hello
$ ./hello
hello world

info for newbies1:

  • man 3 printf
  • man 3 stdio
  • man gcc

Java

Java might be the most popular programming language on the planet. It’s certainly the enterprise heavy weight champion of the world.

Hello.java:

public class Hello {
  public static void main(String[] args) {
    System.out.println("hello world");
  }
}

Let’s compile and run it. I suggest running on OpenJDK, but it’s your machine so use whatever you want.

$ javac Hello.java
$ java Hello
hello world

info for newbies:

Ruby

Ruby is the language lover’s language. I find the syntax to be beautiful. It has interesting origins and is my choice language to prototype in.

hello.rb:

puts 'hello world'

I’m running on Ruby MRI (currently v2.3.3), the standard Ruby implementation.

Let’s run it.

$ ruby hello.rb
hello world

info for newbies:

Javascript

Javascript, IMO, is the most important language of the decade.

Javascript is also the first language where I’ll admit that HW is more complicated than it appears. Javascript’s standard library, unlike C, Java, and Ruby, is so sparse that there isn’t even a function to print a line of text… which is crazy!

It’s the job of the JS runtime environment to provide a library for writing text. In this case I’ll be running on nodejs to use the console.log function. This same code would also run in any browser, and would log to the browser console.

hello.js:

console.log('hello world');

Let’s run it:

$ node hello.js
hello world

info for newbies:

OK, so what?

So we wrote 4 different HWs. They all did the same thing. So what? Well, let’s try to see how they’re different.

What did HW do?

Each program uses the same mechanism to output the HW text to console. On Linux, the write system call. See man 2 write for more.

So then what’s the difference between HW in C, Java, Ruby, and Javascript?

Well there’s probably a 1000 ways to answer the question. I’ll try to compare the difference in execution time2.

Execution Time (ms)
0.73 C 54.23 Java 47.89 Ruby 62.00 Javascript

I’m not surprised that C is fastest. But is C really 100x faster? That doesn’t seem right.

I think a better question is, why are Java, Ruby, and JS so damn slow? If I print HW twice, then each program wouldn’t be twice as slow. So where does the extra overhead come from?

Startup -> Setup -> Write -> Teardown

In order to evaluate this question, it’s important to understand the difference between a programming language and its runtime environment. Breaking up the languages from their runtime environments makes it easy to reason that it’s not the Java language (or, more precisely, its execution) that’s slow, but everything that goes into starting up a Java program.

So how does each HW program run?

Well I’ve made up this arbitrary model to explain HW (I think it works well):

hello world execution flow

The yellow’ish box is where we’ll see the interesting differences, the blue’ish box is exactly the same for each program, in fact it’s just 1 function call. Maybe the white boxes would be interesting for another post.

More definitively:

program startup
execution of execve when running each HW program
runtime setup
any user and kernel space execution from the end of execve until the write
write ‘hello world’
the write system call that prints ‘hello world’ to screen
program teardown
everything that happens after write, until the program exits

So how do we measure [Startup -> Setup -> Write -> Teardown]?

Strace

Just like other interesting thing in life, ‘it depends’. Since I know a system call to execve starts each program, and another system call for write, then I think a reasonable way to investigate is to measure the system calls in the [Startup -> Setup -> Write -> Teardown] flow.

I used a tool called strace for this.

info for newbies:

This is the strace I captured for hello.c. It’s colored with the same scheme as the [Startup -> Setup -> Write -> Teardown] flow above. The blue’ish-green write is always a single line immediately following the yellow’ish setup sections.

$ strace -f -ttt ./hello

Click through the tabs under the trace to see what it looks like in the the other languages. Reading the details of each line isn’t so important, but it’s important get a sense of scale.

strace for: hello.c | Hello.java | hello.rb | hello.js

I don’t want to imply that this is always a good indication of efficiency, but the good thing about HW is that it’s so simple that simple investigations will work well.

Because of the simplicity I can easily draw out some interesting info:

Line length of strace
29 C 3079 Java 3154 Ruby 517 Javascript

The line lenghts for C, Java, and Ruby correlate well to their execution times. JS is the odd outlier.

Here’s a breakdown of the strace captures from all of the HW programs.

  # strace lines # processes % runtime setup % teardown
C 29 1 87% 3%
Java 3079 16 96% 3%
Ruby 3154 3 99% <1%
JS 517 7 96% 3%

Easy Insights:

  • C might be ~100x faster than the others because it’s doing ~100x less work
  • Java starts 15(!) additional child processes to bring up the JVM
  • Ruby appears to be surprisingly light weight compared to Java and JS (in terms of child processes)
  • Node.js spends a lot more time in user space than Ruby MRI & Java

How well did HW do it?

Now that I have an idea of the HW details, I want to know more about the “hello” part of HW. The write. How fast is it? Hint, the answer surprised me.

To simplify I’ll focus only on the C and Ruby HW programs for the rest of the post. They represent a reasonable cross section of languages, e.g.: static vs dynamic, compiled vs interpreted. They’re both imperative programming languages, but unfortunately I didn’t start out with a good functional language choice.

info for newbies:

timed_hello.c

To measure the time for each write I’ve written 2 new programs, beefier versions of HW. timed_hello.c and timed_hello.rb.

timed_hello.c:

#include <stdio.h>
#include <time.h>
#include <unistd.h>

int main() {
  struct timespec t_start;
  struct timespec t_end;

  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t_start);

  printf("hello world\n");

  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t_end);

  printf("time to print: %.6f\n",
     ((double)t_end.tv_sec + t_end.tv_nsec / 1.0e9) -
     ((double)t_start.tv_sec + t_start.tv_nsec / 1.0e9));
}

timed_hello.c uses clock_gettime to get a nanosecond precision timestamp just before and after the call to printf. Then prints the time difference before exiting.

info for newbies:

  • man clock_gettime

timed_hello.rb

The Ruby variant does the same thing. Except it uses Ruby Time.now to get the current time. This can give nanosecond time resolution, but I’m only measuring up to microsecond accuracy.

timed_hello.rb:

# marker to find the code in strace
puts '>>>> before print >>>>>'

t_start = Time.now
puts 'hello world'
STDOUT.flush
t_end = Time.now

puts "time to print: %0.06f " % (t_end.to_f - t_start.to_f)

timed_hello.rb operates more in user space than timed_hello.c, so I need to add an extra marker in order to identify the start of the section in strace.

Comparisons

To compile timed_hello.c:

$ gcc timed_hello.c -o timed_hello
$ ./timed_hello

To execute them I used a small bash script. It runs the command 10 times. I take the lowest number as the optimal ‘time to print’. Simple methods for simple questions.

… drum roll …

  best time (μs)
C 42
Ruby 6

WTF!? Ruby is an order of magnitude faster to print than C, with all default GCC optimizations enabled!

That’s not a misprint. It’s not compiler optimizations fucking with the results. It’s not Ruby cheating and giving a false time.

This made me scratch my head for a while… so I took out my strace hammer and started to swing again.

Straces

These are the traces for timed_hello.c and timed_hello.rb.

excerpt taken from: strace -f -ttt ./timed_hello

...
1490703703.440017 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=653652}) = 0
1490703703.440046 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
1490703703.440092 brk(NULL)             = 0xca3000
1490703703.440109 brk(0xcc4000)         = 0xcc4000
1490703703.440128 brk(NULL)             = 0xcc4000
1490703703.440147 write(1, "hello world\n", 12hello world
  ) = 12
1490703703.440168 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=727646}) = 0
1490703703.440194 write(1, "time to print: 0.000074\n", 24time to print: 0.000074
) = 24
...

excerpt taken from: strace -f -ttt ruby timed_hello.rb

...
[pid 17660] 1490704226.726643 write(1, ">>>> before print >>>>>", 23>>>> before print >>>>>) = 23
[pid 17660] 1490704226.726668 write(1, "\n", 1
) = 1
[pid 17660] 1490704226.726695 write(1, "hello world", 11hello world) = 11
[pid 17660] 1490704226.726716 write(1, "\n", 1
  ) = 1
[pid 17660] 1490704226.726761 write(1, "time to print: 0.000046 ", 24time to print: 0.000046 ) = 24
[pid 17660] 1490704226.726783 write(1, "\n", 1
) = 1
...

The trace clearly shows that the C printf implementation first calls fstat on STDOUT and then allocates some memory before calling the write system call. As opposed to Ruby which only calls write twice, once for the ‘hello world’ string and once for the trailing newline.

Well this is an implemenation detail of glibc printf that I didn’t count on. To understand, I need to go deeper.

Ruby Source and C Assembly

I can’t let this mystery go. How is it that Ruby puts results in a direct call to write, but the equivalent C is nearly 10x slower?

Ruby MRI Source

I decided to find the Ruby source code that implements puts to confirm that this is actually the case. After digging through the MRI code for a while (this is only based on reading code, so please CMIIW) I found a recent version of io.c.

Knowing that Kernel#puts is defined as rb_io_puts I was able to dig up this code path:

rb_io_puts => rb_io_write => id_write

Where id_write seems to be an alias for the C write function. So this all seems to make sense.

rb_io_puts:

rb_io_puts(int argc, const VALUE *argv, VALUE out)
{
    int i;
    VALUE line;

    /* if no argument given, print newline. */
    if (argc == 0) {
    rb_io_write(out, rb_default_rs);
        return Qnil;
    }
    for (i=0; i<argc; i++) {
        if (RB_TYPE_P(argv[i], T_STRING)) {
            line = argv[i];
            goto string;
        }
        if (rb_exec_recursive(io_puts_ary, argv[i], out)) {
            continue;
        }
        line = rb_obj_as_string(argv[i]);
      string:
    rb_io_write(out, line);
        if (RSTRING_LEN(line) == 0 ||
            !rb_str_end_with_asciichar(line, '\n')) {
           rb_io_write(out, rb_default_rs);
        }
    }

    return Qnil;
}

rb_io_write:

rb_io_write(VALUE io, VALUE str)
{
    return rb_funcallv(io, id_write, 1, &str);
}

id_write:

id_write = rb_intern("write");
hello.c Assembly

So now that I’m reasonably confident that the Ruby trace is accurate, it’s time to look deeper at hello.c. After seeing the fstat call above in timed_hello.c I went back to look at my strace from hello.c and saw the same pattern. Seeing the assembly code will yield a better understanding.

$ gcc -S hello.c

hello.s:

    .file        "hello.c"
    .section        .rodata
.LC0:
    .string        "hello world"
    .text
    .globl        main
    .type        main, @function
main:
.LFB0:
    .cfi_startproc
    pushq        %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq        %rsp, %rbp
    .cfi_def_cfa_register 6
    movl        $.LC0, %edi
    call        puts
    movl        $0, %eax
    popq        %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE0:
    .size        main, .-main
    .ident        "GCC: (GNU) 6.3.1 20161221 (Red Hat 6.3.1-1)"
    .section        .note.GNU-stack,"",@progbits

This is all irrelevant. Except call puts.

This line call puts has me totally confused. My C program was implemented with printf, not puts.

Thankfully, way back in 2005, Peter Seiderer explained the whole mystery (thanks Peter).

The [assembler] output indicates that puts is called instead of the original printf from the source code. To verify this we take a look at the intermediate assembler listing. First with optimization off.

This is a GCC optimization. Which I verified by disabling all GCC optimizations:

$ gcc -fno-builtin -O0 -S hello.c -o hello_unopt.s

hello_unopt.s:

    ...
    call        printf
    ...

phewww… disaster averted! But actually this makes even less sense.

Remember how we got here? I’m investingating why C printf is slower than Ruby Kernel#puts. But a GCC optimization is replacing printf with puts. I verified the unopt’d compiled hello.c version too it is in fact marginally slower than C puts so no loss there.

So now it seems that both C puts and printf are 10x slower than Ruby Kernel#puts. Interesting.

In any case, what’s the deal with puts? Well on my laptop I’m using glibc 2.24, and I found this unofficial glibc repository on github. There puts is aliased to _IO_puts:

_IO_puts (const char *str)
{
  int result = EOF;
  _IO_size_t len = strlen (str);
  _IO_acquire_lock (_IO_stdout);

  if ((_IO_vtable_offset (_IO_stdout) != 0
       || _IO_fwide (_IO_stdout, -1) == -1)
      && _IO_sputn (_IO_stdout, str, len) == len
      && _IO_putc_unlocked ('\n', _IO_stdout) != EOF)
    result = MIN (INT_MAX, len + 1);

  _IO_release_lock (_IO_stdout);
  return result;
}

Reading the first part of that conditional seems to explain the extra fstat. I’m not sure which code path hello.c binary goes down, but I’m satisfied, I’ve seen enough of this. I’ll buy this explanation.

But what I’m not satisfied with is whether we can go faster?

Can HW do it better?

It seems pretty obvious to me that hello.c is not the simplest HW C implementation there is. There is a simpler one.

hello_write.c:

#include <unistd.h>

int main() {
  write(1, "hello world\n", 12);
}

When I verify this implementation with a new version of timed_write.c, effectively timed_hello.c from above with a call to write instead of printf, I see the same 6μs time that Ruby Kernel#puts reported. Looking at an strace I’m still happy with hello_write.c:

info for newbies:

  • Please notice, I’m trading maintainability for optimizations. Often optimized code is much lower level. Speed gains may come at the risk of bugs

excerpt taken from: strace -f -ttt ./hello_write

...
1490715358.029252 munmap(0x7f10b786e000, 148457) = 0
1490715358.029301 write(1, "hello world\n", 12) = 12
1490715358.029324 exit_group(0)         = ?
...

No more call to fstat.

Not Fast Enough

I REALLY want to stop here, but I’ve come too far to quit before making as fast a HW as I can reasonably think of.

Looking at the strace for hello_write is still 85% setup. It’s a waste of time. The set up for C is still loading libc and linking unused shared object libraries. What if I just wrote my own HW in GCC assembly?

After a bit of linux x86 assembly reasearch:

hello_gas32.S:

# system calls @ /usr/include/asm/unistd_32.h
# gcc -m32 -nostdlib hello.gas32.S -o hello_gas32
.data
msg:
    .string "hello world\n"

.text

.global _start

_start:
    movl    $1, %ebx    # set fd to stdout
    movl	$msg, %ecx  # set $msg to 2nd argument
    movl	$12, %edx	# set length to 12
    movl	$4, %eax    # system call write
    int	$0x80           # execute interrupt

    movl	$0, %ebx	# set exit code
    movl	$1, %eax	# system call exit
    int	$0x80           # execute interrupt

Compiling and running is relatively easy:

$ gcc -m32 -nostdlib hello.gas32.S -o hello_gas32
$ ./hello_gas32
hello world

And finally the strace is about as clean as it can get:

strace -f -ttt ./hello_gas32

446 μs

So what are the final runtime tallies for hello_gas32 vs hello_write vs hello?

execve time 739 ./hello execve time 623 ./hello_write execve time 446 ./hello_gas32
Execution Time (μs)

446 μs! This includes roughtly 300 μs for execve to run, which is a constant for all programs. Subtracting time for execve, then hello_gas32 is roughly 66% faster than hello.c.

  1. Newbies, there’s no shortcut to learning this stuff. RTFM is the mantra the most talented coders I’ve known to take to heart. It’s a mantra I try to live to as well. 

  2. For details on my implementation and measurements, see the work in my scratchpad