Friday, October 2, 2009

GNU make trick for handling long lists

If you compile a big Java project using Make, you might have discovered that there is a limit to the length of a command line.

The problem is that the Java virtual machine is so slow to start up, and it is so difficult to determine Java dependencies reliably, that it is advantageous to compile all of your Java files in a single invocation of javac. You also know that recursive make is considered harmful, so you want to create your big list of Java files by including module makefiles in the project directories [1]. So you naively do the following:

JAVA_SRC :=

# ...here include lots of module makefiles that add files to JAVA_SRC...

.PHONY: java_class_files
java_class_files: $(JAVA_SRC)
      $(JAVAC) $(JAVAC_FLAGS) $(JAVA_SRC)

But sooner or later this starts to fail, because the javac command line gets too long. How long is too long? On my Linux computer, the maximum length of a command line is about 128 kB, which (at, say, 40 characters per filename) comes out to about 3200 files. Sure, that's a lot of files, but our main project at work passed that number a couple of years ago. So what do you do?

javac has the nice feature that you can put the list of files to compile into a text file, and specify the name of this text file on the command line preceded by an "@" sign: javac @java-file-list. This sounds promising, so you try writing the list of Java source files to a file:

JAVA_SRC :=

# ...here include lots of module makefiles that add files to JAVA_SRC...

.PHONY: java_class_files
java_class_files: $(JAVA_SRC)
      @echo $JAVA_SRC >java-filelist
      $(JAVAC) $(JAVAC_FLAGS) @java-filelist

But that obviously won't work, because the "echo" command line is just as long as the old "javac" command line. Similarly, xargs(1) doesn't get you anywhere because there is no obvious way to get the list of filenames onto the standard input of the xargs command. It's a tough nut to crack.

(At about this time you wonder why you ever decided to use make on a Java project, but we'll leave that topic for another day.)

It turns out that there is a solution. Recent versions of GNU make provide just enough functionality to barely allow a general solution to this problem. It relies on a few of make's primitive functions:

  • $(if CONDITION,THEN-PART[,ELSE-PART]) — If CONDITION is a non-empty string, then expand to THEN-PART; otherwise expand to ELSE-PART.
  • $(call VARIABLE,PARAM1,PARAM2,...) — Expand the contents of VARIABLE, substituting the PARAM1 wherever $(1) appears, PARAM2 for $(2), etc.
  • $(words TEXT) — Return the number of whitespace-separated words in TEXT.
  • $(word N,TEXT) — Return the Nth whitespace-separated word of TEXT, or the empty string if there are fewer than N words in TEXT.
  • $(wordlist S,E,TEXT) — Return the list of words in TEXT with indices S through E.

With these tools and a little bit of recursion, we can write a make function that is the approximate equivalent of the unix xargs command:

# Macro for invoking a command on groups of 1000 words at a time
# (analogous to xargs(1)).  The macro invokes itself recursively
# until the list of words is depleted.
#
# Usage: $(call xargs,COMMAND,LIST)
#
# COMMAND should be a shell command to which the words will be
# appended as arguments in groups of 1000.
define xargs
$(1) $(wordlist 1,1000,$(2))
$(if $(word 1001,$(2)),$(call xargs,$(1),$(wordlist 1001,$(words $(2)),$(2))))
endef

How does this work? The first line invokes the command (i.e., $(1)) with the first 1000 words as arguments. The second line checks whether there is a word number 1001; if so, it invokes xargs recursively with words 1001 through the last word in the list. [2]

Now we use the new xargs function to define a function that writes all of the words from a list into a file:

# Macro for writing the contents of a word list to a file, one word
# per line.  This macro will work even for very long lists of words.
#
# Usage: $(call write-to-file,FILENAME,LIST)
define write-to-file
@: >$(1)
$(call xargs,@printf "%s\n" >>$(1),$(2))
endef

The first line uses the shell builtin ":" to create the new file (and erase any old contents that the file might have had). The second line uses xargs to invoke the shell builtin printf with the filenames as arguments, 1000 at a time, and appends the results to the new file. The formatting string passed to printf causes it to print a newline after each filename rather than, say, 1000 files on each line.

With these functions in hand, the Java files can be compiled as follows:

JAVA_SRC :=

# ...here include lots of module makefiles that add files to JAVA_SRC...

JAVA_FILE_LIST := java-file-list
.INTERMEDIATE: $(JAVA_FILE_LIST)
$(JAVA_FILE_LIST): $(JAVA_SRC)
       $(call write-to-file,$@,$(JAVA_SRC))

.PHONY: java_class_files
java_class_files: $(JAVA_FILE_LIST)
      $(JAVAC) $(JAVAC_FLAGS) @$(JAVA_FILE_LIST)

Well, as the Germans say, "nicht schön aber selten". But don't you get the feeling that this would all be a lot easier in a real programming language?

[1] If you want to compile every Java file within a set of subdirectories, then you can generate the list of files using the find(1) command. But I'm assuming that you want to have finer control over exactly which Java files are compiled in a particular build.

[2] It is possible to make the number "1000" another parameter to the xargs function, but it is ugly because the macro also needs the analogue of "1001". Instead, to get the list of words starting at 1001, you can first get the list of words starting at 1000, then get the sublist of that list starting at word number 2. Or, at the cost of starting another process, you could use shell facilities for doing arithmetic.

1 comment:

Alexander said...

The post is quite useful.

I had another problem. I had to create a "for" loop which operates with very long list of words inside Make rule.

$(foreach ...) is not a way because it "joins" all the words to one common line during the makefile preprocessing stage, and then executes this very long line inside a rule during rule execution.

To call Shell "for" also is not a way, because it leads to "execvp: argument list too long".

I searched for the solution for a long time and I almost didn't hope to find it, and already planned to write the line into file that makes the build procedure slower. But this post helps me.

Thank you very much !!

--
Alexander.