Java boilerplate

Posted December 5, 2006 by

I fixed a Wicket bug today for compressing whitespace in HTML. The wrinkle is that you need to avoid doing that inside <pre> tags, otherwise your code formatting goes all wrong.

You should probably just gzip your responses instead, as it’ll be much more efficient than this, but apparently, someone cares enough to raise a bug report, so I rolled up my sleeves. I mean, how hard can it be to strip out some whitespace?

You can do this fairly nicely in Perl or PHP or some other language that understands regexp callbacks. However, I wanted to do it in Java without the gnu-regexp library and the neatest and best I could come up with was this:

/**
 * Remove whitespace from raw markup
 * 
 * @param rawMarkup
 * @return rawMarkup with compressed whitespace.
 */
protected String compressWhitespace(String rawMarkup)
{
	// We don't want to compress whitespace inside <pre> tags, so we look
	// for matches and:
	//  - Do whitespace compression on everything before the first match.
	//  - Append the pre match with no compression.
	//  - Loop to find the next match.
	//  - Append with compression everything between the two matches.
	//  - Repeat until no match, then special-case the fragment after the
	//    last pre.
 
	Pattern preBlock = Pattern.compile("<pre>.*?</​pre>", Pattern.DOTALL | Pattern.MULTILINE);
	Matcher m = preBlock.matcher(rawMarkup);
	int lastend = 0;
	StringBuffer sb = null;
	while (true)
	{
		boolean matched = m.find();
		String nonPre = matched
				? rawMarkup.substring(lastend, m.start())
				: rawMarkup.substring(lastend);
		nonPre = nonPre.replaceAll("[ \\t]+", " ");
		nonPre = nonPre.replaceAll("( ?[\\r\\n] ?)+", "\n");
 
		// Don't create a StringBuffer if we don't actually need one.
		// This optimises the trivial common case where there is no &lt;pre&gt;
		// tag at all down to just doing the replaceAlls above.
		if (lastend == 0)
		{
			if (matched)
			{
				sb = new StringBuffer(rawMarkup.length());
			}
			else
			{
				return nonPre;
			}
		}
		sb.append(nonPre);
		if (matched)
		{
			sb.append(m.group());
			lastend = m.end();
		}
		else
		{
			break;
		}
	}
	return sb.toString();
}

And something vaguely equivalent in Perl:

$_ = "text to compress";
s#(.*?)(<pre>.*?</​pre>|$)#($_, $pre) = ($1, $2); s/\s+/ /g;$_.$pre#emg;
print "$_";

Ugh. Java really sucks sometimes. :-(

It’s not so much that the regular expression stuff in Java is all that much less powerful than Perl, it’s just that by exposing it all as classes and methods, and without such niceties as closures, it’s all so very verbose.

Post Details

  • Post Title: Java boilerplate
  • Author: Alastair
  • Filed As: Java
  • Tags:
  • You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

3 Opinions have been expressed on “Java boilerplate”. What is your opinion?

  1. Jonathan Locke commented:

    I like the Java code more than the Perl code here. It may be more verbose, but it’s clear and simple and it’s obvious that it does the job.

    In general, less is more, but that doesn’t necessarily mean fewer characters. Less complicated is an important value too.

  2. Jonathan Locke commented:

    I think it’s not Java that’s the problem, but Matcher. You need a place to put that callback logic so the use case looks like this (which is superior to either solution above):

    Replacer r = new Replacer(markup, "regexp") {
        protected String onMatch(String s) {
            return s;
        }
        protected String onNonMatching(String s) {
            return s.replaceAll(...);
        }
    }
    return r.replaceAll();

    It’s a little bit more work to write Replacer, but in the end it will pay off, as I bet this problem crops up more than once.

  3. Alastair commented:

    It’s probably easier to understand what the Java is trying to do, especially if you code Java for a living not Perl. However, it’s not that obvious to me that it doesn’t have off-by-one errors and the like. You don’t have to think about that in the Perl version, so I actually find it easier to fit the whole thing in my head and be sure it works just by looking at it with Perl.

    Perl regular expression usage is certainly more dense – you have to think much harder about what each line of code does. But is that necessarily bad? As Java coders with quite a simple language, we’re used to skimming over hundreds of lines of code that typically do little, as much of it is boilerplate. Is having to stop and stare at a single line of code for a minute or so to understand it necessarily worse than having to stare at one hundred lines of code that achieve the same thing, only broken out into more verbose steps? Does it take longer to do? I’m not sure it does – the Perl and Java are equally readable, in their own way. The only real difference to me is that with the Java one I spend a lot of the time implementing it typing, rather than thinking.

Leave a Reply