• The cover of the 'Perl Hacks' book
  • The cover of the 'Beginning Perl' book
  • An image of Curtis Poe, holding some electronic equipment in front of his face.

Converting Object-Oriented Code to Corinna

minute read



Find me on ... Tags


Introduction

This post is mainly my findings on finally porting some real code to Corinna, but I confess I was concerned when I first started.

I recently tried to convert a bless object hierarchy to Corinna and I failed, badly. The code worked, but the design was a mess. As the lead designer of Corinna, after years of effort, I found that Corinna was flawed. After hours of trying to write and rewrite the explanation of the flaw, I found the flaw: me.

Corinna is not flawed. I was thinking in terms of blessed hashes, not Corinna. When I realized what was going on, my thoughts shifted. I started to see the old code was riddled with flaws. Admittedly, the code I was porting was something I designed (*cough*) two decades ago, but it was still humbling.

Programming in the Wrong Language

It used to be that plenty of C developers flocked to the Perl language because it’s so quick-n-easy to whip up a prototype and test out an idea. Of course, experienced Perl programmers would point out that the C programmers were writing Perl as if it was C, not Perl. They’d look at this:

for ( my $i = 0; $i <= @array; $i++ ) {
    my $element = $array[$i];
    ...
}

By the way, did you spot the bug?

And they’d say, “just use the elements directly”:

for my $element (@array) {
    ...
}

That easier to read, more likely to be correct, and it’s faster.

In the 90s, I didn’t have that problem when I started with Perl. At the time, it had been over a decade since I had done any serious C hacking, so I didn’t write my Perl like C. I wrote it like COBOL. Seriously! I’d declare all of my variables at the top of the program, as if I were setting up my WORKING-STORAGE SECTION . Then I’d set the appropriate variables and call the procedure subroutine I needed, though I at least returned data instead of setting a global.

Needless to say, my early days in writing Perl were pretty awful, but it was still a hell of a lot easier than COBOL.

Today, many Perl developers are excited about the Corinna project, for which I have written a short tutorial. Just as one should stop thinking in C or COBOL when writing Perl, one should stop thinking in terms of using bless when writing Corinna. If that seems like it’s not too hard, I can assure you, many will stumble as I did.

I Failed at Corinna

My HTML::TokeParser::Simple module is modestly popular. There are over 30 distributions which depend on it and I frequently see it used in clients I assist, and given that it’s been around for over two decades with no bugs reported, I’m pretty happy with that module. So when Paul Evans created a PR for a subset of Corinna , I thought it was time to port something. Instead of mocked-up code in the RFC or my experiments with Object::Pad , I was going to write real Corinna code.

I had been following along and giving Paul some feedback on development work. I found a couple of bugs (not many, which is impressive), but now it was time to really push things. Hell, as the lead designer of the Corinna project, based on my original 2019 (mis)design and that was based on research and work I had done prior to that, I’m pretty comfortable with saying that if anyone can port something to Corinna, I am that guy.

And that’s when I discovered that I’m Prax, not Amos. I am not that guy. (If you haven’t watched or read The Expanse, you’re missing out. Trust me.)

What I Was Trying to Do

The point of this module is that HTML::TokeParser could parse HTML into a stream of tokens which look like this:

["S",  $tag,    \%attr, \@attrseq, $text]
["E",  $tag,    $text]
["T",  $text,   $is_data]
["PI", $token0, $text]
["C",  $text]
["D",  $text]

The code I wrote using that was constantly breaking, so I blessed those and put an OO wrapper on them so that $token->as_is (renamed to $token->to_string in my new code) always returned what it was supposed to return, instead of using $token->[4] (start tag), $token->[2] (end tag), $token->[1] (text), an so on. You can’t even use $token->[-1] to read the last item thanks to the T token (text) ironically being the token which had the plaintext in an unpredictable position.

That’s much easier than using HTML::TokeParser directly. I did this by calling bless with the array references and blessing them into appropriate classes. This meant the array reference was still available and HTML::TokeParser::Simple was a drop-in replacement for the original module. You could switch from HTML::TokeParser to HTML::TokeParser::Simple with no other changes in your code. You then gradually converted array reference lookups to method calls. I was doing a lot of web scraping in the pre-API days of the Web and this saved me much grief.

So when I started designing HTML::TokeParser::Corinna , I hit my first snag.

Since Corinna is designed to be encapsulated, you can’t call $token->[1]. No “reaching inside” the object is allowed. But that’s fine! Since HTML::TokeParser::Corinna is a new module, I can create any interface I want. That’s when I hit my next problem.

For each of the array reference types listed above, I have a corresponding class:

  • HTML::TokeParser::Corinna::Token::Tag::Start
  • HTML::TokeParser::Corinna::Token::Tag::End
  • HTML::TokeParser::Corinna::Token::Text
  • HTML::TokeParser::Corinna::Token::Comment
  • HTML::TokeParser::Corinna::Token::Declaration
  • HTML::TokeParser::Corinna::Token::ProcessInstruction

There are some common behaviors there and since we don’t yet have roles for Corinna, I used abstract base classes. (We’ll shorten the prefix to the namespace to make it easier to read):

  • HTC::Token
  • HTC::Token::Tag :isa(HTC::Token)

I can instantiate a corresponding class like this, with all constructors having the same interface:

my $end_tag = HTC::Token::Tag::End->new(
    token => $token
);

Since HTC::Token is the base class for everything, I have this:

class HTC::Token {
    field $token :param;
    method _get_token () {$token}
    ...
}

It also has the methods common to all token classes.

Subclasses look like this:

class HTC::Token::Tag :isa(HTC::Token) {
    ...
}
class HTC::Token::Tag::Start :isa(HTC::Token::Tag) {
    ...
}

Even ignoring the fact that my objects were mutable, my code is flawed. The “Tag” classes need to be able to access the $token from the parent class. I have no way to do that, so I have a _get_token method. Untrusted code can call $token->_get_token and change the array reference in unexpected ways. That kills one of the major points of Corinna, but I’ve no easy way of sharing that data otherwise.

Realizing I could not fix this was my crushing blow, leading me to naïvely believe Corinna was flawed. What follows is how I worked through the issue, but it took longer for me to have clarity than what is shown here.

How I Fixed It

One way of handling this is the following:

class HTC::Token {
    field $token :param;
    method _get_token () {clone($token)}
    ...
}

But that still leaves _get_token() callable outside the class and it’s now part of the interface. It becomes an implementation detail I don’t have the freedom to change (classes should be open for extension, not modification). It’s part of the class contract and should not be violated.

Corinna doesn’t have a clean way of handling this case, but it’s not a flaw. It’s a limitation and one we can easily fix. Adding a :trusted attribute to methods would make this much easier, but that’s still an open discussion.

A trusted method, whether provided by an abstract base class or a role, should propagate to the first non-abstract subclass and become a private method in that class. If it’s defined directly in a concrete (non-abstract) class, then the first concrete class which inherits it gains it as a private method.

This isn’t quite how trusted methods work in other languages, but that’s OK. Perl is not like other languages and we have to adapt.

Lacking trusted methods, I cut-n-pasted the field $token :param; line into each of my concrete classes. That solved the problem, but if they each took multiple parameters, having to sync them across multiple classes would be fragile. A single role (or base class) providing those as :trusted would make this issue go away.

So, bullet dodged. Corinna isn’t irrevocably broken, but it did give me a bit of a scare at first. However, it also pleased me. “No plan survives first contact with the enemy,” but I confess I had a small concern. Despite years of research and design, maybe we had missed something critical. Finding only a tiny limitation has been a relief (though whether this holds remains to be seen).

Types (or the lack thereof)

This next part isn’t about a limitation of Corinna, but of my not understanding object-oriented design when I first wrote HTML::TokeParser::Simple. This is related to type theory.

A type system is nothing more than a way of evaluating expressions to ensure they do not produce unwanted behavior. Perl’s primary type system is based on data structures, not data types. For example, you can’t access an array element the way you access a hash element (though Perl being Perl, there are ways you can change that, too). But what is two plus two? We know that the answer is four, yet the computer often needs help. Let’s look at the following:

my @array = qw(10 11 12);
my $var   = \@array;
say 2 + 2;                  # Int + Int
say 2 + "2";                # Int + String
say "2 pears" + "2 weeks";  # Str + Str
say 2 + @array;             # Int + Int!
say 2 + $var;               # Int + Ref

That prints something like:

4
4
Argument "2 weeks" isn't numeric in addition (+) at two.pl line 8.
Argument "2 pears" isn't numeric in addition (+) at two.pl line 8.
4
5
5201037554

For many languages, only 2 + 2 would be valid in the above. Perl is heavily optimized for text manipulation, so if you’re reading in a bunch of data from a text file, you can often treat numeric strings as numbers. Thus, 2 + "2" is 4. The ASCII value of "2" is 50, but Perl understands what you mean and casts the string as an integer instead of calculating 2 + 50.

The "2 pears" + "2 weeks" is clearly nonsense, but at least you get a warning.

2 + @array surprises many people new to Perl, but it’s evaluating @array in scalar context. Since it has three elements, this reduces to 2 + 3, printing 5. I know several programmers who write this as 2 + scalar @array to be explicit about the intent.

But what’s with that 5201037554 in the output? Your number will vary if you run this code, but what’s happening is that $var, in the expression 2 + $var, evaluates to the address of the reference \@array. You don’t even get a warning. This is useless (no pointer math in Perl) and yes, I’ve been bitten by this in production code.

For many languages this expression would prevent your program from compiling, but Perl is Perl. For the poor maintenance programmer seeing my $result = $var1 + $var2; buried deep in the code, it may not be immediately obvious there’s an issue.

So this gets us back to a fundamental question: what is a type? A type is nothing more than:

  1. A name for the type
  2. A set of allowed values for that type
  3. A set of operations allowed to be called on that type

If we think of an integer as an object and addition as a method, let’s play with some pseudocode and pretend we have multimethods and a way of declaring data types.

class Int {
    field $int :isa(Int);

    multimethod add ($value :isa(Int)) {
        return $int + $value;
    }
    multimethod add ($value :isa(Str) :coerce(Int)) {
        return $int + $value;
    }
}

my $int = Int->new( int => 2 );
say $int->add(3);    # 5
say $int->add("4");  # 6

# runtime error because we can't coerce
say $int->add("4 apples");

# problematic because arrays flatten to lists and
# an array with one element will work here, but
# zero or two or more elements are fatal
say $int->add(@array);

# fatal because there is no multimethod dispatch target
say $int->add(\@array);

In the above, we simply don’t provide methods for behaviors we don’t want. Yes, the developer may very well have to check that they’re not passing in bad data, but this is not a bad thing. At their core, objects are experts about a problem domain and you need to take care to get them right.

This also fits with the principle that we want to minimize our interfaces as much as much as possible. The more methods you expose, the more methods you have to maintain. If you later need to change those methods, you may break existing code. So let’s look at my abstract HTC::Token base class, a more-or-less straight port of the original code:

class HTML::TokeParser::Corinna::Token {
    field $token : param;

    method to_string              { $token->[1] }
    method _get_token             {$token}
    method is_tag                 {false}
    method is_start_tag           {false}
    method is_end_tag             {false}
    method is_text                {false}
    method is_comment             {false}
    method is_declaration         {false}
    method is_pi                  {false}
    method is_process_instruction {false}
    method rewrite_tag            { }
    method delete_attr            { }
    method set_attr               { }
    method tag                    { }
    method attr (@)               { {} }
    method attrseq                { [] }
    method token0                 { }
}

That ensures that every class has a stub of every method available to it, so you won’t get a “method not found” error. But what if you have a token representing text in your HTML? Why on earth would you want to call $token->rewrite_tag if it’s not a tag? It’s like the above example of adding an integer to a reference: you can do it, but it’s not helpful.

What is helpful is knowing what kind of token you have. So my base class is now:

class HTML::TokeParser::Corinna::Token {
    method is_tag                 {false}
    method is_start_tag           {false}
    method is_end_tag             {false}
    method is_text                {false}
    method is_comment             {false}
    method is_declaration         {false}
    method is_pi                  {false}
    method is_process_instruction {false}
}

This is cleaner and easier to maintain. In fact, I could delete this class, but those predicate methods are much easier to use.

if ( $token isa 'HTC::Token::Tag::Start' ) { ... }
# versus
if ( $token->is_start_tag ) { ... }

I’ve also taken the trouble to make all tokens immutable. We generally want immutable objects, but in reality, sometimes it’s cumbersome. If you want to replace the class newz with news everywhere, here’s what you do:

my $parser = HTC->new( file => $file );
while ( my $token = $parser->next ) {
    if (   $token->is_start_tag
        && $token->attr('class') eq 'newz'
    ) {
        $token = $token->set_attrs( class => 'news' );
    }
    print $token->to_string;
}

The mutators such as set_attrs now return a new instance instead of mutating the token directly. That makes it safer because you don’t worry about unrelated code mutating your data. For example, if you call $object->munge(3), you never worry that the value of 3 has suddenly changed in your code. However, $object->munge($other_object) offers no such guarantee.

In the code snippet above, however, always having to remember to assign the return value feels, well, clumsy. In fact, if you call set_attrs in void context (i.e., you don’t assign the return value to anything), the code will throw a HTML::TokeParser::Corinna::Exception::VoidContext exception (yes, it now has true exceptions, but they’re part of this module, not part of Corinna).

So my interfaces are smaller and we no longer provide useless, potentially confusing methods. I think that’s a win.

Exceptions

As a final note, I was proud of an odd little trick. I wanted to use exceptions as much as possible. They fix a very common bug in production code. If someone calls die or croak, you often see code like this:

try {
    some_code();
}
catch ($e) {
    if ( $e =~ /connection gone away/ ) {
        # retry connection or rethrow exception
    }
    die $e;
};

Now if some maintenance programmer renames the error message to connection interrupted, all code dependent on the previous error message breaks. But if they throw an exception in an Exception::Connection::Lost class, the code can check the class of the exception and the developers are free to change the actual error message any way they like.

So here’s my exception base class:

class HTML::TokeParser::Corinna::Exception {
  use overload '""' => 'to_string', fallback => 1;
  use Devel::StackTrace;

  field $message :param = undef;
  field $trace = Devel::StackTrace->new->as_string;

  method error ()       {"An unexpected error occurred"}
  method message ()     {$message}
  method stack_trace () {$trace}

  method to_string {
    # error() can be overridden
    my $error = $self->error;
    # but $message is universal
    if ($message) {
      $error .= "\n$message";
    }
    return "Error: $error\n\nStack Trace:\n$trace";
  }
}

Because stringification is overloaded, I can print the exception or check it with a regex. Because it’s an object, you can check the class of the exception to decide what to do next.

I used the above to provide a MethodNotFound exception:

class HTC::Exception::MethodNotFound
:isa(HTC::Exception) {
  field $method :param;
  field $class  :param;

  method error ()  {
    "No such method '$method' for class '$class'"
  }
  method method () {$method}
  method class ()  {$class}
}

And in my base class, I have this:

method AUTOLOAD {
  our $AUTOLOAD;
  my ( $class, $method )
    = ( $AUTOLOAD =~ /^(.*)::(.*)$/ );
  return if $method eq 'DESTROY';
  throw(
    'MethodNotFound',
    method => $method,
    class  => $class,
  );
}

And now, $token->no_such_method throws an exception instead of causing you to die inside.

Conclusion

The earlier description of the hours of writing and rewriting to explain the flaw encompass much more than what I’ve discussed, but I wanted to keep this short. Of course, I threw in a few other things I noticed along the way.

The encapsulation violation seemed to break the main strength of Corinna, but spending a few hours porting a class hierarchy quickly exposed the design limitation and a solution presented itself. Perhaps the Corinna design team or someone else might offer a more elegant solution than what I presented. I’m OK with that.

So far, the Corinna code is simpler, easier to read, provides strong encapsulation, and was generally a pleasure to write. I’m looking forward to the day it’s production ready. I expect there will be teething pain for others, since thinking in terms of blessed hashes is ingrained in Perl culture, but if we keep living in the past, Perl will become a thing of the past.

The first PR for use feature 'class' is here and I’m sure any and all feedback would be appreciated.

Please leave a comment below!

Full-size image


If you'd like top-notch consulting or training, email me and let's discuss how I can help you. Read my hire me page to learn more about my background.


Copyright © 2018-2024 by Curtis “Ovid” Poe.