Parsing Mathematical Expressions

A while ago, I was looking for a library to parse NSString objects as mathematical expressions. I wanted to be able to take a string like @"2+3*4" and get back 14. I found two easy ways to do this: GCMathParser and NSPredicate (of course!).

GCMathParser is a library by Graham Cox for parsing strings as equations. It uses a parser generated by BISON as the primary engine, with a straight-forward Objective-C wrapper for easy interaction. In terms of raw speed, it’s blazingly fast. However, I found one major flaw with it: it is impossible to extend. It has support for a bunch of built-in functions (all your basic trigonometric functions, rounding, etc), but if you wanted to add support for bitwise operators (bitwise AND, OR, NOT, XOR, etc), you’re totally out of luck. There’s no way to do it without regenerating an entirely new parser.

NSPredicate, as I’ve mentioned before, can also parse numerical expressions. Like GCMathParser, it has a couple of flaws:

  1. The associativity of the power operator is wrong. The string @"2 ** 3 ** 2" is parsed as (2 ** 3) ** 2, or 64. This is “left associativity”. In practical terms, it means that when you have a bunch of the same operator in a row like that, the left-most one is evaluated first. However, exponentiation is supposed to be right associative. In other words, that expression should’ve been parsed as 2 ** (3 ** 2), or 512. This is a major flaw, and it’s been reported: rdar://8692313.
  2. The syntax for custom functions is absurdly bizarre. NSExpression, the class that makes up the basic components of a predicate, does not have any trigonometric functions built-in. However, NSExpression does support defining your own functions, but with a ridiculously complex syntax. For my purposes, I wanted to take a regular user-entered string and parse it, without requiring the user to know about FUNCTION() syntax. I could’ve tried manipulating the user’s string to transform sin(<#expression#>) into FUNCTION(<#expression#>, 'sinValue') and then written a sinValue category method on NSNumber, but if I was going to write a transformer to manipulate the string, I might as well write a full-fledged parser.

So, I wrote DDMathParser to be extensible, correct, and easy to understand.

The Guts

I tried writing a formal grammar for parsing a mathematical expression, and actually got pretty far. However, when writing the recursive descent parser to match strings against this grammar, I came up against some conceptual hurdles that I couldn’t work around. I also didn’t like that modifying the grammar, even slightly, usually meant making extremely large changes to the parser. Eventually I gave up with that sort of parser and came up with something else.

There are several phases to parsing a string using this parser:

  1. Tokenization
  2. Grouping
  3. Resolving

Tokenization

Tokenization is simply the process of breaking up the string into chunks. If I have the string @"1 + 2 - 3 * 4 / 5", then after tokenization I’m going to end up with 9 tokens: 1, +, 2, -, 3, *, 4, /, and 5. No attempt is made to try to resolve anything. I will note, however, that this is the point at which implicit multiplication is made explicit. More on that later.

The tokenization process simply reads through the string character-by-character, and decides what to do with the characters it finds. The general rules it follows are:

  • If it reads a digit or a decimal point, attempt to parse a number (a number is defined as anything recognizable by NSNumberFormatter).
  • If parsing a number failed and the character is matched by [a-zA-Z0-9] (any alphanumeric character), then attempt to parse the name of a function.
  • If parsing a function name failed and the character is the $ character, then attempt to parse a variable name.
  • If parsing a variable name failed, then attempt to parse an operator.
  • If parsing an operator failed, then abort with an error.

It’s also at this point that we figure out if - and + are acting, for example, as negation or subtraction. The basic rule is: if a - or + is preceded by another operator that’s not a closing parenthesis (or it’s the first token in the steam), then it’s the unary operator (negation, etc). Otherwise it’s the binary operator.

Grouping

If tokenization succeeds, the tokens are grouped into “terms”. To explain what goes on, we’ll look at an example. Let’s say we’re parsing the string @"2 * (pi()sin(dtor(45)) + max($a, 42)/54)". The token stream will have generated these tokens:

2 * ( pi ( ) * sin ( dtor ( 45 ) ) + max ( $a , 42 ) / 54 )

Grouping organizes the terms hierarchically. This is done recursively, and is also where parentheses are checked to make sure they’re balanced. Visually, we’ll end up with something like this:

First phase of grouping

As you can see, no attempt has been made to define precedence. The step is only to break up the tokens into parenthetical “levels”.

Resolving

After the terms have been grouped, we can break things up further. We take the root group and iterate over the terms in its “subterms" array to find the indices of the operators with the highest precedence. Precedence is defined via an enum:

enum {
    DDPrecedenceBitwiseOr = 0,
    DDPrecedenceBitwiseXor,
    DDPrecedenceBitwiseAnd,
    DDPrecedenceLeftShift,
    DDPrecedenceRightShift,
    DDPrecedenceSubtraction,
    DDPrecedenceAddition = DDPrecedenceSubtraction,
    DDPrecedenceDivision,
    DDPrecedenceMultiplication = DDPrecedenceDivision,
    DDPrecedenceModulo,
    DDPrecedenceUnary,
    DDPrecedenceFactorial,
    DDPrecedencePower,
    DDPrecedenceParentheses,

    DDPrecedenceUnknown = -1,
    DDPrecedenceNone = -2
};

Parentheses produce the highest precedence, followed by the power operator, then the factorial operator, then any generic unary operator (negation, bitwise NOR, etc), and then down on through the standard mathematical operators, with bitwise OR having the lowest precedence. Subtraction has the same precedence as addition, since subtraction is really just the addition of a negative number. Likewise, division is the same as multiplication by the number’s multiplicative inverse.

Once we have the indices of the operators with the highest precedence, we’ll need to look and see if we have more than one. If we do, then we look at the associativity of the operator to see if we should choose the first (ie, left associative) or last (ie, right associative) operator. After picking an operator, we can look at the terms around the operator, recursively resolve those terms (if necessary), and then combine those resolved terms and the operator into a new group term. This is when we check to make sure that operators have all the necessary terms. It’s at this point, for example, that @"2+" would fail to parse, because + is a binary operator, but the string only has a single term. @"+2", however, would parse, because + would already have been recognized as a unary operator.

After resolution has finished, our object hierarchy now looks like this:

After term resolution

At this point, the term hierarchy is fully organized, and it’s a simple matter to convert it into expression objects for storage and/or evaluation.

Implicit Multiplication

When writing mathematical expressions, we understand that we can express the multiplication of two terms without actually writing a multiplication operator. For example, 5x is 5 * x, and 6(9) is the same thing as 6 * 9. I really wanted the parser to handle implicit multiplication, and it turns out it was very simple to implement. When tokenizing a string, a * token is inserted between two tokens if a Number, Variable, or Close Parenthesis token is followed by anything except an operator token (unless it’s an Open Parenthesis token). With this setup:

  • 5$x becomes 5 * $x
  • 6(9) becomes 6 * (9)
  • 2sin($x) becomes 2 * sin($x)

Analysis

As I see it, there are several major advantages to this parser:

  • The only limit on valid numbers is that they be recognizable by NSNumberFormatter. This means that we can use numerical types with far more precision than a double, and much larger than a long long. For all but the trigonometric functions, calculations are done with NSDecimal structs boxed in NSDecimalNumber objects.
  • Function resolution doesn’t happen until evaluation time. This allows us to parse strings like bogus(42), but not care that the bogus function doesn’t exist. With the ability to delay function recognition, we’re free to define any extra functions that we want.
  • Parsing is not defined by a grammar. If we need to add a new operator, it’s very feasible, and won’t require generating a whole new parser.
  • Operator associativity can be changed on-the-fly. If, for some reason, you need exponentiation to be left associative rather than right associative, it’s extremely easy to change that with a simple property change.

The primary downside of this parser, however, is speed. Since we’re operating with several different layers of Objective-C objects, rather than on a C string, there’s going to be an enormous* speed discrepancy between GCMathParser and DDMathParser. However, what we lose in speed is more than made up for in extensibility and accuracy.

The code is available for your perusal at GitHub: https://github.com/davedelong/DDMathParser

* by “enormous”, I mean “about 2 orders of magnitude, but still able to parse and evaluate a complex string in a fraction of a second”


Updates

  • 4 Jun 2011: original posting
  • 5 Jun 2011: added description of recognizing unary operators

  1. serkanerkan reblogged this from funwithobjc
  2. anjerodesu reblogged this from funwithobjc
  3. do-nothing reblogged this from funwithobjc
  4. quatermain reblogged this from funwithobjc and added:
    There’s so much awesome in this post that I think I just peed myself.
  5. funwithobjc posted this
Short URL for this post: http://tmblr.co/Zt522y5nLv-e
blog comments powered by Disqus