## Parsing Mathematical Expressions

A while ago, I was looking for a library to parse `NSString`

objects as mathematical expressions. I wanted to be able to take a string like `@"2+3*4"`

and get back `14`

. I found two easy ways to do this: `GCMathParser`

and `NSPredicate`

(of course!).

`GCMathParser`

is a library by Graham Cox for parsing strings as equations. It uses a parser generated by BISON as the primary engine, with a straight-forward Objective-C wrapper for easy interaction. In terms of raw speed, it’s blazingly fast. However, I found one major flaw with it: it is impossible to extend. It has support for a bunch of built-in functions (all your basic trigonometric functions, rounding, etc), but if you wanted to add support for bitwise operators (bitwise AND, OR, NOT, XOR, etc), you’re totally out of luck. There’s no way to do it without regenerating an entirely new parser.

`NSPredicate`

, as I’ve mentioned before, can also parse numerical expressions. Like `GCMathParser`

, it has a couple of flaws:

- The associativity of the power operator is wrong. The string
`@"2 ** 3 ** 2"`

is parsed as`(2 ** 3) ** 2`

, or`64`

. This is “left associativity”. In practical terms, it means that when you have a bunch of the same operator in a row like that, the left-most one is evaluated first. However, exponentiation is supposed to be right associative. In other words, that expression should’ve been parsed as`2 ** (3 ** 2)`

, or`512`

. This is a major flaw, and it’s been reported: rdar://8692313. - The syntax for custom functions is absurdly bizarre.
`NSExpression`

, the class that makes up the basic components of a predicate, does not have any trigonometric functions built-in. However,`NSExpression`

does support defining your own functions, but with a ridiculously complex syntax. For my purposes, I wanted to take a regular user-entered string and parse it, without requiring the user to know about`FUNCTION()`

syntax. I could’ve tried manipulating the user’s string to transform`sin(<#expression#>)`

into`FUNCTION(<#expression#>, 'sinValue')`

and then written a`sinValue`

category method on`NSNumber`

, but if I was going to write a transformer to manipulate the string, I might as well write a full-fledged parser.

So, I wrote `DDMathParser`

to be extensible, correct, and easy to understand.

## The Guts

I tried writing a formal grammar for parsing a mathematical expression, and actually got pretty far. However, when writing the recursive descent parser to match strings against this grammar, I came up against some conceptual hurdles that I couldn’t work around. I also didn’t like that modifying the grammar, even slightly, usually meant making extremely large changes to the parser. Eventually I gave up with that sort of parser and came up with something else.

There are several phases to parsing a string using this parser:

- Tokenization
- Grouping
- Resolving

### Tokenization

Tokenization is simply the process of breaking up the string into chunks. If I have the string `@"1 + 2 - 3 * 4 / 5"`

, then after tokenization I’m going to end up with 9 tokens: `1`

, `+`

, `2`

, `-`

, `3`

, `*`

, `4`

, `/`

, and `5`

. No attempt is made to try to resolve anything. I will note, however, that this is the point at which implicit multiplication is made explicit. More on that later.

The tokenization process simply reads through the string character-by-character, and decides what to do with the characters it finds. The general rules it follows are:

- If it reads a digit or a decimal point, attempt to parse a number (a number is defined as anything recognizable by
`NSNumberFormatter`

). - If parsing a number failed and the character is matched by
`[a-zA-Z0-9]`

(any alphanumeric character), then attempt to parse the name of a function. - If parsing a function name failed and the character is the
`$`

character, then attempt to parse a variable name. - If parsing a variable name failed, then attempt to parse an operator.
- If parsing an operator failed, then abort with an error.

It’s also at this point that we figure out if `-`

and `+`

are acting, for example, as negation or subtraction. The basic rule is: if a `-`

or `+`

is preceded by another operator that’s not a closing parenthesis (or it’s the first token in the steam), then it’s the unary operator (negation, etc). Otherwise it’s the binary operator.

### Grouping

If tokenization succeeds, the tokens are grouped into “terms”. To explain what goes on, we’ll look at an example. Let’s say we’re parsing the string `@"2 * (pi()sin(dtor(45)) + max($a, 42)/54)"`

. The token stream will have generated these tokens:

```
2 * ( pi ( ) * sin ( dtor ( 45 ) ) + max ( $a , 42 ) / 54 )
```

Grouping organizes the terms hierarchically. This is done recursively, and is also where parentheses are checked to make sure they’re balanced. Visually, we’ll end up with something like this:

As you can see, no attempt has been made to define precedence. The step is only to break up the tokens into parenthetical “levels”.

### Resolving

After the terms have been grouped, we can break things up further. We take the root group and iterate over the terms in its “`subterms`

" array to find the indices of the operators with the highest precedence. Precedence is defined via an `enum`

:

```
enum {
DDPrecedenceBitwiseOr = 0,
DDPrecedenceBitwiseXor,
DDPrecedenceBitwiseAnd,
DDPrecedenceLeftShift,
DDPrecedenceRightShift,
DDPrecedenceSubtraction,
DDPrecedenceAddition = DDPrecedenceSubtraction,
DDPrecedenceDivision,
DDPrecedenceMultiplication = DDPrecedenceDivision,
DDPrecedenceModulo,
DDPrecedenceUnary,
DDPrecedenceFactorial,
DDPrecedencePower,
DDPrecedenceParentheses,
DDPrecedenceUnknown = -1,
DDPrecedenceNone = -2
};
```

Parentheses produce the highest precedence, followed by the power operator, then the factorial operator, then any generic unary operator (negation, bitwise NOR, etc), and then down on through the standard mathematical operators, with bitwise OR having the lowest precedence. Subtraction has the same precedence as addition, since subtraction is really just the addition of a negative number. Likewise, division is the same as multiplication by the number’s multiplicative inverse.

Once we have the indices of the operators with the highest precedence, we’ll need to look and see if we have more than one. If we do, then we look at the associativity of the operator to see if we should choose the first (ie, left associative) or last (ie, right associative) operator. After picking an operator, we can look at the terms around the operator, recursively resolve those terms (if necessary), and then combine those resolved terms and the operator into a new group term. This is when we check to make sure that operators have all the necessary terms. It’s at this point, for example, that `@"2+"`

would fail to parse, because `+`

is a binary operator, but the string only has a single term. `@"+2"`

, however, *would* parse, because `+`

would already have been recognized as a unary operator.

After resolution has finished, our object hierarchy now looks like this:

At this point, the term hierarchy is fully organized, and it’s a simple matter to convert it into expression objects for storage and/or evaluation.

## Implicit Multiplication

When writing mathematical expressions, we understand that we can express the multiplication of two terms without actually writing a multiplication operator. For example, `5x`

is `5 * x`

, and `6(9)`

is the same thing as `6 * 9`

. I really wanted the parser to handle implicit multiplication, and it turns out it was very simple to implement. When tokenizing a string, a `*`

token is inserted between two tokens if a Number, Variable, or Close Parenthesis token is followed by anything except an operator token (unless it’s an Open Parenthesis token). With this setup:

`5$x`

becomes`5 * $x`

`6(9)`

becomes`6 * (9)`

`2sin($x)`

becomes`2 * sin($x)`

## Analysis

As I see it, there are several major advantages to this parser:

- The only limit on valid numbers is that they be recognizable by
`NSNumberFormatter`

. This means that we can use numerical types with far more precision than a`double`

, and much larger than a`long long`

. For all but the trigonometric functions, calculations are done with`NSDecimal`

structs boxed in`NSDecimalNumber`

objects. - Function resolution doesn’t happen until evaluation time. This allows us to parse strings like
`bogus(42)`

, but not care that the`bogus`

function doesn’t exist. With the ability to delay function recognition, we’re free to define any extra functions that we want. - Parsing is not defined by a grammar. If we need to add a new operator, it’s very feasible, and won’t require generating a whole new parser.
- Operator associativity can be changed on-the-fly. If, for some reason, you need exponentiation to be left associative rather than right associative, it’s extremely easy to change that with a simple property change.

The primary downside of this parser, however, is speed. Since we’re operating with several different layers of Objective-C objects, rather than on a C string, there’s going to be an enormous^{*} speed discrepancy between `GCMathParser`

and `DDMathParser`

. However, what we lose in speed is more than made up for in extensibility and accuracy.

The code is available for your perusal at GitHub: https://github.com/davedelong/DDMathParser

^{* by “enormous”, I mean “about 2 orders of magnitude, but still able to parse and evaluate a complex string in a fraction of a second”}

**Updates**

- 4 Jun 2011: original posting
- 5 Jun 2011: added description of recognizing unary operators

## Mathematically evaluating NSStrings

I’ve recently posted the code to a project I’ve written called ** DDMathParser**. It’s a way whereby you can take an

`NSString`

and evaluate it as a mathematical expression. For example, you can do:```
NSLog(@"%@", [@"1 + 3 * 4 - 5" numberByEvaluatingString]); //logs "8"
```

There are several other ways to evaluate strings, including methods that allow you to use variables (`1 + $a`

, and then substitute `$a`

for a value later). You can even define custom functions, like `mySuperSecretFunction(42)`

.

This was a really fun project, because there were some really interesting challenges to work around (like how can I parse a left associative expression using a recursive descent parser without getting caught in an infinite recursion?). I’ll save those for another post.

For now, go check out the source on github and let me know what you think!