Writing an Interpreter: Lexing Floats

Ruby division

Playing around with division in a ruby console (as one does), we see the following:

irb(main)> 7 / 2
=> 3
irb(main)> 7.0 / 2
=> 3.5

Digging in a little more, this is due to a critical difference between the class of 7 and 7.0:

irb(main)> 7.class
=> Integer
irb(main)> 7.0.class
=> Float

I’ve seen this nuance around dividing numbers in a few different languages (see below for a Python aside). And so when I read, “We only read in integers. What about floats? … We ignore them and just say that [we] don’t support them,” in the Writing an Interpreter in Go book I’m currently working through, my curiosity piqued!

In this post, I’ll step through how I implemented reading in floats as part of the interpreter I’m currently writing in Go.

Tokens

Tokens are a representation of source code in a way that a parser can interpret. Lexing is a phase during interpretation where source code is converted to tokens. Lexers operate on the source code, and then pass the tokens they create on to parsers.

In the book I’m working through, tokens are implemented as structs with two fields, Type and Literal. The Type tells us what we are lexing, and the Literal is the content. In some cases, like semicolons, the same Types will always have the same Literals. In other cases, like integers, the same Types can have infinite possible Literals:

type Token struct {
        Type TokenType
        Literal string
}

type TokenType string

As an example, an integer is be represented as:

token.Token{Type: "INT", Literal: 15}

NextToken()

The crux of the lexer relies on a NextToken() function. This reads the input source code at a current position and produces a token based on that source code. In some cases, like when reading a ;, it will simply need to look at one character at a time and increment the current position by one. In other cases, like when reading a variable name, it will need to read multiple characters and increment the current position by more than one.

Reading an integer is implemented as follows:

func (l *Lexer) NextToken() token.Token {
	var tok token.Token
	...
	// Lexer.ch is the current character
	if isDigit(l.ch) {
		tok.Type = token.INT
		tok.Literal = l.readNumber()
		return tok
	}
	...
}

func (l *Lexer) readNumber() string {
	// Lexer.position is the position of the current character
	position := l.position
	for isDigit(l.ch) {
		// Lexer.readChar() will increment our lexer position
		// appropriately
		l.readChar()
	}
	return l.input[position:l.position]
}

This is looking at the first appearance of a digit, and parsing that number for as long as there are still digits. As soon as digits stop, the token will end. If we therefore saw 5;, we would lex (unsure if this is a verb, but I can pretend) 5 as its own token, and ; as the subsequent token. Similarly, due to the isDigit(l.ch) loop, 544; would be lexed as one token for 544 and another token for ;.

But, in its current implementation, 5.44; would be illegal and un-lexable (definitively sure this is not a word) syntax. That’s because, as the author mentioned, the book does not delve into lexing floats!

Float requirements

Before we solve the problem of lexing both ints and floats, it is important to fully understand the scope of the problem! We want the current behavior around integers to remain untouched. In the absence of a decimal, we still want the TokenType to resolve to a token.INT.

But… we also want to introduce a new TokenType: token.FLOAT

Input should be lexed to a float if it is any number of digits, followed by a decimal, followed by any number of digits. For example: 1.6180.

There are still some illegal inputs. These consist of:

Decimals without a preceding number (.123)
Numbers with . without decimal values (1.)
Numbers with multiple decimals (1.1.1 or 1..1)

We want to refactor the above code to appropriately handle all of these requirements!

Implementing Floats!

As it’s implemented, readNumber() has the return type of a string representing the Token’s literal. This is because there is only one possible TokenType for reading numbers. However, since we are now introducing a second TokenType for numbers, it makes sense to refactor readNumber() to return a Token itself so that we can assign both the Literal and the TokenType fields within readNumber().

We can therefore refactor NextToken() to expect our new signature for readNumber():

func (l *Lexer) NextToken() token.Token {
	...
	if isDigit(l.ch) {
		return l.readNumber()
	}
	...
}

We now need to refactor readNumber(). We can do this by reading a number as we previously were, and then looking for a .. If there is no ., we can just continue treating this like we were before floats. If there is a ., however, we have entered the world of floats! In this case, we want to check if there are more numbers following. If not, the input is illegal. If there are, we want to keep parsing numbers like we would for integers, but return a different type.

I’m a programmer and not an English major, so I’ll let the code (and comments) do the rest of the talking! Note that, as discussed earlier, readNumber() now returns a token.Token.

func (l *Lexer) readNumber() token.Token {
	position := l.position

	// A small Go aside here (since I'm also learning Go while working on
	// this): if we had just written
	// tokenType := token.INT
	// and let Go infer the type of tokenType, it would infer that tokenType
	// was a string since token.INT is a string. But, since Token expects
	// TokenType to be a TokenType, we need to therefore explicitly define
	// the type of tokenType
	var tokenType token.TokenType = token.INT

	for isDigit(l.ch) {
		l.readChar()
	}

	if isDecimal(l.ch) {
		// Checking if the character following the decimal is also
		// a digit
		if isDigit(l.peekChar()) {
			// We're now in FLOAT territory!
			tokenType = token.FLOAT
			l.readChar()
			for isDigit(l.ch) {
				l.readChar()
			}
		} else {
			// This would be the case where we saw input like `1.`
			// as there're no digits following the decimal
			tokenType = token.ILLEGAL
		}
	}

	return token.Token{
		Type:  tokenType,
		Literal: l.input[position:l.position],
	}
}

func isDecimal(ch byte) bool {
	return '.' == ch
}

func (l *Lexer) peekChar() byte {
	if l.readPosition >= len(l.input) {
		return 0
	} else {
		return l.input[l.readPosition]
	}
}

And 🎉, we’ve implemented floats in our interpreter too.

>> 1
{Type:INT Literal:1}
>> 1.0
{Type:FLOAT Literal:1.0}

Be on the lookout for a future post when I do the parsing section of this book and implement division like in the ruby and python examples.

Python aside

In doing a little research for this post, I noticed the following:

$ python
>>> 7 / 2
3
>>> 7.0 / 2
3.5

$ python3
>>> 7 / 2
3.5
>>> 6 / 2
3.0

So I read a little more, and learned that Python actually changed how they implemented division from what they call “classic division” in the Python 2.x series to what they call “true division” in the Python 3.x series. The Python team wrote up a really interesting explanation in their original proposal to change the division operator.