Wednesday, November 7, 2007

Parsing Integers

Every once in a while, it becomes necessary to parse an integer from a string. The trick is doing so correctly and without the need to use an integer type larger than what you are parsing into (i.e. if you are trying to parse a 32bit integer, you don't want to have to use a temporary 64bit int value for checking overflow - this is especially desirable if you are trying to parse larger int values, such as a 64bit int because you won't likely have access to a 128bit int type).

The following macro correctly implements an integer parsing routine for a given number of bits:

#define STRTOINT(bits, max)                                     \
static int                                                      \
strtoint##bits (const char *in, char **inend,                   \
                int##bits##_t *retval, int *err)                \
{                                                               \
    register const char *inptr = in;                            \
    int##bits##_t val = 0;                                      \
    int digit, sign = 1;                                        \
                                                                \
    while (*inptr == ' ')                                       \
        inptr++;                                                \
                                                                \
    if (*inptr == '-') {                                        \
        sign = -1;                                              \
        inptr++;                                                \
    }                                                           \
                                                                \
    in = inptr;                                                 \
    while (*inptr >= '0' && *inptr <= '9')                      \
        digit = (*inptr - '0');                                 \
        if (val > (max / 10)) {                                 \
            *err = EOVERFLOW;                                   \
            return -1;                                          \
        } else if (val == (max / 10)) {                         \
            if (digit > (max % 10) &&                           \
                (sign > 0 || digit > ((max % 10) + 1))) {       \
                *err = EOVERFLOW;                               \
                return -1;                                      \
            }                                                   \
                                                                \
            if (sign < 0)                                       \
                val = (val * sign * 10) - digit;                \
            else                                                \
                val = (val * 10) + digit;                       \
                                                                \
            inptr++;                                            \
            if (*inptr >= '0' && *inptr <= '9') {               \
                *err = EOVERFLOW;                               \
                return -1;                                      \
            }                                                   \
                                                                \
            goto ret;                                           \
        } else {                                                \
            val = (val * 10) + digit;                           \
        }                                                       \
                                                                \
        inptr++;                                                \
    }                                                           \
                                                                \
    if (inptr == in) {                                          \
        *err = EINVAL;                                          \
        return -1;                                              \
    }                                                           \
                                                                \
    val *= sign;                                                \
                                                                \
 ret:                                                           \
    *inend = (char *) inptr;                                    \
    *retval = val;                                              \
                                                                \
    return 0;                                                   \
}

To expand this macro to parse a 32bit integer value from a string, you would do:

STRTOINT(32, 2147483647L)

The resulting function would be named strtoint32().

Why implement my own integer parsing routine instead of using libc's strtol(), you ask?

For a variety of reasons:

  • strtol() might not be available or you might want to implement an integer parsing routine in a language other than c/c++ (a variation of my routine is now used in Mono for all of the Int.Parse() methods).
  • strtol() is locale dependent which is not always a desirable trait.
  • strtol() requires annoying error checking that using my function makes trivial.
  • it's easy to extend my function macro to support integers of any bit size, e.g. 64bit integers (or someday 128bit integers).

To generate functions to parse 8, 16, 32, and 64bit integers, you would write:

STRTOINT(8, 127)
STRTOINT(16, 32767)
STRTOINT(32, 2147483647L)
STRTOINT(64, 9223372036854775807LL)

What about unsigned integers, you ask? Not to worry, I have an implementation for that as well:

#define STRTOUINT(bits, max)                                    \
static int                                                      \
strtouint##bits (const char *in, char **inend,                  \
                 uint##bits##_t *retval, int *err)              \
{                                                               \
    register const char *inptr = in;                            \
    uint##bits##_t val = 0;                                     \
    int digit;                                                  \
                                                                \
    while (*inptr == ' ')                                       \
        inptr++;                                                \
                                                                \
    in = inptr;                                                 \
    while (*inptr >= '0' && *inptr <= '9') {                    \
        digit = (*inptr - '0');                                 \
        if (val > (max / 10)) {                                 \
            *err = EOVERFLOW;                                   \
            return -1;                                          \
        } else if (val == (max / 10) && digit > (max % 10)) {   \
            *err = EOVERFLOW;                                   \
            return -1;                                          \
        } else {                                                \
            val = (val * 10) + digit;                           \
        }                                                       \
                                                                \
        inptr++;                                                \
    }                                                           \
                                                                \
    if (inptr == in) {                                          \
        *err = EINVAL;                                          \
        return -1;                                              \
    }                                                           \
                                                                \
    *inend = (char *) inptr;                                    \
    *retval = val;                                              \
                                                                \
    return 0;                                                   \
}

STRTOUINT(8, 255)
STRTOUINT(16, 65535)
STRTOUINT(32, 4294967295UL)
STRTOUINT(64, 18446744073709551615ULL)

And there you have it.

Note: these routines require [u]int[8,16,32,64]_t to be defined. If you are on Linux, #include <stdint.h> should do the trick.

Update: Miguel's blog goes into more detail on why this code was important for Mono.

19 comments:

Stuart said...

Why do you need to pass the maximum number? Surely some variant on (1 << bits) would do the job?

Jeffrey Stedfast said...

I suppose you could use:

int##bits##_t max = (1 << (bits - 1)) - 1;

but it's likely harder for the c compiler to optimize that than a constant.

Plus I'm not sure my bit hack will always work for signed integer types.

Jeffrey Stedfast said...

Nevermind, it should always work.

Anonymous said...

What license might this be under?

Jeffrey Stedfast said...

Ah, good point... I should probably state on my blog that all code that I post is licensed under MIT/X11 unless otherwise specified.

Anonymous said...

Where did you learn to program?

Every one of your code snippets is incredible. They are all so clean, concise and all so elegant!

Please please please! Post more code snippets!

Anonymous said...

Yeah, but please work on your snippet layout.
The above is almost unreadable. What's with all those backslashes every other line?

Benjamin said...

I think this is missing support for MININT (- MAXINT - 1).
Also, strtol() does radians other than 10, in particular hex numbers. And I think it also uses isspace() for skipping spaces at the beginning of the string.

Joaquim Rendeiro said...

WOW!

@Anonymous: it a macro, thus all the slashes!

Jeffrey Stedfast said...

Anonymous,

The code is a C-macro template, the idea is that you could expand it to implement a parser routine for any size integer (8, 16, 32, 64, 128, etc) without the need to rewrite any new code, just allowing the preprocessor to do it for you.

Benjamin,

Yes, strtol() and friends do indeed handle many more bases than just 10, but typically you want to parse base10 integers (or at least that's what I usually find myself parsing ;-)

It should be fairly trivial to change the code to parse base8 or base16, etc (which are probably the 2 most common other than 10).

As far as eating white space before the numeric value, my code already eats 0x20, but you could fix that up easily to use isspace() :)

As far as MIN_INT, if I'm understanding what you are asking, my code does handle MIN_INT properly. See the following logic:

if (digit > (max % 10) && (sign > 0 || digit > ((max % 10) + 1))) {

You'll see there that my code recognizes that MIN_INT is (-MAX_INT - 1).

Anonymous said...

The code is unreadable. The right hand sides of the code blocks are hidden behind the sidebar. The lines are cut off well before the backslashes a previous commentator complained about.

Turly said...

Jeffrey - the code's tab size is set too large (8 spaces). Could you replace the tabs with 3 or 4 spaces? Cheers!

Jeffrey Stedfast said...

Okay, I've just replaced the tab size to be 4 spaces rather than 8, hope that helps alleviate the viewing problems people are having.

Anonymous said...

I think that signed integer parsing routine is screwed up. Probably it should go like this (i omitted backslashes at the end of every line):

#define STRTOINT(bits, max)
static int
strtoint##bits (const char *in, char **inend,
int##bits##_t *retval, int *err)
{
register const char *inptr = in;
int##bits##_t val = 0;
int digit, sign = 1;

while (*inptr == ' ')
inptr++;

if (*inptr == '-') {
sign = -1;
inptr++;
}

in = inptr;
while (*inptr >= '0' && *inptr <= '9') {
digit = (*inptr - '0');
if (val > (max / 10)) {
*err = EOVERFLOW;
return -1;
} else if (val == (max / 10)) {
if (digit > (max % 10) &&
(sign > 0 || digit > ((max % 10) + 1))) {
*err = EOVERFLOW;
return -1;
}

if (sign < 0)
val = (val * sign * 10) - digit;
else
val = (val * 10) + digit;

inptr++;
}

if (inptr == in) {
*err = EINVAL;
return -1;
}

*inend = (char *) inptr;
*retval = val;

return 0;
}

Jeffrey Stedfast said...

Anonymous,

1. What is "screwed up" in my implementation?

2. Your code doesn't have the proper number of closing braces so I'm not sure I can even follow what you are trying to do... I can only assume you meant to close off the while-loop before the "if (inptr == in)" check.

If my assumption is correct, then your code is broken for negative values. Every iteration through the loop will swap the sign. For example, if the number we are parsing is "-39", your function would parse the value as 21 which is clearly wrong.

Jeffrey Stedfast said...

Let me try to explain better what I had meant by "swap the sign":

if (sign < 0)
val = (val * sign * 10) - digit;
...

since sign == -1 and the number we are parsing is -39, here's what happens:

first pass we get:
val = (0 * -1 * 10) - 3;

as desired, this results in -3. Good so far... However, the next time through your loop we get:

val = (-3 * -1 * 10) - 9;

which results in 21.

This is why it is important not to apply the sign until we are done with the loop.

Hope that helps explain the logic in my implementation.

Anonymous said...

you say that this code is useful because some systems may lack strtol, well, haha, your code relies on stdint.h which is less likely to be available than strtol ;D

Jeffrey Stedfast said...

You only need stdint.h for the [u]int*_t definitions, which you can typedef yourself easily enough.

Anonymous said...

excelente código la importancia en peerfomance que se alcanza vale más que la legibiidad del código que se logro o acaso ustedes lo pueden mejorar tabulando el codigo
jajaja
good work
excellent fine

Code Snippet Licensing

All code posted to this blog is licensed under the MIT/X11 license unless otherwise stated in the post itself.