Split a string


The vast majority of questions about splitting strings are about tokenization: splitting a string into substrings containing only related characters, called, depending on context, tokens or fields.

But first, some basic concepts:

The reality is that there are so many ways to split strings it is overwhelming. Here are some, in no particular order, with examples:

MethodIterated or
all-at-once
DelimiterEmpty Fields
  char   string functionquotable offset  no case elidedtrailing
Boost String Algorithms: Split all-at-once Y Y Y 4 Y
Boost String Algorithms: Split Regex all-at-once 1 Y regex 2 2 2 ?
Boost Tokenizer iterated3 Y Y Y Y opt Y
Trolltech Qt’s QString::split() all-at-once Y Y regex 2 Y opt Y
GNU String Utility Functions: Split all-at-once 1 Y Y
iostreams and getline() iterated3 Y Y Y
string::find_first_of() all-at-once3 Y Y Y
strtok() iterated3 1 Y always NO
Roll your own C tokenizer iterated3 1 Y opt Y
1Searching for a single character is, of course, possible when searching for
more than one is, but the function does not provide a direct way to do it.
2Via regex capabilities.
3While the primary method may be one or the other, it is easy enough to
write a wrapper that provides the other behavior.
4Empty fields are not elided (as in ‘omitted’), but adjacent delimiters are (as in ‘combined’).

A simple Google search reveals many more for perusal.

What’s the difference between a token and a field?

Simply put, a token and a field are two different things, though you will often see them confused. Most of what is on this page concerns fields.

tokens

Tokens relate to lexing and parsing, where the text being decoded is an ordered list of tokens, not necessarily all of the same type.

A token is typically a string of characters that possess some common characteristic, such as being all alphabetic, or being a specific list of characters, like "->" and "<<". Tokens also have semantic value (or meaning) attached to them. (Fields do not.)

For example, in C++, the text string s = "Hello world!"; contains five tokens:

 string a type
 s an identifier
 = an operator
 "Hello world!" a constant string literal
 ; a statement terminator

It is worth noting here that for the C++ example each token has different characteristics and whitespace is treated specially. This kind of tokenizing requires careful lexing; typically token-specific functions are used to determine whether or not a character belongs to the current token and to classify the token’s type.

fields

A field is typically a string of characters that do not include a set of special characters called delimiters. An ordered collection of fields, separated by delimiters, is called a record. A collection of records is called, variously: databases, tables, spreadsheets, etc.

Depending on your requirements specifications, it may be possible for some fields to include characters that would normally be considered delimiters; the characters are in some way encoded into the field to prevent them from being understood as delimiters.

For example, here is a record of six fields, delimited by commas and quoted by double-quote characters, where (unquoted) leading and trailing whitespace is ignored:

Nalleli Andrade, 12 Jun 1989,,"piña colada, long walks in the rain" ,ID    1589-73AYN,

The six fields are, in order:

 1 
Nalleli Andrade
 2 
12 Jun 1989
 3 
 
 4 
piña colada, long walks in the rain
 5 
ID    1589-73AYN
 6 
 

One of the most common questions about this kind of data structure is for CSV files; specifically in relation to Microsoft Excel. The C++ I/O library actually makes handling these kind of things relatively easy for simple data, but for more advanced handling, see the topic Parse CSV data?

What is the essential algorithm?

The split/tokenize algorithm involves very little:

You are given:
  • a string to split
  • a way of determining whether a character “is a token” or “is a delimiter”
You also need:
  • a “start index” into the string to remember where to start the next token
To get a token:

    index = start index
    while index < length( s )
    {
      if s[index] is a delimiter (or is not part of the current token)
      {
        break out of the loop
      }
      index = index + 1
    }
    result = substring of s from (start index) to (index - 1)
    start index = index + something (see comments below)

something
You must be careful when handling characters that are not part of the current token. For tokens, the last character we examined in the loop might actually be part of the next token. (Your something might be zero – but you would have to guarantee to never return an empty token.) For fields, the last character is not part of any token. (Your something is at least one.) When you adjust your start index, make sure to account for these kinds of things.

Other things to account for are multi-character delimiters and elided delimiters (where multiple delimiters are treated as only one).

The basic tokenize/split algorithm is illustrated with the strtok() FAQ. The only issue is that strtok() modifies the original string by sticking '\0's in it (which you should not do).

Other examples are found with the string::find_first_of() and roll your own C tokenizer examples below. (Be warned, though, that you should have a pretty solid understanding of the basic algorithm before you look at these examples, as they also provide the option to eliminate empty fields.)

Boost String Algorithms: Split

The Boost String Algorithms Library is a comprehensive library to do useful things to strings. Included is a nifty function called split(). Here are three simple examples of how to use it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#include <boost/algorithm/string.hpp>
#include <iostream>
#include <string>
#include <vector>

using namespace std;
using namespace boost;

void print( vector <string> & v )
{
  for (size_t n = 0; n < v.size(); n++)
    cout << "\"" << v[ n ] << "\"\n";
  cout << endl;
}

int main()
{
  string s = "a,b, c ,,e,f,";
  vector <string> fields;

  cout << "Original = \"" << s << "\"\n\n";

  cout << "Split on \',\' only\n";
  split( fields, s, is_any_of( "," ) );
  print( fields );

  cout << "Split on \" ,\"\n";
  split( fields, s, is_any_of( " ," ) );
  print( fields );

  cout << "Split on \" ,\" and elide delimiters\n"; 
  split( fields, s, is_any_of( " ," ), token_compress_on );
  print( fields );

  return 0;
}
Original = "a,b, c ,,e,f,"

Split on ',' only
"a"
"b"
" c "
""
"e"
"f"
""

Split on " ,"
"a"
"b"
""
"c"
""
""
"e"
"f"
""

Split on " ," and elide delimiters
"a"
"b"
"c"
"e"
"f"
""

Notice in that last example that that empty field there at the end is still among the results? Multiple delimiters may be treated as if they were one delimiter (or ‘compressed’), which is not the same as simply removing empty fields.

Boost String Algorithms: Split Regex

Again, the Boost String Algorithms Library – Regex Variants comes to the rescue. This version of split() is much more powerful than the one above... but it requires you to have compiled the Boost Regex library and to link it to your executable.

Here is an example of using it to split on a multi-character match:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#include <boost/regex.hpp>
#include <boost/algorithm/string/regex.hpp>
#include <iostream>
#include <string>
#include <vector>

using namespace std;
using namespace boost;

void print( vector <string> & v )
{
  for (size_t n = 0; n < v.size(); n++)
    cout << "\"" << v[ n ] << "\"\n";
  cout << endl;
}

int main()
{
  string s = "one->two->thirty-four";
  vector <string> fields;

  split_regex( fields, s, regex( "->" ) );
  print( fields );

  return 0;
}
"one"
"two"
"thirty-four"

Remember, we are finding delimiters by matching against a regular expression. And again, make sure to link with libboost_regex. (Need help Compiling and Linking?)

Boost Tokenizer

The Boost.Tokenizer library is a small library designed to handle common tokenizing tasks. Unlike the split() functions, it is used with an iterator to work your way through the tokens in a string.

There are three general tokenizing methods that come with it:
  1. Break on delimiter characters using char_separator.
    It allows you to specify:
    • delimiter characters
    • delimiter characters to keep in the extracted field
    • whether adjacent delimiter characters indicate an empty field or are treated as a single delimiter
  2. Break on delimiter characters but allowing quoted fields using escaped_list_separator.
    This method is nice if you are following the C/C++ CSV convention, but it unfortunately cannot handle the industry ‘standard’ Microsoft Excel CSV quoting conventions. (Nor can it be made to fully... a failure of the Boost Tokenizer library’s design, alas.)
  3. Break based on position using offset_separator.
    This method is unique: you can split a string based solely upon offsets (or, more accurately, field width counts) into the string. You get everything, though, so you must decide what to keep. Unfortunately, using iterators makes it a little more clumsy than just using std::string::substr() a few times...
I recommend you to the Boost.Tokenizer site for full explanation and examples.

For properly dealing with Microsoft Excel CSV data, see the next FAQ.

Trolltech Qt’s QString

Remember that Trolltech’s Qt product is only free to use if you plan to never sell the software you develop with it. Otherwise they expect handsome payment for their (very nicely designed) libraries, right at the beginning.

Qt’s QString handles Unicode and Regular Expression parsing easily.

See QString::split() for full explanations and examples.

As nice as it is, however, the Qt Framework is not designed to handle the full Microsoft Excel CSV file format. Again, to properly handle Microsoft Excel CSV data, see the next FAQ.

GNU String Utility Functions: Split

The GNU C Library’s String Utility Functions also include useful routines:
Here is a simple example using the first:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
/* example.c */
#include <stdio.h>
#include <glib.h>

int main()
{
  const char* s = ",,three,,five,,";
  char** fields = g_strsplit( s, ',', 0 );

  gint n = 0;
  for (char** field = fields; field; ++field, ++n)
  {
    printf( "\"%s\"\n", *field );
  }
  printf( "%d tokens\n", n );

  g_strfreev( fields );
  fields = NULL;

  return 0;
}
""
""
"three"
""
"five"
""
""

Make sure to link with GLib when you compile! (If you are a beginner, this might be beyond your capacity on non-Linux systems. There do exist binary development packages for Windows, but GTK+ is not trivial to install... Good luck!)

iostreams and getline()

The simplest method of tokenizing strings in C++ is to use the standard iostream capabilities. The std::getline() function has a very rudimentary capacity to break strings up using a single delimiter character each time you call the function.

AN IMPORTANT WARNING ABOUT THE EXAMPLE CODE 
Because getline() is designed to ignore empty fields at the end of input, we must do something very brash – something that is often the Wrong Thing to do – and loop on EOF. In this case, though, it is actually the Right Thing to do, as we want the odd behavior and we are being very careful to get it.

The basic algorithm to print every field to the standard output is this:

1
2
3
4
5
6
7
8
string s = "string, to, split";
istringstream ss( s );
while (!ss.eof())         // See the WARNING above for WHY we're doing this!
{
  string x;               // here's a nice, empty string
  getline( ss, x, ',' );  // try to read the next field into it
  cout << x << endl;      // print it out, EVEN IF WE ALREADY HIT EOF
}

Let’s put that into a convenient function that splits a string into a container of your choice (such as a std::vector). Let’s also add the ability to elide (or omit) empty fields. Here is the complete code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#include <sstream>
#include <string>

struct split
{
  enum empties_t { empties_ok, no_empties };
};

template <typename Container>
Container& split(
  Container&                                 result,
  const typename Container::value_type&      s,
  typename Container::value_type::value_type delimiter,
  split::empties_t                           empties = split::empties_ok )
{
  result.clear();
  std::istringstream ss( s );
  while (!ss.eof())
  {
    typename Container::value_type field;
    getline( ss, field, delimiter );
    if ((empties == split::no_empties) && field.empty()) continue;
    result.push_back( field );
  }
  return result;
}

(We could also add the ability to trim() leading and trailing whitespace from the field between lines 21 and 22, but we’ll leave that to you.) Here is an example demonstrating how to use it.

28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
#include <iostream>
#include <vector>
using namespace std;

void print( vector <string> & v )
{
  for (size_t n = 0; n < v.size(); n++)
    cout << "\"" << v[ n ] << "\"\n";
  cout << endl;
}

int main()
{
  string s = "One, two,, four , five,";

  vector <string> fields;
  split( fields, s, ',' );

  cout << "\"" << s << "\"\n\n";
  print( fields );
  cout << fields.size() << " fields.\n";

  return 0;
}
"One, two,, four , five,"

"One"
" two"
""
" four "
" five"
""

6 fields.

By the way, how would you like to be able deduce the return type and just say fields = split( s, ',' );? Read about it here.

If you are unsure how to actually use any of these functions in your own programs, make sure to read all about it at this other spot.

string::find_first_of()

The version of split() built using std::getline() was pretty slick, but we can actually do better – much better. We would also like to be able to match on any of a set of delimiters. That is much easier using the STL string find functions.

The basic algorithm is this:

1
2
3
4
5
6
7
8
9
10
11
string s = "string, to, split";
string delimiters = " ,";
size_t current;
size_t next = -1;
do
{
  current = next + 1;
  next = s.find_first_of( delimiters, current );
  cout << s.substr( current, next - current ) << endl;
}
while (next != string::npos);

Let’s do like we did above, and put that into a function that fills a container of your choice, and add the ability to elide empty fields.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#include <cstddef>

struct split
{
  enum empties_t { empties_ok, no_empties };
};

template <typename Container>
Container& split(
  Container&                            result,
  const typename Container::value_type& s,
  const typename Container::value_type& delimiters,
  split::empties_t                      empties = split::empties_ok )
{
  result.clear();
  size_t current;
  size_t next = -1;
  do
  {
    if (empties == split::no_empties)
    {
      next = s.find_first_not_of( delimiters, next + 1 );
      if (next == Container::value_type::npos) break;
      next -= 1;
    }
    current = next + 1;
    next = s.find_first_of( delimiters, current );
    result.push_back( s.substr( current, next - current ) );
  }
  while (next != Container::value_type::npos);
  return result;
}

34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
#include <iostream>
#include <string>
#include <vector>
using namespace std;

void print( vector <string> & v )
{
  for (size_t n = 0; n < v.size(); n++)
    cout << "\"" << v[ n ] << "\"\n";
  cout << endl;
}

int main()
{
  string s = "One, two,, four , five,";

  vector <string> fields;

  cout << "\"" << s << "\"\n\n";

  split( fields, s, "," );
  print( fields );
  cout << fields.size() << " fields.\n\n";

  split( fields, s, ",", split::no_empties );
  print( fields );
  cout << fields.size() << " fields.\n";

  return 0;
}
"One, two,, four , five,"

"One"
" two"
""
" four "
" five"
""

6 fields.

"One"
" two"
" four "
" five"

4 fields.

Remember to read up and learn how you can deduce the return type and call the function like this:

fields = split( s, ",", split::no_empties );.

strtok()

This is the old C library function. There is an extensive overview in a later FAQ. Here is an example of using it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
/* This is C code */
#include <stdio.h>
#include <string.h>

int main()
{
  char s[] = "one, two,, four , five,"; /* mutable! */
  const char* p;

  for (p = strtok( s, "," );  p;  p = strtok( NULL, "," ))
  {
    printf( "\"%s\"\n", p );
  }

  return 0;
}
"one"
" two"
" four "
" five"

Notice how strtok() is too stupid to treat adjacent delimiters as an empty field? And it misses that empty field at the end!

These problems are beyond fixing if you use strtok().

Roll your own C tokenizer

You can always get better results by rolling your own tokenizer from the other functions available in <string.h>. Here’s a simple one you are free to use:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
/* c_tokenizer.h */

#pragma once
#ifndef C_TOKENIZER_H
#define C_TOKENIZER_H

typedef struct
{
  char*       s;
  const char* delimiters;
  char*       current;
  char*       next;
  int         is_ignore_empties;
}
tokenizer_t;

enum { TOKENIZER_EMPTIES_OK, TOKENIZER_NO_EMPTIES };

tokenizer_t tokenizer( const char* s, const char* delimiters, int empties );
const char* free_tokenizer( tokenizer_t* tokenizer );
const char* tokenize( tokenizer_t* tokenizer );

#endif 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
/* c_tokenizer.c */

#include <stdlib.h>
#include <string.h>

#include "c_tokenizer.h"

#ifndef strdup
#define strdup sdup
static char* sdup( const char* s )
{
  size_t n = strlen( s ) + 1;
  char*  p = malloc( n );
  return p ? memcpy( p, s, n ) : NULL;
}
#endif

tokenizer_t tokenizer( const char* s, const char* delimiters, int empties )
{
  char* strdup( const char* );

  tokenizer_t result;

  result.s                 = (s && delimiters) ? strdup( s ) : NULL;
  result.delimiters        = delimiters;
  result.current           = NULL;
  result.next              = result.s;
  result.is_ignore_empties = (empties != TOKENIZER_EMPTIES_OK);

  return result;
}

const char* free_tokenizer( tokenizer_t* tokenizer )
{
  free( tokenizer->s );
  return tokenizer->s = NULL;
}

const char* tokenize( tokenizer_t* tokenizer )
{
  if (!tokenizer->s) return NULL;

  if (!tokenizer->next)
    return free_tokenizer( tokenizer );

  tokenizer->current = tokenizer->next;
  tokenizer->next = strpbrk( tokenizer->current, tokenizer->delimiters );

  if (tokenizer->next)
  {
    *tokenizer->next = '\0';
    tokenizer->next += 1;

    if (tokenizer->is_ignore_empties)
    {
      tokenizer->next += strspn( tokenizer->next, tokenizer->delimiters );
      if (!(*tokenizer->current))
        return tokenize( tokenizer );
    }
  }
  else if (tokenizer->is_ignore_empties && !(*tokenizer->current))
    return free_tokenizer( tokenizer );

  return tokenizer->current;
}

And here is a simple example of how to use it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#include <stdio.h>
#include "c_tokenizer.h"

int main()
{
  const char* s = ",,a,,b,,";  /* see notes with accompanying text below */ 

  tokenizer_t tok = tokenizer( s, ",", TOKENIZER_EMPTIES_OK );
  const char* token;
  unsigned    n;

  n = 0;
  for (token = tokenize( &tok ); token; token = tokenize( &tok ))
  {
    printf( "\"%s\"\n", token );
    n += 1;
  }
  free_tokenizer( &tok );
  printf( "%u tokens\n", n );

  return 0;
}
""
""
"a"
""
"b"
""
""
7 tokens

Remember, if you need help using this stuff in your own programs, make sure to read all about it here.

The tokenizer here is smart enough to tokenize on more than one delimiter; it does not suffer state problems (including threading issues); it allows you to collapse adjacent delimiters (what strtok() does automatically) or to treat them as delimiting empty fields; and it works properly on all input. Try replacing the string to test with things like "" and NULL. Try tokenizing the examples using TOKENIZER_NO_EMPTIES too.