A word about strings and pointers in C++

This question just came up on Stack Overflow. It reflects a pretty common misunderstanding of how C-style strings are represented by char pointers in both C and C++.

Greatly condensed, it goes:

  • You read some data into a std:string object. You display the contents; it’s all there.
  • You invoke c_str() on that std:string, and display its contents; it’s not all there.

For example:


                // assuming we can read from an object called (file) here.
                String theFile(std::istreambuf_iterator<char>(file),std::istreambuf_iterator<char>());
                String::iterator it;
                for ( it=theFile.begin() ; it < theFile.end(); it++ ) {
                        std::cout << *it;
                } // Outputs the entire file

                std::cout << theFile.c_str(); // Outputs only the first few lines

The Way C Does It

In C–all the way back to the original K&R–a string is represented by a simple pointer to char (character) type. C has no way to automatically allocate and deallocate storage for complex objects, so the char pointer is all you get.

By convention, strings in C don’t contain null (ASCII 0) bytes. Also by convention, the first null indicates the end of a C string. So in C if you want to represent the word “foo” in a string, it’s four bytes: one for each letter, all followed by a null.

There’s an obvious problem when your data actually include a null byte, which is fairly common when you’re processing images, word processing files, really anything that’s not plain text. A null byte is a perfectly reasonable value in the middle of a Microsoft Word document or in telemetry chunks from the space shuttle.

The Way C Libraries Do It

That’s why any well-designed function library that might need to handle null bytes doesn’t do so in terms of C strings per se. It may accept a char pointer, but only along with a length argument. Not MyAPIFunc (char *string), but MyAPIFunc (char *databuffer, int datalength).

See the difference? When your data might contain a null byte, then you can’t count on a null byte to indicate end of data. It’s that simple.

So our friend maybe could have written out the contents of the file like this:

char *cstring = theFile.c_str();
int clength = theFile.length;
for (int i = 0; i < clength; i++)
{
        cout << cstring[i];
}

That does call into question why the questioner doesn’t just use a C++ string in the first place! I asked him that, and he replied that the real objective is to use a library call that expects the old conventions. So I referred him to the information above.

The Way C++ Does It

C++ string objects, such as in the std::string, completely avoid the null-terminator problem by containing an internal data buffer and a length property. Since C++ supports object constructors and destructors, the allocation and deallocation of that data buffer can be easily handled outside of your view.

For backward compatibility (geek talk for “letting your old code work without rewriting it”), literal strings in C++ are compiled just as in C, as pointers to char. So code like

std::string foo = “We print anything!”;

is compiled as though you had written

std::string foo = new std::string foo(“We print anything!”);

which invokes the std::string(char const *) constructor, not the std::string (std::string) copy constructor.