convert blocks of text to lines More...
Typedefs | |
typedef typedefCGUL_BEGIN_C struct cgul_crlf * | cgul_crlf_t |
Convert blocks of DOS, Mac, and Unix text to lines of arbitrary length.
This class operates on in-memory buffers. Typically, you will want to use cgul_crlf_file
instead of this class.
For efficiency, this class overwrites parts of the blocks passed into cgul_crlf__convert()
with NUL characters. Thus, if the block is important, the client should pass in a copy of the block instead of the original.
Furthermore, because this class inserts NUL characters into the buffer it converts, the buffer must be writable. Because many operating systems make statically-allocated or stack-allocated variables read-only, this can lead to segfaults if the buffer is not dynamically allocated.
typedef typedefCGUL_BEGIN_C struct cgul_crlf* cgul_crlf_t |
Opaque pointer to a cgul_crlf
instance.
CGUL_EXPORT cgul_crlf_t cgul_crlf__new | ( | cgul_exception_t * | cex | ) |
Create a new cgul_crlf
object. The caller is responsible for freeing the object by calling cgul_crlf__delete()
. If memory cannot be allocated, NULL
is returned, and an exception is thrown.
[in,out] | cex | c-style exception |
cgul_crlf
instance Referenced by cgul_crlf_cxx::cgul_crlf_cxx().
CGUL_EXPORT void cgul_crlf__delete | ( | cgul_crlf_t | crlf | ) |
This method frees all internally allocated memory. Do not try to access crlf
after calling this method.
[in] | crlf | cgul_crlf instance |
Referenced by cgul_crlf_cxx::set_obj(), and cgul_crlf_cxx::~cgul_crlf_cxx().
CGUL_EXPORT void cgul_crlf__reset | ( | cgul_exception_t * | cex, |
cgul_crlf_t | crlf, | ||
unsigned long | offset | ||
) |
This method is used to reset the crlf
object so that it can process a new stream of text or process the same stream of text after seeking to a different location in the stream.
The client must inform this class of the new offset into the underlying file so that subsequent calls to cgul_crlf__get_line_offset()
can return correct values. The value of offset
should be zero-based. Thus, to start processing at the beginning of a new file call this method with offset
set to 0
.
Calling this method resets the line count to zero. This can be used by the client to implement a line counter that does not overflow.
Calling this method does not reset whether a leading UTF-8 byte-order mark (BOM) is stripped.
[in] | cex | c-style exception |
[in] | crlf | cgul_crlf instance |
[in] | offset | zero-based offset |
Referenced by cgul_crlf_cxx::reset().
CGUL_EXPORT int cgul_crlf__get_strip_utf8_bom | ( | cgul_exception_t * | cex, |
cgul_crlf_t | crlf | ||
) |
This method returns whether the leading UTF-8 byte-order mark (BOM) should be stripped from the first line if it is present.
[in] | cex | c-style exception |
[in] | crlf | cgul_crlf instance |
Referenced by cgul_crlf_cxx::get_strip_utf8_bom().
CGUL_EXPORT void cgul_crlf__set_strip_utf8_bom | ( | cgul_exception_t * | cex, |
cgul_crlf_t | crlf, | ||
int | strip_utf8_bom | ||
) |
By default, this class detects the leading UTF-8 byte-order mark (BOM) and strips it from the first line returned by cgul_crlf__get_line()
if it is present. It then clears its internal flag so that BOMs internal to the text file will be returned. This is generally what you want because the leading BOM is not significant but the internal BOMs are.
You can alter the way this class handles the leading BOM by calling this method with strip_utf8_bom
set to 0
. This will cause the leading BOM to be returned as part of the first line. This can be useful, for example, if you just want to convert the text file and are not interested in its contents.
The value of strip_utf8_bom
is remembered across calls to cgul_crlf__reset()
.
It should be noted that most operating systems do not save UTF-8 text files with a leading BOM because UTF-8 is a character stream and, as such, does not have byte-order problems; however, Microsoft Windows adds the BOM to its UTF-8 text files presumably to help distinguish UTF-8 text files from text files with different encodings.
[in] | cex | c-style exception |
[in] | crlf | cgul_crlf instance |
[in] | strip_utf8_bom | whether to strip leading UTF-8 byte-order mark |
Referenced by cgul_crlf_cxx::set_strip_utf8_bom().
CGUL_EXPORT void cgul_crlf__convert | ( | cgul_exception_t * | cex, |
cgul_crlf_t | crlf, | ||
char * | buf, | ||
unsigned long int | bsize | ||
) |
The caller feeds this method a block of text in buf
of size bsize
. The blocks you feed this method can end anywhere; they do not have to end exactly on a line boundary. This method knows how to splice together partial lines from the last call to form arbitrarily long lines using any of the common EOL markers: "\n", "\r", or "\r\n"
After each call to this method, you MUST call cgul_crlf__get_line()
iteratively until it returns NULL
before feeding this method another block.
Do not alter buf
until after you have exhausted cgul_crlf__get_line()
. This prevents cgul_crlf__convert()
from having to make a copy of each block because this method often (but not always) inserts NUL
characters directly into buf
to produce the lines returned by cgul_crlf__get_line()
.
After feeding the last block to this method and exhausting cgul_crlf__get_line()
, you should call cgul_crlf__get_remainder()
to fetch what remained if the last line had no trailing EOL marker.
This method dynamically allocates space to hold the lines that are split across calls to this method. If an error occurs, an exception is thrown, and the crlf
object will be in an undefined state.
WARNING: Because this class embeds NUL
characters directly into buf
, it goes without saying that buf
must be writable. What might not be obvious is that this means buf
probably should not be statically allocated or allocated on the stack because many operating systems have security mechanisms to prevent unexpected writes to these variables.
[in,out] | cex | c-style exception |
[in] | crlf | cgul_crlf instance |
[in] | buf | buffer |
[in] | bsize | buffer size |
Referenced by cgul_crlf_cxx::convert().
CGUL_EXPORT const char* cgul_crlf__get_line | ( | cgul_exception_t * | cex, |
cgul_crlf_t | crlf | ||
) |
After seeding this object by calling cgul_crlf__convert()
, you call this method to fetch the next line. If a line is ready, this method returns it. If no line is ready, this method returns NULL
. The caller should not try to call free()
on the line returned because it is really just a pointer back to the contents of the buffer passed into cgul_crlf__convert()
.
If this method does not return NULL
, you should keep calling it until it does. Once it returns NULL
, you can either refill this object by calling cgul_crlf__convert()
with the next block or call cgul_crlf__get_remainder()
to finish.
[in] | cex | c-style exception |
[in] | crlf | cgul_crlf instance |
Referenced by cgul_crlf_cxx::get_line().
CGUL_EXPORT unsigned long cgul_crlf__get_line_count | ( | cgul_exception_t * | cex, |
cgul_crlf_t | crlf | ||
) |
This method returns the total number of lines returned by cgul_crlf__get_line()
and cgul_crlf__get_remainder()
. The line count is one-based. No attempt is made to prevent the return value from overflowing. So, the caller is responsible for verifying the return value.
Calls to cgul_crlf__reset()
reset the line count to zero. This can be used by the client to implement a line counter that does not overflow.
[in,out] | cex | c-style exception |
[in] | crlf | cgul_crlf instance |
Referenced by cgul_crlf_cxx::get_line_count().
CGUL_EXPORT unsigned long cgul_crlf__get_line_offset | ( | cgul_exception_t * | cex, |
cgul_crlf_t | crlf | ||
) |
This method returns the offset of the last line returned by cgul_crlf__get_line()
or cgul_crlf__get_remainder()
. The offset is zero-based. If you are feeding a binary stream into cgul_crlf__convert()
and if the stream is also a random access stream, you can use the return value to directly seek to the line as follows:
fseek(f, offset, SEEK_SET);
Because the prototype for fseek()
requires a long
for the offset parameter, no attempt is made to prevent the return value from overflowing. So, the caller is responsible for verifying the return value.
Note that the offset returned is basically the number of bytes from the start of the file to the current line. This is not necessarily the same as the number of characters which depends on how the file is encoded.
To get the offset of the remainder, just call cgul_crlf__get_remainder()
before calling this method.
This method throws an exception if, after converting a new block, it is called before cgul_crlf__get_line()
is called.
[in,out] | cex | c-style exception |
[in] | crlf | cgul_crlf instance |
Referenced by cgul_crlf_cxx::get_line_offset().
CGUL_EXPORT const char* cgul_crlf__get_remainder | ( | cgul_exception_t * | cex, |
cgul_crlf_t | crlf | ||
) |
This is the last method you should call, and it should only be called once. It should be called only after all the blocks have been fed to cgul_crlf__convert()
and only after cgul_crlf__get_line()
has been exhausted. At this point, all that is left is the remainder.
This method returns NULL
if a remainder does not exist or if this method has already been called. The only time a remainder exists is if the last line in the file is missing the final EOL marker.
After calling this function, use cgul_crlf__get_line_offset()
to get the offset of the remainder.
The caller should not try to call free()
on the pointer returned because it points to an internal string that will be freed when cgul_crlf__delete()
is called.
[in] | cex | c-style exception |
[in] | crlf | cgul_crlf instance |
Referenced by cgul_crlf_cxx::get_remainder().
CGUL_EXPORT void cgul_crlf__convert_file | ( | cgul_exception_t * | cex, |
FILE * | fin, | ||
FILE * | fout, | ||
const char * | eol | ||
) |
This method copies fin
to fout
stripping the original EOL markers and replacing them with eol
. fin
and fout
must have been opened in binary mode. This method internally uses a cgul_crlf
object to perform the conversion. If an error occurs, an exception is thrown.
[in,out] | cex | c-style exception |
[in] | fin | input file |
[out] | fout | output file |
[in] | eol | new EOL marker |
Referenced by cgul_crlf_cxx::convert_file().