cgul_crlf.h File Reference

convert blocks of text to lines More...

#include "cgul_common.h"
#include "cgul_exception.h"
#include "cgul_string.h"
Include dependency graph for cgul_crlf.h:
This graph shows which files directly or indirectly include this file:

Typedefs

typedef typedefCGUL_BEGIN_C struct cgul_crlf * cgul_crlf_t
 

Functions

CGUL_EXPORT cgul_crlf_t cgul_crlf__new (cgul_exception_t *cex)
 
CGUL_EXPORT void cgul_crlf__delete (cgul_crlf_t crlf)
 
CGUL_EXPORT void cgul_crlf__reset (cgul_exception_t *cex, cgul_crlf_t crlf, unsigned long offset)
 
CGUL_EXPORT int cgul_crlf__get_strip_utf8_bom (cgul_exception_t *cex, cgul_crlf_t crlf)
 
CGUL_EXPORT void cgul_crlf__set_strip_utf8_bom (cgul_exception_t *cex, cgul_crlf_t crlf, int strip_utf8_bom)
 
CGUL_EXPORT void cgul_crlf__convert (cgul_exception_t *cex, cgul_crlf_t crlf, char *buf, unsigned long int bsize)
 
CGUL_EXPORT const char * cgul_crlf__get_line (cgul_exception_t *cex, cgul_crlf_t crlf)
 
CGUL_EXPORT unsigned long cgul_crlf__get_line_count (cgul_exception_t *cex, cgul_crlf_t crlf)
 
CGUL_EXPORT unsigned long cgul_crlf__get_line_offset (cgul_exception_t *cex, cgul_crlf_t crlf)
 
CGUL_EXPORT const char * cgul_crlf__get_remainder (cgul_exception_t *cex, cgul_crlf_t crlf)
 
CGUL_EXPORT void cgul_crlf__convert_file (cgul_exception_t *cex, FILE *fin, FILE *fout, const char *eol)
 

Detailed Description

Convert blocks of DOS, Mac, and Unix text to lines of arbitrary length.

This class operates on in-memory buffers. Typically, you will want to use cgul_crlf_file instead of this class.

For efficiency, this class overwrites parts of the blocks passed into cgul_crlf__convert() with NUL characters. Thus, if the block is important, the client should pass in a copy of the block instead of the original.

Furthermore, because this class inserts NUL characters into the buffer it converts, the buffer must be writable. Because many operating systems make statically-allocated or stack-allocated variables read-only, this can lead to segfaults if the buffer is not dynamically allocated.

Author
Paul Serice
See also
cgul_crlf_file

Typedef Documentation

§ cgul_crlf_t

typedef typedefCGUL_BEGIN_C struct cgul_crlf* cgul_crlf_t

Opaque pointer to a cgul_crlf instance.

Function Documentation

§ cgul_crlf__new()

CGUL_EXPORT cgul_crlf_t cgul_crlf__new ( cgul_exception_t cex)

Create a new cgul_crlf object. The caller is responsible for freeing the object by calling cgul_crlf__delete(). If memory cannot be allocated, NULL is returned, and an exception is thrown.

Parameters
[in,out]cexc-style exception
Returns
new cgul_crlf instance

Referenced by cgul_crlf_cxx::cgul_crlf_cxx().

§ cgul_crlf__delete()

CGUL_EXPORT void cgul_crlf__delete ( cgul_crlf_t  crlf)

This method frees all internally allocated memory. Do not try to access crlf after calling this method.

Parameters
[in]crlfcgul_crlf instance

Referenced by cgul_crlf_cxx::set_obj(), and cgul_crlf_cxx::~cgul_crlf_cxx().

§ cgul_crlf__reset()

CGUL_EXPORT void cgul_crlf__reset ( cgul_exception_t cex,
cgul_crlf_t  crlf,
unsigned long  offset 
)

This method is used to reset the crlf object so that it can process a new stream of text or process the same stream of text after seeking to a different location in the stream.

The client must inform this class of the new offset into the underlying file so that subsequent calls to cgul_crlf__get_line_offset() can return correct values. The value of offset should be zero-based. Thus, to start processing at the beginning of a new file call this method with offset set to 0.

Calling this method resets the line count to zero. This can be used by the client to implement a line counter that does not overflow.

Calling this method does not reset whether a leading UTF-8 byte-order mark (BOM) is stripped.

Parameters
[in]cexc-style exception
[in]crlfcgul_crlf instance
[in]offsetzero-based offset

Referenced by cgul_crlf_cxx::reset().

§ cgul_crlf__get_strip_utf8_bom()

CGUL_EXPORT int cgul_crlf__get_strip_utf8_bom ( cgul_exception_t cex,
cgul_crlf_t  crlf 
)

This method returns whether the leading UTF-8 byte-order mark (BOM) should be stripped from the first line if it is present.

Parameters
[in]cexc-style exception
[in]crlfcgul_crlf instance
Returns
whether to strip a leading UTF-8 byte-order mark

Referenced by cgul_crlf_cxx::get_strip_utf8_bom().

§ cgul_crlf__set_strip_utf8_bom()

CGUL_EXPORT void cgul_crlf__set_strip_utf8_bom ( cgul_exception_t cex,
cgul_crlf_t  crlf,
int  strip_utf8_bom 
)

By default, this class detects the leading UTF-8 byte-order mark (BOM) and strips it from the first line returned by cgul_crlf__get_line() if it is present. It then clears its internal flag so that BOMs internal to the text file will be returned. This is generally what you want because the leading BOM is not significant but the internal BOMs are.

You can alter the way this class handles the leading BOM by calling this method with strip_utf8_bom set to 0. This will cause the leading BOM to be returned as part of the first line. This can be useful, for example, if you just want to convert the text file and are not interested in its contents.

The value of strip_utf8_bom is remembered across calls to cgul_crlf__reset().

It should be noted that most operating systems do not save UTF-8 text files with a leading BOM because UTF-8 is a character stream and, as such, does not have byte-order problems; however, Microsoft Windows adds the BOM to its UTF-8 text files presumably to help distinguish UTF-8 text files from text files with different encodings.

Parameters
[in]cexc-style exception
[in]crlfcgul_crlf instance
[in]strip_utf8_bomwhether to strip leading UTF-8 byte-order mark

Referenced by cgul_crlf_cxx::set_strip_utf8_bom().

§ cgul_crlf__convert()

CGUL_EXPORT void cgul_crlf__convert ( cgul_exception_t cex,
cgul_crlf_t  crlf,
char *  buf,
unsigned long int  bsize 
)

The caller feeds this method a block of text in buf of size bsize. The blocks you feed this method can end anywhere; they do not have to end exactly on a line boundary. This method knows how to splice together partial lines from the last call to form arbitrarily long lines using any of the common EOL markers: "\n", "\r", or "\r\n"

After each call to this method, you MUST call cgul_crlf__get_line() iteratively until it returns NULL before feeding this method another block.

Do not alter buf until after you have exhausted cgul_crlf__get_line(). This prevents cgul_crlf__convert() from having to make a copy of each block because this method often (but not always) inserts NUL characters directly into buf to produce the lines returned by cgul_crlf__get_line().

After feeding the last block to this method and exhausting cgul_crlf__get_line(), you should call cgul_crlf__get_remainder() to fetch what remained if the last line had no trailing EOL marker.

This method dynamically allocates space to hold the lines that are split across calls to this method. If an error occurs, an exception is thrown, and the crlf object will be in an undefined state.

WARNING: Because this class embeds NUL characters directly into buf, it goes without saying that buf must be writable. What might not be obvious is that this means buf probably should not be statically allocated or allocated on the stack because many operating systems have security mechanisms to prevent unexpected writes to these variables.

Parameters
[in,out]cexc-style exception
[in]crlfcgul_crlf instance
[in]bufbuffer
[in]bsizebuffer size

Referenced by cgul_crlf_cxx::convert().

§ cgul_crlf__get_line()

CGUL_EXPORT const char* cgul_crlf__get_line ( cgul_exception_t cex,
cgul_crlf_t  crlf 
)

After seeding this object by calling cgul_crlf__convert(), you call this method to fetch the next line. If a line is ready, this method returns it. If no line is ready, this method returns NULL. The caller should not try to call free() on the line returned because it is really just a pointer back to the contents of the buffer passed into cgul_crlf__convert().

If this method does not return NULL, you should keep calling it until it does. Once it returns NULL, you can either refill this object by calling cgul_crlf__convert() with the next block or call cgul_crlf__get_remainder() to finish.

Parameters
[in]cexc-style exception
[in]crlfcgul_crlf instance
Returns
next line of text

Referenced by cgul_crlf_cxx::get_line().

§ cgul_crlf__get_line_count()

CGUL_EXPORT unsigned long cgul_crlf__get_line_count ( cgul_exception_t cex,
cgul_crlf_t  crlf 
)

This method returns the total number of lines returned by cgul_crlf__get_line() and cgul_crlf__get_remainder(). The line count is one-based. No attempt is made to prevent the return value from overflowing. So, the caller is responsible for verifying the return value.

Calls to cgul_crlf__reset() reset the line count to zero. This can be used by the client to implement a line counter that does not overflow.

Parameters
[in,out]cexc-style exception
[in]crlfcgul_crlf instance
Returns
line count

Referenced by cgul_crlf_cxx::get_line_count().

§ cgul_crlf__get_line_offset()

CGUL_EXPORT unsigned long cgul_crlf__get_line_offset ( cgul_exception_t cex,
cgul_crlf_t  crlf 
)

This method returns the offset of the last line returned by cgul_crlf__get_line() or cgul_crlf__get_remainder(). The offset is zero-based. If you are feeding a binary stream into cgul_crlf__convert() and if the stream is also a random access stream, you can use the return value to directly seek to the line as follows:

    fseek(f, offset, SEEK_SET);

Because the prototype for fseek() requires a long for the offset parameter, no attempt is made to prevent the return value from overflowing. So, the caller is responsible for verifying the return value.

Note that the offset returned is basically the number of bytes from the start of the file to the current line. This is not necessarily the same as the number of characters which depends on how the file is encoded.

To get the offset of the remainder, just call cgul_crlf__get_remainder() before calling this method.

This method throws an exception if, after converting a new block, it is called before cgul_crlf__get_line() is called.

Parameters
[in,out]cexc-style exception
[in]crlfcgul_crlf instance
Returns
zero-based offset into file for the current line

Referenced by cgul_crlf_cxx::get_line_offset().

§ cgul_crlf__get_remainder()

CGUL_EXPORT const char* cgul_crlf__get_remainder ( cgul_exception_t cex,
cgul_crlf_t  crlf 
)

This is the last method you should call, and it should only be called once. It should be called only after all the blocks have been fed to cgul_crlf__convert() and only after cgul_crlf__get_line() has been exhausted. At this point, all that is left is the remainder.

This method returns NULL if a remainder does not exist or if this method has already been called. The only time a remainder exists is if the last line in the file is missing the final EOL marker.

After calling this function, use cgul_crlf__get_line_offset() to get the offset of the remainder.

The caller should not try to call free() on the pointer returned because it points to an internal string that will be freed when cgul_crlf__delete() is called.

Parameters
[in]cexc-style exception
[in]crlfcgul_crlf instance
Returns
text remaining after the final EOL

Referenced by cgul_crlf_cxx::get_remainder().

§ cgul_crlf__convert_file()

CGUL_EXPORT void cgul_crlf__convert_file ( cgul_exception_t cex,
FILE *  fin,
FILE *  fout,
const char *  eol 
)

This method copies fin to fout stripping the original EOL markers and replacing them with eol. fin and fout must have been opened in binary mode. This method internally uses a cgul_crlf object to perform the conversion. If an error occurs, an exception is thrown.

Parameters
[in,out]cexc-style exception
[in]fininput file
[out]foutoutput file
[in]eolnew EOL marker

Referenced by cgul_crlf_cxx::convert_file().